The Washington post has this interactive article picking apart where the data set of one of Google's large language models comes from. Down toward the end, it has a search box where you can enter a domain and see its rank within the set and the percentage of tokens it contributed.
To my surprise, this very website was included in spot 976 277 - of about fifteen million - a rank which seems surprisingly high to me.
I mostly feel weirdly flattered. But overall, I see this as a sign of how little actual good text they could find online, how much meaningless crap there is, and how much good text is in one way or another locked away from what more general search can dig up. (Or, perhaps, protected by organizations with lawyers well-funded enough that Goole dare not risk upsetting them.) There is clearly so little good writing out in the open that a minimal site like mine gets ranked in the top 7%. No matter how small the contribution (0,00001%, the article says), it is enormously out of proportion to my presence on the web. It just goes to show how skewed the data set is.
A darker way to think about it is to think that I basically got taken advantage of because I am unlikely to be able to make a fuss about it. I truly have not made up my mind here, I do not know if I want freely available text to be freely usable to train language models or not. But there is a strange feeling to it, and if there was some flag I could set or opt-out registry I could enter my site into, I would probably do so.
I am also curious if this means models can be made to imitate my writing style, and how well they may do it. I ran a few tests, asking Chatgpt to write in the style of bjoreman.com, but it was all disappointing. I expect much more when they come out with a model trained on podcasts.
(Asking Chatgpt to write in the style of Cormac McCarthy is, however, very entertaining.)