wordfreq is not just concerned with formal printed words. It collected more conversational language usage from two sources in particular: Twitter and Reddit.
Now Twitter is gone anyway, its public APIs have shut down,
Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.
There’s still the Fediverse.
I mean, that doesn’t solve the LLM pollution problem, but…
Danterious@lemmy.dbzer0.com 2 months ago
That sucks. So much research is being twisted by humanity’s greed. I hope that whatever comes after the internet becomes useless is better.
~Anti~ ~Commercial-AI~ ~license~ ~(CC~ ~BY-NC-SA~ ~4.0)~
Zoot@reddthat.com 2 months ago
“Humanity is too greedy. My Facebook esque pretend license will definitely keep my safe!” Lol.