AI’s become so invasively popular and I’ve seen more evidence of its ineffectiveness than otherwise, but what I dislike most about it is that many run on datasets of stolen data for the sake of profitability à la OpenAI and Deepseek
mashable.com/…/openai-chatgpt-class-action-lawsui… petapixel.com/…/openai-claims-deepseek-took-all-o…
Are there any AI services that run on ethically obtained datasets, like stuff people explicitly consented to submitting (not as some side clause of a T&C), data bought by properly compensating the data’s original owners, or datasets contributed by the service providers themselves?
decended_being@midwest.social 2 days ago
No
Treczoks@lemmy.world 1 day ago
There are no legal sources big enough to train an AI on the level required to even perform basic interaction.
AmbitiousProcess@piefed.social 1 day ago
This is very true.
I was part of the OpenAssistant project, voluntarily submitting my personal writing to train open-source LLMs without having to steal data, in the hopes it would stop these companies from stealing people's work and make "AI" less of a black box.
After thousands of people submitting millions of prompt-response pairs, and after some researchers said it was the highest quality natural language dataset they'd seen in a while, the base model was almost always incoherent. You only got a functioning model if you just used the data to fine-tune an existing larger model, Llama at the time.