Comment on OpenAI says it’s “impossible” to create useful AI models without copyrighted material

<- View Parent
BraveSirZaphod@kbin.social ⁨9⁩ ⁨months⁩ ago

There is literally no resemblance between the training works and the model.

This is way too strong a statement when some LLMs can spit out copyrighted works verbatim.

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.

Often, that “random content” is long passages of text scraped directly from the internet. I was able to find verbatim passages the researchers published from ChatGPT on the open internet: Notably, even the number of times it repeats the word “book” shows up in a Google Books search for a children’s book of math problems. Some of the specific content published by these researchers is scraped directly from CNN, Goodreads, WordPress blogs, on fandom wikis, and which contain verbatim passages from Terms of Service agreements, Stack Overflow source code, copyrighted legal disclaimers, Wikipedia pages, a casino wholesaling website, news blogs, and random internet comments.

Beyond that, copyright law was designed under the circumstances where creative works are only ever produced by humans, with all the inherent limitations of time, scale, and ability that come with that. Those circumstances have now fundamentally changed, and while I won't be so bold as to pretend to know what the ideal legal framework is going forward, I think it's also a much bolder statement than people think to say that fair use as currently applied to humans should apply equally to AI and that this should be accepted without question.

source
Sort:hotnewtop