Comment on OpenAI says it’s “impossible” to create useful AI models without copyrighted material

<- View Parent
BraveSirZaphod@kbin.social ⁨9⁩ ⁨months⁩ ago

The key element here is that an LLM does not actually have access to its training data, and at least as of now, I'm skeptical that it's technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data, for every query, in order to determine potential copyright violations, especially when you don't know exactly which portions of the response you need to use in your search. Even then, that only catches verbatim (or near verbatim) violations, and plenty of copyright questions are a lot fuzzier.

For instance, say you tell GPT to generate a fan fiction story involving a romance between Draco Malfoy and Harry Potter. This would unquestionably violate JK Rowling's copyright on the characters if you published the output for commercial gain, but you might be okay if you just plop it on a fan fic site for free. You're unquestionably okay if you never publish it at all and just keep it to yourself (well, a lawyer might still argue that this harms JK Rowling by damaging her profit if she were to publish a Malfoy-Harry romance, since people can just generate their own instead of buying hers, but that's a messier question). But, it's also possible that, in the process of generating this story, GPT might unwittingly directly copy chunks of renowned fan fiction masterpiece My Immortal. Should GPT allow this, or would the copyright-management AI strike it? Legally, it's something of a murky question.

For yet another angle, there is of course a whole host of public domain text out there. GPT probably knows the text of the Lord's Prayer, for instance, and so even though that output would perfectly match some training material, it's legally perfectly okay. So, a copyright police AI would need to know the copyright status of all its training material, which is not something you can super easily determine by just ingesting the broad internet.

source
Sort:hotnewtop