Comment

balsoft@lemmy.ml ⁨5⁩ ⁨weeks⁩ ago

LLMs are absolutely trained on FOSS software, including GPL’d stuff. Accelerating software development is also a large part of how they are making money. I believe training on GPL’d software and then charging for access is copyright infringement, but it doesn’t really matter because entities supposed to be enforcing copyright are paid for by the same billionaires who run the AI companies, so literally nothing will happen.

source

Sort:hotnew top

mechoman444@lemmy.world ⁨5⁩ ⁨weeks⁩ ago
This argument has never made much sense to me.

Copyright protects the expression itself, not the ideas, facts, patterns, grammar, writing styles, or knowledge learned from that expression. Humans learn from copyrighted books, articles, movies, and music every day. Nobody claims that someone who read 10,000 copyrighted novels is committing copyright infringement every time they sit down and write a new story.

That’s the part I keep seeing people ignore.

If learning from copyrighted material is infringement, then every author, journalist, musician, engineer, and artist on the planet is infringing copyright because they all learned their craft from copyrighted works created by other people.

The real question is whether an AI is reproducing copyrighted content, not whether it learned from copyrighted content. Those are two completely different issues.

You don’t get to argue that learning is legal when humans do it and suddenly becomes theft when a machine does it. Either learning from publicly available information is allowed, or it isn’t. The standard cannot magically change because you dislike the technology.

source
- balsoft@lemmy.ml ⁨2⁩ ⁨weeks⁩ ago
  
  Copyright protects the expression itself, not the ideas, facts, patterns, grammar, writing styles, or knowledge learned from that expression.
  
  Copyright absolutely does protect “ideas, patterns, grammar, writing styles”. It does not cover material facts, but this is beside the point here.
  
  Nobody claims that someone who read 10,000 copyrighted novels is committing copyright infringement every time they sit down and write a new story.
  
  Actually, if they take certain elements from other works it often can be considered copyright infringement, but this is also beside the point.
  
  You don’t get to argue that learning is legal when humans do it and suddenly becomes theft when a machine does it. Either learning from publicly available information is allowed, or it isn’t. The standard cannot magically change because you dislike the technology.
  
  This is the crux of the issue. The LLM is not a person from a legal perspective, therefore it cannot “learn”. What is happening is that a legal person - a company - is consuming a bunch of copyrighted material and transforming it into a bunch of data to be interpreted by a computer program. This makes that data definitionally a derivative work made by the company. Sure, the transformation process is probably “creative” enough for the company to be able to claim copyright on the resulting weights, as long as that company gains explicit or implicit approval from all the copyright holders of the works they used to create the weights. Legally speaking, this is no different from you pirating a bunch of movies and making a compilation of the funniest moments - see how the legal system would react to you doing that.
  
  Of course, I find the entire concept of modern copyright system to be an awful idea whose entire purpose is impeding the human creativity for the purposes of monetization. But if we read the law as it is - the AI companies are absolutely doing copyright infringement by training their models on GPL code and not releasing the weights under a GPL-compatible license.
  
  source
  - mechoman444@lemmy.world ⁨2⁩ ⁨weeks⁩ ago
    
    A copyright is a type of intellectual property that gives its owner the exclusive right to copy and distribute a creative work, usually for a limited time. The creative work may be in a literary, artistic, educational, or musical form. Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself. A copyright is subject to limitations based on public interest considerations, such as the fair use doctrine in the United States.
    
    The explanation above comes from the Big Law Dictionary app, which is available on the Google Play Store for Android.
    
    Copyright protects the expression of an idea, not the idea itself. If ideas could be copyrighted, it would be absurd. I don’t even know how that could realistically be enforced.
    
    This is the problem with so many people talking out of their backsides on this platform. They don’t know what these legal terms actually mean or how they function. They see something they don’t like and immediately start making claims about it.
    
    LLMs do a lot of things that deserve criticism. Copyright isn’t necessarily one of them.
    
    If a company developing or operating an LLM is committing copyright infringement, then it should be prosecuted. If it has violated the law and hasn’t been held accountable, that is a separate issue that should be remedied. But that is a completely different argument from claiming that LLMs are inherently infringing copyright.
    
    An LLM uses training data to generate novel responses. That means it produces new output rather than copying source material verbatim. Simply using publicly available information to learn patterns is not, by itself, copyright infringement. If a company unlawfully obtained private or copyrighted material, or reproduced protected works in a way that violates copyright law, then it has broken the law and should be prosecuted.
    
    But the existence of an LLM, by itself, is not copyright infringement.
    
    So, moving forward, this can go one of two ways: either you concede that you were incorrect, or you double down and make even more absurd claims. Either way, I don’t know… I’ve had this argument so many times on this platform that it’s ridiculous.
    
    As of right now, there are multiple lawsuits against major LLM developers. In some of those cases, the courts have ruled that training on publicly available data can qualify as fair use.
    
    At this point, not a single court has issued a final ruling in favor of a plaintiff holding that LLM training itself is copyright infringement.
    
    I genuinely don’t know what else I’m supposed to do to prove this to you people.
    
    source
    balsoft@lemmy.ml ⁨2⁩ ⁨weeks⁩ ago
    
    If a company unlawfully … reproduced protected works in a way that violates copyright law, then it has broken the law and should be prosecuted.
    
    Merely training an LLM on copyrighted material without holder’s permission (and then distributing the weights or selling access to inference on those weights) is a violation of copyright law if it were to be applied consistently (ignoring the fair use argument, which I’ll get back to). That is, if you applied any other computational process in this way, the result would be a derivative work. The reason it’s “different” this time is that the people violating the law are richer than those who wrote the law in the first place, not because of any legal argument.
    
    As of right now, there are multiple lawsuits against major LLM developers. In some of those cases, the courts have ruled that training on publicly available data can qualify as fair use.
    
    If a court ruled that it’s “fair use”, that actually lends more credence to the idea that LLM weights are a derivative work - “fair use” is a defense for copyright infringement that only makes sense in this case if the new work is an unauthorized derivative of the original.
    
    Whether it’s actually fair use or not is another question (I can see the fair use argument for open-weight non-commercial models, not so much for commercial offerings).
    
    BTW, I’m not even necessarily anti-AI (at least the open-weight, local models). I use a local model in my job almost daily, and also I think it’s mostly good that the entirety of FOSS corpus is available for download in a compressed and easily remixable form. I’m just pointing out the hypocrisy of the legal system which applies its already unjust copyright law (and most other laws) only against poor people.
    
    source
    -> View More Comments