OpenAI says it’s “impossible” to create useful AI models without copyrighted material

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨sculd@beehaw.org⁩ to ⁨technology@beehaw.org⁩

https://arstechnica.com/information-technology/2024/01/openai-says-its-impossible-to-create-useful-ai-models-without-copyrighted-material/

Apparently, stealing other people’s work to create product for money is now “fair use” as according to OpenAI because they are “innovating” (stealing). Yeah. Move fast and break things, huh?

“Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.

OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”

source

Comments

Sort:hotnew top

Haus@kbin.social ⁨1⁩ ⁨year⁩ ago
Try to train a human comedian to make jokes without ever allowing him to hear another comedian's jokes, never watching a movie, never reading a book or magazine, never watching a TV show. I expect the jokes would be pretty weak.

source
- Phanatik@kbin.social ⁨1⁩ ⁨year⁩ ago
  A comedian isn't forming a sentence based on what the most probable word is going to appear after the previous one. This is such a bullshit argument that reduces human competency to "monkey see thing to draw thing" and completely overlooks the craft and intent behind creative works. Do you know why ChatGPT uses certain words over others? Probability. It decided as a result of its training that one word would appear after the previous in certain contexts. It absolutely doesn't take into account things like "maybe this word would be better here because the sound and syllables maintains the flow of the sentence".
  
  Baffling takes from people who don't know what they're talking about.
  
  source
  - frog@beehaw.org ⁨1⁩ ⁨year⁩ ago
    I wish I could upvote this more than once.
    
    What people always seem to miss is that a human doesn’t need to billions of examples to be able to produce something that’s kind of “eh, close enough”. Artists don’t look at billions of paintings. They look at a few, but do so deeply, absorbing not just the most likely distribution of brushstrokes, but why the painting looks the way it does. For a basis of comparison, I did an art and design course last year and looked at about 300 artworks in total (course requirement was 50-100). The research component on my design-related degree course is one page a week per module (so basically one example from the field the module is about, plus some analysis). The real bulk of the work humans do isn’t looking at billions of examples: it’s looking at a few, and then practicing the skill and developing a process that allows them to convey the thing they’re trying to express.
    
    If the AI models were really doing exactly the same thing humans do, the models could be trained without any copyright infringement at all, because all of the public domain and creative commons content, plus maybe licencing a little more, would be more than enough.
    
    source
    -> View More Comments
  - DaDragon@kbin.social ⁨1⁩ ⁨year⁩ ago
    That’s what humans do, though. Maybe not probability directly, but we all know that some words should be put in a certain order. We still operate within standard norms that apply to aparte group of people. LLM’s just go about it in a different way, but they achieve the same general result. If I’m drawing a human, that means there’s a ‘hand’ here, and a ‘head’ there. ‘Head’ is a weird combination of pixels that mostly look like this, ‘hand’ looks kinda like that. All depends on how the model is structured, but tell me that’s not very similar to a simplified version of how humans operate.
    
    source
    -> View More Comments
  - hascat@programming.dev ⁨1⁩ ⁨year⁩ ago
    That’s not the point though. The point is that the human comedian and the AI both benefit from consuming creative works covered by copyright.
    
    source
    -> View More Comments
  - teawrecks@sopuli.xyz ⁨1⁩ ⁨year⁩ ago
    
    A comedian isn’t forming a sentence based on what the most probable word is going to appear after the previous one.
    
    Neither is an LLM. What you’re describing is a primitive Markov chain.
    
    You may not like it, but brains really are just glorified pattern recognition and generation machines. So yes, “monkey see thing to draw thing”, except a really complicated version of that.
    
    Think of it this way: if your brain wasn’t a reorganization and regurgitation of the things you have observed before, it would just generate random noise. There’s no such thing as “truly original” art or it would be random noise. Every single word either of us is typing is the direct result of everything you and I have observed before this moment.
    
    Baffling takes from people who don’t know what they’re talking about.
    
    Ironic, to say the least.
    
    The point you should be making, is that a corporation will make this above argument up to, but not including the point where they have to treat AIs ethically. So that’s the way to beat them. If they’re going to argue that they have created something that learns and creates content like a human brain, then they should need to treat it like a human, ensure it is well compensated, ensure it isn’t being overworked or enslaved, ensure it is being treated “humanely”. If they don’t want to do that, if they want it to just be a well built machine, then they need to license all the proprietary data they used to build it. Make them pick a lane.
    
    source
    -> View More Comments
  - tryptaminev@feddit.de ⁨1⁩ ⁨year⁩ ago
    You do know that comedians are copying each others material all the time though? Either making the same joke, or slightly adapting it?
    
    So in the context of copyright vs. model training i fail to see how the exact process of the model is relevant? At the end copyrighted material goes in and material based on that copyrighted material goes out.
    
    source
  - pupbiru@aussie.zone ⁨1⁩ ⁨year⁩ ago
    you know how the neurons in our brain work, right?
    
    because if not, well, it’s pretty similar… unless you say there’s a soul (in which case we can’t really have a conversation based on fact alone), we’re just big ol’ probability machines with tuned weights based on past experiences too
    
    source
    -> View More Comments
  - SuperSaiyanSwag@lemmy.zip ⁨1⁩ ⁨year⁩ ago
    Am I a moron? How do you have more upvotes than the parent comment, is it because you’re being more aggressive with your statement? I feel like you didn’t quite refute what the parent comment said. You’re just explaining how Chat GPT works, but you’re not really saying how it shouldn’t use our established media as a reference.
    
    source
    -> View More Comments
  - intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Text prediction seems to be sufficient to explain all verbal communication to me. Until someone comes up with a use case that humans can do that LLMs cannot, and I mean a specific use case not general high level concepts, I’m going to assume human verbal cognition works the same was as an LLM.
    
    We are absolutely basing our responses on what words are likely to follow which other ones. It’s literally how a baby learns language from those around them.
    
    source
    -> View More Comments
- luciole@beehaw.org ⁨1⁩ ⁨year⁩ ago
  There’s this linguistic problem where one word is used for two different things, it becomes difficult to tell them apart. “Training” or “learning” is a very poor choice of word to describe the calibration of a neural network. The actor and action are both fundamentally different from the accepted meaning. To start with, human learning is active whereas machining learning is strictly passive: it’s something done by someone with the machine as a tool. Teachers know very well that’s not how it happens with humans.
  
  When I compare training a neural network with how a trained to play clarinet, I fail to see any parallel. The two are about as close as a horse and a seahorse.
  
  source
  - intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Not sure what you mean by passive. It takes a hell of a lot of electricity to train one of these LLMs so something is happening actively.
    
    I often interact with ChatGPT 4 as if it were a child. I guide it through different kinds of mental problems, having it take notes and evaluate its own output, because I know our conversations become part of its training data.
    
    It feels very much like teaching a kid to me.
    
    source
    -> View More Comments
- sculd@beehaw.org ⁨1⁩ ⁨year⁩ ago
  AIs are not humans. Humans cannot read millions of texts in seconds and cannot split out millions of output at the same time.
  
  source
- Powderhorn@beehaw.org ⁨1⁩ ⁨year⁩ ago
  A comedian walks on stage and says, “Why is there a mic here?”
  
  source
- sub_@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Try to train a human comedian to make jokes without ever allowing him to hear another comedian’s jokes, never watching a movie, never reading a book or magazine, never watching a TV show. I expect the jokes would be pretty weak.
  
  source
noorbeast@lemmy.zip ⁨1⁩ ⁨year⁩ ago
I will repeat what I have proffered before, along with associated discussions:

If OpenAI stated that it is impossible to train leading AI models without using copyrighted material, then, unpopular as it may be, the preemptive pragmatic solution should be pretty obvious, enter into commercial arrangements for access to said copyrighted material.

Claiming a failure to do so in circumstances where the subsequent commercial product directly competes in a market seems disingenuous at best, given what I assume is the purpose of copyrighted material, that being to set the terms under which public facing material can be used. Particularly if regurgitation of copyrighted material seems to exist in products inadequately developed to prevent such a simple and foreseeable situation.

Yes I am aware of the USA concept of fair use, but the test of that should be manifestly reciprocal, for example would Meta allow what it did to MySpace, hack and allow easy user transfer, or Google with scraping Youtube.

To me it seems Big Tech wants its cake and to eat it, where investor $$$ are used to corrupt open markets and undermine both fundamental democratic State social institutions, manipulate legal processes, and undermine basic consumer rights.

source
- sculd@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Agreed.
  
  There is nothing “fair” about the way Open AI steals other people’s work. ChatGPT is being monetized all over the world and the large number of people whose work has not been compensated will never see a cent of that money.
  
  At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.
  
  Tech bros are disgusting.
  
  source
  - nicetriangle@kbin.social ⁨1⁩ ⁨year⁩ ago
    
    At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.
    
    This right here is the core of the moral issue when it comes down to it, as far as I'm concerned. These text and image models are already killing jobs and applying downward pressure on salaries. I've seen it happen first hand multiple times now, not just anecdotally from some rando on an internet comment section.
    
    These people losing jobs and getting pay cuts are who created the content these models are siphoning up. People are not going to like how this pans out.
    
    source
    -> View More Comments
  - Omega_Haxors@lemmy.ml ⁨1⁩ ⁨year⁩ ago
    
    Tech bros are disgusting.
    
    That’s not even getting into the fraternity behavior at work, hyper-reactionary politics and, er, concerning age preferences.
    
    source
    -> View More Comments
- TheFreezinSteven@beehaw.org ⁨1⁩ ⁨year⁩ ago
  With your logic all artists will have to pay copyright fees just to learn how to draw. All musicians will have to pay copyright fees just to learn their instrument.
  
  source
  - chahk@beehaw.org ⁨1⁩ ⁨year⁩ ago
    Do musicians not buy the music that they want to listen to? Should they be allowed to torrent any MP3 they want just because they say it’s for their instrument learning?
    
    I mean I’d be all for it, but that’s not what these very same corporations (including Microsoft when it comes to software) wanted back during Napster times. Now they want a separate set of rules just for themselves. No.
    
    source
    -> View More Comments
- redcalcium@lemmy.institute ⁨1⁩ ⁨year⁩ ago
  I suspect the US government will allow OpenAI to continue doing as it please to keep their competitive advantage in AI over China (which don’t have problem with using copyrighted materials to train their models).
  
  source
- DaDragon@kbin.social ⁨1⁩ ⁨year⁩ ago
  So why is so much information (data) freely available on the internet? How do you expect a human artist to learn drawing, if not looking at tutorials and improving their skills through emulating what they see?
  
  source
- vexikron@lemmy.zip ⁨1⁩ ⁨year⁩ ago
  Yep, completely agree.
  
  source
Nacktmull@lemm.ee ⁨1⁩ ⁨year⁩ ago
The problem is not the use of copyrighted material. The problem is doing so without permission and without paying for it.

source
sub_@beehaw.org ⁨1⁩ ⁨year⁩ ago
petapixel.com/…/court-docs-reveal-midjourney-want…

What’s stopping AI companies from paying royalties to artists they ripped off?

Also, lol at accounts created within few hours just to reply in this thread.

The moment their works are the one that got stolen by big companies and driven out of business, watch their tune change.

source
- furrowsofar@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Money is not always the issue. FOSS software for example. Who wants their FOSS software gobbled up by a commercial AI regardless. So there are a variety of issues.
  
  source
  - intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
    I don’t care if any of my FOSS software is gobbled up by a commercial AI. Someone reading my code isn’t a problem to me. If it were, I wouldn’t publish it openly.
    
    source
    -> View More Comments
- sanzky@beehaw.org ⁨1⁩ ⁨year⁩ ago
  
  What’s stopping AI companies from paying royalties to artists they ripped off?
  
  profit. AI is not even a profitable business now. They live because of the huge amount of investment being poured into it. If they have to pay their fair share they would not exist as a business.
  
  why OpenAI is actually true. The issue IMHO is the idea that we should give them a pass to do it.
  
  source
  - sub_@beehaw.org ⁨1⁩ ⁨year⁩ ago
    Uber wasn’t making profit anyway, despite all the VCs money behind it.
    
    I guess they have reasons not to pay drivers properly. Give Uber a free pass for it too
    
    source
    -> View More Comments
lily33@lemm.ee ⁨1⁩ ⁨year⁩ ago
This is not REALLY about copyright - this is an attack on free and open AI models, which would be IMPOSSIBLE if copyright was extended to cover the case of using the works for training.

It’s not stealing. There is literally no resemblance between the training works and the model. IP rights have been continuously strengthened due to lobbying over the last century and are already absurdly strong, I don’t understand why people on here want so much to strengthen them ever further.
source
- BraveSirZaphod@kbin.social ⁨1⁩ ⁨year⁩ ago
  
  There is literally no resemblance between the training works and the model.
  
  This is way too strong a statement when some LLMs can spit out copyrighted works verbatim.
  
  https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/
  
  A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.
  
  Often, that “random content” is long passages of text scraped directly from the internet. I was able to find verbatim passages the researchers published from ChatGPT on the open internet: Notably, even the number of times it repeats the word “book” shows up in a Google Books search for a children’s book of math problems. Some of the specific content published by these researchers is scraped directly from CNN, Goodreads, WordPress blogs, on fandom wikis, and which contain verbatim passages from Terms of Service agreements, Stack Overflow source code, copyrighted legal disclaimers, Wikipedia pages, a casino wholesaling website, news blogs, and random internet comments.
  
  Beyond that, copyright law was designed under the circumstances where creative works are only ever produced by humans, with all the inherent limitations of time, scale, and ability that come with that. Those circumstances have now fundamentally changed, and while I won't be so bold as to pretend to know what the ideal legal framework is going forward, I think it's also a much bolder statement than people think to say that fair use as currently applied to humans should apply equally to AI and that this should be accepted without question.
  
  source
  - MudMan@kbin.social ⁨1⁩ ⁨year⁩ ago
    I'm gonna say those circumstances changed when digital copies and the Internet became a thing, but at least we're having the conversation now, I suppose.
    
    I agree that ML image and text generation can create something that breaks copyright. You for sure can duplicate images or use copyrighted characterrs. This is also true of Youtube videos and Tiktoks and a lot of human-created art. I think it's a fascinated question to ponder whether the infraction is in what the tool generates (i.e. did it make a picture of Spider-Man and sell it to you for money, whcih is under copyright and thus can't be used that way) or is the infraction in the ingest that enables it to do that (i.e. it learned on pictures of Spider-Man available on the Internet, and thus all output is tainted because the images are copyrighted).
    
    The first option makes more sense to me than the second, but if I'm being honest I don't know if the entire framework makes sense at this point at all.
    
    source
    -> View More Comments
  - intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
    I can spit out copyrighted work verbatim.
    
    “No Lieutenant, your men are already dead”
    
    See?
    
    source
  - lily33@lemm.ee ⁨1⁩ ⁨year⁩ ago
    But AI isn’t all about generating creative works. It’s a store of information that I can query - a bit like searching Google; but understands semantics, and is interactive. It can translate my own text for me - in which case all the creativity comes from me, and I use it just for its knowledge of language. Many people use it to generate boilerplate code, which is pretty generic and wouldn’t usually be subject to copyright.
    
    source
    -> View More Comments
  - AndrasKrigare@beehaw.org ⁨1⁩ ⁨year⁩ ago
    I know it inherently seems like a bad idea to fix an AI problem with more AI, but it seems applicable to me here. I believe it should be technically feasible to incorporate into the model something which checks if the result is too similar to source content as part of the regression.
    
    My gut would be that this would, at least in the short term, make responses worse on the whole, so would probably require legal action or pressure to have it implemented.
    
    source
    -> View More Comments
- MNByChoice@midwest.social ⁨1⁩ ⁨year⁩ ago
  
  I don’t understand why people on here want so much to strengthen them ever further.
  
  It is about a lawless company doing lawless things. Some of us want companies to follow the spirit, or at least the letter, of the law. We can change the law, but we need to discuss that.
  
  source
  - explodicle@local106.com ⁨1⁩ ⁨year⁩ ago
    IANAL, why isn’t it fair use?
    
    source
    -> View More Comments
- sculd@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Sorry AIs are not humans. Also executives like Altman are literally being paid millions to steal creator’s work.
  
  source
  - lily33@lemm.ee ⁨1⁩ ⁨year⁩ ago
    I didn’t say anything about AIs being humans.
    
    source
    -> View More Comments
- chahk@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Agreed on both counts… Except Microsoft sings a different tune when their software is being “stolen” in the exact same way. They want to have it both ways - calling us pirates when we copy their software, but it’s “without merit” when they do it. Fuck’em! Let them play by the same rules they want everyone else to play.
  
  source
  - intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
    That sounds bad. Do you have evidence for MS behaving this way?
    
    source
    -> View More Comments
sculd@beehaw.org ⁨1⁩ ⁨year⁩ ago
Some relevant comments from Ars:

leighno5

The absolute hubris required for OpenAI here to come right out and say, ‘Yeah, we have no choice but to build our product off the exploitation of the work others have already performed’ is stunning. It’s about as perfect a representation of the tech bro mindset that there can ever be. They didn’t even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don’t think it’s hyperbolic to compare this to modern day colonization, or worker exploitation. ‘You’ve been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we’re building, so we’re just going to fucking take it and do what we need to do.’

The entitlement is just…it’s incredible.

4qu4rius

20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).

All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.

source
SilentStorms@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
It’s crazy how everyone is suddenly in favour of IP law.

source
- t3rmit3@beehaw.org ⁨1⁩ ⁨year⁩ ago
  IP law used to stop corporations from profiting off of creators’ labor without compensation? Yeah, absolutely.
  
  IP law used to stop individuals from consuming media where purchases wouldn’t even go to the creators, but some megacorp? Fuck that.
  
  I’m against downloading movies by indie filmmakers without compensating them. I’m not against downloading films from Universal and Sony.
  
  I’m against stealing food from someone’s garden. I’m not against stealing food from Safeway.
  
  If you stop looking at corporations as being the same as individuals, it’s a very simple and consistent viewpoint.
  
  source
  - jlow@beehaw.org ⁨1⁩ ⁨year⁩ ago
    Word.
    
    source
  - jarfil@beehaw.org ⁨1⁩ ⁨year⁩ ago
    IP law used to compensate creators “until their death + 70 years”… you can spin it however you want, that’s just plain wrong.
    
    If you stop looking at corporations as being the same as individuals
    
    That’s a separate bonkers legislation. Two wrongs don’t make one right.
    
    source
    -> View More Comments
- mnglw@beehaw.org ⁨1⁩ ⁨year⁩ ago
  I’m not so much in favor of IP law as I am in favor of informef consent.
  
  when posting photos, art and text content years ago, I was not able to imagine it might be trained off by an AI. As such I was not able to make a decision based on informed consent if I agreed to that or not.
  
  Even though quotes such as “once you post it, its on the internet forever” were around, I was not aware the extend to which this reached and that had my art been vacuumed by a generative AI model (it hasnt luckily) people could create art that pretends to be created by me. Thus I could not consent
  
  I think this goes for a lot of artists actually, especially those who exist far more publicly than I do
  
  source
- interdimensionalmeme@lemmy.ml ⁨1⁩ ⁨year⁩ ago
  I still think IP needs to eat shit and die. Always has, always will.
  
  I recently found out we could have had 3d printing 20 years earlier but patents stopped that. Cocks !
  
  source
- Daxtron2@startrek.website ⁨1⁩ ⁨year⁩ ago
  It’s almost like most people are idiots who don’t understand the thing they’re against and are just parroting what they hear/read.
  
  source
explodicle@local106.com ⁨1⁩ ⁨year⁩ ago
Having read through these comments, I wonder if we’ve reached the logical conclusion of copyright itself.

source
- sanzky@beehaw.org ⁨1⁩ ⁨year⁩ ago
  copyright has become a tool of oppression. Individual author’s copyright is constantly being violated with little resources for them to fight while big tech abuses others work and big media uses theirs to the point of it being censorship.
  
  source
- frog@beehaw.org ⁨1⁩ ⁨year⁩ ago
  Perhaps a fair compromise would be doing away with copyright in its entirety, from the tiny artists trying to protect their artwork all the way up to Disney, no exceptions. Basically, either every creator has to be protected, or none of them should be.
  
  source
  - zaphod@lemmy.ca ⁨1⁩ ⁨year⁩ ago
    IMO the right compromise is to return copyright to its original 14 year term. OpenAI can freely train on anything up to 2009 which is still a gigantic amount of material whole artists continue to be protected and incentivized.
    
    source
    -> View More Comments
  - sanzky@beehaw.org ⁨1⁩ ⁨year⁩ ago
    that would mean governments prosecuting all offences, which is not going to happen. I doubt any country would have enough resources for doing that
    
    source
- raccoona_nongrata@beehaw.org ⁨1⁩ ⁨year⁩ ago
  [deleted]
  source
  - explodicle@local106.com ⁨1⁩ ⁨year⁩ ago
    Apparently they’re going to just make only the little guy’s copyrights effectively meaningless, so yeah.
    
    source
casmael@startrek.website ⁨1⁩ ⁨year⁩ ago
Well in that case maybe chat gpt should just fuck off it doesn’t seem to be doing anything particularly useful, and now it’s creator has admitted it doesn’t work without stealing things to feed it. Un fucking believable. Hacks gonna hack I guess.

source
- intensely_human@lemm.ee ⁨1⁩ ⁨year⁩ ago
  ChatGPT has been enormously useful to me over the last six months. No idea where you’re getting this notion it isn’t useful.
  
  source
  - bilb@lem.monster ⁨1⁩ ⁨year⁩ ago
    People pretending it’s not useful and/or not improving all the time are living in their own worlds. I think you can argue the legality and the ethics, but any anti-ai position based on low quality output (“it can’t even do hands!”) has a short shelf-life.
    
    source
Powderhorn@beehaw.org ⁨1⁩ ⁨year⁩ ago
Any reasonable person can reach the conclusion that something is wrong here.

What I’m not seeing a lot of acknowledgement of is who really gets hurt by copyright infringement under the current U.S. scheme. (The quote is obviously directed toward the UK, but I’m reasonably certain a similar situation exists there.)

Hint: It’s rarely the creators, who usually get paid once while their work continues to make money for others.

Let’s say the New York Times wins its lawsuit. Do you really think the reporters who wrote the infringed-upon material will be getting royalty checks to be made whole?

This is not OpenAI vs creatives. OK, on a basic level it is, but expecting no one to scrape blogs and forum posts rather goes against the idea of the open internet in the first place. We’ve all learned by now that what goes on the internet stays there, with attribution totally optional unless you have a legal department. What’s novel here is the scale of scraping, but I see some merit to the “transformational” fair-use defense given that the ingested content is not being reposted verbatim.

This is corporations vs corporations. Framing it as millions of people missing out on what they’d have otherwise rightfully gotten is disingenuous.

source
- lemmyvore@feddit.nl ⁨1⁩ ⁨year⁩ ago
  This isn’t about scraping the internet. The internet is full of crap and the LLMs will add even more crap to it. It will shortly become exponentially harder to find the meaningful content on the internet.
  
  No, this is about dipping into high quality, curated content. OpenAI wants to be able to use all existing human artwork without paying anything for it, so they can flood the world with cheap knockoff copies. It’s that simple.
  
  source
  - towerful@programming.dev ⁨1⁩ ⁨year⁩ ago
    Shortly? It’s happening already. I notice it when using Google and Duckduckgo. There are always a few hits that are AI written blog spam word soup
    
    source
    -> View More Comments
- MudMan@kbin.social ⁨1⁩ ⁨year⁩ ago
  Yep. The effect of this as currently framed is that you get data ownership clauses in EULAs forever and only major data brokers like Google or Meta can afford to use this tech at all. It's not even a new scenario, it already happened when those exact companies were pushing facial recognition and other big data tools.
  
  I agree that the basics of modern copyright don't work great with ML in the mix (or with the Internet in the mix, while we're at it), but people are leaning on the viral negativity to slip by very unwanted consequences before anybody can make a case for good use of the tech.
  
  source
- pupbiru@aussie.zone ⁨1⁩ ⁨year⁩ ago
  it’s so baffling to me that some people think this is a clear cut problem of “you stole the work just the same as if you sold a copy without paying me!”
  
  it ain’t the same folks… that’s not how models work… the outcome is unfortunate, for sure, but to just straight out argue that it’s the same is ludicrous… it’s a new problem and ML isn’t going away, so we’re going to have to deal with it as a new problem
  
  source
drwho@beehaw.org ⁨1⁩ ⁨year⁩ ago
As with many things, the golden rule applies. They who have the gold, make the rules.

source
KingThrillgore@lemmy.ml ⁨1⁩ ⁨year⁩ ago
…so stop doing it!

This explains what Valve was until recently not so cavalier about AI: They didn’t want to hold the bag on copyright matters outside of their domain.

source
bedrooms@kbin.social ⁨1⁩ ⁨year⁩ ago
Alas, AI critics jumped on the conclusion this one time. Read this:

Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."

It's a plain fact. It does not say we have to train AI without paying.

To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.

If misled politicians write a law banning the use of copyrighted materials, that'll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that's absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering business and techs while the West has to fight them with Siri and Alexa?

So, it is true that, at the end of the day, we'd have to ask how much should rule-abiding AI companies pay for copyrighted materials.

However, you can't equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.

source
- p03locke@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
  It’s bizarre. People suddenly start voicing pro-copyright arguments just to kill an useful technology, when we should be trying to burn copyright to the fucking ground. Copyright is a tool for the rich and it will remain so until it is dismantled.
  
  source
  - AVincentInSpace@pawb.social ⁨1⁩ ⁨year⁩ ago
    Life plus 70 years is bullshit.
    
    20 years from release date is not.
    
    source
- krellor@beehaw.org ⁨1⁩ ⁨year⁩ ago
  The issue is that fair use is more nuanced than people think, but that the barrier to claiming fair use is higher when you are engaged in commercial activities. I’d more readily accept the fair use arguments from research institutions, companies that train and release their model weights (llama), or some other activity with a clear tie to the public benefit.
  
  OpenAI isn’t doing this work for the public benefit, regardless of the language of altruism they wrap it in. They, and Microsoft, and hoovering up others data to build a for profit product and make money. That’s really what it boils down to for me. And I’m fine with them making money. But pay the people whose days your using.
  
  Now, in the US there is no case law on this yet and it will take years to settle. But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.
  
  source
  - bedrooms@kbin.social ⁨1⁩ ⁨year⁩ ago
    Well, regarding text online, most is there fir the visitors to read fir free. So, if we end up treating these AI training like human reading text one could argue they don't have to pay.
    
    Reddit doesn't pay their users, anyway.
    
    But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.
    
    Agreed. That said, NYT actually intentionally allows Google and Bing servers to parse their news articles in order to put their articles top in the search results. In that regard they might like certain form of processing by LLMs.
    
    source
    -> View More Comments
- AVincentInSpace@pawb.social ⁨1⁩ ⁨year⁩ ago
  “But you see, we have to let corporations break the law, because if we don’t, a country we might be at war with later will”
  
  source
fckreddit@lemmy.ml ⁨1⁩ ⁨year⁩ ago
Then shutdown your goddamn company until you find a better way.

source
jlow@beehaw.org ⁨1⁩ ⁨year⁩ ago
It’s also “impossible” to have multiple terabytes of media on my homeserver without copyright infringement, so piracy is ok, right!?

O no, wait it actually is possible, it’s just more expensive and more work to do it legally (and leaves a lot of plastic trash in form of Blurays and DVDs), just like with AI. But laws are just for poor people, I guess.

source
Pratai@lemmy.ca ⁨1⁩ ⁨year⁩ ago
I sand by my opinion that AI will be the worst thing humans ever created, just a bit above religion.

source
MudMan@kbin.social ⁨1⁩ ⁨year⁩ ago
I think viral outrage aside, there is a very open question about what constitutes fair use in this application. And I think the viral outrage misunderstands the consequences of enforcing the notion that you can't use openly scrapable online data to build ML models.

Effectively what the copyright argument does here is make it so that ML models are only legally allowed to make by Meta, Google, Microsoft and maybe a couple of other companies. OpenAI can say whatever, I'm not concerned about them, but I am concerned about open source alternatives getting priced out of that market. I am also concerned about what it does to previously available APIs, as we've seen with Twitter and Reddit.

I get that it's fashionable to hate on these things, and it's fashionable to repeat the bit of misinformation about models being a copy or a collage of training data, but there are ramifications here people aren't talking about and I fear we're going to the worst possible future on this, where AI models are effectively ubiquitous but legally limited to major data brokers who added clauses to own AI training rights from their billions of users.

source
vexikron@lemmy.zip ⁨1⁩ ⁨year⁩ ago
Or, or, or, hear me out:

Maybe their particular approach to making an AI is flawed.

Its like people do not know that there are many different kinds of ways that attempt to do AI.

Many of them do not rely on basically a training set that is the cumulative sum of all human generated content of every imaginable kind.

source
qyron@sopuli.xyz ⁨1⁩ ⁨year⁩ ago
If it is impossible, either shut down operations or find a way to pay for it.

source
ky56@aussie.zone ⁨1⁩ ⁨year⁩ ago
All the AI race has done is surface the long standing issue of how broken copyright is for the online internet era. Artists should be compensated but trying to do that using the traditional model which was originally designed with physical, non infinitely copyable goods in mind is just asinine.

One such model could be to make the copyright owner automatically assigned by first upload on any platform that supports the API. An API provided and enforced by the US copyright office. A percentage of the end use case can be paid back as royalties. I haven’t really thought out this model much further than this.

Machine learning is here to say and is a useful tool that can be used for good and evil things alike.

source
FracturedPelvis@lemmy.ml ⁨1⁩ ⁨year⁩ ago
The real issue is money. How much and how (un)distributed.

Why is it fair/ok that one company can use all this material and make a lot of money off it without paying or even acknowledging others work?

On the flip side AI model could be useful. Maybe the models/weights should be made free just like the content they are trained on. Instead of paying for the model, we should pay for the hosting of the inference (aka. the API)

source
Kolanaki@yiffit.net ⁨1⁩ ⁨year⁩ ago
Then pay for the material.

source
onlinepersona@programming.dev ⁨1⁩ ⁨year⁩ ago
Wait, so if the way I make money is illegal now, it’s the system’s fault, isn’t it? That means I can keep going because I believe I’m justified, right? Right?

CC BY-NC-SA 4.0

source
furrowsofar@beehaw.org ⁨1⁩ ⁨year⁩ ago
Of course it is. About 50 years ago we went to a regime where everything is copywrited rather then just things that were marked and registered. Not sure where.I stand on that. One could argue we are in a crazy over copyright era now anyway.

source
BoastfulDaedra@lemmynsfw.com ⁨1⁩ ⁨year⁩ ago
Yes, well, a pirate ship can’t stay in business without raiding trade convoys, either.

source
Critical_Insight@feddit.uk ⁨1⁩ ⁨year⁩ ago
There’s not a musician that havent heard other songs before. Not a painter that haven’t seen other painting. No comedian that haven’t heard jokes. No writer that haven’t read books.

AI haters are not applying the same standards to humans that they do to generative AI. Obviously this is not to say that AI can’t plagiarize. If it’s spitting out sentences that are direct quotes from an article someone wrote before and doesn’t disclose the source then yeah that is an issue. There’s however a limit after which the output differs enough from the input that you can’t claim it’s stealing even if perfectly mimics the style of someone else.

Just because DallE creates pictures that have getty images watermark on them it doesn’t mean the picture itself is a direct copy from their database. If anything it’s the use of the logo that’s the issue. Not the picture.

source
DavidGarcia@feddit.nl ⁨1⁩ ⁨year⁩ ago
ip protections are a spook anyway

source
randomaside@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
OpenAI now needs to go to court and argue fair use forever. That’s the burden of our system. Private ownership is valued higher than anything else so … Good luck we’re all counting on you (unfortunately).

source
GammaGames@beehaw.org ⁨1⁩ ⁨year⁩ ago
Could they be legally required to open source the llm? I believe them, but that doesn’t make it right

source
Lightrider@lemmynsfw.com ⁨1⁩ ⁨year⁩ ago
#fuckingcapitalists

source

-> View More Comments