If AI spits out stuff it's been trained on

Submitted ⁨⁨10⁩ ⁨months⁩ ago⁩ by ⁨ExtremeDullard@lemmy.sdf.org⁩ to ⁨[deleted]⁩

doesn’t it follow that AI-generated CSAM can only be generated if the AI has been trained on CSAM?

This article even explicitely says as much.

My question is: why aren’t OpenAI, Google, Microsoft, Anthropic… sued for possession of CSAM? It’s clearly in their training datasets.

source

Comments

Sort:hotnew top

hendrik@palaver.p3x.de ⁨10⁩ ⁨months⁩ ago
Well, it can draw an astronaut on a horse, and I doubt it had seen lots of astronauts on horses...

source
- ExtremeDullard@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
  Yeah but the article suggests that pedos train their local AI on existing CSAM, which would indicate that it’s somehow needed to generate AI-generated CSAM. Otherwise why would they bother? They’d just feed images of children in innocent settings and images of ordinary porn to get their local AI to generate CSAM.
  
  source
  - rikudou@lemmings.world ⁨10⁩ ⁨months⁩ ago
    How do they know that? Did the pedos text them to let them know? Sounds very made up.
    
    source
    -> View More Comments
  - hendrik@palaver.p3x.de ⁨10⁩ ⁨months⁩ ago
    It's certainly technically possible. I suspect these AI models just aren't good at it. So the pedophiles need to train them on actual images.
    
    I can imagine for example AI doesn't know what puberty is since it has in fact not seen a lot of naked children. It would try to infer from all the internet porn it's seen, and draw any female with big breasts, disregarding age. And that's not how children actually look.
    
    I haven't tried, since it's illegal where I live. But that's my suspicion why pedophiles bother with training models.
    
    source
  - GBU_28@lemm.ee ⁨10⁩ ⁨months⁩ ago
    Training an existing model on a specific set of new data is known as “fine tuning”.
    
    A base model has broad world knowledge and the ability to generate outputs of things it hasn’t specifically seen, but a tuned model will provide “better” (fucking yuck to even write it) results.
    
    The closer your training data is to your desired result, the better.
    
    source
  - Deceptichum@quokk.au ⁨10⁩ ⁨months⁩ ago
    That’s not exactly how it works.
    
    It can “understand” different concepts and mix them, without having to see the combination before hand.
    
    As for the training thing, that would probably be more LORA. They’re like add-ons you can put on your AI to draw certain things better like a character, a pose, etc. not needed for the base model.
    
    source
  - MolochAlter@lemmy.world ⁨10⁩ ⁨months⁩ ago
    Why wouldn’t they? They have it on hand and it would obviously yield “better” results for their intended use case.
    
    If you’re going as far as trying to generate AI csam you’re probably quite deep in that hole already, why settle for less?
    
    source
  - AnAmericanPotato@programming.dev ⁨10⁩ ⁨months⁩ ago
    
    which would indicate that it’s somehow needed to generate AI-generated CSAM
    
    This is not strictly true in general. Generative AI is able to produce output that is not in the training data, by learning a broad range of concepts and applying them in novel ways. I can generate an image of a rollerskating astronaut even if there are no rollerskating astronauts in the training data.
    
    It is true that some training sets include CSAM, at least in the past. Back in 2023, researches found a few thousand such images in the LAION-5B dataset (roughly one per million images). 404 Media has an excellent article with details: www.404media.co/laion-datasets-removed-stanford-c…
    
    On learning of this, LAION took down their database until it could properly cleaned. Source: laion.ai/notes/laion-maintenance/
    
    Those images were collected from the public web. LAION took steps to avoid linking to illicit content (details in the link above), but clearly it’s an imperfect system. God only knows what closed companies (OpenAI, Google, etc.) are doing. With open data sets, at least any interested parties can review, verify, and report this stuff. With closed data sets, who knows?
    
    source
frightful_hobgoblin@lemmy.ml ⁨10⁩ ⁨months⁩ ago
a GPT can produce things it’s never seen.

It can produce a galaxy made out of dog food; doesn’t mean it was trained on pictures of galaxies made out of dog food.

source
Ragdoll_X@lemmy.world ⁨10⁩ ⁨months⁩ ago

doesn’t it follow that AI-generated CSAM can only be generated if the AI has been trained on CSAM?

Not quite, since the whole thing with image generators is that they’re able to combine different concepts to create new images. That’s why DALL-E 2 was able to create a images of an astronaut riding a horse on the moon, even though it never saw such images, and probably never even saw astronauts and horses in the same image. So in theory these models can combine the concept of porn and children even if they never actually saw any CSAM during training, though I’m not gonna thoroughly test this possibility myself.

Still, as the article says, since Stable Diffusion is publicly available someone can train it on CSAM images on their own computer specifically to make the model better at generating them. Based on my limited understanding of the litigations that Stability AI is currently dealing with (1, 2), whether they can be sued for how users employ their models will depend on how exactly these cases play out, and if the plaintiffs do win, whether their arguments can be applied outside of copyright law to include CSAM images generated with SD.

My question is: why aren’t OpenAI, Google, Microsoft, Anthropic… sued for possession of CSAM? It’s clearly in their training datasets.

Well they don’t own the LAION dataset, which is what their image generators are trained on. And to sue either LAION or the companies that use their datasets you’d probably have to clear a very high bar of proving that they have CSAM images downloaded, know that they are there and have not removed them. It’s similar to how social media companies can’t be held liable for users posting CSAM to their website if they can show that they’re actually trying to remove these images. Some things will slip through the cracks, but if you show that you’re actually trying to deal with the problem you won’t get sued.

LAION actually doesn’t even provide the images themselves, only linking to images on the internet, and they do a lot of screening to remove potentially illegal content. As they mention in this article there was a report showing that some CSAM images were linked in the dataset, but if my memory doesn’t fail me the researchers who found this did so by looking at the stored hashes of the images, which were matched to known CSAM hashes, but the images themselves had already been removed from the internet, so LAION technically only linked to unavailable images. Still they took down and revised the dataset after this report.

source
PM_ME_VINTAGE_30S@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago

If AI spits out stuff it’s been trained on

For Stable Diffusion, it really doesn’t just spit out what it’s trained on. Very loosely, it starts with white noise, then adds noise and denoises the result based on your prompt, and it keeps doing this until it converges to a representation of your prompt.

IMO your premise is closer to true in practice, but still not strictly true, about large language models.

source
- notfromhere@lemmy.ml ⁨10⁩ ⁨months⁩ ago
  It’s akin to virtually starting with a block of marble and removing every part (pixel) that isn’t the resulting image. Crazy how it works.
  
  source
rikudou@lemmings.world ⁨10⁩ ⁨months⁩ ago
The article is bullshit that wants to stir shit up for more clicks.

You don’t need a single CSAM image to train AI to make fake CSAM. In fact, if you used the images from the database of known CSAM, you’d get very shit results because most of them are very old and thus the quality most likely sucks.

Additionally, in another comment you mention that it’s users training their models locally, so that answers your 2nd question of why companies are not sued: they don’t have CSAM in their training dataset.

source
BradleyUffner@lemmy.world ⁨10⁩ ⁨months⁩ ago
The AI can generate a picture of cows dancing with roombas on the moon. Do you think it was trained on images of cows dancing with roombas on the moon?

source
- xia@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
  Individually, yes. Thousands of cows, thousands of "dancing"s, thousands of roombas, and thousands of "on the moon"s.
  
  source
  - LovableSidekick@lemmy.world ⁨10⁩ ⁨months⁩ ago
    But a living human artist also learned to draw dancing cows and roombas on the moon that way. It just didn’t take thousands.
    
    source
  - SplashJackson@lemmy.ca ⁨10⁩ ⁨months⁩ ago
    And they all typed Shakespeare!
    
    source
    -> View More Comments
Hawk@lemmynsfw.com ⁨10⁩ ⁨months⁩ ago
It doesn’t need CSAM in the dataset to generate images that would be considered CSAM.

I’m sure they take good effort to stay away from that stuff as it’s bad for business.

source
Free_Opinions@feddit.uk ⁨10⁩ ⁨months⁩ ago
First of all, it’s by definition not CSAM if it’s AI generated. It’s simulated CSAM - no people were harmed doing it. That happened when the training data was created.

However it’s not necessary that such content even exists in the training data. Just like ChatGPT can generate sentences it has never seen before, image generators can also generate pictures it has not seen before. Ofcourse the results will be more accurate if that’s what it has been trained on but it’s not strictly necessary. It just takes a skilled person to write the prompt.

My understanding is that the simulated CSAM content you’re talking about has been made by people running their software locally and having provided the training data themselves.

source
- Buffalox@lemmy.world ⁨10⁩ ⁨months⁩ ago
  
  First of all, it’s by definition not CSAM if it’s AI generated. It’s simulated CSAM
  
  This is blatantly false. It’s also illegal and you can go to prison for owning selling or making child Lolita dolls.
  
  source
  - frightful_hobgoblin@lemmy.ml ⁨10⁩ ⁨months⁩ ago
    Dumb internet argument from here on down; advise the reader to do something else with their time.
    
    source
  - Free_Opinions@feddit.uk ⁨10⁩ ⁨months⁩ ago
    What’s blatantly false about what I said?
    
    source
    -> View More Comments
- ExtremeDullard@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
  
  it’s by definition not CSAM if it’s AI generated
  
  Tell that to the judge. People caught with machine-made imagery go to the slammer just as much as those caught with the real McCoy.
  
  source
  - HK65@sopuli.xyz ⁨10⁩ ⁨months⁩ ago
    Have there been cases like that already?
    
    source
  - Free_Opinions@feddit.uk ⁨10⁩ ⁨months⁩ ago
    It’s not legal advice I’m giving here.
    
    source
ChaoticNeutralCzech@feddit.org ⁨10⁩ ⁨months⁩ ago
It probably won’t yield good results for the literal query “child porn” because such content on the open web is censored, but I’m pretty sure degenerates know workarounds such as “young, short, naked, flat chested, no pubic hair”, all of which exist plentifully in isolation. Just my guess, I haven’t tried of course.

source
LovableSidekick@lemmy.world ⁨10⁩ ⁨months⁩ ago
My question is why this logic doesn’t apply to anybody who learns about anything and goes on to use it in their work without explicit permission.

source
- ByteJunk@lemmy.world ⁨10⁩ ⁨months⁩ ago
  End stage capitalism of the brain, all your ideas are ours and you owe us money for thinking them.
  
  What a great idea there bud
  
  source
  - LovableSidekick@lemmy.world ⁨10⁩ ⁨months⁩ ago
    “All your ideas are ours and you owe us money for thinking them” is actually a good summary of anti-AI sentiment.
    
    source
    -> View More Comments
YungOnions@lemmy.world ⁨10⁩ ⁨months⁩ ago

Sexton says criminals are using older versions of AI models and fine-tuning them to create illegal material of children. This involves feeding a model existing abuse images or photos of people’s faces, allowing the AI to create images of specific individuals. “We’re seeing fine-tuned models which create new imagery of existing victims,” Sexton says. Perpetrators are “exchanging hundreds of new images of existing victims” and making requests about individuals, he says. Some threads on dark web forums share sets of faces of victims, the research says, and one thread was called: “Photo Resources for AI and Deepfaking Specific Girls.”

The model hasn’t necessarily been trained with CSAM, rather you can create things called LORAs which help influence the image output of a model so that it’s better at producing very specific content that it may have struggled with before. For example I downloaded some recently that help Stable Diffusion create better images of Battleships from Warhammer 40k. My guess is that criminals are creating their own versions for kiddy porn etc.

source
- OhNoMoreLemmy@lemmy.ml ⁨10⁩ ⁨months⁩ ago
  This is one of those things where both are likely to be true. All webscale datasets have a problem with porn and csam, and it’s like that people wanting to generate csam use their own fine tuned models.
  
  Here’s an example story. …stanford.edu/…/investigation-finds-ai-image-gene… and it’s very likely that this was the tip of the iceberg, and there’s more csam still in these datasets.
  
  source
southsamurai@sh.itjust.works ⁨10⁩ ⁨months⁩ ago
I think you misunderstand what’s happening.

It isn’t that, as an example to represent the idea, openai is training their models on kiddie porn.

It’s that people are taking ai software, and then training it on their existing material. The wired article even specifically says they’re issuing older versions of the software to bypass safeguards that are in place to prevent it now.

This isn’t to say that any of the companies involved in offering generative software don’t have such imagery in the data used to train their models. But they wouldn’t have to possess it for it to be in there. Most of those assholes just grabbed giant datasets and plugged them in. They even used scrapers for some of it. So all it would take is them accessing some of it unintentionally for their software to end up able to generate new material. They don’t need to store anything once the software is trained.

Currently, none of them lack some degree of prevention in their products to prevent it being used for that. How good those protections are, I have zero clue. But they’ve all made noises about it.

But don’t forget, one of the earlier iterations of software designed to identify kiddie porn was trained on seized materials. The point of that is that there are exceptions to possession. The various agencies that investigate sexual abuse of minors tend to keep materials because they need it to track down victims, have as evidence, etc. It’s that body of data that made detection something that can be automated. While I have no idea if it happened, it wouldn’t be surprising if some company or another did scrape that data at some point. That’s just a tangent rather than part of your question.

So, the reason that they haven’t been “sued” is that they likely don’t have any materials to be “sued” for in the first place.

Besides, not all generated materials are made based on existing supplies. Some of it is made akin to a deepfake, where someone’s face is pasted onto a different body. So, they can take materials of perfectly legal adults that look young, slap real or fictional children’s faces onto them, and have new stuff to spread around. That doesn’t require any original material at all. You could, as I understand it, train an generative model on that and it would turn out realistic fully generative materials. All of that is still illegal, but it’s created differently.

source
Dkarma@lemmy.world ⁨10⁩ ⁨months⁩ ago
Tldr op doesn’t understand how ML models work

source
DragonsInARoom@lemmy.world ⁨10⁩ ⁨months⁩ ago
I would imagine that ai generated csam can be “had” in big tech ai in two ways: contamination, and training from an analog. Contamination would be the training passes of the ai using the data being introduced into an uncontaminated training pool. (Not introducing raw csam material). Training from analogous data is what the name states, get as close to the csam material as possible without raising eyebrows. Or the criminals could train off of “fresh” unknown to lawenforcment csam.

source
Battle_Masker@lemmy.world ⁨10⁩ ⁨months⁩ ago
those are big companies. they have more legal protection than anyone in the world. and money if judges/law enforcement still consider moving a case forward

source