Comment on If AI spits out stuff it's been trained on
Ragdoll_X@lemmy.world 4 weeks ago
doesn’t it follow that AI-generated CSAM can only be generated if the AI has been trained on CSAM?
Not quite, since the whole thing with image generators is that they’re able to combine different concepts to create new images. That’s why DALL-E 2 was able to create a images of an astronaut riding a horse on the moon, even though it never saw such images, and probably never even saw astronauts and horses in the same image. So in theory these models can combine the concept of porn and children even if they never actually saw any CSAM during training, though I’m not gonna thoroughly test this possibility myself.
Still, as the article says, since Stable Diffusion is publicly available someone can train it on CSAM images on their own computer specifically to make the model better at generating them. Based on my limited understanding of the litigations that Stability AI is currently dealing with (1, 2), whether they can be sued for how users employ their models will depend on how exactly these cases play out, and if the plaintiffs do win, whether their arguments can be applied outside of copyright law to include CSAM images generated with SD.
My question is: why aren’t OpenAI, Google, Microsoft, Anthropic… sued for possession of CSAM? It’s clearly in their training datasets.
Well they don’t own the LAION dataset, which is what their image generators are trained on. And to sue either LAION or the companies that use their datasets you’d probably have to clear a very high bar of proving that they have CSAM images downloaded, know that they are there and have not removed them. It’s similar to how social media companies can’t be held liable for users posting CSAM to their website if they can show that they’re actually trying to remove these images. Some things will slip through the cracks, but if you show that you’re actually trying to deal with the problem you won’t get sued.
LAION actually doesn’t even provide the images themselves, only linking to images on the internet, and they do a lot of screening to remove potentially illegal content. As they mention in this article there was a report showing that some CSAM images were linked in the dataset, but if my memory doesn’t fail me the researchers who found this did so by looking at the stored hashes of the images, which were matched to known CSAM hashes, but the images themselves had already been removed from the internet, so LAION technically only linked to unavailable images. Still they took down and revised the dataset after this report.