Comment on If AI spits out stuff it's been trained on
AnAmericanPotato@programming.dev 1 day agowhich would indicate that it’s somehow needed to generate AI-generated CSAM
This is not strictly true in general. Generative AI is able to produce output that is not in the training data, by learning a broad range of concepts and applying them in novel ways. I can generate an image of a rollerskating astronaut even if there are no rollerskating astronauts in the training data.
It is true that some training sets include CSAM, at least in the past. Back in 2023, researches found a few thousand such images in the LAION-5B dataset (roughly one per million images). 404 Media has an excellent article with details: www.404media.co/laion-datasets-removed-stanford-c…
On learning of this, LAION took down their database until it could properly cleaned. Source: laion.ai/notes/laion-maintenance/
Those images were collected from the public web. LAION took steps to avoid linking to illicit content (details in the link above), but clearly it’s an imperfect system. God only knows what closed companies (OpenAI, Google, etc.) are doing. With open data sets, at least any interested parties can review, verify, and report this stuff. With closed data sets, who knows?