Comment on News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
tal@lemmy.today 3 days agoActually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.
So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.
But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended not to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.
You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like lynx might see the poison, but you could probably make that work for most users. That has some nice characteristics:
-
You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not.
-
Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.
-
The Internet Archive can still archive the pages.
Someone could make a bot that post-processes what you do, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content or to risk poisoned content slipping into their model.
I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.
shrugs
cmnybo@discuss.tchncs.de 3 days ago
Hidden junk that a person wouldn’t see would likely be picked up by a screen reader. That would make the site much harder to use for a visually impaired person.