Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.
This is a mistake we’re going to regret for generations.
Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:
When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.
The Times has gone even further:
The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.
News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
Submitted 2 hours ago by Powderhorn@beehaw.org to technology@beehaw.org
tal@lemmy.today 1 hour ago
I’m very far from sure that this is an effective way to block AI crawlers from pulling stories for training, if that’s their actual concern. Like…the rate of new stories just isn’t that high. This isn’t, say, Reddit, where someone trying to crawl the thing at least has to generate some abnormal traffic. Yeah, okay, maybe a human wouldn’t read all stories, but all you’d have to do is create a handful of paid accounts and then just pull the content, and I think that a bot would just fade into the noise. And my guess is that it is very likely that AI training companies will do that or something similar if knowledge of current news events is of interest to people.
You could use a canary trap, and that might be more-effective:
en.wikipedia.org/wiki/Canary_trap
There, you generate slightly different versions of articles for different people. Say that you have 100 million subscribers.
ln(100000000)/l(2)=26.57…So you’re talking about 27 bits of information that need to go into the article to uniquely describe each. The AI is going to be lossy, I imagine that but you can potentially manage to produce 27 unique bits of information per article that can reasonably-reliably be remembered by an AI after training. That’s 27 different memorable items that need to show up in either Form A or Form B. Then you search to see what a new LLM knows about and ban the bot identified.Cartographers have done that, introduced minor, intentional errors to see what errors maps used to see whether they were derived from their map.
en.wikipedia.org/wiki/Trap_street
en.wikipedia.org/wiki/Phantom_island
That has weaknesses. It’s possible to defeat that by requesting multiple versions using different bot accounts and identifying divergences and maybe merging them.
And even if you ban an account, it’s trivial to just create a new one, decoupled from the old one. Thus, there isn’t much that a media company can realistically do about it, as long as the generated material doesn’t rise to the level of a derived work and thus copyright infringement (and this is in the legal sense of derived — simply training something on something else isn’t sufficient to make it a derived work from a copyright law standpoint, any more than you reading a news report and then talking to someone else about it is).
Getting back to the citation issue…
Some news companies do keep archives (and often selling access to archives is a premium service), so for some, that might cover some of the “inability to cite” problem that not having Internet Archive archives produces, as long as the company doesn’t go under. It doesn’t help with a problem that many news companies have a tendency to silently modify articles without reliably listing errata, and that having an Internet Archive copy can be helpful. There are also some issues that I haven’t yet seen become widespread but worried about, like where a news source might provide different articles to people in different regions; there, having a trusted source like the Internet Archive can avoid that, and that could become a problem.
tal@lemmy.today 1 hour ago
Actually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.
So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.
But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended not to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.
You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like
lynxmight see the poison, but you could probably make that work for most users. That has some nice characteristics:You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not.
Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.
The Internet Archive can still archive the pages.
Someone could make a bot that post-processes what you do, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content or to risk poisoned content slipping into their model.
I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.
shrugs
cmnybo@discuss.tchncs.de 40 minutes ago
Hidden junk that a person wouldn’t see would likely be picked up by a screen reader. That would make the site much harder to use for a visually impaired person.
Petter1@discuss.tchncs.de 1 hour ago
But, AI chatbots aren’t trained ob recent data, they just do a good old google style search and take the content of the top articles as reference while generating the response…