By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and - since FOSS software relies on public collaboration, whereas private companies don’t have that requirement - this is putting some extra burden on Open Source communities.

So let’s try to get more details – going back to Drew’s blogpost. According to Drew, LLM crawlers don’t respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

Due to this, it’s hard to come off with a good set of mitigations. Drew says that several high-priority tasks have been delayed for weeks or months due to these interruptions, users have been occasionally affected (because it’s hard to distinguish bots and humans), and - of course - this causes occasional outages of SourceHut.