AI crawler attack

Last week I wanted to demo my Open DMARC Analyzer installation to my colleagues. I shared my browser window, entered the analyzer's URL and ... waited for 15 seconds. Clicking on links also took over 10 seconds.

Afterwards I checked other websites on my server and also found them unresponsive; they took more than 10 seconds to load - even static pages. Sometimes they didn't load at all and Firefox showed a "could not connect" error.

The server load was ok - around 300% on a 12 core system. Next I checked the web server's load and was overwhelmed: Apache's server-status page showed that of the 150 workers, 150 were busy.

Nearly all of them were fetching content from my git server:

root@ahso5:~> curl -s localhost/server-status | html2text
[...]
0-0  x 195/ R 0.70 0  0   309  0.0   0.66  0.66 176.236.195.170 http/1.1 git.cweiske.de:443
2-0  x 123/ R 0.57 0  0   331  0.0   0.66  0.66 a100:a08b:7a85: http/1.1 git.cweiske.de:443
4-0  x 94/  R 0.48 28 0   37   0.0   1.57  1.57 81:e464:5ff7:   http/1.1 git.cweiske.de:443
5-0  x 84/  R 0.49 0  0   767  0.0   0.67  0.67 e254:8545:2fa1: http/1.1 git.cweiske.de:443
6-0  x 79/  R 0.47 4  0   59   0.0   0.46  0.46 181.94.228.41   http/1.1 git.cweiske.de:443
9-0  x 64/  R 0.43 0  0   211  0.0   0.43  0.43 177.170.151.186 http/1.1 git.cweiske.de:443
10-0 x 94/  R 0.47 3  0   25   0.0   0.50  0.50 168.220.176.186 http/1.1 git.cweiske.de:443
11-0 x 99/  R 0.48 0  0   1055 0.0   0.52  0.52 560e:54ad:4599: http/1.1 git.cweiske.de:443
12-0 x 145/ R 0.59 0  0   652  0.0   0.89  0.89 181.54.0.0      http/1.1 git.cweiske.de:443
13-0 x 120/ R 0.50 27 0   44   0.0   0.67  0.67 188.30.63.72    http/1.1 git.cweiske.de:443
[...]

Disable git server

As first measure, I wanted to stop returning any content on my git server and decided to return "HTTP/1.1 402 Payment required" errors to all requests on that domain.

/etc/apache/sites-available/cweiske/git.cweiske.de.conf

<VirtualHost *:443>
    [...]
    ServerAdmin "fuckoff@ai.bots"
    RedirectMatch 402 "^"
    ErrorDocument 402 "Fuckoff, AI bots"
    ServerSignature Off
</VirtualHost>

The RedirectMatch alone showed a "internal server error" message to the browser. Adding the error document solved it.

Unfortunately, the problem was not solved - loading the error page was still slow, took more than 10 seconds or timed out. I had so many AI crawler bots attacking my server that even sending out the 1k "402" error response was not fast enough and saturated the 150 workers:

Munin: Apache processes, day view

More workers

To cope with the onslaught, I increased the workers from 150 to 512 with a config option:

/etc/apache2/conf-enabled/90-cweiske-workers.conf

MaxRequestWorkers 512
ServerLimit 512

This temporarily worked until...

root@ahso5:~> curl -s localhost/server-status | html2text
Current Time: Monday, 20-Apr-2026 15:13:53 CEST
Restart Time: Monday, 20-Apr-2026 15:01:06 CEST
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 12 minutes 46 seconds
Server load: 1.51 1.49 1.04
Total accesses: 76795 - Total Traffic: 538.0 MB - Total Duration: 1199990
CPU Usage: u74.88 s62.19 cu73.39 cs58.62 - 35.1% CPU load
100 requests/sec - 0.7 MB/second - 7.2 kB/request - 15.6259 ms/request
512 requests currently being processed, 0 workers gracefully restarting, 0 idle workers
 
RRRRRRRRRRRRRRRRRCRRRRRRRRRCRRRRRRRRWRRRRRRRRRCRCRCRCCRRRCCRRRRC
RRRRCRRRRRRRRRCRCCRRRRRRRRRRRRRRRCRRCRRRRCRCRRRRRRCRRRRCRRCCRRRR
RRCRCCRRRRRRCRRRCCRRRRRRRRRRCRCRRRRRRRRRRRRCRCRCRRRCCRRRRRRCKRRC
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRCRCRRRRCRRCCRCRRCRRRRRRRRR
CRRRRCRRRRRCCRRCRRRRRRRRRRCRCRCRRRRRRCRRCRRRRCRRRRRRRCRCRRRRRRRR
RCRRRRRRRRCRRCRRRRCRRRRRCCCKRCRRRKRRRKRRCRCRRRRCCRCRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRCRRRRCRRRRRRRRRRRRRRWRRRRRRRRRRRRCRRRRRRRCRRRRRR
RRRRKRRRRRRRRRRCRRRRRRRRCRCRRRRRRRRRCRCRRRRRRRRRRRRRRRRCRRRRRRCR

512 requests currently being processed. The bots adjusted and just requested more error responses!

Luckily this was only a spike; the number of parallel requests averaged between 200 and 300, 80 per second:

$ tail -f /var/log/apache2/cweiske/git.cweiske.de-access.log | pv --line-mode --rate > /dev/null
[80,2 /s]

After ~3 hours, the bot controller noticed that no sensible responses came back from my server and the requests fell back to the normal ~20 per second.

Clients

Looking at the awstats log analytics I saw that most of the clients only fetched a single URL and then vanished:

Number of visits: 1,295,538
Hits: 1,544,307

The "user-agent" header did not indicate bots: They masked as normal browsers from all kinds of common operating systems with current version numbers.

Throwing 20 IP addresses into Maxmind's GeoIP demo page showed:

Requests are from all over the world - USA, Germany, Brazil, Iraq, Chile, ...
All requests came from "Cable/DSL" connections.

That makes them impossible to block because you can't identify them.

Limiting requests to the git vhost

My main problem was the the other websites on my server were unreachable because the bots downloaded every single page from the git server. The best solution would be if I could limit the parallel requests on this vhost, so that enough would be free for the other domains.

Apache2 has no native support for that (v1 had), but I found mod_vhost_limit. Compiled it, configured it and now I only allow 10 parallel requests to my git server.

Traffic

In 2026-01, the git vhost had 2.25 GiB of traffic. The april number is 120 GiB, and the month still has 4 days.

The sad thing is: Those git repositories take 419 MiB on the hard disk. When using git clone to download them, it would take ~500 MiB to get all of them at once.

"AI" training data crawlers are dumb as hell, and their operators do not care that they waste resources and bring down servers one after the other.

When you work for one of those companies: I hate you.

Additional idea: Basic Auth for expensive pages

In 2026-05-15 my feed reader brought me this: Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth .

I decided to implement that: Require a username and password for expensive pages like commit messages, diffs and file listings - but not for the other URL paths.

At first I created a htpasswd file:

$ htpasswd -bc /etc/apache2/git-dummy.htpasswd let mein

and then I used that for some URL paths:

/etc/apache2/sites-available/cweiske/git.cweiske.de

<VirtualHost *:443>
    [...]
    <LocationMatch "/[^/]+.git/(blob|commit|diff|log|patch|plain|snapshot|tree)">
        AuthType Basic
        AuthName "AI crawler block: Use let as username and mein as password."
        ErrorDocument 401 "AI crawler block: Use 'let' as username and 'mein' as password."
        AuthUserFile /etc/apache2/git-dummy.htpasswd
        Require valid-user
    </LocationMatch>
    [...]
</VirtualHost>

It turns out that browsers do not show the text in the "AuthName" configuration anymore, which are sent to the client in the WWW-Authenticate header:

WWW-Authenticate: Basic realm="AI crawler block: Use let as username and mein as password."

This is the reason error page for status code 401 has to contain the same text.

Links

Myriads of web server operators have the same problem.

2025-03-17: Source hut: Please stop externalizing your costs directly into my face
2025-03-20: thelibre.news: FOSS infrastructure is under attack by AI companies
2025-03-25: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries
2025-08-29: The Register: AI web crawlers are destroying websites in their never-ending hunger for any and all content
2026-06-04: Dolphin vs. Aggressive AI Scraper Bots
2025-06-12: Letter from Codeberg: We love our new infrastructure

AI crawling protection: The threat of gold-rush-style AI crawling of the web is not a new problem, but the impact on our infrastructure kept increasing. In the past, blocking of certain IP ranges would be enough and keep us protected for days. However, more and more AI companies seem to hide behind residential proxies, abusing consumer IP addresses to distribute traffic. Rate-limiting is no longer effective, because given IP addresses might only do about two requests per day.

As a consequence, we have decided to protect certain expensive routes of Codeberg.org and most of Weblate (Codeberg Translate) behind Anubis, which requires the browser to do some computation via JavaScript to prove they have capacity. While this causes a slight increase of energy demand, legitimate users will only solve the challenge rarely, while large amounts of energy usage due to the heavy crawling. Most current crawlers do not seem to execute the code, and their requests don't reach our backends.
2025-10-21: pushbx: Anonymous sign in now required for the hgweb server

The issue are residential proxies (RESIP):

2019-05-19: Resident Evil: Understanding Residential IP Proxy as a Dark Service
2026-22-06: Nearly Half of LG Smart TV Apps Are Laced with Proxies

Tags

AI crawler attack

Disable git server​

More workers​

Clients​

Limiting requests to the git vhost​

Traffic​

Additional idea: Basic Auth for expensive pages​

Links​