AI crawler attack

Last week I wanted to demo my Open DMARC Analyzer installation to my colleagues. I shared my browser window, entered the analyzer's URL and ... waited for 15 seconds. Clicking on links also took over 10 seconds.

Afterwards I checked other websites on my server and also found them unresponsive; they took more than 10 seconds to load - even static pages. Sometimes they didn't load at all and Firefox showed a "could not connect" error.

The server load was ok - around 300% on a 12 core system. Next I checked the web server's load and was overwhelmed: Apache's server-status page showed that of the 150 workers, 150 were busy.

Nearly all of them were fetching content from my git server:

root@ahso5:~> curl -s localhost/server-status | html2text
[...]
0-0  x 195/ R 0.70 0  0   309  0.0   0.66  0.66 176.236.195.170 http/1.1 git.cweiske.de:443
2-0  x 123/ R 0.57 0  0   331  0.0   0.66  0.66 a100:a08b:7a85: http/1.1 git.cweiske.de:443
4-0  x 94/  R 0.48 28 0   37   0.0   1.57  1.57 81:e464:5ff7:   http/1.1 git.cweiske.de:443
5-0  x 84/  R 0.49 0  0   767  0.0   0.67  0.67 e254:8545:2fa1: http/1.1 git.cweiske.de:443
6-0  x 79/  R 0.47 4  0   59   0.0   0.46  0.46 181.94.228.41   http/1.1 git.cweiske.de:443
9-0  x 64/  R 0.43 0  0   211  0.0   0.43  0.43 177.170.151.186 http/1.1 git.cweiske.de:443
10-0 x 94/  R 0.47 3  0   25   0.0   0.50  0.50 168.220.176.186 http/1.1 git.cweiske.de:443
11-0 x 99/  R 0.48 0  0   1055 0.0   0.52  0.52 560e:54ad:4599: http/1.1 git.cweiske.de:443
12-0 x 145/ R 0.59 0  0   652  0.0   0.89  0.89 181.54.0.0      http/1.1 git.cweiske.de:443
13-0 x 120/ R 0.50 27 0   44   0.0   0.67  0.67 188.30.63.72    http/1.1 git.cweiske.de:443
[...]

Disable git server

As first measure, I wanted to stop returning any content on my git server and decided to return "HTTP/1.1 402 Payment required" errors to all requests on that domain.

/etc/apache/sites-available/cweiske/git.cweiske.de.conf
<VirtualHost *:443>
    [...]
    ServerAdmin "fuckoff@ai.bots"
    RedirectMatch 402 "^"
    ErrorDocument 402 "Fuckoff, AI bots"
    ServerSignature Off
</VirtualHost>

The RedirectMatch alone showed a "internal server error" message to the browser. Adding the error document solved it.

Unfortunately, the problem was not solved - loading the error page was still slow, took more than 10 seconds or timed out. I had so many AI crawler bots attacking my server that even sending out the 1k "402" error response was not fast enough and saturated the 150 workers:

Munin: Apache processes, day view

More workers

To cope with the onslaught, I increased the workers from 150 to 512 with a config option:

/etc/apache2/conf-enabled/90-cweiske-workers.conf
MaxRequestWorkers 512
ServerLimit 512

This temporarily worked until...

root@ahso5:~> curl -s localhost/server-status | html2text
Current Time: Monday, 20-Apr-2026 15:13:53 CEST
Restart Time: Monday, 20-Apr-2026 15:01:06 CEST
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 12 minutes 46 seconds
Server load: 1.51 1.49 1.04
Total accesses: 76795 - Total Traffic: 538.0 MB - Total Duration: 1199990
CPU Usage: u74.88 s62.19 cu73.39 cs58.62 - 35.1% CPU load
100 requests/sec - 0.7 MB/second - 7.2 kB/request - 15.6259 ms/request
512 requests currently being processed, 0 workers gracefully restarting, 0 idle workers
 
RRRRRRRRRRRRRRRRRCRRRRRRRRRCRRRRRRRRWRRRRRRRRRCRCRCRCCRRRCCRRRRC
RRRRCRRRRRRRRRCRCCRRRRRRRRRRRRRRRCRRCRRRRCRCRRRRRRCRRRRCRRCCRRRR
RRCRCCRRRRRRCRRRCCRRRRRRRRRRCRCRRRRRRRRRRRRCRCRCRRRCCRRRRRRCKRRC
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRCRCRRRRCRRCCRCRRCRRRRRRRRR
CRRRRCRRRRRCCRRCRRRRRRRRRRCRCRCRRRRRRCRRCRRRRCRRRRRRRCRCRRRRRRRR
RCRRRRRRRRCRRCRRRRCRRRRRCCCKRCRRRKRRRKRRCRCRRRRCCRCRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRCRRRRCRRRRRRRRRRRRRRWRRRRRRRRRRRRCRRRRRRRCRRRRRR
RRRRKRRRRRRRRRRCRRRRRRRRCRCRRRRRRRRRCRCRRRRRRRRRRRRRRRRCRRRRRRCR

512 requests currently being processed. The bots adjusted and just requested more error responses!

Luckily this was only a spike; the number of parallel requests averaged between 200 and 300, 80 per second:

$ tail -f /var/log/apache2/cweiske/git.cweiske.de-access.log | pv --line-mode --rate > /dev/null
[80,2 /s]

After ~3 hours, the bot controller noticed that no sensible responses came back from my server and the requests fell back to the normal ~20 per second.

Clients

Looking at the awstats log analytics I saw that most of the clients only fetched a single URL and then vanished:

Number of visits
1,295,538
Hits
1,544,307

The "user-agent" header did not indicate bots: They masked as normal browsers from all kinds of common operating systems with current version numbers.

Throwing 20 IP addresses into Maxmind's GeoIP demo page showed:

That makes them impossible to block because you can't identify them.

Maxmind GeoIP list #1 Maxmind GeoIP list #2

Limiting requests to the git vhost

My main problem was the the other websites on my server were unreachable because the bots downloaded every single page from the git server. The best solution would be if I could limit the parallel requests on this vhost, so that enough would be free for the other domains.

Apache2 has no native support for that (v1 had), but I found mod_vhost_limit. Compiled it, configured it and now I only allow 10 parallel requests to my git server.

Traffic

In 2026-01, the git vhost had 2.25 GiB of traffic. The april number is 120 GiB, and the month still has 4 days.

awstats: year awstats: Month

The sad thing is: Those git repositories take 419 MiB on the hard disk. When using git clone to download them, it would take ~500 MiB to get all of them at once.

"AI" training data crawlers are dumb as hell, and their operators do not care that they waste resources and bring down servers one after the other.

When you work for one of those companies: I hate you.

Additional idea: Basic Auth for expensive pages

In 2026-05-15 my feed reader brought me this: Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth .

I decided to implement that: Require a username and password for expensive pages like commit messages, diffs and file listings - but not for the other URL paths.

At first I created a htpasswd file:

$ htpasswd -bc /etc/apache2/git-dummy.htpasswd let mein

and then I used that for some URL paths:

/etc/apache2/sites-available/cweiske/git.cweiske.de
<VirtualHost *:443>
    [...]
    <LocationMatch "/[^/]+.git/(blob|commit|diff|log|patch|plain|snapshot|tree)">
        AuthType Basic
        AuthName "AI crawler block: Use let as username and mein as password."
        ErrorDocument 401 "AI crawler block: Use 'let' as username and 'mein' as password."
        AuthUserFile /etc/apache2/git-dummy.htpasswd
        Require valid-user
    </LocationMatch>
    [...]
</VirtualHost>

It turns out that browsers do not show the text in the "AuthName" configuration anymore, which are sent to the client in the WWW-Authenticate header:

WWW-Authenticate: Basic realm="AI crawler block: Use let as username and mein as password."

This is the reason error page for status code 401 has to contain the same text.

Myriads of web server operators have the same problem.

Written by Christian Weiske.

Comments? Please send an e-mail.