Last week I wanted to demo my Open DMARC Analyzer installation to my colleagues. I shared my browser window, entered the analyzer's URL and ... waited for 15 seconds. Clicking on links also took over 10 seconds.
Afterwards I checked other websites on my server and also found them unresponsive; they took more than 10 seconds to load - even static pages. Sometimes they didn't load at all and Firefox showed a "could not connect" error.
The server load was ok - around 300% on a 12 core system. Next I checked the web server's load and was overwhelmed: Apache's server-status page showed that of the 150 workers, 150 were busy.
Nearly all of them were fetching content from my git server:
root@ahso5:~> curl -s localhost/server-status | html2text [...] 0-0 x 195/ R 0.70 0 0 309 0.0 0.66 0.66 176.236.195.170 http/1.1 git.cweiske.de:443 2-0 x 123/ R 0.57 0 0 331 0.0 0.66 0.66 a100:a08b:7a85: http/1.1 git.cweiske.de:443 4-0 x 94/ R 0.48 28 0 37 0.0 1.57 1.57 81:e464:5ff7: http/1.1 git.cweiske.de:443 5-0 x 84/ R 0.49 0 0 767 0.0 0.67 0.67 e254:8545:2fa1: http/1.1 git.cweiske.de:443 6-0 x 79/ R 0.47 4 0 59 0.0 0.46 0.46 181.94.228.41 http/1.1 git.cweiske.de:443 9-0 x 64/ R 0.43 0 0 211 0.0 0.43 0.43 177.170.151.186 http/1.1 git.cweiske.de:443 10-0 x 94/ R 0.47 3 0 25 0.0 0.50 0.50 168.220.176.186 http/1.1 git.cweiske.de:443 11-0 x 99/ R 0.48 0 0 1055 0.0 0.52 0.52 560e:54ad:4599: http/1.1 git.cweiske.de:443 12-0 x 145/ R 0.59 0 0 652 0.0 0.89 0.89 181.54.0.0 http/1.1 git.cweiske.de:443 13-0 x 120/ R 0.50 27 0 44 0.0 0.67 0.67 188.30.63.72 http/1.1 git.cweiske.de:443 [...]
Disable git server
As first measure, I wanted to stop returning any content on my git server and decided to return "HTTP/1.1 402 Payment required" errors to all requests on that domain.
<VirtualHost *:443> [...] ServerAdmin "fuckoff@ai.bots" RedirectMatch 402 "^" ErrorDocument 402 "Fuckoff, AI bots" ServerSignature Off </VirtualHost>
The RedirectMatch alone showed a "internal server error" message to the browser. Adding the error document solved it.
Unfortunately, the problem was not solved - loading the error page was still slow, took more than 10 seconds or timed out. I had so many AI crawler bots attacking my server that even sending out the 1k "402" error response was not fast enough and saturated the 150 workers:
More workers
To cope with the onslaught, I increased the workers from 150 to 512 with a config option:
MaxRequestWorkers 512 ServerLimit 512
This temporarily worked until...
root@ahso5:~> curl -s localhost/server-status | html2text Current Time: Monday, 20-Apr-2026 15:13:53 CEST Restart Time: Monday, 20-Apr-2026 15:01:06 CEST Parent Server Config. Generation: 1 Parent Server MPM Generation: 0 Server uptime: 12 minutes 46 seconds Server load: 1.51 1.49 1.04 Total accesses: 76795 - Total Traffic: 538.0 MB - Total Duration: 1199990 CPU Usage: u74.88 s62.19 cu73.39 cs58.62 - 35.1% CPU load 100 requests/sec - 0.7 MB/second - 7.2 kB/request - 15.6259 ms/request 512 requests currently being processed, 0 workers gracefully restarting, 0 idle workers RRRRRRRRRRRRRRRRRCRRRRRRRRRCRRRRRRRRWRRRRRRRRRCRCRCRCCRRRCCRRRRC RRRRCRRRRRRRRRCRCCRRRRRRRRRRRRRRRCRRCRRRRCRCRRRRRRCRRRRCRRCCRRRR RRCRCCRRRRRRCRRRCCRRRRRRRRRRCRCRRRRRRRRRRRRCRCRCRRRCCRRRRRRCKRRC RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRCRCRRRRCRRCCRCRRCRRRRRRRRR CRRRRCRRRRRCCRRCRRRRRRRRRRCRCRCRRRRRRCRRCRRRRCRRRRRRRCRCRRRRRRRR RCRRRRRRRRCRRCRRRRCRRRRRCCCKRCRRRKRRRKRRCRCRRRRCCRCRRRRRRRRRRRRR RRRRRRRRRRRRRRRRCRRRRCRRRRRRRRRRRRRRWRRRRRRRRRRRRCRRRRRRRCRRRRRR RRRRKRRRRRRRRRRCRRRRRRRRCRCRRRRRRRRRCRCRRRRRRRRRRRRRRRRCRRRRRRCR
512 requests currently being processed
.
The bots adjusted and just requested more error responses!
Luckily this was only a spike; the number of parallel requests averaged between 200 and 300, 80 per second:
$ tail -f /var/log/apache2/cweiske/git.cweiske.de-access.log | pv --line-mode --rate > /dev/null [80,2 /s]
After ~3 hours, the bot controller noticed that no sensible responses came back from my server and the requests fell back to the normal ~20 per second.
Clients
Looking at the awstats log analytics I saw that most of the clients only fetched a single URL and then vanished:
- Number of visits
- 1,295,538
- Hits
- 1,544,307
The "user-agent" header did not indicate bots: They masked as normal browsers from all kinds of common operating systems with current version numbers.
Throwing 20 IP addresses into Maxmind's GeoIP demo page showed:
- Requests are from all over the world - USA, Germany, Brazil, Iraq, Chile, ...
- All requests came from "Cable/DSL" connections.
That makes them impossible to block because you can't identify them.
Limiting requests to the git vhost
My main problem was the the other websites on my server were unreachable because the bots downloaded every single page from the git server. The best solution would be if I could limit the parallel requests on this vhost, so that enough would be free for the other domains.
Apache2 has no native support for that (v1 had), but I found mod_vhost_limit. Compiled it, configured it and now I only allow 10 parallel requests to my git server.
Traffic
In 2026-01, the git vhost had 2.25 GiB of traffic. The april number is 120 GiB, and the month still has 4 days.
The sad thing is: Those git repositories take 419 MiB on the hard disk. When using git clone to download them, it would take ~500 MiB to get all of them at once.
"AI" training data crawlers are dumb as hell, and their operators do not care that they waste resources and bring down servers one after the other.
When you work for one of those companies: I hate you.
Additional idea: Basic Auth for expensive pages
In 2026-05-15 my feed reader brought me this: Blocking DDoS from scraper bots the easy way via HTTP-401 Basic Auth .
I decided to implement that: Require a username and password for expensive pages like commit messages, diffs and file listings - but not for the other URL paths.
At first I created a htpasswd file:
$ htpasswd -bc /etc/apache2/git-dummy.htpasswd let mein
and then I used that for some URL paths:
<VirtualHost *:443> [...] <LocationMatch "/[^/]+.git/(blob|commit|diff|log|patch|plain|snapshot|tree)"> AuthType Basic AuthName "AI crawler block: Use let as username and mein as password." ErrorDocument 401 "AI crawler block: Use 'let' as username and 'mein' as password." AuthUserFile /etc/apache2/git-dummy.htpasswd Require valid-user </LocationMatch> [...] </VirtualHost>
It turns out that browsers do not show the text in the "AuthName" configuration anymore, which are sent to the client in the WWW-Authenticate header:
WWW-Authenticate: Basic realm="AI crawler block: Use let as username and mein as password."
This is the reason error page for status code 401 has to contain the same text.
Links
Myriads of web server operators have the same problem.
- 2025-03-17: Source hut: Please stop externalizing your costs directly into my face
- 2025-03-20: thelibre.news: FOSS infrastructure is under attack by AI companies
- 2025-03-25: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries
- 2025-08-29: The Register: AI web crawlers are destroying websites in their never-ending hunger for any and all content
-
2025-06-12:
Letter from Codeberg: We love our new infrastructure
AI crawling protection: The threat of gold-rush-style AI crawling of the web is not a new problem, but the impact on our infrastructure kept increasing. In the past, blocking of certain IP ranges would be enough and keep us protected for days. However, more and more AI companies seem to hide behind residential proxies, abusing consumer IP addresses to distribute traffic. Rate-limiting is no longer effective, because given IP addresses might only do about two requests per day.
As a consequence, we have decided to protect certain expensive routes of Codeberg.org and most of Weblate (Codeberg Translate) behind Anubis, which requires the browser to do some computation via JavaScript to prove they have capacity. While this causes a slight increase of energy demand, legitimate users will only solve the challenge rarely, while large amounts of energy usage due to the heavy crawling. Most current crawlers do not seem to execute the code, and their requests don't reach our backends.