While implementing the crawler for my own search engine phinde, I tried to minimize the amount of data transferred between web servers and the crawler.
The crawler can only extract links from HTML, XHTML and Atom feeds, so it sends a HTTP Accept header stating that:
Accept: application/atom+xml, application/xhtml+xml, text/html
Unfortunately, my Apache still sends out the content for large .bz2 files that my crawler then has to throw away.
Specification
The HTTP/1.1 RFC 2616 states in section 10.4.7:
Note: HTTP/1.1 servers are allowed to return responses which are not acceptable according to the accept headers sent in the request.
I think this was noted to make it easier to implement HTTP/1.1.
Server support
Unfortunately, none of the 3 big web servers makes it possible to send out a 406 status code when the Accept condition cannot be fulfilled. I've opened a feature request for Apache: Option to send "406 Not Acceptable" when mime type in "Accept" header cannot be fulfilled
Standard configuration doesn't support it by no means:
Apache
$ curl -IH 'Accept: image/png' http://httpd.apache.org/ HTTP/1.1 200 OK [...] Server: Apache/2.4.7 (Ubuntu) [...] Content-Type: text/html
Lighttpd
$ curl -IH 'Accept: image/png' http://www.lighttpd.net/ HTTP/1.1 200 OK [...] Content-Type: text/html [...] Server: lighttpd/2.0.0
nginx
$ curl -IH 'Accept: image/png' http://nginx.org/ HTTP/1.1 200 OK Server: nginx/1.9.8 Date: Wed, 10 Feb 2016 20:11:30 GMT Content-Type: text/html; charset=utf-8