I spent the last couple of days at work integrating REST API data into the search result list of TYPO3's indexed_search extension. Yesterday I wanted to run a last test on my development machine to see if everything worked as it should and if API data would be indexed correctly. It did not work.
Why didn't it work?
After several hours I found out that the crawler extension did indeed process the page with my special crawling configuration, but stops in the middle.
Why did it stop?
The crawler catches an exception and stops processing. Unfortunately it did not tell anyone about that. The exception was "HTTP/1.1 404 Not Found", from the API connector.
Why did the API connector throw an exception?
Our crawler hook thought it was running on the live (production) system and queried the production API. The new API methods had not yet been deployed to the production API system, and it returned a 404.
Why did crawler think we are on prod?
The docker container has an environment variable TYPO3_CONTEXT=Development, which tells the TYPO3 instance to use the development configuration. That variable was not set.
Why was the environment variable not set?
To make the crawler process run correctly (write access to temporary directories + files), it must be run as the same user that the nginx web server runs under, www-data. I switched to the www-data user as I always do:
$ su - www-data -s /bin/bash
The - resets all environment variables. TYPO3_CONTEXT was thus not set anymore.
After 6 hours, I removed that minus and everything worked as it should.