During development of the TYPO3-based Wohnglück project, fellow developers experienced the dreaded "white pages" when accessing a certain page in the CMS. This happened on production, test and local dev environments - but only now and then and not reproducible.
Our log server only showed:
nginx stdout | 2017/02/02 11:26:55 [error] 24#24: *40 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 42.0.23.0, server: _, request: "GET /some/path/ HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "some.host", referrer: "https://some.host/other/path/"
I was totally certain that this must be PHP crashing. We were running php version 7.0.8ubuntu*, but the most recent one was 7.1.15. Upgrading would surely make the crash go away.
Another colleague did not want to go down that route and suggested to collect more error information, which we then did by setting the FPM log_level setting to warning (it was error before). After that we waited until it happened again a day later. The log was more verbose now:
php7.0-fpm stderr | WARNING: [pool www] child 27, script '/var/www/site/htdocs/index.php' (request: "GET /index.php") execution timed out (152.506891 sec), terminating php7.0-fpm stderr | WARNING: [pool www] child 27 exited on signal 15 (SIGTERM) after 240.011067 seconds from start nginx stdout | 24#24: *40 recv() failed (104: Connection reset by peer) while reading response header from upstream, ...
So PHP did not crash, it was killed by php-fpm because the process took longer than 4 minutes to run!
The log also contained error messages about TYPO3 calling graphicsmagick but failing whith strange errors like "invalid JPEG header data" and "no information read", which I could not make sense of before.
But now it actually did make sense: The page contained a whopping 219 images, which got lazy-loaded by the browser, but all had to be scaled and cropped during page generation. TYPO3's whole cache is cleared during our automated docker deployment, and the first persons accessing the page experienced that problem.
Multiple people accessing that page at the same time did also explain the gm command errors: Two PHP processes tried to generate the same files at once and one read the partly generated data by the other process.
The solution to this problem is to not throw away the scaled images during deployment.