When I checked my mails this morning I saw many that told me about errors on my server:
ssh_exchange_identification: Connection closed by remote host
In: STARTTLS
Out: 454 4.7.0 TLS not available due to local problem
Session aborted, reason: lost connection
sendmail: warning: fork: Cannot allocate memory
postdrop: fatal: inet_addr_local[getifaddrs]: getifaddrs: Cannot allocate memory
sendmail: warning: command "/usr/sbin/postdrop -r" exited with status 1
sendmail: fatal: cweiske(1001): unable to execute /usr/sbin/postdrop -r: Cannot allocate memory
Also, spamassassin had died multiple times yesterday and it also wasn't running this morning.
Cause study
RAM
"Cannot allocate memory" sounds as if RAM is full. The server is a VPS at HostEurope and has 8 GiB of RAM guranteed, and 8 optional ones that are shared with the other virtual servers on this machine.
My first thought was that 8 GiB were used and Linux tried to allocate optional RAM that was already used up by the other servers. Munin showed that I was using only 6 GiB of active memory, and htop showed active RAM usage of 3.7 GiB. That did not really look like it would be the cause.
Control Panel
Then I looked at the VPS' Parallels Control Panel: The status indicator was red, "Resource shortage".
The "Resource Alerts" page showed:
Resource numothersock red alert on environment lvps5-35-241-22.dedicated.hosteurope.de current value: 1000 soft limit: hard limit: 1024
The help for numothersock was:
The number of sockets other than TCP ones.
Local (UNIX-domain) sockets are used for communications inside the system.
UDP sockets are used, for example, for Domain Name Service (DNS) queries.
UDP and other sockets may also be used in some very specialized applications (SNMP agents and others).
Bingo.
numothersock
/proc/user_beancounters contains the numothersock value:
$ cat /proc/user_beancounters Version: 2.5 uid resource held maxheld barrier limit failcnt [...] numothersock 1021 1024 1024 1024 40 [...]
The I tried to find out which process caused this high number, by grepping for the numothersock value, stopping a service and grepping again. The resulting list was not very accurate and did not give an indication for the error:
apache 20 courier-authdaemon 5 courier-imap 5 courier-imap-ssl 100 elasticsearch 0 ejabberd 0 gearman 10 memcached 0 mysql 10 postfix 100 spamassassin 2
I tried to keep the server functional while doing this by quickly stopping, grepping and then restarting the services.
Failing to find the cause with that approach, I simply shut down all services one after each other. After they were down, numothersock still showed over 600.
In ps I saw a couple of systemd-logind instances:
$ ps aux|grep systemd-logind|wc -l 217
After killing them all, numothersock was down by 600.
systemd?
Seeing those systemd-logind instances made me remember lines I saw in /var/log/syslog, sometimes 6 of them per minute:
dbus[2427]: [system] Activating service name='org.freedesktop.systemd1' (using servicehelper) dbus[2427]: [system] Successfully activated service 'org.freedesktop.systemd1'
I asked for help in #systemd and was told that this isn't normal. That service was meant to be running as pid 1, and should never be able to be "successfully activated" otherwise. But pid 1 was init, and even /proc/1/comm showed nothing systemd-y.
The Debian 8 release notes talks about the new init system: systemd-sysv:
This package is installed automatically on upgrades.
We did upgrade from Debian 7 to version 8, but did not have systemd-sysv installed - instead we had sysvinit-core. To permanently get rid of this errors I decided to install systemd-sysv, typed reboot and crossed fingers.
A minute later the server was running again, with:
$ cat /proc/1/comm systemd
I had no more errors since that time.