Hosteurope VPS: Cannot allocate memory

When I checked my mails this morning I saw many that told me about errors on my server:

ssh_exchange_identification: Connection closed by remote host
In: STARTTLS
Out: 454 4.7.0 TLS not available due to local problem
Session aborted, reason: lost connection
sendmail: warning: fork: Cannot allocate memory
postdrop: fatal: inet_addr_local[getifaddrs]: getifaddrs: Cannot allocate memory
sendmail: warning: command "/usr/sbin/postdrop -r" exited with status 1
sendmail: fatal: cweiske(1001): unable to execute /usr/sbin/postdrop -r: Cannot allocate memory

Also, spamassassin had died multiple times yesterday and it also wasn't running this morning.

Cause study

RAM

"Cannot allocate memory" sounds as if RAM is full. The server is a VPS at HostEurope and has 8 GiB of RAM guranteed, and 8 optional ones that are shared with the other virtual servers on this machine.

My first thought was that 8 GiB were used and Linux tried to allocate optional RAM that was already used up by the other servers. Munin showed that I was using only 6 GiB of active memory, and htop showed active RAM usage of 3.7 GiB. That did not really look like it would be the cause.

Control Panel

Then I looked at the VPS' Parallels Control Panel: The status indicator was red, "Resource shortage".

The "Resource Alerts" page showed:

Resource numothersock red alert on environment lvps5-35-241-22.dedicated.hosteurope.de current value: 1000 soft limit: hard limit: 1024

The help for numothersock was:

The number of sockets other than TCP ones.

Local (UNIX-domain) sockets are used for communications inside the system.

UDP sockets are used, for example, for Domain Name Service (DNS) queries.

UDP and other sockets may also be used in some very specialized applications (SNMP agents and others).

Bingo.

numothersock

/proc/user_beancounters contains the numothersock value:

$ cat /proc/user_beancounters
Version: 2.5
uid  resource      held  maxheld  barrier  limit  failcnt
[...]
     numothersock  1021    1024      1024   1024       40
[...]

The I tried to find out which process caused this high number, by grepping for the numothersock value, stopping a service and grepping again. The resulting list was not very accurate and did not give an indication for the error:

apache 20
courier-authdaemon 5
courier-imap 5
courier-imap-ssl 100
elasticsearch 0
ejabberd 0
gearman 10
memcached 0
mysql 10
postfix 100
spamassassin 2

I tried to keep the server functional while doing this by quickly stopping, grepping and then restarting the services.

Failing to find the cause with that approach, I simply shut down all services one after each other. After they were down, numothersock still showed over 600.

In ps I saw a couple of systemd-logind instances:

$ ps aux|grep systemd-logind|wc -l
217

After killing them all, numothersock was down by 600.

systemd?

Seeing those systemd-logind instances made me remember lines I saw in /var/log/syslog, sometimes 6 of them per minute:

dbus[2427]: [system] Activating service name='org.freedesktop.systemd1' (using servicehelper)
dbus[2427]: [system] Successfully activated service 'org.freedesktop.systemd1'

I asked for help in #systemd and was told that this isn't normal. That service was meant to be running as pid 1, and should never be able to be "successfully activated" otherwise. But pid 1 was init, and even /proc/1/comm showed nothing systemd-y.

The Debian 8 release notes talks about the new init system: systemd-sysv:

This package is installed automatically on upgrades.

We did upgrade from Debian 7 to version 8, but did not have systemd-sysv installed - instead we had sysvinit-core. To permanently get rid of this errors I decided to install systemd-sysv, typed reboot and crossed fingers.

A minute later the server was running again, with:

$ cat /proc/1/comm
systemd

I had no more errors since that time.

Written by Christian Weiske.

Comments? Please send an e-mail.