On Mon, 1 Jun 2020 08:03:03 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
Also, supervisord will eventually cease with the restart attempts when smsd fails too many times (seems to be about 6 times for me).
I'm certainly not seeing this behaviour. The container has currently been running for just short of 3 days, and it's still spawning smsd every second (will look at how to configure supervisord).
The default setting in supervisord is a `startretries` [1] value of only 3. This, however, only affects the number of restart attempts for processes that fail immediately when starting (i.e., they only get to the STARTING state before failing).
Once supervisord considers a process to be in the RUNNING state, the autorestart option applies. The supervisord config provided by the nav-container says to always autorestart an exited NAV process.
I'm not sure what conditions apply for supervisord to consider a process as having moved from the STARTING to the RUNNING state, but to me it seems that your supervisord instance must have considered those processes as having moved to the RUNNING state before they FAILED, which would always cause it to attempt a restart.
In fact, it's now also spawning alertengine, eventengine and snmptrapd every second also (it wasn't doing this when the container first booted).
If that is the case, then something is really wrong, and you should check supervisord's logs of these processes.
Not sure whether it starting to respawn those tallies with the gaps starting, but conceivably the CPU or IO usage from trying to run several python scripts every second could cause postgres issues when sharing those resources? (or even the fact they're not running?)
Having looked at why those other services (other than smsd) were respawning, they thought they were already running. I've been able to recreate this. If I stop and start the container (rather than destroying and recreating it via docker-compose) then there are still .pid files for alertengine, eventengine, pping and snmptrapd in /tmp, some or all of which are datestamped for an earlier container start. It doesn't do this for all of those services every time though - guessing that if a service is busy and hasn't exited before the container stops, it doesn't clean its pid file? Or maybe it only causes an issue if there's a (different) process running on the same PID when it next starts.
If a NAV daemon is killed or somehow dies without getting a chance to run its cleanup routine, its PID file will be left behind. On a normal system, this PID number will not be reused for a long time, so the chances of that PID being "alive" when attempting a restart is low.
However, in the context of restarting a container, PID numbers will be assigned from 1 and up again, so re-use and overlapping of those stored PID numbers is highly likely, and would cause a problem.
Curiously, pping (and on the most recent restart of the container, apache - again, there's another process running on the same PID as Apache's lockfile) *do* get flagged by supervisord as being restarted too many times, and aborted as you had noted. Maybe there's a difference if it's a python script rather than a compiled executable (i.e. python starts cleanly but the script doesn't)
The difference might be whether the daemon forks into the background before checking the PID file, or if it stops immediately. pping is the oldest piece of code in NAV, so it might behave slightly differently from the rest.
As a kludge workaround I added commands to the entrypoint script which deleted all of the PID files on the basis that at container start, the processes shouldn't be running. That seemed to make it start cleanly, but not sure if there is a better or more 'Docker'ish way to do it?
That's the exact solution I would suggest: Explicitly deleting the PID files on container startup. There might be a way to get supervisord to remove the pid file as well, but a much simpler solution would be to clean the slate at container startup. I wouldn't mind a pull request to that effect on GitHub :-)
[1] http://supervisord.org/configuration.html?highlight=startretries#program-x-s...