Re: Issues with Nav docker container - logs, graph gaps and high postgres utilization

3 Jun 2020


      On Mon, 1 Jun 2020 08:03:03 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
...
...
Also, supervisord will eventually cease  with the restart attempts when smsd
fails too many times (seems to be about 6 times for me).
I'm certainly not seeing this behaviour.  The container has currently been
running for just short of 3 days, and it's still spawning smsd every second
(will look at how to configure supervisord).
The default setting in supervisord is a `startretries` [1] value of only
3. This, however, only affects the number of restart attempts for
processes that fail immediately when starting (i.e., they only get to
the STARTING state before failing).
Once supervisord considers a process to be in the RUNNING state, the
autorestart option applies. The supervisord config provided by the
nav-container says to always autorestart an exited NAV process.
I'm not sure what conditions apply for supervisord to consider a process
as having moved from the STARTING to the RUNNING state, but to me it
seems that your supervisord instance must have considered those
processes as having moved to the RUNNING state before they FAILED, which
would always cause it to attempt a restart.
...
In fact, it's now also spawning alertengine, eventengine and snmptrapd
every second also (it wasn't doing this when the container first
booted).
If that is the case, then something is really wrong, and you should
check supervisord's logs of these processes.
...
Not sure whether it starting to respawn those tallies with
the gaps starting, but conceivably the CPU or IO usage from trying to
run several python scripts every second could cause postgres issues
when sharing those resources?  (or even the fact they're not running?)
Having looked at why those other services (other than smsd) were
respawning, they thought they were already running.  I've been able to
recreate this.  If I stop and start the container (rather than
destroying and recreating it via docker-compose) then there are still
.pid files for alertengine, eventengine, pping and snmptrapd in /tmp,
some or all of which are datestamped for an earlier container start.
It doesn't do this for all of those services every time though -
guessing that if a service is busy and hasn't exited before the
container stops, it doesn't clean its pid file?  Or maybe it only
causes an issue if there's a (different) process running on the same
PID when it next starts.
If a NAV daemon is killed or somehow dies without getting a chance to
run its cleanup routine, its PID file will be left behind. On a normal
system, this PID number will not be reused for a long time, so the
chances of that PID being "alive" when attempting a restart is low.
However, in the context of restarting a container, PID numbers will be
assigned from 1 and up again, so re-use and overlapping of those stored
PID numbers is highly likely, and would cause a problem.
...
Curiously, pping (and on the most recent restart of the container, apache -
again, there's another process running on the same PID as Apache's lockfile)
*do* get flagged by supervisord as being restarted too many times, and aborted
as you had noted.  Maybe there's a difference if it's a python script rather
than a compiled executable (i.e. python starts cleanly but the script doesn't)
The difference might be whether the daemon forks into the background
before checking the PID file, or if it stops immediately. pping is the
oldest piece of code in NAV, so it might behave slightly differently
from the rest.
...
As a kludge workaround I added commands to the entrypoint script which
deleted all of the PID files on the basis that at container start, the
processes shouldn't be running.  That seemed to make it start cleanly,
but not sure if there is a better or more 'Docker'ish way to do it?
That's the exact solution I would suggest: Explicitly deleting the PID
files on container startup. There might be a way to get supervisord to
remove the pid file as well, but a much simpler solution would be to
clean the slate at container startup.  I wouldn't mind a pull request to
that effect on GitHub :-)
[1] http://supervisord.org/configuration.html?highlight=startretries#program-x-s...
-- 
sincerely,
Morten Brekkevold
Uninett

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: Issues with Nav docker container - logs, graph gaps and high postgres utilization