Re: Issues with Nav docker container - logs, graph gaps and high postgres utilization

29 Mai 2020


      On Thu, 28 May 2020 08:34:51 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
...
TLDR: Gaps in graphs after Nav container running for a day or two.
Nav container doesn't create logfiles.  Nav postgres container has
lots of postgres processes running, consuming ¾ of host CPU.  Is this
normal or bug (in container, or with my setup?)
Can't really speak for container-specific behavior of PostgreSQL, but
depending on the size of your NAV installation, PostgreSQL is the first
thing you would outsource to dedicated hardware (it really shines when
it doesn't need to compete for I/O with lots of other processes).
Also depending on the size of your installation/network: At some point,
you will have enough data in PostgreSQL that you will want to act as a
DBA and make tweaks to the standard PostgreSQL tuning options. YMMV.
...
The next issue I discovered while looking into the graph gaps is that
there are no logs.  Docker does mount a volume on /var/log/nav (from
the dockerfile) but nothing ever gets created in it, meaning it's hard
to look for warnings or errors.  Is that expected behaviour, or is it
something I got wrong when adapting the docker-compose file?
Ideally, each daemon would have its own container and just log to
stderr, but we're not there yet.
However, what you're describing wasn't the intended behavior, and is
probably an error on our part. Since all the NAV processes run in the
same container, supervisord is used to ensure the processes are always
running. This also means that the processes DO actually log to stderr,
but supervisord takes care of dispatching the log output to disk. These
logs end up in /var/log/supervisor instead, which is not a mounted
volume at all. I suspect some of the cron jobs may still log directly to
the NAV-configured location, however, since any stderr output they have
would otherwise be mailed to the crontab owner.
The container distro is still very much a work in progress. We don't use
it in production ourselves (yet), but there are some who do, and who
have contributed back to the nav-container repository.  If you have any
useful improvements to the distro, they would be most welcome :-)
...
2020-05-28 07:29:25,355 INFO exited: smsd (exit status 1; not expected)
(looking into this, appears to be exiting because python-gammu is not
configured in the container, but is unlikely to be connected to the
issue)
This assessment is correct. The container distribution doesn't configure
smsd, and its default is to use Gammu, but since Gammu requires a
physically connected GSM unit, the container distro doesn't set anything
up. In any case, using SMS notifications will always require some level
of configuring smsd - so in reality, the container default could be to
just not start the daemon at all.
Also, supervisord will eventually cease  with the restart attempts when
smsd fails too many times (seems to be about 6 times for me).
...
I did increase UDP receive buffers on the host, which I first thought
had solved the issue, but the improvement was a side effect of
restarting all of the containers and reverted to gaps later.
Nav's seeddb has 65 devices.  Almost all just switches, with a handful
of vmware hosts.  The job durations are peculiar too - the 1minstats
job for instance, on some switches takes a couple of seconds to run.
The duration of the 1minstats job would depend entirely on the amount of
data points to collect (and somewhat on the SNMP agent implementation of
the monitored device). The 1minstats job collects system metrics, which
may be few, depending on the device vendor, and sensor metrics, which
may range from none to many, also depending on the device vendor.
...
On others, nearly a minute (or often several minutes on one in
particular), but whether that's the cause or a symptom I don't know.
If you did read the NAV docs guide to debugging gaps in graphs, you'll
know that it's quite important for the metrics to arrive in the carbon
backend at specific intervals.
If the ipdevpoll stats gathering jobs do not run on more or less exact
intervals, but keep getting delayed from their schedules, you may end up
with gaps in the Graphite data.  I.e. for each device, 1minstats should
ideally run every 60 seconds for each device, with as few deviations as
possible (300 seconds for the 5minstats job). ipdevpoll has limits on
how many concurrent jobs it will run, and if the total workload gets too
high, jobs will start getting delayed. I.e. when you find the logs, you
should make sure to see whether the actual job intervals are as
expected. If not, you may want to look into ipdevpoll's multiprocess
mode instead, in which ipdevpoll will use multiple processes (and
therefore, potentially, cores) to accomplish its work.
...
It looks as though the postgres database used by Nav is what's eating
resources and presumably causing the graph gaps.  The docker host
typically has a load average of 9-12.  Looking at the processes, there
are typically at least 10 postgres processes running in the NavDB
container, continually using 20-30% of CPU for each process.  This
does not seem normal to me?  I also gave that container more shared
memory, as it was logging occasional errors about not being able to
allocate enough. Again no change - worse if anything.
Again, I can't speak from experience when it comes to running
PostgreSQL in production in containers. It can do so, of course, but it
will always perform best when it's close to the bare metal.
There's a guy named Øystein Gyland who runs NAV in production using
nav-container. He's made some contributions to the container repo in
that regard, but I'm not sure whether he follows the mailing list. I
normally find him active in the IRC channel. He might be able to share
some insights into his experience with running NAV in containers.
...
The docker host is running as a VM as an in-production 'test' - in the
fullness of time (when I'm not remote working), I'll likely move the
containers onto a bare metal host.
That will probably help, but I would still pursue putting the PostgreSQL
server on bare metal before anything else (I've already lost count of
how many times I've mentioned this :-D). If you have other software
that depends on PostgreSQL, they would probably benefit from it too.
-- 
sincerely,
Morten Brekkevold
Uninett

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: Issues with Nav docker container - logs, graph gaps and high postgres utilization