On Thu, 28 May 2020 08:34:51 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
TLDR: Gaps in graphs after Nav container running for a day or two. Nav container doesn't create logfiles. Nav postgres container has lots of postgres processes running, consuming ¾ of host CPU. Is this normal or bug (in container, or with my setup?)
Can't really speak for container-specific behavior of PostgreSQL, but depending on the size of your NAV installation, PostgreSQL is the first thing you would outsource to dedicated hardware (it really shines when it doesn't need to compete for I/O with lots of other processes).
Also depending on the size of your installation/network: At some point, you will have enough data in PostgreSQL that you will want to act as a DBA and make tweaks to the standard PostgreSQL tuning options. YMMV.
The next issue I discovered while looking into the graph gaps is that there are no logs. Docker does mount a volume on /var/log/nav (from the dockerfile) but nothing ever gets created in it, meaning it's hard to look for warnings or errors. Is that expected behaviour, or is it something I got wrong when adapting the docker-compose file?
Ideally, each daemon would have its own container and just log to stderr, but we're not there yet.
However, what you're describing wasn't the intended behavior, and is probably an error on our part. Since all the NAV processes run in the same container, supervisord is used to ensure the processes are always running. This also means that the processes DO actually log to stderr, but supervisord takes care of dispatching the log output to disk. These logs end up in /var/log/supervisor instead, which is not a mounted volume at all. I suspect some of the cron jobs may still log directly to the NAV-configured location, however, since any stderr output they have would otherwise be mailed to the crontab owner.
The container distro is still very much a work in progress. We don't use it in production ourselves (yet), but there are some who do, and who have contributed back to the nav-container repository. If you have any useful improvements to the distro, they would be most welcome :-)
2020-05-28 07:29:25,355 INFO exited: smsd (exit status 1; not expected)
(looking into this, appears to be exiting because python-gammu is not configured in the container, but is unlikely to be connected to the issue)
This assessment is correct. The container distribution doesn't configure smsd, and its default is to use Gammu, but since Gammu requires a physically connected GSM unit, the container distro doesn't set anything up. In any case, using SMS notifications will always require some level of configuring smsd - so in reality, the container default could be to just not start the daemon at all.
Also, supervisord will eventually cease with the restart attempts when smsd fails too many times (seems to be about 6 times for me).
I did increase UDP receive buffers on the host, which I first thought had solved the issue, but the improvement was a side effect of restarting all of the containers and reverted to gaps later.
Nav's seeddb has 65 devices. Almost all just switches, with a handful of vmware hosts. The job durations are peculiar too - the 1minstats job for instance, on some switches takes a couple of seconds to run.
The duration of the 1minstats job would depend entirely on the amount of data points to collect (and somewhat on the SNMP agent implementation of the monitored device). The 1minstats job collects system metrics, which may be few, depending on the device vendor, and sensor metrics, which may range from none to many, also depending on the device vendor.
On others, nearly a minute (or often several minutes on one in particular), but whether that's the cause or a symptom I don't know.
If you did read the NAV docs guide to debugging gaps in graphs, you'll know that it's quite important for the metrics to arrive in the carbon backend at specific intervals.
If the ipdevpoll stats gathering jobs do not run on more or less exact intervals, but keep getting delayed from their schedules, you may end up with gaps in the Graphite data. I.e. for each device, 1minstats should ideally run every 60 seconds for each device, with as few deviations as possible (300 seconds for the 5minstats job). ipdevpoll has limits on how many concurrent jobs it will run, and if the total workload gets too high, jobs will start getting delayed. I.e. when you find the logs, you should make sure to see whether the actual job intervals are as expected. If not, you may want to look into ipdevpoll's multiprocess mode instead, in which ipdevpoll will use multiple processes (and therefore, potentially, cores) to accomplish its work.
It looks as though the postgres database used by Nav is what's eating resources and presumably causing the graph gaps. The docker host typically has a load average of 9-12. Looking at the processes, there are typically at least 10 postgres processes running in the NavDB container, continually using 20-30% of CPU for each process. This does not seem normal to me? I also gave that container more shared memory, as it was logging occasional errors about not being able to allocate enough. Again no change - worse if anything.
Again, I can't speak from experience when it comes to running PostgreSQL in production in containers. It can do so, of course, but it will always perform best when it's close to the bare metal.
There's a guy named Øystein Gyland who runs NAV in production using nav-container. He's made some contributions to the container repo in that regard, but I'm not sure whether he follows the mailing list. I normally find him active in the IRC channel. He might be able to share some insights into his experience with running NAV in containers.
The docker host is running as a VM as an in-production 'test' - in the fullness of time (when I'm not remote working), I'll likely move the containers onto a bare metal host.
That will probably help, but I would still pursue putting the PostgreSQL server on bare metal before anything else (I've already lost count of how many times I've mentioned this :-D). If you have other software that depends on PostgreSQL, they would probably benefit from it too.