TLDR: Gaps in graphs after Nav container running for a day or two. Nav container doesn't create logfiles. Nav postgres container has lots of postgres processes running, consuming ¾ of host CPU. Is this normal or bug (in container, or with my setup?)
So, after killing my old Nav system with a series of conflicting upgrades, I decided to embrace the Docker approach, so that Nav (and other services) would be self-contained and not affected by upgrading components.
All was initially well, but I'm starting to run into increasing graph gaps that I've not been able to resolve. If I stop and rebuild all of the containers with docker-compose, all seems well to start with, but after a while (few hours to a day or two) I start to get gaps and then it's downhill from there. The next issue I discovered while looking into the graph gaps is that there are no logs. Docker does mount a volume on /var/log/nav (from the dockerfile) but nothing ever gets created in it, meaning it's hard to look for warnings or errors. Is that expected behaviour, or is it something I got wrong when adapting the docker-compose file? The only logs I get from 'docker logs' show smsd being spawned and exiting every second:
2020-05-28 07:29:19,089 INFO spawned: 'smsd' with pid 28112
2020-05-28 07:29:20,092 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-05-28 07:29:20,560 INFO exited: smsd (exit status 1; not expected)
2020-05-28 07:29:21,563 INFO spawned: 'smsd' with pid 28116
2020-05-28 07:29:22,565 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-05-28 07:29:22,971 INFO exited: smsd (exit status 1; not expected)
2020-05-28 07:29:23,975 INFO spawned: 'smsd' with pid 28118
2020-05-28 07:29:24,977 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-05-28 07:29:25,355 INFO exited: smsd (exit status 1; not expected)
(looking into this, appears to be exiting because python-gammu is not configured in the container, but is unlikely to be connected to the issue)
I did increase UDP receive buffers on the host, which I first thought had solved the issue, but the improvement was a side effect of restarting all of the containers and reverted to gaps later.
Nav's seeddb has 65 devices. Almost all just switches, with a handful of vmware hosts. The job durations are peculiar too - the 1minstats job for instance, on some switches takes a couple of seconds to run. On others, nearly a minute (or often several minutes on one in particular), but whether that's the cause or a symptom I don't know.
It looks as though the postgres database used by Nav is what's eating resources and presumably causing the graph gaps. The docker host typically has a load average of 9-12. Looking at the processes, there are typically at least 10 postgres processes running in the NavDB container, continually using 20-30% of CPU for each process. This does not seem normal to me? I also gave that container more shared memory, as it was logging occasional errors about not being able to allocate enough. Again no change - worse if anything.
The docker host is running as a VM as an in-production 'test' - in the fullness of time (when I'm not remote working), I'll likely move the containers onto a bare metal host. Docker host has 4 cores and 8Gb allocated, and is not hitting swap, but I can increase both. The vmware host it runs on has plenty of CPU and RAM with not a massive amount running on it, and the docker host is running from an array of SSDs. Besides Nav and NavDB containers, it's running containers for Icinga, and the Icinga backend mysql database; a dedicated shared graphite container accessed by both Icinga and Nav; Grafana as a dashboard for both; and nginx as a web proxy for all services. (Previously these services were all running fine alongside each other on a physical host of similar specification, until conflicting dependencies broke things).
Does anyone have any thoughts? Entries from my docker-compose.yml: