Morning,
TLDR: Gaps in graphs after Nav container running for a day or two. Nav container doesn't create logfiles. Nav postgres container has lots of postgres processes running, consuming ¾ of host CPU. Is this normal or bug (in container, or with my setup?)
So, after killing my old Nav system with a series of conflicting upgrades, I decided to embrace the Docker approach, so that Nav (and other services) would be self-contained and not affected by upgrading components.
All was initially well, but I'm starting to run into increasing graph gaps that I've not been able to resolve. If I stop and rebuild all of the containers with docker-compose, all seems well to start with, but after a while (few hours to a day or two) I start to get gaps and then it's downhill from there. The next issue I discovered while looking into the graph gaps is that there are no logs. Docker does mount a volume on /var/log/nav (from the dockerfile) but nothing ever gets created in it, meaning it's hard to look for warnings or errors. Is that expected behaviour, or is it something I got wrong when adapting the docker-compose file? The only logs I get from 'docker logs' show smsd being spawned and exiting every second: 2020-05-28 07:29:19,089 INFO spawned: 'smsd' with pid 28112 2020-05-28 07:29:20,092 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-05-28 07:29:20,560 INFO exited: smsd (exit status 1; not expected) 2020-05-28 07:29:21,563 INFO spawned: 'smsd' with pid 28116 2020-05-28 07:29:22,565 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-05-28 07:29:22,971 INFO exited: smsd (exit status 1; not expected) 2020-05-28 07:29:23,975 INFO spawned: 'smsd' with pid 28118 2020-05-28 07:29:24,977 INFO success: smsd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-05-28 07:29:25,355 INFO exited: smsd (exit status 1; not expected)
(looking into this, appears to be exiting because python-gammu is not configured in the container, but is unlikely to be connected to the issue)
I did increase UDP receive buffers on the host, which I first thought had solved the issue, but the improvement was a side effect of restarting all of the containers and reverted to gaps later.
Nav's seeddb has 65 devices. Almost all just switches, with a handful of vmware hosts. The job durations are peculiar too - the 1minstats job for instance, on some switches takes a couple of seconds to run. On others, nearly a minute (or often several minutes on one in particular), but whether that's the cause or a symptom I don't know.
It looks as though the postgres database used by Nav is what's eating resources and presumably causing the graph gaps. The docker host typically has a load average of 9-12. Looking at the processes, there are typically at least 10 postgres processes running in the NavDB container, continually using 20-30% of CPU for each process. This does not seem normal to me? I also gave that container more shared memory, as it was logging occasional errors about not being able to allocate enough. Again no change - worse if anything.
The docker host is running as a VM as an in-production 'test' - in the fullness of time (when I'm not remote working), I'll likely move the containers onto a bare metal host. Docker host has 4 cores and 8Gb allocated, and is not hitting swap, but I can increase both. The vmware host it runs on has plenty of CPU and RAM with not a massive amount running on it, and the docker host is running from an array of SSDs. Besides Nav and NavDB containers, it's running containers for Icinga, and the Icinga backend mysql database; a dedicated shared graphite container accessed by both Icinga and Nav; Grafana as a dashboard for both; and nginx as a web proxy for all services. (Previously these services were all running fine alongside each other on a physical host of similar specification, until conflicting dependencies broke things).
Does anyone have any thoughts? Entries from my docker-compose.yml: nav: container_name: nav build: context: /docker/build/nav args: NAV_VERSION: 5.0.5 ports: - "8002:80" - "162:162/udp" volumes: - /docker/nav/storage/roomimages:/var/lib/nav/uploads/images/rooms depends_on: - navdb environment: - PGHOST=navdb - PGDATABASE=nav - PGUSER=nav - PGPASSWORD=Password - PGPORT=5432 - CARBONHOST=graphite - CARBONPORT=2003 - GRAPHITEWEB=http://graphite/ - NOINITDB=0 env_file: - /docker/build/nav/nav-variables.env restart: always
navdb: image: "postgres:12" container_name: navdb volumes: - /docker/navdb/storage/postgresql:/var/lib/postgresql/data environment: - POSTGRES_PASSWORD=Password - POSTGRES_USER=nav - POSTGRES_DB=nav restart: always shm_size: '256mb'
Steve.
On Thu, 28 May 2020 08:34:51 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
TLDR: Gaps in graphs after Nav container running for a day or two. Nav container doesn't create logfiles. Nav postgres container has lots of postgres processes running, consuming ¾ of host CPU. Is this normal or bug (in container, or with my setup?)
Can't really speak for container-specific behavior of PostgreSQL, but depending on the size of your NAV installation, PostgreSQL is the first thing you would outsource to dedicated hardware (it really shines when it doesn't need to compete for I/O with lots of other processes).
Also depending on the size of your installation/network: At some point, you will have enough data in PostgreSQL that you will want to act as a DBA and make tweaks to the standard PostgreSQL tuning options. YMMV.
The next issue I discovered while looking into the graph gaps is that there are no logs. Docker does mount a volume on /var/log/nav (from the dockerfile) but nothing ever gets created in it, meaning it's hard to look for warnings or errors. Is that expected behaviour, or is it something I got wrong when adapting the docker-compose file?
Ideally, each daemon would have its own container and just log to stderr, but we're not there yet.
However, what you're describing wasn't the intended behavior, and is probably an error on our part. Since all the NAV processes run in the same container, supervisord is used to ensure the processes are always running. This also means that the processes DO actually log to stderr, but supervisord takes care of dispatching the log output to disk. These logs end up in /var/log/supervisor instead, which is not a mounted volume at all. I suspect some of the cron jobs may still log directly to the NAV-configured location, however, since any stderr output they have would otherwise be mailed to the crontab owner.
The container distro is still very much a work in progress. We don't use it in production ourselves (yet), but there are some who do, and who have contributed back to the nav-container repository. If you have any useful improvements to the distro, they would be most welcome :-)
2020-05-28 07:29:25,355 INFO exited: smsd (exit status 1; not expected)
(looking into this, appears to be exiting because python-gammu is not configured in the container, but is unlikely to be connected to the issue)
This assessment is correct. The container distribution doesn't configure smsd, and its default is to use Gammu, but since Gammu requires a physically connected GSM unit, the container distro doesn't set anything up. In any case, using SMS notifications will always require some level of configuring smsd - so in reality, the container default could be to just not start the daemon at all.
Also, supervisord will eventually cease with the restart attempts when smsd fails too many times (seems to be about 6 times for me).
I did increase UDP receive buffers on the host, which I first thought had solved the issue, but the improvement was a side effect of restarting all of the containers and reverted to gaps later.
Nav's seeddb has 65 devices. Almost all just switches, with a handful of vmware hosts. The job durations are peculiar too - the 1minstats job for instance, on some switches takes a couple of seconds to run.
The duration of the 1minstats job would depend entirely on the amount of data points to collect (and somewhat on the SNMP agent implementation of the monitored device). The 1minstats job collects system metrics, which may be few, depending on the device vendor, and sensor metrics, which may range from none to many, also depending on the device vendor.
On others, nearly a minute (or often several minutes on one in particular), but whether that's the cause or a symptom I don't know.
If you did read the NAV docs guide to debugging gaps in graphs, you'll know that it's quite important for the metrics to arrive in the carbon backend at specific intervals.
If the ipdevpoll stats gathering jobs do not run on more or less exact intervals, but keep getting delayed from their schedules, you may end up with gaps in the Graphite data. I.e. for each device, 1minstats should ideally run every 60 seconds for each device, with as few deviations as possible (300 seconds for the 5minstats job). ipdevpoll has limits on how many concurrent jobs it will run, and if the total workload gets too high, jobs will start getting delayed. I.e. when you find the logs, you should make sure to see whether the actual job intervals are as expected. If not, you may want to look into ipdevpoll's multiprocess mode instead, in which ipdevpoll will use multiple processes (and therefore, potentially, cores) to accomplish its work.
It looks as though the postgres database used by Nav is what's eating resources and presumably causing the graph gaps. The docker host typically has a load average of 9-12. Looking at the processes, there are typically at least 10 postgres processes running in the NavDB container, continually using 20-30% of CPU for each process. This does not seem normal to me? I also gave that container more shared memory, as it was logging occasional errors about not being able to allocate enough. Again no change - worse if anything.
Again, I can't speak from experience when it comes to running PostgreSQL in production in containers. It can do so, of course, but it will always perform best when it's close to the bare metal.
There's a guy named Øystein Gyland who runs NAV in production using nav-container. He's made some contributions to the container repo in that regard, but I'm not sure whether he follows the mailing list. I normally find him active in the IRC channel. He might be able to share some insights into his experience with running NAV in containers.
The docker host is running as a VM as an in-production 'test' - in the fullness of time (when I'm not remote working), I'll likely move the containers onto a bare metal host.
That will probably help, but I would still pursue putting the PostgreSQL server on bare metal before anything else (I've already lost count of how many times I've mentioned this :-D). If you have other software that depends on PostgreSQL, they would probably benefit from it too.
Hi Morten, Thanks for that detailed response. I'll look at options on moving postgres. However, on one point:
Also, supervisord will eventually cease with the restart attempts when smsd fails too many times (seems to be about 6 times for me).
I'm certainly not seeing this behaviour. The container has currently been running for just short of 3 days, and it's still spawning smsd every second (will look at how to configure supervisord). In fact, it's now also spawning alertengine, eventengine and snmptrapd every second also (it wasn't doing this when the container first booted). Not sure whether it starting to respawn those tallies with the gaps starting, but conceivably the CPU or IO usage from trying to run several python scripts every second could cause postgres issues when sharing those resources? (or even the fact they're not running?)
Having looked at why those other services (other than smsd) were respawning, they thought they were already running. I've been able to recreate this. If I stop and start the container (rather than destroying and recreating it via docker-compose) then there are still .pid files for alertengine, eventengine, pping and snmptrapd in /tmp, some or all of which are datestamped for an earlier container start. It doesn't do this for all of those services every time though - guessing that if a service is busy and hasn't exited before the container stops, it doesn't clean its pid file? Or maybe it only causes an issue if there's a (different) process running on the same PID when it next starts.
Curiously, pping (and on the most recent restart of the container, apache - again, there's another process running on the same PID as Apache's lockfile) *do* get flagged by supervisord as being restarted too many times, and aborted as you had noted. Maybe there's a difference if it's a python script rather than a compiled executable (i.e. python starts cleanly but the script doesn't)
As a kludge workaround I added commands to the entrypoint script which deleted all of the PID files on the basis that at container start, the processes shouldn't be running. That seemed to make it start cleanly, but not sure if there is a better or more 'Docker'ish way to do it?
Steve.
On Mon, 1 Jun 2020 08:03:03 +0000 Steve Kersley steve.kersley@keble.ox.ac.uk wrote:
Also, supervisord will eventually cease with the restart attempts when smsd fails too many times (seems to be about 6 times for me).
I'm certainly not seeing this behaviour. The container has currently been running for just short of 3 days, and it's still spawning smsd every second (will look at how to configure supervisord).
The default setting in supervisord is a `startretries` [1] value of only 3. This, however, only affects the number of restart attempts for processes that fail immediately when starting (i.e., they only get to the STARTING state before failing).
Once supervisord considers a process to be in the RUNNING state, the autorestart option applies. The supervisord config provided by the nav-container says to always autorestart an exited NAV process.
I'm not sure what conditions apply for supervisord to consider a process as having moved from the STARTING to the RUNNING state, but to me it seems that your supervisord instance must have considered those processes as having moved to the RUNNING state before they FAILED, which would always cause it to attempt a restart.
In fact, it's now also spawning alertengine, eventengine and snmptrapd every second also (it wasn't doing this when the container first booted).
If that is the case, then something is really wrong, and you should check supervisord's logs of these processes.
Not sure whether it starting to respawn those tallies with the gaps starting, but conceivably the CPU or IO usage from trying to run several python scripts every second could cause postgres issues when sharing those resources? (or even the fact they're not running?)
Having looked at why those other services (other than smsd) were respawning, they thought they were already running. I've been able to recreate this. If I stop and start the container (rather than destroying and recreating it via docker-compose) then there are still .pid files for alertengine, eventengine, pping and snmptrapd in /tmp, some or all of which are datestamped for an earlier container start. It doesn't do this for all of those services every time though - guessing that if a service is busy and hasn't exited before the container stops, it doesn't clean its pid file? Or maybe it only causes an issue if there's a (different) process running on the same PID when it next starts.
If a NAV daemon is killed or somehow dies without getting a chance to run its cleanup routine, its PID file will be left behind. On a normal system, this PID number will not be reused for a long time, so the chances of that PID being "alive" when attempting a restart is low.
However, in the context of restarting a container, PID numbers will be assigned from 1 and up again, so re-use and overlapping of those stored PID numbers is highly likely, and would cause a problem.
Curiously, pping (and on the most recent restart of the container, apache - again, there's another process running on the same PID as Apache's lockfile) *do* get flagged by supervisord as being restarted too many times, and aborted as you had noted. Maybe there's a difference if it's a python script rather than a compiled executable (i.e. python starts cleanly but the script doesn't)
The difference might be whether the daemon forks into the background before checking the PID file, or if it stops immediately. pping is the oldest piece of code in NAV, so it might behave slightly differently from the rest.
As a kludge workaround I added commands to the entrypoint script which deleted all of the PID files on the basis that at container start, the processes shouldn't be running. That seemed to make it start cleanly, but not sure if there is a better or more 'Docker'ish way to do it?
That's the exact solution I would suggest: Explicitly deleting the PID files on container startup. There might be a way to get supervisord to remove the pid file as well, but a much simpler solution would be to clean the slate at container startup. I wouldn't mind a pull request to that effect on GitHub :-)
[1] http://supervisord.org/configuration.html?highlight=startretries#program-x-s...