Announcement: NAV 4.2.5 released

List overview All Threads
Download

newer

older

NAV mac-search with Nexus and...

Re: Hard Disk Full

Morten Brekkevold

20 Mar 2015 20 Mar '15

13:46

The fifth maintenance release of the NAV 4.2 series is now available.

This replaces the 4.2.4 release, which broke the parallel pinger, and possibly other event-posting components. Again, to those who already upgraded to 4.2.4, we apologize for the inconvenience. More details at the bug report here:

https://bugs.launchpad.net/nav/+bug/1434520

The source code is available for download at Launchpad [1]. A new package for Debian Wheezy has been published in our APT repository [2], as usual.

Changes =======

The following 20 reported issues have been fixed:

* LP#1403365 (offMaintenance alerts for same device every 5 minutes) * LP#1422298 (device history crashes when viewing weathergoose events) * LP#1422316 (thresholdmon AttributeError crash) * LP#1425536 (Status widget error when filtering on Device group) * LP#1425846 (alert profiles does not display without refresh) * LP#1427666 (watchdog is slow) * LP#1428071 (Portadmin crashes when searching by sysname or ip) * LP#1428578 (seeddb test for snmp version crashes if neither v1 or v2c is supported) * LP#1429868 (ipdevpoll jobs that are no longer supposed to run for a device as flagged as "overdue") * LP#1430795 (SeedDB room edit form asks for user's location, shows no map until permission is given) * LP#1430797 (Geomap shows no map at all when no room positions are defined) * LP#1430802 (SeedDB room edit insists on inserting a geoposition) * LP#1430803 (SeedDB room form position indicator icon is missing) * LP#1431780 (Trunk port status is never reset on non-Cisco equipment) * LP#1432056 (mod_wsgi option WSGIApplicationGroup should be %{GLOBAL} by default) * LP#1432057 ([appliance] missing python-dnspython) * LP#1432620 (Unable to load Netmap layer 3 map with ELINK peers) * LP#1432682 (Should be able to specify exact subnet prefixes to ignore) * LP#1433063 (netmap zoom and pan does not work for some views) * LP#1433120 (Shouldn't generate linkState alerts for intentionally shutdown interfaces)

Happy NAVing everyone!

Links =====

[1] https://launchpad.net/nav/4.2/4.2.5 [2] https://nav.uninett.no/wiki/nav_on_debian

-- Morten Brekkevold UNINETT AS

Show replies by date

Sigurd Mytting

23 Mar 23 Mar

09:16

On 2015-03-20 13:46, Morten Brekkevold wrote:

...

The fifth maintenance release of the NAV 4.2 series is now available.

I'm kind if thinking there is something fishy here as well, I'm running ipdevpolld with "-m" and are now seeing hundreds of postgres-processes looking like this in my "ps -ef" output:

postgres 13679 3183 4 08:56 ? 00:00:19 postgres: nav nav 192.168.217.121(55373) INSERT waiting

# ps -ef | grep -c INSERT\ waiting 263

Also from ipdevpoll.log for one of my routers (this is the default gateway for our nav-installation):

2015-03-23 06:15:12,773 [20245] [INFO schedule.netboxjobscheduler] [dns gw-1.somewhere] dns for gw-1.somewhere completed in 0:03:09.328468. next run in 0:56:50.671563. 2015-03-23 06:15:43,180 [20242] [INFO schedule.netboxjobscheduler] [topo gw-1.somewhere] topo for gw-1.somewhere completed in 0:20:08.509836. next run right now. 2015-03-23 06:29:23,451 [20241] [INFO schedule.netboxjobscheduler] [1minstats gw-1.somewhere] 1minstats for gw-1.somewhere completed in 0:17:25.663956. next run right now. 2015-03-23 06:29:54,689 [20246] [INFO schedule.netboxjobscheduler] [snmpcheck gw-1.somewhere] snmpcheck for gw-1.somewhere completed in 0:17:56.107511. next run in 0:12:03.892532. 2015-03-23 06:32:05,148 [20242] [INFO schedule.netboxjobscheduler] [topo gw-1.somewhere] topo for gw-1.somewhere completed in 0:16:21.966341. next run right now. 2015-03-23 06:36:16,453 [20240] [INFO schedule.netboxjobscheduler] [5minstats gw-1.somewhere] 5minstats for gw-1.somewhere completed in 0:17:51.246169. next run right now. 2015-03-23 06:39:11,529 [20244] [INFO schedule.netboxjobscheduler] [inventory gw-1.somewhere] inventory for gw-1.somewhere completed in 0:17:40.145423. next run in 5:42:19.854604. 2015-03-23 06:49:08,948 [20241] [INFO schedule.netboxjobscheduler] [1minstats gw-1.somewhere] 1minstats for gw-1.somewhere completed in 0:19:45.398619. next run right now. 2015-03-23 06:49:51,196 [20242] [INFO schedule.netboxjobscheduler] [topo gw-1.somewhere] topo for gw-1.somewhere completed in 0:17:46.015066. next run right now. 2015-03-23 06:50:52,032 [20239] [INFO schedule.netboxjobscheduler] [linkcheck gw-1.somewhere] linkcheck for gw-1.somewhere completed in 0:17:04.692522. next run right now. 2015-03-23 06:52:16,093 [20243] [INFO schedule.netboxjobscheduler] [ip2mac gw-1.somewhere] ip2mac for gw-1.somewhere completed in 8:19:02.775597. next run right now. 2015-03-23 07:03:59,843 [20241] [INFO schedule.netboxjobscheduler] [1minstats gw-1.somewhere] 1minstats for gw-1.somewhere completed in 0:14:50.893142. next run right now. 2015-03-23 07:04:50,713 [20242] [INFO schedule.netboxjobscheduler] [topo gw-1.somewhere] topo for gw-1.somewhere completed in 0:14:59.516408. next run in 0:00:00.483632. 2015-03-23 07:06:51,259 [20239] [INFO schedule.netboxjobscheduler] [linkcheck gw-1.somewhere] linkcheck for gw-1.somewhere completed in 0:15:59.195122. next run right now. 2015-03-23 07:18:12,938 [20246] [INFO schedule.netboxjobscheduler] [snmpcheck gw-1.somewhere] snmpcheck for gw-1.somewhere completed in 0:13:43.676428. next run in 0:16:16.323605. 2015-03-23 07:19:32,061 [20241] [INFO schedule.netboxjobscheduler] [1minstats gw-1.somewhere] 1minstats for gw-1.somewhere completed in 0:15:32.208171. next run right now.

I see the same thing for hundreds of other devices.

Cheers,

-Sigurd

Morten Brekkevold

24 Mar 24 Mar

16:05

On Mon, 23 Mar 2015 09:16:14 +0100 Sigurd Mytting sigurd@mytting.no wrote:

...

On 2015-03-20 13:46, Morten Brekkevold wrote:

...
The fifth maintenance release of the NAV 4.2 series is now available.

I'm kind if thinking there is something fishy here as well, I'm running ipdevpolld with "-m" and are now seeing hundreds of postgres-processes looking like this in my "ps -ef" output:

postgres 13679 3183 4 08:56 ? 00:00:19 postgres: nav nav 192.168.217.121(55373) INSERT waiting

First of all, please remember that the "-m" mode is what we call "experimental". If you use it, each job configured in `ipdevpoll.conf` will have its own process, and each process will open 10 separate connections to PostgreSQL. If you have 10 jobs in `ipdevpoll.conf`, ipdevpoll will in total open 100 connections to PostgreSQL! This is nothing new in 4.2.5.

...

# ps -ef | grep -c INSERT\ waiting 263

Also from ipdevpoll.log for one of my routers (this is the default gateway for our nav-installation):

2015-03-23 06:15:12,773 [20245] [INFO schedule.netboxjobscheduler] [dns gw-1.somewhere] dns for gw-1.somewhere completed in 0:03:09.328468. next run in 0:56:50.671563.

Did this only occur after upgrading to 4.2.5? Does it go on like this "forever", or does it calm down a while after starting ipdevpoll?

I would connect to PostgreSQL and issue a "ALTER DATABASE nav SET log_min_duration_statement TO 1500;", then restart NAV.

PostgreSQL will now log any SQL statement in the NAV database that runs for more than 1500ms. Does PostgreSQL log many similar statements, and if so, which ones?

4.2.5 changed the way old job log entries are purged from the database, but it may still need some tuning to not cause inserts into the log table to take too long.

-- Morten Brekkevold UNINETT

Sigurd Mytting

23:35

On 2015-03-24 16:05, Morten Brekkevold wrote:

...

Did this only occur after upgrading to 4.2.5? Does it go on like this "forever", or does it calm down a while after starting ipdevpoll?

It started with 4.2.4. I have Munin-graphs of all servers and the postgres server have a load going from 2 to 60 after upgrading to 4.2.4, the server running NAV has a load decrease from 2-3 to less than 1. This goes on forever.

...

I would connect to PostgreSQL and issue a "ALTER DATABASE nav SET log_min_duration_statement TO 1500;", then restart NAV.

Just did, the main thing looks like

2015-03-24 23:05:58 CET LOG: duration: 130238.622 ms statement: INSERT INTO "ipdevpoll_job_log" ("netboxid", "job_name", "end_time", "duration", "success", "interval") VALUES (1187, 'dns', '2015-03-24 23:03:48.719354', 1285.223631, true, 3600) RETURNING "ipdevpoll_job_log"."id"

about 1800 lines with 'INSERT INTO "ipdevpoll_job_log"' in 45 minutes.

but there are other statements as well.

Normally I would suspect something like a faulty raid cache battery or drive failure for this kind of symptoms, but everything has been checked out.

Any ideas appreciated!

Cheers,

-Sigurd

Morten Brekkevold

25 Mar 25 Mar

08:44

On Tue, 24 Mar 2015 23:35:58 +0100 Sigurd Mytting sigurd@mytting.no wrote:

...

...
I would connect to PostgreSQL and issue a "ALTER DATABASE nav SET log_min_duration_statement TO 1500;", then restart NAV.

Just did, the main thing looks like

2015-03-24 23:05:58 CET LOG: duration: 130238.622 ms statement: INSERT INTO "ipdevpoll_job_log" ("netboxid", "job_name", "end_time", "duration", "success", "interval") VALUES (1187, 'dns', '2015-03-24 23:03:48.719354', 1285.223631, true, 3600) RETURNING "ipdevpoll_job_log"."id"

Ugh, that's a nasty run time!

...

Any ideas appreciated!

I immediately have one:

Connect to the nav database and issue this statement:

DROP TRIGGER IF EXISTS trig_trim_old_ipdevpoll_job_log_entries_on_insert ON ipdevpoll_job_log;

This trigger will try to prune old entries from the ipdevpoll_job_log table on about 20% of INSERT statements to said table. That is perhaps too often (and I'm not really comfortable with having it as a trigger, as it belongs in some log rotation job that we don't yet have).

Removing the trigger will cause your ipdevpoll_job_log table to grow indefinitely, however. If this alleviates your problem, it's a strong indicator that we shall have to find another way to prune this table.

-- Morten Brekkevold UNINETT

Sigurd Mytting

17:50

On 2015-03-25 08:44, Morten Brekkevold wrote:

...

...
Any ideas appreciated!

I immediately have one:

Spot on!

Load down from 60 to less than 1, number of timeouts for the last 20 minutes are down to 10, most of them similar to this one:

2015-03-25 17:19:01 CET LOG: duration: 7756.695 ms statement: SELECT ijl.* FROM ipdevpoll_job_log AS ijl JOIN ( SELECT netboxid, job_name, MAX(end_time) AS end_time FROM ipdevpoll_job_log GROUP BY netboxid, job_name ) AS foo USING (netboxid, job_name, end_time) JOIN netbox ON (ijl.netboxid = netbox.netboxid) WHERE success = 'f' ORDER BY netboxid;

...

Removing the trigger will cause your ipdevpoll_job_log table to grow indefinitely, however. If this alleviates your problem, it's a strong indicator that we shall have to find another way to prune this table.

I'm quite found of cron for tasks like that, but I'm not a database person so the preferred solution from that point of view might be a bit different ;)

The table seems to be growing quite fast, not that it's a big issue with 43GB to spare on the postgres-partition, but I can see a need for cleaning the table on a regular basis.

Thanks for helping out, most appreciated. We only do electronic postcards, and at Christmas, but I'll drop one in your direction in a few minutes since the generator is still available on your intranet for last Christmas :)

Cheers!

-Sigurd

Morten Brekkevold

27 Mar 27 Mar

14:29

On Wed, 25 Mar 2015 17:50:25 +0100 Sigurd Mytting sigurd@mytting.no wrote:

...

Load down from 60 to less than 1, number of timeouts for the last 20 minutes are down to 10, most of them similar to this one:

I thought that might help. I've filed this bug in response:

https://bugs.launchpad.net/nav/+bug/1437318

-- Morten Brekkevold UNINETT

3803

Age (days ago)

3810

Last active (days ago)

nav-users@lister.sikt.no

6 comments

2 participants

tags (0)

participants (2)

Morten Brekkevold
Sigurd Mytting