We are running NAV 3.13.0 since 08.02.13
Some days ago something happened in our environment that caused eventengine to throw an error:
2013-02-13 17:08:57,887 [ERROR nav.eventengine.engine] Unhandled exception in plugin <nav.eventengine.plugins.boxstate.BoxStateHandler object at 0x804e170d0> ; ignoring it Traceback (most recent call last): File "/usr/local/nav/lib/python/nav/eventengine/engine.py", line 190, in handle_event handler.handle() File "/usr/local/nav/lib/python/nav/eventengine/plugins/delayedstate.py", line 52, in handle return self._handle_start() File "/usr/local/nav/lib/python/nav/eventengine/plugins/delayedstate.py", line 68, in _handle_start self._set_internal_state_down() File "/usr/local/nav/lib/python/nav/eventengine/plugins/boxstate.py", line 29, in _set_internal_state_down shadow = self._verify_shadow() File "/usr/local/nav/lib/python/nav/eventengine/plugins/delayedstate.py", line 189, in _verify_shadow netbox.up = (Netbox.UP_DOWN if netbox_appears_reachable(netbox) File "/usr/local/nav/lib/python/nav/eventengine/topology.py", line 32, in netbox_appears_reachable target_path = get_path_to_netbox(netbox) File "/usr/local/nav/lib/python/nav/eventengine/topology.py", line 82, in get_path_to_netbox path = networkx.shortest_path(graph, netbox, router) File "/usr/local/lib/python2.7/site-packages/networkx-1.6-py2.7.egg/networkx/ algorithms/shortest_paths/generic.py", line 124, in shortest_path paths=nx.bidirectional_shortest_path(G,source,target) File "/usr/local/lib/python2.7/site-packages/networkx-1.6-py2.7.egg/networkx/ algorithms/shortest_paths/unweighted.py", line 138, in bidirectional_shortest_path results=_bidirectional_pred_succ(G,source,target) File "/usr/local/lib/python2.7/site-packages/networkx-1.6-py2.7.egg/networkx/ algorithms/shortest_paths/unweighted.py", line 205, in _bidirectional_pred_succ raise nx.NetworkXNoPath("No path between %s and %s." % (source, target)) NetworkXNoPath: No path between <server> and <router>
This happened for 5 servers all on the same GSW.
Four days later 4 of them are still marked as down even though ipdevinfo reports availability numbers like the following and the servers clearly are up:
Availability 100.00% last day, 99.77% last week, 99.94% last month
Where do I have to push (or kick) to make NAV recognise the servers as up?
--Ingeborg
On Mon, 18 Feb 2013 11:24:48 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
We are running NAV 3.13.0 since 08.02.13
Some days ago something happened in our environment that caused eventengine to throw an error:
[snip]
File "/usr/local/lib/python2.7/site-packages/networkx-1.6-py2.7.egg/networkx/ algorithms/shortest_paths/unweighted.py", line 205, in _bidirectional_pred_succ raise nx.NetworkXNoPath("No path between %s and %s." % (source, target)) NetworkXNoPath: No path between <server> and <router>
This happened for 5 servers all on the same GSW.
You are using NetworkX 1.6, while NAV was developed on NetworkX 1.1. The latter returns an empty list when no path is found between the nodes, but it appears the version you're running raises an exception instead, which completely throws the event engine off.
Since the exception class doesn't even exist in NetworkX 1.1, we can't catch it explicitly. Rather we have to swallow and ignore any kind of exception from that function to work around this.
Four days later 4 of them are still marked as down even though ipdevinfo reports availability numbers like the following and the servers clearly are up:
Availability 100.00% last day, 99.77% last week, 99.94% last month
Where do I have to push (or kick) to make NAV recognise the servers as up?
I'm not sure, because the path checking is only performed when down events are received - I.e. the traceback you pasted should only have occurred as your servers went down, not when they came up (this can only be confirmed by checking the preceding log lines). Topology is mostly irrelevant when things are reported as up.
If the servers are still listed as down by NAV, it may be that the internal state of the pping daemon has become unsynced with the database. Restarting pping should dispatch the correct up events for the servers, if they are indeed reachable from your NAV server. If this makes eventengine crash, there must be another bug in there somewhere.
I would really be interested in the eventengine logs from around the time the servers actually came back up and the up events were received, if anything at all. You might also want to consider upping the default log level for the eventengine to INFO, using "nav.eventengine=INFO" in `logging.conf`.
morten.brekkevold@uninett.no said:
You are using NetworkX 1.6, while NAV was developed on NetworkX 1.1.
You are soooo 2010! I'll downgrade. On FreeBSD I can choose between 1.0.1 and 1.2. Go figure...
morten.brekkevold@uninett.no said:
If the servers are still listed as down by NAV, it may be that the internal state of the pping daemon has become unsynced with the database. Restarting pping should dispatch the correct up events for the servers, if they are indeed reachable from your NAV server.
That did the trick - today. I restarted all the NAV services yesterday while trying to figure out the error and pping did not then find the servers. But today - with the moon in first quarter - it worked.
morten.brekkevold@uninett.no said:
I would really be interested in the eventengine logs from around the time the servers actually came back up and the up events were received, if anything at all.
eventenginde was logging with the default level (which is..?) and except a couple of hobbitServiceDown/Up-messages (which shows that the servers became available after a few minutes) I find no log lines about the servers after the NetworkXNoPath. The servers were never really down, but something happened with the network.
I have upped the log level for eventengine. Is there a way to up the log level for all the services at the same time or is it one line per service?
--Ingeborg
On Tue, 19 Feb 2013 11:29:19 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
morten.brekkevold@uninett.no said:
You are using NetworkX 1.6, while NAV was developed on NetworkX 1.1.
You are soooo 2010! I'll downgrade. On FreeBSD I can choose between 1.0.1 and 1.2. Go figure...
We can still try to catch the exception, though, if you would like to file a bug report.
That did the trick - today. I restarted all the NAV services yesterday while trying to figure out the error and pping did not then find the servers. But today - with the moon in first quarter - it worked.
Given the correct pping log level, it should be obvious when it can and cannot get a ping response from a device.
eventenginde was logging with the default level (which is..?)
I believe the default of Python's logging system is the WARNING level. Some of the NAV daemons are inconsistent here, and provide a way of configuring the default base level in its own config file, and only letting `logging.conf` override it. The two exceptions are the ancient code bases of pping and servicemon, which use a homegrown logging system (I believe they were written before Python grew a standard logging library).
and except a couple of hobbitServiceDown/Up-messages (which shows that the servers became available after a few minutes) I find no log lines about the servers after the NetworkXNoPath. The servers were never really down, but something happened with the network.
And as you know, to NAV it doesn't matter much if they were really down, only whether they were ping-reachable.
I have upped the log level for eventengine. Is there a way to up the log level for all the services at the same time or is it one line per service?
You could add "root=INFO" or "nav=INFO", but I would advise against it as a permanent solution. Some of the cron jobs will log to stdout/stderr instead of a file, and might spam you to death.