On Mon, 18 Feb 2013 11:24:48 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
We are running NAV 3.13.0 since 08.02.13
Some days ago something happened in our environment that caused eventengine to throw an error:
[snip]
File "/usr/local/lib/python2.7/site-packages/networkx-1.6-py2.7.egg/networkx/ algorithms/shortest_paths/unweighted.py", line 205, in _bidirectional_pred_succ raise nx.NetworkXNoPath("No path between %s and %s." % (source, target)) NetworkXNoPath: No path between <server> and <router>
This happened for 5 servers all on the same GSW.
You are using NetworkX 1.6, while NAV was developed on NetworkX 1.1. The latter returns an empty list when no path is found between the nodes, but it appears the version you're running raises an exception instead, which completely throws the event engine off.
Since the exception class doesn't even exist in NetworkX 1.1, we can't catch it explicitly. Rather we have to swallow and ignore any kind of exception from that function to work around this.
Four days later 4 of them are still marked as down even though ipdevinfo reports availability numbers like the following and the servers clearly are up:
Availability 100.00% last day, 99.77% last week, 99.94% last month
Where do I have to push (or kick) to make NAV recognise the servers as up?
I'm not sure, because the path checking is only performed when down events are received - I.e. the traceback you pasted should only have occurred as your servers went down, not when they came up (this can only be confirmed by checking the preceding log lines). Topology is mostly irrelevant when things are reported as up.
If the servers are still listed as down by NAV, it may be that the internal state of the pping daemon has become unsynced with the database. Restarting pping should dispatch the correct up events for the servers, if they are indeed reachable from your NAV server. If this makes eventengine crash, there must be another bug in there somewhere.
I would really be interested in the eventengine logs from around the time the servers actually came back up and the up events were received, if anything at all. You might also want to consider upping the default log level for the eventengine to INFO, using "nav.eventengine=INFO" in `logging.conf`.