On Tue, 13 Jan 2009 09:58:36 +0100 Ingeborg Hellemo Ingeborg.Hellemo@cc.uit.no wrote:
Yesterday we rebooted the switch to which our NAV-box is connected. The result was of course 1000+ alerts about our whole network going down and coming up again.
The correct behaviour would of course be that NAV found out that it was _itself_ that had lost network connectivity and that the rest of the world was undetermined. 1 alert about this would suffice.
Actually, I think this should be taken care of by the pre-existing shadow evaluation code in eventEngine. We have, however, multiple examples of this evaluation failing.
Ideally, eventEngine would first see that the NAV server's uplink switch is down. The 1000+ boxDown events it receives after that should be translated into shadow alerts, based on the topology information NAV has.
We think there may be some sort of timing problem here, i.e. that the ordering and timing of the boxDown events seem to be significant when deciding whether an boxDown or boxShadow event should be generated. I've filed this as a new bug report [1] at Launchpad. I'm hoping to have time to look at this after 3.5.0 is released.
This also ties into a wishlist item [2] about re-evaluating shadow statuses when things change in the network.
[1] https://bugs.launchpad.net/bugs/317039 [2] https://bugs.launchpad.net/bugs/258329