Hi!
Running NAV 3.7.0 on Debian 5-64-bit, about 700 switches and routers currently in NAV.
There has been a lot of weird stuff (trouble) happening with Cricket after going from the latest in the 3.5-series up to 3.7, some things got better from 3.6 to 3.7, but not all is perfect. Not complaining, just a summary from my point of view over the last months of releases.
About a week ago my Cricket graphs just stopped and after some debugging I found the Cricket-cronjob run as navcron dies after a while and leaves a lock-file, cleaning it up and doing some optimizing (separating switch and router interfaces, and system values into separate cronjobs) I found the cronjobs still dies and leaves lock-files. First of all; anyone else seeing the Cricket collector dying? (In my case just about on every run.)
Second; Anyone have a good tip on how to tune Cricket to run a sensible number of jobs to get thru it all in a sensible amount of time? Currently, if the largest job (switch interfaces) don't die after only a few minutes it usually runs for 25-30 minutes.
NAV in my environment isn't really a production system, but my pet-project provides more accurate and sensible information than any system we have in production so I'll hang on to it as long as I can.
Any input appreciated!
Cheers,
-Sigurd
On 2010-12-08 18:06, Sigurd Mytting wrote:
Hi! Second; Anyone have a good tip on how to tune Cricket to run a sensible number of jobs to get thru it all in a sensible amount of time? Currently, if the largest job (switch interfaces) don't die after only a few minutes it usually runs for 25-30 minutes.
Got a tip of running 50ish devices per Cricket-collector, this seems to both stop the collector from dying and lets Cricket finish before it's next run.
Cheers,
-Sigurd
And FYI, as I've requested the exact same thing before (and thought there was already a blueprint on this) I took myself the liberty to submit a blueprint on Launchpad to add this feature.
-Vidar
-----Opprinnelig melding----- Fra: Sigurd Mytting [mailto:sigurd@mytting.no] Sendt: 10. desember 2010 00:18 Til: nav-users@uninett.no Emne: Re: Cricket trouble/optimizing
On 2010-12-08 18:06, Sigurd Mytting wrote:
Hi! Second; Anyone have a good tip on how to tune Cricket to run a sensible number of jobs to get thru it all in a sensible amount of time? Currently, if the largest job (switch interfaces) don't die after only a few minutes it usually runs for 25-30 minutes.
Got a tip of running 50ish devices per Cricket-collector, this seems to both stop the collector from dying and lets Cricket finish before it's next run.
Cheers,
-Sigurd
On Fri, Dec 10, 2010 at 12:17:34AM +0100, Sigurd Mytting wrote:
Second; Anyone have a good tip on how to tune Cricket to run a sensible number of jobs to get thru it all in a sensible amount of time? Currently, if the largest job (switch interfaces) don't die after only a few minutes it usually runs for 25-30 minutes.
Got a tip of running 50ish devices per Cricket-collector, this seems to both stop the collector from dying and lets Cricket finish before it's next run.
Glad to see your problem was solved. I'll still add a few comments.
I've no experience with Cricket consistently crashing and leaving its lock files behind. I do have experience with Cricket going bananas and eating all available RAM and running forever until its config tree is recompiled.
If you set your cricket to log debug info, can you glean any idea of why it crashes from the logs?
Also, when NTNU originally added the Cricket integration to NAV, they were completely unable to collect traffic statistics from all their access ports - there was just too many of them to complete rounds in anything remotely close to five minutes.
They decided not to collect stats for access ports, and that is how the EDGE category was born. An EDGE device is the same as an SW device, except that no Cricket configuration is generated for switch-ports on EDGE devices.
I do remember talk of someone who wrote a program to automatically split the Cricket config tree into sizable chunks so that multiple collectors could run in parallel and complete collection in a timely manner. There's also the issue of optimizing RRD writes, which can be horribly inefficient when attempting to scale up.