On Fri, 24 Feb 2017 08:15:50 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
Jeg håper optimalisering av ipdevpoll står på lista over oppgaver.
Hi Ingeborg,
may I remind you that the preferred language on our mailing lists is still English? :-)
Some work on ipdevpoll is definitely on the list, even 3rd on the nav-ref list (which YOU have voted on):
https://nav.uninett.no/wiki/nav-ref:nav-ref-arbeidsliste
I.e., reworking the multiprocess model is registered here:
https://github.com/UNINETT/nav/issues/1174
And, in case you don't see it, a pull request for this has already been accepted (#1422). Sigmund gladly took up the task as he was learning his way around the NAV code recently.
So, in the 4.7 release, you can specify the number of worker processes, and jobs will be assigned in a round-robin fashion to free workers.
Things like killing workers after a set number of jobs, or spawning remote workers is not implemented yet.
Vi kjører i dag 'ipdevpoll -m', altså 8 parallelle jobber som alle har 10 forbindelser åpne mot databasen.
[snip]
I fjor på denne tida hadde jeg et problem med at enkeltspørringer mot bokser tok for lang tid (mer enn ett minutt). Dette løste vi ved å justere ned max_concurrent_jobs til 20 siden et (mye) større tall gjorde at enkeltjobbene blei stående å henge på manglende forbindelse mot databasen.
Det jeg så starten på i da, men som er blitt meir og meir uttalt i løpet av året er at spesielt 1minstats ikke greier å snurre raskt nok. Grep(1) i loggen viser at vi i løpet av ett minutt greier å trøkke gjennom mellom 170-200 jobber. Når vi har 645 (GW,SW,GSW) sier det seg selv at dette ikke funker. Vi får hullete grafer.
If you are unable to scale 1minstats through other means (including the new, upcoming multiprocess model), you might want to consider whether you need those stats every minute. I do believe NTNU have moved several of the plugins from the 1minstats job to the 5minstats job (though, this requires a schema resize for the corresponding Whisper files that are storing these stats).
Hva kan gjøres? Jeg tror ikke utfordringa ligger i CPU/minne, men i selve implementasjonen.
It is quite difficult to get useful metrics on how much time was spent in waiting for free DB connections, waiting for actual PostgreSQL responses, waiting for SNMP responses, or how much time was just spent running Python code.
Some rudimentary metrics can be had. If you set DEBUG level logging for `nav.ipdevpoll.jobs.jobhandler_timings`, each job will log a summary of time spent in each plugin, and how much was spent in overhead (i.e. updating the database at the end of the job). There is, however, no separate metric for how much time was spent talking to the DB inside the plugins (normally, all the DB access is before or after the job, not inside plugins).
There are also issues like this one:
https://github.com/UNINETT/nav/issues/1403
Some SQL statements may take a longer time to complete because of locking in the database, and this potentially gets worse as the number of parallel connections increases.
Dersom løsninga er å øke antall oppkoplinger mot databasen må man samtidig gjøre det mulig å tune dette per jobb. De mindre jobbene som f.eks. topo, dns, ip2mac trenger strengt tatt ikke å kjøre som separate jobber og i allefall ikke å ta opp masse ressurser mot databasen.
Tuning per job is a moot point in 4.7, so this will not be an issue by that time.
It would be interesting, however, if we are able to log more of the potential metrics mentioned above, like the time spent just waiting for free DB threads. Those kinds of numbers might help you decide on tuning parameters.