Jeg håper optimalisering av ipdevpoll står på lista over oppgaver.
Vi kjører i dag 'ipdevpoll -m', altså 8 parallelle jobber som alle har 10 forbindelser åpne mot databasen. navcron 36782 1 36781 36781 0 S - 19:16.14 python /usr/local/nav/bin/ipdevpolld -m (python2.7) navcron 36783 36782 36781 36781 0 S - 10099:09.26 python /usr/local/nav/bin/ipdevpolld -J 5minstats -f -s -P --threadpoolsize=10 (python2.7) navcron 36784 36782 36781 36781 0 S - 9812:19.26 python /usr/local/nav/bin/ipdevpolld -J 1minstats -f -s -P --threadpoolsize=10 (python2.7) navcron 36785 36782 36781 36781 0 S - 8824:16.25 python /usr/local/nav/bin/ipdevpolld -J topo -f -s -P --threadpoolsize=10 (python2.7) navcron 36786 36782 36781 36781 0 I - 782:36.65 python /usr/local/nav/bin/ipdevpolld -J ip2mac -f -s -P --threadpoolsize=10 (python2.7) navcron 36787 36782 36781 36781 0 S - 1086:00.12 python /usr/local/nav/bin/ipdevpolld -J inventory -f -s -P --threadpoolsize=10 (python2.7) navcron 36788 36782 36781 36781 0 R - 9951:01.96 python /usr/local/nav/bin/ipdevpolld -J dns -f -s -P --threadpoolsize=10 (python2.7) navcron 36789 36782 36781 36781 0 S - 2040:33.93 python /usr/local/nav/bin/ipdevpolld -J snmpcheck -f -s -P --threadpoolsize=10 (python2.7) navcron 36790 36782 36781 36781 0 R - 10437:55.65 python /usr/local/nav/bin/ipdevpolld -J statuscheck -f -s -P --threadpoolsize=10 (python2.7)
I fjor på denne tida hadde jeg et problem med at enkeltspørringer mot bokser tok for lang tid (mer enn ett minutt). Dette løste vi ved å justere ned max_concurrent_jobs til 20 siden et (mye) større tall gjorde at enkeltjobbene blei stående å henge på manglende forbindelse mot databasen.
Det jeg så starten på i da, men som er blitt meir og meir uttalt i løpet av året er at spesielt 1minstats ikke greier å snurre raskt nok. Grep(1) i loggen viser at vi i løpet av ett minutt greier å trøkke gjennom mellom 170-200 jobber. Når vi har 645 (GW,SW,GSW) sier det seg selv at dette ikke funker. Vi får hullete grafer.
Hva kan gjøres? Jeg tror ikke utfordringa ligger i CPU/minne, men i selve implementasjonen.
Dersom løsninga er å øke antall oppkoplinger mot databasen må man samtidig gjøre det mulig å tune dette per jobb. De mindre jobbene som f.eks. topo, dns, ip2mac trenger strengt tatt ikke å kjøre som separate jobber og i allefall ikke å ta opp masse ressurser mot databasen.
--Ingeborg
On Fri, 24 Feb 2017 08:15:50 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
Jeg håper optimalisering av ipdevpoll står på lista over oppgaver.
Hi Ingeborg,
may I remind you that the preferred language on our mailing lists is still English? :-)
Some work on ipdevpoll is definitely on the list, even 3rd on the nav-ref list (which YOU have voted on):
https://nav.uninett.no/wiki/nav-ref:nav-ref-arbeidsliste
I.e., reworking the multiprocess model is registered here:
https://github.com/UNINETT/nav/issues/1174
And, in case you don't see it, a pull request for this has already been accepted (#1422). Sigmund gladly took up the task as he was learning his way around the NAV code recently.
So, in the 4.7 release, you can specify the number of worker processes, and jobs will be assigned in a round-robin fashion to free workers.
Things like killing workers after a set number of jobs, or spawning remote workers is not implemented yet.
Vi kjører i dag 'ipdevpoll -m', altså 8 parallelle jobber som alle har 10 forbindelser åpne mot databasen.
[snip]
I fjor på denne tida hadde jeg et problem med at enkeltspørringer mot bokser tok for lang tid (mer enn ett minutt). Dette løste vi ved å justere ned max_concurrent_jobs til 20 siden et (mye) større tall gjorde at enkeltjobbene blei stående å henge på manglende forbindelse mot databasen.
Det jeg så starten på i da, men som er blitt meir og meir uttalt i løpet av året er at spesielt 1minstats ikke greier å snurre raskt nok. Grep(1) i loggen viser at vi i løpet av ett minutt greier å trøkke gjennom mellom 170-200 jobber. Når vi har 645 (GW,SW,GSW) sier det seg selv at dette ikke funker. Vi får hullete grafer.
If you are unable to scale 1minstats through other means (including the new, upcoming multiprocess model), you might want to consider whether you need those stats every minute. I do believe NTNU have moved several of the plugins from the 1minstats job to the 5minstats job (though, this requires a schema resize for the corresponding Whisper files that are storing these stats).
Hva kan gjøres? Jeg tror ikke utfordringa ligger i CPU/minne, men i selve implementasjonen.
It is quite difficult to get useful metrics on how much time was spent in waiting for free DB connections, waiting for actual PostgreSQL responses, waiting for SNMP responses, or how much time was just spent running Python code.
Some rudimentary metrics can be had. If you set DEBUG level logging for `nav.ipdevpoll.jobs.jobhandler_timings`, each job will log a summary of time spent in each plugin, and how much was spent in overhead (i.e. updating the database at the end of the job). There is, however, no separate metric for how much time was spent talking to the DB inside the plugins (normally, all the DB access is before or after the job, not inside plugins).
There are also issues like this one:
https://github.com/UNINETT/nav/issues/1403
Some SQL statements may take a longer time to complete because of locking in the database, and this potentially gets worse as the number of parallel connections increases.
Dersom løsninga er å øke antall oppkoplinger mot databasen må man samtidig gjøre det mulig å tune dette per jobb. De mindre jobbene som f.eks. topo, dns, ip2mac trenger strengt tatt ikke å kjøre som separate jobber og i allefall ikke å ta opp masse ressurser mot databasen.
Tuning per job is a moot point in 4.7, so this will not be an issue by that time.
It would be interesting, however, if we are able to log more of the potential metrics mentioned above, like the time spent just waiting for free DB threads. Those kinds of numbers might help you decide on tuning parameters.
morten.brekkevold@uninett.no said:
If you are unable to scale 1minstats through other means (including the new, upcoming multiprocess model), you might want to consider whether you need those stats every minute. I do believe NTNU have moved several of the plugins from the 1minstats job to the 5minstats job (though, this requires a schema resize for the corresponding Whisper files that are storing these stats).
Do you know which jobs NTNU moved? I have already moved statsensors which leaves statsystem and statmulticast.
When is NAV 4.7 scheduled?
--Ingeborg
On Fri, 24 Feb 2017 12:49:50 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
Do you know which jobs NTNU moved? I have already moved statsensors which leaves statsystem and statmulticast.
Looks like they moved all of 1minstats into 5minstats...
When is NAV 4.7 scheduled?
I don't think we have set a date yet, but sometime in the period between the next nav-ref meeting and June would probably not be too far off.
morten.brekkevold@uninett.no said:
If you set DEBUG level logging for `nav.ipdevpoll.jobs.jobhandler_timings`
In case somebody reads this thread in the future: The variable is `nav.ipdevpoll.jobs.jobhandler-timings` (with '-')
It seems like typical values are like these (looks like I will earn nothing by moving statmulticast to 5minstats):
2017-02-24 12:59:31,914 [27044] [DEBUG jobs.jobhandler-timings] [1minstats bodo-gsw.infra.uit.no] Job '1minstats' timings for bodo-gsw.infra.uit.no: StatSystem : 0:00:10.781017 StatMulticast: 0:00:00.000089 ----------------------------- Plugin total : 0:00:10.781106 ----------------------------- Job total : 0:00:11.034561 Job overhead : 0:00:00.253455 2017-02-24 12:59:31,915 [27044] [INFO schedule.netboxjobscheduler] [1minstats bodo-gsw.infra.uit.no] 1minstats for bodo-gsw.infra.uit.no completed in 0:00:11.036677. next run in 0:00:48.963349.
But interesting things are happening. A lot of jobs have overhead-values in the 0:00:00.X range like the example above, but then suddenly a bulk of jobs have overhead values up to 10 seconds!
Looking forward to 4.7
--Ingeborg
On Fri, 24 Feb 2017 13:17:18 +0100 Ingeborg Hellemo ingeborg.hellemo@uit.no wrote:
But interesting things are happening. A lot of jobs have overhead-values in the 0:00:00.X range like the example above, but then suddenly a bulk of jobs have overhead values up to 10 seconds!
Well, I couldn't resist playing around a bit with this. Here's a patch to log some timings specific to threaded work, if you're interested:
https://gist.github.com/lunkwill42/6c14ea6829270b93a39829a9ee0c9c2b
The code path is only enabled when DEBUG logging is enabled for `nav.ipdevpoll.db.timings`. The logged overhead value here will represent the time spent waiting for an available thread worker to run the SQL code.
There is a margin of error here, where ipdevpoll will spend its time running other code, but I can see the overhead increase quickly, but temporarily, once ipdevpoll is getting very busy - while increasing the threadpool size will visibly work against these overheads.
Looking forward to 4.7
Ditto :)
Have a great weekend, Ingeborg!