Hi,
first of all welcome to the list!
we (UniBasel) use NAV for quite some while now and are very pleased with the possibilities NAV gives us. We use Debian as OS with the following Setup:
root@urz-nav:~# uname -a Linux urz-nav 2.6.32-5-amd64 #1 SMP Fri May 10 09:44:53 UTC 2013 x86_64 GNU/Linux root@urz-nav:~# cat /etc/debian_version 6.0.7 root@urz-nav:~# dpkg -l | grep nav ii nav 2+3.14.15-1 Network Administration Visualized
A problem we have since a while is that our system is permanently under a lot of load (mostly CPU bound) and we haven't really found a way to reduce the pressure. The hardware we use is a hp blade (ProLiant BL460c G6) with:
Intel(r) Xeon(r) Processor X5550 (4 Cores/ 8 Threats) (8M Cache, 2.66 GHz, 6.40 GT/s Intel(r) QPI)
and: root@urz-nav:~# free total used free shared buffers cached Mem: 12321700 11673596 648104 0 387768 9123748
At the moment we have 1460 active Devices (mainly Cisco Switches). Around 30 or 40 are OVERDUE in: https://urz-nav/report/lastupdated
So my question is, do you have any good experience with HW-systems that are actually dealing with this amount of devices or is there any tuning possibility (without losing functionality) we could try to reduce the pressure on the system?
Thanks in advance, Mischa Diehm
-- Mischa Diehm | Network Operation Center (NOC) Universitaet Basel | Universitaetsrechenzentrum Klingelbergstr. 70 | CH-4056 Basel | Switzerland Tel. +41 61 267 15 74 | Fax +41 61 267 22 82 | http://urz.unibas.ch
On Tue, 18 Jun 2013 10:25:25 +0000 Mischa Diehm mischa.diehm@unibas.ch wrote:
Hi,
Hi Mischa!
first of all welcome to the list!
Uh, thank you? Welcome yourself :)
A problem we have since a while is that our system is permanently under a lot of load (mostly CPU bound) and we haven't really found a way to reduce the pressure. The hardware we use is a hp blade (ProLiant BL460c G6) with:
At the moment we have 1460 active Devices (mainly Cisco Switches). Around 30 or 40 are OVERDUE in: https://urz-nav/report/lastupdated
Are all jobs overdue for these devices, or just some of the jobs? Does NAV consider the devices to be reachable and responding to SNMP requests? Does `ipdevpoll.conf` indicate that the jobs are failing due to errors, or just that they are delayed or time out?
So my question is, do you have any good experience with HW-systems that are actually dealing with this amount of devices or is there any tuning possibility (without losing functionality) we could try to reduce the pressure on the system?
At the moment, the closest I have access to is a system monitoring 882 devices, but it still isn't in full production mode (meaning, they still have more devices to add). The load number of the system varies wildly with which collection jobs are running at any given moment. They might be seen as high numbers, but the system has 4 cores (with hyperthreading enabled), so the load average is mostly less than the number of cores.
This system is a HP DL360p Gen-8 server, with 12GB RAM and 4x600GB SAS 10K drives mounted in a hardware RAID 1+0 configuration.
We will very soon be migrating PostgreSQL off this server and onto a dedicated server with identical specifications, specifically to alleviate some of the load issues we are experiencing.
ipdevpoll is currently running in its "experimental" multiprocess-mode on this system, which means each of the configured jobs in `ipdevpoll.conf` get their own dedicated process (which improves things on multicore systems). This can be achieved on a more permanent basis by adding the "-m" switch to the ipdevpoll command in the `/etc/nav/init.d/ipdevpoll` script.
We will be using this system for testing performance optimizations to ipdevpoll once we migrate PostgreSQL to a dedicated server. I can post our findings here once we get there, but that probably won't be until August, as I'll be offline most of July.
Hi,
On 19.06.13 12:54, "Morten Brekkevold" morten.brekkevold@uninett.no wrote:
On Tue, 18 Jun 2013 10:25:25 +0000 Mischa Diehm mischa.diehm@unibas.ch wrote:
Hi,
Hi Mischa!
first of all welcome to the list!
Uh, thank you? Welcome yourself :)
A problem we have since a while is that our system is permanently under a lot of load (mostly CPU bound) and we haven't really found a way to reduce the pressure. The hardware we use is a hp blade (ProLiant BL460c G6) with:
At the moment we have 1460 active Devices (mainly Cisco Switches). Around 30 or 40 are OVERDUE in: https://urz-nav/report/lastupdated
Are all jobs overdue for these devices, or just some of the jobs? Does
As far as I can see it's mainly inventory and topo jobs which are overdue.
NAV consider the devices to be reachable and responding to SNMP requests? Does `ipdevpoll.conf` indicate that the jobs are failing due to errors, or just that they are delayed or time out?
Devices are marked up and and snmp_status = ok. I don't understand what you mean by the last sentence.
So my question is, do you have any good experience with HW-systems that are actually dealing with this amount of devices or is there any tuning possibility (without losing functionality) we could try to reduce the pressure on the system?
At the moment, the closest I have access to is a system monitoring 882 devices, but it still isn't in full production mode (meaning, they still have more devices to add). The load number of the system varies wildly with which collection jobs are running at any given moment. They might be seen as high numbers, but the system has 4 cores (with hyperthreading enabled), so the load average is mostly less than the number of cores.
Yes indeed. The average on our machine is ok but there are ongoing peeks with very high load.
This system is a HP DL360p Gen-8 server, with 12GB RAM and 4x600GB SAS 10K drives mounted in a hardware RAID 1+0 configuration.
We will very soon be migrating PostgreSQL off this server and onto a dedicated server with identical specifications, specifically to alleviate some of the load issues we are experiencing.
ok. So we will be thinking about doing the same thing if that is what is necessary to return to more sane loads.
ipdevpoll is currently running in its "experimental" multiprocess-mode on this system, which means each of the configured jobs in `ipdevpoll.conf` get their own dedicated process (which improves things on multicore systems). This can be achieved on a more permanent basis by adding the "-m" switch to the ipdevpoll command in the `/etc/nav/init.d/ipdevpoll` script.
adding the -m switch seems not so easy here. The problem is the way daemon() is written and su - xxx -c $CMD is working. How do I add the -m Switch so it's actually using it (root is using bash on this system):
Debug output when starting with the init script:
Starting ipdevpoll: + daemon 'su - navcron -c /usr/lib/nav/ipdevpolld -m' + su - navcron -c /usr/lib/nav/ipdevpolld -m
This way the -m is never executed... I couldn't find a way to integrate -m without changing the nav code.
We will be using this system for testing performance optimizations to ipdevpoll once we migrate PostgreSQL to a dedicated server. I can post our findings here once we get there, but that probably won't be until August, as I'll be offline most of July.
That would still be very much appreciated.
Cheers, Mischa
-- Morten Brekkevold UNINETT
On Mon, 24 Jun 2013 07:22:58 +0000 Mischa Diehm mischa.diehm@unibas.ch wrote:
Are all jobs overdue for these devices, or just some of the jobs? Does
As far as I can see it's mainly inventory and topo jobs which are overdue.
Those are the heaviest jobs, so that's not surprising, given the circumstances.
NAV consider the devices to be reachable and responding to SNMP requests? Does `ipdevpoll.conf` indicate that the jobs are failing due to errors, or just that they are delayed or time out?
Devices are marked up and and snmp_status = ok. I don't understand what you mean by the last sentence.
I'm sorry, there's a typo there. I meant the log file `ipdevpoll.log`. Please check it for error messages; are jobs failing due errors or timeouts, or are they just running very long. If you send a SIGUSR1 signal to the ipdevpoll daemon process, it will log a list of currently running jobs and their current runtimes.
We will very soon be migrating PostgreSQL off this server and onto a dedicated server with identical specifications, specifically to alleviate some of the load issues we are experiencing.
ok. So we will be thinking about doing the same thing if that is what is necessary to return to more sane loads.
The first step to any scale-out operation involving a database is to give the database software dedicated hardware; it's also the fastest/easiest action to take (and as always: Adding more hardware is almost always cheaper than the man-hours required to optimize software).
ipdevpoll is currently running in its "experimental" multiprocess-mode on this system, which means each of the configured jobs in `ipdevpoll.conf` get their own dedicated process (which improves things on multicore systems). This can be achieved on a more permanent basis by adding the "-m" switch to the ipdevpoll command in the `/etc/nav/init.d/ipdevpoll` script.
adding the -m switch seems not so easy here. The problem is the way daemon() is written and su - xxx -c $CMD is working. How do I add the -m Switch so it's actually using it
You are correct, I checked up on what we did on this specific server, and we extracted the following separate bash function in `/etc/nav/init.d/ipdevpoll`:
runit() { su - ${user} -c "${IPDEVPOLLD} -m" }
and replaced the daemon call under the start section with a call to "daemon runit". We'll be looking at different ways of implementing a multiprocess mode; whatever we come up with will probably use a config file option instead.
ipdevpoll once we migrate PostgreSQL to a dedicated server. I can post our findings here once we get there, but that probably won't be until August, as I'll be offline most of July.
That would still be very much appreciated.
Making a note of that, then :)
On Tue, 25 Jun 2013 13:50:28 +0200 Morten Brekkevold morten.brekkevold@uninett.no wrote:
ipdevpoll once we migrate PostgreSQL to a dedicated server. I can post our findings here once we get there, but that probably won't be until August, as I'll be offline most of July.
That would still be very much appreciated.
Making a note of that, then :)
I just just found that note in my TODO-list - sorry about the late update, I hope this finds you well!
We migrated PostgreSQL off the heavily-loaded NAV server and onto a dedicated server early in September and have been running them like that ever since.
As expected, the results were very good. Neither server appears in our daily load graphs (though the main NAV server seems to have some loadavg peaks of 3.0 a few times a week).
Also, we upgraded all our production systems from Linux 2.6 to Linux 3.2 a few weeks later, and this seems to have had a massive improvement on the load numbers of all our production servers.
All in all, we are quite satisifed.
On 19 Jun 2013, at 12:54, Morten Brekkevold wrote:
So my question is, do you have any good experience with HW-systems that are actually dealing with this amount of devices or is there any tuning possibility (without losing functionality) we could try to reduce the pressure on the system?
At the moment, the closest I have access to is a system monitoring 882 devices, but it still isn't in full production mode (meaning, they still have more devices to add).
So, a bit sorry for dragging up this old thread, but I figured it would be as good as starting a new one (-:
I'm an old NAV-user (previous "job"), and haven't used it in a while. We use some commercial systems at work, that in ways does the same as NAV (though, most of them does some of the things far more un-userfriendly than NAV). Thinking of trying it out, partly because I know it has some nice features, and partly to try out new/improved features since last time I used it.
First round I'm only going to add routers and switches, which is about ~1500 units. All Cisco. Last time I used NAV, was with far less devices, so, this time, especially when taking Morten's reply above into consideration, I guess I actually need to "plan ahead", so that I get a system that can handle this amount of devices.
It also might be possible that we add servers at some point, but that's further up the road. I don't have an estimate of the number of servers, but we're talking at least 1000 of them (including VM's) -- probably more.
We also have ~1700 access points, but these are controller-based, so I guess they don't have much use to be added to NAV. I.e. I'm not even sure they speak SNMP for themselves (only via the CAPWAP-tunnel). Maybe only for monitoring of uptime, but, meh, we already have plenty of tools for that.
Anyways. What hardware requirements are we looking at here? In terms of storage, CPU, memory. I guess we need to have two servers; one for NAV, and one for the PostgreSQL. Based on your experiences with ~900 units; should I expect to double the hardware, in order to handle ~1500 units? Keep in mind that I don't want to upgrade this setup if we go down the road where we add servers as well (but I guess they impose less load per unit, compared to switches/routers that speaks SNMP, so not sure how much more overhead we need?).
Not sure if the system load is very different depending on what NAV-features one wants to use? 'Arnold' and 'Syslog analyzer' are two things that comes to mind that we won't be using.
On Thu, 27 Feb 2014 07:23:41 +0100 "Joachim Tingvold" joachim@tingvold.com wrote:
So, a bit sorry for dragging up this old thread, but I figured it would be as good as starting a new one (-:
And sorry for the late reply!
I'm an old NAV-user (previous "job"), and haven't used it in a while. We use some commercial systems at work, that in ways does the same as NAV (though, most of them does some of the things far more un-userfriendly than NAV). Thinking of trying it out, partly because I know it has some nice features, and partly to try out new/improved features since last time I used it.
Are you working in an academic or enterprise setting? We'd love to have som feedback on your ongoing experience with NAV :)
We also have ~1700 access points, but these are controller-based, so I guess they don't have much use to be added to NAV. I.e. I'm not even sure they speak SNMP for themselves (only via the CAPWAP-tunnel). Maybe only for monitoring of uptime, but, meh, we already have plenty of tools for that.
None of our customers monitor their WLC slave APs through NAV, just the wireless LAN controllers themselves. They prefer Cisco's own WLC-related software to monitor the slaves. Oftentimes, the slaves won't even have their own IP addresses.
Anyways. What hardware requirements are we looking at here? In terms of storage, CPU, memory. I guess we need to have two servers; one for NAV, and one for the PostgreSQL. Based on your experiences with ~900 units; should I expect to double the hardware, in order to handle ~1500 units? Keep in mind that I don't want to upgrade this setup if we go down the road where we add servers as well (but I guess they impose less load per unit, compared to switches/routers that speaks SNMP, so not sure how much more overhead we need?).
There should be no need to double the hardware, the resource requirement is not linear. The single most important thing you can do is have a dedicated PostgreSQL server, so that the database doesn't have to compete with the rest of NAV's processes for system resources. Throw lots of RAM at PostgreSQL, so that the most important bits can be kept cached at all times.
Use reasonably modern hardware, fast disks (or even SSDs) and opt for RAID striping if you can (we use RAID 10 in production).
Also, unless you equip all your servers with SNMP agents that you wish NAV to monitor, they will not cause much load - NAV will just ping them (and monitor any services you explicitly configured it to).
Not sure if the system load is very different depending on what NAV-features one wants to use? 'Arnold' and 'Syslog analyzer' are two things that comes to mind that we won't be using.
These aren't really resource-intensive. SNMP processing and data storage are what make up the bulk of NAV's resource usage.
On 3 Mar 2014, at 13:30, Morten Brekkevold wrote:
So, a bit sorry for dragging up this old thread, but I figured it would be as good as starting a new one (-:
And sorry for the late reply!
No worries (-:
Are you working in an academic or enterprise setting? We'd love to have som feedback on your ongoing experience with NAV :)
Enterprise. I don't have any 'ongoing' experience with NAV, but when I get the time to set up NAV (around summer), I'll be sure to share my updated experience regarding NAV (-:
(I'll probably have shitloads of questions as well (-: ).
None of our customers monitor their WLC slave APs through NAV, just the wireless LAN controllers themselves. They prefer Cisco's own WLC-related software to monitor the slaves. Oftentimes, the slaves won't even have their own IP addresses.
Yeah, probably what we'll conclude with. In our case they have their own IP-addresses, so we could have NAV ping them, but again, that would probably cause a lot of noise, as there always are some access points going offline/online all over the place.
There should be no need to double the hardware, the resource requirement is not linear. The single most important thing you can do is have a dedicated PostgreSQL server, so that the database doesn't have to compete with the rest of NAV's processes for system resources. Throw lots of RAM at PostgreSQL, so that the most important bits can be kept cached at all times.
What is 'lots of RAM'? Are we talking ~32GB or ~128GB? :-P
Use reasonably modern hardware, fast disks (or even SSDs) and opt for RAID striping if you can (we use RAID 10 in production).
We'll probably use our VMware-environment. At least I think we'll start with that, to see how it copes with a virtualised environment. They just set up some new SAN solution that's supposedly pretty fast, so hopefully that'll do.
Do you have any indication as to what amount of storage is required? Both at the server running NAV, and the one running PostgreSQL? Probably depends on how long one wants to store things, but, yeah? At least some kind of indication of what sizes we're talking about? (I.e. 500GB vs. 2TB or whatnot?).
Also, unless you equip all your servers with SNMP agents that you wish NAV to monitor, they will not cause much load - NAV will just ping them (and monitor any services you explicitly configured it to).
Yes, this is what I suspected. _IF_ we decide to put in servers, it'll only be ping. However, we have a lot of change in servers being deleted/created (especially in our virtual environment), so we'd have to test out how much noise it would create.
Is there some kind of API in NAV, that could help manage the device database? I.e. some way that we could automate the list of servers NAV should ping by making a script or whatnot? Or maybe one can just interact directly with the PostgreSQL-database?
Not sure if the system load is very different depending on what NAV-features one wants to use? 'Arnold' and 'Syslog analyzer' are two things that comes to mind that we won't be using.
These aren't really resource-intensive. SNMP processing and data storage are what make up the bulk of NAV's resource usage.
OK.
Thanks for your feedback so far (-:
On Mon, 03 Mar 2014 14:44:16 +0100 "Joachim Tingvold" joachim@tingvold.com wrote:
Enterprise. I don't have any 'ongoing' experience with NAV, but when I get the time to set up NAV (around summer), I'll be sure to share my updated experience regarding NAV (-:
(I'll probably have shitloads of questions as well (-: ).
Great :)
Throw lots of RAM at PostgreSQL, so that the most important bits can be kept cached at all times.
What is 'lots of RAM'? Are we talking ~32GB or ~128GB? :-P
In the aforementioned case study, the NAV server and the PostgreSQL server are equipped with 12GB each.
The PostgreSQL server is currently using about 10GB for buffers and cache, the rest is used by the running processes. The NAV server is currently using 7GB for buffers and cache, and still has ~1GB memory free.
This installation currently monitors 988 devices/52543 ports. I understand they have quite a few HP switches that are still waiting to be monitored, but they have been kept out because they have problems keeping up with NAV's SNMP requests.
Use reasonably modern hardware, fast disks (or even SSDs) and opt for RAID striping if you can (we use RAID 10 in production).
We'll probably use our VMware-environment. At least I think we'll start with that, to see how it copes with a virtualised environment. They just set up some new SAN solution that's supposedly pretty fast, so hopefully that'll do.
I would not normally recommend running a big PostgreSQL production installation in a virtual environment, but your mileage may vary.
Do you have any indication as to what amount of storage is required? Both at the server running NAV, and the one running PostgreSQL? Probably depends on how long one wants to store things, but, yeah? At least some kind of indication of what sizes we're talking about? (I.e. 500GB vs. 2TB or whatnot?).
The aforementioned PostgreSQL installation currently uses 17G. It contains more than a year worth of data. It has approximately doubled in size in the past 6 months. You may want to clean out CAM and ARP logs regularly, though.
RRD data on the NAV server is 6.8GB - this usage does not increase with time, only with the number of things monitored.
Is there some kind of API in NAV, that could help manage the device database? I.e. some way that we could automate the list of servers NAV should ping by making a script or whatnot? Or maybe one can just interact directly with the PostgreSQL-database?
Not really, though we are looking into it, as we added some API functionality last year. We are looking to synchronize our own NAV installation from our authoritative inventory database, and our currently working solution uses NAV's Django ORM models as its "API".
Thanks for your feedback so far (-:
No problem :)
On 3 Mar 2014 13:31, "Morten Brekkevold" morten.brekkevold@uninett.no wrote:
None of our customers monitor their WLC slave APs through NAV, just the wireless LAN controllers themselves. They prefer Cisco's own WLC-related software to monitor the slaves. Oftentimes, the slaves won't even have their own IP addresses.
We do. But ICMP ping only. The Images on the APs have SNMP, but it's not configurable. I think they just couldn't remove it. The only useful thing you get is reachability stats. It's only useful if your distribution network is unstable (I.e. radio link/mesh usage on open bands)
As Morten said, it has very few use cases for adding the APs directly, and you know it if you need it.
Both lwapp and and capwap use ipv4 transport (not in the OSI-sense), so those APs always have an IP address. I am not sure about other vendors (Aruba/Meru) thou.
-- Morten Brekkevold UNINETT
Christoffer Viken / CVi
Network Elf .:|:.:|:. Trådløse Trondheim / Wireless Trondheim tradlosetrondheim.no
mischa.diehm@unibas.ch said:
A problem we have since a while is that our system is permanently under a lot of load (mostly CPU bound) and we haven't really found a way to reduce the pressure.
mischa.diehm@unibas.ch said:
At the moment we have 1460 active Devices (mainly Cisco Switches)
Our NAV is monitoring 1370 devices (routers, switches, servers, printers, ups, misc), but as Morten already has suggested we run postgreSQL on a separate server. I think postgreSQL and some of the NAV processes step on each other toes and each use more resources when they run on the same box as when they have a box for themselves.
--Ingeborg