Hi NAV gurus,
I have been testing NAV on and off for a while with a fraction of our devices on a small VM. I like it more and more as I am getting used to it. I would like to get recommendation of a NAV setup that can handle our environment.
Our network is layered like the following:
Border Router: 2
Core and Area Router: ~30
Building Router: ~100
Edge Switch: ~1500
Shall we use one big server or multiple servers, physical or virtual for various NAV management components? What CPU, MEMORY, DISK IO, and DISK size shall we plan for?
Thank you very much for your help.
Zhi-Wei Lu
IET-CR-Network Operations Center University of California, Davis (530) 752-0155
On Mon, 30 Apr 2018 03:18:00 +0000 Zhi-Wei Lu zwlu@ucdavis.edu wrote:
I have been testing NAV on and off for a while with a fraction of our devices on a small VM. I like it more and more as I am getting used to it. I would like to get recommendation of a NAV setup that can handle our environment.
[snip]
Shall we use one big server or multiple servers, physical or virtual for various NAV management components? What CPU, MEMORY, DISK IO, and DISK size shall we plan for?
Hi,
so, with reference to our short video conference yesterday, I'm posting some technical details here for all to see.
As some of you already know, Uninett operates multiple NAV installations on behalf of our customers, and we do this by provisioning physical servers which are installed on the customer site and operated centrally from Uninett.
Our current hardware generation is standardized on Dell PowerEdge R430 servers, with tech specs more or less like this:
- CPU: Intel Xeon E5-2660 v3 2.6GHz - 32GB RAM - 4x1TB 7.2K SATA 6Gbps hard drives in a software RAID 10 configuration - 2x200GB Intel S3710 SSD in a software RAID 1 configuration - Dual hot-plug redundant power supplies
What we look for is primarily multiple cores and DISK I/O. The minimum disk sizes available today will usually provide much more than any known NAV installation will need. A bunch of RAM for OS cache/buffering is ideal.
For most of our customers, one such server is enough to power a full NAV installation, with PostgreSQL, Graphite and other 3rd party software which is included in our service.
For the larger customers, we scale this out by adding more servers (for a total of 2 or 3, dependent on continuous performance measurements).
We find that the biggest bottleneck of a complete NAV system is Graphite's performance, which is why we now include SSDs in a standard server. The I/O requirements of Graphite tends to overtake a system as the number of metrics grows, and only SSDs appear to be able to take the write-I/O load. The SSD RAID is usually reserved for Graphite data only.
While NAV itself was never designed with horizontal scaling in mind, the most efficient way to scale out is to run PostgreSQL and Graphite on separate servers. While Graphite may run just fine as long as you have the SSDs for metric storage, a PostgreSQL instance with a high workload benefits greatly from not sharing a processor with other services.
Graphite, however, was designed with horizontal scaling in mind. If necessary, it can be scaled out to multiple servers to support millions of metrics - but it requires planning and usually lots of configuration by hand.
If availability is critical, PostgreSQL can also be configured in a high-availability fashion with multiple servers, but this requires a high level of PostgreSQL/DBA skill (or at least, lots of time to aquire these skills).
Although NAV was not really designed for it, the individual backend processes can be split up and run on different servers, as long as they all can access the PostgreSQL and Graphite services. The SNMP collector (ipdevpoll) is not currently designed for horizontal scaling, but can scale to multiple cores/processors on a single server. This scaling model can, however, be adapted to a more horizontally distributed model with some work on the codebase (which we are considering putting on our roadmap).
The network you are describing seems comparable in size to our largest customer in Norway. They currently run on three physical servers with the NAV+PostgreSQL+Graphite split described above. The Graphite server was introduced because there were no free slots for SSDs in their two existing servers. Now that our current hardware generation includes SSDs by default, we are considering testing reducing this to two servers on the next upgrade (but our service level does not include high-availability, which would require more servers in general).
With regard to physical vs. virtual servers, I personally have some reservations on the performance of virtual servers, because of the high write I/O loads NAV can generate (specifically in PostgreSQL and Graphite), but your mileage may vary.
If you have more specific questions, please don't hesitate to ask them here :) I see now that we should maybe spend some time writing about how to scale a NAV installation in our docs.