Hi all!
I'm looking into doing an upgrade of our NAV installation, we're on 3.15 and it's time to move on to 4.
I'm looking for advice on hardware setup, filesystem layouts and so on.
Currently we have about 1500 devices, 99% routers and switches, in the database (mostly Catalysts ranging from 2940 to 3750, a few 6500 and Nexus 7k/5k, Juniper SRX/EX and a Juniper QFX, around 50 virtual Cisco ASA contexts as well).
Hardware is a single server, 2x dual core opteron 2.6GHz, 32GB ram and several raid1's (in hardware). I've put postgres on its own 15k RPM raid1, the same with /var/lib/nav and confined /var/log/nav to the OS-drives with its own partition.
I have been and will be using the excellent debian-packages from UNINETT for my installation.
I've done some tricks for optimizing disk performance, by using the "noop" disk scheduler I see a big difference in system sluggishness, even when I have loads around 30. This is mostly related to my hardware raid, the default debian scheduler does not give a good performance when operating on a hardware raid (same thing with SSDs).
I'm planning a new installation, this time on two servers, but I have yet to determine disk and raid layouts and what to put on each server.
Postgres away from the collector is rather given, I'm unfamiliar with graphite, could/should it be put on another server? (I might find a third server if its recommended.)
What will be the default datastores be? I'd like to set up separate partitions before installing the NAV-packages to make sure filesystems like /var don't run out of space when NAV logs grow, postgres goes wild and/or graphite stores what I'm actually asking it to store... What will typically be hit hard with random IO, sequential read/write and so on?
Any input would be apreciated.
Cheers,
-Sigurd
On Mon, 30 Jun 2014 11:28:09 +0200 Sigurd Mytting sigurd@mytting.no wrote:
Hi all!
Hi Sigurd,
I'm looking into doing an upgrade of our NAV installation, we're on 3.15 and it's time to move on to 4.
I'm looking for advice on hardware setup, filesystem layouts and so on.
Currently we have about 1500 devices, 99% routers and switches, in the database (mostly Catalysts ranging from 2940 to 3750, a few 6500 and Nexus 7k/5k, Juniper SRX/EX and a Juniper QFX, around 50 virtual Cisco ASA contexts as well).
Hardware is a single server, 2x dual core opteron 2.6GHz, 32GB ram and several raid1's (in hardware). I've put postgres on its own 15k RPM raid1, the same with /var/lib/nav and confined /var/log/nav to the OS-drives with its own partition.
Looks pretty good to me :)
I have been and will be using the excellent debian-packages from UNINETT for my installation.
I've done some tricks for optimizing disk performance, by using the "noop" disk scheduler I see a big difference in system sluggishness, even when I have loads around 30. This is mostly related to my hardware raid, the default debian scheduler does not give a good performance when operating on a hardware raid (same thing with SSDs).
It's always good to experiment with I/O schedulers, as the default CFQ scheduler isn't the best for the kind of loads on a typical NAV server.
I'm planning a new installation, this time on two servers, but I have yet to determine disk and raid layouts and what to put on each server.
Postgres away from the collector is rather given, I'm unfamiliar with graphite, could/should it be put on another server? (I might find a third server if its recommended.)
Our current template for customer deployments is a single Intel Xeon E5620 (Quad core @ 2.40GHz) based server, with 12GB RAM and 4x600GB 10K SAS disks in a RAID 10 configuration.
This has scaled well for all pre-4.0 deployments, except at NTNU, where we deployed two of these servers, separating PostgreSQL onto one of them.
Under NAV 4.0, more metrics are monitored, and as a result from monitoring ~40k ports / ~1k devices and storing metrics in Graphite, these two servers no longer scale very well for NTNU, even with Graphite load balanced across the two servers. Additionally, one other customer deployment also has scaling problems now.
Therefore, we are currently deploying dedicated Graphite servers to these two customers. This server features an SSD RAID for metric storage, and we will see how it fares in the coming days.
What will be the default datastores be? I'd like to set up separate partitions before installing the NAV-packages to make sure filesystems like /var don't run out of space when NAV logs grow, postgres goes wild and/or graphite stores what I'm actually asking it to store... What will typically be hit hard with random IO, sequential read/write and so on?
Any input would be apreciated.
Our template is to have separate partitions for /, /boot, /home, /usr, /usr/local, /var, /var/log and /var/lib/postgresql . We use LVM for flexibility in partition sizes.
Under NAV 4.0 on Debian, /var and /var/lib/postgresql will be hit the hardest, as Graphite metrics are stored in /var/lib/graphite/ - you should already know your PostgreSQL load :)
I can report back when we have more experience with an SSD-based setup, but my intuition is that you could probably get away with mounting /var/lib/graphite to an SSD-based RAID on an existing NAV server. Unfortunately, there are no free drive bays in our already deployed servers, which is why we are deploying a separate Graphite server to customers where things don't scale well.
If your existing server is running fine with both NAV 3.15 and PostgreSQL, I probably wouldn't deviate from that, but you might want to start thinking about a separate Graphite server with SSDs for NAV 4.0.
[Morten Brekkevold]
Thanks for you input, most valuable when designing our new deployment of NAV!
| Our current template for customer deployments is a single Intel Xeon | E5620 (Quad core @ 2.40GHz) based server, with 12GB RAM and 4x600GB 10K | SAS disks in a RAID 10 configuration.
Out of curiosity, how many devices/port would run fairly ok on such a configuration?
| If your existing server is running fine with both NAV 3.15 and | PostgreSQL, I probably wouldn't deviate from that, but you might want to | start thinking about a separate Graphite server with SSDs for NAV 4.0.
'Running fine' is a matter of definition, my current setup works, but it could be a lot more snappy. From your input I guess I'll go for three servers, putting postgres and graphite on dedicated iron.
Would graphite hogg memory? I have a couple of server with 96GB and one with 32GB ram available, should I put memory on graphite?
Cheers, -Sigurd
On Tue, 01 Jul 2014 14:47:19 +0200 Sigurd Mytting sigurd@mytting.no wrote:
| Our current template for customer deployments is a single Intel Xeon | E5620 (Quad core @ 2.40GHz) based server, with 12GB RAM and 4x600GB 10K | SAS disks in a RAID 10 configuration.
Out of curiosity, how many devices/port would run fairly ok on such a configuration?
Biggest I see is 519 devices / 17701 ports, but this server stopped breathing ok after the upgrade to NAV 4 and Graphite.
We also have one with 194 devices / 19308 ports. This one is a bit high, but not enough to alarm us in any way.
The port count will drive up the load for storing statistics, while the device count will drive up the general load for monitoring.
| If your existing server is running fine with both NAV 3.15 and | PostgreSQL, I probably wouldn't deviate from that, but you might want to | start thinking about a separate Graphite server with SSDs for NAV 4.0.
'Running fine' is a matter of definition, my current setup works, but it could be a lot more snappy. From your input I guess I'll go for three servers, putting postgres and graphite on dedicated iron.
You will have a lot more leeway for scaling up with that configuration, at least :-)
Would graphite hogg memory? I have a couple of server with 96GB and one with 32GB ram available, should I put memory on graphite?
I'd say the memory footprint of Graphite is negligible. The carbon-cache daemons will cache datapoints in memory to smooth out I/O-write activity, but you generally don't want them to keep a lot of uncommitted data around in memory.
The graphite-web component will use a little bit more memory, but we're still talking sub-gigabyte levels. Once your start graphing a lot, it makes sense to throw a small memcache instance on it to improve performance, but still, I'd say 32GB is way overkill :) If you're not afraid of a little data loss, you could use that to store all your Whisper files in a ramdisk and sync them to physical, rotating drives every so often ;-)