On Tue, 31 Mar 2020 13:20:05 +0200 (CEST) carsten@strotmann.de wrote:
As I have seen similar systems to NAV breaking down in large (> 100.000 devices) networks, I would like to know if anyone has experience with NAV in such large networks and how NAV scales in these networks.
Anyone using NAV in such large environments?
I don't know of anyone, but you never know who you'll find on this mailing list. I'm just here to throw some theoretical considerations into the mix :-)
The biggest bottleneck in a 100.000 node system would probably be Graphite, followed by the NAV snmp collector (ipdevpoll).
Graphite is the 3rd party time series database utilized by NAV. It has kind of a high I/O bandwidth footprint, but is designed to be scaled horizontally - it just takes some effort to configure.
PostgreSQL has its own scaling strategies, but should always be hosted on dedicated hardware in a scenario like this.
Some work is being done on making ipdevpoll more horizontally scalable as well. Specifically, pull request #2128 (https://github.com/Uninett/nav/pull/2128) attempts to enable configuration options to allow individual ipdevpoll instances to work only with specific groups of devices. The contributor aims to use this for running distributed collection inside closed networks, but it would essentially work as a scaling mechanism as well.