Hello,
we're having problems with gaps in graphs (only for some interfaces) and I'm working through the https://nav.uninett.no/doc/4.8/faq/graph_gaps.html guide. I've arrived at the part with the "Carbon Cache" and I'd like to know whether those two graphs represent a healthy state: http://abload.de/img/e7zassl.png http://abload.de/img/jyz3sjf.png
What strikes me as odd is that there are negative numbers, I can't see that on the graph on https://nav.uninett.no/doc/4.8/faq/graph_gaps.html
Best regards Karl
On Fri, 16 Feb 2018 17:13:08 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
Hello,
we're having problems with gaps in graphs (only for some interfaces) and I'm working through the https://nav.uninett.no/doc/4.8/faq/graph_gaps.html guide. I've arrived at the part with the "Carbon Cache" and I'd like to know whether those two graphs represent a healthy state: http://abload.de/img/e7zassl.png http://abload.de/img/jyz3sjf.png
This does not look like a cache that is being saturated, so if there's a problem, it's likely somewhere else. But, however, as you noted, it is a bit disconcerting that a negative number of cache entries are reported (which would be impossible).
Which version of carbon are you on?
Hello Morten,
thank you very much for your help.
Picture e7: graphite-carbon 0.9.15-1, nav 4.8.2-1stretch (debian stretch) Picture jy: graphite-carbon 0.9.15-1, nav 4.8.2-2stretch (debian stretch)
We don't have any SNMP timeouts in our logs, ipdevpoll reports no errors and strangely enough this issue affects only some interfaces: *On one device we have an AE consisting of 4 interfaces and all 4 interfaces have graphs without gaps, but the AE interface consists of mostly gaps. The problems are not limited to AE interfaces though, that would be too easy. *On another device we have interfaces with 1000BASE-T (=copper) and some of them have no gaps at all while others are mostly gaps.
Best regards Karl
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ *From:* Morten Brekkevold [mailto:morten.brekkevold@uninett.no] *Sent:* Monday, Feb 19, 2018 11:36 AM CET *To:* Karl Gerhard *Cc:* nav-users@uninett.no *Subject:* Is that a healthy carbon cache?
On Fri, 16 Feb 2018 17:13:08 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
Hello,
we're having problems with gaps in graphs (only for some interfaces) and I'm working through the https://nav.uninett.no/doc/4.8/faq/graph_gaps.html guide. I've arrived at the part with the "Carbon Cache" and I'd like to know whether those two graphs represent a healthy state: http://abload.de/img/e7zassl.png http://abload.de/img/jyz3sjf.png
This does not look like a cache that is being saturated, so if there's a problem, it's likely somewhere else. But, however, as you noted, it is a bit disconcerting that a negative number of cache entries are reported (which would be impossible).
Which version of carbon are you on?
On Mon, 19 Feb 2018 18:25:55 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
Picture e7: graphite-carbon 0.9.15-1, nav 4.8.2-1stretch (debian stretch) Picture jy: graphite-carbon 0.9.15-1, nav 4.8.2-2stretch (debian stretch)
The negative cache size thing seems be a known problem in that version: https://github.com/graphite-project/carbon/issues/420 - but I'm not clear on whether the issue is more serious than faulty reports...
We don't have any SNMP timeouts in our logs, ipdevpoll reports no errors and strangely enough this issue affects only some interfaces:
*On one device we have an AE consisting of 4 interfaces and all 4 interfaces have graphs without gaps, but the AE interface consists of mostly gaps. The problems are not limited to AE interfaces though, that would be too easy.
*On another device we have interfaces with 1000BASE-T (=copper) and some of them have no gaps at all while others are mostly gaps.
Did you verify that the 5minstats jobs for these devices are running at the correct intervals, as suggested by the guide?
Also, are you sure you have configured NAV (in SeedDB) to use SNMP v2c and not SNMP v1 on these devices (just want to rule out the use of 32-bit counters, which would really make things bad on high-speed interfaces)?
Hello Morten
apologies for the late reply, the flu season is terrible this year.
I have verified that the 5min jobs are running at the correct intervals and that we're using SNMP v2c everywhere. I checked everything that was suggested in "Debugging gaps in graphs" article and I'm looking for advice as to how to proceed further.
Best regards Karl
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ *From:* Morten Brekkevold [mailto:morten.brekkevold@uninett.no] *Sent:* Tuesday, Feb 20, 2018 8:49 AM CET *To:* Karl Gerhard *Cc:* nav-users@uninett.no *Subject:* Is that a healthy carbon cache?
On Mon, 19 Feb 2018 18:25:55 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
Picture e7: graphite-carbon 0.9.15-1, nav 4.8.2-1stretch (debian stretch) Picture jy: graphite-carbon 0.9.15-1, nav 4.8.2-2stretch (debian stretch)
The negative cache size thing seems be a known problem in that version: https://github.com/graphite-project/carbon/issues/420 - but I'm not clear on whether the issue is more serious than faulty reports...
We don't have any SNMP timeouts in our logs, ipdevpoll reports no errors and strangely enough this issue affects only some interfaces: *On one device we have an AE consisting of 4 interfaces and all 4 interfaces have graphs without gaps, but the AE interface consists of mostly gaps. The problems are not limited to AE interfaces though, that would be too easy. *On another device we have interfaces with 1000BASE-T (=copper) and some of them have no gaps at all while others are mostly gaps.
Did you verify that the 5minstats jobs for these devices are running at the correct intervals, as suggested by the guide?
Also, are you sure you have configured NAV (in SeedDB) to use SNMP v2c and not SNMP v1 on these devices (just want to rule out the use of 32-bit counters, which would really make things bad on high-speed interfaces)?
On Mon, 26 Feb 2018 10:03:51 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
apologies for the late reply, the flu season is terrible this year.
I hope this message finds you well :)
I have verified that the 5min jobs are running at the correct intervals and that we're using SNMP v2c everywhere. I checked everything that was suggested in "Debugging gaps in graphs" article and I'm looking for advice as to how to proceed further.
You seem to describe a situation where only some interfaces on some devices have gaps, and none of the issues described in the debugging guide apply.
At this point, I'd grab a swiss army knife: I'd use tcpdump and/or wireshark to verify a couple of things:
1. Does ipdevpoll actually consistently post the traffic counter values of these interfaces to the carbon backend? (UDP port 2003).
2. If not, does ipdevpoll consistently query and get a response for the SNMP traffic counters for these interfaces (UDP port 161).
Are you able to do this without further guidance?
Hello Morten, this is how I approached your instructions:
$ timeout 3600 tcpdump -i lo udp port 2003 -w tcpdump-1hour.pcap $ grep --binary-files=text router01 tcpdump-1hour.pcap | grep --binary-files=text ae3.ifInOctets
This should give us about 12 entries because the interface is being checked every 5min. And we should see the counter rising with each entry. This is the output of of the grep command above: nav.devices.router1_domain_com.ports.ae3.ifInOctets 4267030379578 151981310 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4272260055697 1519813402 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4276300392448 151981370 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4280165984433 1519814003 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4283800404729 151981430 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4287170549844 1519814603 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4290575217783 151981490 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4294304860274 1519815203 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4298158132556 151981550 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4302670400704 1519815803 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4306587460714 1519816103 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4310180259720 1519816403
12 entries, just as expected. Apparently everything is working fine, but the graph is still mostly gaps.
Best regards Karl
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ *From:* Morten Brekkevold [mailto:morten.brekkevold@uninett.no] *Sent:* Tuesday, Feb 27, 2018 10:35 AM CET *To:* Karl Gerhard *Cc:* nav-users@uninett.no *Subject:* Is that a healthy carbon cache?
On Mon, 26 Feb 2018 10:03:51 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
apologies for the late reply, the flu season is terrible this year.
I hope this message finds you well :)
I have verified that the 5min jobs are running at the correct intervals and that we're using SNMP v2c everywhere. I checked everything that was suggested in "Debugging gaps in graphs" article and I'm looking for advice as to how to proceed further.
You seem to describe a situation where only some interfaces on some devices have gaps, and none of the issues described in the debugging guide apply.
At this point, I'd grab a swiss army knife: I'd use tcpdump and/or wireshark to verify a couple of things:
Does ipdevpoll actually consistently post the traffic counter values of these interfaces to the carbon backend? (UDP port 2003).
If not, does ipdevpoll consistently query and get a response for the SNMP traffic counters for these interfaces (UDP port 161).
Are you able to do this without further guidance?
On Wed, 28 Feb 2018 16:59:38 +0100 Karl Gerhard karl_gerh@gmx.at wrote:
This should give us about 12 entries because the interface is being checked every 5min. And we should see the counter rising with each entry.
[snip]
nav.devices.router1_domain_com.ports.ae3.ifInOctets 4306587460714 1519816103 nav.devices.router1_domain_com.ports.ae3.ifInOctets 4310180259720 1519816403
12 entries, just as expected. Apparently everything is working fine, but the graph is still mostly gaps.
Perfect approach :) This means NAV is doing its job, but something is amiss with Graphite, though.
Were you able to confirm that nav/devices/router1_domain_com/ports/ae3/ifInOctets.wsp in the whisper data directory has a first archive with a 300 second resolution?
E.g.:
| $ whisper-info /var/lib/graphite/whisper/nav/devices/router1_domain_com/ports/ae3/ifInOctets.wsp | maxRetention: 51840000 | xFilesFactor: 0.5 | aggregationMethod: last | fileSize: 28864 | | Archive 0 | retention: 180000 | secondsPerPoint: 300 | points: 600 | size: 7200 | ...
Archive 0 has a 'secondsPerPoint: 300', which means it expects a data point every 5 minutes. If this is e.g. 60 instead, you will have mostly gaps, and you likely have (or had, at the time the wsp file was created) a problem with your `storage-schemas.conf`.