Service port is down on 158.XX.XX.XXX - Nav-users

20 Jan 2006

      ...
...
...
alertengine@XXX.XXXX.no 20.01.2006 10:35:22 >>>
Service port on 158.XX.XX.XXX is down since 2006-01-20 10:34:34.
Server up: Yes
Status: Main building
This was a crash and should not be "Server up: Yes"
Some logg from servicemon.log 
[2006-01-20 10:34:06] abstractChecker.py:run:117 [Info]
158.XX.XX.XXX:port    -> timed out
[2006-01-20 10:34:06] abstractChecker.py:run:131 [Notice]
158.XX.XX.XXX:port    -> State changed. New check in 5 sec. (DOWN, timed
out)
This is normal if we stop the service  that we check on the server
[2006-01-20 10:34:34] abstractChecker.py:run:117 [Info]
158.XX.XX.XXX:port    -> (113, 'No route to host')
[2006-01-20 10:34:34] abstractChecker.py:run:140 [Alert ]
158.XX.XX.XXX:port    -> DOWN, (113, 'No route to host')
But these line are special for the server that crash and should not
give "Server up: Yes"
Has this been solved in the latest nav-3.0.0-1.noarch.rpm?
Peder
...
From magnus at ntnu.no  Fri Jan 20 11:20:49 2006
From: magnus at ntnu.no (Magnus Nordseth)
Date: Fri Jan 20 11:21:07 2006
Subject: [Nav-users] Service port is down on 158.XX.XX.XXX
In-Reply-To: s3d0c2ce.064@HVO-3.hivolda.no
References: s3d0c2ce.064@HVO-3.hivolda.no
Message-ID: 20060120102048.GA15629@stud.ntnu.no
Peder Magne Sefland:
...
This was a crash and should not be "Server up: Yes"
Some logg from servicemon.log 
[2006-01-20 10:34:06] abstractChecker.py:run:117 [Info]
158.XX.XX.XXX:port    -> timed out
[2006-01-20 10:34:06] abstractChecker.py:run:131 [Notice]
158.XX.XX.XXX:port    -> State changed. New check in 5 sec. (DOWN, timed
out)
This is normal if we stop the service  that we check on the server
[2006-01-20 10:34:34] abstractChecker.py:run:117 [Info]
158.XX.XX.XXX:port    -> (113, 'No route to host')
[2006-01-20 10:34:34] abstractChecker.py:run:140 [Alert ]
158.XX.XX.XXX:port    -> DOWN, (113, 'No route to host')
But these line are special for the server that crash and should not
give "Server up: Yes"
Has this been solved in the latest nav-3.0.0-1.noarch.rpm?
Service state monitoring is handled by one subsystem, while pping
handles netbox state monitoring. pping usually operates at a higher
frequency than servicemon, and should detect the server crash before
servicemon.
Both pping and statemon posts events to eventengine. Eventengine
should wait for some seconds, before processing the event, to see if
correlated events comes in.
To debug this, I suggest that you check the log of pping and
eventengine to see if a netbox down event is sent.
-- 
Magnus Nordseth
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
>From peder.sefland at hivolda.no  Fri Jan 20 12:57:44 2006
From: peder.sefland at hivolda.no (Peder Magne Sefland)
Date: Fri Jan 20 12:58:34 2006
Subject: SV: Re: [Nav-users] Service port is down on 158.XX.XX.XXX
Message-ID: s3d0de62.023@HVO-3.hivolda.no

>>> Magnus Nordseth magnus@ntnu.no 20.01.2006 11:20:49 >>>
Peder Magne Sefland:
> This was a crash and should not be "Server up: Yes"
> 
> Some logg from servicemon.log 
> [2006-01-20 10:34:06] abstractChecker.py:run:117 [Info]
> 158.XX.XX.XXX:port    -> timed out
> [2006-01-20 10:34:06] abstractChecker.py:run:131 [Notice]
> 158.XX.XX.XXX:port    -> State changed. New check in 5 sec. (DOWN,
timed
> out)
> 
> This is normal if we stop the service  that we check on the server
> 
> 
> [2006-01-20 10:34:34] abstractChecker.py:run:117 [Info]
> 158.XX.XX.XXX:port    -> (113, 'No route to host')
> [2006-01-20 10:34:34] abstractChecker.py:run:140 [Alert ]
> 158.XX.XX.XXX:port    -> DOWN, (113, 'No route to host')
> 
> But these line are special for the server that crash and should not
> give "Server up: Yes"
> 
> Has this been solved in the latest nav-3.0.0-1.noarch.rpm?

Service state monitoring is handled by one subsystem, while pping
handles netbox state monitoring. pping usually operates at a higher
frequency than servicemon, and should detect the server crash before
servicemon.

Both pping and statemon posts events to eventengine. Eventengine
should wait for some seconds, before processing the event, to see if
correlated events comes in.

To debug this, I suggest that you check the log of pping and
eventengine to see if a netbox down event is sent.

-- 
Magnus Nordseth
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' |
dc

Thank you for your respond

pping.log says
[2006-01-20 10:34:46] pping.py:generateEvents:142 [Notice]
158.XX.XX.XXX (158.XX.XX.XXX) marked as down.
[2006-01-20 10:36:46] pping.py:generateEvents:159 [Notice]
158.XX.XX.XXX (158.XX.XX.XXX) marked as up.

eventEngine.log says
Jan 20 10:34:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-HANDLE Box
going down: Box [ip=158.XX.XX.XXX, sysname=158.XX.XX.XXX, status=up]
Jan 20 10:35:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-CALLBACK Box
down: 158.XX.XX.XXX

If we look at the time in the alert-mail, it is send before pping
dectect that the server is down

>>> alertengine@XXX.XXXX.no 20.01.2006 10:35:22 >>>
Service port on 158.XX.XX.XXX is down since 2006-01-20 10:34:34.

Peder 
>From kaa at msu.net  Mon Jan 23 16:00:56 2006
From: kaa at msu.net (Alexander Krapivin)
Date: Mon Jan 23 14:00:59 2006
Subject: [Nav-users] Errors while building NAV from sources
Message-ID: 43D4D388.2060801@msu.net

Hello!
I'm trying to build NAV from latest trunk sources. Everything is OK,
until 'make install'.
Please, advice me where I'm wrong.
While doing 'make install' i've got the following errors:

--cut here--
compile:
     [javac] Compiling 1 source file to
/usr/local/src/nav/trunk/src/Database/build
     [javac]
/usr/local/src/nav/trunk/src/Database/src/no/ntnu/nav/Database/Database.java:40: 

package no.ntnu.nav.logger does not exist
     [javac] import no.ntnu.nav.logger.Log;
     [javac]                           ^
     [javac]
/usr/local/src/nav/trunk/src/Database/src/no/ntnu/nav/Database/Database.java:774: 

cannot find symbol
     [javac] symbol  : variable Log
     [javac] location: class no.ntnu.nav.Database.Database
     [javac]                                     Log.e("DATABASE-QUERY",
"Got Exception; database is probably down: " + msg);
     [javac]                                         ^
     [javac]
/usr/local/src/nav/trunk/src/Database/src/no/ntnu/nav/Database/Database.java:1024: 

cannot find symbol
     [javac] symbol  : variable Log
     [javac] location: class no.ntnu.nav.Database.Database
     [javac]
Log.e("DATABASE-UPDATE", "Got Exception; database is probably down: " +
msg);
     [javac]                                         ^
     [javac] 3 errors

BUILD FAILED
/usr/local/src/nav/trunk/src/Database/build.xml:30: Compile failed; see
the compiler error output for details.

Total time: 1 second
make[1]: *** [install] Error 1
make[1]: Leaving directory `/usr/local/src/nav/trunk/src'
make: *** [install-all] Error 1
--cut here--

-- 
With best regards,

Alexander Krapivin            mailto:kaa@msu.net
NOC MSUNET                    ICQ UIN:3967345
MSU, Moscow, Russia
>From magnus at ntnu.no  Mon Jan 23 14:59:34 2006
From: magnus at ntnu.no (Magnus Nordseth)
Date: Mon Jan 23 14:59:37 2006
Subject: [Nav-users] Service port is down on 158.XX.XX.XXX
In-Reply-To: s3d0de62.024@HVO-3.hivolda.no
References: s3d0de62.024@HVO-3.hivolda.no
Message-ID: 20060123135934.GA9534@stud.ntnu.no

Peder Magne Sefland:
> pping.log says
> [2006-01-20 10:34:46] pping.py:generateEvents:142 [Notice]
> 158.XX.XX.XXX (158.XX.XX.XXX) marked as down.
> [2006-01-20 10:36:46] pping.py:generateEvents:159 [Notice]
> 158.XX.XX.XXX (158.XX.XX.XXX) marked as up.
> 
> eventEngine.log says
> Jan 20 10:34:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-HANDLE Box
> going down: Box [ip=158.XX.XX.XXX, sysname=158.XX.XX.XXX, status=up]
> Jan 20 10:35:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-CALLBACK Box
> down: 158.XX.XX.XXX
> 
> If we look at the time in the alert-mail, it is send before pping
> dectect that the server is down
> 
> >>> alertengine@XXX.XXXX.no 20.01.2006 10:35:22 >>>
> Service port on 158.XX.XX.XXX is down since 2006-01-20 10:34:34.

Peder,

What is your pping and servicemon checkinterval? Check pping.conf and
servicemon.conf. 

-- 
Magnus Nordseth
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
>From peder.sefland at hivolda.no  Mon Jan 23 15:30:21 2006
From: peder.sefland at hivolda.no (Peder Magne Sefland)
Date: Mon Jan 23 15:30:48 2006
Subject: SV: Re: [Nav-users] Service port is down on 158.XX.XX.XXX
Message-ID: s3d4f69e.082@HVO-3.hivolda.no

>>> Magnus Nordseth magnus@ntnu.no 23.01.2006 14:59:34 >>>
Peder Magne Sefland:
> pping.log says
> [2006-01-20 10:34:46] pping.py:generateEvents:142 [Notice]
> 158.XX.XX.XXX (158.XX.XX.XXX) marked as down.
> [2006-01-20 10:36:46] pping.py:generateEvents:159 [Notice]
> 158.XX.XX.XXX (158.XX.XX.XXX) marked as up.
> 
> eventEngine.log says
> Jan 20 10:34:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-HANDLE Box
> going down: Box [ip=158.XX.XX.XXX, sysname=158.XX.XX.XXX, status=up]
> Jan 20 10:35:47 2006 eventEngine BOX_STATE_EVENTHANDLER-6-CALLBACK
Box
> down: 158.XX.XX.XXX
> 
> If we look at the time in the alert-mail, it is send before pping
> dectect that the server is down
> 
> >>> alertengine@XXX.XXXX.no 20.01.2006 10:35:22 >>>
> Service port on 158.XX.XX.XXX is down since 2006-01-20 10:34:34.

Peder,

What is your pping and servicemon checkinterval? Check pping.conf and
servicemon.conf. 

pping.conf
# How often do you want to ping
checkinterval = 20#

servicemon.conf
How often do we want to check each service
checkinterval = 60

Peder