Discussion:
[Check_mk (english)] Help diagnose notification issues
(too old to reply)
Tim AtLee
2018-06-08 14:06:21 UTC
Permalink
Raw Message
Good morning

We had a scheduled outage at one of our sites last night, and I noticed that I was only receiving notifications when the check for the site came back up - not when it went down.

This morning, I tested with some manual checks, and was receiving notifications for host down and up.

Furthermore, exploring the notification logs, my colleague received both the CRITICAL and OK notifications.

We are both members of the same contact group. WATO isn't showing any difference in notification rules.

Last night's logs. Colleague is getting "CRITICAL" notifications, followed by "OK" notifications. I (tima) only receive "OK" notifications.
[1528415840] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;CRITICAL - 128.144.97.254: rta nan, lost 100%
[1528415898] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;CRITICAL - 128.144.97.254: rta nan, lost 100%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK - 128.144.97.254: rta 0.983ms, lost 0%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK - 128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.983ms, lost 0%
[1528416467] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;CRITICAL - 128.144.97.254: rta nan, lost 100%
[1528416494] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;CRITICAL - 128.144.97.254: rta nan, lost 100%
[1528416705] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK - 128.144.97.254: rta 0.972ms, lost 0%
[1528416715] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK - 128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.972ms, lost 0%


This morning's manual logs. We are both getting manual CRITICAL and OK messages.
[1528466074] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;Manually set to Critical by tima
[1528466074] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;Manually set to Critical by tima
[1528466083] SERVICE NOTIFICATION: tima;router;PING;CRITICAL;mail;Manually set to Critical by tima
[1528466083] SERVICE NOTIFICATION: tima;router;PING;CRITICAL;smsnotify;Manually set to Critical by tima
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK - 128.144.97.254: rta 0.954ms, lost 0%
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK - 128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK - 128.144.97.254: rta 0.954ms, lost 0%


The output of cmk -N router shows nearly identical contacts, the only difference being our email and SMS numbers.

So, I suppose I have a couple questions..

Where do I start to trouble shoot this? I would think that setting a service to CRIT manually would emulate the behavior as if the service went CRIT on its own.

How can I actually emulate the site outage, without unplugging the device? I see that I can supply plugin output and performance data into the Fake Check Results, I suppose I'm not clear on the syntax.

Thanks in advance.

Tim
Andreas Döhler
2018-06-08 17:24:37 UTC
Permalink
Raw Message
Hi Tim,

from your information i only see that you don't use the rule based
notification. It looks like an "old school" Nagios notification setup. Is
this the old "flexible notification" of Check_MK?
If yes please take a look at your flexible notification definition of your
user. The flexible notification can be altered for every user. It looks
like a timeperiod dependent UP and Down or only Up notification.

With rule based notifications you have the possibility to simulate and test
your rules. This is a very nice feature and the migration from flexible
notification to rule based is not to hard :)

br
Andreas
Post by Tim AtLee
Good morning
We had a scheduled outage at one of our sites last night, and I noticed
that I was only receiving notifications when the check for the site came
back up – not when it went down.
This morning, I tested with some manual checks, and was receiving
notifications for host down and up.
Furthermore, exploring the notification logs, my colleague received both
the CRITICAL and OK notifications.
We are both members of the same contact group. WATO isn’t showing any
difference in notification rules.
Last night’s logs. Colleague is getting “CRITICAL” notifications,
followed by “OK” notifications. I (tima) only receive “OK” notifications.
colleague;router;PING;CRITICAL;mail;CRITICAL - 128.144.97.254: rta nan,
lost 100%
colleague;router;PING;CRITICAL;smsnotify;CRITICAL - 128.144.97.254: rta
nan, lost 100%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.983ms, lost 0%
colleague;router;PING;CRITICAL;mail;CRITICAL - 128.144.97.254: rta nan,
lost 100%
colleague;router;PING;CRITICAL;smsnotify;CRITICAL - 128.144.97.254: rta
nan, lost 100%
[1528416705] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.972ms, lost 0%
[1528416715] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.972ms, lost 0%
This morning’s manual logs. We are both getting manual CRITICAL and OK
messages.
colleague;router;PING;CRITICAL;mail;Manually set to Critical by tima
colleague;router;PING;CRITICAL;smsnotify;Manually set to Critical by tima
[1528466083] SERVICE NOTIFICATION: tima;router;PING;CRITICAL;mail;Manually
set to Critical by tima
tima;router;PING;CRITICAL;smsnotify;Manually set to Critical by tima
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.954ms, lost 0%
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.954ms, lost 0%
The output of cmk -N router shows nearly identical contacts, the only
difference being our email and SMS numbers.
So, I suppose I have a couple questions..
Where do I start to trouble shoot this? I would think that setting a
service to CRIT manually would emulate the behavior as if the service went
CRIT on its own.
How can I actually emulate the site outage, without unplugging the
device? I see that I can supply plugin output and performance data into
the Fake Check Results, I suppose I’m not clear on the syntax.
Thanks in advance.
Tim
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Paul
2018-06-08 17:26:34 UTC
Permalink
Raw Message
Tim, in addition to Andreas' comments,

Create a clone of the router, give it a name like test-router and supply a
bogus IP, one that will simply not work. This way you can simulate a
failure from the system, just like you would with the 'router' device.

Do you get CRITICAL notifications for other devices and services?
Does the router have different notification settings than other
hosts/devices you monitor?
Post by Andreas Döhler
Hi Tim,
from your information i only see that you don't use the rule based
notification. It looks like an "old school" Nagios notification setup. Is
this the old "flexible notification" of Check_MK?
If yes please take a look at your flexible notification definition of your
user. The flexible notification can be altered for every user. It looks
like a timeperiod dependent UP and Down or only Up notification.
With rule based notifications you have the possibility to simulate and
test your rules. This is a very nice feature and the migration from
flexible notification to rule based is not to hard :)
br
Andreas
Post by Tim AtLee
Good morning
We had a scheduled outage at one of our sites last night, and I noticed
that I was only receiving notifications when the check for the site came
back up – not when it went down.
This morning, I tested with some manual checks, and was receiving
notifications for host down and up.
Furthermore, exploring the notification logs, my colleague received both
the CRITICAL and OK notifications.
We are both members of the same contact group. WATO isn’t showing any
difference in notification rules.
Last night’s logs. Colleague is getting “CRITICAL” notifications,
followed by “OK” notifications. I (tima) only receive “OK” notifications.
[1528415840] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;CRITICAL
- 128.144.97.254: rta nan, lost 100%
[1528415898] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;CRITICAL
- 128.144.97.254: rta nan, lost 100%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416235] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK
- 128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416244] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.983ms, lost 0%
[1528416467] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;CRITICAL
- 128.144.97.254: rta nan, lost 100%
[1528416494] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;CRITICAL
- 128.144.97.254: rta nan, lost 100%
[1528416705] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.972ms, lost 0%
[1528416715] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK
- 128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.972ms, lost 0%
[1528416724] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.972ms, lost 0%
This morning’s manual logs. We are both getting manual CRITICAL and OK
messages.
[1528466074] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;mail;Manually
set to Critical by tima
[1528466074] SERVICE NOTIFICATION: colleague;router;PING;CRITICAL;smsnotify;Manually
set to Critical by tima
[1528466083] SERVICE NOTIFICATION: tima;router;PING;CRITICAL;mail;Manually
set to Critical by tima
[1528466083] SERVICE NOTIFICATION: tima;router;PING;CRITICAL;smsnotify;Manually
set to Critical by tima
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;mail;OK -
128.144.97.254: rta 0.954ms, lost 0%
[1528466093] SERVICE NOTIFICATION: colleague;router;PING;OK;smsnotify;OK
- 128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;mail;OK -
128.144.97.254: rta 0.954ms, lost 0%
[1528466102] SERVICE NOTIFICATION: tima;router;PING;OK;smsnotify;OK -
128.144.97.254: rta 0.954ms, lost 0%
The output of cmk -N router shows nearly identical contacts, the only
difference being our email and SMS numbers.
So, I suppose I have a couple questions..
Where do I start to trouble shoot this? I would think that setting a
service to CRIT manually would emulate the behavior as if the service went
CRIT on its own.
How can I actually emulate the site outage, without unplugging the
device? I see that I can supply plugin output and performance data into
the Fake Check Results, I suppose I’m not clear on the syntax.
Thanks in advance.
Tim
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Loading...