Discussion:
[Check_mk (english)] Retry interval for failed checks
(too old to reply)
Rafal Bialek
2017-12-29 09:12:04 UTC
Permalink
Raw Message
Hello,

I'm hoping to set the following for agent based checks:

1. Normal check interval for services 5 mins. This setting managed by 'Normal check interval for service checks' and works
2. When first failure occurs (1 soft service alert) I would like to perform 3 attempts (rule 'Maximum number of check attempts for service') with interval of 1 minute until service alert changes from soft to hard:
Green(HARD) -(5 mins)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(HARD)
e.g.
- service go down and recover after 2 mins of detection
RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> Green(HARD)
- service go down and recover after 5 mins of detection
RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(HARD) -(1 min)-> RED(HARD) -(1 min)-> Green(HARD)

I hoped that I can achieve interval bit with rule 'Retry check interval for service checks' but it doesn't work and description of the rule suggest this is valid only for 'active checks'.

Are there a way of implementing this.
I would like to avoid performing check every minute when everything is ok but waiting 15 mins for state change from soft to hard is not acceptable.

Hope I make some sense here

Regards,

Rafal
Evy Bongers
2017-12-29 09:54:28 UTC
Permalink
Raw Message
Hi Rafal,

On 2017-12-29 10:12, Rafal Bialek wrote:

> Hello,
>
> I'm hoping to set the following for agent based checks:
>
> 1. Normal check interval for services 5 mins. This setting managed by
> 'Normal check interval for service checks' and works
>
> 2. When first failure occurs (1 soft service alert) I would like to
> perform 3 attempts (rule 'Maximum number of check attempts for
> service') with interval of 1 minute until service alert changes from
> soft to hard:
>
> Green(HARD) -(5 mins)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> RED(SOFT) -(1 min)-> RED(HARD)

This can be achieved by these settings:
- service check interval: 5 minutes
- retry check interval: 1 minute
- max check attempts: 3

Also, see my note below on active vs passive checks.

> e.g.
>
> - service go down and recover after 2 mins of detection
>
> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> Green(HARD)
>
> - service go down and recover after 5 mins of detection
>
> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> RED(HARD) -(1 min)-> RED(HARD) -(1 min)-> Green(HARD)

This seems to counter what you state above. I know of no way that a
service in non-OK hard state can be forced to recheck every minute until
it recovers.

> I hoped that I can achieve interval bit with rule 'Retry check interval
> for service checks' but it doesn't work and description of the rule
> suggest this is valid only for 'active checks'.

It's try that the check interval and retry check interval are only valid
for scheduling active checks, since passive checks aren't scheduled.
Keep in mind that agent-based checks receive updates when the Check_MK
check is executed.
If any agent-based check goes into non-OK state, it will still only
receive updates every 5 minutes if the Check_MK check is still OK.

> Are there a way of implementing this.
>
> I would like to avoid performing check every minute when everything is
> ok but waiting 15 mins for state change from soft to hard is not
> acceptable.
>
> Hope I make some sense here

Why would you avoid performing checks every minute? The agent-based
checks are very lightweight, and since they're passive checks don't
stress the check scheduler.

My advice would be to set both intervals on 1 minute and (if required)
delay sending notifications through the notification settings (for
example 'Delay first service notification' if you're using RBN).

> Regards,
>
> Rafal
>
> _______________________________________________
> checkmk-en mailing list
> checkmk-***@lists.mathias-kettner.de
> http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Rafal Bialek
2017-12-29 11:57:21 UTC
Permalink
Raw Message
Thank you Evy

I used ‘retry check interval’ parameter but that doesn’t seem to apply to agent-based checks.



I’m running multisite environment with 6 sites. Some sites are small but I have two site with number of hosts over 500 and 300 hosts. In total I monitor over 1200 hosts and over 50k services. If I change normal check interval to 1 min CPU load on slaves server monitoring biggest sites is so extreme that they became unresponsive. Also using ‘Server Performance‘ snapin I can see ‘Service Checks’ rate going 5-6 times the rate from 5 mins interval.



Also taking this opportunity would like to check what exactly the following represent in ‘Server Performance‘ snapin:

New Log messages vs Cached log messages

I’m in the process of migrating from version CRE 1.2.8p10 to CRE 1.4.0p12 and older version has very high cached value with virtually no new messages

New version has low cached value which never grows to high. New message value go up and down





Regards,

Rafal Bialek



________________________________
From: Evy Bongers <lists+check-***@evybongers.nl>
Sent: Friday, December 29, 2017 9:54:28 AM
To: Rafal Bialek
Cc: checkmk-***@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Retry interval for failed checks

Hi Rafal,

On 2017-12-29 10:12, Rafal Bialek wrote:

> Hello,
>
> I'm hoping to set the following for agent based checks:
>
> 1. Normal check interval for services 5 mins. This setting managed by
> 'Normal check interval for service checks' and works
>
> 2. When first failure occurs (1 soft service alert) I would like to
> perform 3 attempts (rule 'Maximum number of check attempts for
> service') with interval of 1 minute until service alert changes from
> soft to hard:
>
> Green(HARD) -(5 mins)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> RED(SOFT) -(1 min)-> RED(HARD)

This can be achieved by these settings:
- service check interval: 5 minutes
- retry check interval: 1 minute
- max check attempts: 3

Also, see my note below on active vs passive checks.

> e.g.
>
> - service go down and recover after 2 mins of detection
>
> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> Green(HARD)
>
> - service go down and recover after 5 mins of detection
>
> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> RED(HARD) -(1 min)-> RED(HARD) -(1 min)-> Green(HARD)

This seems to counter what you state above. I know of no way that a
service in non-OK hard state can be forced to recheck every minute until
it recovers.

> I hoped that I can achieve interval bit with rule 'Retry check interval
> for service checks' but it doesn't work and description of the rule
> suggest this is valid only for 'active checks'.

It's try that the check interval and retry check interval are only valid
for scheduling active checks, since passive checks aren't scheduled.
Keep in mind that agent-based checks receive updates when the Check_MK
check is executed.
If any agent-based check goes into non-OK state, it will still only
receive updates every 5 minutes if the Check_MK check is still OK.

> Are there a way of implementing this.
>
> I would like to avoid performing check every minute when everything is
> ok but waiting 15 mins for state change from soft to hard is not
> acceptable.
>
> Hope I make some sense here

Why would you avoid performing checks every minute? The agent-based
checks are very lightweight, and since they're passive checks don't
stress the check scheduler.

My advice would be to set both intervals on 1 minute and (if required)
delay sending notifications through the notification settings (for
example 'Delay first service notification' if you're using RBN).

> Regards,
>
> Rafal
>
> _______________________________________________
> checkmk-en mailing list
> checkmk-***@lists.mathias-kettner.de
> http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Andreas Döhler
2017-12-30 10:31:38 UTC
Permalink
Raw Message
Hi Rafal,

Your wanted settings make no sense for the agent based checks.
Now why :)

With agent based checks only the “check_mk“ service is done as a active
check and only for this check you can define some types of check interval.
For all the passive checks the retry interval cannot be smaller then the
normal check interval as the check_mk service is checked in the normal
interval, as long as this service has no problem.
Normally I would recommend for the check_mk service a small interval like
1min then you can define the 5min interval for the other service and also
your recheck interval should work.

Best regards


Andreas

Rafal Bialek <***@hotmail.com> schrieb am Fr., 29. Dez. 2017, 12:58:

> Thank you Evy
>
> I used ‘retry check interval’ parameter but that doesn’t seem to apply to
> agent-based checks.
>
>
>
> I’m running multisite environment with 6 sites. Some sites are small but I
> have two site with number of hosts over 500 and 300 hosts. In total I
> monitor over 1200 hosts and over 50k services. If I change normal check
> interval to 1 min CPU load on slaves server monitoring biggest sites is so
> extreme that they became unresponsive. Also using ‘Server Performance‘
> snapin I can see ‘Service Checks’ rate going 5-6 times the rate from 5
> mins interval.
>
>
>
> Also taking this opportunity would like to check what exactly the
> following represent in ‘Server Performance‘ snapin:
>
> New Log messages vs Cached log messages
>
> I’m in the process of migrating from version CRE 1.2.8p10 to CRE 1.4.0p12
> and older version has very high cached value with virtually no new messages
>
> New version has low cached value which never grows to high. New message
> value go up and down
>
>
>
>
>
> Regards,
>
> Rafal Bialek
>
>
> ------------------------------
> *From:* Evy Bongers <lists+check-***@evybongers.nl>
> *Sent:* Friday, December 29, 2017 9:54:28 AM
> *To:* Rafal Bialek
> *Cc:* checkmk-***@lists.mathias-kettner.de
> *Subject:* Re: [Check_mk (english)] Retry interval for failed checks
>
> Hi Rafal,
>
> On 2017-12-29 10:12, Rafal Bialek wrote:
>
> > Hello,
> >
> > I'm hoping to set the following for agent based checks:
> >
> > 1. Normal check interval for services 5 mins. This setting managed by
> > 'Normal check interval for service checks' and works
> >
> > 2. When first failure occurs (1 soft service alert) I would like to
> > perform 3 attempts (rule 'Maximum number of check attempts for
> > service') with interval of 1 minute until service alert changes from
> > soft to hard:
> >
> > Green(HARD) -(5 mins)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> > RED(SOFT) -(1 min)-> RED(HARD)
>
> This can be achieved by these settings:
> - service check interval: 5 minutes
> - retry check interval: 1 minute
> - max check attempts: 3
>
> Also, see my note below on active vs passive checks.
>
> > e.g.
> >
> > - service go down and recover after 2 mins of detection
> >
> > RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> > Green(HARD)
> >
> > - service go down and recover after 5 mins of detection
> >
> > RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)-> RED(SOFT) -(1 min)->
> > RED(HARD) -(1 min)-> RED(HARD) -(1 min)-> Green(HARD)
>
> This seems to counter what you state above. I know of no way that a
> service in non-OK hard state can be forced to recheck every minute until
> it recovers.
>
> > I hoped that I can achieve interval bit with rule 'Retry check interval
> > for service checks' but it doesn't work and description of the rule
> > suggest this is valid only for 'active checks'.
>
> It's try that the check interval and retry check interval are only valid
> for scheduling active checks, since passive checks aren't scheduled.
> Keep in mind that agent-based checks receive updates when the Check_MK
> check is executed.
> If any agent-based check goes into non-OK state, it will still only
> receive updates every 5 minutes if the Check_MK check is still OK.
>
> > Are there a way of implementing this.
> >
> > I would like to avoid performing check every minute when everything is
> > ok but waiting 15 mins for state change from soft to hard is not
> > acceptable.
> >
> > Hope I make some sense here
>
> Why would you avoid performing checks every minute? The agent-based
> checks are very lightweight, and since they're passive checks don't
> stress the check scheduler.
>
> My advice would be to set both intervals on 1 minute and (if required)
> delay sending notifications through the notification settings (for
> example 'Delay first service notification' if you're using RBN).
>
> > Regards,
> >
> > Rafal
> >
> > _______________________________________________
> > checkmk-en mailing list
> > checkmk-***@lists.mathias-kettner.de
> > http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
> _______________________________________________
> checkmk-en mailing list
> checkmk-***@lists.mathias-kettner.de
> http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Loading...