[Check_mk (english)] perfometers not displaying troubleshooting

Discussion:

Jason Humes

2012-06-14 13:51:10 UTC

Hi
Is there some way to troubleshoot why a perfometer is not being drawn, even in the most basic of forms;

def perfometer_debug_template(row, check_command, perf_data):
return ' Hello World ' , '<table><tr>' \
+ perfometer_td(20, '#fff') \
+ perfometer_td(80, '#ff0000') \
+ '</tr></table>'
perfometers['check_mk-cisco_v2qos'] = perfometer_debug_template

My check is called cisco_v2qos. I've restarted check_mk and OMD, yet the perfometer debug text is not showing up. I've tried putting this into both the OMD perfometer dir and my local/share/check_mk/web/plugins/perfometer/ folder, and still nothing.

Any ideas why or where I could start looking?

Thanks

J

Andy D'Arcy Jewell

2012-06-14 15:38:12 UTC

Permalink

Hi list!

Sorry for this long post - I've tried to keep it as brief as possible.

I'm prototyping a workstation monitoring solution, based around Nagios
Core and Check_MK. I have a test rig running on some old test kit in our
lab, monitoring 1000 "fake" Win Xp workstations. I have some
observations and a few queries too.

GOAL

Monitor a large number (4000+) windows workstations, both real and
virtual, to provide stats graphs on resources and also event log
monitoring.

CONSTRAINTS

* Workstations are not powered-on and active 24x7 - alerts must not be
generated when ws powered off
* Alerting must be minimal only certain "special" conditions should
generate alerts
* Workstation IP's change but they are registered through the AD Domain
in DDNS

TEST RIG SETUP
Nagios server: HP BL460 with Dual core xeon cpu @3GHz, 2GB RAM, 1x
143GB SAS disk, gigabit networking, running Nagios 3.2.1 with Check MK
1.1.12p7 and PNP4Nagios 0.6.12-1~bpo60+1
"Fake" Workstations: 200 different host names pointing to each of 5 VM's
"Real" Workstations: 5x KVM VM's with 512MB ram, 8GB HDD, 1 vCPU,
running on ProxmoxVE on a similar blade to the Nagios box

TEST 1 - MONITORING 1000 WORKSTATIONS THE "NORMAL WAY"

In this setup, Nagios is attempting to perform an active check per host,
per minute, forking the Nagios process in the normal manner, with
PNP4Nagios using the broker interface for speed. However, checks rapidly
begin to fall behind. Utilisation about 6-7 on 2 cores, mostly waiting
on SYS and IO. This is the forking conundrum MK has written about.
Reducing the check frequency to every 5 minutes still does not allow all
checks to complete on time. Iostat showing disk utilisation at 100%
constantly. Had to write a python script to front for check_mk_agent and
cache the results, as it can't do more than about 2-3 requests per
second on this platform (probably because it's a VM).

TEST 2 - AS TEST 2, USING MK LIVECHECK

In this setup, Nagios is using MK LiveCheck to speed up the check
process. SYS utilisation went down to abuot 1/3 of the original value,
but checks were still falling behind, even with a check interval of 5
minutes. System load about 3. Iostat shows disk utilisation peaking at
100% regularly. Switched off most Nagios logging and moved status.dat to
/dev/shm and disk utilisation fell to peaking at ~40% once every 20-30
seconds.

TEST 3 - PUSH CHECK MK AGENT

Jury rigged the windows check_mk_agent.exe by writing a simple .cmd
using windows ports of nc and curl to talk to the cmk agent on
localhost, and set up a scheduled task to run the script every 5
minutes. On the server side, set up a simple cgi to receive cmk agent
output and dump to a file in a queue directory. Modified main.mk to use
a "datasource program" to collect this data. Wrote a small program to
cat and delete these files. System load about 0.3.

CONCLUSION

It looks like using the push approach, I can save a lot of CPU load and
some disk IO too.

Questions

If anyone is interested in a better write-up, just ask. I'm happy to
post up the programs and scripts i've written

Can anyone suggest any other approaches to optimisation? I need to have
"full" cmk stats, including logs, so just pings won't do.

Does anyone see any disadvantages to the push approach? It avoids a lot
of un-necessary check processing when workstations are unavailble.

Regards
-Andy D'Arcy Jewell

--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
T: 0844 9918804
M: 07961605631
E: ***@sysmicro.co.uk
W: www.sysmicro.co.uk

Tim AtLee

2012-06-14 15:49:22 UTC

Permalink

Hi Andy

This has been something I've thought about for a period of time for my own environment. I have a much smaller enterprise to work with (about 120 workstations, all physical) but would like to know about heavy RAM usage, disk space issues, event log, etc.. however I run into the same issue you outlined in your constraints - workstations (specifically laptops) are not always on.

Have you considered if collecting data by SNMP is viable? I imagine that "test 3" would be compatible with collecting data with SNMP if required..

While I don't have any suggestions for optimization, I would be interested in the finer details of your implementation.

Thanks,

Tim

-----Original Message-----
From: checkmk-en-***@lists.mathias-kettner.de [mailto:checkmk-en-***@lists.mathias-kettner.de] On Behalf Of Andy D'Arcy Jewell
Sent: Thursday, June 14, 2012 9:38 AM
To: checkmk-***@lists.mathias-kettner.de
Subject: [Check_mk (english)] Check MK Push Agent for workstation monitoring

Hi list!

Sorry for this long post - I've tried to keep it as brief as possible.

I'm prototyping a workstation monitoring solution, based around Nagios Core and Check_MK. I have a test rig running on some old test kit in our lab, monitoring 1000 "fake" Win Xp workstations. I have some observations and a few queries too.

GOAL

Monitor a large number (4000+) windows workstations, both real and virtual, to provide stats graphs on resources and also event log monitoring.

CONSTRAINTS

* Workstations are not powered-on and active 24x7 - alerts must not be generated when ws powered off
* Alerting must be minimal only certain "special" conditions should generate alerts
* Workstation IP's change but they are registered through the AD Domain in DDNS

TEST RIG SETUP
Nagios server: HP BL460 with Dual core xeon cpu @3GHz, 2GB RAM, 1x
143GB SAS disk, gigabit networking, running Nagios 3.2.1 with Check MK
1.1.12p7 and PNP4Nagios 0.6.12-1~bpo60+1
"Fake" Workstations: 200 different host names pointing to each of 5 VM's
"Real" Workstations: 5x KVM VM's with 512MB ram, 8GB HDD, 1 vCPU, running on ProxmoxVE on a similar blade to the Nagios box

TEST 1 - MONITORING 1000 WORKSTATIONS THE "NORMAL WAY"

In this setup, Nagios is attempting to perform an active check per host, per minute, forking the Nagios process in the normal manner, with PNP4Nagios using the broker interface for speed. However, checks rapidly begin to fall behind. Utilisation about 6-7 on 2 cores, mostly waiting on SYS and IO. This is the forking conundrum MK has written about.
Reducing the check frequency to every 5 minutes still does not allow all checks to complete on time. Iostat showing disk utilisation at 100% constantly. Had to write a python script to front for check_mk_agent and cache the results, as it can't do more than about 2-3 requests per second on this platform (probably because it's a VM).

TEST 2 - AS TEST 2, USING MK LIVECHECK

In this setup, Nagios is using MK LiveCheck to speed up the check process. SYS utilisation went down to abuot 1/3 of the original value, but checks were still falling behind, even with a check interval of 5 minutes. System load about 3. Iostat shows disk utilisation peaking at 100% regularly. Switched off most Nagios logging and moved status.dat to /dev/shm and disk utilisation fell to peaking at ~40% once every 20-30 seconds.

TEST 3 - PUSH CHECK MK AGENT

Jury rigged the windows check_mk_agent.exe by writing a simple .cmd using windows ports of nc and curl to talk to the cmk agent on localhost, and set up a scheduled task to run the script every 5 minutes. On the server side, set up a simple cgi to receive cmk agent output and dump to a file in a queue directory. Modified main.mk to use a "datasource program" to collect this data. Wrote a small program to cat and delete these files. System load about 0.3.

CONCLUSION

It looks like using the push approach, I can save a lot of CPU load and some disk IO too.

Questions

If anyone is interested in a better write-up, just ask. I'm happy to
post up the programs and scripts i've written

Can anyone suggest any other approaches to optimisation? I need to have
"full" cmk stats, including logs, so just pings won't do.

Does anyone see any disadvantages to the push approach? It avoids a lot
of un-necessary check processing when workstations are unavailble.

Regards
-Andy D'Arcy Jewell

--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
T: 0844 9918804
M: 07961605631
E: ***@sysmicro.co.uk
W: www.sysmicro.co.uk

Andy D'Arcy Jewell

2012-06-14 16:10:53 UTC

Permalink

Post by Tim AtLee
This has been something I've thought about for a period of time for my own environment. I have a much smaller enterprise to work with (about 120 workstations, all physical) but would like to know about heavy RAM usage, disk space issues, event log, etc.. however I run into the same issue you outlined in your constraints - workstations (specifically laptops) are not always on.
Have you considered if collecting data by SNMP is viable? I imagine that "test 3" would be compatible with collecting data with SNMP if required..

Have you considered if collecting data by SNMP is viable? I imagine that
"test 3" would be compatible with collecting data with SNMP if
required.. While I don't have any suggestions for optimization, I would
be interested in the finer details of your implementation.

Post by Tim AtLee
While I don't have any suggestions for optimization, I would be interested in the finer details of your implementation.

I'll put together a more detailed report when I'm done testing then. ;-)

--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
T: 0844 9918804
M: 07961605631
E: ***@sysmicro.co.uk
W: www.sysmicro.co.uk

Andy D'Arcy Jewell

2012-06-14 16:27:15 UTC

Permalink

Post by Andy D'Arcy Jewell

Post by Tim AtLee
This has been something I've thought about for a period of time for
my own environment. I have a much smaller enterprise to work with
(about 120 workstations, all physical) but would like to know about
heavy RAM usage, disk space issues, event log, etc.. however I run
into the same issue you outlined in your constraints - workstations
(specifically laptops) are not always on.
Have you considered if collecting data by SNMP is viable? I imagine
that "test 3" would be compatible with collecting data with SNMP if
required..

Have you considered if collecting data by SNMP is viable? I imagine
that "test 3" would be compatible with collecting data with SNMP if
required.. While I don't have any suggestions for optimization, I
would be interested in the finer details of your implementation.

Cut'n'paste mayhem, sorry! Meant to say: It would be the same as with
normal checks - you'd still have the forking problem, and you'll still
waste a lot of cpu contacting hosts that aren't powered on. I haven't
tested this tho, so it's just my guess.

Post by Andy D'Arcy Jewell

Post by Tim AtLee
While I don't have any suggestions for optimization, I would be
interested in the finer details of your implementation.

I'll put together a more detailed report when I'm done testing then. ;-)

--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
T: 0844 9918804
M: 07961605631
E: ***@sysmicro.co.uk
W: www.sysmicro.co.uk

Chris Beattie

2012-06-15 13:08:21 UTC

Permalink

Post by Andy D'Arcy Jewell
Have you considered if collecting data by SNMP is viable? I imagine that
"test 3" would be compatible with collecting data with SNMP if required..

Going off-topic for a moment:

It's been my experience using Cacti to gather SNMP data from various
hosts and devices that Windows' SNMP service is undesirable if you
require high performance. If you request more than a couple OIDs at a
time, the SNMP service doesn't return data for all of them reliably.

I can't imagine I did anything wrong: there's not much involved in
setting up SNMP on Windows. Meanwhile, Nagios reliably checks 1,300
hosts and 13,000 services using either NSClient++ or Check_MK here, with
no problem except when the checked host is very heavily loaded.

So, I avoid Windows SNMP as a primary data-gathering solution.

--
-Chris

Nothing in this message is intended to make or accept an offer or to form a contract, except that an attachment that is an image of a contract bearing the signature of an officer of our company may be or become a contract. This message (including any attachments) is intended only for the use of the individual or entity to whom it is addressed. It may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, we hereby notify you that any use, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this message in error, please notify us immediately by telephone and delete this message immediately.

Thank you.

Florian Heigl

2012-06-15 08:59:42 UTC

Permalink

Hi Andy,

that was an interesting read and a nice test setup you made.

Post by Andy D'Arcy Jewell
Hi list!
Sorry for this long post - I've tried to keep it as brief as possible.
I'm prototyping a workstation monitoring solution, based around Nagios
Core and Check_MK. I have a test rig running on some old test kit in our
lab, monitoring 1000 "fake" Win Xp workstations. I have some
observations and a few queries too.
GOAL
Monitor a large number (4000+) windows workstations, both real and
virtual, to provide stats graphs on resources and also event log
monitoring.
CONSTRAINTS
* Workstations are not powered-on and active 24x7 - alerts must not be
generated when ws powered off
* Alerting must be minimal only certain "special" conditions should
generate alerts

What I personally never figured well is monitoring transient systems.
Workstations, that need to be perfectly well monitored while up, but not
while down.

Oh well, maybe it's enough to just set 'none' for the host
notifications... d'oh

Post by Andy D'Arcy Jewell
* Workstation IP's change but they are registered through the AD Domain
in DDNS

For IP changes, dyndns_hosts is the right option, that's easy.

Post by Andy D'Arcy Jewell
TEST RIG SETUP
SAS disk, gigabit networking, running Nagios 3.2.1 with Check MK
1.1.12p7 and PNP4Nagios 0.6.12-1~bpo60+1
"Fake" Workstations: 200 different host names pointing to each of 5 VM's
"Real" Workstations: 5x KVM VM's with 512MB ram, 8GB HDD, 1 vCPU,
running on ProxmoxVE on a similar blade to the Nagios box
TEST 1 - MONITORING 1000 WORKSTATIONS THE "NORMAL WAY"
In this setup, Nagios is attempting to perform an active check per host,
per minute, forking the Nagios process in the normal manner, with
PNP4Nagios using the broker interface for speed. However, checks rapidly
begin to fall behind. Utilisation about 6-7 on 2 cores, mostly waiting
on SYS and IO. This is the forking conundrum MK has written about.
Reducing the check frequency to every 5 minutes still does not allow all
checks to complete on time. Iostat showing disk utilisation at 100%
constantly. Had to write a python script to front for check_mk_agent and
cache the results, as it can't do more than about 2-3 requests per
second on this platform (probably because it's a VM).

It will fall back even more once you go over 1300 or so hosts, the
nagios process bloats too much then.

Post by Andy D'Arcy Jewell
TEST 2 - AS TEST 2, USING MK LIVECHECK
In this setup, Nagios is using MK LiveCheck to speed up the check
process. SYS utilisation went down to abuot 1/3 of the original value,
but checks were still falling behind, even with a check interval of 5
minutes. System load about 3. Iostat shows disk utilisation peaking at
100% regularly. Switched off most Nagios logging and moved status.dat to
/dev/shm and disk utilisation fell to peaking at ~40% once every 20-30
seconds.

See below about OMD. Totally not worth running an oldschool nagios if
you need performance.
Also consider enabling 'livecheck', in any scenario.

Post by Andy D'Arcy Jewell
TEST 3 - PUSH CHECK MK AGENT
Jury rigged the windows check_mk_agent.exe by writing a simple .cmd
using windows ports of nc and curl to talk to the cmk agent on
localhost, and set up a scheduled task to run the script every 5
minutes. On the server side, set up a simple cgi to receive cmk agent
output and dump to a file in a queue directory. Modified main.mk to use
a "datasource program" to collect this data. Wrote a small program to
cat and delete these files. System load about 0.3.

We also use a SSH push solution for our office build host to be
monitored from the internet's, and another one for the demo site.
There the submission is even email-based.

So this is a perfectly fine thing to do.
What you should consider is using check_file_age as the host check
command for the systems, so that you have a valid mechanism for host
down detection.

Post by Andy D'Arcy Jewell
CONCLUSION
It looks like using the push approach, I can save a lot of CPU load and
some disk IO too.

I think it's great for that use case.

Post by Andy D'Arcy Jewell
Questions
If anyone is interested in a better write-up, just ask. I'm happy to
post up the programs and scripts i've written

Just do it, people invent too many wheels already.
(My ssh push agent for Linux/Unix is at http://bitbucket.org/darkfader/ )

Sadly I couldn't write something like that for windows, so it would be
great if you made it public.

Post by Andy D'Arcy Jewell
Can anyone suggest any other approaches to optimisation? I need to have
"full" cmk stats, including logs, so just pings won't do.

Addon hint:
Use agent based filtering for the eventlogs that are monitored to cut
down on traffic.

Post by Andy D'Arcy Jewell
Does anyone see any disadvantages to the push approach? It avoids a lot
of un-necessary check processing when workstations are unavailble.

Regarding performance:
You didn't say if you're using OMD for the nagios install.
This can be very performance saving, due to use of rrdcached and tmpfs.

I'm quite sure once you go over 4k hosts to monitor, you will run into
some additional bottlenecks.
Your choice of a push solution is very well suited there.

Greetings, and thanks for your experiences shared.
I know of a few customers that were looking for such massive client
monitoring in the recent time.
Blog about your experiments and once it goes live, we need more SEO
points :)

Greetings,
Florian

p.s.:
If possible, take the hint and re-test with an OMD nagios install.
Just 2 days ago I did my Nagios benchmarking talk, it was fun:
Fired up over 340k service checks / min on a quadcore box and I
basically considered it as normal and went on about explaining
bottlenecks that come at higher load, instead of saying much about *that
is damn fast now*.
The audience' faces disagreed :)