Hi Andy,
that was an interesting read and a nice test setup you made.
Post by Andy D'Arcy JewellHi list!
Sorry for this long post - I've tried to keep it as brief as possible.
I'm prototyping a workstation monitoring solution, based around Nagios
Core and Check_MK. I have a test rig running on some old test kit in our
lab, monitoring 1000 "fake" Win Xp workstations. I have some
observations and a few queries too.
GOAL
Monitor a large number (4000+) windows workstations, both real and
virtual, to provide stats graphs on resources and also event log
monitoring.
CONSTRAINTS
* Workstations are not powered-on and active 24x7 - alerts must not be
generated when ws powered off
* Alerting must be minimal only certain "special" conditions should
generate alerts
What I personally never figured well is monitoring transient systems.
Workstations, that need to be perfectly well monitored while up, but not
while down.
Oh well, maybe it's enough to just set 'none' for the host
notifications... d'oh
Post by Andy D'Arcy Jewell* Workstation IP's change but they are registered through the AD Domain
in DDNS
For IP changes, dyndns_hosts is the right option, that's easy.
Post by Andy D'Arcy JewellTEST RIG SETUP
SAS disk, gigabit networking, running Nagios 3.2.1 with Check MK
1.1.12p7 and PNP4Nagios 0.6.12-1~bpo60+1
"Fake" Workstations: 200 different host names pointing to each of 5 VM's
"Real" Workstations: 5x KVM VM's with 512MB ram, 8GB HDD, 1 vCPU,
running on ProxmoxVE on a similar blade to the Nagios box
TEST 1 - MONITORING 1000 WORKSTATIONS THE "NORMAL WAY"
In this setup, Nagios is attempting to perform an active check per host,
per minute, forking the Nagios process in the normal manner, with
PNP4Nagios using the broker interface for speed. However, checks rapidly
begin to fall behind. Utilisation about 6-7 on 2 cores, mostly waiting
on SYS and IO. This is the forking conundrum MK has written about.
Reducing the check frequency to every 5 minutes still does not allow all
checks to complete on time. Iostat showing disk utilisation at 100%
constantly. Had to write a python script to front for check_mk_agent and
cache the results, as it can't do more than about 2-3 requests per
second on this platform (probably because it's a VM).
It will fall back even more once you go over 1300 or so hosts, the
nagios process bloats too much then.
Post by Andy D'Arcy JewellTEST 2 - AS TEST 2, USING MK LIVECHECK
In this setup, Nagios is using MK LiveCheck to speed up the check
process. SYS utilisation went down to abuot 1/3 of the original value,
but checks were still falling behind, even with a check interval of 5
minutes. System load about 3. Iostat shows disk utilisation peaking at
100% regularly. Switched off most Nagios logging and moved status.dat to
/dev/shm and disk utilisation fell to peaking at ~40% once every 20-30
seconds.
See below about OMD. Totally not worth running an oldschool nagios if
you need performance.
Also consider enabling 'livecheck', in any scenario.
Post by Andy D'Arcy JewellTEST 3 - PUSH CHECK MK AGENT
Jury rigged the windows check_mk_agent.exe by writing a simple .cmd
using windows ports of nc and curl to talk to the cmk agent on
localhost, and set up a scheduled task to run the script every 5
minutes. On the server side, set up a simple cgi to receive cmk agent
output and dump to a file in a queue directory. Modified main.mk to use
a "datasource program" to collect this data. Wrote a small program to
cat and delete these files. System load about 0.3.
We also use a SSH push solution for our office build host to be
monitored from the internet's, and another one for the demo site.
There the submission is even email-based.
So this is a perfectly fine thing to do.
What you should consider is using check_file_age as the host check
command for the systems, so that you have a valid mechanism for host
down detection.
Post by Andy D'Arcy JewellCONCLUSION
It looks like using the push approach, I can save a lot of CPU load and
some disk IO too.
I think it's great for that use case.
Post by Andy D'Arcy JewellQuestions
If anyone is interested in a better write-up, just ask. I'm happy to
post up the programs and scripts i've written
Just do it, people invent too many wheels already.
(My ssh push agent for Linux/Unix is at http://bitbucket.org/darkfader/ )
Sadly I couldn't write something like that for windows, so it would be
great if you made it public.
Post by Andy D'Arcy JewellCan anyone suggest any other approaches to optimisation? I need to have
"full" cmk stats, including logs, so just pings won't do.
Addon hint:
Use agent based filtering for the eventlogs that are monitored to cut
down on traffic.
Post by Andy D'Arcy JewellDoes anyone see any disadvantages to the push approach? It avoids a lot
of un-necessary check processing when workstations are unavailble.
Regarding performance:
You didn't say if you're using OMD for the nagios install.
This can be very performance saving, due to use of rrdcached and tmpfs.
I'm quite sure once you go over 4k hosts to monitor, you will run into
some additional bottlenecks.
Your choice of a push solution is very well suited there.
Greetings, and thanks for your experiences shared.
I know of a few customers that were looking for such massive client
monitoring in the recent time.
Blog about your experiments and once it goes live, we need more SEO
points :)
Greetings,
Florian
p.s.:
If possible, take the hint and re-test with an OMD nagios install.
Just 2 days ago I did my Nagios benchmarking talk, it was fun:
Fired up over 340k service checks / min on a quadcore box and I
basically considered it as normal and went on about explaining
bottlenecks that come at higher load, instead of saying much about *that
is damn fast now*.
The audience' faces disagreed :)