[Check_mk (english)] High Windows Agent execution time with wmi

Discussion:

[Check_mk (english)] High Windows Agent execution time with wmi_cpuload and msexch

Trahan, Jonathan

2016-05-25 15:11:36 UTC

Hi,
We are getting some slow execution time with the 1.2.8 Windows agent when wmi_cpuload and msexch sections are included in the .ini configuration file of the agent. We poll the agent every 5 minutes, and we get 1 slow poll out of 4 executions.

06:30:38 06:46:44 2.43% OK Agent version 1.2.8, execution time 2.6 sec
06:46:44 06:51:47 0.76% CRIT CRIT - Check_MK timed out after 60 seconds
06:51:47 07:07:52 2.42% OK Agent version 1.2.8, execution time 2.5 sec
07:07:52 07:12:55 0.76% CRIT CRIT - Check_MK timed out after 60 seconds
07:12:55 07:29:00 2.42% OK Agent version 1.2.8, execution time 2.6 sec
07:29:00 07:34:41 0.86% CRIT CRIT - Check_MK timed out after 60 seconds
07:34:41 07:50:47 2.43% OK Agent version 1.2.8, execution time 2.5 sec
07:50:47 07:55:49 0.76% CRIT CRIT - Check_MK timed out after 60 seconds
07:55:49 08:11:55 2.43% OK Agent version 1.2.8, execution time 2.5 sec
08:11:55 08:16:58 0.76% CRIT CRIT - Check_MK timed out after 60 seconds

The slowness seems to come directly from the agent execution, as a strace on the python execution when the agent is polled shows that it is seems to wait for the result to get back.

44507 17:28:31 connect(3, {sa_family=AF_INET, sin_port=htons(6556), sin_addr=inet_addr("10.16.30.98")}, 16) = -1 EINPROGRESS (Operation now in progress)
44507 17:28:31 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
44507 17:28:31 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
44507 17:28:31 fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
44507 17:28:31 fcntl(3, F_SETFL, O_RDWR) = 0
44507 17:28:31 recvfrom(3, "<<<check_mk>>>\nVersion: 1.2.8\nBu"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "veur DCOM\ndefragsvc stopped/dema"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "osoft Exchange\nMSExchangeTranspo"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "ing/auto HP ProLiant System Shut"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "89120859161 131085989120859161 1"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "760 7601664 14270908928 53718016"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "54 1713826986 1671394714 1499169"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:30:59 recvfrom(3, "el,Name,NameFormat,NetworkServer"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:31:00 recvfrom(3, "etoProcessaFederatedFreeBusyRequ"..., 4096, MSG_WAITALL, NULL, NULL) = 4096
44507 17:31:00 recvfrom(3, "0,,0,0,1,0,1,11024,0,0,0,0,0,0,0"..., 4096, MSG_WAITALL, NULL, NULL) = 1339
44507 17:31:01 recvfrom(3, "", 4096, MSG_WAITALL, NULL, NULL) = 0
44507 17:31:01 close(3) = 0

Is this a known behaviour? Should I allow more time than 60 seconds for the agent to execute if those sections are included?
Is there a way to get more information from the Windows agent? I am not that familiar with C++ so I do not know if there is any debugging tool that could help me get more information to diagnose the issue.

Cheers,
Jonathan

Marcel Schulte

2016-05-26 06:27:55 UTC

Permalink

Hi Jonathan,

could you please show us your mentioned ini-sections?

Regards,
Marcel

Post by Trahan, Jonathan
Hi,
We are getting some slow execution time with the 1.2.8 Windows
agent when wmi_cpuload and msexch sections are included in the .ini
configuration file of the agent. We poll the agent every 5 minutes, and we
get 1 slow poll out of 4 executions.
06:30:38 06:46:44 2.43% OK Agent version 1.2.8,
execution time 2.6 sec
06:46:44 06:51:47 0.76% CRIT CRIT - Check_MK timed out
after 60 seconds
06:51:47 07:07:52 2.42% OK Agent version 1.2.8,
execution time 2.5 sec
07:07:52 07:12:55 0.76% CRIT CRIT - Check_MK timed out
after 60 seconds
07:12:55 07:29:00 2.42% OK Agent version 1.2.8,
execution time 2.6 sec
07:29:00 07:34:41 0.86% CRIT CRIT - Check_MK timed out
after 60 seconds
07:34:41 07:50:47 2.43% OK Agent version 1.2.8,
execution time 2.5 sec
07:50:47 07:55:49 0.76% CRIT CRIT - Check_MK timed out
after 60 seconds
07:55:49 08:11:55 2.43% OK Agent version 1.2.8,
execution time 2.5 sec
08:11:55 08:16:58 0.76% CRIT CRIT - Check_MK timed out
after 60 seconds
The slowness seems to come directly from the agent execution, as a
strace on the python execution when the agent is polled shows that it is
seems to wait for the result to get back.
44507 17:28:31 connect(3, {sa_family=AF_INET, sin_port=htons(6556),
sin_addr=inet_addr("10.16.30.98")}, 16) = -1 EINPROGRESS (Operation now in
progress)
44507 17:28:31 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
44507 17:28:31 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
44507 17:28:31 fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
44507 17:28:31 fcntl(3, F_SETFL, O_RDWR) = 0
44507 17:28:31 recvfrom(3, "<<<check_mk>>>\nVersion: 1.2.8\nBu"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "veur DCOM\ndefragsvc stopped/dema"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "osoft Exchange\nMSExchangeTranspo"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "ing/auto HP ProLiant System Shut"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "89120859161 131085989120859161 1"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "760 7601664 14270908928 53718016"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:28:31 recvfrom(3, "54 1713826986 1671394714 1499169"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:30:59 recvfrom(3, "el,Name,NameFormat,NetworkServer"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:31:00 recvfrom(3, "etoProcessaFederatedFreeBusyRequ"..., 4096,
MSG_WAITALL, NULL, NULL) = 4096
44507 17:31:00 recvfrom(3, "0,,0,0,1,0,1,11024,0,0,0,0,0,0,0"..., 4096,
MSG_WAITALL, NULL, NULL) = 1339
44507 17:31:01 recvfrom(3, "", 4096, MSG_WAITALL, NULL, NULL) = 0
44507 17:31:01 close(3) = 0
Is this a known behaviour? Should I allow more time than 60
seconds for the agent to execute if those sections are included?
Is there a way to get more information from the Windows agent? I
am not that familiar with C++ so I do not know if there is any debugging
tool that could help me get more information to diagnose the issue.
Cheers,
Jonathan
_______________________________________________
checkmk-en mailing list
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Trahan, Jonathan

2016-05-26 10:31:09 UTC

Permalink

Hi Marcel,
Here it is! The commented sections line is the one which timeout sometimes.

Cheers,

# Created by Check_MK Agent Bakery.
# This file is managed via WATO, do not edit manually or you
# lose your changes next time when you update the agent.

[global]
# Only execute the following agent sections
#sections = check_mk uptime systemtime df mem services winperf logwatch logfiles fileinfo plugins local mrpe spool wmi_cpuload msexch
sections = check_mk uptime systemtime df mem services winperf logwatch logfiles fileinfo plugins local mrpe spool

# TCP port the agent is listening on
port = 6556

# Allow access only from the following IP addresses and networks
only_from = 10.8.0.2 10.8.0.3 10.8.0.4 10.8.0.5 10.8.0.6 127.0.0.1

[logwatch]
# Testmode, where all messages are sent over and over again
sendall = no

# Configuration for individual Eventlogs
logfile system = crit nocontext
logfile application = crit nocontext
logfile * = off nocontext

Marcel Schulte

2016-05-27 06:03:25 UTC

Permalink

Hi Jonathan,

I don't know what section msexch would be but we've this set in our
check_mk.ini on exchange hosts (additionally to the default entries):

[global]
async_script_execution = sequential

[plugins]
execution msexch_dag.ps1 = async
cache_age msexch_dag.ps1 = 90

execution mk_inventory.vbs = async
cache_age mk_inventory.vbs = 18000

execution msexch_database.ps1 = async
cache_age msexch_database.ps1 = 120
timeout msexch_database.ps1 = 120
retry_count msexch_database.ps1 = 1

[winperf]
counters = MSExchange ADAccess Domain Controllers:msx_dc_access
counters = MSExchange ActiveSync:msx_activesync
counters = MSExchange RpcClientAccess:msx_rpc_clientaccess
counters = MSExchange Database:msx_db_reads_avg_latency
counters = MSExchangeTransport Queues:msx_queues

------------

The msexch_database.ps1 plugin never came back (even after 1 hour), so we
removed it for the moment. msexch_dag.ps1 as well as the winperf counters
work perfectly.

For the winperf counters to produce services you've to copy checks from
~/share/doc/check_mk/treasures/msexchange/ to
~/local/share/check_mk/checks/.

HTH,
Marcel

Post by Trahan, Jonathan
Hi Marcel,
Here it is! The commented sections line is the one which timeout sometimes.
Cheers,
# Created by Check_MK Agent Bakery.
# This file is managed via WATO, do not edit manually or you
# lose your changes next time when you update the agent.
[global]
# Only execute the following agent sections
#sections = check_mk uptime systemtime df mem services winperf
logwatch logfiles fileinfo plugins local mrpe spool wmi_cpuload msexch
sections = check_mk uptime systemtime df mem services winperf logwatch
logfiles fileinfo plugins local mrpe spool
# TCP port the agent is listening on
port = 6556
# Allow access only from the following IP addresses and networks
only_from = 10.8.0.2 10.8.0.3 10.8.0.4 10.8.0.5 10.8.0.6 127.0.0.1
[logwatch]
# Testmode, where all messages are sent over and over again
sendall = no
# Configuration for individual Eventlogs
logfile system = crit nocontext
logfile application = crit nocontext
logfile * = off nocontext

Andreas Döhler

2016-05-28 14:32:14 UTC

Permalink

Hi Jonathan,

While testing the 1.2.8 agent I get the same result on bigger exchange
servers.
The problem is the included WMI section as you already found out. This
should be kept outside the core in scripts. If the execution of this
internal section could be configured asynchronous it would be no problem.
The wmi_cpuload was no problem alone on my testing servers.

Best regards
Andreas

Post by Marcel Schulte
Hi Jonathan,
I don't know what section msexch would be but we've this set in our
[global]
async_script_execution = sequential
[plugins]
execution msexch_dag.ps1 = async
cache_age msexch_dag.ps1 = 90
execution mk_inventory.vbs = async
cache_age mk_inventory.vbs = 18000
execution msexch_database.ps1 = async
cache_age msexch_database.ps1 = 120
timeout msexch_database.ps1 = 120
retry_count msexch_database.ps1 = 1
[winperf]
counters = MSExchange ADAccess Domain Controllers:msx_dc_access
counters = MSExchange ActiveSync:msx_activesync
counters = MSExchange RpcClientAccess:msx_rpc_clientaccess
counters = MSExchange Database:msx_db_reads_avg_latency
counters = MSExchangeTransport Queues:msx_queues
------------
The msexch_database.ps1 plugin never came back (even after 1 hour), so we
removed it for the moment. msexch_dag.ps1 as well as the winperf counters
work perfectly.
For the winperf counters to produce services you've to copy checks from
~/share/doc/check_mk/treasures/msexchange/ to
~/local/share/check_mk/checks/.
HTH,
Marcel

Post by Trahan, Jonathan
Hi Marcel,
Here it is! The commented sections line is the one which timeout sometimes.
Cheers,
# Created by Check_MK Agent Bakery.
# This file is managed via WATO, do not edit manually or you
# lose your changes next time when you update the agent.
[global]
# Only execute the following agent sections
#sections = check_mk uptime systemtime df mem services winperf
logwatch logfiles fileinfo plugins local mrpe spool wmi_cpuload msexch
sections = check_mk uptime systemtime df mem services winperf
logwatch logfiles fileinfo plugins local mrpe spool
# TCP port the agent is listening on
port = 6556
# Allow access only from the following IP addresses and networks
only_from = 10.8.0.2 10.8.0.3 10.8.0.4 10.8.0.5 10.8.0.6 127.0.0.1
[logwatch]
# Testmode, where all messages are sent over and over again
sendall = no
# Configuration for individual Eventlogs
logfile system = crit nocontext
logfile application = crit nocontext
logfile * = off nocontext

_______________________________________________
checkmk-en mailing list
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Trahan, Jonathan

2016-05-30 14:26:27 UTC

Permalink

Hi,
Thank you for your answers, it is really appreciated.

So, if I understand you both correctly, if those two sections (wmi_cpuload and msexch) could be ran asynchronous outside the agent core, it could probably solve my issue in the long term.

That would be considered a feature / bug, depending on what causes the timeout.

Cheers,
Jonathan

Trahan, Jonathan

2016-05-30 21:41:55 UTC

Permalink

FYI,

I noticed that my problem seems to occur when a WMI call is done through C++ or powershell every X executions. It always occurs on the *first* WMI call of the script, and when the problem do occur, it takes around 2 minutes for that call to comeback, the other ones in the same script go through in a second or less.

I tried to look online for anything that talks about slow wmi call or cached wmi call, but did not get any pertinent results...

-----Message d'origine-----
De : Trahan, Jonathan
Envoyé : 30 mai 2016 10:26
À : Andreas Döhler; Marcel Schulte; checkmk-***@lists.mathias-kettner.de
Objet : RE: [Check_mk (english)] High Windows Agent execution time with wmi_cpuload and msexch

Hi,
Thank you for your answers, it is really appreciated.

So, if I understand you both correctly, if those two sections (wmi_cpuload and msexch) could be ran asynchronous outside the agent core, it could probably solve my issue in the long term.

That would be considered a feature / bug, depending on what causes the timeout.

Cheers,
Jonathan