Discussion:
[Check_mk (english)] CMC Stopping Periodically in 1.5.0p2
Adam Chesterton
2018-09-24 23:09:41 UTC
Permalink
Hi Everyone,

Got another problem cropping up after our upgrade from 1.4.0p7 to 1.5.0p2
(running on CentOS 7). Periodically, the CMC will stop running, and we have
to manually recover it. This happens every 5 days or so.
I've looked at the Check_MK logs, and turned on debug level logs for the
cmc.log, but this hasn't revealed any new information. We get an entry in
the alerts.log file that configuration has changed and it is restarting
itself, and at the same time there is a traceback from an error in the
cmc.log. Shortly after this, we get an another error in cmc.log ("could not
read signal byte: Connection reset by peer") and then things just stop.

Does anyone have any ideas on what is causing this and/or how to resolve it?
An extract of the logs is below.

Regards,
Adam Chesterton

----
ALERTS.LOG
07:52:07 Configuration has changed. Restarting myself.

CMC.LOG
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: Traceback (most recent call last):
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/bin/cmk", line 96, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: exit_status = modes.call(o, a, opts, args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/modes/__init__.py", line 80, in
call
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: return mode.handler_function(*handler_args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/modes/cee.py",
line 216, in mode_handle_alerts
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.cee.alert_handling as alert_handling
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/cee/alert_handling.py", line 45,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.events as events
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/events.py", line
46, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core as core
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core.py", line 44,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core_nagios as core_nagios
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core_nagios.py",
line 45, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.data_sources as data_sources
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/__init__.py", line
62, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from .ipmi import IPMIManagementBoardDataSource
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/ipmi.py", line 27,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.command as ipmi_cmd
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/command.py",
line 25, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lookup import get_oem_handler
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lookup.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.oem.lenovo.handler as lenovo
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/handler.py", line
33, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lenovo import imm
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/imm.py", line 25,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.private.session as ipmisession
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 273,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: class Session(object):
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 309,
in Session
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: _crypto_backend = default_backend()
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/__init__.py",
line 15, in default_backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/__init__.py",
line 7, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/backend.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography import utils, x509
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/x509/__init__.py", line 8, in
<module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.x509 import certificate_transparency
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: ImportError: cannot import name certificate_transparency
..........
2018-09-25 07:52:30 [0] [alert helper 32312] could not read signal byte:
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events, sending
them now
2018-09-25 07:52:30 [0] [alert helper 32312] could not read signal byte:
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events, sending
them now
Paul Dott
2018-09-25 05:29:03 UTC
Permalink
Is it possible you have Check_MK Discovery running against some servers and
auto activating your changes? Seems like it would be a clean restart though
vs what you are seeing.
Post by Adam Chesterton
Hi Everyone,
Got another problem cropping up after our upgrade from 1.4.0p7 to 1.5.0p2
(running on CentOS 7). Periodically, the CMC will stop running, and we have
to manually recover it. This happens every 5 days or so.
I've looked at the Check_MK logs, and turned on debug level logs for the
cmc.log, but this hasn't revealed any new information. We get an entry in
the alerts.log file that configuration has changed and it is restarting
itself, and at the same time there is a traceback from an error in the
cmc.log. Shortly after this, we get an another error in cmc.log ("could not
read signal byte: Connection reset by peer") and then things just stop.
Does anyone have any ideas on what is causing this and/or how to resolve it?
An extract of the logs is below.
Regards,
Adam Chesterton
----
ALERTS.LOG
07:52:07 Configuration has changed. Restarting myself.
CMC.LOG
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/bin/cmk", line 96, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: exit_status = modes.call(o, a, opts, args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/modes/__init__.py", line 80, in
call
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: return mode.handler_function(*handler_args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/modes/cee.py",
line 216, in mode_handle_alerts
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.cee.alert_handling as alert_handling
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/cee/alert_handling.py", line 45,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.events as events
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/events.py", line
46, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core as core
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core.py", line 44,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core_nagios as core_nagios
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core_nagios.py",
line 45, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.data_sources as data_sources
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/__init__.py", line
62, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from .ipmi import IPMIManagementBoardDataSource
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/ipmi.py", line 27,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.command as ipmi_cmd
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/command.py",
line 25, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lookup import get_oem_handler
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lookup.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.oem.lenovo.handler as lenovo
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/handler.py", line
33, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lenovo import imm
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/imm.py", line 25,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.private.session as ipmisession
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 273,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 309,
in Session
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: _crypto_backend = default_backend()
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/__init__.py",
line 15, in default_backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/__init__.py",
line 7, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/backend.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography import utils, x509
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/x509/__init__.py", line 8, in
<module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.x509 import certificate_transparency
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: ImportError: cannot import name certificate_transparency
..........
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Adam Chesterton
2018-09-26 00:40:56 UTC
Permalink
Yes, I do have Check_MK Discovery running on my environment. It's set to
discover once a day and automatically add missing services, however the
time period I have set (don't discover or activate between 0700-2100)
doesn't match up with the times the CMC has stopped, such as the most
recent one where it happened at 0752.

I also don't have this issue on our other CMK hosts, but this host does
have considerably more load on it (13000 services with 16 cores vs 1000
with 4 cores). I am looking to split out from a single host per site
(growth is a good problem to have), so hopefully we'll find a load level
that works.

As a short term workaround, I've set up an hourly cron job to run "omd
start" to keep things ticking over (particularly over weekends).

Regards,
Adam
Post by Paul Dott
Is it possible you have Check_MK Discovery running against some servers
and auto activating your changes? Seems like it would be a clean restart
though vs what you are seeing.
Post by Adam Chesterton
Hi Everyone,
Got another problem cropping up after our upgrade from 1.4.0p7 to 1.5.0p2
(running on CentOS 7). Periodically, the CMC will stop running, and we have
to manually recover it. This happens every 5 days or so.
I've looked at the Check_MK logs, and turned on debug level logs for the
cmc.log, but this hasn't revealed any new information. We get an entry in
the alerts.log file that configuration has changed and it is restarting
itself, and at the same time there is a traceback from an error in the
cmc.log. Shortly after this, we get an another error in cmc.log ("could not
read signal byte: Connection reset by peer") and then things just stop.
Does anyone have any ideas on what is causing this and/or how to resolve it?
An extract of the logs is below.
Regards,
Adam Chesterton
----
ALERTS.LOG
07:52:07 Configuration has changed. Restarting myself.
CMC.LOG
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/bin/cmk", line 96, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: exit_status = modes.call(o, a, opts, args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/modes/__init__.py", line 80, in
call
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: return mode.handler_function(*handler_args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/modes/cee.py",
line 216, in mode_handle_alerts
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.cee.alert_handling as alert_handling
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/cee/alert_handling.py", line 45,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.events as events
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/events.py", line
46, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core as core
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core.py", line 44,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core_nagios as core_nagios
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core_nagios.py",
line 45, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.data_sources as data_sources
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/__init__.py", line
62, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from .ipmi import IPMIManagementBoardDataSource
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/ipmi.py", line 27,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.command as ipmi_cmd
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/command.py",
line 25, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lookup import get_oem_handler
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lookup.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.oem.lenovo.handler as lenovo
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/handler.py", line
33, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lenovo import imm
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/imm.py", line 25,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.private.session as ipmisession
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 273,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 309,
in Session
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: _crypto_backend = default_backend()
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/__init__.py",
line 15, in default_backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/__init__.py",
line 7, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/backend.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography import utils, x509
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/x509/__init__.py", line 8, in
<module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.x509 import certificate_transparency
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: ImportError: cannot import name certificate_transparency
..........
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Paul Dott
2018-09-26 13:36:01 UTC
Permalink
Interesting you didn’t have this before. The message implies your auto
discovery did this:

07:52:07 Configuration has changed. Restarting myself.

But like you said times don’t add up.

Is this a distributed setup? Master or slave that is having an issue?
Post by Adam Chesterton
Yes, I do have Check_MK Discovery running on my environment. It's set to
discover once a day and automatically add missing services, however the
time period I have set (don't discover or activate between 0700-2100)
doesn't match up with the times the CMC has stopped, such as the most
recent one where it happened at 0752.
I also don't have this issue on our other CMK hosts, but this host does
have considerably more load on it (13000 services with 16 cores vs 1000
with 4 cores). I am looking to split out from a single host per site
(growth is a good problem to have), so hopefully we'll find a load level
that works.
As a short term workaround, I've set up an hourly cron job to run "omd
start" to keep things ticking over (particularly over weekends).
Regards,
Adam
Post by Paul Dott
Is it possible you have Check_MK Discovery running against some servers
and auto activating your changes? Seems like it would be a clean restart
though vs what you are seeing.
Post by Adam Chesterton
Hi Everyone,
Got another problem cropping up after our upgrade from 1.4.0p7 to
1.5.0p2 (running on CentOS 7). Periodically, the CMC will stop running, and
we have to manually recover it. This happens every 5 days or so.
I've looked at the Check_MK logs, and turned on debug level logs for the
cmc.log, but this hasn't revealed any new information. We get an entry in
the alerts.log file that configuration has changed and it is restarting
itself, and at the same time there is a traceback from an error in the
cmc.log. Shortly after this, we get an another error in cmc.log ("could not
read signal byte: Connection reset by peer") and then things just stop.
Does anyone have any ideas on what is causing this and/or how to resolve it?
An extract of the logs is below.
Regards,
Adam Chesterton
----
ALERTS.LOG
07:52:07 Configuration has changed. Restarting myself.
CMC.LOG
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/bin/cmk", line 96, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: exit_status = modes.call(o, a, opts, args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/modes/__init__.py", line 80, in
call
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: return mode.handler_function(*handler_args)
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/modes/cee.py",
line 216, in mode_handle_alerts
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.cee.alert_handling as alert_handling
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/cee/alert_handling.py", line 45,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.events as events
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/events.py", line
46, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core as core
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core.py", line 44,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.core_nagios as core_nagios
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/cmk_base/core_nagios.py",
line 45, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import cmk_base.data_sources as data_sources
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/__init__.py", line
62, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from .ipmi import IPMIManagementBoardDataSource
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cmk_base/data_sources/ipmi.py", line 27,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.command as ipmi_cmd
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/command.py",
line 25, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lookup import get_oem_handler
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File "/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lookup.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.oem.lenovo.handler as lenovo
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/handler.py", line
33, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from pyghmi.ipmi.oem.lenovo import imm
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/oem/lenovo/imm.py", line 25,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: import pyghmi.ipmi.private.session as ipmisession
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 273,
in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/pyghmi/ipmi/private/session.py", line 309,
in Session
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: _crypto_backend = default_backend()
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/__init__.py",
line 15, in default_backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/__init__.py",
line 7, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.hazmat.backends.openssl.backend import backend
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/hazmat/backends/openssl/backend.py",
line 16, in <module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography import utils, x509
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: File
"/omd/sites/melbourne/lib/python/cryptography/x509/__init__.py", line 8, in
<module>
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: from cryptography.x509 import certificate_transparency
2018-09-25 07:52:07 [3] [alert helper 32312] Invalid response from alert
helper: ImportError: cannot import name certificate_transparency
..........
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
Connection reset by peer
2018-09-25 07:52:30 [5] [alert helper 32312] still 1 unsent events,
sending them now
_______________________________________________
checkmk-en mailing list
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Greg Wildman
2018-09-26 07:58:11 UTC
Permalink
Post by Adam Chesterton
Hi Everyone,
Got another problem cropping up after our upgrade from 1.4.0p7 to
1.5.0p2 (running on CentOS 7). Periodically, the CMC will stop
running, and we have to manually recover it. This happens every 5
days or so.
I've looked at the Check_MK logs, and turned on debug level logs for
the cmc.log, but this hasn't revealed any new information. We get an
entry in the alerts.log file that configuration has changed and it is
restarting itself, and at the same time there is a traceback from an
error in the cmc.log. Shortly after this, we get an another error in
cmc.log ("could not read signal byte: Connection reset by peer") and
then things just stop.
You are not alone, I am experiencing the same problem. It is happening
over multiple sites and at what seems like random intervals. I have a
ticket opened and it is being looked at.

I will report back any findings. I am sure the devs will get to the
bottom of this and implement an update if neccessary.

--
Greg
Adam Chesterton
2018-09-27 04:05:04 UTC
Permalink
Thanks Greg, good to know I'm not the only one and that it's being looked
at. I'll wait to hear more from you.

On a side note, do you use the cmk-update-agent with Windows, and if so are
you having any problems with the updater falling over part-way through an
agent update and not installing any plugins? I emailed the list last last
week about it, but haven't heard back from anyone.

Regards,
Adam
Post by Greg Wildman
Post by Adam Chesterton
Hi Everyone,
Got another problem cropping up after our upgrade from 1.4.0p7 to
1.5.0p2 (running on CentOS 7). Periodically, the CMC will stop
running, and we have to manually recover it. This happens every 5
days or so.
I've looked at the Check_MK logs, and turned on debug level logs for
the cmc.log, but this hasn't revealed any new information. We get an
entry in the alerts.log file that configuration has changed and it is
restarting itself, and at the same time there is a traceback from an
error in the cmc.log. Shortly after this, we get an another error in
cmc.log ("could not read signal byte: Connection reset by peer") and
then things just stop.
You are not alone, I am experiencing the same problem. It is happening
over multiple sites and at what seems like random intervals. I have a
ticket opened and it is being looked at.
I will report back any findings. I am sure the devs will get to the
bottom of this and implement an update if neccessary.
--
Greg
Loading...