Univention Bugzilla – Bug 49967
Nagios reports false positive about nscd
Last modified: 2019-10-07 13:35:08 CEST
When Nagios checks 'nscd' the following might occur /usr/lib/nagios/plugins/check_univention_nscd_suidwrapper CRITICAL: no instance of nscd bound to nscd socket. nscd might have crashed. This is wrong and seems to happen due to a bug in 'fuser', however when using 'lsof' everything is fine. PH already has a fix for that.
(In reply to Nico Stöckigt from comment #0) > PH already has a fix for that. sed -i -e 's/fuser /lsof -t /' /usr/lib/nagios/plugins/check_univention_nscd For the customer each failed Nagios check results in a Ticket being opened, which is a pain. It happens on multiple hosts. In the past I experienced similar problems with `fuser` on UNIX sockets not finding processes. I have not yet looked at src:psmisc, but using `lsof -t` instead looks a lot more stable.
I changed that to fuser because lsof had a problem... https://forge.univention.org/bugzilla/show_bug.cgi?id=45414#c4 That bug was often triggered by OX
"nscd -g" returns a non zero exit code if nscd isn't running. That might be a better alternative to lsof/fuser.
Happened again on customer side. Manually added hot-fix got reverted during an update. Customer has serious issues with this as internal controlling is metering the number of issues in Nagios. False alarms have to be explained. Getting urgent there...
Another Ticket #2019082921000331 I had a short look at `fuser.c`, but aborted that for other more important issues. Why does the Nagios check complain about multiple NSCD? It should be sufficient that at least one NSCD is running AND responding. We should just use timeout 3 nscd -g &>/dev/null
Maybe also something like start-stop-daemon -T -p /var/run/nscd/nscd.pid --exec /usr/sbin/nscd or some systemctl --quiet is-enabled nscd.service ! systemctl --quiet is-failed nscd.service Also `lsof` has a strange behavior regarding path containing symbolic links: # ls -lid /var/run /run /run/nscd/socket 1142 drwxr-xr-x 30 root root 1180 Aug 29 12:35 /run 108101 srw-rw-rw- 1 root root 0 Aug 29 12:55 /run/nscd/socket 523581 lrwxrwxrwx 1 root root 4 Mär 11 15:02 /var/run -> /run # readlink -f /var/run/nscd/socket /run/nscd/socket # lsof /var/run/nscd/socket COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME nscd 25701 root 8u unix 0xffff984f7d1e7800 0t0 108101 /var/run/nscd/socket type=STREAM # fuser /var/run/nscd/socket /run/nscd/socket: 25701 # readlink -f /run/nscd/socket /run/nscd/socket # lsof /run/nscd/socket # fuser /run/nscd/socket /run/nscd/socket: 2570
<git:phahn/49967-nscd> → <https://git.knut.univention.de/univention/ucs/commits/phahn/49967-nscd>
(In reply to Philipp Hahn from comment #5) > Why does the Nagios check complain about multiple NSCD? > It should be sufficient that at least one NSCD is running AND responding. AFAIR we intrudoced that check in the past as we had situations where having parallel NSCD processes resulted in some sort of deadlock.
(In reply to Philipp Hahn from comment #5) > Why does the Nagios check complain about multiple NSCD? IIRC we had an issue and had to change the check, because nscd processes running in docker containers were detected and counted.
(In reply to Erik Damrose from comment #9) > (In reply to Philipp Hahn from comment #5) > > Why does the Nagios check complain about multiple NSCD? > > IIRC we had an issue and had to change the check, because nscd processes > running in docker containers were detected and counted. Yes the appbox images do start an nscd process as well. If we just check for at least one running nscd it might be the one inside the appbox image, while the one on the host died.
(In reply to Ingo Steuwer from comment #8) > (In reply to Philipp Hahn from comment #5) > > Why does the Nagios check complain about multiple NSCD? > > It should be sufficient that at least one NSCD is running AND responding. > > AFAIR we intrudoced that check in the past as we had situations where having > parallel NSCD processes resulted in some sort of deadlock. A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint? I can only think of running nscd with "persistent * yes" and some locking issues on those DB files (UCRV:nscd/.*/persistent). (In reply to Erik Damrose from comment #9) > (In reply to Philipp Hahn from comment #5) > > Why does the Nagios check complain about multiple NSCD? > > IIRC we had an issue and had to change the check, because nscd processes > running in docker containers were detected and counted. Because of containers I tried using "start-stop-daemon --test", but it also does not handle that correctly: But "/proc/net/unix" only lists the sockets of the same NS, so also does not find nscd of containers (as far as I testet that). We can test for processes with /proc/$pid/ns/net == /proc/self/ns/net and having /proc/$pid/fs/* in /proc/net/unix, but is it necessary to warn about processes serving the NSCD socket but not being nscd (container could use "unscd")?
(In reply to Philipp Hahn from comment #11) > (In reply to Ingo Steuwer from comment #8) > > (In reply to Philipp Hahn from comment #5) > > > Why does the Nagios check complain about multiple NSCD? > > > It should be sufficient that at least one NSCD is running AND responding. > > > > AFAIR we intrudoced that check in the past as we had situations where having > > parallel NSCD processes resulted in some sort of deadlock. > > A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint? Bug #42812
(In reply to Stefan Gohmann from comment #12) > > > AFAIR we intrudoced that check in the past as we had situations where having > > > parallel NSCD processes resulted in some sort of deadlock. > > > > A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint? > > Bug #42812 That Bug was about "NSCD in another container", not "a 2nd NSCD in the same name-space" (NS). This Bug was what triggered the initial re-write to use lsof/fuser to find the NSCD corresponding to the same NS as where the Nagio check runs in. This is why my patch uses `pgrep --ns $$`.
Getting urgent for customer as all these false errors cause a discussion with controlling if this is a failure or not.
[4.4-2] 62d22bd0a4 Bug #49967: Check for running nscd only doc/errata/staging/univention-nagios.yaml | 11 +++ nagios/univention-nagios/debian/changelog | 6 ++ .../usr/lib/nagios/plugins/check_univention_nscd | 108 +++++---------------- 3 files changed, 41 insertions(+), 84 deletions(-) Package: univention-nagios Version: 12.0.1-10A~4.4.0.201909301205 Branch: ucs_4.4-0 Scope: errata4.4-2 [4.4-2] 94fd35f577 Bug #49967: univention-nagios 12.0.1-10A~4.4.0.201909301205 doc/errata/staging/univention-nagios.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) PS: We probably should backport this change back to UCS-4.3 after QA.
Tests looks good and works via cmdline, but not via nagios nrpe (and so not in the nagios webinterface) $ /usr/lib/nagios/plugins/check_nrpe -H 10.200.7.80 -c UNIVENTION_NSCD2 CRITICAL: nscd not responding! seem that executed via nrep PATH does not contain /usr/sbin and nscd can't be executed, if i change -timeout 3 nscd -g >/dev/null 2>/dev/null || +timeout 3 /usr/sbin/nscd -g >/dev/null 2>/dev/null || it works.
(In reply to Felix Botner from comment #16) > seem that executed via nrep PATH does not contain /usr/sbin and nscd can't > be executed, if i change I fixed all SUID wrappers to setup a proper PATH, which is the right thing to do. While doing that I also noticed that the wrappers did not terminate on errors. Also they were not compiled with build-hardening enabled, which is a security risk. [4.4-2] 849d7357ab Bug #49967 nagios: Set PATH in SUID wrappers nagios/univention-nagios/Makefile | 36 ++++++++++++++++++++++ nagios/univention-nagios/debian/changelog | 6 ++++ nagios/univention-nagios/debian/control | 7 +++-- nagios/univention-nagios/debian/rules | 20 ++---------- .../check_univention_joinstatus_suidwrapper.c | 24 ++++++++++----- .../plugins/check_univention_ldap_suidwrapper.c | 3 +- .../plugins/check_univention_nscd_suidwrapper.c | 24 ++++++++++----- ...heck_univention_slapd_mdb_maxsize_suidwrapper.c | 21 ++++++++----- .../plugins/check_univention_winbind_suidwrapper.c | 20 +++++++++--- 9 files changed, 111 insertions(+), 50 deletions(-) To verify that all works well, I added a very simple test: [4.4-2] 7f9c6510a6 Bug #49967 test: Add minimal Nagios SUID wrapper test test/ucs-test/debian/changelog | 6 ++++++ test/ucs-test/tests/22_nagios/01removedconfig | 1 - test/ucs-test/tests/22_nagios/07suidwrapper | 28 +++++++++++++++++++++++++++ 3 files changed, 34 insertions(+), 1 deletion(-) Package: univention-nagios Version: 12.0.1-11A~4.4.0.201910021007 Branch: ucs_4.4-0 Scope: errata4.4-2 Package: ucs-test Version: 9.0.3-75A~4.4.0.201910021005 Branch: ucs_4.4-0 Scope: errata4.4-2 [4.4-2] 0689fb9476 Bug #49967: univention-nagios 12.0.1-11A~4.4.0.201910021007 doc/errata/staging/univention-nagios.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
What I tested: /usr/lib/nagios/plugins/check_nrpe -H $localhost -c UNIVENTION_NSCD2 -> Works again Nagios checks are green -> OK ncsd check is critical if nscd is stopped -> OK ucs-test -s nagios -E dangerous -> OK YAML -> OK -> Verified
<http://errata.software-univention.de/ucs/4.4/297.html>
*** Bug 50322 has been marked as a duplicate of this bug. ***