Bug 49967 - Nagios reports false positive about nscd
Nagios reports false positive about nscd
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Monitoring (Prometheus or Nagios)
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.4-2-errata
Assigned To: Philipp Hahn
Jürn Brodersen
:
: 50322 (view as bug list)
Depends on:
Blocks: 50319 50322
  Show dependency treegraph
 
Reported: 2019-08-06 16:30 CEST by Nico Stöckigt
Modified: 2019-10-07 13:35 CEST (History)
10 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 4: Minor Usability: Impairs usability in secondary scenarios
Who will be affected by this bug?: 4: Will affect most installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.274
Enterprise Customer affected?: Yes
School Customer affected?: Yes
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019062821000381, 2019082021000696, 2019082821000351, 2019082921000331
Bug group (optional):
Max CVSS v3 score:
hahn: Patch_Available+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nico Stöckigt univentionstaff 2019-08-06 16:30:14 CEST
When Nagios checks 'nscd' the following might occur

/usr/lib/nagios/plugins/check_univention_nscd_suidwrapper
CRITICAL: no instance of nscd bound to nscd socket. nscd might have crashed.

This is wrong and seems to happen due to a bug in 'fuser', however when using 'lsof' everything is fine.

PH already has a fix for that.
Comment 1 Philipp Hahn univentionstaff 2019-08-06 16:37:48 CEST
(In reply to Nico Stöckigt from comment #0)
> PH already has a fix for that.

sed -i -e 's/fuser /lsof -t /' /usr/lib/nagios/plugins/check_univention_nscd

For the customer each failed Nagios check results in a Ticket being opened, which is a pain. It happens on multiple hosts.

In the past I experienced similar problems with `fuser` on UNIX sockets not finding processes. I have not yet looked at src:psmisc, but using `lsof -t` instead looks a lot more stable.
Comment 2 Jürn Brodersen univentionstaff 2019-08-06 16:56:25 CEST
I changed that to fuser because lsof had a problem...
https://forge.univention.org/bugzilla/show_bug.cgi?id=45414#c4

That bug was often triggered by OX
Comment 3 Jürn Brodersen univentionstaff 2019-08-20 17:45:14 CEST
"nscd -g" returns a non zero exit code if nscd isn't running. That might be a better alternative to lsof/fuser.
Comment 4 Christian Völker univentionstaff 2019-08-28 11:58:38 CEST
Happened again on customer side.

Manually added hot-fix got reverted during an update.

Customer has serious issues with this as internal controlling is metering the number of issues in Nagios. False alarms have to be explained.


Getting urgent there...
Comment 5 Philipp Hahn univentionstaff 2019-08-29 12:24:14 CEST
Another Ticket #2019082921000331

I had a short look at `fuser.c`, but aborted that for other more important issues.
Why does the Nagios check complain about multiple NSCD?
It should be sufficient that at least one NSCD is running AND responding.

We should just use
  timeout 3 nscd -g &>/dev/null
Comment 6 Philipp Hahn univentionstaff 2019-08-29 13:00:42 CEST
Maybe also something like
  start-stop-daemon -T -p /var/run/nscd/nscd.pid --exec /usr/sbin/nscd
or some
  systemctl --quiet is-enabled nscd.service
  ! systemctl --quiet is-failed nscd.service


Also `lsof` has a strange behavior regarding path containing symbolic links:

# ls -lid /var/run /run /run/nscd/socket
  1142 drwxr-xr-x 30 root root 1180 Aug 29 12:35 /run
108101 srw-rw-rw-  1 root root    0 Aug 29 12:55 /run/nscd/socket
523581 lrwxrwxrwx  1 root root    4 Mär 11 15:02 /var/run -> /run

# readlink -f /var/run/nscd/socket 
/run/nscd/socket
# lsof /var/run/nscd/socket
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
nscd    25701 root    8u  unix 0xffff984f7d1e7800      0t0 108101 /var/run/nscd/socket type=STREAM
# fuser /var/run/nscd/socket
/run/nscd/socket:    25701

# readlink -f /run/nscd/socket 
/run/nscd/socket
# lsof /run/nscd/socket
# fuser  /run/nscd/socket
/run/nscd/socket:    2570
Comment 7 Philipp Hahn univentionstaff 2019-08-30 15:49:30 CEST
<git:phahn/49967-nscd> → <https://git.knut.univention.de/univention/ucs/commits/phahn/49967-nscd>
Comment 8 Ingo Steuwer univentionstaff 2019-09-05 14:21:37 CEST
(In reply to Philipp Hahn from comment #5)
> Why does the Nagios check complain about multiple NSCD?
> It should be sufficient that at least one NSCD is running AND responding.

AFAIR we intrudoced that check in the past as we had situations where having parallel NSCD processes resulted in some sort of deadlock.
Comment 9 Erik Damrose univentionstaff 2019-09-05 14:31:08 CEST
(In reply to Philipp Hahn from comment #5)
> Why does the Nagios check complain about multiple NSCD?

IIRC we had an issue and had to change the check, because nscd processes running in docker containers were detected and counted.
Comment 10 Jürn Brodersen univentionstaff 2019-09-05 14:51:15 CEST
(In reply to Erik Damrose from comment #9)
> (In reply to Philipp Hahn from comment #5)
> > Why does the Nagios check complain about multiple NSCD?
> 
> IIRC we had an issue and had to change the check, because nscd processes
> running in docker containers were detected and counted.

Yes the appbox images do start an nscd process as well. If we just check for at least one running nscd it might be the one inside the appbox image, while the one on the host died.
Comment 11 Philipp Hahn univentionstaff 2019-09-05 17:24:09 CEST
(In reply to Ingo Steuwer from comment #8)
> (In reply to Philipp Hahn from comment #5)
> > Why does the Nagios check complain about multiple NSCD?
> > It should be sufficient that at least one NSCD is running AND responding.
> 
> AFAIR we intrudoced that check in the past as we had situations where having
> parallel NSCD processes resulted in some sort of deadlock.

A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint?
I can only think of running nscd with "persistent * yes" and some locking issues on those DB files (UCRV:nscd/.*/persistent).

(In reply to Erik Damrose from comment #9)
> (In reply to Philipp Hahn from comment #5)
> > Why does the Nagios check complain about multiple NSCD?
> 
> IIRC we had an issue and had to change the check, because nscd processes
> running in docker containers were detected and counted.

Because of containers I tried using "start-stop-daemon --test", but it also does not handle that correctly:

But "/proc/net/unix" only lists the sockets of the same NS, so also does not find nscd of containers (as far as I testet that).

We can test for processes with /proc/$pid/ns/net == /proc/self/ns/net and having /proc/$pid/fs/* in /proc/net/unix, but is it necessary to warn about processes serving the NSCD socket but not being nscd (container could use "unscd")?
Comment 12 Stefan Gohmann univentionstaff 2019-09-09 19:47:06 CEST
(In reply to Philipp Hahn from comment #11)
> (In reply to Ingo Steuwer from comment #8)
> > (In reply to Philipp Hahn from comment #5)
> > > Why does the Nagios check complain about multiple NSCD?
> > > It should be sufficient that at least one NSCD is running AND responding.
> > 
> > AFAIR we intrudoced that check in the past as we had situations where having
> > parallel NSCD processes resulted in some sort of deadlock.
> 
> A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint?

Bug #42812
Comment 13 Philipp Hahn univentionstaff 2019-09-10 07:18:30 CEST
(In reply to Stefan Gohmann from comment #12)
> > > AFAIR we intrudoced that check in the past as we had situations where having
> > > parallel NSCD processes resulted in some sort of deadlock.
> > 
> > A bugzilla search for "ALL comp:nscd" has not reveals such isses; any hint?
> 
> Bug #42812

That Bug was about "NSCD in another container", not "a 2nd NSCD in the same name-space" (NS).
This Bug was what triggered the initial re-write to use lsof/fuser to find the NSCD corresponding to the same NS as where the Nagio check runs in.
This is why my patch uses `pgrep --ns $$`.
Comment 14 Christian Völker univentionstaff 2019-09-24 15:16:59 CEST
Getting urgent for customer as all these false errors cause a discussion with controlling if this is a failure or not.
Comment 15 Philipp Hahn univentionstaff 2019-09-30 12:08:59 CEST
[4.4-2] 62d22bd0a4 Bug #49967: Check for running nscd only
 doc/errata/staging/univention-nagios.yaml          |  11 +++
 nagios/univention-nagios/debian/changelog          |   6 ++
 .../usr/lib/nagios/plugins/check_univention_nscd   | 108 +++++----------------
 3 files changed, 41 insertions(+), 84 deletions(-)

Package: univention-nagios
Version: 12.0.1-10A~4.4.0.201909301205
Branch: ucs_4.4-0
Scope: errata4.4-2

[4.4-2] 94fd35f577 Bug #49967: univention-nagios 12.0.1-10A~4.4.0.201909301205
 doc/errata/staging/univention-nagios.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

PS: We probably should backport this change back to UCS-4.3 after QA.
Comment 16 Felix Botner univentionstaff 2019-10-01 18:11:37 CEST
Tests looks good and works via cmdline, but not via nagios nrpe (and so not in the nagios webinterface)

$ /usr/lib/nagios/plugins/check_nrpe -H 10.200.7.80 -c UNIVENTION_NSCD2
CRITICAL: nscd not responding!

seem that executed via nrep PATH does not contain /usr/sbin and nscd can't be executed, if i change

-timeout 3 nscd -g >/dev/null 2>/dev/null ||
+timeout 3 /usr/sbin/nscd -g >/dev/null 2>/dev/null ||

it works.
Comment 17 Philipp Hahn univentionstaff 2019-10-02 10:10:39 CEST
(In reply to Felix Botner from comment #16)
> seem that executed via nrep PATH does not contain /usr/sbin and nscd can't
> be executed, if i change

I fixed all SUID wrappers to setup a proper PATH, which is the right thing to do. While doing that I also noticed that the wrappers did not terminate on errors. Also they were not compiled with build-hardening enabled, which is a security risk.

[4.4-2] 849d7357ab Bug #49967 nagios: Set PATH in SUID wrappers
 nagios/univention-nagios/Makefile                  | 36 ++++++++++++++++++++++
 nagios/univention-nagios/debian/changelog          |  6 ++++
 nagios/univention-nagios/debian/control            |  7 +++--
 nagios/univention-nagios/debian/rules              | 20 ++----------
 .../check_univention_joinstatus_suidwrapper.c      | 24 ++++++++++-----
 .../plugins/check_univention_ldap_suidwrapper.c    |  3 +-
 .../plugins/check_univention_nscd_suidwrapper.c    | 24 ++++++++++-----
 ...heck_univention_slapd_mdb_maxsize_suidwrapper.c | 21 ++++++++-----
 .../plugins/check_univention_winbind_suidwrapper.c | 20 +++++++++---
 9 files changed, 111 insertions(+), 50 deletions(-)

To verify that all works well, I added a very simple test:

[4.4-2] 7f9c6510a6 Bug #49967 test: Add minimal Nagios SUID wrapper test
 test/ucs-test/debian/changelog                |  6 ++++++
 test/ucs-test/tests/22_nagios/01removedconfig |  1 -
 test/ucs-test/tests/22_nagios/07suidwrapper   | 28 +++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

Package: univention-nagios
Version: 12.0.1-11A~4.4.0.201910021007
Branch: ucs_4.4-0
Scope: errata4.4-2

Package: ucs-test
Version: 9.0.3-75A~4.4.0.201910021005
Branch: ucs_4.4-0
Scope: errata4.4-2

[4.4-2] 0689fb9476 Bug #49967: univention-nagios 12.0.1-11A~4.4.0.201910021007
 doc/errata/staging/univention-nagios.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 18 Jürn Brodersen univentionstaff 2019-10-02 12:29:47 CEST
What I tested:

/usr/lib/nagios/plugins/check_nrpe -H $localhost -c UNIVENTION_NSCD2
-> Works again

Nagios checks are green -> OK
ncsd check is critical if nscd is stopped -> OK
ucs-test -s nagios -E dangerous -> OK

YAML -> OK

-> Verified
Comment 19 Erik Damrose univentionstaff 2019-10-02 15:54:59 CEST
<http://errata.software-univention.de/ucs/4.4/297.html>
Comment 20 Philipp Hahn univentionstaff 2019-10-07 13:35:08 CEST
*** Bug 50322 has been marked as a duplicate of this bug. ***