Bug 43450 - slapd entered failed state in test case 25reconnect_uldap
slapd entered failed state in test case 25reconnect_uldap
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: LDAP
UCS 4.2
Other Linux
: P5 normal (vote)
: UCS 4.2
Assigned To: Janek Walkenhorst
Philipp Hahn
: interim-2, systemd
Depends on:
Blocks: 43874
  Show dependency treegraph
 
Reported: 2017-01-31 07:17 CET by Stefan Gohmann
Modified: 2017-09-26 10:47 CEST (History)
3 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2017-01-31 07:17:18 CET
From syslog:

Jan 30 20:20:52 master090 systemd[1]: Stopping LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Jan 30 20:20:52 master090 systemd[1]: Stopped LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Jan 30 20:20:52 master090 systemd[1]: Starting LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Jan 30 20:20:52 master090 slapd[23035]: LDAP server already running.
Jan 30 20:20:52 master090 systemd[1]: slapd.service: control process exited, code=exited status=2
Jan 30 20:20:52 master090 systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Jan 30 20:20:52 master090 systemd[1]: Unit slapd.service entered failed state.

ucs-test output:

*** BEGIN *** ['/usr/bin/python2.7', '25reconnect_uldap'] ***
*** 10_ldap/25reconnect_uldap *** Test reconnect mechanism of uldap ***
*** START TIME: 2017-01-30 20:19:57 ***
Setting ldap/client/retry/count

************************************************************************
Test: connect, search, stop, start, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: stop, connect - start after 9 seconds, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: connect, stop,  - start after 9 seconds, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: stop, connect - start after 11 seconds (which is too late), search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: connect, stop - start after 11 seconds (which is too late), search
************************************************************************
Job for slapd.service canceled.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
invoke-rc.d: initscript slapd, action "start" failed.
### FAIL ###
Search was successful
###      ###
CLEANUP
Setting ldap/client/retry/count
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Job for slapd.service failed. See 'systemctl status slapd.service' and 'journalctl -xn' for details.
invoke-rc.d: initscript slapd, action "start" failed.
*** END TIME: 2017-01-30 20:20:53 ***
*** TEST DURATION (H:MM:SS.ms): 0:00:56.360147 ***
*** END *** 1 ***
Comment 1 Stefan Gohmann univentionstaff 2017-01-31 07:39:11 CET
For this test case we could wait for a little bit more time but I think we need to fix it more generally.

If the slapd failed to start we need to ensure that the next restart will be possible, e.g.:
 systemctl start slapd  → failed 
 systemctl restart slapd → must be possible

Otherwise some basics will fail for example the LDAP extension registration.

@Janek, Philipp: is it possible to configure this in systemd?
Comment 2 Janek Walkenhorst univentionstaff 2017-01-31 17:54:29 CET
(In reply to Stefan Gohmann from comment #1)
> For this test case we could wait for a little bit more time but I think we
> need to fix it more generally.
> 
> If the slapd failed to start we need to ensure that the next restart will be
> possible, e.g.:
>  systemctl start slapd  → failed 
>  systemctl restart slapd → must be possible
This should be possible.

In this case the setup was:
                              systemctl start slapd
                              …check database
  systemctl stop slapd
  …no process → OK
                              …start process → OK but killed by stop thus FAILED

And then trying:
  systemctl stop slapd → no need to execute, already stopped
Or trying:
  systemctl start slapd
  …slapd is running → FAILED‽

So the "start" case was "lying" in that it says something went wrong when the daemon is running just fine.

Also the stop case seems to abort the start case but not cleanly, leading to a possible race.

My suggested fix is
	log_action_msg "LDAP server already running"
	exit 0
.

This should not be a problem with errors in starting, because if the start script fails it can be retried always.
Only script that say FAILED when it went well or say OK when it went wrong could lead to problems.


My guess is that the race is due to systemd killing the start script (TERM) but the script not killing its subprocesses. This may be because of the /lib/lsb/init-functions.d/40-systemd hook.
Comment 3 Janek Walkenhorst univentionstaff 2017-01-31 18:03:33 CET
I would also remove the
	sleep 5
It is there since CVS times and seems to not have any purpose given that the loop after it has error handling and retry.

This improves the performance if the init script and also reduces the chance of hitting the race in
 ucs-test/10_ldap/26reconnect_univention_ldapsearch
Comment 4 Philipp Hahn univentionstaff 2017-02-07 11:21:50 CET
OK: ucs-4.2-0@76262
 <https://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/iniscrptact.html>
FIXED: ucs-4.2-0@76263 → r76482

OK: systemctl status slapd.service
OK: systemctl stop slapd.service
OK: systemctl start slapd.service
OK: systemctl restart slapd.service
~FAIL: killall slapd ; systemctl start slapd.service # NO slapd running

This is a generic systemd SysV-init compatibility bug, which violates the "behave sensibly" requirement from <https://www.debian.org/doc/debian-policy/ch-opersys.html#s-writing-init>. (It is not OpenLDAP specific, as I can repeat it with clamd, too.) Calling old SysV-init scripts would start the service again as expected, but due to the /lib/lsb/init-functions.d/40-systemd shim systemd can't distinguish "type=oneshot" services from "type=simple/forking" services and has to treat the exit code 0 from the init script as "everything okay", even when the service crashes later and is no longer running.
 killall slapd ; _SYSTEMCTL_SKIP_REDIRECT=1 /etc/init.d/slapd start
does show the difference as there slapd is restarted.

We may either switch slapd completely to systemd, or at least supplement the generated slapd.service:

/etc/systemd/system/slapd.service.d/custom.conf:
[Service]
Restart=on-failure  # or "always", as slapd exit(0) when SIGTERM killed?
PIDFile=/var/run/slapd/slapd.pid  # init.d tries to parse that dynamically
RemainAfterExit=no  # systemd-generator uses wrong "yes" here, causing this issue
Comment 5 Stefan Gohmann univentionstaff 2017-04-04 18:30:03 CEST
UCS 4.2 has been released:
 https://docs.software-univention.de/release-notes-4.2-0-en.html
 https://docs.software-univention.de/release-notes-4.2-0-de.html

If this error occurs again, please use "Clone This Bug".