Bug 43450 – slapd entered failed state in test case 25reconnect_uldap

Bug 43450 - slapd entered failed state in test case 25reconnect_uldap


Summary:	slapd entered failed state in test case 25reconnect_uldap

Status:	CLOSED FIXED

Product:	UCS
Classification:	Unclassified
Component:	LDAP
Version:	UCS 4.2
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	UCS 4.2
Assigned To:	Janek Walkenhorst
QA Contact:	Philipp Hahn

URL:
Keywords:	interim-2, systemd

Depends on:
Blocks:	43874
	Show dependency tree / graph

Reported:	2017-01-31 07:17 CET by Stefan Gohmann
Modified:	2017-09-26 10:47 CEST (History)
CC List:	3 users (show)

See Also:	43313
What kind of report is it?:	Development Internal
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stefan Gohmann

2017-01-31 07:17:18 CET

From syslog:

Jan 30 20:20:52 master090 systemd[1]: Stopping LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Jan 30 20:20:52 master090 systemd[1]: Stopped LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Jan 30 20:20:52 master090 systemd[1]: Starting LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Jan 30 20:20:52 master090 slapd[23035]: LDAP server already running.
Jan 30 20:20:52 master090 systemd[1]: slapd.service: control process exited, code=exited status=2
Jan 30 20:20:52 master090 systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Jan 30 20:20:52 master090 systemd[1]: Unit slapd.service entered failed state.

ucs-test output:

*** BEGIN *** ['/usr/bin/python2.7', '25reconnect_uldap'] ***
*** 10_ldap/25reconnect_uldap *** Test reconnect mechanism of uldap ***
*** START TIME: 2017-01-30 20:19:57 ***
Setting ldap/client/retry/count

************************************************************************
Test: connect, search, stop, start, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: stop, connect - start after 9 seconds, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: connect, stop,  - start after 9 seconds, search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: stop, connect - start after 11 seconds (which is too late), search
************************************************************************
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.

************************************************************************
Test: connect, stop - start after 11 seconds (which is too late), search
************************************************************************
Job for slapd.service canceled.
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
invoke-rc.d: initscript slapd, action "start" failed.
### FAIL ###
Search was successful
###      ###
CLEANUP
Setting ldap/client/retry/count
Warning: Unit file of slapd.service changed on disk, 'systemctl daemon-reload' recommended.
Job for slapd.service failed. See 'systemctl status slapd.service' and 'journalctl -xn' for details.
invoke-rc.d: initscript slapd, action "start" failed.
*** END TIME: 2017-01-30 20:20:53 ***
*** TEST DURATION (H:MM:SS.ms): 0:00:56.360147 ***
*** END *** 1 ***

Comment 1 Stefan Gohmann

2017-01-31 07:39:11 CET

For this test case we could wait for a little bit more time but I think we need to fix it more generally.

If the slapd failed to start we need to ensure that the next restart will be possible, e.g.:
 systemctl start slapd  → failed 
 systemctl restart slapd → must be possible

Otherwise some basics will fail for example the LDAP extension registration.

@Janek, Philipp: is it possible to configure this in systemd?

Comment 2 Janek Walkenhorst

2017-01-31 17:54:29 CET

(In reply to Stefan Gohmann from comment #1)
> For this test case we could wait for a little bit more time but I think we
> need to fix it more generally.
> 
> If the slapd failed to start we need to ensure that the next restart will be
> possible, e.g.:
>  systemctl start slapd  → failed 
>  systemctl restart slapd → must be possible
This should be possible.

In this case the setup was:
                              systemctl start slapd
                              …check database
  systemctl stop slapd
  …no process → OK
                              …start process → OK but killed by stop thus FAILED

And then trying:
  systemctl stop slapd → no need to execute, already stopped
Or trying:
  systemctl start slapd
  …slapd is running → FAILED‽

So the "start" case was "lying" in that it says something went wrong when the daemon is running just fine.

Also the stop case seems to abort the start case but not cleanly, leading to a possible race.

My suggested fix is
	log_action_msg "LDAP server already running"
	exit 0
.

This should not be a problem with errors in starting, because if the start script fails it can be retried always.
Only script that say FAILED when it went well or say OK when it went wrong could lead to problems.


My guess is that the race is due to systemd killing the start script (TERM) but the script not killing its subprocesses. This may be because of the /lib/lsb/init-functions.d/40-systemd hook.

Comment 3 Janek Walkenhorst

2017-01-31 18:03:33 CET

I would also remove the
	sleep 5
It is there since CVS times and seems to not have any purpose given that the loop after it has error handling and retry.

This improves the performance if the init script and also reduces the chance of hitting the race in
 ucs-test/10_ldap/26reconnect_univention_ldapsearch

Comment 4 Philipp Hahn

2017-02-07 11:21:50 CET

OK: ucs-4.2-0@76262
 <https://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/iniscrptact.html>
FIXED: ucs-4.2-0@76263 → r76482

OK: systemctl status slapd.service
OK: systemctl stop slapd.service
OK: systemctl start slapd.service
OK: systemctl restart slapd.service
~FAIL: killall slapd ; systemctl start slapd.service # NO slapd running

This is a generic systemd SysV-init compatibility bug, which violates the "behave sensibly" requirement from <https://www.debian.org/doc/debian-policy/ch-opersys.html#s-writing-init>. (It is not OpenLDAP specific, as I can repeat it with clamd, too.) Calling old SysV-init scripts would start the service again as expected, but due to the /lib/lsb/init-functions.d/40-systemd shim systemd can't distinguish "type=oneshot" services from "type=simple/forking" services and has to treat the exit code 0 from the init script as "everything okay", even when the service crashes later and is no longer running.
 killall slapd ; _SYSTEMCTL_SKIP_REDIRECT=1 /etc/init.d/slapd start
does show the difference as there slapd is restarted.

We may either switch slapd completely to systemd, or at least supplement the generated slapd.service:

/etc/systemd/system/slapd.service.d/custom.conf:
[Service]
Restart=on-failure  # or "always", as slapd exit(0) when SIGTERM killed?
PIDFile=/var/run/slapd/slapd.pid  # init.d tries to parse that dynamically
RemainAfterExit=no  # systemd-generator uses wrong "yes" here, causing this issue

Comment 5 Stefan Gohmann

2017-04-04 18:30:03 CEST

UCS 4.2 has been released:
 https://docs.software-univention.de/release-notes-4.2-0-en.html
 https://docs.software-univention.de/release-notes-4.2-0-de.html

If this error occurs again, please use "Clone This Bug".