Univention Bugzilla – Bug 53461
UDN / notifier stops working during the UCS@school 5.0 singlemaster tests
Last modified: 2021-11-24 17:53:02 CET
In the UCS@school 5.0 singlemaster tests the notifier stopped working reproducably during the ucs-test run. The situation is that /var/lib/univention-ldap/notify/transaction shows 8726 as processed and /var/lib/univention-ldap/listener/listener shows 8728 as next, but 8727 is missing in the files. But it has been written into the `cn=translog in OpenLDAP: root@master201:~# univention-ldapsearch -LLL -b cn=translog reqSession=8727 dn: reqSession=8727,cn=translog objectClass: auditObject reqStart: 20210617112524Z reqSession: 8727 reqDN: cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=ucsschool,cn=groups,dc=autotest201,dc=local reqType: a ## translog check says this: root@master201:~# /usr/share/univention-directory-notifier/univention-translog check 2021-06-17 15:54:59,881:ERROR:/var/lib/univention-ldap/listener/listener:1:'8728 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local d\n': Not continuous with /var/lib/univention-ldap/notify/transaction: '8726 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local a' /var/lib/univention-ldap/listener/listener needs fixing: - pending transactions are not consecutive ## The timestamps of the critical transaction points to this section in journalctl: ================================================================================== Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-Member-Edukativnetz", "module": "groups/group"}, "tags": [], "context_id": Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-DC-Verwaltungsnetz", "module": "groups/group"}, "tags": [], "context_id": " Jun 17 13:25:24 master201 slapd[18773]: send_ldap_response: ber write failed Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Main process exited, code=killed, status=6/ABRT -- Subject: Unit process exited -- Defined-By: systemd -- Support: https://www.debian.org/support -- -- An ExecStart= process belonging to unit univention-directory-notifier.service has exited. -- -- The process' exit code is 'killed' and its exit status is 6. Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-Member-Verwaltungsnetz", "module": "groups/group"}, "tags": [], "context_id Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Failed with result 'signal'. -- Subject: Unit failed -- Defined-By: systemd -- Support: https://www.debian.org/support -- -- The unit univention-directory-notifier.service has entered the 'failed' state with result 'signal'. Jun 17 13:25:25 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "admins-ksy1aa0zjk", "module": "groups/group"}, "tags": [], "context_id": "0fcc9c2d-da4b- Jun 17 13:25:25 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "admins-ksy1aa0zjk", "module": "groups/group"}, "tags": [], "context_id": "20e46ab8-69fb- ================================================================================== systemctl status: ================================================================================== root@master201:~# systemctl status univention-directory-notifier.service ● univention-directory-notifier.service - Univention Directory Notifier Daemon Loaded: loaded (/lib/systemd/system/univention-directory-notifier.service; enabled; vendor preset: enabled) Active: inactive (dead) (Result: signal) since Thu 2021-06-17 13:25:27 CEST; 2h 34min ago Process: 5083 ExecStartPre=/usr/share/univention-directory-notifier/import_translog.sh (code=exited, status=0/SUCCESS) Process: 5087 ExecStart=/usr/sbin/univention-directory-notifier -F $UDN_OPTS (code=killed, signal=ABRT) Main PID: 5087 (code=killed, signal=ABRT) Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Main process exited, code=killed, status=6/ABRT Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Failed with result 'signal'. Jun 17 13:25:27 master201 systemd[1]: Stopped Univention Directory Notifier Daemon. ================================================================================== notifier.log ================================================================================== 17.06.21 13:22:23.411 DEBUG_INIT 17.06.21 13:22:23.448 TRANSFILE ( WARN ) : 7906 ldap_add() already exists 17.06.21 13:22:31.368 TRANSFILE ( ERROR ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server 17.06.21 13:22:41.671 DEBUG_INIT 17.06.21 13:22:41.712 TRANSFILE ( WARN ) : 8071 ldap_add() already exists 17.06.21 13:23:02.226 DEBUG_INIT 17.06.21 13:23:46.502 DEBUG_INIT 17.06.21 13:24:21.629 DEBUG_INIT 17.06.21 13:25:01.900 TRANSFILE ( ERROR ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server 17.06.21 13:25:12.104 DEBUG_INIT 17.06.21 13:25:12.145 TRANSFILE ( WARN ) : 8567 ldap_add() already exists 17.06.21 13:25:24.987 TRANSFILE ( ERROR ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server ================================================================================== When I start the notifier again, this is logged in notifier.log: ================================================================================== 17.06.21 16:04:53.684 DEBUG_INIT 17.06.21 16:04:53.730 TRANSFILE ( WARN ) : 8727 ldap_add() already exists ================================================================================== And replication resumes as normal. So there are four questions here: 1. Why did the Notifier run into an abort between writing to LDAP and writing to transaction file? 2. Why didn't it log what is wrong= 3. what are those warnings that start with "7 ldap_add() already exists" (which are also present on a UCS 4.4 system)? 4. why doesn't systemd start the service again
> 4. why doesn't systemd start the service again There is a 10 second timeout between restart notifier restarts configured in the service definition. If the service is explicitly stopped during that time the restart cycle is aborted until the service is explicitly started again. In this case a ucs@school import test was runningt. The ucs@school import stops the notifier service but only starts it again if it was already running, which it wasn't because it had just crashed. (keyword: "stopped_notifier")
(In reply to Jürn Brodersen from comment #1) > > 4. why doesn't systemd start the service again > > There is a 10 second timeout between restart notifier restarts configured in > the service definition. If the service is explicitly stopped during that > time the restart cycle is aborted until the service is explicitly started > again. > > In this case a ucs@school import test was runningt. The ucs@school import > stops the notifier service but only starts it again if it was already > running, which it wasn't because it had just crashed. (keyword: > "stopped_notifier") Maybe it was caused by the broken migration of init.d to systemd in UCS@school. see stopped_notifer() in git:cc0a5d9d096a026ad8053a928989ff4ed0660ff6 ucs-school-lib/modules/ucsschool/lib/models/utils.py (which used "systemctl is-active" (instead of the process list), which does probably not correctly detect the service state). A fix might be git:9d51d3312c70d1f1a6c5068fbec715b87416b194, which is currently tested: start notifier always.
(In reply to Arvid Requate from comment #0) > 1. Why did the Notifier run into an abort between writing to LDAP and > writing to transaction file? Probably because the tests (re-)start slapd/UDN/UDL all the time. This might kill the connection between UDN and slapd, which otherwise runs daemonized in the background. > 3. what are those warnings that start with "7 ldap_add() already exists" > (which are also present on a UCS 4.4 system)? UDN read the pending transactions from /var/lib/univention-ldap/listener/listener and writes them to both file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog. When the (last) transaction has already been added to ldap:cn=translog with `ldap_add()` from the previous invocation (before the ABORT) the 2nd try after the restart will not succeed and print that warning instead. (FYI: UDN does NOT verify the transactions are the same - it assumes transactions to be immutable and `listener/listener` to be never changed) > '8728 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local d\n': Not continuous with /var/lib/univention-ldap/notify/transaction: '8726 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local a' ... > 17.06.21 16:04:53.730 TRANSFILE ( WARN ) : 8727 ldap_add() already exists ... > root@master201:~# univention-ldapsearch -LLL -b cn=translog reqSession=8727 > dn: reqSession=8727,cn=translog > objectClass: auditObject > reqStart: 20210617112524Z > reqSession: 8727 > reqDN: cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=ucsschool,cn=groups,dc=autotest201,dc=local > reqType: a Please notice that UDN does TWO things per transaction: 1. save the transaction in LDAP to "cn=translog" 2. append the transaction to "/var/lib/univention-ldap/notify/transaction" This is NOT atomic! In this case (and again in a second case which just happened today) 1. succeeded but 2. failed: 1. The transaction was saved to LDAP as it can be retrieved by `univention-translog ldap 8727` 2. But the transaction was NOT properly appended to `listener/listener` leading to the above error message "Not continuous between 8726 and 8728" The error message today was slightly different where the text file contained binary garbage: > 2021-07-03 06:27:23,022:WARNING:Transaction 546 is invalid > Traceback (most recent call last): > File "/usr/share/univention-directory-notifier/univention-translog", line 1406, in <module> > exit(main()) > File "/usr/share/univention-directory-notifier/univention-translog", line 420, in main > return opt.func(opt) or 0 > File "/usr/share/univention-directory-notifier/univention-translog", line 464, in import_all > with Index(opt.translog_file) as index, Translog(opt.translog_file, index) as translog, ldapi(opt) as ld: > File "/usr/share/univention-directory-notifier/univention-translog", line 290, in __enter__ > self.seek(last) > File "/usr/share/univention-directory-notifier/univention-translog", line 361, in seek > raise IndexError(tid) > IndexError: 546 In both cases the file "listener/listener" was "corrupt" and UDN failed to parse it: It is supposed to be "one line per entry" with no binary garbage. During normal operation mode the file is only ever written by UDN src/notify.c:notify_dump_to_files(), which uses proper flock() and fallocate() to guarantee enough space for append is available. fclose() is supposed to flush any stream buffer. Currently I have no idea where that corruption comes from.
(In reply to Philipp Hahn from comment #3) ... > UDN read the pending transactions from > /var/lib/univention-ldap/listener/listener and writes them to both > file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog. ... > In both cases the file "listener/listener" was "corrupt" and UDN failed to > parse it: It is supposed to be "one line per entry" with no binary garbage. Correction: "notify/transaction" was corrupt! ("listener/listener" is written by our slapd.translog overlay or UDN when cascading)
(In reply to Philipp Hahn from comment #4) > (In reply to Philipp Hahn from comment #3) > ... > > UDN read the pending transactions from > > /var/lib/univention-ldap/listener/listener and writes them to both > > file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog. > ... > > In both cases the file "listener/listener" was "corrupt" and UDN failed to > > parse it: It is supposed to be "one line per entry" with no binary garbage. > > Correction: "notify/transaction" was corrupt! > ("listener/listener" is written by our slapd.translog overlay or UDN when > cascading) Correction: ... or UDL when cascading