Bug 53461 - UDN / notifier stops working during the UCS@school 5.0 singlemaster tests
UDN / notifier stops working during the UCS@school 5.0 singlemaster tests
Status: NEW
Product: UCS
Classification: Unclassified
Component: Notifier (univention-directory-notifier)
UCS 5.0
Other Linux
: P5 normal (vote)
: ---
Assigned To: Arvid Requate
UCS maintainers
:
Depends on: 43687 51804
Blocks:
  Show dependency treegraph
 
Reported: 2021-06-17 16:08 CEST by Arvid Requate
Modified: 2021-11-24 17:53 CET (History)
8 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2021-06-17 16:08:17 CEST
In the UCS@school 5.0 singlemaster tests the notifier stopped working reproducably during the ucs-test run.

The situation is that /var/lib/univention-ldap/notify/transaction shows 8726 as processed and /var/lib/univention-ldap/listener/listener shows 8728 as next, but 8727 is missing in the files. But it has been written into the `cn=translog in OpenLDAP:

root@master201:~# univention-ldapsearch -LLL -b cn=translog reqSession=8727
dn: reqSession=8727,cn=translog
objectClass: auditObject
reqStart: 20210617112524Z
reqSession: 8727
reqDN: cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=ucsschool,cn=groups,dc=autotest201,dc=local
reqType: a


## translog check says this:

root@master201:~# /usr/share/univention-directory-notifier/univention-translog check
2021-06-17 15:54:59,881:ERROR:/var/lib/univention-ldap/listener/listener:1:'8728 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local d\n': Not continuous with /var/lib/univention-ldap/notify/transaction: '8726 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local a'

/var/lib/univention-ldap/listener/listener needs fixing:
- pending transactions are not consecutive


## The timestamps of the critical transaction points to this section in journalctl:

==================================================================================
Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-Member-Edukativnetz", "module": "groups/group"}, "tags": [], "context_id": 
Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-DC-Verwaltungsnetz", "module": "groups/group"}, "tags": [], "context_id": "
Jun 17 13:25:24 master201 slapd[18773]: send_ldap_response: ber write failed
Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Main process exited, code=killed, status=6/ABRT
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit univention-directory-notifier.service has exited.
-- 
-- The process' exit code is 'killed' and its exit status is 6.
Jun 17 13:25:24 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "OUksy1aa0zjk-Member-Verwaltungsnetz", "module": "groups/group"}, "tags": [], "context_id
Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Failed with result 'signal'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit univention-directory-notifier.service has entered the 'failed' state with result 'signal'.
Jun 17 13:25:25 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "admins-ksy1aa0zjk", "module": "groups/group"}, "tags": [], "context_id": "0fcc9c2d-da4b-
Jun 17 13:25:25 master201 ADMINDIARY[5294]: {"username": "cn=admin", "args": {"name": "admins-ksy1aa0zjk", "module": "groups/group"}, "tags": [], "context_id": "20e46ab8-69fb-
==================================================================================

systemctl status:
==================================================================================
root@master201:~# systemctl status univention-directory-notifier.service 
● univention-directory-notifier.service - Univention Directory Notifier Daemon
   Loaded: loaded (/lib/systemd/system/univention-directory-notifier.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: signal) since Thu 2021-06-17 13:25:27 CEST; 2h 34min ago
  Process: 5083 ExecStartPre=/usr/share/univention-directory-notifier/import_translog.sh (code=exited, status=0/SUCCESS)
  Process: 5087 ExecStart=/usr/sbin/univention-directory-notifier -F $UDN_OPTS (code=killed, signal=ABRT)
 Main PID: 5087 (code=killed, signal=ABRT)

Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Main process exited, code=killed, status=6/ABRT
Jun 17 13:25:24 master201 systemd[1]: univention-directory-notifier.service: Failed with result 'signal'.
Jun 17 13:25:27 master201 systemd[1]: Stopped Univention Directory Notifier Daemon.
==================================================================================

notifier.log
==================================================================================
17.06.21 13:22:23.411  DEBUG_INIT
17.06.21 13:22:23.448  TRANSFILE   ( WARN    ) : 7906 ldap_add() already exists
17.06.21 13:22:31.368  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
17.06.21 13:22:41.671  DEBUG_INIT
17.06.21 13:22:41.712  TRANSFILE   ( WARN    ) : 8071 ldap_add() already exists
17.06.21 13:23:02.226  DEBUG_INIT
17.06.21 13:23:46.502  DEBUG_INIT
17.06.21 13:24:21.629  DEBUG_INIT
17.06.21 13:25:01.900  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
17.06.21 13:25:12.104  DEBUG_INIT
17.06.21 13:25:12.145  TRANSFILE   ( WARN    ) : 8567 ldap_add() already exists
17.06.21 13:25:24.987  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
==================================================================================

When I start the notifier again, this is logged in notifier.log:

==================================================================================
17.06.21 16:04:53.684  DEBUG_INIT
17.06.21 16:04:53.730  TRANSFILE   ( WARN    ) : 8727 ldap_add() already exists
==================================================================================

And replication resumes as normal.

So there are four questions here:

1. Why did the Notifier run into an abort between writing to LDAP and writing to transaction file?

2. Why didn't it log what is wrong=

3. what are those warnings that start with "7 ldap_add() already exists" (which are also present on a UCS 4.4 system)?

4. why doesn't systemd start the service again
Comment 1 Jürn Brodersen univentionstaff 2021-06-17 19:16:23 CEST
> 4. why doesn't systemd start the service again

There is a 10 second timeout between restart notifier restarts configured in the service definition. If the service is explicitly stopped during that time the restart cycle is aborted until the service is explicitly started again.

In this case a ucs@school import test was runningt. The ucs@school import stops the notifier service but only starts it again if it was already running, which it wasn't because it had just crashed. (keyword: "stopped_notifier")
Comment 2 Florian Best univentionstaff 2021-06-24 14:23:21 CEST
(In reply to Jürn Brodersen from comment #1)
> > 4. why doesn't systemd start the service again
> 
> There is a 10 second timeout between restart notifier restarts configured in
> the service definition. If the service is explicitly stopped during that
> time the restart cycle is aborted until the service is explicitly started
> again.
> 
> In this case a ucs@school import test was runningt. The ucs@school import
> stops the notifier service but only starts it again if it was already
> running, which it wasn't because it had just crashed. (keyword:
> "stopped_notifier")

Maybe it was caused by the broken migration of init.d to systemd in UCS@school.
see stopped_notifer() in git:cc0a5d9d096a026ad8053a928989ff4ed0660ff6 ucs-school-lib/modules/ucsschool/lib/models/utils.py (which used "systemctl is-active" (instead of the process list), which does probably not correctly detect the service state).
A fix might be git:9d51d3312c70d1f1a6c5068fbec715b87416b194, which is currently tested: start notifier always.
Comment 3 Philipp Hahn univentionstaff 2021-07-05 13:07:32 CEST
(In reply to Arvid Requate from comment #0)
> 1. Why did the Notifier run into an abort between writing to LDAP and
> writing to transaction file?

Probably because the tests (re-)start slapd/UDN/UDL all the time. This might kill the connection between UDN and slapd, which otherwise runs daemonized in the background.


> 3. what are those warnings that start with "7 ldap_add() already exists"
> (which are also present on a UCS 4.4 system)?

UDN read the pending transactions from /var/lib/univention-ldap/listener/listener and writes them to both file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog.

When the (last) transaction has already been added to ldap:cn=translog with `ldap_add()` from the previous invocation (before the ABORT) the 2nd try after the restart will not succeed and print that warning instead.
(FYI: UDN does NOT verify the transactions are the same - it assumes transactions to be immutable and `listener/listener` to be never changed)




> '8728 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local d\n': Not continuous with /var/lib/univention-ldap/notify/transaction: '8726 cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=groupName,cn=temporary,cn=univention,dc=autotest201,dc=local a'
...
> 17.06.21 16:04:53.730  TRANSFILE   ( WARN    ) : 8727 ldap_add() already exists
...
> root@master201:~# univention-ldapsearch -LLL -b cn=translog reqSession=8727
> dn: reqSession=8727,cn=translog
> objectClass: auditObject
> reqStart: 20210617112524Z
> reqSession: 8727
> reqDN: cn=OUksy1aa0zjk-Member-Verwaltungsnetz,cn=ucsschool,cn=groups,dc=autotest201,dc=local
> reqType: a

Please notice that UDN does TWO things per transaction:
1. save the transaction in LDAP to "cn=translog"
2. append the transaction to "/var/lib/univention-ldap/notify/transaction"
This is NOT atomic!

In this case (and again in a second case which just happened today) 1. succeeded but 2. failed:
1. The transaction was saved to LDAP as it can be retrieved by `univention-translog ldap 8727`
2. But the transaction was NOT properly appended to `listener/listener` leading to the above error message "Not continuous between 8726 and 8728"

The error message today was slightly different where the text file contained binary garbage:
> 2021-07-03 06:27:23,022:WARNING:Transaction 546 is invalid
> Traceback (most recent call last):
>   File "/usr/share/univention-directory-notifier/univention-translog", line 1406, in <module>
>     exit(main())
>   File "/usr/share/univention-directory-notifier/univention-translog", line 420, in main
>     return opt.func(opt) or 0
>   File "/usr/share/univention-directory-notifier/univention-translog", line 464, in import_all
>     with Index(opt.translog_file) as index, Translog(opt.translog_file, index) as translog, ldapi(opt) as ld:
>   File "/usr/share/univention-directory-notifier/univention-translog", line 290, in __enter__
>     self.seek(last)
>   File "/usr/share/univention-directory-notifier/univention-translog", line 361, in seek
>     raise IndexError(tid)
> IndexError: 546


In both cases the file "listener/listener" was "corrupt" and UDN failed to parse it: It is supposed to be "one line per entry" with no binary garbage.


During normal operation mode the file is only ever written by UDN src/notify.c:notify_dump_to_files(), which uses proper flock() and fallocate() to guarantee enough space for append is available. fclose() is supposed to flush any stream buffer.
Currently I have no idea where that corruption comes from.
Comment 4 Philipp Hahn univentionstaff 2021-07-05 13:13:13 CEST
(In reply to Philipp Hahn from comment #3)
...
> UDN read the pending transactions from
> /var/lib/univention-ldap/listener/listener and writes them to both
> file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog.
...
> In both cases the file "listener/listener" was "corrupt" and UDN failed to
> parse it: It is supposed to be "one line per entry" with no binary garbage.

Correction: "notify/transaction" was corrupt!
("listener/listener" is written by our slapd.translog overlay or UDN when cascading)
Comment 5 Philipp Hahn univentionstaff 2021-07-05 13:14:01 CEST
(In reply to Philipp Hahn from comment #4)
> (In reply to Philipp Hahn from comment #3)
> ...
> > UDN read the pending transactions from
> > /var/lib/univention-ldap/listener/listener and writes them to both
> > file:/var/lib/univention-ldap/notify/transaction and ldap:cn=translog.
> ...
> > In both cases the file "listener/listener" was "corrupt" and UDN failed to
> > parse it: It is supposed to be "one line per entry" with no binary garbage.
> 
> Correction: "notify/transaction" was corrupt!
> ("listener/listener" is written by our slapd.translog overlay or UDN when
> cascading)

Correction: ... or UDL when cascading