Univention Bugzilla – Bug 51804
UDN transaction processing delayed until next transaction
Last modified: 2021-06-21 17:14:52 CEST
univention-directory-notifier: 13.0.3-2A~4.4.0.202007291251 slapd: 2.4.45+dfsg-1~bpo9+1A~4.4.0.202007242143 ucs: 4.4-5 errata793 seen in a test setup for a customer the test goes like this * creates ldap objects * delete ldap object * waits until id's in /var/lib/univention-ldap/last_id and /var/lib/univention-directory-listener/notifier_id are equal but "sometimes" the notifier seems to miss a transaction and /var/lib/univention-directory-listener/notifier_id is not updated -> more /var/lib/univention-ldap/last_id /var/lib/univention-directory-listener/notifier_id :::::::::::::: /var/lib/univention-ldap/last_id :::::::::::::: 199635 :::::::::::::: /var/lib/univention-directory-listener/notifier_id :::::::::::::: 199634 -> ls -la /var/lib/univention-ldap/listener/listener -rw-r--r-- 1 listener root 107 Aug 10 22:08 /var/lib/univention-ldap/listener/listener more /var/lib/univention-ldap/listener/listener 199635 cn=alexander..... d -> usr/share/univention-directory-listener/get_notifier_id.py 199634 notifer.log 10.08.20 22:08:12.145 TRANSFILE ( INFO ) : want to write to transaction file; id=199634 10.08.20 22:08:12.145 TRANSFILE ( INFO ) : wrote to transaction file; id=199634; dn=cn=96112523..., cmd=d 10.08.20 22:08:12.145 TRANSFILE ( INFO ) : wrote to transaction file; close 10.08.20 22:08:12.145 TRANSFILE ( PROCESS ) : Added to cache pos 17, id 199634 listener.log 10.08.20 22:08:12.153 LISTENER ( INFO ) : dntree_del_id: deleted id=180596 10.08.20 22:08:12.155 LISTENER ( INFO ) : dntree_get_id4dn: DN cn=96112523,.... not in id2dn 10.08.20 22:08:12.157 LISTENER ( INFO ) : Last Notifier ID: 199634 10.08.20 22:08:27.172 LISTENER ( INFO ) : running postrun handlers -> univention-ldapsearch -b cn=translog reqSession=199634 dn: reqSession=199634,cn=translog ... reqType: d univention-ldapsearch -b cn=translog reqSession=199635 -> None So 199634 has been processed but for the pending transaction 199635 the notifier has not been "triggerd". Notifier/Listener/LDAP all running at this point but nothing happens. Creating a new LDAP object, or even echo aa > /var/lib/univention-ldap/listener/a triggers the notifier and the transaction is processed -> echo aa > /var/lib/univention-ldap/listener/a -> notifier 10.08.20 22:08:12.145 TRANSFILE ( INFO ) : wrote to transaction file; id=199634; dn=cn=96112523,c...., cmd=d 10.08.20 22:08:12.145 TRANSFILE ( INFO ) : wrote to transaction file; close 10.08.20 22:08:12.145 TRANSFILE ( PROCESS ) : Added to cache pos 17, id 199634 11.08.20 10:33:27.808 TRANSFILE ( INFO ) : want to write to transaction file; id=199635 11.08.20 10:33:27.808 TRANSFILE ( INFO ) : wrote to transaction file; id=199635; dn=cn=alexander..., cmd=d 11.08.20 10:33:27.808 TRANSFILE ( INFO ) : wrote to transaction file; close 11.08.20 10:33:27.808 TRANSFILE ( PROCESS ) : Added to cache pos 18, id 199635 Is there a problem in the notifiers signal handling/F_NOTIFY?
Is this a regression caused by Bug #51722?
(In reply to Florian Best from comment #1) > Is this a regression caused by Bug #51722? according to ProfServ this started to happen way before 2020-08-05_698_univention-directory-notifier.yaml
I was just able to trigger the observed behaviour in UCS 3.2. As a further notice: we also have a deployment with a backup server where this problem also occurs. The master already has the problems as described in the OP. When running a loop like this: while echo 'a' > /var/lib/univention-ldap/listener/a ; do sleep 10s; done everything seems to work as it should, but the backup server now seems to stop although there is a pending transaction in /var/lib/univention-ldap/listener/listener. Running the loop above on the backup fixes this issue as well.
Notes from Arvid: I guess that the unset_listener_callback() call which is made in check_callbacks() should be moved to notify_listener_change_callback() shortly after you create the new listener file and while you still have the lock. Otherwise you always have a race condition. Philipp mentioned the race condition already in Bug #41687.
I can recompile and test if you provide a patch.
I applied a patch as suggested by Arvid. The problem still occurs. diff --git a/management/univention-directory-notifier/src/network.c b/management/univention-directory-notifier/src/network.c index 809f217e57..8630182754 100644 --- a/management/univention-directory-notifier/src/network.c +++ b/management/univention-directory-notifier/src/network.c @@ -63,7 +63,6 @@ extern void set_listener_callback ( int sig, siginfo_t *si, void *data); extern int get_schema_callback (); extern int get_listener_callback (); extern void unset_schema_callback (); -extern void unset_listener_callback (); extern NotifyId_t notify_last_id; @@ -294,7 +293,6 @@ void check_callbacks() } if ( get_listener_callback () ) { notify_listener_change_callback ( 0, NULL, NULL); - unset_listener_callback(); } } int network_client_main_loop ( ) diff --git a/management/univention-directory-notifier/src/notify.c b/management/univention-directory-notifier/src/notify.c index 76baa23f88..801a48a4f7 100644 --- a/management/univention-directory-notifier/src/notify.c +++ b/management/univention-directory-notifier/src/notify.c @@ -56,6 +56,7 @@ extern NotifyId_t notify_last_id; extern Notify_t notify; extern long long notifier_lock_count; extern long long notifier_lock_time; +extern void unset_listener_callback (); extern unsigned long SCHEMA_ID; @@ -664,6 +665,7 @@ void notify_listener_change_callback(int sig, siginfo_t *si, void *data) close(fd); } + unset_listener_callback(); fclose_lock(&l_file); }
I think this is bug-type "7" (or at least 5) because a transaction is completely lost and not delivered to the listener. This may explain strange unpreroducible bugs.
(In reply to Florian Best from comment #7) > I think this is bug-type "7" (or at least 5) because a transaction is > completely lost and not delivered to the listener. This may explain strange > unpreroducible bugs. "completely lost" - No, not lost, just "ignored" until the next transaction, but this is bad too, im OK with 7
Seen again in one of our jenkins tests - UCS-4.4-6 Product Tests Component openid-connect * master & backup (4.4-6 e750) * setup-join on backup * on backup univention-join created the ldap object cn=backup * nothing more happened on the master, the ldap object cn=backup was present but the notifier wasn't triggered and cn=translog not updated, * as a result, the listener on masterdid not create ssl certs for the backup and the join failed * directly after creating another udm object on the master, everything worked like expected (cn=backup was processed, then the new object)
and another one -> /var/lib/univention-ldap/listener/listener 1169 cn=2009,cn=gidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local a 1170 cn=2009,cn=gidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local d 1171 cn=52:54:00:cd:d8:4c,cn=mac,cn=temporary,cn=univention,dc=autotest125,dc=local a 1172 cn=10.207.210.252,cn=aRecord,cn=temporary,cn=univention,dc=autotest125,dc=local a 1173 cn=uidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local m 1174 cn=2009,cn=uidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local d 1175 cn=slave125$,cn=uid,cn=temporary,cn=univention,dc=autotest125,dc=local a 1176 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local a 1177 cn=slave125$,cn=uid,cn=temporary,cn=univention,dc=autotest125,dc=local d 1178 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local m 1179 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local m 1180 relativeDomainName=slave125,zoneName=autotest125.local,cn=dns,dc=autotest125,dc=local a 1181 zoneName=autotest125.local,cn=dns,dc=autotest125,dc=local m 1182 relativeDomainName=252.210,zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest125,dc=local a 1183 zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest125,dc=local m 1184 cn=10.207.210.252,cn=aRecord,cn=temporary,cn=univention,dc=autotest125,dc=local d 1185 cn=52:54:00:cd:d8:4c,cn=mac,cn=temporary,cn=univention,dc=autotest125,dc=local d 1186 cn=DC Slave Hosts,cn=groups,dc=autotest125,dc=local m -> ls -la /var/lib/univention-ldap/listener/listener -rw-r--r-- 1 listener root 1374 Okt 12 23:24 /var/lib/univention-ldap/listener/listener -> stat /var/lib/univention-ldap/listener/listener Datei: /var/lib/univention-ldap/listener/listener Größe: 1374 Blöcke: 8 EA Block: 4096 reguläre Datei Gerät: fd00h/64768d Inode: 540278 Verknüpfungen: 1 Zugriff: (0644/-rw-r--r--) Uid: ( 106/listener) Gid: ( 0/ root) Zugriff : 2020-10-13 09:51:54.769950073 +0200 Modifiziert: 2020-10-12 23:24:13.858872030 +0200 Geändert : 2020-10-12 23:24:13.858872030 +0200 Geburt : - so last change form 23:24 since then nothing happened in the notifier (and therefor listener) In the listener log i see 12.10.20 23:15:07.046 DEBUG_INIT 12.10.20 23:24:13.761 TRANSFILE ( ERROR ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server 12.10.20 23:24:19.120 DEBUG_INIT This comes from notify_dump_to_ldap from the notify_listener_change_callback callback. Could it be that we abort (and forget) the transaction when notify_dump_to_ldap aborts?
(In reply to Felix Botner from comment #10) > > In the listener log i see > > 12.10.20 23:15:07.046 DEBUG_INIT > 12.10.20 23:24:13.761 TRANSFILE ( ERROR ) : > ldap_sasl_interactive_bind_s(): Can't contact LDAP server > 12.10.20 23:24:19.120 DEBUG_INIT sorry, notifier.log
I guess it's a regression of Bug #51722 (commit a18a39c017), my current analysis is this: 1. We see the TRANSFILE log message complaining about a failed LDAP bind, which comes from notify_dump_to_ldap, called here, just before unlinking FILE_NAME_NOTIFIER_PRIV here: https://git.knut.univention.de/univention/ucs/-/blob/4.4-6/management/univention-directory-notifier/src/notify.c#L693 2. The notifier aborts and when it restarts it checks the FILE_NAME_LISTENER once, but only if FILE_NAME_NOTIFIER_PRIV doesn't exist. 3. Then it will not read the FILE_NAME_LISTENER again unless dnotify reports a modification. Workaround in this situation: pkill -RTMIN -f /usr/sbin/univention-directory-notifier I'll attach an idea for a patch.
Created attachment 10514 [details] Bug51804-idea.patch
I applied your patch. You can see the result here: https://github.com/thomasbach-dev/univention-corporate-server/tree/bug51804-idea.patch It doesn't fix the issue.
The tests 01_base/96rename* failed every jenkins run on multiple roles, when executing them, every fifth time or so because of this bug. I added a workaround 87c2f51f70 for Bug #52005. If this bug is fixed, please remove the workaround. This might be a good test to verify the fix.
UDN not processing transaction is responsible for Bug #51776 which we often observe with out Jenkins tests: This leads to tests not running which leads to delays in out daily development work.
Both of the patches here seemed to solve the issue as seen on the test. I applied them to the package. Waiting for test results. 04f798467d Bug #51804: yaml 367c1ad555 Bug #51804, Bug #52261, Bug #52439: yaml 68f6dd5597 Bug #51804: changelog 987df24e3c Revert "Bug #52005: Reactivate 96rename* and work around Bug#51804" 620bd158e7 Bug #51804: call unset_listener_callback before unlocking the listener file. Check listener file on restart, even if listener.priv exists Successful build Package: univention-directory-notifier Version: 13.0.3-3A~4.4.0.202012011723 Branch: ucs_4.4-0 Scope: errata4.4-7
The test did not fail since I applied these patches, which was five days ago. The tests previously failed every day, so this is a good indication that the patches worked.
MR: https://git.knut.univention.de/univention/ucs/-/merge_requests/50
OK - univention-directory-notifier OK - jenkins test OK - reactivate 96rename* tests OK - yaml OK - MR
<https://errata.software-univention.de/#/?erratum=4.4x835>