Bug 51804 - UDN transaction processing delayed until next transaction
UDN transaction processing delayed until next transaction
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Notifier (univention-directory-notifier)
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 4.4-7-errata
Assigned To: Julia Bremer
Felix Botner
https://git.knut.univention.de/univen...
:
Depends on:
Blocks: 53461 51776
  Show dependency treegraph
 
Reported: 2020-08-11 12:45 CEST by Felix Botner
Modified: 2021-06-21 17:14 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.400
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
Bug51804-idea.patch (522 bytes, patch)
2020-10-13 11:57 CEST, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Botner univentionstaff 2020-08-11 12:45:42 CEST
univention-directory-notifier: 13.0.3-2A~4.4.0.202007291251
slapd: 2.4.45+dfsg-1~bpo9+1A~4.4.0.202007242143
ucs: 4.4-5 errata793

seen in a test setup for a customer

the test goes like this

 * creates ldap objects
 * delete ldap object
 * waits until id's in /var/lib/univention-ldap/last_id and /var/lib/univention-directory-listener/notifier_id are equal

but "sometimes" the notifier seems to miss a transaction and /var/lib/univention-directory-listener/notifier_id is not updated

-> more /var/lib/univention-ldap/last_id /var/lib/univention-directory-listener/notifier_id
::::::::::::::
/var/lib/univention-ldap/last_id
::::::::::::::
199635
::::::::::::::
/var/lib/univention-directory-listener/notifier_id
::::::::::::::
199634

-> ls -la /var/lib/univention-ldap/listener/listener
-rw-r--r-- 1 listener root 107 Aug 10 22:08 /var/lib/univention-ldap/listener/listener

more /var/lib/univention-ldap/listener/listener
199635 cn=alexander..... d

-> usr/share/univention-directory-listener/get_notifier_id.py
199634

notifer.log
10.08.20 22:08:12.145  TRANSFILE   ( INFO    ) : want to write to transaction file; id=199634
10.08.20 22:08:12.145  TRANSFILE   ( INFO    ) : wrote to transaction file; id=199634; dn=cn=96112523..., cmd=d
10.08.20 22:08:12.145  TRANSFILE   ( INFO    ) : wrote to transaction file; close
10.08.20 22:08:12.145  TRANSFILE   ( PROCESS ) : Added to cache pos 17, id 199634

listener.log
10.08.20 22:08:12.153  LISTENER    ( INFO    ) : dntree_del_id: deleted id=180596
10.08.20 22:08:12.155  LISTENER    ( INFO    ) : dntree_get_id4dn: DN cn=96112523,.... not in id2dn
10.08.20 22:08:12.157  LISTENER    ( INFO    ) : Last Notifier ID: 199634
10.08.20 22:08:27.172  LISTENER    ( INFO    ) : running postrun handlers

-> univention-ldapsearch -b cn=translog reqSession=199634
dn: reqSession=199634,cn=translog
...
reqType: d

univention-ldapsearch -b cn=translog reqSession=199635
-> None

So 199634 has been processed but for the pending transaction 199635 the notifier has not been "triggerd".

Notifier/Listener/LDAP all running at this point but nothing happens.

Creating a new LDAP object, or even echo aa > /var/lib/univention-ldap/listener/a triggers the notifier and the transaction is processed

-> echo aa > /var/lib/univention-ldap/listener/a

-> notifier

10.08.20 22:08:12.145  TRANSFILE   ( INFO    ) : wrote to transaction file; id=199634; dn=cn=96112523,c...., cmd=d
10.08.20 22:08:12.145  TRANSFILE   ( INFO    ) : wrote to transaction file; close
10.08.20 22:08:12.145  TRANSFILE   ( PROCESS ) : Added to cache pos 17, id 199634

11.08.20 10:33:27.808  TRANSFILE   ( INFO    ) : want to write to transaction file; id=199635
11.08.20 10:33:27.808  TRANSFILE   ( INFO    ) : wrote to transaction file; id=199635; dn=cn=alexander..., cmd=d
11.08.20 10:33:27.808  TRANSFILE   ( INFO    ) : wrote to transaction file; close
11.08.20 10:33:27.808  TRANSFILE   ( PROCESS ) : Added to cache pos 18, id 199635

Is there a problem in the notifiers signal handling/F_NOTIFY?
Comment 1 Florian Best univentionstaff 2020-08-11 13:21:46 CEST
Is this a regression caused by Bug #51722?
Comment 2 Felix Botner univentionstaff 2020-08-11 16:35:40 CEST
(In reply to Florian Best from comment #1)
> Is this a regression caused by Bug #51722?

according to ProfServ this started to happen way before 2020-08-05_698_univention-directory-notifier.yaml
Comment 3 Thomas Bach univentionstaff 2020-08-13 09:26:12 CEST
I was just able to trigger the observed behaviour in UCS 3.2.

As a further notice: we also have a deployment with a backup server where this problem also occurs. The master already has the problems as described in the OP. When running a loop like this:

while echo 'a' > /var/lib/univention-ldap/listener/a ; do sleep 10s; done

everything seems to work as it should, but the backup server now seems to stop although there is a pending transaction in /var/lib/univention-ldap/listener/listener. Running the loop above on the backup fixes this issue as well.
Comment 4 Florian Best univentionstaff 2020-08-13 09:45:11 CEST
Notes from Arvid:

I guess that the unset_listener_callback() call which is made in check_callbacks() should be moved to notify_listener_change_callback() shortly after you create the new listener file and while you still have the lock. Otherwise you always have a race condition.

Philipp mentioned the race condition already in Bug #41687.
Comment 5 Thomas Bach univentionstaff 2020-08-13 09:47:21 CEST
I can recompile and test if you provide a patch.
Comment 6 Thomas Bach univentionstaff 2020-08-13 13:05:51 CEST
I applied a patch as suggested by Arvid. The problem still occurs.

diff --git a/management/univention-directory-notifier/src/network.c b/management/univention-directory-notifier/src/network.c
index 809f217e57..8630182754 100644
--- a/management/univention-directory-notifier/src/network.c
+++ b/management/univention-directory-notifier/src/network.c
@@ -63,7 +63,6 @@ extern void set_listener_callback ( int sig, siginfo_t *si, void *data);
 extern int get_schema_callback ();
 extern int get_listener_callback ();
 extern void unset_schema_callback ();
-extern void unset_listener_callback ();
 
 extern NotifyId_t notify_last_id;
 
@@ -294,7 +293,6 @@ void check_callbacks()
        }
        if ( get_listener_callback () ) {
                notify_listener_change_callback ( 0, NULL, NULL);
-               unset_listener_callback();
        }
 }
 int network_client_main_loop ( )
diff --git a/management/univention-directory-notifier/src/notify.c b/management/univention-directory-notifier/src/notify.c
index 76baa23f88..801a48a4f7 100644
--- a/management/univention-directory-notifier/src/notify.c
+++ b/management/univention-directory-notifier/src/notify.c
@@ -56,6 +56,7 @@ extern NotifyId_t notify_last_id;
 extern Notify_t notify;
 extern long long notifier_lock_count;
 extern long long notifier_lock_time;
+extern void unset_listener_callback ();
 
 extern unsigned long SCHEMA_ID;
 
@@ -664,6 +665,7 @@ void notify_listener_change_callback(int sig, siginfo_t *si, void *data)
                        close(fd);
                }
 
+               unset_listener_callback();
                fclose_lock(&l_file);
        }
Comment 7 Florian Best univentionstaff 2020-08-13 13:21:23 CEST
I think this is bug-type "7" (or at least 5) because a transaction is completely lost and not delivered to the listener. This may explain strange unpreroducible bugs.
Comment 8 Felix Botner univentionstaff 2020-08-13 13:50:56 CEST
(In reply to Florian Best from comment #7)
> I think this is bug-type "7" (or at least 5) because a transaction is
> completely lost and not delivered to the listener. This may explain strange
> unpreroducible bugs.

"completely lost" - No, not lost, just "ignored" until the next transaction, but this is bad too, im OK with 7
Comment 9 Felix Botner univentionstaff 2020-09-30 10:08:31 CEST
Seen again in one of our jenkins tests - UCS-4.4-6 Product Tests Component openid-connect

* master & backup (4.4-6 e750)
* setup-join on backup
* on backup univention-join created the ldap object cn=backup
* nothing more happened on the master, the ldap object cn=backup
  was present but the notifier wasn't triggered and cn=translog
  not updated, 
* as a result, the listener on masterdid not create ssl certs
  for the backup and the join failed
* directly after creating another udm object on the master,
  everything worked like expected (cn=backup was processed,
  then the new object)
Comment 10 Felix Botner univentionstaff 2020-10-13 10:21:11 CEST
and another one

-> /var/lib/univention-ldap/listener/listener
1169 cn=2009,cn=gidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local a
1170 cn=2009,cn=gidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local d
1171 cn=52:54:00:cd:d8:4c,cn=mac,cn=temporary,cn=univention,dc=autotest125,dc=local a
1172 cn=10.207.210.252,cn=aRecord,cn=temporary,cn=univention,dc=autotest125,dc=local a
1173 cn=uidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local m
1174 cn=2009,cn=uidNumber,cn=temporary,cn=univention,dc=autotest125,dc=local d
1175 cn=slave125$,cn=uid,cn=temporary,cn=univention,dc=autotest125,dc=local a
1176 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local a
1177 cn=slave125$,cn=uid,cn=temporary,cn=univention,dc=autotest125,dc=local d
1178 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local m
1179 cn=slave125,cn=dc,cn=computers,dc=autotest125,dc=local m
1180 relativeDomainName=slave125,zoneName=autotest125.local,cn=dns,dc=autotest125,dc=local a
1181 zoneName=autotest125.local,cn=dns,dc=autotest125,dc=local m
1182 relativeDomainName=252.210,zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest125,dc=local a
1183 zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest125,dc=local m
1184 cn=10.207.210.252,cn=aRecord,cn=temporary,cn=univention,dc=autotest125,dc=local d
1185 cn=52:54:00:cd:d8:4c,cn=mac,cn=temporary,cn=univention,dc=autotest125,dc=local d
1186 cn=DC Slave Hosts,cn=groups,dc=autotest125,dc=local m

-> ls -la /var/lib/univention-ldap/listener/listener
-rw-r--r-- 1 listener root 1374 Okt 12 23:24 /var/lib/univention-ldap/listener/listener

-> stat  /var/lib/univention-ldap/listener/listener
  Datei: /var/lib/univention-ldap/listener/listener
  Größe: 1374      	Blöcke: 8          EA Block: 4096   reguläre Datei
Gerät: fd00h/64768d	Inode: 540278      Verknüpfungen: 1
Zugriff: (0644/-rw-r--r--)  Uid: (  106/listener)   Gid: (    0/    root)
Zugriff    : 2020-10-13 09:51:54.769950073 +0200
Modifiziert: 2020-10-12 23:24:13.858872030 +0200
Geändert   : 2020-10-12 23:24:13.858872030 +0200
 Geburt    : -


so last change form 23:24 since then nothing happened in the notifier (and therefor listener)

In the listener log i see

12.10.20 23:15:07.046  DEBUG_INIT
12.10.20 23:24:13.761  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
12.10.20 23:24:19.120  DEBUG_INIT

This comes from notify_dump_to_ldap from the notify_listener_change_callback callback. Could it be that we abort (and forget) the transaction when  notify_dump_to_ldap aborts?
Comment 11 Felix Botner univentionstaff 2020-10-13 10:34:14 CEST
(In reply to Felix Botner from comment #10)

> 
> In the listener log i see
> 
> 12.10.20 23:15:07.046  DEBUG_INIT
> 12.10.20 23:24:13.761  TRANSFILE   ( ERROR   ) :
> ldap_sasl_interactive_bind_s(): Can't contact LDAP server
> 12.10.20 23:24:19.120  DEBUG_INIT


sorry, notifier.log
Comment 12 Arvid Requate univentionstaff 2020-10-13 11:56:36 CEST
I guess it's a regression of Bug #51722 (commit a18a39c017), my current analysis is this:

1. We see the TRANSFILE log message complaining about a failed LDAP bind, which comes from notify_dump_to_ldap, called here, just before unlinking FILE_NAME_NOTIFIER_PRIV here:

https://git.knut.univention.de/univention/ucs/-/blob/4.4-6/management/univention-directory-notifier/src/notify.c#L693

2. The notifier aborts and when it restarts it checks the FILE_NAME_LISTENER once, but only if FILE_NAME_NOTIFIER_PRIV doesn't exist.

3. Then it will not read the FILE_NAME_LISTENER again unless dnotify reports a modification.

Workaround in this situation:

 pkill -RTMIN -f /usr/sbin/univention-directory-notifier

I'll attach an idea for a patch.
Comment 13 Arvid Requate univentionstaff 2020-10-13 11:57:01 CEST
Created attachment 10514 [details]
Bug51804-idea.patch
Comment 14 Thomas Bach univentionstaff 2020-10-16 22:16:50 CEST
I applied your patch. You can see the result here:

https://github.com/thomasbach-dev/univention-corporate-server/tree/bug51804-idea.patch

It doesn't fix the issue.
Comment 16 Julia Bremer univentionstaff 2020-11-20 11:20:52 CET
The tests 01_base/96rename* failed every jenkins run on multiple roles,
when executing them, every fifth time or so because of this bug. 
I added a workaround 87c2f51f70 for Bug #52005.

If this bug is fixed, please remove the workaround.
This might be a good test to verify the fix.
Comment 17 Philipp Hahn univentionstaff 2020-11-30 11:39:30 CET
UDN not processing transaction is responsible for Bug #51776 which we often observe with out Jenkins tests: This leads to tests not running which leads to delays in out daily development work.
Comment 18 Julia Bremer univentionstaff 2020-12-02 10:08:50 CET
Both of the patches here seemed to solve the issue as seen on the test.
I applied them to the package.
Waiting for test results.


04f798467d Bug #51804: yaml
367c1ad555 Bug #51804, Bug #52261, Bug #52439: yaml
68f6dd5597 Bug #51804: changelog
987df24e3c Revert "Bug #52005: Reactivate 96rename* and work around Bug#51804"
620bd158e7 Bug #51804: call unset_listener_callback before unlocking the listener file. Check listener file on restart, even if listener.priv exists

Successful build
Package: univention-directory-notifier
Version: 13.0.3-3A~4.4.0.202012011723
Branch: ucs_4.4-0
Scope: errata4.4-7
Comment 19 Julia Bremer univentionstaff 2020-12-07 08:49:30 CET
The test did not fail since I applied these patches, which was five days ago.
The tests previously failed every day, so this is a good indication that the patches worked.
Comment 21 Felix Botner univentionstaff 2020-12-09 11:30:58 CET
OK - univention-directory-notifier
OK - jenkins test
OK - reactivate 96rename* tests
OK - yaml
OK - MR