Bug 52358 - S4C synchronization loop after add+delete
S4C synchronization loop after add+delete
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: S4 Connector
UCS 5.0
Other Linux
: P5 critical with 1 vote (vote)
: UCS 5.0
Assigned To: Julia Bremer
Felix Botner
:
Depends on: 52043
Blocks: 53061
  Show dependency treegraph
 
Reported: 2020-11-12 19:17 CET by Arvid Requate
Modified: 2021-05-25 16:03 CEST (History)
4 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
connector-s4.log (10.43 MB, text/x-log)
2020-11-12 19:21 CET, Arvid Requate
Details
listener.log (1.31 MB, text/x-log)
2020-11-12 19:22 CET, Arvid Requate
Details
/var/lib/univention-ldap/notify/transaction (754.98 KB, text/plain)
2020-11-12 19:22 CET, Arvid Requate
Details
Bug52358-workaround.patch (1.48 KB, patch)
2020-11-17 20:55 CET, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2020-11-12 19:17:18 CET
In the tests with the Python 3 univention-s4-connector the test case 51_samba4/51create_gpo triggers an infinite loop of delete-add operations by creating a GPO in Samba/AD and removing it again within a time window of 4 seconds.

Apart from increasing global warming this has the collateral effect that the Listener is busy processing the changes and doesn't call the Listener module postruns any longer.

I'm not sure yet if this is caused by just a subtle change of timing in the behavior of the S4-Connector or if it's caused by some issue overlooked during the Python3 migration.
Comment 1 Arvid Requate univentionstaff 2020-11-12 19:21:57 CET
Created attachment 10552 [details]
connector-s4.log
Comment 2 Arvid Requate univentionstaff 2020-11-12 19:22:23 CET
Created attachment 10553 [details]
listener.log
Comment 3 Arvid Requate univentionstaff 2020-11-12 19:22:59 CET
Created attachment 10554 [details]
/var/lib/univention-ldap/notify/transaction
Comment 4 Arvid Requate univentionstaff 2020-11-12 19:31:50 CET
For deletes triggered via UDM we have a mechanism to prevent this kind of ping-pong resurrection (sqlite table "UCS deleted", Bug #32263).

Maybe we need a similar mechanism in the other direction too. The way it currently is, it may be that it can be easily triggered by an unsuspecting Windows Admin that adds+deletes a GPO in the GPMC GUI.

I guess that the approach suggested in Bug #49442 could significantly reduce the time window for this to happen (or otherwise reducing connector/s4/poll/sleep, which is 5 seconds).
Comment 6 Arvid Requate univentionstaff 2020-11-17 10:42:25 CET
It's not only GPOs, 66_udm-computers/56_all_roles_univentionLastUsedValue also seems to trigger this pretty often with the UCS-5 S4C+UDM.

This looks like Bug #50593 Comment 5.
Comment 7 Arvid Requate univentionstaff 2020-11-17 18:45:42 CET
Ah, as just discussed with Julia, these are actually two different, but very similar issues here:


1) 51_samba4/51create_gpo does add+modify+delete in Samba/AD

2) 66_udm-computers/56_all_roles_univentionLastUsedValue does add+modify+delete in UDM/OpenLDAP


I don't see how the "UCS deleted" table can help us here, because it only records entryUUIDs.
That helps to filter out replays of leftover pickle files, but not against re-creation of
objects with same DN.


So, currently I only see two possible approaches (for case 2, not sure it helps for 1):

a) Update the s4cache content already directly after writing to Samba/AD.
   Currently it's only updated during reads, just like the Listener cache behaves.
   That way we could filter out the "pong".
   I guess this wouldn't help in case 1, because we don't have a cache in that
   direction which we could update during writes to UDM/OpenLDAP.

b) Remember what has been done in the previous replication stage ("ping") and
   use that memory to avoid doing things that don't look inconsistent. Like:
   We just synced an "Add" coming from the left side to the right side and
   now we want to do an add coming from the right side to the left? directly
   as the next step? That doesn't make sense.
   For "add"s could do this on the basis of DNs, that would probably not be
   too hard. For modifies (which could probably trigger a similar ping-pong) it
   would become harder to track this.
   My feeling is that this approach kind of strays away from the idea of keeping
   things in sync, as it decides about behavior on a more abstract level.
Comment 8 Arvid Requate univentionstaff 2020-11-17 20:55:17 CET
Created attachment 10556 [details]
Bug52358-workaround.patch

This patch adjusts the polling timing a bit to continue reading a bit longer from the side where changes come in last. It helped on my VM. This may be a workaround useful to avoid the damage in our CI tests as long as we don't have a better solution.
Comment 9 Julia Bremer univentionstaff 2020-11-27 10:07:34 CET
These are the differences in modification in UCS4, vs UCS5.
The case tested here is the pingpong starting in UCS.

UCS 4
01.11.2020 08:54:02.698 LDAP        (PROCESS): sync from ucs: [          user] [       add] cn=test777,DC=jbm20,DC=intranet
01.11.2020 08:54:02.753 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=jbm20,DC=intranet
01.11.2020 08:54:16.936 LDAP        (PROCESS): sync to ucs:   [          user] [       add] u'uid=test777,dc=jbm20,dc=intranet'
01.11.2020 08:54:23.511 LDAP        (PROCESS): sync from ucs: [          user] [    delete] cn=test777,DC=jbm20,DC=intranet
01.11.2020 08:54:23.527 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=jbm20,DC=intranet
01.11.2020 08:54:23.546 LDAP        (PROCESS): sync from ucs: [          user] [       add] cn=test777,DC=jbm20,DC=intranet
01.11.2020 08:54:23.595 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=jbm20,DC=intranet
01.11.2020 08:54:26.049 LDAP        (PROCESS): sync to ucs:   [          user] [    modify] u'uid=test777,dc=jbm20,dc=intranet'
01.11.2020 08:54:27.449 LDAP        (PROCESS): sync to ucs:   [          user] [    delete] u'uid=test777,dc=jbm20,dc=intranet'
01.11.2020 08:54:33.596 LDAP        (PROCESS): __sync_file_from_ucs: Object with entryUUID 2fc77fe0-b063-103a-9cf7-8900c1ec9fa7 has been removed before. Don't re-create.
01.11.2020 08:54:33.601 LDAP        (PROCESS): sync from ucs: [          user] [    delete] CN=test777,DC=jbm20,DC=intranet
01.11.2020 09:21:02.408 LDAP        (WARNING): Object to delete doesn't exists, ignore (u'uid=test777,dc=jbm20,dc=intranet')
(End)
================================================================================================================================
UCS 5
27.11.2020 09:40:36.638 LDAP        (PROCESS): sync from ucs: [          user] [       add] cn=test777,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:36.726 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:43.211 LDAP        (PROCESS): sync to ucs:   [          user] [       add] 'uid=test777,dc=ucs5alpha5,dc=dev'
27.11.2020 09:40:48.534 LDAP        (PROCESS): sync from ucs: [          user] [    delete] cn=test777,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:48.570 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:48.588 LDAP        (PROCESS): sync from ucs: [          user] [       add] cn=test777,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:48.725 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:51.634 LDAP        (PROCESS): sync to ucs:   [          user] [    delete] 'uid=test777,dc=ucs5alpha5,dc=dev'
27.11.2020 09:40:52.571 LDAP        (PROCESS): sync to ucs:   [          user] [       add] 'uid=test777,dc=ucs5alpha5,dc=dev'
27.11.2020 09:40:57.903 LDAP        (PROCESS): sync from ucs: [          user] [    delete] cn=test777,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:57.938 LDAP        (PROCESS): sync from ucs: [         group] [    modify] cn=domain users,cn=groups,DC=ucs5alpha5,DC=dev
27.11.2020 09:40:58.018 LDAP        (PROCESS): sync from ucs: [          user] [       add] cn=test777,DC=ucs5alpha5,DC=dev
(and so on, and so on .. )
Comment 10 Julia Bremer univentionstaff 2021-02-08 08:53:46 CET
01_base/00_univention-license-check always created a loop in our test environment, which inhibited the postrun. 
I added some extra waiting time before cleanup. 
62acf7cf81
Comment 11 Julia Bremer univentionstaff 2021-02-08 21:00:47 CET
Same thing with 01_base/101_ssl_cname_san:
It created a S4 loop pretty reliably. In the test I ran after adding waiting time to this one too, no loop was created.

bb6448fdcd Bug #52358: wait before cleanup 101_ssl_cname_san creates s4 loop


Please remove these workarounds after fixing this problem.
The tests can act as pretty reliable reproducers.
Comment 13 Julia Bremer univentionstaff 2021-02-11 18:38:57 CET
Successful build
Package: univention-s4-connector
Version: 14.0.6-4A~5.0.0.202102111810
Branch: ucs_5.0-0

f773e67cec Bug #52358: User LDAP post read control for all objects, not just msGPO.

During analysis of the loop we found out, that the main difference between UCS4 (which does not create a loop) and UCS5 (which does) was 
the sequence of modifications during sync_to_ucs.
An add, delete sequence used to be followerd up by samba with an delete,add sequence.
Because of our already_deleted cache, the then following add_by_ucs is then ignored. 
Now an add,delete is responded with an add,delete sequence.
This we couldn't handle and ended in this never-ending add, delete loop.

In Bug #43628, Arvid added LDAP PostReadControl to our S4-Connector, but only activated it for msGPO Objects, which are most susceptible to Ping-Pong problems, due to their user interface in windows. 
With this, we can get the entryCSN of the object we just created during add_in_ucs.
With this knowledge, we can identify the "pong" of this modification and skip it.
This interrupts the loop in the correct place, at the pong of an "add".

Sadly, this does not prevent the problem from Bug #50593, which dealt with an delete+add ping pong, while this bug dealt with an add+delete ping pong.

TODO:
Waiting for our test results.
Comment 14 Julia Bremer univentionstaff 2021-02-12 09:16:07 CET
Arvid noticed that 
52a4b0cd0c Bug #52043 s4c: initial s4 search: combine filters

May have introduced this change in behaviour. (sequence of add,deletes -> delete,add in sync_to_ucs). 
I tested this, and can confirm this. No loop can be triggered anymore after reverting this commit, the behaviour is just like UCS4. 


Sadly, I thought the solution of yesterday would have eliminated the loops completely. I still found one in:
http://jenkins.knut.univention.de:8080/job/UCS-5.0/job/UCS-5.0-0/job/AutotestJoin/SambaVersion=s4,Systemrolle=master-part-II/ws/test/connector-s4.log/*view*/

triggered by 10_ldap/49replication_modrdn_with_previous_modification,
which modifies and moves an object while the listener is stopped.


12.02.2021 01:16:35.671 LDAP        (PROCESS): sync from ucs: [windowscomputer] [       add] cn=z09rntxadq,cn=bpg2kk8txr,cn=memberserver,cn=computers,DC=autotest091,DC=local
12.02.2021 01:16:35.717 LDAP        (PROCESS): sync from ucs: [windowscomputer] [    modify] cn=z09rntxadq,cn=bpg2kk8txr,cn=memberserver,cn=computers,DC=autotest091,DC=local

12.02.2021 01:16:36.776 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [    modify] 'cn=z09rntxadq,cn=bpg2kk8txr,cn=memberserver,cn=computers,dc=AutoTest091,dc=local'

12.02.2021 01:16:42.860 LDAP        (PROCESS): sync from ucs: [windowscomputer] [      move] CN=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local
12.02.2021 01:16:42.864 LDAP        (PROCESS): password_sync_ucs_to_s4: The UCS object (cn=z09rntxadq,cn=memberserver,cn=computers,dc=AutoTest091,dc=local) was not found. The object was removed.

12.02.2021 01:16:42.867 LDAP        (PROCESS): sync from ucs: [windowscomputer] [    modify] cn=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local

12.02.2021 01:16:43.905 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [       add] 'cn=z09rntxadq,cn=memberserver,cn=computers,dc=AutoTest091,dc=local'

12.02.2021 01:16:50.036 LDAP        (PROCESS): sync from ucs: [windowscomputer] [    delete] cn=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local
12.02.2021 01:16:50.072 LDAP        (PROCESS): sync from ucs: [windowscomputer] [       add] cn=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local

12.02.2021 01:16:51.136 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [    delete] 'cn=z09rntxadq,cn=memberserver,cn=computers,dc=AutoTest091,dc=local'
12.02.2021 01:16:51.171 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [       add] 'cn=z09rntxadq,CN=memberserver,CN=Computers,dc=AutoTest091,dc=local'

12.02.2021 01:16:57.284 LDAP        (PROCESS): sync from ucs: [windowscomputer] [    delete] cn=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local
12.02.2021 01:16:57.311 LDAP        (PROCESS): sync from ucs: [windowscomputer] [    modify] cn=z09rntxadq,cn=memberserver,cn=computers,DC=autotest091,DC=local

12.02.2021 01:16:58.366 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [    delete] 'cn=z09rntxadq,cn=memberserver,cn=computers,dc=AutoTest091,dc=local'
12.02.2021 01:16:58.416 LDAP        (PROCESS): sync to ucs:   [windowscomputer] [       add] 'cn=z09rntxadq,CN=memberserver,CN=Computers,dc=AutoTest091,dc=local'
.....
Comment 15 Julia Bremer univentionstaff 2021-02-15 08:48:45 CET
I added the test-case 99_s4connector_loop, which creates and deletes computer objects and checks if a loop was created. 
On my test machine, this always failed correctly. In Jenkins the test succeeded. I guess it needs more fidgeting to decrease false negatives.


Successful build
Package: univention-s4-connector
Version: 14.0.6-5A~5.0.0.202102142351
Branch: ucs_5.0-0

5a47a4b555 Bug #52358: Restore ordering of s4 chan
Arvid made a patch recommendation to restore the ordering of incoming s4 changes to how it was in UCS4.
Manual testing confirms this and fixes all kinds of loops we observed before.

I pushed this after the tests already started, to see if the automated tests fails at least once, which was not the case.
Comment 16 Julia Bremer univentionstaff 2021-02-22 15:52:13 CET
Also happens in adconnector
Comment 17 Julia Bremer univentionstaff 2021-02-23 11:54:58 CET
For QA: 
REPRODUCING FROM UCS:
To reproduce this reliably(from UCS), one has to insert a breakpoint at this place in the s4connector. 

+++ services/univention-s4-connector/modules/univention/s4connector/__init__.py
@@ -1316,6 +1316,7 @@ class ucs(object):
                        return True
 
                ucs_object_dn = object.get('olddn', object['dn'])
+               breakpoint()
                old_object = self.get_ucs_object(property_type, ucs_object_dn)
                if old_object and object['modtype'] == 'add':
                        object['modtype'] = 'modify'

stop the s4connector and start it in the foreground
/usr/bin/python3 -W ignore -m univention.s4connector.s4.main -n

first add an object (users won't work, they create a reject in this case)

udm computers/windowscomputer create --set name=test
when the breakpoint is hit, delete the object
udm computers/windowscomputer remove --dn=cn=test.$(ucr get ldap/base)
Continue and a loop should happen. 

REPRODUCING FROM S4:

In this case a breakpoint has to be inserted here:

+++ services/univention-s4-connector/modules/univention/s4connector/s4/__init__.py
@@ -2024,6 +2024,7 @@ class s4(univention.s4connector.ucs):
                        old_dn = object['olddn']
                        # the old object was moved in UCS, but does this object exist in S4?
                        try:
+                               breakpoint()
                                old_object = self.s4_search_ext_s(old_dn, ldap.SCOPE_BASE, 'objectClass=*')
                        except ldap.SERVER_DOWN:
                                raise

An object has to be created
samba-tool computer create test
when the breakpoint hits, it has to be deleted again
samba-tool computer delete test

Moving and then deleting an object should also be tested since this was the case that was not fixed by the patch in Comment 13. 
But this should be covered by the patch in Comment 16.
Comment 18 Felix Botner univentionstaff 2021-02-24 15:05:32 CET
OK - 5a47a4b555c1cdb0677bf405263b2a245907871e

OK - s4 connector
 * OK - 5a47a4b555c1cdb0677bf405263b2a245907871e
 * OK - f773e67cece739cc002fb1758ece7d4bcd12121e
 * OK - S4 REPRODUCING FROM UCS
 * OK - S4 Fixed FROM UCS
 * OK - S4 REPRODUCING FROM S4
@@ -2020,6 +2020,7 @@ class s4(univention.s4connector.ucs):
 
                # check for move, if old_object exists, set modtype move
                pre_mapped_ucs_old_dn = old_dn
+               breakpoint()
                if old_dn:
                        old_dn = object['olddn']
                        # the old object was moved in UCS, but does this object exist in S4?
 * OK - S4 Fixed FROM UCS
 * OK - 10_ldap/49replication_modrdn_with_previous_modification
 * OK - Jenkins

TODO - ad connector
TODO - tests
Comment 19 Felix Botner univentionstaff 2021-02-25 12:32:33 CET
OK - ad connector
 * OK - 152bc96aed076a2940f67ccca27e1870f8289b97
 * OK - 90835163c05f7a2e4c9c2d3f35c0771f0c01d67e
 * OK - AD REPRODUCING FROM UCS
 * OK - AD Fixed FROM UCS
 * OK - AD REPRODUCING FROM AD
@@ ????
                # check for move, if old_object exists, set modtype move
                pre_mapped_ucs_old_dn = old_dn
+               breakpoint()
                if old_dn:
                        ud.debug(ud.LDAP, ud.INFO, "move %s from [%s] to [%s]" % (property_type, old_dn, object['dn']))
 * OK - AD Fixed FROM AD
 * OK - 10_ldap/49replication_modrdn_with_previous_modification


TODO - tests
Comment 20 Florian Best univentionstaff 2021-03-02 21:33:50 CET
*** Bug 48988 has been marked as a duplicate of this bug. ***
Comment 21 Julia Bremer univentionstaff 2021-03-03 13:00:10 CET
I added three tests:

52_s4connector/99_s4connector_loop which tries to reproduce an s4connector loop
55_adconnector/99_adconnector_loop which tries to reproduce an adconnector loop
99_end/00_test_if_s4connector_loop_present, which checks if a loop is  present after executing all other tests in jenkins.
-------------------------------------------
73f2999205 Bug #52358: fixup! use python3 in new tests
e9dc83c706 fixup! Bug #52358: Add s4connector loop test
0d5f9bd128 Bug #52358: fixup! use python3 in new tests
ea7b2e2a00 fixup! Bug #52358: Add tests to create/detect s4/ad loops
29a1bfdd21 Bug #52358: Add tests to create/detect s4/ad loops
Comment 22 Julia Bremer univentionstaff 2021-03-03 13:04:19 CET
I reverted the test workarounds for this problem:
e3dd347fad
Comment 23 Felix Botner univentionstaff 2021-03-03 15:24:51 CET
OK - 99_adconnector_loop reproduces the problem, passes in Jenkins
OK - 99_s4connector_loop reproduces the problem, passes in Jenkins
OK - 99_end/00_test_if_s4connector_loop_present detects a loop
OK - e3dd347fad
Comment 24 Florian Best univentionstaff 2021-05-25 16:03:37 CEST
UCS 5.0 has been released:
 https://docs.software-univention.de/release-notes-5.0-0-en.html
 https://docs.software-univention.de/release-notes-5.0-0-de.html

If this error occurs again, please use "Clone This Bug".