Bug 50593 - Ping pong between LDAP and the Samba4 leads to disappearing users
Ping pong between LDAP and the Samba4 leads to disappearing users
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: S4 Connector
UCS 4.4
Other Mac OS X 10.1
: P5 normal (vote)
: UCS 5.0-0-errata
Assigned To: Florian Best
Felix Botner
:
Depends on:
Blocks: 51860
  Show dependency treegraph
 
Reported: 2019-12-04 15:56 CET by Michel Smidt
Modified: 2021-09-15 17:41 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 4: A User would return the product
User Pain: 0.229
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019120221000346
Bug group (optional):
Max CVSS v3 score:


Attachments
patch (git:fbest/50593-s4-connector-ping-pong) (2.15 KB, patch)
2021-09-07 15:01 CEST, Florian Best
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michel Smidt 2019-12-04 15:56:20 CET
In school environments we have problems in certain constellations with the synchronization of the S4 connector in exam mode. The class test mode is a key use case and is often started several times in a row to test or if certain configurations have to be re-set.
The problem is that ExamUser can than disappear.
This is a problem because the exam situation for teachers is a considerable stress situation and if the mode does not already work in the tests, the mode is not used altogether.

The reason for the disappearing ExamUser lies in a ping pong between LDAP and the Samba4, which occurs with high system load or too slow replication. This is how it works:
All (user) activities in OpenLDAP are transferred in parallel, but slightly delayed (!) from the S4 connector to Samba. From there, they are then transferred back from S4 to LDAP (because a few attributes have been added)!

However, the delay at high load may lead to the following order:

    User is created/modified in LDAP
    User is copied from LDAP to S4
    User is copied from S4 to LDAP
    User is deleted in LDAP
    User is created again in LDAP
    User is deleted in S4 ("copied from LDAP to S4")
    User is deleted from S4 in LDAP ("copied from S4 to LDAP")
    Software (here ExamMode) has an error because the user has disappeared in LDAP.
    User is copied from S4 to LDAP
    Admin/developer looks in LDAP, finds the user and wonders.

This problem also appears regularly in our Jenkins tests and could be fixed if UCS users are not deleted with the same DN but only with the same entryUUID.

To be a fix would be sufficient if it is activatable for school environments.
Comment 1 Julia Bremer univentionstaff 2020-03-26 13:54:47 CET
f3b062080f Bug #50593: yaml
706f9854e0 Bug #50593: Users in UCS wont be deleted in sync_to_ucs if deleted "by" UCS

Successful build
Package: univention-s4-connector
Version: 13.0.2-67A~4.4.0.202003261342
Branch: ucs_4.4-0
Scope: errata4.4-4
User: jbremer


Luckily, the logic to only delete objects if their entryUUID/objectGUID matches was introduced in Bug #47374 but it was only used for containers/ous. 

I tested this by stopping the s4-connector via pdb at line 1458, where it searches for a corresponding object with the same DN after removing an object.
Then I created an object with the same DN and let the connector continue.
Before this change the connector used to delete the new object which has the same DN, with these changes it does not.

Waiting for the test result.
Comment 2 Julia Bremer univentionstaff 2020-03-27 22:06:56 CET
109sync_create_and_modify_ucs_to_ad failed last build because a user still existed after udm remove.
Investigating.
Comment 3 Julia Bremer univentionstaff 2020-03-29 14:40:31 CEST
Because it just failed once on jenkins, I don't have logs to analyse the problem that led this test to fail.
The test did not fail again, it seems to be flaky.
Comment 5 Julia Bremer univentionstaff 2020-04-01 11:11:59 CEST
I was wrong.
With my change, we traded one problem for another one.
The tests failed because they added objects, modified them, then deleted them. 
Its something like this:

LDAP          ->         S4 
           add object   
              <-
           add object(nochange)

              ->
          modify object
              ->
          delete object
             
             <- 
          modify object(already delted/ add instead)

            <- 
         delete object (not the same entryUUID)

To avoid this, the pingback of all sync_to_ucs needs to be controlled, not just the deletes.

I reverted the change.
Comment 6 Julia Bremer univentionstaff 2020-04-30 20:59:42 CEST
Original Problem: 
If a user(A) is deleted and another user(B) with the same DN is created shortly thereafter in UCS, 
the sync_to_ucs which follows the deletion, wants to delete A, finds B, and deletes B. 
The user disappears. (Which leads to the Exam-Mode Error)
The sync_to_ucs which follows the creation, finds no object, then creates a new Object(C) with the DN in UCS.
The user reappears.

---------------------------------------------------------------------------

First we tried only deleting users if they have the same entryUUID. E.g. if it is the very same object, not just a similar object (same DN). 

This kind of inverted the problem. 
If users were added, then deleted shortly after, they may be left over.

Here is how that works:
A User (A) is added in UCS, User (A) is deleted in UCS.
The sync_to_ucs following the creation finds no user Object with the DN of A, recreates it (User B). User reappears. 
The sync_to_ucs following the deletion, finds User B, the entryUUID of B doesn't match the entryUUID of A, doesn't delete User B. 

Because of this, we added a new listener to the S4 Connector, which then cached all deleted objects with entryUUID and objectGUID which were removed by UCS. 
Before creating or deleting objects in UCS, the S4 Connector could look up the entryUUID of this object.
If it is found in the cache, the S4 Connector has nothing to do.

With this approach, some users were still left over after our nightly test runs, because of the small window between the user being deleted vs the listener reacting on the deletion.
We could not find a reasonable solution for this problem without changing the structure and functionality of the S4-Connector almost entirely. 

The easiest approach to tackle this problem would be in the exam mode. 
We could just catch the Error if a User disappeared for some time, wait a bit and try again (like we do in ucs-test).
We could wait for full replication after the creation on the Master. 
We could adjust the naming of the students to have a random element, so that those don't have the same DN in each exam run.
Comment 7 Michel Smidt 2020-05-05 15:37:22 CEST
Thank you for your work and analysis.
The UCS@school board will take over for now. Even if we don't go to the root of the problem, I think that the pragmatic way is now the solution in the exam module.

My first approach as a workaround would be the implementation: "We could just catch the Error if a User disappeared for some time, wait a bit and try again (like we do in ucs-test)".

The approach with the "random element" on the username I can exclude for usability reasons.

Further design decisions like "wait for completion of a KA", "permanent storage of users" or "computer name as login name" will be discussed later in the team.
Comment 8 Julia Bremer univentionstaff 2020-05-05 15:44:16 CEST
Thank you Michel. 
Just one new information. 
I actually found out recently, that the exam user will not reappear if created in the wrong time window,
as stated in the first comment, but stay deleted, since the S4-Connector has
some built in mechanics to prevent deleted Users from reappearing.
Waiting for the user to reappear won't work sadly. 

If the ucs-school team has any questions, I am happy to help.
This is a tricky bug :)
Comment 9 Arvid Requate univentionstaff 2020-05-05 17:27:12 CEST
There's another thing I suggested in the discussion with the developers:

* S4C currently uses one table 'UCS deleted' to store (entryUUID, objectGUID) of deleted objects
  to avoid re-appearance of deleted objects (Bug #32263)

* But the class method "s4.update_deleted_cache_after_removal()", which fills that table is used
  in both s4.delete_in_ucs() and s4.delete_in_s4()

* My impression is that it would help if we would use two separate tables here,
  i.e. s4.delete_in_ucs() should keep notes of objects deleted in Samba/AD in a separate table 'CON deleted' instead.
  That way, a delete that originated in OpenLDAP could be distinguished from a delete that originated in Samba/AD.
  That way, the "echo"/pong of a delete could be skipped, I guess.
  A similar strategy could be applied for add ops, but that may be more difficult due to the modify/add ambiguity in the S4C.
Comment 11 Nico Gulden univentionstaff 2020-12-04 12:41:41 CET
A workaround has been implemented to cope with the situation in the exam mode.

Nevertheless, the problem still exists in the S4 connector. I therefore change product, component and version and removed the hint to exam mode from the title.
Comment 12 Nico Gulden univentionstaff 2020-12-04 14:19:59 CET
Automatic tests showed that the problem occurs more often in the yet unreleased UCS 5.0 code because of optimizations in the S4 Connector code which lead to more timing issues, frankly said. Currently, now active support case is known so far on this topic.

When the problem occurs, logfiles grow and the CPU is under load. Postruns from the listener are no longer executed and therefore disturb further synchronizations.

The problem may also affect the AD connector.


After internal discussion we should iterate to a solution here.

1. For the direction Samba → LDAP the OpenLDAP standard feature Post Read Control could be used to kind of "remember" some past synchronizations. That seems to be rather easy to implement and offers some relief. 

2. For the direction LDAP → Samba some more differentiation will be needed which would probably have to be in the S4 connector code. It looks the connector also needs to remember some more of the past synchronizations for a proper handling. Some more thoughts will be needed for a solution.
Comment 13 Arvid Requate univentionstaff 2020-12-04 15:40:37 CET
See also Bug #52358 for UCS 5.0
Comment 14 Florian Best univentionstaff 2021-09-07 15:01:19 CEST
Created attachment 10816 [details]
patch (git:fbest/50593-s4-connector-ping-pong)

Patch, which restores 706f9854e073f674df08547258b2aaea65db1449 but also do the logic in add_in_ucs() to prevent comment #5.
Comment 15 Felix Botner univentionstaff 2021-09-08 11:59:41 CEST
Jenkins Tests looks good (green) https://jenkins.knut.univention.de:8181/job/UCS-5.0/job/UCS-5.0-0/job/S4Connector/cfg=master-s4connector/276/

[master091c] 2021-09-07T14:11:07.729128	__MSG__:98% [Waiting for headers]
[master091c] 2021-09-07T14:11:07.729439	__MSG__:Get:12 http://192.168.0.10/build2 ucs_5.0-0-ucs-school-5.0/all/ univention-s4-connector 14.1.7-4A~5.0.0.202109071536 [68.6 kB]
[master091c] 2021-09-07T14:11:07.729439	__MSG__:98% [12 univention-s4-connector 0 B/68.6 kB 0%]
Comment 16 Florian Best univentionstaff 2021-09-08 13:54:49 CEST
Applied the patch in:

univention-s4-connector.yaml
1a303ab905a2 | Bug #50593: do not remove/add users in sync_to_ucs if they were recently removed "by" UCS

univention-s4-connector (13.0.2-70)
88b4dc6a87c7 | Revert "Bug #50593: Only delete if same objectGUID, only add if not already deleted, only modify/add if not already deleted, POC"
Comment 18 Julia Bremer univentionstaff 2021-09-08 14:30:26 CEST
(In reply to Florian Best from comment #16)
> Applied the patch in:
> 
> univention-s4-connector.yaml
> 1a303ab905a2 | Bug #50593: do not remove/add users in sync_to_ucs if they
> were recently removed "by" UCS
> 
> univention-s4-connector (13.0.2-70)
> 88b4dc6a87c7 | Revert "Bug #50593: Only delete if same objectGUID, only add
> if not already deleted, only modify/add if not already deleted, POC"

Please check if it doesn't create leftover users in case the modify operation happens before the delete operation was used by the s4connector. 
As in: 


openLDAP     S4Connector       S4
delete                         modify
add        <modify / add object does not exist
            delete>             delete
            <delete (dont delete, different entryUUID)
object left iver

as in this case, the delete operation has not been handled by the s4connector yet 
and wasDeletedByUCS will return False. 
This was the reason we wrote a listener in comment6, that set wasDeletedByUCS on LDAP changes(mentioned in comment 6). But even this left a window in which objects were left over. 

This can be checked by setting a few strategic breakpoints in add_in_ucs and deleting the object in openLDAP, then continuing.
Comment 19 Julia Bremer univentionstaff 2021-09-08 14:34:32 CEST
AFAIR we discussed if this new problem was better than the old one, but we decided to leave it as it is and propose a change in the exam mode.
(I think the proposal was to delete the exam-users only at the end of the day)
Comment 20 Florian Best univentionstaff 2021-09-09 14:32:30 CEST
Ok, thanks for that hint again.
I saw it in d3fbdd8fba7bef49ea398f55c629cfbda07c34e8 / 88b4dc6a87c721a5b16f2a2a1beecd761370f51f but ignored it :-P
Comment 21 Florian Best univentionstaff 2021-09-09 17:01:27 CEST
OK, I addressed this issue by creating a new table `UCS added` which is a mapping of entryUUID → objectGUID.

If we create a object in S4 we are storing it's values in sync_from_ucs().
If we receive a modify in sync_to_ucs() we are checking if the equivalent object does not exists in UCS and if it was added once in UCS.
If that's the case, we don't modify the action from "modify" to an "add" and just ignore the operation (`return True`) with a log message on PROCESS level.
IF we remove an object on any side, the entryUUID is removed again from the sqlite table `UCS added`.

There is also the case that sync_to_ucs() can receive a move() operation, but I think we must use the old logic here.
There could also be the theoretical case of further race conditions e.g. when we have an add + remove + add + remove of the same object during the sync_from_ucs.
but this is unlikely to be caused in the real world and can be ignored.

univention-s4-connector (14.0.7-6)
a0366bed549d | Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs
Comment 22 Florian Best univentionstaff 2021-09-09 17:06:45 CEST
P.S. I first tried to implement this with a PostReadControl but the Samba-4 server doesn't support this. So I am just doing a regular search after the object creation.
Comment 23 Florian Best univentionstaff 2021-09-10 16:32:23 CEST
I fixed the removal from the sqlite table and only clear the entry in sync_from_ucs not in sync_to_ucs.

Also improved and unified logging: 
* add context to log messages, e.g.: sync UCS > AD: [user] [delete] 'cn=foo,dc=base'
* drop noisy duplicated or repeating log messages to ud.ALL (4)
* add some important log messages, so that we know in which state we are

univention-s4-connector (14.0.7-6)
0e6c6afa71ab | Bug #50593: improve logging of context
bcae36b571ff | Bug #50593: unify context logging
b9ef8f9063a9 | fixup! fixup! Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs
f619ca3bd6fd | fixup! Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs
Comment 24 Felix Botner univentionstaff 2021-09-13 21:53:11 CEST
FAIL - version in yaml

OK - Jenkins looks good
OK - logging

also added two test case in tests/52_s4connector/516_create_delete_objects_in_ucs_check_sync.py for the (a) leftover problem and (b) falsely removed objects problem
Comment 25 Florian Best univentionstaff 2021-09-14 08:14:49 CEST
(In reply to Felix Botner from comment #24)
> FAIL - version in yaml
done, a679cc5860

> OK - Jenkins looks good
> OK - logging
> 
> also added two test case in
> tests/52_s4connector/516_create_delete_objects_in_ucs_check_sync.py for the
> (a) leftover problem and (b) falsely removed objects problem
great!
Comment 26 Felix Botner univentionstaff 2021-09-14 08:35:55 CEST
OK - yaml