Univention Bugzilla – Bug 50593
Ping pong between LDAP and the Samba4 leads to disappearing users
Last modified: 2021-09-15 17:41:44 CEST
In school environments we have problems in certain constellations with the synchronization of the S4 connector in exam mode. The class test mode is a key use case and is often started several times in a row to test or if certain configurations have to be re-set. The problem is that ExamUser can than disappear. This is a problem because the exam situation for teachers is a considerable stress situation and if the mode does not already work in the tests, the mode is not used altogether. The reason for the disappearing ExamUser lies in a ping pong between LDAP and the Samba4, which occurs with high system load or too slow replication. This is how it works: All (user) activities in OpenLDAP are transferred in parallel, but slightly delayed (!) from the S4 connector to Samba. From there, they are then transferred back from S4 to LDAP (because a few attributes have been added)! However, the delay at high load may lead to the following order: User is created/modified in LDAP User is copied from LDAP to S4 User is copied from S4 to LDAP User is deleted in LDAP User is created again in LDAP User is deleted in S4 ("copied from LDAP to S4") User is deleted from S4 in LDAP ("copied from S4 to LDAP") Software (here ExamMode) has an error because the user has disappeared in LDAP. User is copied from S4 to LDAP Admin/developer looks in LDAP, finds the user and wonders. This problem also appears regularly in our Jenkins tests and could be fixed if UCS users are not deleted with the same DN but only with the same entryUUID. To be a fix would be sufficient if it is activatable for school environments.
f3b062080f Bug #50593: yaml 706f9854e0 Bug #50593: Users in UCS wont be deleted in sync_to_ucs if deleted "by" UCS Successful build Package: univention-s4-connector Version: 13.0.2-67A~4.4.0.202003261342 Branch: ucs_4.4-0 Scope: errata4.4-4 User: jbremer Luckily, the logic to only delete objects if their entryUUID/objectGUID matches was introduced in Bug #47374 but it was only used for containers/ous. I tested this by stopping the s4-connector via pdb at line 1458, where it searches for a corresponding object with the same DN after removing an object. Then I created an object with the same DN and let the connector continue. Before this change the connector used to delete the new object which has the same DN, with these changes it does not. Waiting for the test result.
109sync_create_and_modify_ucs_to_ad failed last build because a user still existed after udm remove. Investigating.
Because it just failed once on jenkins, I don't have logs to analyse the problem that led this test to fail. The test did not fail again, it seems to be flaky.
I was wrong. With my change, we traded one problem for another one. The tests failed because they added objects, modified them, then deleted them. Its something like this: LDAP -> S4 add object <- add object(nochange) -> modify object -> delete object <- modify object(already delted/ add instead) <- delete object (not the same entryUUID) To avoid this, the pingback of all sync_to_ucs needs to be controlled, not just the deletes. I reverted the change.
Original Problem: If a user(A) is deleted and another user(B) with the same DN is created shortly thereafter in UCS, the sync_to_ucs which follows the deletion, wants to delete A, finds B, and deletes B. The user disappears. (Which leads to the Exam-Mode Error) The sync_to_ucs which follows the creation, finds no object, then creates a new Object(C) with the DN in UCS. The user reappears. --------------------------------------------------------------------------- First we tried only deleting users if they have the same entryUUID. E.g. if it is the very same object, not just a similar object (same DN). This kind of inverted the problem. If users were added, then deleted shortly after, they may be left over. Here is how that works: A User (A) is added in UCS, User (A) is deleted in UCS. The sync_to_ucs following the creation finds no user Object with the DN of A, recreates it (User B). User reappears. The sync_to_ucs following the deletion, finds User B, the entryUUID of B doesn't match the entryUUID of A, doesn't delete User B. Because of this, we added a new listener to the S4 Connector, which then cached all deleted objects with entryUUID and objectGUID which were removed by UCS. Before creating or deleting objects in UCS, the S4 Connector could look up the entryUUID of this object. If it is found in the cache, the S4 Connector has nothing to do. With this approach, some users were still left over after our nightly test runs, because of the small window between the user being deleted vs the listener reacting on the deletion. We could not find a reasonable solution for this problem without changing the structure and functionality of the S4-Connector almost entirely. The easiest approach to tackle this problem would be in the exam mode. We could just catch the Error if a User disappeared for some time, wait a bit and try again (like we do in ucs-test). We could wait for full replication after the creation on the Master. We could adjust the naming of the students to have a random element, so that those don't have the same DN in each exam run.
Thank you for your work and analysis. The UCS@school board will take over for now. Even if we don't go to the root of the problem, I think that the pragmatic way is now the solution in the exam module. My first approach as a workaround would be the implementation: "We could just catch the Error if a User disappeared for some time, wait a bit and try again (like we do in ucs-test)". The approach with the "random element" on the username I can exclude for usability reasons. Further design decisions like "wait for completion of a KA", "permanent storage of users" or "computer name as login name" will be discussed later in the team.
Thank you Michel. Just one new information. I actually found out recently, that the exam user will not reappear if created in the wrong time window, as stated in the first comment, but stay deleted, since the S4-Connector has some built in mechanics to prevent deleted Users from reappearing. Waiting for the user to reappear won't work sadly. If the ucs-school team has any questions, I am happy to help. This is a tricky bug :)
There's another thing I suggested in the discussion with the developers: * S4C currently uses one table 'UCS deleted' to store (entryUUID, objectGUID) of deleted objects to avoid re-appearance of deleted objects (Bug #32263) * But the class method "s4.update_deleted_cache_after_removal()", which fills that table is used in both s4.delete_in_ucs() and s4.delete_in_s4() * My impression is that it would help if we would use two separate tables here, i.e. s4.delete_in_ucs() should keep notes of objects deleted in Samba/AD in a separate table 'CON deleted' instead. That way, a delete that originated in OpenLDAP could be distinguished from a delete that originated in Samba/AD. That way, the "echo"/pong of a delete could be skipped, I guess. A similar strategy could be applied for add ops, but that may be more difficult due to the modify/add ambiguity in the S4C.
A workaround has been implemented to cope with the situation in the exam mode. Nevertheless, the problem still exists in the S4 connector. I therefore change product, component and version and removed the hint to exam mode from the title.
Automatic tests showed that the problem occurs more often in the yet unreleased UCS 5.0 code because of optimizations in the S4 Connector code which lead to more timing issues, frankly said. Currently, now active support case is known so far on this topic. When the problem occurs, logfiles grow and the CPU is under load. Postruns from the listener are no longer executed and therefore disturb further synchronizations. The problem may also affect the AD connector. After internal discussion we should iterate to a solution here. 1. For the direction Samba → LDAP the OpenLDAP standard feature Post Read Control could be used to kind of "remember" some past synchronizations. That seems to be rather easy to implement and offers some relief. 2. For the direction LDAP → Samba some more differentiation will be needed which would probably have to be in the S4 connector code. It looks the connector also needs to remember some more of the past synchronizations for a proper handling. Some more thoughts will be needed for a solution.
See also Bug #52358 for UCS 5.0
Created attachment 10816 [details] patch (git:fbest/50593-s4-connector-ping-pong) Patch, which restores 706f9854e073f674df08547258b2aaea65db1449 but also do the logic in add_in_ucs() to prevent comment #5.
Jenkins Tests looks good (green) https://jenkins.knut.univention.de:8181/job/UCS-5.0/job/UCS-5.0-0/job/S4Connector/cfg=master-s4connector/276/ [master091c] 2021-09-07T14:11:07.729128 __MSG__:98% [Waiting for headers] [master091c] 2021-09-07T14:11:07.729439 __MSG__:Get:12 http://192.168.0.10/build2 ucs_5.0-0-ucs-school-5.0/all/ univention-s4-connector 14.1.7-4A~5.0.0.202109071536 [68.6 kB] [master091c] 2021-09-07T14:11:07.729439 __MSG__:98% [12 univention-s4-connector 0 B/68.6 kB 0%]
Applied the patch in: univention-s4-connector.yaml 1a303ab905a2 | Bug #50593: do not remove/add users in sync_to_ucs if they were recently removed "by" UCS univention-s4-connector (13.0.2-70) 88b4dc6a87c7 | Revert "Bug #50593: Only delete if same objectGUID, only add if not already deleted, only modify/add if not already deleted, POC"
(In reply to Florian Best from comment #16) > Applied the patch in: > > univention-s4-connector.yaml > 1a303ab905a2 | Bug #50593: do not remove/add users in sync_to_ucs if they > were recently removed "by" UCS > > univention-s4-connector (13.0.2-70) > 88b4dc6a87c7 | Revert "Bug #50593: Only delete if same objectGUID, only add > if not already deleted, only modify/add if not already deleted, POC" Please check if it doesn't create leftover users in case the modify operation happens before the delete operation was used by the s4connector. As in: openLDAP S4Connector S4 delete modify add <modify / add object does not exist delete> delete <delete (dont delete, different entryUUID) object left iver as in this case, the delete operation has not been handled by the s4connector yet and wasDeletedByUCS will return False. This was the reason we wrote a listener in comment6, that set wasDeletedByUCS on LDAP changes(mentioned in comment 6). But even this left a window in which objects were left over. This can be checked by setting a few strategic breakpoints in add_in_ucs and deleting the object in openLDAP, then continuing.
AFAIR we discussed if this new problem was better than the old one, but we decided to leave it as it is and propose a change in the exam mode. (I think the proposal was to delete the exam-users only at the end of the day)
Ok, thanks for that hint again. I saw it in d3fbdd8fba7bef49ea398f55c629cfbda07c34e8 / 88b4dc6a87c721a5b16f2a2a1beecd761370f51f but ignored it :-P
OK, I addressed this issue by creating a new table `UCS added` which is a mapping of entryUUID → objectGUID. If we create a object in S4 we are storing it's values in sync_from_ucs(). If we receive a modify in sync_to_ucs() we are checking if the equivalent object does not exists in UCS and if it was added once in UCS. If that's the case, we don't modify the action from "modify" to an "add" and just ignore the operation (`return True`) with a log message on PROCESS level. IF we remove an object on any side, the entryUUID is removed again from the sqlite table `UCS added`. There is also the case that sync_to_ucs() can receive a move() operation, but I think we must use the old logic here. There could also be the theoretical case of further race conditions e.g. when we have an add + remove + add + remove of the same object during the sync_from_ucs. but this is unlikely to be caused in the real world and can be ignored. univention-s4-connector (14.0.7-6) a0366bed549d | Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs
P.S. I first tried to implement this with a PostReadControl but the Samba-4 server doesn't support this. So I am just doing a regular search after the object creation.
I fixed the removal from the sqlite table and only clear the entry in sync_from_ucs not in sync_to_ucs. Also improved and unified logging: * add context to log messages, e.g.: sync UCS > AD: [user] [delete] 'cn=foo,dc=base' * drop noisy duplicated or repeating log messages to ud.ALL (4) * add some important log messages, so that we know in which state we are univention-s4-connector (14.0.7-6) 0e6c6afa71ab | Bug #50593: improve logging of context bcae36b571ff | Bug #50593: unify context logging b9ef8f9063a9 | fixup! fixup! Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs f619ca3bd6fd | fixup! Bug #50593: make sure no users are left over/re-created when they are removed in UCS parallel to the creating during sync_from_ucs
FAIL - version in yaml OK - Jenkins looks good OK - logging also added two test case in tests/52_s4connector/516_create_delete_objects_in_ucs_check_sync.py for the (a) leftover problem and (b) falsely removed objects problem
(In reply to Felix Botner from comment #24) > FAIL - version in yaml done, a679cc5860 > OK - Jenkins looks good > OK - logging > > also added two test case in > tests/52_s4connector/516_create_delete_objects_in_ucs_check_sync.py for the > (a) leftover problem and (b) falsely removed objects problem great!
OK - yaml
<https://errata.software-univention.de/#/?erratum=5.0x89>