Univention Bugzilla – Bug 33495
duplication of ldap-entries after "udm move"
Last modified: 2013-12-12 11:32:02 CET
Ticket#: 2013111421008025 Ticket#: 2013111121001626 Two customers reported duplication (users and computer objects to be more precise) of ldap-entries on one or more DC-Backups after "udm move". Both customers use larger S4 environments with multiple sites, OU's and more than one DC-Backup (all running S4). Samba 4 AD is not affected so I assume this is a listener problem, sadly we were not yet able to reproduce this behaviour or catch any nameble logs.
We should also check a backport for UCS 3.1.
One problem is the handling of the modrdn status file. The file is removed before the modrn process has been finished. This should be changed. But I'm not sure if this is the reason for the customer cases.
We got more log output: ***************************************************************************** 22.11.13 17:21:38.866 LISTENER ( INFO ) : delete handlers for cn=OBJECT,cn=computers,LDAP_BASE 22.11.13 17:21:38.868 LISTENER ( ALL ) : replication: rename phase I: cn=OBJECT,cn=computers,LDAP_BASE (entryUUID=6c778548-84ec-1032-8de1-dda543e42327) ***************************************************************************** The rename was considered which is correct. ***************************************************************************** 22.11.13 17:21:38.868 LISTENER ( INFO ) : handler: replication (successful) [...] 22.11.13 17:21:38.868 LISTENER ( INFO ) : handler: pkgdb-watch (skipped) 22.11.13 17:21:38.868 LISTENER ( INFO ) : deleted from cache: cn=OBJECT,cn=computers,LDAP_BASE 22.11.13 17:21:38.869 LISTENER ( INFO ) : write to transaction file dn=[cn=OBJECT,cn=computers,LDAP_BASE], command=[r] 22.11.13 17:21:38.870 LISTENER ( ALL ) : >>> MSGID: 4 GET_DN 89013 22.11.13 17:21:38.912 LISTENER ( ALL ) : RESULT: [(null)] 22.11.13 17:21:38.912 LISTENER ( ALL ) : <<< MSGID: 4 89013 cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE a 22.11.13 17:21:38.912 LISTENER ( INFO ) : notifier returned = id: 89013 dn: cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE cmd: a 22.11.13 17:21:38.912 LISTENER ( INFO ) : updating cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE 22.11.13 17:21:38.963 LISTENER ( ALL ) : >>> MSGID: 5 GET_SCHEMA_ID 22.11.13 17:21:39.012 LISTENER ( ALL ) : RESULT: [(null)] 22.11.13 17:21:39.012 LISTENER ( ALL ) : <<< MSGID: 5 5 22.11.13 17:21:39.012 LISTENER ( INFO ) : checking parent_dn of cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE in local LDAP 22.11.13 17:21:39.012 LDAP ( INFO ) : connecting to ldap://localhost:7389/ 22.11.13 17:21:39.012 LDAP ( INFO ) : simple_bind as cn=admin,LDAP_BASE 22.11.13 17:21:39.013 LISTENER ( ALL ) : no cache entry found for cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE 22.11.13 17:21:39.013 LISTENER ( INFO ) : running handlers for cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE 22.11.13 17:21:39.013 LISTENER ( ALL ) : handler: replication considered 22.11.13 17:21:39.013 LISTENER ( INFO ) : LDAP keys=['uid', 'krb5PrincipalName', 'univentionOperatingSystemVersion', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPrimaryGroupSID', 'entryDN', 'structuralObjectClass', 22.11.13 17:21:39.013 LISTENER ( INFO ) : replication: old entries from LDAP server and Listener do not match 22.11.13 17:21:39.013 LISTENER ( ALL ) : replication: modify: cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE ***************************************************************************** The rename is not continued. It was considered as a normal modification. I'm still unable to reproduce it but it could be possible if the local listener cache still contains informations about the old object. The rename phase should be continued even if the replication module gets new and old (modification).
The modrdn phase is now continued even if new and old was given. Fix: 46289 YAML: 46292
Actually the listener.log enty at 17:21:39.013 shows that the listener did not pass any "old" object, but replication.py found the object in the local local LDAP: 22.11.13 17:21:39.013 LISTENER ( INFO ) : LDAP keys=['uid', 'krb5PrincipalName', 'univentionOperatingSystemVersion', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPrimaryGroupSID', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'cn', 'hasSubordinates', 'userPassword', 'krb5Key', 'krb5MaxRenew', 'loginShell', 'aRecord', 'uidNumber', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'displayName', 'sambaPwdLastSet', 'sambaNTPassword', 'modifyTimestamp', 'entryCSN', 'associatedDomain', 'creatorsName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'homeDirectory', 'univentionOperatingSystem']; listener keys=[] Note that listener keys is empty here. replication.py uses the result of the local LDAP search as "old" object. So it would be interesting to know why the target DN cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE existed in the local (slave) LDAP at that point. I guess in this case the current patch would not fix the problem in this case, as an LDAP rename_s call would fail if the target exists, resulting in a traceback in line 930: l.rename_s(old_dn, new_rdn, new_parent) What do you think? At least I think we should avoid this potential traceback? Or maybe we even find another clue about the source of this issue.
I simulated this by manually pre-creating the target DN on a DC Slave and then moving the test user on the DC Master. This is what I get with the new code: =========================================================================== 24.10.13 07:55:09.611 LISTENER ( INFO ) : LDAP keys=['cn', 'krb5PrincipalName', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPasswordHistory', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'shadowLastChange', 'uid', 'hasSubordinates', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'uidNumber', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'entryCSN', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'modifyTimestamp', 'displayName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory', 'creatorsName']; listener keys=[] 24.10.13 07:55:09.611 LISTENER ( INFO ) : replication: old entries from LDAP server and Listener do not match 24.10.13 07:55:09.611 LISTENER ( ALL ) : replication: rename phase II: uid=t3,cn=c1,cn=users,dc=ar311r1,dc=qa (entryUUID=9b09af44-d091-1032-8ad8-fdff6bd9af56) 24.10.13 07:55:09.612 LISTENER ( WARN ) : Already exists: uid=t3,cn=c1,cn=users,dc=ar311r1,dc=qa; trying to apply changes 24.10.13 07:55:09.613 LISTENER ( INFO ) : LDAP keys=['cn', 'krb5PrincipalName', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPasswordHistory', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'shadowLastChange', 'uid', 'hasSubordinates', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'uidNumber', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'entryCSN', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'modifyTimestamp', 'displayName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory', 'creatorsName']; listener keys=['cn', 'krb5PrincipalName', 'objectClass', 'uidNumber', 'sambaAcctFlags', 'sambaPasswordHistory', 'krb5MaxLife', 'shadowLastChange', 'uid', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'displayName', 'sambaSID', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory'] =========================================================================== So, the beginning looks like the listener log file found in Ticket#: 2013111421008025. But now an ldap.ALREADY_EXISTS exception occurs during the ldap.rename_s, which is cought by the listener and he tries again "trying to apply changes". This goes on until RuntimeError: maximum recursion depth exceeded while calling a Python object
(In reply to Arvid Requate from comment #5) > Actually the listener.log enty at 17:21:39.013 shows that the listener did > not pass any "old" object, but replication.py found the object in the local > local LDAP: Ah, yes. You are right, I missed the re-check against the local slapd because I thought too much about a corrupted cache. > What do you think? At least I think we should avoid this potential > traceback? Or maybe we even find another clue about the source of this issue. I think we should do the following: - if rename and new and not old: do rename like we always did - if rename and new and old: log the status, remove old and rename new to old The rename from new to old sounds strange but I think you know what I mean. That would be a kind of self-healing. Do you agree?
I've added it like described. Additional the local deleted objects are dumped to /var/univention-backup/replication/<seconds since 1970>. I've also added a test case for it: 46replication_modrdn_local_modification Updates: r46420 + r46423 Test Case: r46417
We found another issue. The rename in 2.4 results to different entryUUIDs on master and backup/slave. This is a problem for the 3.1 / 3.2 move because it based on the same entryUUID. If phase II is not recognized in the modrdn process, the old entry should be removed locally.
(In reply to Stefan Gohmann from comment #9) > We found another issue. The rename in 2.4 results to different entryUUIDs on > master and backup/slave. This is a problem for the 3.1 / 3.2 move because it > based on the same entryUUID. > > If phase II is not recognized in the modrdn process, the old entry should be > removed locally. This has been fixed with r46471. The modrdn phase I now creates a symlink to point to the current entryUUID. If the entryUUID does not match in phase II, the old dn will be removed and the new one will be added.
Ok: * Advisory OK * code refactoring of _delete_dn_recursive is backwards compatible * target conflict resolution (recursive backup+delete+logging OK, move OK) * source missing during rename works as well (also with target conflict) * During the QA I also positively saw an evidence of the Comment 2 improving the behavior after a manual ldapmodrn run on the master: ============================================================================== 04.12.13 11:21:36.341 LISTENER ( PROCESS ) : replication: rename phase I: uid=user3,ou=ou1,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a) 04.12.13 11:21:36.682 LISTENER ( PROCESS ) : replication: rename phase II: uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a) 04.12.13 11:21:36.682 LISTENER ( WARN ) : Can't contact LDAP server: retrying 04.12.13 11:21:36.684 LISTENER ( PROCESS ) : replication: rename phase II: uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a) 04.12.13 11:21:36.684 LISTENER ( PROCESS ) : replication: rename from uid=user3,ou=ou1,dc=ar320i10,dc=qa to uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa ============================================================================== QA TODO left: * check handling of entryUUID difference * check _modify_object_from_old_and_new and _add_object_from_new refactoring
Ok, the test cases work as well.
http://errata.univention.de/ucs/3.2/11.html