33495 – duplication of ldap-entries after "udm move"

Bug 33495 - duplication of ldap-entries after "udm move"

Summary: duplication of ldap-entries after "udm move"

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Listener (univention-directory-listener)
Version:	UCS 3.2
Hardware:	Other Linux

Importance:	P5 normal
Target Milestone:	UCS 3.2-0-errata
Assignee:	Stefan Gohmann
QA Contact:	Arvid Requate

URL:
Keywords:

Depends on:
Blocks:	33555
	Show dependency tree / graph

Reported:	2013-11-20 17:31 CET by Janis Meybohm
Modified:	2013-12-12 11:32 CET (History)
CC List:	2 users (show)

See Also:
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Janis Meybohm

2013-11-20 17:31:31 CET

Ticket#: 2013111421008025
Ticket#: 2013111121001626

Two customers reported duplication (users and computer objects to be more precise) of ldap-entries on one or more DC-Backups after "udm move".
Both customers use larger S4 environments with multiple sites, OU's and more than one DC-Backup (all running S4).

Samba 4 AD is not affected so I assume this is a listener problem, sadly we were not yet able to reproduce this behaviour or catch any nameble logs.

Comment 1 Stefan Gohmann

2013-11-20 19:36:11 CET

We should also check a backport for UCS 3.1.

Comment 2 Stefan Gohmann

2013-11-22 12:36:55 CET

One problem is the handling of the modrdn status file. The file is removed before the modrn process has been finished. This should be changed.

But I'm not sure if this is the reason for the customer cases.

Comment 3 Stefan Gohmann

2013-11-25 08:06:24 CET

We got more log output:

*****************************************************************************
22.11.13 17:21:38.866  LISTENER    ( INFO    ) : delete handlers for cn=OBJECT,cn=computers,LDAP_BASE
22.11.13 17:21:38.868  LISTENER    ( ALL     ) : replication: rename phase I: cn=OBJECT,cn=computers,LDAP_BASE (entryUUID=6c778548-84ec-1032-8de1-dda543e42327)
*****************************************************************************

The rename was considered which is correct.

*****************************************************************************
22.11.13 17:21:38.868  LISTENER    ( INFO    ) : handler: replication (successful)
[...]
22.11.13 17:21:38.868  LISTENER    ( INFO    ) : handler: pkgdb-watch (skipped)
22.11.13 17:21:38.868  LISTENER    ( INFO    ) : deleted from cache: cn=OBJECT,cn=computers,LDAP_BASE
22.11.13 17:21:38.869  LISTENER    ( INFO    ) : write to transaction file dn=[cn=OBJECT,cn=computers,LDAP_BASE], command=[r]
22.11.13 17:21:38.870  LISTENER    ( ALL     ) : >>>
MSGID: 4
GET_DN 89013



22.11.13 17:21:38.912  LISTENER    ( ALL     ) : RESULT: [(null)]

22.11.13 17:21:38.912  LISTENER    ( ALL     ) : <<<
MSGID: 4
89013 cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE a


22.11.13 17:21:38.912  LISTENER    ( INFO    ) : notifier returned = id: 89013  dn: cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE   cmd: a
22.11.13 17:21:38.912  LISTENER    ( INFO    ) : updating cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE
22.11.13 17:21:38.963  LISTENER    ( ALL     ) : >>>
MSGID: 5
GET_SCHEMA_ID



22.11.13 17:21:39.012  LISTENER    ( ALL     ) : RESULT: [(null)]

22.11.13 17:21:39.012  LISTENER    ( ALL     ) : <<<
MSGID: 5
5


22.11.13 17:21:39.012  LISTENER    ( INFO    ) : checking parent_dn of cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE in local LDAP
22.11.13 17:21:39.012  LDAP        ( INFO    ) : connecting to ldap://localhost:7389/
22.11.13 17:21:39.012  LDAP        ( INFO    ) : simple_bind as cn=admin,LDAP_BASE
22.11.13 17:21:39.013  LISTENER    ( ALL     ) : no cache entry found for cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE
22.11.13 17:21:39.013  LISTENER    ( INFO    ) : running handlers for cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE
22.11.13 17:21:39.013  LISTENER    ( ALL     ) : handler: replication considered
22.11.13 17:21:39.013  LISTENER    ( INFO    ) : LDAP keys=['uid', 'krb5PrincipalName', 'univentionOperatingSystemVersion', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPrimaryGroupSID', 'entryDN', 'structuralObjectClass',
22.11.13 17:21:39.013  LISTENER    ( INFO    ) : replication: old entries from LDAP server and Listener do not match
22.11.13 17:21:39.013  LISTENER    ( ALL     ) : replication: modify: cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE
*****************************************************************************

The rename is not continued. It was considered as a normal modification. I'm still unable to reproduce it but it could be possible if the local listener cache still contains informations about the old object.

The rename phase should be continued even if the replication module gets new and old (modification).

Comment 4 Stefan Gohmann

2013-11-25 08:54:38 CET

The modrdn phase is now continued even if new and old was given.

Fix: 46289
YAML: 46292

Comment 5 Arvid Requate

2013-11-26 18:18:48 CET

Actually the listener.log enty at 17:21:39.013  shows that the listener did not pass any "old" object, but replication.py found the object in the local local LDAP:

22.11.13 17:21:39.013  LISTENER    ( INFO    ) : LDAP keys=['uid', 'krb5PrincipalName', 'univentionOperatingSystemVersion', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPrimaryGroupSID', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'cn', 'hasSubordinates', 'userPassword', 'krb5Key', 'krb5MaxRenew', 'loginShell', 'aRecord', 'uidNumber', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'displayName', 'sambaPwdLastSet', 'sambaNTPassword', 'modifyTimestamp', 'entryCSN', 'associatedDomain', 'creatorsName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'homeDirectory', 'univentionOperatingSystem']; listener keys=[]

Note that listener keys is empty here. replication.py uses the result of the local LDAP search as "old" object. So it would be interesting to know why the target DN

 cn=OBJECT,cn=memberserver,cn=computers,LDAP_BASE

existed in the local (slave) LDAP at that point.

I guess in this case the current patch would not fix the problem in this case, as an LDAP rename_s call would fail if the target exists, resulting in a traceback in line 930:

   l.rename_s(old_dn, new_rdn, new_parent)

What do you think? At least I think we should avoid this potential traceback? Or maybe we even find another clue about the source of this issue.

Comment 6 Arvid Requate

2013-11-26 19:10:34 CET

I simulated this by manually pre-creating the target DN on a DC Slave and then moving the test user on the DC Master. This is what I get with the new code:

===========================================================================
24.10.13 07:55:09.611  LISTENER    ( INFO    ) : LDAP keys=['cn', 'krb5PrincipalName', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPasswordHistory', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'shadowLastChange', 'uid', 'hasSubordinates', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'uidNumber', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'entryCSN', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'modifyTimestamp', 'displayName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory', 'creatorsName']; listener keys=[]
24.10.13 07:55:09.611  LISTENER    ( INFO    ) : replication: old entries from LDAP server and Listener do not match
24.10.13 07:55:09.611  LISTENER    ( ALL     ) : replication: rename phase II: uid=t3,cn=c1,cn=users,dc=ar311r1,dc=qa (entryUUID=9b09af44-d091-1032-8ad8-fdff6bd9af56)
24.10.13 07:55:09.612  LISTENER    ( WARN    ) : Already exists: uid=t3,cn=c1,cn=users,dc=ar311r1,dc=qa; trying to apply changes
24.10.13 07:55:09.613  LISTENER    ( INFO    ) : LDAP keys=['cn', 'krb5PrincipalName', 'objectClass', 'entryUUID', 'sambaAcctFlags', 'sambaPasswordHistory', 'entryDN', 'structuralObjectClass', 'krb5MaxLife', 'shadowLastChange', 'uid', 'hasSubordinates', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'uidNumber', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'subschemaSubentry', 'entryCSN', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'modifyTimestamp', 'displayName', 'modifiersName', 'sambaSID', 'createTimestamp', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory', 'creatorsName']; listener keys=['cn', 'krb5PrincipalName', 'objectClass', 'uidNumber', 'sambaAcctFlags', 'sambaPasswordHistory', 'krb5MaxLife', 'shadowLastChange', 'uid', 'userPassword', 'krb5Key', 'sambaMungedDial', 'krb5MaxRenew', 'gecos', 'loginShell', 'univentionObjectType', 'krb5KDCFlags', 'gidNumber', 'sambaPwdLastSet', 'sambaPrimaryGroupSID', 'sambaNTPassword', 'displayName', 'sambaSID', 'krb5KeyVersionNumber', 'sn', 'pwhistory', 'homeDirectory']
===========================================================================

So, the beginning looks like the listener log file found in Ticket#: 2013111421008025. But now an ldap.ALREADY_EXISTS exception occurs during the ldap.rename_s, which is cought by the listener and he tries again "trying to apply changes". This goes on until 

RuntimeError: maximum recursion depth exceeded while calling a Python object

Comment 7 Stefan Gohmann

2013-11-27 06:38:02 CET

(In reply to Arvid Requate from comment #5)
> Actually the listener.log enty at 17:21:39.013  shows that the listener did
> not pass any "old" object, but replication.py found the object in the local
> local LDAP:

Ah, yes. You are right, I missed the re-check against the local slapd because I thought too much about a corrupted cache.

> What do you think? At least I think we should avoid this potential
> traceback? Or maybe we even find another clue about the source of this issue.

I think we should do the following:

 - if rename and new and not old: do rename like we always did
 - if rename and new and old: log the status, remove old and rename new to old

The rename from new to old sounds strange but I think you know what I mean.

That would be a kind of self-healing. Do you agree?

Comment 8 Stefan Gohmann

2013-11-29 13:51:47 CET

I've added it like described. Additional the local deleted objects are dumped to /var/univention-backup/replication/<seconds since 1970>. I've also added a test case for it: 46replication_modrdn_local_modification

Updates: r46420 + r46423
Test Case: r46417

Comment 9 Stefan Gohmann

2013-12-02 16:14:06 CET

We found another issue. The rename in 2.4 results to different entryUUIDs on master and backup/slave. This is a problem for the 3.1 / 3.2 move because it based on the same entryUUID.

If phase II is not recognized in the modrdn process, the old entry should be removed locally.

Comment 10 Stefan Gohmann

2013-12-03 07:22:39 CET

(In reply to Stefan Gohmann from comment #9)
> We found another issue. The rename in 2.4 results to different entryUUIDs on
> master and backup/slave. This is a problem for the 3.1 / 3.2 move because it
> based on the same entryUUID.
> 
> If phase II is not recognized in the modrdn process, the old entry should be
> removed locally.

This has been fixed with r46471.

The modrdn phase I now creates a symlink to point to the current entryUUID. If the entryUUID does not match in phase II, the old dn will be removed and the new one will be added.

Comment 11 Arvid Requate

2013-12-04 20:09:17 CET

Ok:
 * Advisory OK
 * code refactoring of _delete_dn_recursive is backwards compatible
 * target conflict resolution (recursive backup+delete+logging OK, move OK)
 * source missing during rename works as well (also with target conflict)
 * During the QA I also positively saw an evidence of the Comment 2 improving the behavior after a manual ldapmodrn run on the master:
==============================================================================
 04.12.13 11:21:36.341  LISTENER    ( PROCESS ) : replication: rename phase I: uid=user3,ou=ou1,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a)
04.12.13 11:21:36.682  LISTENER    ( PROCESS ) : replication: rename phase II: uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a)
04.12.13 11:21:36.682  LISTENER    ( WARN    ) : Can't contact LDAP server: retrying
04.12.13 11:21:36.684  LISTENER    ( PROCESS ) : replication: rename phase II: uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa (entryUUID=30cc3eb6-f117-1032-8a08-6bfd03501c5a)
04.12.13 11:21:36.684  LISTENER    ( PROCESS ) : replication: rename from uid=user3,ou=ou1,dc=ar320i10,dc=qa to uid=user3,cn=usercontainer1,ou=ou2,dc=ar320i10,dc=qa
==============================================================================
QA TODO left:
 * check handling of entryUUID difference
 * check _modify_object_from_old_and_new and _add_object_from_new refactoring

Comment 12 Arvid Requate

2013-12-05 16:15:33 CET

Ok, the test cases work as well.

Comment 13 Moritz Muehlenhoff

2013-12-12 11:32:02 CET

http://errata.univention.de/ucs/3.2/11.html