Bug 42662 - Sometimes object are not properly moved during a mass move
Sometimes object are not properly moved during a mass move
Status: RESOLVED WONTFIX
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 3.3
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-12 21:19 CEST by Ingo Sieverdingbeck
Modified: 2019-01-03 07:11 CET (History)
6 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.057
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
/var/lib/univention-ldap/listener/listener of a LDAP slave (1.43 KB, text/plain)
2016-10-12 21:19 CEST, Ingo Sieverdingbeck
Details
listener.log of a LDAP slave (7.44 KB, text/x-log)
2016-10-12 21:20 CEST, Ingo Sieverdingbeck
Details
listener.log and transaction of a backup instance (3.19 KB, text/plain)
2016-10-13 13:49 CEST, Ingo Sieverdingbeck
Details
listener.log of a move_to without history (6.32 KB, text/x-log)
2017-02-06 16:18 CET, Ingo Sieverdingbeck
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ingo Sieverdingbeck univentionstaff 2016-10-12 21:19:34 CEST
Created attachment 8094 [details]
/var/lib/univention-ldap/listener/listener of a LDAP slave

During mass move operations we observed, that the source object of a move operation is not always removed on systems that replicate from a LDAP backup instance. The failed move operation is reported with a 'move_to without history' error log line in the listener.log of slave and member systems.

The move changes only the first part of the DN (uid in this case), so source and destination of the move is in the same container and are visible on all affected LDAP instances. The moved objects have been created by a mass import of a LDIF file with slapadd and the objects still have different entryUUID values in different LDAP directories.

The behaviour described above could not be reproduced on LDAP backup instances and could not be reproduced for objects that don't have the entryUUID mismatch.
Comment 1 Ingo Sieverdingbeck univentionstaff 2016-10-12 21:20:21 CEST
Created attachment 8095 [details]
listener.log of a LDAP slave
Comment 2 Ingo Sieverdingbeck univentionstaff 2016-10-13 09:13:40 CEST
One addition: All affected object are not in the local listener cache because they are filtered.
Comment 3 Arvid Requate univentionstaff 2016-10-13 11:36:44 CEST
Via the filter mechanism implemented for Bug 38823? The wording of the original feature request suggests that this was intended for use "on member server instances" and not on LDAP slaves, let alone DC Backups.
Comment 4 Ingo Sieverdingbeck univentionstaff 2016-10-13 13:49:13 CEST
The described problem has been reproduced by the customer after the entryUUID for the affected objects have been synced.
Comment 5 Ingo Sieverdingbeck univentionstaff 2016-10-13 13:49:19 CEST
Created attachment 8098 [details]
listener.log and transaction of a backup instance
Comment 6 Ingo Sieverdingbeck univentionstaff 2017-02-06 16:15:39 CET
I was able to reproduce in on slaves the described in my test environment with a probability of about 50% to 100%. 

The LDAP has been populated using slapadd -q and a LDIF file with 30k objects on each system. Slapd, notifier and listener were stopped until the last slapadd finished. This results in objects which have the same DN in the complete environment but different entryUUID.

The mass rename operation is then started using a script that first enforces a entryUUID sync as first step (MOD_REPLACE on univentionObjectType) and in the second step does the move of the object. None of the affected objects is in the listener cache due to 'listener/cache/filter' settings.

I was not able to reproduce this on a slave connected to the master for replication. I was also unable to reproduce this on a slave if the listener is stopped during the moves are executed and restarted after the used backup instance finished replication.
Comment 7 Ingo Sieverdingbeck univentionstaff 2017-02-06 16:18:08 CET
Created attachment 8402 [details]
listener.log of a move_to without history
Comment 9 Arvid Requate univentionstaff 2017-03-27 23:07:56 CEST
I'm trying to catch up with the status here... Comment 2 says:

> One addition: All affected object are not in the local listener cache because they are filtered.

Ok, so if they are not cached then it's always a move without history, or where should the Listener get the history from? The current listener implementation doesn't look at the local LDAP. I guess it should though, if the listener cache has been disabled effectively).


An a question regarding Comment 8:

> Breakpoint 1, change_update_dn (trans=0x7fffffffddb0) at change.c:787
> 787             if (rv == LDAP_NO_SUCH_OBJECT) {
> $41 = 0
> $42 = 114 'r'

Can you elaborate what $41 and $42 refer to here? From the code context I can only guess that

* $41 refers to rv
* $42 refers to trans->prev.notify.command


> So the Slave successfully retrieved the entry from the backup - while it
> should be gone there already as the listener only writes its cascaded
> transaction log *after* replication.py has finished updating the local LDAP.
> 
> LMDB is described as "Fully-transactional, full ACID semantics with MVCC"
> - is this a ACID problem in OpenLDAP?

If we really would have to dig in that direction and we assume that LMDB itself is ACID compliant, I could remotely imagine two issues in the way LMDB is used in UCS/OpenLDAP:

a) The translog overlay executes before the LMDB transation is committed.
b) The Listener LDAP search connection holds a long running LMDB transaction,
   in which case it would not see updates, theoretically.

I guess the first race condition could be checked in an experimental setup in which the translog overlay is adjusted to sleep 300 seconds before continuing.
I would be very surprised if b) would be the case. Many use cases of OpenLDAP would potentially suffer from that.
Comment 11 Stefan Gohmann univentionstaff 2019-01-03 07:11:21 CET
This issue has been filled against UCS 3.3. The maintenance with bug and security fixes for UCS 3.3 has ended on 31st of December 2016.

Customers still on UCS 3.3 are encouraged to update to UCS 4.3. Please contact
your partner or Univention for any questions.

If this issue still occurs in newer UCS versions, please use "Clone this bug" or simply reopen the issue. In this case please provide detailed information on how this issue is affecting you.