Bug 49194

Summary: Improve robustness of UDN protocol
Product: UCS Reporter: Christian Völker <voelker>
Component: Notifier (univention-directory-notifier)Assignee: Stefan Gohmann <gohmann>
Status: CLOSED WORKSFORME QA Contact: Philipp Hahn <hahn>
Severity: normal    
Priority: P5 CC: gohmann, michelsmidt, requate
Version: UCS 4.4   
Target Milestone: ---   
Hardware: Other   
OS: Linux   
URL: https://etherpad-lite.knut.univention.de/etherpad/p/pullcord-udn-v2
See Also: https://forge.univention.org/bugzilla/show_bug.cgi?id=48617
What kind of report is it?: Bug Report What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 4: Will affect most installed domains How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.571 Enterprise Customer affected?: Yes
School Customer affected?: ISV affected?:
Waiting Support: Yes Flags outvoted (downgraded) after PO Review:
Ticket number: 2019032621001041, 2019032821000494, 2019031921001509, 2019030421001144 Bug group (optional):
Max CVSS v3 score:
Bug Depends on: 28233, 49198, 49199, 49200, 49201, 49202    
Bug Blocks:    

Description Christian Völker univentionstaff 2019-03-28 21:49:51 CET
Support had the last 14 days at least four reports of UDN protocol v3 failing.

Troubleshooting has been very time consuming and at least partially not possible.

When the issue occurs synchronization fails at customer site with the well-known symptoms of users not able to log in and so on.

Forum reports as well numerous issues related to this.

Looks like there are some quirks in the implementation which causes this major issues.

We should make the protocol more robust regarding stability (do not fail so frequently) as well as from troubleshooting viewpoint (be able to fix without massive manual synchronization and renumbering of involved files.
Comment 2 Arvid Requate univentionstaff 2019-03-29 00:22:01 CET
FYI: Ticket#2019032821000494 mentions this error message coming from slapd: "MDB_MAP_FULL: Environment mapsize limit reached".

This looks like the "standard" lmdb error message indicating that the virtual memory limit of the MDB database has been reached. That value can be configured via UCR ldap/database/mdb/maxsize. The default value of 2147483648 (= 2*1024*1024*1024 = 2 GB) has been chosen as the maximum possible value for i386 (IIRC). For amd64 it can be increased as desired and the new value applies directly when the slapd is started again.

On amd64 systems the new cn=translog is part of the overall MDB database size (on i386 we use BDB instead for it), so there is an increased virtual memory footprint for the slapd.

Maybe all tickets noted here share a common issue, but it could also be that they have different or multiple issues. We should keep that in mind when analyzing further.
Comment 4 Stefan Gohmann univentionstaff 2019-04-04 16:53:25 CEST
We have released the following updates:

 - Bug #49198: Mutliple entries in the transaction file
 - Bug #28233: Notifier should check free space
 - Bug #49201: Extend univention-translog by various consistency checks

 - Bug #49199: [4.3] Mutliple entries in the transaction file
 - Bug #49200: [4.3] Notifier should check free space
 - Bug #49202: [4.3] Extend univention-translog by various consistency checks

And we have created two SDB articles:
 https://help.univention.com/t/problem-umc-diagnostic-module-complains-about-problems-with-udn-replication/11707
 https://help.univention.com/t/how-to-reset-listener-notifier-replication/11710

As discussed, the following articles should no longer be used since they are wrong:
 - https://help.univention.com/t/transaction-file-checking/6418
 - https://help.univention.com/t/fixing-translog-issues/11613