Bug 40230 - Replication may run into LDAP search timeout and lets join fail
Replication may run into LDAP search timeout and lets join fail
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 4.0
Other Linux
: P5 normal (vote)
: UCS 4.0-4-errata
Assigned To: Philipp Hahn
Arvid Requate
:
Depends on:
Blocks: 40373
  Show dependency treegraph
 
Reported: 2015-12-11 16:36 CET by Alexander Kläser
Modified: 2016-02-04 15:58 CET (History)
6 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional): External feedback, Large environments, UCS Performance
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Kläser univentionstaff 2015-12-11 16:36:45 CET
Seen at Ticket#2015121121000514.

In a UCS@school scenario, the join of a DC slave failed as an LDAP search request (which presumably queried all LDAP objects) timed out. Subsequently, all following actions carried out by the listener join script failed. After a restart of the DC master, the LDAP server was more responsive and allowed a join.

========== join.log ==========
> [...]
> Configure 03univention-directory-listener.inst Fri Dec 11 16:20:57 CET 2015
> [...]
> 11.12.15 15:47:49.335  LISTENER    ( WARN    ) : initializing module replication
> File: /var/lib/univention-ldap/ldap/DB_CONFIG
> slapd: Kein Prozess gefunden
> File: /var/lib/univention-ldap/ldap/DB_CONFIG
> Starting ldap server(s): slapd ...done.
> Restarting ldap server(s).
> Stopping ldap server(s): slapd ...retry #1....done.
> Starting ldap server(s): slapd ...done.
> 11.12.15 15:53:04.076  LISTENER    ( ERROR   ) : could not get DNs when initializing replication: Timed out
> [...]
====================

In the source code, the timeout for this LDAP search is set to 5min:

========== management/univention-directory-listener/src/change.c ==========
> [...]
>        struct timeval timeout = {
>            .tv_sec = 5*60,
>            .tv_usec = 0,
>        };
>        int sizelimit0 = 0;
>        if ((rv =  ldap_search_ext_s(lp->ld, (*f)->base, (*f)->scope, (*f)->filter, _attrs, attrsonly1,  serverctrls, clientctrls, &timeout, sizelimit0, &res)) != LDAP_SUCCESS) {
>            univention_debug(UV_DEBUG_LISTENER, UV_DEBUG_ERROR, "could not get DNs when initializing %s: %s", handler->name, ldap_err2string(rv));
>            return rv;
>        }
> [...]
====================

It would be nice to use paging for the request in order to avoid these problems.
Comment 1 Philipp Hahn univentionstaff 2015-12-11 16:49:35 CET

*** This bug has been marked as a duplicate of bug 34877 ***
Comment 2 Stefan Gohmann univentionstaff 2015-12-11 20:30:38 CET
(In reply to Alexander Kläser from comment #0)
> It would be nice to use paging for the request in order to avoid these
> problems.

Yes. Alternatively, we could simply increase the timeout or make the timeout configurable in a first step.
Comment 3 Alexander Kläser univentionstaff 2015-12-14 11:15:34 CET
(In reply to Stefan Gohmann from comment #2)
> Yes. Alternatively, we could simply increase the timeout or make the timeout
> configurable in a first step.

IMHO, paging would have a real benefit as it allows to log progress information. Especially here where, progress information would be very helpful, as it is difficult to decide at a first glance whether a system might hang or not.
Comment 4 Stefan Gohmann univentionstaff 2015-12-29 10:09:39 CET
(In reply to Alexander Kläser from comment #3)
> IMHO, paging would have a real benefit as it allows to log progress
> information. Especially here where, progress information would be very
> helpful, as it is difficult to decide at a first glance whether a system
> might hang or not.

Yes, sure. Therefore we have Bug #34877.

I think increasing the timeout or make the timeout configurable is a first step which could help in such a support scenario. The implementation of paging would cost much more effort.
Comment 5 Sönke Schwardt-Krummrich univentionstaff 2016-01-06 10:39:02 CET
The problem occurred 2 additional times on different slaves. If 
univention-directory-listener (version 9.0.2-5.269.201506171450) is used, the joins completed successfully.

(In reply to Stefan Gohmann from comment #4)
> I think increasing the timeout or make the timeout configurable is a first
> step which could help in such a support scenario. The implementation of
> paging would cost much more effort.

The problem seems to be introduced with commit 63434 where a hard timeout of 
5 minutes has been added. As discussed, the default timeout should be raised to 
2 hours and has to be configurable via UCR.

The problem is here, that in problematic environments the timeout has to be set before the first join attempts. So setting the value via UCR policy is not possible.
Comment 6 Sönke Schwardt-Krummrich univentionstaff 2016-01-28 12:02:30 CET
Customer asked for fix because joining slave systems is not possible unless a listener downgrade is performed before joining the system.
Comment 7 Philipp Hahn univentionstaff 2016-02-01 11:52:47 CET
Backport from Bug #40373:
r67091 | Bug #40230 UDL: Abort on out-of-memory
r67090 | Bug #40230 UDL: Fix memory leaks
r67089 | Bug #40230 UDL: Free LDAP memory
r67088 | Bug #40230 UDL: Remove pointless free()
r67087 | Bug #40230 UDL: Only retrieve DNs
r67086 | Bug #40230 UDL: Fix long search timeout
r67085 | Bug #40230 UDL: static change_init_module()
r67084 | Bug #40230 UDL: Declare extern
r67083 | Bug #40230 UDL: Remove self-include
r67082 | Bug #40230 UDL: Copyright 2016

Package: univention-directory-listener
Version: 9.0.2-9.295.201602011148
Branch: ucs_4.0-0
Scope: errata4.0-4

r67093 | Bug #40338 UDL: Fix long search timeout YAML
 univention-directory-listener.yaml
Comment 8 Arvid Requate univentionstaff 2016-02-01 21:23:32 CET
Verified:
* Code review Ok
* Advisory / Versioning Ok
* Function Ok

Could you adjust the UCR Variable description (and YAML) to tell which time units or more general which value syntax is allowed / expected? See Bug 40373 Comment 3.
Comment 9 Philipp Hahn univentionstaff 2016-02-02 08:12:23 CET
(In reply to Arvid Requate from comment #8)
> Could you adjust the UCR Variable description (and YAML) to tell which time
> units or more general which value syntax is allowed / expected? See Bug
> 40373 Comment 3.

r67110 | Bug #40230 UDL: Improve UCR variable description

Package: univention-directory-listener
Version: 9.0.2-10.297.201602020808
Branch: ucs_4.0-0
Scope: errata4.0-4

r67111 | Bug #40230,Bug #40373 UDL: Improve UCR variable description YAML
 univention-directory-listener.yaml
Comment 10 Janek Walkenhorst univentionstaff 2016-02-04 15:58:14 CET
<http://errata.software-univention.de/ucs/4.0/395.html>