Bug 48030 - Listener doesn't reconnect to notifier after changing the LDAP server which leads to inconstencies
Listener doesn't reconnect to notifier after changing the LDAP server which l...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.3-2-errata
Assigned To: Philipp Hahn
Arvid Requate
https://trello.com/c/7jsYr4G2
:
Depends on:
Blocks: 48088
  Show dependency treegraph
 
Reported: 2018-10-19 20:45 CEST by Stefan Gohmann
Modified: 2019-03-30 09:00 CET (History)
3 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.120
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018100121000666
Bug group (optional):
Max CVSS v3 score:


Attachments
ucs-kvm-create.cfg (4.97 KB, text/plain)
2018-11-12 11:12 CET, Philipp Hahn
Details
Log file (1.15 MB, text/plain)
2018-11-12 11:14 CET, Philipp Hahn
Details
revert_flow_refactoring.diff (776 bytes, patch)
2018-11-12 18:21 CET, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2018-10-19 20:45:00 CEST
The listener on a slave connects to the notifier and OpenLDAP on a master or backup. It is mandatory that notifier and OpenLDAP run on one server.

If the OpenLDAP server is restarted on the notifier System during the replication, the listener may connect to the OpenLDAP server on another system. This leads to inconsistencies in replication.

I was able to reproduce the behavior in a UCS 4.3-2 environment with master, 2x backups and one slave. On the master I created users in an endless loop. The listener on the slave was connected to a backup (Notifier and OpenLDAP). During the replication I restarted the OpenLDAP server again and again on the backup. At some point the slave connected to another OpenLDAP server and kept the notifier connection.

Connection list on the backup:

root@backup436:~# netstat -tupen | grep 10.201.43.4
tcp        0      0 10.201.43.6:7389        10.201.43.4:38926       VERBUNDEN   0          104284     9232/slapd
tcp6       0      0 10.201.43.6:6669        10.201.43.4:40560       VERBUNDEN   0          104004     9162/univention-dir
root@backup436:~#


listener.log on the slave:

15.10.18 14:55:03.744  DEBUG_INIT
0: backup432.deadlock43.intranet
1: backup436.deadlock43.intranet
2: master431.deadlock43.intranet
UNIVENTION_DEBUG_BEGIN  : uldap.__open host=backup436.deadlock43.intranet port=7389 base=dc=deadlock43,dc=intranet
UNIVENTION_DEBUG_END    : uldap.__open host=backup436.deadlock43.intranet port=7389 base=dc=deadlock43,dc=intranet
15.10.18 14:58:16.287  LISTENER    ( PROCESS ) : updating 'cn=test1000,cn=uid,cn=temporary,cn=univention,dc=deadlock43,dc=intranet' command a
[...]
15.10.18 14:58:36.189  LISTENER    ( PROCESS ) : updating 'uid=test1014,dc=deadlock43,dc=intranet' command a
15.10.18 14:58:36.191  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:58:37.192  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:58:39.193  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:58:43.194  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:58:51.195  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:59:07.195  LISTENER    ( WARN    ) : error searching DN uid=test1014,dc=deadlock43,dc=intranet: -1
15.10.18 14:59:07.195  LISTENER    ( ERROR   ) : change_update_dn failed: -1
15.10.18 14:59:07.197  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:59:07.197  LISTENER    ( WARN    ) : can not connect to ldap server (backup436.deadlock43.intranet)
15.10.18 14:59:07.197  LISTENER    ( WARN    ) : can not connect to any ldap server, retrying in 30 seconds
0: backup432.deadlock43.intranet
1: backup436.deadlock43.intranet
2: master431.deadlock43.intranet
15.10.18 14:59:37.197  LISTENER    ( WARN    ) : chosen server: backup436.deadlock43.intranet:7389
15.10.18 14:59:37.198  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 14:59:37.198  LISTENER    ( WARN    ) : can not connect to ldap server (backup436.deadlock43.intranet)
15.10.18 14:59:37.198  LISTENER    ( WARN    ) : can not connect to any ldap server, retrying in 30 seconds
0: backup432.deadlock43.intranet
1: backup436.deadlock43.intranet
2: master431.deadlock43.intranet
15.10.18 15:00:07.199  LISTENER    ( WARN    ) : chosen server: backup436.deadlock43.intranet:7389
15.10.18 15:00:07.200  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
15.10.18 15:00:07.200  LISTENER    ( WARN    ) : can not connect to ldap server (backup436.deadlock43.intranet)
15.10.18 15:00:07.200  LISTENER    ( WARN    ) : can not connect to any ldap server, retrying in 30 seconds
0: backup432.deadlock43.intranet
1: backup436.deadlock43.intranet
2: master431.deadlock43.intranet
15.10.18 15:00:37.200  LISTENER    ( WARN    ) : chosen server: backup432.deadlock43.intranet:7389
15.10.18 15:00:37.259  LISTENER    ( PROCESS ) : updating 'uid=test1014,dc=deadlock43,dc=intranet' command a

Connection list afterwards:

root@backup436:~# netstat -tupen | grep 10.201.43.4
tcp6       0      0 10.201.43.6:6669        10.201.43.4:40560       VERBUNDEN   0          104004     9162/univention-dir
root@backup436:~# 

root@backup432:~# netstat -tupen | grep 10.201.43.4
tcp        0      0 10.201.43.2:7389        10.201.43.4:41932       VERBUNDEN   0          645095     7374/slapd
root@backup432:~#
Comment 1 Philipp Hahn univentionstaff 2018-11-05 16:30:53 CET
[4.3-2] 083de5b0ab Bug #48030 UDL: Remove wrong select_server()
 .../univention-directory-listener/debian/changelog |  6 +++++
 .../univention-directory-listener/src/notifier.c   | 28 ++++++++++++----------
 .../src/select_server.c                            |  6 +++++
 3 files changed, 27 insertions(+), 13 deletions(-)

[4.3-2] 6dc20d9efa Bug #48030 UDL: Print selected server
 management/univention-directory-listener/src/main.c          | 8 +++-----
 management/univention-directory-listener/src/notifier.c      | 1 -
 management/univention-directory-listener/src/select_server.c | 2 +-
 3 files changed, 4 insertions(+), 7 deletions(-)

[4.3-2] 4dda7ca8a5 Bug #48030 UDL: Remove broken connection attempt tracking
 management/univention-directory-listener/src/main.c       |  2 --
 management/univention-directory-listener/src/notifier.c   |  2 --
 .../univention-directory-listener/src/select_server.c     | 15 ---------------
 .../univention-directory-listener/src/select_server.h     |  2 --
 4 files changed, 21 deletions(-)


Package: univention-directory-listener
Version: 12.0.0-32A~4.3.0.201811051607
Branch: ucs_4.3-0
Scope: errata4.3-2

[4.3-2] eed92d5e59 Bug #48030 UDL: Remove wrong select_server() YAML
 doc/errata/staging/univention-directory-listener.yaml | 13 +++++++++++++
 1 file changed, 13 insertions(+)


QA: I have not tested this, but the bug should be triggerable by this:

@backup2: stop the listener on the 2nd backup (so it no longer updates its slapd)
@master: create a new object
@slave: Restart UDL until connected to UDN@backup1
@slave: wait for the 15s timeout period of UDL to close the LDAP connection
@backup1: stop that upstream slapd
@master: modify the object

That should trigger UDL @ slave,
it will try to reconnect to slapd@backup1, which is offline
it will re-select to slapd@backup2, but will not find the object there
@slave will remove the object instead of modifying it.

With the new version UDL@slave will terminate eventually,
get restarted by runit,
will select UDN@backup2,
wily only start doing things until UDL@backup2 is started again.
Comment 2 Philipp Hahn univentionstaff 2018-11-12 11:12:10 CET
Created attachment 9734 [details]
ucs-kvm-create.cfg

cfg file to setup test environment using out internal KVM test environment and IPs from my range.
Comment 3 Philipp Hahn univentionstaff 2018-11-12 11:14:09 CET
Created attachment 9735 [details]
Log file

...
>[slave] 2018-11-09T10:19:55.558137      dn: uid=phahn,cn=users,dc=bug48030,dc=dev
>[backup1] 2018-11-09T10:19:55.608117    dn: uid=phahn,cn=users,dc=bug48030,dc=dev
>[backup1] 2018-11-09T10:19:55.608117    description: 1541755193

Entry is updated on both systems, but only the LDAP search on the backup1 shows the modified entry, as UDL@slave replicates from backup2, which is NOT updates and thus the old entry is replicated again.
Comment 4 Arvid Requate univentionstaff 2018-11-12 18:21:39 CET
Created attachment 9736 [details]
revert_flow_refactoring.diff

Commit 083de5b0ab changes the program flow in notifier_listen(...). I don't think this change is functionally required for this bug and I also think that it could introduce an infinite loop in case change_update_dn returns something that is neither LDAP_SUCCESS nor LDAP_SERVER_DOWN. The attached patch shows exactly what I would like to see reverted.
Comment 5 Arvid Requate univentionstaff 2018-11-12 18:40:06 CET
There is another question: In case the currently selected LDAP server process becomes unavailable, the code would now retry 5 times and wait 30 seconds before each retry. So now we have (at least) 150 seconds failover delay where before we had (at least) 30 seconds. So my question would be: why did you choose to retry 5 times? Do you think it is too expensive to just retry once (or maximally twice) and then give up on the currently selected server?
Comment 6 Philipp Hahn univentionstaff 2018-11-13 16:29:37 CET
(In reply to Arvid Requate from comment #4)
> Commit 083de5b0ab changes the program flow in notifier_listen(...). I don't
> think this change is functionally required for this bug and I also think
> that it could introduce an infinite loop in case change_update_dn returns
> something that is neither LDAP_SUCCESS nor LDAP_SERVER_DOWN. The attached
> patch shows exactly what I would like to see reverted.
...

(In reply to Arvid Requate from comment #5)
> There is another question: In case the currently selected LDAP server
> process becomes unavailable, the code would now retry 5 times and wait 30
> seconds before each retry. So now we have (at least) 150 seconds failover
> delay where before we had (at least) 30 seconds. So my question would be:
> why did you choose to retry 5 times? Do you think it is too expensive to
> just retry once (or maximally twice) and then give up on the currently
> selected server?


As discussed I removed the complete block as every ldap_search_ext_s() call is already wrapped by LDAP_RETRY(), which is used to handle LDAP_SERVER_DOWN.

[4.3-2] 9464dfbe05 Bug #48030: Fix retry logic
 .../univention-directory-listener/debian/changelog |  6 +++++
 .../univention-directory-listener/src/notifier.c   | 30 +++-------------------
 2 files changed, 10 insertions(+), 26 deletions(-)

Package: univention-directory-listener
Version: 12.0.0-33A~4.3.0.201811131559
Branch: ucs_4.3-0
Scope: errata4.3-2

[4.3-2] 63c1604429 Bug #48030 LDAP: Log connections at PROCESS level
 base/univention-policy/debian/changelog | 6 ++++++
 base/univention-policy/lib/ldap.c       | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Package: univention-policy
Version: 9.0.0-4A~4.3.0.201811131609
Branch: ucs_4.3-0
Scope: errata4.3-2

[4.3-2] 06c5a2a306 Bug #48030 LDAP: Log connections at PROCESS level YAML
 doc/errata/staging/univention-directory-listener.yaml |  2 +-
 doc/errata/staging/univention-policy.yaml             | 10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

QA: tail /var/log/univention/listener.log
...
13.11.18 16:28:36.578  LISTENER    ( WARN    ) : Notifier/LDAP server is ma43.phahn.qa:7389
13.11.18 16:28:36.578  LDAP        ( PROCESS ) : connecting to ldap://ma43.phahn.qa:7389
Comment 7 Arvid Requate univentionstaff 2018-11-14 08:02:13 CET
Ok, works and the listener.log looks good. Now the LDAP_RETRY macro handles the retries. I saw 6 attempts in my logs, spread over 15 seconds before giving up. Then the listener is restarted.

Tonights Jenkins run does't show replication related regressions.

3ef7d7a035 | Advisory: Fix version number in univention-policy.yaml
8edde6fe5d | Minimal wording adjustment in univention-directory-listener.yaml