Bug 57024 – Join fails if notifier gets restarted

Bug 57024 - Join fails if notifier gets restarted


Summary:	Join fails if notifier gets restarted

Status:	CLOSED FIXED

Product:	UCS
Classification:	Unclassified
Component:	Listener (univention-directory-listener)
Version:	UCS 5.0
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	UCS 5.0-6-errata
Assigned To:	Arvid Requate
QA Contact:	Julia Bremer

URL:	https://git.knut.univention.de/univen...
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2024-02-04 11:13 CET by Stefan Gohmann
Modified:	2024-02-28 13:17 CET (History)
CC List:	4 users (show)

See Also:
What kind of report is it?:	Bug Report
What type of bug is this?:	6: Setup Problem: Issue for the setup process
Who will be affected by this bug?:	1: Will affect a very few installed domains
How will those affected feel about the bug?:	5: Blocking further progress on the daily work
User Pain:	0.171
Enterprise Customer affected?:
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2023091921000063
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stefan Gohmann

2024-02-04 11:13:12 CET

I've tried to join a school slave in a big customer environment (180k users and 290 school servers). The join failed in three of three cases because the notifer has been restarted while the directory-listener join script was running.

The notifer has been restarted due to different reasons:
----------------------------------------------------------------------------------
03.02.24 08:59:15.356  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
03.02.24 08:59:25.933  DEBUG_INIT
--
03.02.24 12:00:31.106  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
03.02.24 12:00:41.625  DEBUG_INIT
--
03.02.24 12:01:03.497  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
03.02.24 12:01:14.181  DEBUG_INIT
--
03.02.24 15:46:17.359  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
03.02.24 15:46:27.846  DEBUG_INIT
--
----------------------------------------------------------------------------------

----------------------------------------------------------------------------------
04.02.24 09:02:26.895  TRANSFILE   ( ERROR   ) : Normal exit
04.02.24 09:02:26.895  DEBUG_EXIT
04.02.24 09:02:27.425  DEBUG_INIT
----------------------------------------------------------------------------------


I saw different cases on the school slave. In two cases the replication module was successful but the following modules failed:
----------------------------------------------------------------------------------
03.02.24 08:32:12.403  LISTENER    ( WARN    ) : Set Schema ID to 72
03.02.24 08:32:12.403  LISTENER    ( WARN    ) : initializing module replication
File: /var/lib/univention-ldap/ldap/DB_CONFIG
slapd: Kein Prozess gefunden
File: /var/lib/univention-ldap/ldap/DB_CONFIG
Starting slapd (via systemctl): slapd.service.
Restarting slapd (via systemctl): slapd.service.
03.02.24 10:52:20.215  LISTENER    ( WARN    ) : finished initializing module replication with rv=0
03.02.24 10:52:20.220  LISTENER    ( WARN    ) : initializing module ldap-cache-baa04df67e7af6bb0769f5cb7e72dba9
03.02.24 10:52:32.508  LISTENER    ( ERROR   ) : connection to notifier was closed
03.02.24 10:52:32.508  LISTENER    ( ERROR   ) : failed to get schema DN
03.02.24 10:52:32.508  LISTENER    ( WARN    ) : finished initializing module ldap-cache-baa04df67e7af6bb0769f5cb7e72dba9 with rv=80
03.02.24 10:52:32.508  LISTENER    ( WARN    ) : initializing module well-known-sid-name-mapping
03.02.24 11:06:50.608  LISTENER    ( ERROR   ) : connection to notifier was closed
03.02.24 11:06:50.608  LISTENER    ( ERROR   ) : failed to get schema DN
03.02.24 11:06:50.609  LISTENER    ( WARN    ) : finished initializing module well-known-sid-name-mapping with rv=80
----------------------------------------------------------------------------------

In these cases the join failed because the listener modules were not initialized and the listener is busy with initializing the modules and doesn't register syntax extensions or similar things.

And in one case it happened during the replication module:
----------------------------------------------------------------------------------
04.02.24 08:53:22.918  LISTENER    ( WARN    ) : handler: s4-connector (not ready) (ignore)
04.02.24 08:53:22.929  LISTENER    ( WARN    ) : Set Schema ID to 72
04.02.24 08:53:22.930  LISTENER    ( WARN    ) : initializing module replication
File: /var/lib/univention-ldap/ldap/DB_CONFIG
slapd: Kein Prozess gefunden
File: /var/lib/univention-ldap/ldap/DB_CONFIG
Starting slapd (via systemctl): slapd.serviceWarning: The unit file, source configuration file or drop-ins of slapd.service changed on disk. Run 'systemctl daemon-reload' to reload units.
.
Restarting slapd (via systemctl): slapd.serviceWarning: The unit file, source configuration file or drop-ins of slapd.service changed on disk. Run 'systemctl daemon-reload' to reload units.
.
04.02.24 09:05:13.838  LISTENER    ( ERROR   ) : connection to notifier was closed
04.02.24 09:05:13.838  LISTENER    ( ERROR   ) : failed to get schema DN
04.02.24 09:05:13.842  LISTENER    ( WARN    ) : finished initializing module replication with rv=80
----------------------------------------------------------------------------------

One solution would be a reconnection if the notifier gets restarted. It is a really big environment and so the join could take more than a day and a restart could happen.

Comment 2 Stefan Gohmann

2024-02-08 06:46:12 CET

I've installed the test packages and two joins were successful:

root@schoolserver:~# grep RETRY /var/log/univention/join.log
06.02.24 06:57:47.444  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
06.02.24 09:05:46.447  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
06.02.24 09:51:12.421  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
06.02.24 12:04:19.994  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
07.02.24 09:52:43.667  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
07.02.24 11:12:43.727  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
07.02.24 12:53:02.368  LISTENER    ( WARN    ) : NOTIFIER_RETRY rv=1
root@schoolserver:~#

Comment 3 Arvid Requate

2024-02-22 00:25:10 CET

0c10d73e96 | retry in case notifier doesn't answer

Package: univention-directory-listener
Version: 14.0.9-2
Branch: ucs_5.0-0
Scope: errata5.0-6

New UCR variable listener/notifier/retries with default 30.

Comment 4 Julia Bremer

2024-02-25 16:39:06 CET

OK: Join works from each server role when notifier is stopped and started every 2 seconds
OK: Join works from ucs@school replicas when notifier is stopped and started every 2 seconds
OK: Code review
OK: Log messages understandable
OK: Package build and install
OK: YAML

Comment 5 Iván.Delgado

2024-02-28 13:17:25 CET

<https://errata.software-univention.de/#/?erratum=5.0x967>