Bug 47609 - Domain join takes 2-9 hours - multiple calls to ldap_setup_index
Domain join takes 2-9 hours - multiple calls to ldap_setup_index
Status: NEW
Product: UCS
Classification: Unclassified
Component: LDAP
UCS 4.4
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
UCS maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-08-17 13:53 CEST by Philipp Hahn
Modified: 2022-08-03 18:41 CEST (History)
12 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.171
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018080921000496, 2020060221000177, 2018042621000711
Bug group (optional): Large environments
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2018-08-17 13:53:54 CEST
A UCS@school slave is joined into a domain with many existing objects. The join takes ages:
1. The initial replication took ~60m (03univention-directory-listener.inst)
2. The first indexing took ~40m (10univention-ldap-server.inst)
3. The second indexing took another ~40m (10univention-ldap-server.inst)

univention-join itself shows no progress (Bug #47604), neither anything in join.log

We already have UCVs ldap/index/autorebuild¹ and ldap/index/quickmode which could be set to true during the initial join until it has finished.
(similar to switching Qemu to cache=unsafe for new test installations)

`ps` showed slapindex in 'D' state, which would indicate 'waiting for I/O'.

(LDAP has ~10k objects, VM has ~16 GiB RAM, using a SAS disk)

¹: <https://docs.software-univention.de/performance-guide-4.3.html#slapd:index>
Comment 1 Stefan Gohmann univentionstaff 2018-08-17 14:29:43 CEST
Maybe we can also set the mdb envflags for example nosync during the join:
https://docs.software-univention.de/performance-guide-4.3.html#slapd:bdb
Comment 2 Philipp Hahn univentionstaff 2018-08-23 15:33:26 CEST
IMHO we should move the configuration from UCRV ldap/index/* into LDAP, so a joining system can query them and set them locally BEFORE the replication module starts filling the local LDAP. That way the index is built on-the-fly and no extra call to slapindex should be needed.

With Bug #43515 fixed we even could query the current setting from the Master now:
# ldapsearch -LLLo ldif-wrap=no -b 'olcDatabase={1}mdb,cn=config' -H ldapi:// -Y EXTERNAL -s base olcDbIndex
Comment 3 Sönke Schwardt-Krummrich univentionstaff 2020-06-08 12:42:03 CEST
Ticket#2020060221000177:
another customer with slow block device. 
ldap/index/quickmode should be set/unset in 10univention-ldap-server.inst before/after indexing.

Initial replication took ~100min.
First index is running since ~40min and still working.
Comment 4 Arvid Requate univentionstaff 2020-06-08 15:40:09 CEST
> IMHO we should move the configuration from UCRV ldap/index/* into LDAP, so a joining system can query them and set them locally BEFORE the replication module starts filling the local LDAP. That way the index is built on-the-fly and no extra call to slapindex should be needed.

This information is already present in cn=config, but not yet readable by remote servers in the domain.
So I see two options in univention-join:
a) read the index config from Primary OpenLDAP and initialize UCR-Vars accordingly
b) copy the UCR vars from the UCR Primary

I think b) is simpler.
Comment 5 Michael Grandjean univentionstaff 2020-07-29 12:55:59 CEST
I'm increasing the user pain here, because we get more and more pressure in projects with larger installations (resulting in more LDAP objects to index and replicate during a join).

The current join of a UCS@school 4.4 schoolserver in a larger customer environment takes 4 to 8 hours, depending on the hardware. This actually blocks the partner from keeping their planned schedule, especially when the initial join fails for some reason and they have to (re)join again.
Comment 6 Ingo Steuwer univentionstaff 2020-07-29 15:41:51 CEST
(In reply to Stefan Gohmann from comment #1)
> Maybe we can also set the mdb envflags for example nosync during the join:
> https://docs.software-univention.de/performance-guide-4.3.html#slapd:bdb

Do we have numbers about the speedup if joins are done with this flag?

My impression is: it would be way easier to set and remove this flag during join and would shorten the time in any case, even with the other suggestions implemented.
Comment 7 Arvid Requate univentionstaff 2020-07-29 17:21:04 CEST
Hard to say, probably depends on the host.

I'd propose profiling the join of a dc backup with good range of apps
and a representative number of UDM objects, to collect numbers for the
individual join scripts and the final time for replication (listener, S4C).
Comment 9 Nico Gulden univentionstaff 2020-08-04 16:31:09 CEST
(In reply to Michael Grandjean from comment #5)
> I'm increasing the user pain here, because we get more and more pressure in
> projects with larger installations (resulting in more LDAP objects to index
> and replicate during a join).
> 
> The current join of a UCS@school 4.4 schoolserver in a larger customer
> environment takes 4 to 8 hours, depending on the hardware. This actually
> blocks the partner from keeping their planned schedule, especially when the
> initial join fails for some reason and they have to (re)join again.

How big is "larger customer environment"? 

How many user accounts are present in the UCS directory?

What maximum time limit for the join would be acceptable for the customers?
Comment 10 Michael Grandjean univentionstaff 2020-08-05 16:09:51 CEST
Two recent examples:

# Customer A #
Users: 100.000
Join duration: ~9 hours

> Mon Jul 20 11:10:39 CEST 2020: starting /usr/sbin/univention-join
> [...]
> Configure 98univention-samba4-dns.inst Mon Jul 20 16:34:34 CEST 2020
This failed at 16:55, because the DNS SPN account was not synced fast enough to Samba. Therefore, "univention-run-join-scripts" was called manually the next day:
> univention-run-join-scripts started
> Di 21. Jul 10:35:16 CEST 2020
> [...]
> Di 21. Jul 13:50:23 CEST 2020
> univention-run-join-scripts finished


# Customer B #
Users: 20.000
Join duraction: ~5 hours

> Mon Jul 20 09:39:12 CEST 2020: starting /usr/sbin/univention-join
> [...]
> Mon Jul 20 14:45:43 CEST 2020: finish /usr/share/univention-join/univention-join

I can provide the logs (join.log), if necessary.
Comment 11 Nico Gulden univentionstaff 2020-12-01 13:04:02 CET
There has not been any recent activity on this bug. Has the problem been seen somewhere else as well?
Comment 12 Oliver Friedrich univentionstaff 2020-12-02 17:44:04 CET
Hi, we also have this problem:
# Customer C #

Users: 42.000 (and still growing)
Join duration: ~5 hours

occurs during join of DC backups (so the whole ldap is synced)
Comment 13 Nico Gulden univentionstaff 2020-12-04 14:55:52 CET
As far as I understand this problem the reason for the long runtime of a join are multiple calls to create the LDAP index. Two calls are to much and of no additional use.

I discussed this problem this week. Here are some ideas on how to approach this issue:

1. Run A/B tests to find out which solution to implement:

a. Create a test environment reflecting the size of the affected environments and let a slave join this master. Measure the time.

b. Evaluation the different solution ideas and compare the join runtimes.

2. We should run this test regularly on the join process.

3. Use the tests and evaluate the effect of the different solutions.


Possible solutions have already been stated here. The following options should be evaluated:

1. Retrieve the LDAP index configuration from the master and apply it to the slave _before_ the replication starts. Let the index built up during the replication and thus avoid explicit index build calls.

2. Build the index after the join process. Adapt `ldap_setup_index` and let it consider the join status and decide when to actually rebuild the LDAP index.

3. I/O operations seem to slow down the performance significantly. The performance guide lists some options, e.g. deactivating `f_sync`. The options could be applied one after another and their improvement impact could be measured and compared to each other. This would also help to decide what optimization should be applied in a join process for large environments.

Furthermore, the performance guide could be extended by a comparison on the tuning options, their effect and their possible side effects to help administrators with their decisions on what to choose in their situation and environment.
Comment 14 Riya Bhattacharjee univentionstaff 2021-11-11 11:24:44 CET
The story has been moved to GitLab.
Please find the story in the following GitLab issue: 
git.knut.univention.de/univention/ucs/-/issues/144.