Univention Bugzilla – Bug 47609
Domain join takes 2-9 hours - multiple calls to ldap_setup_index
Last modified: 2022-08-03 18:41:53 CEST
A UCS@school slave is joined into a domain with many existing objects. The join takes ages: 1. The initial replication took ~60m (03univention-directory-listener.inst) 2. The first indexing took ~40m (10univention-ldap-server.inst) 3. The second indexing took another ~40m (10univention-ldap-server.inst) univention-join itself shows no progress (Bug #47604), neither anything in join.log We already have UCVs ldap/index/autorebuild¹ and ldap/index/quickmode which could be set to true during the initial join until it has finished. (similar to switching Qemu to cache=unsafe for new test installations) `ps` showed slapindex in 'D' state, which would indicate 'waiting for I/O'. (LDAP has ~10k objects, VM has ~16 GiB RAM, using a SAS disk) ¹: <https://docs.software-univention.de/performance-guide-4.3.html#slapd:index>
Maybe we can also set the mdb envflags for example nosync during the join: https://docs.software-univention.de/performance-guide-4.3.html#slapd:bdb
IMHO we should move the configuration from UCRV ldap/index/* into LDAP, so a joining system can query them and set them locally BEFORE the replication module starts filling the local LDAP. That way the index is built on-the-fly and no extra call to slapindex should be needed. With Bug #43515 fixed we even could query the current setting from the Master now: # ldapsearch -LLLo ldif-wrap=no -b 'olcDatabase={1}mdb,cn=config' -H ldapi:// -Y EXTERNAL -s base olcDbIndex
Ticket#2020060221000177: another customer with slow block device. ldap/index/quickmode should be set/unset in 10univention-ldap-server.inst before/after indexing. Initial replication took ~100min. First index is running since ~40min and still working.
> IMHO we should move the configuration from UCRV ldap/index/* into LDAP, so a joining system can query them and set them locally BEFORE the replication module starts filling the local LDAP. That way the index is built on-the-fly and no extra call to slapindex should be needed. This information is already present in cn=config, but not yet readable by remote servers in the domain. So I see two options in univention-join: a) read the index config from Primary OpenLDAP and initialize UCR-Vars accordingly b) copy the UCR vars from the UCR Primary I think b) is simpler.
I'm increasing the user pain here, because we get more and more pressure in projects with larger installations (resulting in more LDAP objects to index and replicate during a join). The current join of a UCS@school 4.4 schoolserver in a larger customer environment takes 4 to 8 hours, depending on the hardware. This actually blocks the partner from keeping their planned schedule, especially when the initial join fails for some reason and they have to (re)join again.
(In reply to Stefan Gohmann from comment #1) > Maybe we can also set the mdb envflags for example nosync during the join: > https://docs.software-univention.de/performance-guide-4.3.html#slapd:bdb Do we have numbers about the speedup if joins are done with this flag? My impression is: it would be way easier to set and remove this flag during join and would shorten the time in any case, even with the other suggestions implemented.
Hard to say, probably depends on the host. I'd propose profiling the join of a dc backup with good range of apps and a representative number of UDM objects, to collect numbers for the individual join scripts and the final time for replication (listener, S4C).
(In reply to Michael Grandjean from comment #5) > I'm increasing the user pain here, because we get more and more pressure in > projects with larger installations (resulting in more LDAP objects to index > and replicate during a join). > > The current join of a UCS@school 4.4 schoolserver in a larger customer > environment takes 4 to 8 hours, depending on the hardware. This actually > blocks the partner from keeping their planned schedule, especially when the > initial join fails for some reason and they have to (re)join again. How big is "larger customer environment"? How many user accounts are present in the UCS directory? What maximum time limit for the join would be acceptable for the customers?
Two recent examples: # Customer A # Users: 100.000 Join duration: ~9 hours > Mon Jul 20 11:10:39 CEST 2020: starting /usr/sbin/univention-join > [...] > Configure 98univention-samba4-dns.inst Mon Jul 20 16:34:34 CEST 2020 This failed at 16:55, because the DNS SPN account was not synced fast enough to Samba. Therefore, "univention-run-join-scripts" was called manually the next day: > univention-run-join-scripts started > Di 21. Jul 10:35:16 CEST 2020 > [...] > Di 21. Jul 13:50:23 CEST 2020 > univention-run-join-scripts finished # Customer B # Users: 20.000 Join duraction: ~5 hours > Mon Jul 20 09:39:12 CEST 2020: starting /usr/sbin/univention-join > [...] > Mon Jul 20 14:45:43 CEST 2020: finish /usr/share/univention-join/univention-join I can provide the logs (join.log), if necessary.
There has not been any recent activity on this bug. Has the problem been seen somewhere else as well?
Hi, we also have this problem: # Customer C # Users: 42.000 (and still growing) Join duration: ~5 hours occurs during join of DC backups (so the whole ldap is synced)
As far as I understand this problem the reason for the long runtime of a join are multiple calls to create the LDAP index. Two calls are to much and of no additional use. I discussed this problem this week. Here are some ideas on how to approach this issue: 1. Run A/B tests to find out which solution to implement: a. Create a test environment reflecting the size of the affected environments and let a slave join this master. Measure the time. b. Evaluation the different solution ideas and compare the join runtimes. 2. We should run this test regularly on the join process. 3. Use the tests and evaluate the effect of the different solutions. Possible solutions have already been stated here. The following options should be evaluated: 1. Retrieve the LDAP index configuration from the master and apply it to the slave _before_ the replication starts. Let the index built up during the replication and thus avoid explicit index build calls. 2. Build the index after the join process. Adapt `ldap_setup_index` and let it consider the join status and decide when to actually rebuild the LDAP index. 3. I/O operations seem to slow down the performance significantly. The performance guide lists some options, e.g. deactivating `f_sync`. The options could be applied one after another and their improvement impact could be measured and compared to each other. This would also help to decide what optimization should be applied in a join process for large environments. Furthermore, the performance guide could be extended by a comparison on the tuning options, their effect and their possible side effects to help administrators with their decisions on what to choose in their situation and environment.
The story has been moved to GitLab. Please find the story in the following GitLab issue: git.knut.univention.de/univention/ucs/-/issues/144.