Bug 51182 - ldap-group-to-file: huge performance loss by using ldap/server/addition
ldap-group-to-file: huge performance loss by using ldap/server/addition
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: PAM
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 4.4-4-errata
Assigned To: Sönke Schwardt-Krummrich
Arvid Requate
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-04-28 20:38 CEST by Sönke Schwardt-Krummrich
Modified: 2020-05-06 14:40 CEST (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 4: A User would return the product
User Pain: 0.343
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2020042221000437
Bug group (optional): Regression
Max CVSS v3 score:
troeder: Patch_Available+


Attachments
prefer local ldap server (917 bytes, patch)
2020-04-29 11:32 CEST, Daniel Tröder
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sönke Schwardt-Krummrich univentionstaff 2020-04-28 20:38:04 CEST
ldap-group-to-file.py uses the UCR variable ldap/server/name as well as the LDAP servers from the UCR variable ldap/server/addition for an LDAP connection to query the group memberships. If the local LDAP server is used, this is usually not a problem and is done quickly.
However, if e.g. a DC Master or DC Backup is used, which can be reached with a few miliseconds latency, the runtime of the script increases to several minutes.

In the concrete customer case ldap-group-to-file was started on a UCS@school slave by the class work mode.
There are 1900 groups readable in the domain or for the slave.
If ldap-group-to-file uses the local LDAP server, the script finishes its work after only 6 seconds.
However, ldap/server/addition contains 3 external systems (master + 2 backups), which are also randomly selected by ldap-group-to-file.
The runtime of the script then increases to 4-5min.

The start of the class work mode therefore does not take ~45 seconds but >5min.

It should be checked if
- too many LDAP queries are performed (# of query * latency = 5min?)
- the LDAP ACL evaluation is too expensive on DC Master or DC Backup
- ...
Comment 1 Ingo Steuwer univentionstaff 2020-04-29 08:13:24 CEST
The intention of using ldap/server/addition for this script was to reduce the load on the DC Master initiated by memberserver instances.

It makes absolutely sense to always try the local LDAP server first if it is ldap/server/name or in ldap/server/addition. So I suggest to change that script in the first place.
Comment 2 Daniel Tröder univentionstaff 2020-04-29 11:31:39 CEST
IMHO getMachineConnection() should be modified - not ldap-group-to-file.py.

getMachineConnection() always uses [ldap/server/name + ldap/server/addition].
If called with random_server=True, it randomizes the list.

It should be changed to always prefer the local LDAP server, and then append the (optionally randomized) ldap/server/addition list.
Comment 3 Daniel Tröder univentionstaff 2020-04-29 11:32:23 CEST
Created attachment 10342 [details]
prefer local ldap server
Comment 4 Ingo Steuwer univentionstaff 2020-04-29 11:47:05 CEST
(In reply to Daniel Tröder from comment #2)
> IMHO getMachineConnection() should be modified - not ldap-group-to-file.py.
> 
> getMachineConnection() always uses [ldap/server/name + ldap/server/addition].
> If called with random_server=True, it randomizes the list.
> 
> It should be changed to always prefer the local LDAP server, and then append
> the (optionally randomized) ldap/server/addition list.

In which other usecases is getMachineConnection() used with random_server=true?
Comment 5 Daniel Tröder univentionstaff 2020-04-29 12:15:00 CEST
(In reply to Ingo Steuwer from comment #4)
> In which other usecases is getMachineConnection() used with
> random_server=true?
Only in test/ucs-test/tests/10_ldap/24uldap.
Comment 6 Sönke Schwardt-Krummrich univentionstaff 2020-04-29 12:19:32 CEST
(In reply to Daniel Tröder from comment #5)
> (In reply to Ingo Steuwer from comment #4)
> > In which other usecases is getMachineConnection() used with
> > random_server=true?
> Only in test/ucs-test/tests/10_ldap/24uldap.

To be precise: 
the occurrence in 24uldap is the actual test of the random_server argument.

I would still suggest, that ldap-group-to-file uses the local LDAP-server first (random_server=False) and if that fails, it retries with random_server=True.
Comment 7 Daniel Tröder univentionstaff 2020-04-29 12:27:29 CEST
(In reply to Sönke Schwardt-Krummrich from comment #6)
> I would still suggest, that ldap-group-to-file uses the local LDAP-server
> first (random_server=False) and if that fails, it retries with
> random_server=True.

That won't work... Or at least not how you expect it to.
With random_server=False it will still try out *all* servers, incl. those from ldap/server/addition. Just in a predefined order.
Comment 10 Sönke Schwardt-Krummrich univentionstaff 2020-04-29 22:59:33 CEST
(In reply to Daniel Tröder from comment #7)
> That won't work... Or at least not how you expect it to.
> With random_server=False it will still try out *all* servers, incl. those
> from ldap/server/addition. Just in a predefined order.

Yes. 

I've been doing some back and forth with Arvid and we decided on a variant that maintains the current state on member servers and shows improved behavior on domain controllers.

For this purpose getMachineConnection() was adapted and not the script 
ldap-group-to-file.py: 

With random_server==false, the servers from ldap/server/name and 
ldap/server/addition are still concatenated and iterate over the resulting server list until a connection is successfully established with an LDAP server.

If random_server==True, the system role is important. As is well known, no LDAP server runs on member servers, so the desired goal is to distribute the load by requests to all specified LDAP servers. Therefore, getMachineConnection first concatenates ldap/server/name and ldap/server/addition and then randomizes the order of the resulting list of hosts.

On all other system roles, with random_server==True only the order of the servers from ldap/server/addition is randomized. The server from ldap/server/name (usually the local system) remains in first position. 

This should eliminate the performance regression introduced with bug 50191.

[4.4-4] 89ab6ec640 Bug #51182: update advisory
[4.4-4] ffbda6baa3 Bug #51182: Merge branch 'sschwardt/bug51182/4.4-4' into 4.4-4
[4.4-4] 05bf9aa62a Bug #51182: add advisory
[4.4-4] 0c517a8197 Bug #51182: improve random_server in getMachineConnection

Package: univention-python
Version: 12.0.0-19A~4.4.0.202004292242
Branch: ucs_4.4-0
Scope: errata4.4-4

Waiting for jenkins results.
Comment 11 Sönke Schwardt-Krummrich univentionstaff 2020-04-30 17:31:06 CEST
> Waiting for jenkins results.

Results are good.

[4.4-4] 224ae72feb Bug #51182: update unit test for getMachineConnection

Package: ucs-test
Version: 9.0.3-195A~4.4.0.202004301730
Branch: ucs_4.4-0
Scope: errata4.4-4
Comment 12 Arvid Requate univentionstaff 2020-05-04 11:53:16 CEST
Verified:
* Code Ok
* Test-Case: Ok and green
* Advisory Ok
Comment 13 Erik Damrose univentionstaff 2020-05-06 14:40:02 CEST
<http://errata.software-univention.de/ucs/4.4/588.html>