Bug 46860 - 4.3: Listener module for generating netlogon scripts is too slow
4.3: Listener module for generating netlogon scripts is too slow
Status: CLOSED FIXED
Product: UCS@school
Classification: Unclassified
Component: Netlogon scripts
UCS@school 4.3
Other Linux
: P5 normal (vote)
: UCS@school 4.3 v3
Assigned To: Sönke Schwardt-Krummrich
Daniel Tröder
:
Depends on: 42997 45802
Blocks: 46924
  Show dependency treegraph
 
Reported: 2018-04-20 16:48 CEST by Sönke Schwardt-Krummrich
Modified: 2018-06-04 15:27 CEST (History)
4 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 4: Minor Usability: Impairs usability in secondary scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.206
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2017112021000371
Bug group (optional): UCS Performance
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sönke Schwardt-Krummrich univentionstaff 2018-04-20 16:48:16 CEST
Port changes to UCS@school 4.3

+++ This bug was initially created as a clone of Bug #45802 +++

A large customer reported (2017112021000371) that the regeneration of netlogon script takes now 2 hours instead of ~20 minutes.

Also the new daemon throws tracebacks from time to time which (intentionally) slows down the daemon each time by ~5 seconds.


+++ This bug was initially created as a clone of Bug #42997 +++

Currently the listener module for generating netlogon scripts is triggered by LDAP modifications of user, group and share objects. For each event the corresponding netlogon scripts are rebuilt.

There are some situations when this approach takes too much time:
a) when rejoining slave systems, the rebuild o all netlogon scripts takes minutes up to hours
b) when starting the exam mode, there the replication might be too slow and hits the current timeout of 300 seconds for replication of all exam users to the corresponding school server.

Idea to speed up the rebuild:
Affected DNs are not immediately handled in handler() but collected until a certain number of changes is reached, e.g. handler() has been called 100 times (UCR variable!), or the listener module's postrun() method is called.
DNs might be collected by creating/touching the file /var/lib/ucs-school-netlogon-user-logonscripts/${DN}.
This way recurring changes to the same LDAP object are automatically merged and there is no information loss if the listener gets stopped.

Log excerpts from a customer system:
15.11.16 13:47:03.619 LISTENER (INFO): updating 'cn=exam-someuser1,cn=uid,cn=temporary,cn=univention,dc=example,dc=com' command a
15.11.16 13:47:03.733 LISTENER (INFO): updating 'cn=xxxxxx,cn=uidNumber,cn=temporary,cn=univention,dc=example,dc=com' command a
15.11.16 13:47:03.831 LISTENER (INFO): updating 'cn=xxxxxx,cn=gidNumber,cn=temporary,cn=univention,dc=example,dc=com' command a
15.11.16 13:47:03.925 LISTENER (INFO): updating 'cn=xxxxxx,cn=gidNumber,cn=temporary,cn=univention,dc=example,dc=com' command d
15.11.16 13:47:03.993 LISTENER (INFO): updating 'cn=S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-xxxxxxx,cn=sid,cn=temporary,cn=univention,dc=example,dc=com' command a
15.11.16 13:47:04.073 LISTENER (INFO): updating 'uid=exam-someuser1,cn=examusers,ou=schoolA,dc=example,dc=com' command a
15.11.16 13:47:04.318 LISTENER (INFO): updating 'cn=Domain Users schoolA,cn=groups,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:04.653 LISTENER (INFO): updating 'cn=schuelerschoolA,cn=groups,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:04.872 LISTENER (INFO): updating 'cn=schoolA-Q II,cn=klassen,cn=schueler,cn=groups,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.002 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-mat 10,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.152 LISTENER (INFO): updating 'cn=schoolA-LKurs-13-BIO 4,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.290 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-che 30,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.382 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-kun 15,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.640 LISTENER (INFO): updating 'cn=schoolA-LKurs-13-GEG 2,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.812 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-spopr 12,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:05.952 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-deu 65,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:06.137 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-ges 35,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:06.443 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-eng 50,cn=kurse,ou=schoolA,dc=example,dc=com' command m
15.11.16 13:47:06.654 LISTENER (INFO): updating 'cn=OUschoolA-Klassenarbeit,cn=ucsschool,cn=groups,dc=example,dc=com' command m
15.11.16 13:47:06.788 LISTENER (INFO): updating 'cn=exam-someuser1,cn=uid,cn=temporary,cn=univention,dc=example,dc=com' command d
15.11.16 13:47:06.869 LISTENER (INFO): updating 'cn=S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-xxxxxxx,cn=sid,cn=temporary,cn=univention,dc=example,dc=com' command d
Comment 1 Sönke Schwardt-Krummrich univentionstaff 2018-04-20 16:56:25 CEST
Have a look at Bug #45802 for further details:

Several optimizations have been performed to improve performance of the listener module.
1) a missing SQL index has been added
2) the traceback of comment 1 has been fixed
3) groups (resp. the group's members) are only processed if the group has been referenced by at least one share
4) a much simpler LDAP query is used which reduced the initial LDAP query time by factor of ~4. In a customer environment the searchtime jumped from 15m56s to 4m36s. The found objects are verified later on against the UDM modules. So there should be no behaviour change.
5) As Daniel pointed out, the setup of a LDAP connection for each handled object is very expensive. The connection is now set up once and in case of an LDAP exception, and teared down in postrun().
6) The LDAP connection is not performed against the UCS master anymore but to the local LDAP server.

As shown below, this improved the processing speed of the listener module remarkably, so the LDAP replication is no longer block for longer periods.

A test (module resync) has been performed with the old and the new code. The created logon scripts have been compared from both test runs. No difference has been found.

* old code with benchmarking:
24.03.18 18:33:58.041  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
24.03.18 18:35:24.889  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0
24.03.18 18:35:39.908  LISTENER    ( PROCESS ) : ucs-school-user-logonscript: Timer total:
handler start__: 0.000000
lo ready_______: 59.822722
db ready_______: 1.116696
handlers called: 12.993600
handler end____: 0.262067
** …daemon.log
2018-03-24 18:33:58,261 [INFO]  INIT
2018-03-24 18:33:58,307 [INFO]  got global links {'Marktplatz': {'letter': 'M', 'server': 'slave111'}}
2018-03-24 18:34:08,434 [INFO]  signal received - reduced wait time by 162.9 seconds
[…]
2018-03-24 18:34:41,062 [INFO]  all items processed... sleeping...
2018-03-24 18:34:41,080 [INFO]  signal received - reduced wait time by 173.0 seconds
[…]
2018-03-24 18:34:49,676 [INFO]  all items processed... sleeping...
2018-03-24 18:34:49,682 [INFO]  signal received - reduced wait time by 173.0 seconds
[…]
2018-03-24 18:34:58,653 [INFO]  all items processed... sleeping...
2018-03-24 18:34:58,662 [INFO]  signal received - reduced wait time by 173.0 seconds
[…]
2018-03-24 18:35:07,420 [INFO]  all items processed... sleeping...
2018-03-24 18:35:07,438 [INFO]  signal received - reduced wait time by 173.0 seconds
[…]
2018-03-24 18:35:16,118 [INFO]  all items processed... sleeping...
2018-03-24 18:35:16,128 [INFO]  signal received - reduced wait time by 173.0 seconds
[…]
2018-03-24 18:35:31,794 [INFO]  all items processed... sleeping...

→ Start: 18:33:58 → End: 18:35:31 → 93 seconds
# egrep '^2018-04-09 15:0(4|5).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l
2398
→ Listener: 2398 users / 86 seconds → 27.8 users/s
→ Daemon: 2398 users / 83 seconds → 28.9 users/s

* new code with benchmarking
** listener.log
24.03.18 21:41:48.343  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
24.03.18 21:41:59.219  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0
24.03.18 21:42:14.236  LISTENER    ( PROCESS ) : ucs-school-user-logonscript: Timer total:
handler start__: 0.000000
lo ready_______: 0.054209
db ready_______: 0.620861
handlers called: 3.995481
handler end____: 0.192979
** …daemon.log
2018-03-24 21:41:48,614 [INFO]  INIT
2018-03-24 21:41:48,665 [INFO]  got global links {'Marktplatz': {'letter': 'M', 'server': 'slave111'}}
2018-03-24 21:41:53,086 [INFO]  signal received - reduced wait time by 168.6 seconds
[…]
2018-03-24 21:42:09,424 [INFO]  all items processed... sleeping...

→ Start: 21:41:48 → End: 21:42:09 → 27 seconds
# egrep '^2018-04-09 15:0(4|5).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l
2398
→ Listener: 2398 users / 11 seconds → 218 users/s
→ Daemon: 2398 users / 16 seconds → 150 users/s



Test run in customer environment:

* old package
12.04.18 14:45:09.031  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
12.04.18 14:45:09.263  LISTENER    ( INFO    ) : call handler_initialize for module ucs-school-user-logonscript
12.04.18 14:45:12.859  LISTENER    ( INFO    ) : initialize schema for module ucs-school-user-logonscript
12.04.18 17:09:54.627  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0
→ ~2h25m → ~145min

* new package
20.04.18 16:02:14.116  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
20.04.18 16:02:14.356  LISTENER    ( INFO    ) : call handler_initialize for module ucs-school-user-logonscript
20.04.18 16:02:17.799  LISTENER    ( INFO    ) : initialize schema for module ucs-school-user-logonscript
20.04.18 16:11:53.043  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0
→ ~9m30s → ~10min

ucs-school-netlogon-user-logonscripts (14.0.0-3)
b04806f38504 | Bug #46860/#45802: add changelog entry
f512ada97dcc | Bug #46860/#45802: use local LDAP connection instead of a LDAP connection to DC master
81dd0efff78d | Bug #46860/#45802: reuse LDAP connection between handler() calls
c7fd8761806a | Bug #46860/#45802: make mypy happy
675f33db024f | Bug #46860/#45802: cleanup
8fa6df62b39d | Bug #46860/#45802: use much simpler LDAP query to reduce query time
0dba7f35140a | Bug #46860/#45802: only process groups that are referenced by at least one share
63128aa5ad4d | Bug #46860/#45802: add mypy comments
98c814eb86d6 | Bug #46860/#45802: fix traceback by skipping users that have no gidNumer set
3b1b181cff87 | Bug #46860/#45802: add comment for mypy
a452b8b4f9da | Bug #46860/#45802: improve performance by adding a missing index

ucs-school-netlogon-user-logonscripts.yaml
7dd74fe646cc | Bug #46860/#45802: add advisory entry

Package: ucs-school-netlogon-user-logonscripts
Version: 14.0.0-3A~4.3.0.201804201654
Branch: ucs_4.3-0
Scope: ucs-school-4.3

(Hint: package does not work until Bug 46696 is fixed).
Comment 2 Sönke Schwardt-Krummrich univentionstaff 2018-04-25 16:04:32 CEST
OK: code change
OK: functional change
OK: tests
OK: changelog entry
OK: advisory
OK: package built
Comment 3 Sönke Schwardt-Krummrich univentionstaff 2018-04-25 16:05:32 CEST
(In reply to Sönke Schwardt-Krummrich from comment #2)
> OK: code change
> OK: functional change
> OK: tests
> OK: changelog entry
> OK: advisory
> OK: package built
→ Wrong bug → back to RESOLVED
Comment 4 Daniel Tröder univentionstaff 2018-05-02 14:11:40 CEST
OK: code changes, fixed in mypy hints:
[4.3] 78becbf6 Bug #46860: comment/mypy-only fix: class name of univention.admin.uldap.access is lower case
[4.3] 679c95f2 Bug #46860: fix mypy hint, enhance docstring

OK: manual tests:

================================================================

root@m61:~# /usr/share/ucs-school-import/scripts/ucs-school-testuser-import --staff 100 --students 700 --teachers 150 --staffteachers 50 --classes 30 --schools 1 SchuleM62

----------------------------------------------------------------

root@s62:~# dpkg -l ucs-school-netlogon-user-logonscripts
ii  ucs-school-netlogon-user 14.0.0-1A~4.3.0.2 all

root@s62:~# univention-ldapsearch -LLL '(&(uid=*)(!(uid=*$)))' dn | grep dn: | wc -l
919

root@s62:~# univention-directory-listener-ctrl resync ucs-school-user-logonscript


==> /var/log/univention/listener.log <==
02.05.18 13:42:56.415  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
[…]
02.05.18 13:43:27.204  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0

==> /var/log/univention/ucs-school-user-logonscript-daemon.log <==
2018-05-02 13:42:56,546 [INFO]  INIT
2018-05-02 13:42:56,570 [INFO]  got global links {'Marktplatz': {'letter': 'M', 'server': 's62'}}
2018-05-02 13:42:58,066 [INFO]  signal received - reduced wait time by 171.5 seconds
[…]
2018-05-02 13:43:32,249 [INFO]  all items processed... sleeping...

root@s62:~# egrep '2018-05-02 13:4.*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l
2241

root@s62:~# find /var/lib/samba/netlogon/ -name '*.vbs' -ls | grep 'Mai  2 13:4' | wc -l
908

---
listener  : 919 user / 31s = 29,65 user/s
daemon    : 919 user / 36s = 25,53 user/s (incl. sleeps)

wall clock resync -> files written: 13:42:56 -> 13:43:32 = 36s
wall clock: 919 user / 36s = 25,53 user/s

================================================================

root@s62:~# univention-upgrade --ignoressh --ignoreterm
root@s62:~# dpkg -l ucs-school-netlogon-user-logonscripts
ii  ucs-school-netlogon-user 14.0.0-4A~4.3.0.2 all
root@s62:~# service univention-directory-listener restart
root@s62:~# service ucs-school-netlogon-user-logonscripts restart
root@s62:~# univention-directory-listener-ctrl resync ucs-school-user-logonscript

==> /var/log/univention/listener.log <==
02.05.18 14:00:14.536  LISTENER    ( WARN    ) : initializing module ucs-school-user-logonscript
02.05.18 14:00:17.786  LISTENER    ( WARN    ) : finished initializing module ucs-school-user-logonscript with rv=0

==> /var/log/univention/ucs-school-user-logonscript-daemon.log <==
2018-05-02 14:00:14,845 [INFO]  INIT
2018-05-02 14:00:14,896 [INFO]  got global links {'Marktplatz': {'letter': 'M', 'server': 's62'}}
2018-05-02 14:00:15,657 [INFO]  signal received - reduced wait time by 172.2 seconds

2018-05-02 14:00:26,277 [INFO]  all items processed... sleeping...

root@s62:~# egrep '2018-05-02 14:.*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l
910

root@s62:~# find /var/lib/samba/netlogon/ -name '*.vbs' -ls | grep 'Mai  2 14:' | wc -l
908

---

listener  : 919 user / 3s = 306,33 user/s
daemon    : 919 user / 11s = 83,55 user/s (no sleeps)

wall clock resync -> files written: 14:00:14 -> 14:00:26 = 12s
wall clock: 919 user / 12s = 76,58 user/s
================================================================

root@s121:~# cd /var/lib/samba/netlogon/
root@s121:/var/lib/samba/netlogon# diff -Naur user.14.0.0-1/ user.14.0.0-4/
→ no difference

3 x faster on wall clock, same files generated.
Comment 5 Sönke Schwardt-Krummrich univentionstaff 2018-05-02 17:59:04 CEST
UCS@school 4.3 v3 has been released.

https://docs.software-univention.de/changelog-ucsschool-4.3v3-de.html

If this error occurs again, please clone this bug.