Univention Bugzilla – Bug 45802
4.2: Listener module for generating netlogon scripts is too slow
Last modified: 2018-06-04 15:34:08 CEST
A large customer reported (2017112021000371) that the regeneration of netlogon script takes now 2 hours instead of ~20 minutes. Also the new daemon throws tracebacks from time to time which (intentionally) slows down the daemon each time by ~5 seconds. +++ This bug was initially created as a clone of Bug #42997 +++ Currently the listener module for generating netlogon scripts is triggered by LDAP modifications of user, group and share objects. For each event the corresponding netlogon scripts are rebuilt. There are some situations when this approach takes too much time: a) when rejoining slave systems, the rebuild o all netlogon scripts takes minutes up to hours b) when starting the exam mode, there the replication might be too slow and hits the current timeout of 300 seconds for replication of all exam users to the corresponding school server. Idea to speed up the rebuild: Affected DNs are not immediately handled in handler() but collected until a certain number of changes is reached, e.g. handler() has been called 100 times (UCR variable!), or the listener module's postrun() method is called. DNs might be collected by creating/touching the file /var/lib/ucs-school-netlogon-user-logonscripts/${DN}. This way recurring changes to the same LDAP object are automatically merged and there is no information loss if the listener gets stopped. Log excerpts from a customer system: 15.11.16 13:47:03.619 LISTENER (INFO): updating 'cn=exam-someuser1,cn=uid,cn=temporary,cn=univention,dc=example,dc=com' command a 15.11.16 13:47:03.733 LISTENER (INFO): updating 'cn=xxxxxx,cn=uidNumber,cn=temporary,cn=univention,dc=example,dc=com' command a 15.11.16 13:47:03.831 LISTENER (INFO): updating 'cn=xxxxxx,cn=gidNumber,cn=temporary,cn=univention,dc=example,dc=com' command a 15.11.16 13:47:03.925 LISTENER (INFO): updating 'cn=xxxxxx,cn=gidNumber,cn=temporary,cn=univention,dc=example,dc=com' command d 15.11.16 13:47:03.993 LISTENER (INFO): updating 'cn=S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-xxxxxxx,cn=sid,cn=temporary,cn=univention,dc=example,dc=com' command a 15.11.16 13:47:04.073 LISTENER (INFO): updating 'uid=exam-someuser1,cn=examusers,ou=schoolA,dc=example,dc=com' command a 15.11.16 13:47:04.318 LISTENER (INFO): updating 'cn=Domain Users schoolA,cn=groups,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:04.653 LISTENER (INFO): updating 'cn=schuelerschoolA,cn=groups,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:04.872 LISTENER (INFO): updating 'cn=schoolA-Q II,cn=klassen,cn=schueler,cn=groups,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.002 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-mat 10,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.152 LISTENER (INFO): updating 'cn=schoolA-LKurs-13-BIO 4,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.290 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-che 30,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.382 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-kun 15,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.640 LISTENER (INFO): updating 'cn=schoolA-LKurs-13-GEG 2,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.812 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-spopr 12,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:05.952 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-deu 65,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:06.137 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-ges 35,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:06.443 LISTENER (INFO): updating 'cn=schoolA-GKurs-13-eng 50,cn=kurse,ou=schoolA,dc=example,dc=com' command m 15.11.16 13:47:06.654 LISTENER (INFO): updating 'cn=OUschoolA-Klassenarbeit,cn=ucsschool,cn=groups,dc=example,dc=com' command m 15.11.16 13:47:06.788 LISTENER (INFO): updating 'cn=exam-someuser1,cn=uid,cn=temporary,cn=univention,dc=example,dc=com' command d 15.11.16 13:47:06.869 LISTENER (INFO): updating 'cn=S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-xxxxxxx,cn=sid,cn=temporary,cn=univention,dc=example,dc=com' command d
2017-11-30 14:36:58,005 [DEBUG] EXCEPTION: Traceback (most recent call last): File "/usr/share/ucs-school-netlogon-user-logonscripts/ucs-school-user-logonscript-daemon", line 659, in run self.handle_user_change(user_dn, username) File "/usr/share/ucs-school-netlogon-user-logonscripts/ucs-school-user-logonscript-daemon", line 453, in handle_user_change membershipIDs.add(new.get('gidNumber')[0]) TypeError: 'NoneType' object has no attribute '__getitem__'
Can we get the tracebacks?
Traceback (most recent call last): File "/usr/lib/univention-directory-listener/system/ucs-school-user-logonscript.py", line 172, in handler user_queue = SqliteQueue(logger=Log) File "/usr/lib/pymodules/python2.7/ucsschool/netlogon/__init__.py", line 73, in __init__ self.setup_database() File "/usr/lib/pymodules/python2.7/ucsschool/netlogon/__init__.py", line 101, in setup_database self.cursor.execute(u'CREATE TABLE IF NOT EXISTS user_queue (id INTEGER PRIMARY KEY AUTOINCREMENT, userdn TEXT, username TEXT)') sqlite3.OperationalError: database is locked
Traceback (most recent call last): File "/usr/lib/univention-directory-listener/system/ucs-school-user-logonscript.py", line 178, in handler handle_group(dn, new, old, lo, user_queue) File "/usr/lib/univention-directory-listener/system/ucs-school-user-logonscript.py", line 144, in handle_group user_queue.add(user_dn, db_commit=False) File "/usr/lib/pymodules/python2.7/ucsschool/netlogon/__init__.py", line 130, in add self.cursor.execute(u'insert or replace into user_queue (id, userdn, username) VALUES ((select id from user_queue where userdn = ?), ?, (select username from user_queue where userdn = ?))', (userdn, userdn, userdn)) sqlite3.OperationalError: database is locked
(In reply to Sönke Schwardt-Krummrich from comment #0) > A large customer reported (2017112021000371) that the regeneration of > netlogon script takes now 2 hours instead of ~20 minutes. All "Domain Users $OU" groups are replicated to all school slaves. When these groups are processed, all uniqueMember entries are written to queue sqlite db → all users in LDAP are considered for regeneration of the netlogon script but a small percentage of those users actually exists on the school slave. I prepared a check in the listener module: Upon group add/modify the listener checks if there is at least one share object in LDAP that uses the group. If not, the group is not considered. In the customer environment, this reduced the processing time down to ~20min. But the initial lookup in LDAP against the master (or backup) is too heavy: A shortened LDAP filter takes only ~2 minutes: time univention-ldapsearch -p 7389 -h $ldap_master \ '(|(uid=*)(objectClass=univentionGroup)(objectClass=univentionShare))' real 2m4.756s user 0m0.708s sys 0m1.104s The current LDAP filter takes much longer: time univention-ldapsearch -p 7389 -h $ldap_master '(|(&(|(&(objectClass=posixAccount)(objectClass=shadowAccount))(objectClass=univentionMail)(objectClass=sambaSamAccount)(objectClass=simpleSecurityObject)(objectClass=inetOrgPerson))(!(uidNumber=0))(!(uid=*$))(!(univentionObjectFlag=functional)))(&(cn=*)(|(&(objectClass=univentionGroup))(&(objectClass=sambaGroupMapping))))(&(objectClass=univentionShare)(cn=*)))' real 8m28.993s user 0m0.700s sys 0m1.028s We should consider reducing the LDAP filter complexity and check each retrieved LDAP object within the listener module. > Also the new daemon throws tracebacks from time to time which > (intentionally) slows down the daemon each time by ~5 seconds. (In reply to Sönke Schwardt-Krummrich from comment #1) > 2017-11-30 14:36:58,005 [DEBUG] EXCEPTION: > Traceback (most recent call last): > File > "/usr/share/ucs-school-netlogon-user-logonscripts/ucs-school-user- > logonscript-daemon", line 659, in run > self.handle_user_change(user_dn, username) > File > "/usr/share/ucs-school-netlogon-user-logonscripts/ucs-school-user- > logonscript-daemon", line 453, in handle_user_change > membershipIDs.add(new.get('gidNumber')[0]) > TypeError: 'NoneType' object has no attribute '__getitem__' Some users in the LDAP have no gidNumber because they are simple authentication accounts → slow down of 5 seconds per user object. I will skip all users with missing gidNumber.
(In reply to Florian Best from comment #4) > Traceback (most recent call last): > File > "/usr/lib/univention-directory-listener/system/ucs-school-user-logonscript. > py", line 172, in handler > user_queue = SqliteQueue(logger=Log) > File "/usr/lib/pymodules/python2.7/ucsschool/netlogon/__init__.py", line > 73, in __init__ > self.setup_database() > File "/usr/lib/pymodules/python2.7/ucsschool/netlogon/__init__.py", line > 101, in setup_database > self.cursor.execute(u'CREATE TABLE IF NOT EXISTS user_queue (id INTEGER > PRIMARY KEY AUTOINCREMENT, userdn TEXT, username TEXT)') > sqlite3.OperationalError: database is locked Maybe we should execute the following: # automatic retry every few miliseconds for up to 30 s if db is lockedconn.execute("PRAGMA busy_timeout = 30000")
Compare with timings in Bug #46696 comment2: Attached a patch that moves the LDAP initialization to prerun(). --------------------------------------------------------------------------------- 09.04.18 15:04:19.925 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript ... 09.04.18 15:04:39.360 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 09.04.18 15:04:54.373 LISTENER ( PROCESS ) : ucs-school-user-logonscript: Timer total: prerun start___________________: 0.000000 prerun end_____________________: 0.037327 handler start__________________: 0.000000 handler init 0_________________: 0.043483 handler init 1_________________: 0.052743 handler init 2_________________: 0.035691 handler init 3_________________: 1.345861 handle_group start_____________: 0.302925 handle_group end_______________: 4.732932 handler done___________________: 0.404676 handle_user start______________: 0.080790 handle_user end________________: 7.141573 handle_share start_____________: 0.004682 add_group_change_to_queue start: 0.001553 add_group_change_to_queue end__: 0.005056 handle_share end_______________: 0.002589 postrun start__________________: 15.014258 postrun end____________________: 0.000553 --------------------------------------------------------------------------------- root@m65:~# egrep '^09.04.18 15:04.*ucs-school-user-logonscript: Timer one' /var/log/univention/listener.log | wc -l 2368 --------------------------------------------------------------------------------- 2018-04-09 15:04:20,274 [INFO] INIT 2018-04-09 15:04:20,327 [INFO] got global links {'Marktplatz': {'letter': 'M', 'server': 'm65'}} 2018-04-09 15:04:20,848 [INFO] signal received - reduced wait time by 172.5 seconds 2018-04-09 15:04:27,855 [INFO] Timer one: run() start____________________________: 0.000000 run() self.user_queue.query_next_user(): 0.000377 run() after queue is not empty_________: 0.000029 run() end______________________________: 7.523433 ... 2018-04-09 15:05:28,258 [INFO] Shutdown as requested 2018-04-09 15:05:28,259 [INFO] Timer total: run() start_____________________________________: 0.000000 run() self.user_queue.query_next_user()_________: 2.140297 run() after queue is not empty__________________: 0.251143 run() end_______________________________________: 19.950882 run() self.user_queue.remove(user_dn)___________: 8.404595 run() self.handle_user_change(user_dn, username): 27.065045 run() self.user_queue.add([(user_dn, username)]): 9.149080 --------------------------------------------------------------------------------- root@m65:~# egrep '^2018-04-09 15:0(4|5).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l 2072 --------------------------------------------------------------------------------- Listener module performance: 2008 user / 35s → 57.37 user/s Daemon performance: 2008 user / 61s → 32.91 user/s Total real time: 15:04:19 → 15:05:28: 69s → 29.10 user/s That's almost 3 times as fast as before :)
Created attachment 9492 [details] move LDAP init to prerun() Ignore the timing lines.
Could have posted this link https://git.knut.univention.de/univention/ucsschool/commit/1d9eab41da721175bcd1e228de1672cbdce8b77b instead of creating a patch... which one would you have preferred?
*** Bug 46851 has been marked as a duplicate of this bug. ***
(In reply to Daniel Tröder from comment #10) > Could have posted this link > https://git.knut.univention.de/univention/ucsschool/commit/ > 1d9eab41da721175bcd1e228de1672cbdce8b77b instead of creating a patch... > which one would you have preferred? In this case: a patch. Survives even if we deactivate our gitlab instance. :-)
Several optimizations have been performed to improve performance of the listener module. 1) a missing SQL index has been added 2) the traceback of comment 1 has been fixed 3) groups (resp. the group's members) are only processed if the group has been referenced by at least one share 4) a much simpler LDAP query is used which reduced the initial LDAP query time by factor of ~4. In a customer environment the searchtime jumped from 15m56s to 4m36s. The found objects are verified later on against the UDM modules. So there should be no behaviour change. 5) As Daniel pointed out, the setup of a LDAP connection for each handled object is very expensive. The connection is now set up once and in case of an LDAP exception, and teared down in postrun(). 6) The LDAP connection is not performed against the UCS master anymore but to the local LDAP server. As shown below, this improved the processing speed of the listener module remarkably, so the LDAP replication is no longer block for longer periods. A test (module resync) has been performed with the old and the new code. The created logon scripts have been compared from both test runs. No difference has been found. * old code with benchmarking: 24.03.18 18:33:58.041 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript 24.03.18 18:35:24.889 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 24.03.18 18:35:39.908 LISTENER ( PROCESS ) : ucs-school-user-logonscript: Timer total: handler start__: 0.000000 lo ready_______: 59.822722 db ready_______: 1.116696 handlers called: 12.993600 handler end____: 0.262067 ** …daemon.log 2018-03-24 18:33:58,261 [INFO] INIT 2018-03-24 18:33:58,307 [INFO] got global links {'Marktplatz': {'letter': 'M', 'server': 'slave111'}} 2018-03-24 18:34:08,434 [INFO] signal received - reduced wait time by 162.9 seconds […] 2018-03-24 18:34:41,062 [INFO] all items processed... sleeping... 2018-03-24 18:34:41,080 [INFO] signal received - reduced wait time by 173.0 seconds […] 2018-03-24 18:34:49,676 [INFO] all items processed... sleeping... 2018-03-24 18:34:49,682 [INFO] signal received - reduced wait time by 173.0 seconds […] 2018-03-24 18:34:58,653 [INFO] all items processed... sleeping... 2018-03-24 18:34:58,662 [INFO] signal received - reduced wait time by 173.0 seconds […] 2018-03-24 18:35:07,420 [INFO] all items processed... sleeping... 2018-03-24 18:35:07,438 [INFO] signal received - reduced wait time by 173.0 seconds […] 2018-03-24 18:35:16,118 [INFO] all items processed... sleeping... 2018-03-24 18:35:16,128 [INFO] signal received - reduced wait time by 173.0 seconds […] 2018-03-24 18:35:31,794 [INFO] all items processed... sleeping... → Start: 18:33:58 → End: 18:35:31 → 93 seconds # egrep '^2018-04-09 15:0(4|5).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l 2398 → Listener: 2398 users / 86 seconds → 27.8 users/s → Daemon: 2398 users / 83 seconds → 28.9 users/s * new code with benchmarking ** listener.log 24.03.18 21:41:48.343 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript 24.03.18 21:41:59.219 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 24.03.18 21:42:14.236 LISTENER ( PROCESS ) : ucs-school-user-logonscript: Timer total: handler start__: 0.000000 lo ready_______: 0.054209 db ready_______: 0.620861 handlers called: 3.995481 handler end____: 0.192979 ** …daemon.log 2018-03-24 21:41:48,614 [INFO] INIT 2018-03-24 21:41:48,665 [INFO] got global links {'Marktplatz': {'letter': 'M', 'server': 'slave111'}} 2018-03-24 21:41:53,086 [INFO] signal received - reduced wait time by 168.6 seconds […] 2018-03-24 21:42:09,424 [INFO] all items processed... sleeping... → Start: 21:41:48 → End: 21:42:09 → 27 seconds # egrep '^2018-04-09 15:0(4|5).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l 2398 → Listener: 2398 users / 11 seconds → 218 users/s → Daemon: 2398 users / 16 seconds → 150 users/s Test run in customer environment: * old package 12.04.18 14:45:09.031 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript 12.04.18 14:45:09.263 LISTENER ( INFO ) : call handler_initialize for module ucs-school-user-logonscript 12.04.18 14:45:12.859 LISTENER ( INFO ) : initialize schema for module ucs-school-user-logonscript 12.04.18 17:09:54.627 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 → ~2h25m → ~145min * new package 20.04.18 16:02:14.116 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript 20.04.18 16:02:14.356 LISTENER ( INFO ) : call handler_initialize for module ucs-school-user-logonscript 20.04.18 16:02:17.799 LISTENER ( INFO ) : initialize schema for module ucs-school-user-logonscript 20.04.18 16:11:53.043 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 → ~9m30s → ~10min ucs-school-netlogon-user-logonscripts (13.0.2-14) c6bc3eb78d6e | Bug #45802: add changelog entry c85b82abeb1f | Bug #45802: use local LDAP connection instead of a LDAP connection to DC master d891c30e88df | Bug #45802: reuse LDAP connection between handler() calls 8d574d94e4be | Bug #45802: make mypy happy 2a5ee03fec47 | Bug #45802: cleanup 0103df57e77f | Bug #45802: use much simpler LDAP query to reduce query time 94968015f220 | Bug #45802: only process groups that are referenced by at least one share 40737e25a8ba | Bug #45802: add mypy comments 9339623049cb | Bug #45802: fix traceback by skipping users that have no gidNumer set 04ee0f355139 | Bug #45802: add comment for mypy 590ffb5ac3f2 | Bug #45802: improve performance by adding a missing index ucs-school-netlogon-user-logonscripts.yaml 7b26bd6d9a02 | Bug #45802: add advisory entry Package: ucs-school-netlogon-user-logonscripts Version: 13.0.2-14A~4.2.0.201804201639 Branch: ucs_4.2-0 Scope: ucs-school-4.2 (Hint: package does not work until Bug 46696 is fixed).
OK: code changes, fixed class name in mypy hint (73f90f25, 9aae7699) OK: manual tests: ================================================================ root@s121:~# univention-ldapsearch -LLL '(&(uid=*)(!(uid=*$)))' dn | grep dn: | wc -l 927 root@s121:~# dpkg -l ucs-school-netlogon-user-logonscripts ii ucs-school-netlogon-user 13.0.2-12A~4.2.0. all root@s121:~# univention-directory-listener-ctrl resync ucs-school-user-logonscript --- /var/log/univention/listener.log --- 24.04.18 16:03:36.708 DEBUG_INIT 24.04.18 16:03:39.440 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript […] 24.04.18 16:05:35.720 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 --- root@s121:~# egrep '24.04.18 16:.*ucs-school-user-logonscript: added/updated entry' /var/log/univention/listener.log | wc -l 5389 --- ucs-school-user-logonscript-daemon.log --- 2018-04-24 16:03:39,983 [INFO] INIT 2018-04-24 16:04:14,331 [DEBUG] just woke up... 2018-04-24 16:04:14,340 [INFO] updating logon script for uid=Guest,cn=users,dc=uni,dc=dtr […] 2018-04-24 16:05:49,272 [INFO] updating logon script for uid=modest.stoeh,cn=schueler,cn=users,ou=SchuleEins,dc=uni,dc=dtr 2018-04-24 16:05:49,277 [INFO] all items processed... sleeping... --- root@s121:~# egrep '2018-04-24 16:.*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l 3816 root@s121:~# find /var/lib/samba/netlogon/ -name '*.vbs' -ls | grep 'Apr 24' | wc -l 914 --- listener : 927 user / 116s = 7,99 user/s daemon : 927 user / 95s = 7,75 user/s (incl. sleeps) wall clock resync -> files written: 16:03:39 -> 16:05:49 = 130s wall clock: 927 user / 130s = 7,13 user/s ================================================================ root@s121:~# univention-upgrade --ignoreterm --ignoressh root@s121:~# dpkg -l ucs-school-netlogon-user-logonscripts ii ucs-school-netlogon-user 13.0.2-15A~4.2.0. all root@s121:~# service univention-directory-listener restart root@s121:~# service ucs-school-netlogon-user-logonscripts restart root@s121:~# univention-directory-listener-ctrl resync ucs-school-user-logonscript root@s121:~# egrep '2018-04-24 16:(39|40).*updating logon script for' /var/log/univention/ucs-school-user-logonscript-daemon.log | wc -l 2197 --- /var/log/univention/listener.log --- 24.04.18 17:12:34.101 LISTENER ( WARN ) : initializing module ucs-school-user-logonscript […] 24.04.18 17:12:47.010 LISTENER ( WARN ) : finished initializing module ucs-school-user-logonscript with rv=0 --- -------------- ucs-school-user-logonscript-daemon.log --- 2018-04-24 17:12:42,335 [DEBUG] just woke up... […] 2018-04-24 17:12:58,244 [INFO] all items processed... sleeping... --- listener : 927 user / 13s = 71,30 user/s daemon : 927 user / 16s = 57,94 user/s (no sleeps) wall clock resync -> files written: 17:12:34 -> 17:12:58 = 24s wall clock: 927 user / 24s = 38,63 user/s ================================================================ root@s121:~# cd /var/lib/samba/netlogon/ root@s121:/var/lib/samba/netlogon# diff -Naur user.13.0.2-12/ user.13.0.2-15/ → no difference 5.4 x faster on wall clock, same files generated.
UCS@school 4.2 v9 has been released. https://docs.software-univention.de/changelog-ucsschool-4.2v9-de.html If this error occurs again, please clone this bug.