Univention Bugzilla – Bug 53669
[4.4] [UDM HTTP API] server does not scale horizontally
Last modified: 2022-02-23 17:06:50 CET
Backport to UCS 4.4 +++ This bug was initially created as a clone of Bug #50050 +++ The UDM HTTP API service uses only 1 process. Under medium load UDM is CPU-bound, and thus the UDM HTTP API cannot deliver speedups available in multi-cpu/core systems. In a 4-core system creating 1000 users sequentially, with 3 or 4 processes makes a big difference: parallelism=1: Seconds for creating 1000 Users: 127,57 Seconds for reading 1000 Users: 3,15 Seconds for modifying 1000 Users: 23,32 Seconds for deleting 1000 Users: 71,95 parallelism=3: Seconds for creating 1000 Users: 47,55 Seconds for reading 1000 Users: 1,25 Seconds for modifying 1000 Users: 17,93 Seconds for deleting 1000 Users: 42,69 parallelism=4: Seconds for creating 1000 Users: 47,53 Seconds for reading 1000 Users: 1,04 Seconds for modifying 1000 Users: 22,83 Seconds for deleting 1000 Users: 53,81 The UDM REST API service should spawn [num-cores]-1 processes to be able to deliver the maximum speed available on the system.
MR: https://git.knut.univention.de/univention/ucs/-/merge_requests/208
Backported the changes in the following commits: univention-directory-manager-rest.yaml f8ead2434a4f | Bug #53669: debian/changelog + YAML + developer reference univention-directory-manager-rest (9.0.16-16) f8ead2434a4f | Bug #53669: debian/changelog + YAML + developer reference univention-directory-manager-rest (9.0.16-15) 3a453aedf552 | Bug #53669: don't create global dictionaries during python import 6ef5d2c3a615 | Bug #53669: remove daemonizing handled by systemd db3f8838899b | Bug #53669: Add multiprocessing to UDM REST API
Verified: * Code review * Package update * Functional test * Advisory 1000 parallel curl users/user lookups from a remote machine * before (against VM with 8 cores): * ~132 seconds + 7 curl timeouts * after: * ~20 seconds with directory/manager/rest/processes=8 (or =0) * ~27 seconds with directory/manager/rest/processes=4 Reopen: root@master60:~# time systemctl stop univention-directory-manager-rest.service real 0m20,122s user 0m0,000s sys 0m0,004s Before: root@master60:~# time systemctl stop univention-directory-manager-rest.service real 0m1,093s user 0m0,008s sys 0m0,000s
Some debugging showed that the shared_memory "# manager" processes get killed by systemd and are gone even before the signal_handler_stop run. Now, systemd has the default "KillMode=control-group". It works when I add KillMode=process to /lib/systemd/system/univention-directory-manager-rest.service While debugging I found that the os.waitpid(pid, os.WNOHANG) call in the safe_kill methods don't do much. But when I change that to wait without os.WNOHANG, then the behavior is worse as I get a traceback Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/dist-packages/univention/admin/rest/__main__.py", line 232, in <module> Server.main() File "/usr/lib/python2.7/dist-packages/univention/admin/rest/__main__.py", line 228, in main server.run(args) File "/usr/lib/python2.7/dist-packages/univention/admin/rest/__main__.py", line 127, in run child_id = tornado.process.fork_processes(args.processes, 0) File "/usr/lib/python2.7/dist-packages/tornado/process.py", line 155, in fork_processes pid, status = os.wait() OSError: [Errno 10] No child processes from the parent processes (de_DE.UTF-8 and en_US.UTF-8). The manpage of waitpid seems to say, that that's the expected behavior *in case* ...: > POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behaviour of setting SIGCHLD to SIG_IGN unspecified.) Linux 2.6 conforms to this specification. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child. So I would propose that we don't try to improve that, as it's not really a problem here.
Move-Operations were fixed by f83c95cbc8 Bug #53669: Fix move operations In the move operations, a nested shared dict was used. In python3.6 modifications to the inner shared dict were propagated to the containing dict proxy. In python2.7, this doesn't work, I had to make a local copy of the inner dict and reassign it to the outer dict, to notify the outer dict proxy of the change. Tests were successful. https://jenkins.knut.univention.de:8181/job/UCS-4.4/job/UCS-4.4-8/job/AutotestJoin/lastCompletedBuild/testReport/
univention-directory-manager-rest (9.0.16-18) a47446852903 | Bug #53669: fix delayed restarting of univention-directory-manager-rest service * KillMode=process does not kill the whole process group and therefore does not cause the 1 second delayed stop to hand when trying to access multiprocessing shared memory contents, which were already killed by systemd. * os.waitpid(pid, os.WNOHANG) does not block and therefore is only useful for fetching status - which we don't need here.
bc85a89bb6 | Fix tab/space mix in debian/changelog Reopen: Advisory version
(In reply to Arvid Requate from comment #10) > Reopen: Advisory version 5956e1d91d
Verified: * service shutdown works now w/o delay * Advisory
<https://errata.software-univention.de/#/?erratum=4.4x1181>