Univention Bugzilla – Bug 33458
Increasing CPU usage over time
Last modified: 2014-03-12 14:43:30 CET
univention-virtual-machine-manager-daemon's CPU usage is increasing over time. In a customer's environment with 95 Xen Hosts and ~240 virtual machines the UVMM (running as virtual machine, 2GB Ram, 2 vCPU's) is unusable slow after 2 days (even over the weekend, without heavy frontend usage).
We should also check a backport to 3.1-1.
Further debugging with a running "vmstat 1" showed, that after 1h the maximum number of open files (1024) was reached and UVMMd bombed, because it could no longer write its (cache) files: in cs us sy id 6426 7984 5 2 93 7230 8795 43 19 38 35532 36021 41 18 40 /v/l/u/vmm.log shows failed TLS connection errors for 27 of the 93 servers, so after the initial ½,1,2s three new file descriptors get added every 5 minutes. That is after 50 minutes the maximum 1024 is reached: echo $((27*3*10 + 66*3)) # 27 failing * 3 FD/try * 50min/5m + 66 working = 1008 I can reproduct it locally with adding other libvirtd's, which use a different SSL CA. uvmm add qemu://xen16.knut.univention.de/system uvmm add qemu://xen14.knut.univention.de/system uvmm add qemu://skepp.knut.univention.de/system uvmm add qemu://krus.knut.univention.de/system uvmm add qemu://isalla.knut.univention.de/system uvmm add qemu://boksel.knut.univention.de/system uvmm add qemu://utby.knut.univention.de/system uvmm add qemu://xen2.knut.univention.de/system # lsof -p $(pgrep -f /usr/sbin/univention-virtual-machine-manager-daemon) | awk -F ' ' '/TCP/ {print gensub(".*:[0-9]+->(.+):[0-9]+","\\1","g",$9),$10;}' | sort | uniq -c 19 192.168.0.109 (CLOSE_WAIT) 19 192.168.0.203 (CLOSE_WAIT) 19 192.168.0.205 (CLOSE_WAIT) 19 192.168.0.238 (CLOSE_WAIT) 19 192.168.0.87 (CLOSE_WAIT) 1 *:2106 (LISTEN) 1 xen12.phahn.dev (ESTABLISHED)
For debugging libvirt: LIBVIRT_GNUTLS_DEBUG=1 My UVMMd failed this night and reproduced the high CPU load. I've been unable to reproduce the FD leak by using "virsh", but the attached codes clearly shows the problem to be related to the event loop implementation. Notice that UVMMd is still using the pure-Python-implementation and not the default C implementation → Bug #31371) There's already a very similar problem to TLS not working and the connection not being closed properly: Bug #31370. Bug #20296 and Bug #20476 looks also like event-loop-problems.
Created attachment 5741 [details] Threded event loop UVMM performance problem Will register the default event-loop and start 5 connections in parallel. Number of open file descriptors is increasing each round.
Created attachment 5759 [details] Threded event loop UVMM performance problem v2 Start thread to run event loop. Enable debug output.
The problem was using the outdated eventloop python code from libvirt. Changing to the internal libvirt implementation fixed the problem with open file handles. In my tests i found no regressions while using UVMM: Start, Stop, Destroy, Snapshot-{Create|Revert} from UVMM and virsh. r47521 univention-virtual-machine-manager-daemon 3.0.17-4.483.201401301119 r47522 update copyright r47524 2014-01-30-univention-virtual-machine-manager-daemon.yaml
OK: r47521,r47522,r47524 OK: aptitude install '?source-package(univention-virtual-machine-manager)?installed' OK: announce_errata -V 2014-01-30-univention-virtual-machine-manager-daemon.yaml OK: uvmm add qemu+tls://*.knut.univention.de/system # rejected → no new connections OK: uvmm add qemu+tls://$(hostname -f)/systen # 3 new FDs OK: uvmm remove qemu+tls://$(hostname -f)/systen # 3 FDs closed OK: uvmm add xen://xen14.knut.univention.de/ OK: uvmm remove xen://xen14.knut.univention.de/ OK: lsof -p `pgrep -f /usr/sbin/univention-virtual-machine-manager-daemon` OK: less /var/log/univention/virtual-machine-manager-daemon.log OK: 3.0.17-4.483.201401301119
http://errata.univention.de/ucs/3.2/51.html
*** Bug 28548 has been marked as a duplicate of this bug. ***
*** Bug 31370 has been marked as a duplicate of this bug. ***