Univention Bugzilla – Bug 35101
virsh times out when restoring a snapshot
Last modified: 2016-02-04 13:52:23 CET
When restoring a snapshot (domain stopped) for a domain (stopped) the command "virsh snapshot-restore <snapshot name> <domain name>" fails after some time with the error "error: End of file while reading data: Eingabe-/Ausgabefehler", afterwards libvirtd crashes and must be restarted. When restoring the snapshot with "qemu-img snapshot -a "<snapshot name>" /path/to/image/file" the snapshot restores without problems.
During an operation, which takes a long time, like "snapshot-revert" libvirtd no longer responds in a timely manner and is thus prone to being killed by /usr/lib/univention-virtual-machine-manager-node/libvirt-check.sh, which is run every 2 minutes by default: # tail -f /var/log/univention/virtual-machine-manager-node-errors.log & # virsh snapshot-revert ucs32-OX 321e77-oxed libvirt-check.sh: libvirt does not response like expected. Restarting libvirt now. Restarting UCS libvirt daemon: libvirtdkill: finish: univention-libvirt: (pid 24900) 6114s, normally down . error: End of file while reading data: Warning: Permanently added 'xen12.knut.univention.de,192.168.0.135' (RSA) to : Eingabe-/Ausgabefehler
Does libvirtd not respond at all while a long operation, or does it just take longer for a response? The UCRV libvirt/check/timeout controls how many seconds the check will wait for a response before restarting libvirtd. The cronjob itself can be deactivated by setting libvirt/check/interval to 0
(In reply to Erik Damrose from comment #2) > Does libvirtd not respond at all while a long operation, or does it just > take longer for a response? libvirtd still works, but a "list" is stalled because it probably iterates over all VMs. As the reverting VM is currently locked, the "list" stalls until that operation has finished. Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return instantaneously while a "snapshort-revert" is still running, but not always: The first call returns immediately, the second call stalls. libvirtd supports multiple concurrent calls, but some block others. (See /etc/libvirt/libvirtd.conf for max_clients, min_workers, max_workers, max_requests, max_client_requests, ...) > The UCRV libvirt/check/timeout controls how many seconds the check will wait > for a response before restarting libvirtd. The cronjob itself can be > deactivated by setting libvirt/check/interval to 0 I know, but the default configuration breaks long running operations, which is annoying bug and breaks an important feature: snapshots.
(In reply to Philipp Hahn from comment #3) > libvirtd still works, but a "list" is stalled because it probably iterates > over all VMs. As the reverting VM is currently locked, the "list" stalls > until that operation has finished. > > Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return > instantaneously while a "snapshort-revert" is still running, but not always: > The first call returns immediately, the second call stalls. Could we skip the check/kill if we find such a blocking operation?
(In reply to Stefan Gohmann from comment #4) > Could we skip the check/kill if we find such a blocking operation? No: libvirtd is serving multiple connections through TCP and UNIX domain sockets. The cron job is just one of many jobs doing something, as is UVMMd (or another virsh) doing a revert (or other long running operation). - looking at open TCP/UNIX socket connections is not usable, as they would be also there when libvirtd is stuck in a futex and we want to restart it. - doing a memory dump is not practical. If I remember correctly the locking in libvirt was changed between 0.9.12 and 1.2.6. An update might fix this issue of one operation blocking others, which should be checked first.
Again: libvirtd was killed while doing a "snapshort-delete": $ date;ps wwwu `pidof libvirtd` Mo 5. Jan 09:21:34 CET 2015 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 14625 5.8 0.0 466768 10764 ? SLl 09:20 0:04 /usr/sbin/libvirtd -l
(In reply to Philipp Hahn from comment #6) > Again: libvirtd was killed while doing a "snapshort-delete": I inadvertently "reproduce" this every other time I restore some (2-4) snapshots on "skepp".
If I remember correctly the virsh check was introduced because of libvirt / Xen stability reasons. Since we are no longer using Xen in UCS 4, we could remove the virsh check. Comments?
r66876 | Bug #35101 virt: Remove cron job to check/restart libvirtd r66875 | Bug #35101 virt: Copyright 2016 Package: univention-virtual-machine-manager-node Version: 4.0.1-2.89.201601191051 Branch: ucs_4.1-0 Scope: errata4.1-0 r66877 | Bug #35101 virt: Remove cron job to check/restart libvirtd YAML univention-virtual-machine-manager-node.yaml
OK: Package installation OK: Package update OK: cronjob removed on update OK: Yaml slight adapted in r67070 Verified
<http://errata.software-univention.de/ucs/4.1/99.html>