Bug 35101 - virsh times out when restoring a snapshot
virsh times out when restoring a snapshot
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Virtualization - UVMM
UCS 4.1
Other Linux
: P2 normal (vote)
: UCS 4.1-0-errata
Assigned To: Philipp Hahn
Erik Damrose
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-11 10:43 CEST by Timo Denissen
Modified: 2016-02-04 13:52 CET (History)
4 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Timo Denissen univentionstaff 2014-06-11 10:43:59 CEST
When restoring a snapshot (domain stopped) for a domain (stopped) the command "virsh snapshot-restore <snapshot name> <domain name>" fails after some time with the error "error: End of file while reading data: Eingabe-/Ausgabefehler", afterwards libvirtd crashes and must be restarted.

When restoring the snapshot with "qemu-img snapshot -a "<snapshot name>" /path/to/image/file" the snapshot restores without problems.
Comment 1 Philipp Hahn univentionstaff 2014-06-25 11:15:14 CEST
During an operation, which takes a long time, like "snapshot-revert" libvirtd no longer responds in a timely manner and is thus prone to being killed by /usr/lib/univention-virtual-machine-manager-node/libvirt-check.sh, which is run every 2 minutes by default:

# tail -f /var/log/univention/virtual-machine-manager-node-errors.log &
# virsh snapshot-revert ucs32-OX 321e77-oxed

libvirt-check.sh: libvirt does not response like expected. Restarting libvirt now.
Restarting UCS libvirt daemon: libvirtdkill: finish: univention-libvirt: (pid 24900) 6114s, normally down
.

error: End of file while reading data: Warning: Permanently added 'xen12.knut.univention.de,192.168.0.135' (RSA) to : Eingabe-/Ausgabefehler
Comment 2 Erik Damrose univentionstaff 2014-06-25 11:33:43 CEST
Does libvirtd not respond at all while a long operation, or does it just take longer for a response?
The UCRV libvirt/check/timeout controls how many seconds the check will wait for a response before restarting libvirtd. The cronjob itself can be deactivated by setting libvirt/check/interval to 0
Comment 3 Philipp Hahn univentionstaff 2014-06-25 13:47:41 CEST
(In reply to Erik Damrose from comment #2)
> Does libvirtd not respond at all while a long operation, or does it just
> take longer for a response?

libvirtd still works, but a "list" is stalled because it probably iterates over all VMs. As the reverting VM is currently locked, the "list" stalls until that operation has finished.

Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return instantaneously while a "snapshort-revert" is still running, but not always: The first call returns immediately, the second call stalls.

libvirtd supports multiple concurrent calls, but some block others.
(See /etc/libvirt/libvirtd.conf for max_clients, min_workers, max_workers, max_requests, max_client_requests, ...)

> The UCRV libvirt/check/timeout controls how many seconds the check will wait
> for a response before restarting libvirtd. The cronjob itself can be
> deactivated by setting libvirt/check/interval to 0

I know, but the default configuration breaks long running operations, which is annoying bug and breaks an important feature: snapshots.
Comment 4 Stefan Gohmann univentionstaff 2014-06-26 07:02:34 CEST
(In reply to Philipp Hahn from comment #3)
> libvirtd still works, but a "list" is stalled because it probably iterates
> over all VMs. As the reverting VM is currently locked, the "list" stalls
> until that operation has finished.
> 
> Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return
> instantaneously while a "snapshort-revert" is still running, but not always:
> The first call returns immediately, the second call stalls.

Could we skip the check/kill if we find such a blocking operation?
Comment 5 Philipp Hahn univentionstaff 2014-06-26 08:52:31 CEST
(In reply to Stefan Gohmann from comment #4)
> Could we skip the check/kill if we find such a blocking operation?

No: libvirtd is serving multiple connections through TCP and UNIX domain sockets. The cron job is just one of many jobs doing something, as is UVMMd (or another virsh) doing a revert (or other long running operation).
- looking at open TCP/UNIX socket connections is not usable, as they would be also there when libvirtd is stuck in a futex and we want to restart it.
- doing a memory dump is not practical.

If I remember correctly the locking in libvirt was changed between 0.9.12 and 1.2.6. An update might fix this issue of one operation blocking others, which should be checked first.
Comment 6 Philipp Hahn univentionstaff 2015-01-05 09:27:00 CET
Again: libvirtd was killed while doing a "snapshort-delete":

$ date;ps wwwu `pidof libvirtd`
Mo 5. Jan 09:21:34 CET 2015
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     14625  5.8  0.0 466768 10764 ?        SLl  09:20   0:04 /usr/sbin/libvirtd -l
Comment 7 Janek Walkenhorst univentionstaff 2015-01-06 12:05:10 CET
(In reply to Philipp Hahn from comment #6)
> Again: libvirtd was killed while doing a "snapshort-delete":
I inadvertently "reproduce" this every other time I restore some (2-4) snapshots on "skepp".
Comment 8 Stefan Gohmann univentionstaff 2016-01-18 07:27:35 CET
If I remember correctly the virsh check was introduced because of libvirt / Xen stability reasons. Since we are no longer using Xen in UCS 4, we could remove the virsh check.

Comments?
Comment 9 Philipp Hahn univentionstaff 2016-01-19 10:55:37 CET
r66876 | Bug #35101 virt: Remove cron job to check/restart libvirtd
r66875 | Bug #35101 virt: Copyright 2016

Package: univention-virtual-machine-manager-node
Version: 4.0.1-2.89.201601191051
Branch: ucs_4.1-0
Scope: errata4.1-0

r66877 | Bug #35101 virt: Remove cron job to check/restart libvirtd YAML
 univention-virtual-machine-manager-node.yaml
Comment 10 Erik Damrose univentionstaff 2016-01-29 15:54:16 CET
OK: Package installation
OK: Package update
OK: cronjob removed on update
OK: Yaml slight adapted in r67070
Verified
Comment 11 Janek Walkenhorst univentionstaff 2016-02-04 13:52:23 CET
<http://errata.software-univention.de/ucs/4.1/99.html>