35101 – virsh times out when restoring a snapshot

Bug 35101 - virsh times out when restoring a snapshot

Summary: virsh times out when restoring a snapshot

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Virtualization - UVMM
Version:	UCS 4.1
Hardware:	Other Linux

Importance:	P2 normal
Target Milestone:	UCS 4.1-0-errata
Assignee:	Philipp Hahn
QA Contact:	Erik Damrose

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-11 10:43 CEST by Timo Denissen
Modified:	2016-02-04 13:52 CET (History)
CC List:	4 users (show)

See Also:	35354
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Timo Denissen

2014-06-11 10:43:59 CEST

When restoring a snapshot (domain stopped) for a domain (stopped) the command "virsh snapshot-restore <snapshot name> <domain name>" fails after some time with the error "error: End of file while reading data: Eingabe-/Ausgabefehler", afterwards libvirtd crashes and must be restarted.

When restoring the snapshot with "qemu-img snapshot -a "<snapshot name>" /path/to/image/file" the snapshot restores without problems.

Comment 1 Philipp Hahn

2014-06-25 11:15:14 CEST

During an operation, which takes a long time, like "snapshot-revert" libvirtd no longer responds in a timely manner and is thus prone to being killed by /usr/lib/univention-virtual-machine-manager-node/libvirt-check.sh, which is run every 2 minutes by default:

# tail -f /var/log/univention/virtual-machine-manager-node-errors.log &
# virsh snapshot-revert ucs32-OX 321e77-oxed

libvirt-check.sh: libvirt does not response like expected. Restarting libvirt now.
Restarting UCS libvirt daemon: libvirtdkill: finish: univention-libvirt: (pid 24900) 6114s, normally down
.

error: End of file while reading data: Warning: Permanently added 'xen12.knut.univention.de,192.168.0.135' (RSA) to : Eingabe-/Ausgabefehler

Comment 2 Erik Damrose

2014-06-25 11:33:43 CEST

Does libvirtd not respond at all while a long operation, or does it just take longer for a response?
The UCRV libvirt/check/timeout controls how many seconds the check will wait for a response before restarting libvirtd. The cronjob itself can be deactivated by setting libvirt/check/interval to 0

Comment 3 Philipp Hahn

2014-06-25 13:47:41 CEST

(In reply to Erik Damrose from comment #2)
> Does libvirtd not respond at all while a long operation, or does it just
> take longer for a response?

libvirtd still works, but a "list" is stalled because it probably iterates over all VMs. As the reverting VM is currently locked, the "list" stalls until that operation has finished.

Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return instantaneously while a "snapshort-revert" is still running, but not always: The first call returns immediately, the second call stalls.

libvirtd supports multiple concurrent calls, but some block others.
(See /etc/libvirt/libvirtd.conf for max_clients, min_workers, max_workers, max_requests, max_client_requests, ...)

> The UCRV libvirt/check/timeout controls how many seconds the check will wait
> for a response before restarting libvirtd. The cronjob itself can be
> deactivated by setting libvirt/check/interval to 0

I know, but the default configuration breaks long running operations, which is annoying bug and breaks an important feature: snapshots.

Comment 4 Stefan Gohmann

2014-06-26 07:02:34 CEST

(In reply to Philipp Hahn from comment #3)
> libvirtd still works, but a "list" is stalled because it probably iterates
> over all VMs. As the reverting VM is currently locked, the "list" stalls
> until that operation has finished.
> 
> Other commands like "hostname", "nodeinfo", and "nodecpustats" mostly return
> instantaneously while a "snapshort-revert" is still running, but not always:
> The first call returns immediately, the second call stalls.

Could we skip the check/kill if we find such a blocking operation?

Comment 5 Philipp Hahn

2014-06-26 08:52:31 CEST

(In reply to Stefan Gohmann from comment #4)
> Could we skip the check/kill if we find such a blocking operation?

No: libvirtd is serving multiple connections through TCP and UNIX domain sockets. The cron job is just one of many jobs doing something, as is UVMMd (or another virsh) doing a revert (or other long running operation).
- looking at open TCP/UNIX socket connections is not usable, as they would be also there when libvirtd is stuck in a futex and we want to restart it.
- doing a memory dump is not practical.

If I remember correctly the locking in libvirt was changed between 0.9.12 and 1.2.6. An update might fix this issue of one operation blocking others, which should be checked first.

Comment 6 Philipp Hahn

2015-01-05 09:27:00 CET

Again: libvirtd was killed while doing a "snapshort-delete":

$ date;ps wwwu `pidof libvirtd`
Mo 5. Jan 09:21:34 CET 2015
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     14625  5.8  0.0 466768 10764 ?        SLl  09:20   0:04 /usr/sbin/libvirtd -l

Comment 7 Janek Walkenhorst

2015-01-06 12:05:10 CET

(In reply to Philipp Hahn from comment #6)
> Again: libvirtd was killed while doing a "snapshort-delete":
I inadvertently "reproduce" this every other time I restore some (2-4) snapshots on "skepp".

Comment 8 Stefan Gohmann

2016-01-18 07:27:35 CET

If I remember correctly the virsh check was introduced because of libvirt / Xen stability reasons. Since we are no longer using Xen in UCS 4, we could remove the virsh check.

Comments?

Comment 9 Philipp Hahn

2016-01-19 10:55:37 CET

r66876 | Bug #35101 virt: Remove cron job to check/restart libvirtd
r66875 | Bug #35101 virt: Copyright 2016

Package: univention-virtual-machine-manager-node
Version: 4.0.1-2.89.201601191051
Branch: ucs_4.1-0
Scope: errata4.1-0

r66877 | Bug #35101 virt: Remove cron job to check/restart libvirtd YAML
 univention-virtual-machine-manager-node.yaml

Comment 10 Erik Damrose

2016-01-29 15:54:16 CET

OK: Package installation
OK: Package update
OK: cronjob removed on update
OK: Yaml slight adapted in r67070
Verified

Comment 11 Janek Walkenhorst

2016-02-04 13:52:23 CET

<http://errata.software-univention.de/ucs/4.1/99.html>