Univention Bugzilla – Bug 48536
Snapshot restore breaks <cpu model="host-model">
Last modified: 2019-05-09 20:18:09 CEST
When snapshots are created and restored the <cpu model="host-model"/> placeholder is replaced by the concrete host-model; this persists even when the VM is powered down, which will prevent migration to hosts with older CPUs. This can be fixed manually by running "virsh edit" and replacing the concrete model again with the placeholder. (This again will break with the next snapshot) Even worse libvirt seems to contain a bug where for an powered off VM a flag is mentioned as required in the concrete CPU model with QEMU is not able to provide/emulate. The VM then cannot be started even on the same host where the snapshot war created. (that does not happen with all CPU models: 'krus' works, 'lattjo' seems to be broken.)
*** Bug 48535 has been marked as a duplicate of this bug. ***
This also affects suspend-2-disk (AKA managed-save) @ lattjo: error: the CPU is incompatible with host CPU: Host CPU does not provide required features: cmt
echo "<domain type='kvm'> <name>ucs</name> <cpu mode='host-model'/> <memory unit='GiB'>1</memory> <os> <type arch='x86_64' machine='q35'>hvm</type> </os> <devices> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/ucs.qcow2'/> <target dev='sda' bus='scsi'/> </disk> </devices> </domain>" >ucs.xml virsh define --file ucs.xml virsh pool-create-as --name default --type dir --target /var/lib/libvirt/images virsh vol-create-as --pool default --name ucs.qcow2 --capacity 10G --format qcow2 virsh snapshot-create-as --domain ucs --name snap virsh snapshot-dumpxml --domain ucs --snapshotname snap # BUG virsh snapshot-revert --domain ucs --snapshotname snap virsh dumpxml --domain ucs virsh snapshot-delete --domain ucs --snapshotname snap virsh undefine --domain ucs --remove-all-storage
Created attachment 9958 [details] git bisect log git bisect --old=v3.0.0 --new=v5.0.0 log $ git dc 06f75ff2cb292e2658b4f2f6949c700550006272 v3.8.0-rc1~42 $ git log -1 06f75ff2cb292e2658b4f2f6949c700550006272 commit 06f75ff2cb292e2658b4f2f6949c700550006272 Author: Jiri Denemark <jdenemar@redhat.com> Date: Fri Jun 30 16:55:20 2017 +0200 qemu: Don't update CPU when formatting live def Since commit v2.2.0-199-g7ce711a30e libvirt stores an updated guest CPU in domain's live definition and there's no need to update it every time we want to format the definition. The commit itself tried to address this in qemuDomainFormatXML, but forgot to fix qemuDomainDefFormatLive. Not to mention that masking a previously set flag is only acceptable if the flag was set by a public API user. Internally, libvirt should have never set the flag in the first place. https://bugzilla.redhat.com/show_bug.cgi?id=1485022 Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
<https://libvirt.org/news.html> v3.0.0..v5.0.0 Other git commits related and of interest: >>06f75ff2cb292e2658b4f2f6949c700550006272 qemu: Don't update CPU when formatting live def > 7e874326a3eca1233017ab91774d845b99869af1 qemu: Use correct host model for updating guest cpu > 4fd179f518ad6765492b5dd66a9d8fbda8465ef5 cpu_conf: Drop updateCPU from virCPUDefFormat > > ee68bb391efb684341edb6286a1278631167f08c qemu: Don't update CPU when checking ABI stability > 063b2b8788db204d0ac2e1d367fd0c5eb9b81451 qemu: Add qemuDomainCheckABIStability > a0912df3faac8b58725f45b8cf928990999585a3 qemu: Add qemuDomainMigratableDefCheckABIStability > 0810d4f5e0d77354cdf7ad56e1e8b7748bdb1501 qemu: Introduce qemuDomainDefFromXML helper > > 8e34f478137c2a6b5e57e382729bd2776b042301 qemu: Use updated CPU when starting QEMU if possible > 8c19fbf452372c7c1becf987497525ee8b61abbf qemu: Store updated CPU in save cookie >>356a2161e293eaab57448a8e68f489e0841efe19 qemu: Report the original CPU in migratable xml > ea6d89831198ae1854dca15b2bcf081df2d866a2 qemu: Remember CPU def from domain start > 215476b64226275ac090ca0e957bfca76ba1549e qemu: Implement virSaveCookie object and callbacks > 957cd268a98043fe061227a8e829c622a6fb4b86 conf: Pass xmlopt to virDomainSnapshotDefFormat > 4f0aeed8713b679dd024542f4823efcef1473f4f virDomainXMLOption: Introduce virDomainABIStabilityDomain 06f75ff2cb292e2658b4f2f6949c700550006272 does not apply cleanly because of 356a2161e293eaab57448a8e68f489e0841efe19, but that is trivial to fix. But it now might break in other subtile ways like the ABI checking for migration might no longer work correctly. repo_admin.py --cherrypick -r 4.3 -s errata4.3-2 --releasedest 4.3 --dest errata4.3-3 -p libvirt r18539 | Bug #48536 libvirt: Fix CPU host model Package: libvirt Version: 3.0.0-4+deb9u3A~4.3.0.201904041619 Branch: ucs_4.3-0 Scope: errata4.3-3 [4.3-3] df3db897ea Bug #48536: libvirt 3.0.0-4+deb9u3A~4.3.0.201904041619 doc/errata/staging/libvirt.yaml | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) TODO: After QA the package must also be copied to errata4.4-0 and the YAML must be cloned.
QA: virsh define ucs43b # mode=host-model snapshot-create-as ucs43b stopped snapshot-dumpxml ucs43b stopped # mode=host-model start --paused ucs43b snapshot-create-as ucs43b running snapshot-dumpxml ucs43b running # mode=custom dumpxml ucs43b # mode=custom dumpxml --inactive ucs43b # mode=custom destroy ucs43b dumpxml ucs43b # mode=custrom snapshot-revert ucs43b running dumpxml ucs43b # mode=custom dumpxml --inactive ucs43b # mode=custom snapshot-revert ucs43b stopped dumpxml ucs43b start --paused ucs43b dumpxml ucs43b # mode=custom managedsave ucs43b start ucs43b dumpxml ucs43b # mode=custom destroy ucs43b dumpxml ucs43b # mode=host-model
QA: virsh migrate --live --persistent --undefinesource --verbose --domain ucs43b --desturi qemu://dc0.phahn.dev/system ... dumpxml ucs43b # mode=custom dumpxml --inactive ucs43b # mode=host-model destroy ucs43b dumpxml ucs43b # mode=host-model
Everything works fine from console, however... Reopen: Taking a snapshot via UMC module takes a long time, returns after about 2 minutes with a message 'snapshot created successfully'.. which is too long. uvmm-daemon.log shows a traceback: 2019-04-08 17:40:10,887 - uvmmd.node - WARNING - Failed to query job status a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML) 2019-04-08 17:40:22,004 - uvmmd.node - WARNING - Failed to query job status a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML) 2019-04-08 17:40:40,911 - uvmmd.node.migration - ERROR - qemu://master.mydomain.intranet/system: Exception handling callback Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1023, in migration_event domStat.migration_status(dom.jobStats()) File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1370, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML) umc-module-uvmm.log shows: 08.04.19 17:21:32.938 MODULE ( PROCESS ) : Fehler beim Sichern der Domäne "08be1baa-dfbf-44c4-8c1e-85e5d565b9f7": Unable to read from monitor: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt hypervisor: xen10, 10.200.29.189
(In reply to Erik Damrose from comment #8) > Reopen: Taking a snapshot via UMC module takes a long time, returns after > about 2 minutes with a message 'snapshot created successfully'.. which is > too long. uvmm-daemon.log shows a traceback: > > 2019-04-08 17:40:22,004 - uvmmd.node - WARNING - Failed to query job status > a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot > acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML) > 2019-04-08 17:40:40,911 - uvmmd.node.migration - ERROR - > qemu://master.mydomain.intranet/system: Exception handling callback > Traceback (most recent call last): > File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1023, in > migration_event > domStat.migration_status(dom.jobStats()) > File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1370, in jobStats > if ret is None: raise libvirtError ('virDomainGetJobStats() failed', > dom=self) > libvirtError: Timed out during operation: cannot acquire state change lock > (held by remoteDispatchDomainSnapshotCreateXML) That issue about "cannot acquire state change lock" is not new and is not caused by the change related to this bug. For testing purpose I down-graded the version of libvirt* again to "3.0.0-4+deb9u3A~4.3.0.201811081529" on your host and the exception also happens there. The "state change lock" is a "design-flaw" of the Qemu driver in libvirt: libvirtd communicates with each running qemu process through a pre process UNIX socket. While libvirtd and qemu themselves are multi-threaded, the communication channel is not and must be serialized: therefore the lock. The command used to create the snapshot for a running VM is synchronous by nature, so the UNIX socket communication channel remains blocked while the snapshot is being created. During that time UVMMd runs its periodic update thread each 15 seconds and asks libvirtd for updates of all VMs, so also for the VM being snapshotted. This requires libvirtd to communicate again with the same qemu processes, which is not possible as the snapshot command still has not finished and still holds the "state change" lock (indicated by "remoteDispatchDomainSnapshotCreateXML" above). To not delay fetching information about the different domains too much libvirtd uses a timeout and aborts the query, it the query takes too long. The time required to create the snapshot mostly depends on the speed of the HD of the host and the RAM size of the VM. It you turn off the VM, start it again and stay in GRUB, the RAM is mostly empty and creating a snapshot takes < 5s, irrespective if I use the old or my new version of libvirt*. With cache="unsafe" it took ~15s for the VM while in UMC-USS. With cache="none" it took ~75s. So to me this looks like it is IO related and not a new issue introduced with this bug. Maybe we need to revisit Bug #30504? > umc-module-uvmm.log shows: 08.04.19 17:21:32.938 MODULE ( PROCESS ) : > Fehler beim Sichern der Domäne "08be1baa-dfbf-44c4-8c1e-85e5d565b9f7": > Unable to read from monitor: Die Verbindung wurde vom Kommunikationspartner > zurückgesetzt This 2nd error is way before you created the snapshot - probably caused by Bug #49403.
I agree, it looks like a timing issue caused by the slow server hardware. As mentioned before: OK: No wrong cpu model replacement when creating snapshots OK: yaml Verified
<http://errata.software-univention.de/ucs/4.3/496.html>