Bug 48536 - Snapshot restore breaks <cpu model="host-model">
Snapshot restore breaks <cpu model="host-model">
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Virtualization - KVM
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.3-4-errata
Assigned To: Philipp Hahn
Erik Damrose
:
: 48535 (view as bug list)
Depends on: 21386 49440
Blocks: 49425
  Show dependency treegraph
 
Reported: 2019-01-25 14:23 CET by Philipp Hahn
Modified: 2019-05-09 20:18 CEST (History)
4 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.429
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
git bisect log (2.40 KB, text/plain)
2019-04-04 13:10 CEST, Philipp Hahn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2019-01-25 14:23:35 CET
When snapshots are created and restored the <cpu model="host-model"/> placeholder is replaced by the concrete host-model; this persists even when the VM is powered down, which will prevent migration to hosts with older CPUs.
This can be fixed manually by running "virsh edit" and replacing the concrete model again with the placeholder. (This again will break with the next snapshot)

Even worse libvirt seems to contain a bug where for an powered off VM a flag is mentioned as required in the concrete CPU model with QEMU is not able to provide/emulate. The VM then cannot be started even on the same host where the snapshot war created. (that does not happen with all CPU models: 'krus' works, 'lattjo' seems to be broken.)
Comment 1 Philipp Hahn univentionstaff 2019-01-25 14:23:53 CET
*** Bug 48535 has been marked as a duplicate of this bug. ***
Comment 2 Philipp Hahn univentionstaff 2019-03-08 09:31:22 CET
This also affects suspend-2-disk (AKA managed-save) @ lattjo:

error: the CPU is incompatible with host CPU: Host CPU does not provide required features: cmt
Comment 3 Philipp Hahn univentionstaff 2019-03-26 14:00:47 CET
echo "<domain type='kvm'>
 <name>ucs</name>
 <cpu mode='host-model'/>
 <memory unit='GiB'>1</memory>
 <os>
  <type arch='x86_64' machine='q35'>hvm</type>
 </os>
 <devices>
  <disk type='file' device='disk'>
   <driver name='qemu' type='qcow2'/>
   <source file='/var/lib/libvirt/images/ucs.qcow2'/>
   <target dev='sda' bus='scsi'/>
  </disk>
 </devices>
</domain>" >ucs.xml
virsh define --file ucs.xml
virsh pool-create-as --name default --type dir --target /var/lib/libvirt/images
virsh vol-create-as --pool default --name ucs.qcow2 --capacity 10G --format qcow2
virsh snapshot-create-as --domain ucs --name snap
virsh snapshot-dumpxml --domain ucs --snapshotname snap  # BUG
virsh snapshot-revert --domain ucs --snapshotname snap
virsh dumpxml --domain ucs
virsh snapshot-delete --domain ucs --snapshotname snap
virsh undefine --domain ucs --remove-all-storage
Comment 4 Philipp Hahn univentionstaff 2019-04-04 13:10:55 CEST
Created attachment 9958 [details]
git bisect log

git bisect --old=v3.0.0 --new=v5.0.0 log

$ git dc 06f75ff2cb292e2658b4f2f6949c700550006272
v3.8.0-rc1~42

$ git log -1 06f75ff2cb292e2658b4f2f6949c700550006272
commit 06f75ff2cb292e2658b4f2f6949c700550006272
Author: Jiri Denemark <jdenemar@redhat.com>
Date:   Fri Jun 30 16:55:20 2017 +0200

    qemu: Don't update CPU when formatting live def
    
    Since commit v2.2.0-199-g7ce711a30e libvirt stores an updated guest CPU
    in domain's live definition and there's no need to update it every time
    we want to format the definition. The commit itself tried to address
    this in qemuDomainFormatXML, but forgot to fix qemuDomainDefFormatLive.
    Not to mention that masking a previously set flag is only acceptable if
    the flag was set by a public API user. Internally, libvirt should have
    never set the flag in the first place.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1485022
    
    Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Comment 5 Philipp Hahn univentionstaff 2019-04-04 16:32:22 CEST
<https://libvirt.org/news.html> v3.0.0..v5.0.0

Other git commits related and of interest:
>>06f75ff2cb292e2658b4f2f6949c700550006272 qemu: Don't update CPU when formatting live def
> 7e874326a3eca1233017ab91774d845b99869af1 qemu: Use correct host model for updating guest cpu
> 4fd179f518ad6765492b5dd66a9d8fbda8465ef5 cpu_conf: Drop updateCPU from virCPUDefFormat
>
> ee68bb391efb684341edb6286a1278631167f08c qemu: Don't update CPU when checking ABI stability
> 063b2b8788db204d0ac2e1d367fd0c5eb9b81451 qemu: Add qemuDomainCheckABIStability
> a0912df3faac8b58725f45b8cf928990999585a3 qemu: Add qemuDomainMigratableDefCheckABIStability
> 0810d4f5e0d77354cdf7ad56e1e8b7748bdb1501 qemu: Introduce qemuDomainDefFromXML helper
>
> 8e34f478137c2a6b5e57e382729bd2776b042301 qemu: Use updated CPU when starting QEMU if possible
> 8c19fbf452372c7c1becf987497525ee8b61abbf qemu: Store updated CPU in save cookie
>>356a2161e293eaab57448a8e68f489e0841efe19 qemu: Report the original CPU in migratable xml
> ea6d89831198ae1854dca15b2bcf081df2d866a2 qemu: Remember CPU def from domain start
> 215476b64226275ac090ca0e957bfca76ba1549e qemu: Implement virSaveCookie object and callbacks
> 957cd268a98043fe061227a8e829c622a6fb4b86 conf: Pass xmlopt to virDomainSnapshotDefFormat
> 4f0aeed8713b679dd024542f4823efcef1473f4f virDomainXMLOption: Introduce virDomainABIStabilityDomain

06f75ff2cb292e2658b4f2f6949c700550006272 does not apply cleanly because of 356a2161e293eaab57448a8e68f489e0841efe19, but that is trivial to fix. But it now might break in other subtile ways like the ABI checking for migration might no longer work correctly.

repo_admin.py --cherrypick -r 4.3 -s errata4.3-2 --releasedest 4.3 --dest errata4.3-3 -p libvirt

r18539 | Bug #48536 libvirt: Fix CPU host model

Package: libvirt
Version: 3.0.0-4+deb9u3A~4.3.0.201904041619
Branch: ucs_4.3-0
Scope: errata4.3-3

[4.3-3] df3db897ea Bug #48536: libvirt 3.0.0-4+deb9u3A~4.3.0.201904041619
 doc/errata/staging/libvirt.yaml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

TODO: After QA the package must also be copied to errata4.4-0 and the YAML must be cloned.
Comment 6 Philipp Hahn univentionstaff 2019-04-05 18:31:27 CEST
QA: virsh
 define ucs43b  # mode=host-model
 snapshot-create-as ucs43b stopped
 snapshot-dumpxml ucs43b stopped # mode=host-model

 start --paused ucs43b
 snapshot-create-as ucs43b running
 snapshot-dumpxml ucs43b running # mode=custom
 dumpxml ucs43b # mode=custom
 dumpxml --inactive ucs43b # mode=custom

 destroy ucs43b
 dumpxml ucs43b # mode=custrom

 snapshot-revert ucs43b running
 dumpxml ucs43b # mode=custom
 dumpxml --inactive ucs43b # mode=custom

 snapshot-revert ucs43b stopped
 dumpxml ucs43b

 start --paused ucs43b
 dumpxml ucs43b # mode=custom
 managedsave ucs43b
 start ucs43b
 dumpxml ucs43b # mode=custom
 destroy ucs43b
 dumpxml ucs43b # mode=host-model
Comment 7 Philipp Hahn univentionstaff 2019-04-05 18:35:42 CEST
QA: virsh
 migrate --live  --persistent --undefinesource --verbose --domain ucs43b --desturi qemu://dc0.phahn.dev/system
...
 dumpxml ucs43b  # mode=custom
 dumpxml --inactive ucs43b  # mode=host-model
 destroy ucs43b
 dumpxml ucs43b  # mode=host-model
Comment 8 Erik Damrose univentionstaff 2019-04-08 17:55:15 CEST
Everything works fine from console, however...
Reopen: Taking a snapshot via UMC module takes a long time, returns after about 2 minutes with a message 'snapshot created successfully'.. which is too long. uvmm-daemon.log shows a traceback:

2019-04-08 17:40:10,887 - uvmmd.node - WARNING - Failed to query job status a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML)
2019-04-08 17:40:22,004 - uvmmd.node - WARNING - Failed to query job status a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML)
2019-04-08 17:40:40,911 - uvmmd.node.migration - ERROR - qemu://master.mydomain.intranet/system: Exception handling callback
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1023, in migration_event
    domStat.migration_status(dom.jobStats())
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1370, in jobStats
    if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML)


umc-module-uvmm.log shows: 08.04.19 17:21:32.938  MODULE      ( PROCESS ) : Fehler beim Sichern der Domäne "08be1baa-dfbf-44c4-8c1e-85e5d565b9f7": Unable to read from monitor: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt

hypervisor: xen10, 10.200.29.189
Comment 9 Philipp Hahn univentionstaff 2019-05-02 12:07:10 CEST
(In reply to Erik Damrose from comment #8)
> Reopen: Taking a snapshot via UMC module takes a long time, returns after
> about 2 minutes with a message 'snapshot created successfully'.. which is
> too long. uvmm-daemon.log shows a traceback:
> 
> 2019-04-08 17:40:22,004 - uvmmd.node - WARNING - Failed to query job status
> a51d9903-808c-46bd-94e2-e26be44e60b0: Timed out during operation: cannot
> acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML)
> 2019-04-08 17:40:40,911 - uvmmd.node.migration - ERROR -
> qemu://master.mydomain.intranet/system: Exception handling callback
> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1023, in
> migration_event
>     domStat.migration_status(dom.jobStats())
>   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1370, in jobStats
>     if ret is None: raise libvirtError ('virDomainGetJobStats() failed',
> dom=self)
> libvirtError: Timed out during operation: cannot acquire state change lock
> (held by remoteDispatchDomainSnapshotCreateXML)

That issue about "cannot acquire state change lock" is not new and is not caused by the change related to this bug. For testing purpose I down-graded the version of libvirt* again to "3.0.0-4+deb9u3A~4.3.0.201811081529" on your host and the exception also happens there.

The "state change lock" is a "design-flaw" of the Qemu driver in libvirt: libvirtd communicates with each running qemu process through a pre process UNIX socket. While libvirtd and qemu themselves are multi-threaded, the communication channel is not and must be serialized: therefore the lock.
The command used to create the snapshot for a running VM is synchronous by nature, so the UNIX socket communication channel remains blocked while the snapshot is being created.
During that time UVMMd runs its periodic update thread each 15 seconds and asks libvirtd for updates of all VMs, so also for the VM being snapshotted. This requires libvirtd to communicate again with the same qemu processes, which is not possible as the snapshot command still has not finished and still holds the "state change" lock (indicated by "remoteDispatchDomainSnapshotCreateXML" above).
To not delay fetching information about the different domains too much libvirtd uses a timeout and aborts the query, it the query takes too long.

The time required to create the snapshot mostly depends on the speed of the HD of the host and the RAM size of the VM. It you turn off the VM, start it again and stay in GRUB, the RAM is mostly empty and creating a snapshot takes < 5s, irrespective if I use the old or my new version of libvirt*.
With cache="unsafe" it took ~15s for the VM while in UMC-USS.
With cache="none" it took ~75s.

So to me this looks like it is IO related and not a new issue introduced with this bug.
Maybe we need to revisit Bug #30504?


> umc-module-uvmm.log shows: 08.04.19 17:21:32.938  MODULE      ( PROCESS ) :
> Fehler beim Sichern der Domäne "08be1baa-dfbf-44c4-8c1e-85e5d565b9f7":
> Unable to read from monitor: Die Verbindung wurde vom Kommunikationspartner
> zurückgesetzt

This 2nd error is way before you created the snapshot - probably caused by Bug #49403.
Comment 10 Erik Damrose univentionstaff 2019-05-07 16:24:39 CEST
I agree, it looks like a timing issue caused by the slow server hardware.

As mentioned before:
OK: No wrong cpu model replacement when creating snapshots
OK: yaml
Verified
Comment 11 Arvid Requate univentionstaff 2019-05-08 13:39:57 CEST
<http://errata.software-univention.de/ucs/4.3/496.html>