Univention Bugzilla – Bug 47617
Migration does not converge
Last modified: 2019-08-30 16:16:51 CEST
The live migration of a VM with 16 GiB RAM and 8 CPUs did non converge; UVMM returned an error with did not contain any specific details - probably just a timeout. # virsh domjobinfo $DOM Job type: Unbounded Time elapsed: 5499859 ms Data processed: 530,696 GiB Data remaining: 170,289 MiB Data total: 12,009 GiB Memory processed: 530,696 GiB Memory remaining: 170,289 MiB Memory total: 12,009 GiB Dirty rate: 30740 pages/s Iteration: 1695 Constant pages: 5252035 Normal pages: 138835932 Normal data: 529,617 GiB Expected downtime: 1245 ms Setup time: 167 ms This is a known (Qemu) problem: <https://wiki.qemu.org/Features/AutoconvergeLiveMigration> Using --postcopy the VM was migrated ~1 minute: # virsh migrate --domain $DOM --live --persistent --undefinesource --postcopy --postcopy-after-precopy --verbose qemu://$DEST/system UVMMd should use --postcopy by default. Alternative 1: --auto-converge --auto-converge-initial 20 --auto-converge-increment 10 Alternative 2: Add UCRV to make it configurable.
Today the same issue occured again. The migration was hanging for more than 1 hour. In that time, the load on the hypervisor was higher than usual, due to the ongoing migration. Other production VMs reported performance issues like NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s After the migration was stopped with virsh domjobabort $DOM and restarted with --postcopy as in #comment0, the migration completed successfully and the hypervisor load normalized itself.
As stated in comment 1, this caused VMs to run into soft lockups.
(In reply to Valentin Heidelberger from comment #2) > As stated in comment 1, this caused VMs to run into soft lockups. A "soft lockup" is not a crash: It might look like some, but it can also be caused by other means as will heal itself if the underlying condition is removed: This is very likely to happen if the network is saturated by multiple migrations happening in parallel: This will create massive network IO and all other network IO will get delayed, including NFS traffic and others. If aborting all migrations fixes the soft lockup, it is not a crash. If possible please provide the captured kernel dmesg for further analysation. <https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt>
(In reply to Philipp Hahn from comment #3) > (In reply to Valentin Heidelberger from comment #2) > > As stated in comment 1, this caused VMs to run into soft lockups. > > A "soft lockup" is not a crash: It might look like some, but it can also be > caused by other means as will heal itself if the underlying condition is > removed: > This is very likely to happen if the network is saturated by multiple > migrations happening in parallel: This will create massive network IO and > all other network IO will get delayed, including NFS traffic and others. > If aborting all migrations fixes the soft lockup, it is not a crash. > > If possible please provide the captured kernel dmesg for further analysation. > > <https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt> Technically you're right, I just took the perspective of the customer here, because the soft lockups caused a situation similar to a crash from their point of view. All the services on the affected servers didn't respond anymore, SSH/VNC were useless. I'll gladly change "What type of bug is this?" back again, if it is meant for the "technical description" instead of the "perceived effect". Will also try and get the dmesg you asked for.
<https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/> <https://wiki.qemu.org/Features/PostCopyLiveMigration> Patch @ phahn/47617-uvmm-converge, but <patches/libvirt/4.3-0-0-ucs/3.0.0-4+deb9u3-errata4.3-0/0023-Allow-to-migrate-and-undefine-domains-with-snapshots.quilt> introduces a bug which breaks migration.
repo_admin.py --cherrypick -r 4.3 --releasedest 4.3 --dest errata4.3-2 -p libvirt-python r18349 | Bug #47617 libvirt-python: Fix missing migration hints r18350 | Bug #47617 libvirt-python: Fix missing migration hints 2 Package: libvirt-python Version: 3.0.0-2A~4.3.0.201811161443 Branch: ucs_4.3-0 Scope: errata4.3-2 [4.3-2] d5b2d0c0c4 Bug #47617: libvirt-python 3.0.0-2A~4.3.0.201811161443 doc/errata/staging/libvirt-python.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+) [4.3-2] 865cf4e160 Bug #47617: libvirt 3.0.0-4+deb9u3A~4.3.0.201811081529 doc/errata/staging/libvirt.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+)
FYI: Postcopy doesn't support large page sizes yet (pc.ram) so the "pages/s" is 4KiB pages: 'migration': {'data_processed': 1885047899L, 'data_remaining': 848883712L, 'data_total': 1091379200L, 'disk_processed': 0L, 'disk_remaining': 0L, 'disk_total': 0L, 'downtime': 802L, 'memory_constant': 6402L, 'memory_dirty_rate': 28399L, 'memory_iteration': 3L, 'memory_normal': 459305L, 'memory_normal_bytes': 1881313280L, 'memory_processed': 1885047899L, 'memory_remaining': 848883712L, 'memory_total': 1091379200L, 'msg': 'Migration in progress since 0:00:16.478, iteration 3', 'setup_time': 7L, 'time_elapsed': 16478L, 'type': 2}, net_bandwidth := data_processed / time_elapsed # [B / s] change_rate := memory_dirty_rate << 12 # [B / s] if change_rate > net_bandwith: "The current memory change rate exceeds the network bandwidth; under these conditions migration will probably not converge until either the VM is throttled or paused or switched to post-copy migration mode." My code is checked into branch "phahn/47617-uvmm-converge" and waiting for front-end changes (Bug #48083). The code will be merged into 4.3-2 after first QA.
The qemu process on the target host, after a migration, throws an error on loading snapshots. I think it's this bug: https://patchwork.kernel.org/patch/10062159/ But I didn't test it yet.
r18361 | Bug #47617: Fix snapshot revert after postcopy migration Package: qemu Version: 1:2.8+dfsg-6+deb9u5A~4.3.0.201811261055 Branch: ucs_4.3-0 Scope: errata4.3-2 [4.3-2] 44dcfc6510 Bug #47617: qemu 1:2.8+dfsg-6+deb9u5A~4.3.0.201811261055 doc/errata/staging/qemu.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+) https://git.knut.univention.de/univention/ucs/tree/phahn/47617-uvmm-converge
Looks all good to me :) Ready to merge
[4.3-3] f99717e953 Bug #47617, Bug #47741, Bug #36661, Bug #48199, Bug #48024, Bug #45498, Bug #35196 Package: univention-virtual-machine-manager-daemon Version: 7.0.0-17A~4.3.0.201812111413 Branch: ucs_4.3-0 Scope: errata4.3-3 [4.3-3] 582fb65dce Bug #47617: univention-virtual-machine-manager-daemon 7.0.0-17A~4.3.0.201812111413 doc/errata/staging/univention-virtual-machine-manager-daemon.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
<http://errata.software-univention.de/ucs/4.3/382.html> <http://errata.software-univention.de/ucs/4.3/383.html> <http://errata.software-univention.de/ucs/4.3/384.html> <http://errata.software-univention.de/ucs/4.3/385.html>