Bug 47617 - Migration does not converge
Migration does not converge
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Virtualization - KVM
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.3-3-errata
Assigned To: Philipp Hahn
Jürn Brodersen
:
Depends on: 48024
Blocks: 47934 48083 50092
  Show dependency treegraph
 
Reported: 2018-08-20 13:00 CEST by Philipp Hahn
Modified: 2019-08-30 16:16 CEST (History)
6 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.200
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018082021000474, 2018090421000967, 2018100821000804
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2018-08-20 13:00:01 CEST
The live migration of a VM with 16 GiB RAM and 8 CPUs did non converge; UVMM returned an error with did not contain any specific details - probably just a timeout.

# virsh domjobinfo $DOM
Job type:         Unbounded
Time elapsed:     5499859      ms
Data processed:   530,696 GiB
Data remaining:   170,289 MiB
Data total:       12,009 GiB
Memory processed: 530,696 GiB
Memory remaining: 170,289 MiB
Memory total:     12,009 GiB
Dirty rate:       30740        pages/s
Iteration:        1695
Constant pages:   5252035
Normal pages:     138835932                                                                                                                                                        
Normal data:      529,617 GiB
Expected downtime: 1245         ms
Setup time:       167          ms

This is a known (Qemu) problem: <https://wiki.qemu.org/Features/AutoconvergeLiveMigration>

Using --postcopy the VM was migrated ~1 minute:
# virsh migrate --domain $DOM --live --persistent --undefinesource --postcopy --postcopy-after-precopy --verbose qemu://$DEST/system

UVMMd should use --postcopy by default.
Alternative 1: --auto-converge --auto-converge-initial 20 --auto-converge-increment 10
Alternative 2: Add UCRV to make it configurable.
Comment 1 Erik Damrose univentionstaff 2018-10-08 16:51:24 CEST
Today the same issue occured again. The migration was hanging for more than 1 hour. In that time, the load on the hypervisor was higher than usual, due to the ongoing migration. Other production VMs reported performance issues like NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s

After the migration was stopped with virsh domjobabort $DOM and restarted with --postcopy as in #comment0, the migration completed successfully and the hypervisor load normalized itself.
Comment 2 Valentin Heidelberger univentionstaff 2018-10-18 16:40:09 CEST
As stated in comment 1, this caused VMs to run into soft lockups.
Comment 3 Philipp Hahn univentionstaff 2018-10-18 17:39:10 CEST
(In reply to Valentin Heidelberger from comment #2)
> As stated in comment 1, this caused VMs to run into soft lockups.

A "soft lockup" is not a crash: It might look like some, but it can also be caused by other means as will heal itself if the underlying condition is removed:
This is very likely to happen if the network is saturated by multiple migrations happening in parallel: This will create massive network IO and all other network IO will get delayed, including NFS traffic and others.
If aborting all migrations fixes the soft lockup, it is not a crash.

If possible please provide the captured kernel dmesg for further analysation.

<https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt>
Comment 4 Valentin Heidelberger univentionstaff 2018-10-19 08:52:22 CEST
(In reply to Philipp Hahn from comment #3)
> (In reply to Valentin Heidelberger from comment #2)
> > As stated in comment 1, this caused VMs to run into soft lockups.
> 
> A "soft lockup" is not a crash: It might look like some, but it can also be
> caused by other means as will heal itself if the underlying condition is
> removed:
> This is very likely to happen if the network is saturated by multiple
> migrations happening in parallel: This will create massive network IO and
> all other network IO will get delayed, including NFS traffic and others.
> If aborting all migrations fixes the soft lockup, it is not a crash.
> 
> If possible please provide the captured kernel dmesg for further analysation.
> 
> <https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt>

Technically you're right, I just took the perspective of the customer here, because the soft lockups caused a situation similar to a crash from their point of view. All the services on the affected servers didn't respond anymore, SSH/VNC were useless. 
I'll gladly change "What type of bug is this?" back again, if it is meant for the "technical description" instead of the "perceived effect".

Will also try and get the dmesg you asked for.
Comment 5 Philipp Hahn univentionstaff 2018-11-05 16:55:59 CET
<https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/>
<https://wiki.qemu.org/Features/PostCopyLiveMigration>

Patch @ phahn/47617-uvmm-converge, but <patches/libvirt/4.3-0-0-ucs/3.0.0-4+deb9u3-errata4.3-0/0023-Allow-to-migrate-and-undefine-domains-with-snapshots.quilt> introduces a bug which breaks migration.
Comment 6 Philipp Hahn univentionstaff 2018-11-16 14:53:12 CET
repo_admin.py --cherrypick -r 4.3 --releasedest 4.3 --dest errata4.3-2 -p libvirt-python

r18349 | Bug #47617 libvirt-python: Fix missing migration hints
r18350 | Bug #47617 libvirt-python: Fix missing migration hints 2

Package: libvirt-python
Version: 3.0.0-2A~4.3.0.201811161443
Branch: ucs_4.3-0
Scope: errata4.3-2

[4.3-2] d5b2d0c0c4 Bug #47617: libvirt-python 3.0.0-2A~4.3.0.201811161443
 doc/errata/staging/libvirt-python.yaml | 10 ++++++++++
 1 file changed, 10 insertions(+)

[4.3-2] 865cf4e160 Bug #47617: libvirt 3.0.0-4+deb9u3A~4.3.0.201811081529
 doc/errata/staging/libvirt.yaml | 10 ++++++++++
 1 file changed, 10 insertions(+)
Comment 7 Philipp Hahn univentionstaff 2018-11-19 13:15:49 CET
FYI: Postcopy doesn't support large page sizes yet (pc.ram)

so the "pages/s" is 4KiB pages:
     'migration': {'data_processed': 1885047899L,
                   'data_remaining': 848883712L,
                   'data_total': 1091379200L,
                   'disk_processed': 0L,
                   'disk_remaining': 0L,
                   'disk_total': 0L,
                   'downtime': 802L,
                   'memory_constant': 6402L,
                   'memory_dirty_rate': 28399L,
                   'memory_iteration': 3L,
                   'memory_normal': 459305L,
                   'memory_normal_bytes': 1881313280L,
                   'memory_processed': 1885047899L,
                   'memory_remaining': 848883712L,
                   'memory_total': 1091379200L,
                   'msg': 'Migration in progress since 0:00:16.478, iteration 3',
                   'setup_time': 7L,
                   'time_elapsed': 16478L,
                   'type': 2},

net_bandwidth := data_processed / time_elapsed  # [B / s]
change_rate := memory_dirty_rate << 12  # [B / s]
if change_rate > net_bandwith: "The current memory change rate exceeds the network bandwidth; under these conditions migration will probably not converge until either the VM is throttled or paused or switched to post-copy migration mode."

My code is checked into branch "phahn/47617-uvmm-converge" and waiting for front-end changes (Bug #48083). The code will be merged into 4.3-2 after first QA.
Comment 8 Jürn Brodersen univentionstaff 2018-11-23 13:48:13 CET
The qemu process on the target host, after a migration, throws an error on loading snapshots.

I think it's this bug:
https://patchwork.kernel.org/patch/10062159/

But I didn't test it yet.
Comment 9 Philipp Hahn univentionstaff 2018-11-26 16:23:43 CET
r18361 | Bug #47617: Fix snapshot revert after postcopy migration

Package: qemu
Version: 1:2.8+dfsg-6+deb9u5A~4.3.0.201811261055
Branch: ucs_4.3-0
Scope: errata4.3-2

[4.3-2] 44dcfc6510 Bug #47617: qemu 1:2.8+dfsg-6+deb9u5A~4.3.0.201811261055
 doc/errata/staging/qemu.yaml | 10 ++++++++++
 1 file changed, 10 insertions(+)

https://git.knut.univention.de/univention/ucs/tree/phahn/47617-uvmm-converge
Comment 10 Jürn Brodersen univentionstaff 2018-12-11 12:46:43 CET
Looks all good to me :)

Ready to merge
Comment 11 Philipp Hahn univentionstaff 2018-12-11 14:17:16 CET
[4.3-3] f99717e953 Bug #47617, Bug #47741, Bug #36661, Bug #48199, Bug #48024, Bug #45498, Bug #35196

Package: univention-virtual-machine-manager-daemon
Version: 7.0.0-17A~4.3.0.201812111413
Branch: ucs_4.3-0
Scope: errata4.3-3

[4.3-3] 582fb65dce Bug #47617: univention-virtual-machine-manager-daemon 7.0.0-17A~4.3.0.201812111413
 doc/errata/staging/univention-virtual-machine-manager-daemon.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)