Bug 48098 - Check RAM over-commitment before live migration
Check RAM over-commitment before live migration
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Virtualization - UVMM
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.3-3-errata
Assigned To: Philipp Hahn
Jürn Brodersen
:
Depends on:
Blocks: 48901 49940
  Show dependency treegraph
 
Reported: 2018-11-06 10:36 CET by Valentin Heidelberger
Modified: 2019-07-31 08:26 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 4: A User would return the product
User Pain: 0.160
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018102521000291, 2019030521000474
Bug group (optional): External feedback, Large environments, Usability
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Valentin Heidelberger univentionstaff 2018-11-06 10:36:17 CET
UVMM should detect CPU/RAM/... over-commitment when a live migration is triggered and issue a warning.
A customer has regular problems with over-commitment because they have quite a lot of KVM servers and always checking resources of both migration partners quickly becomes a tedious task.
Comment 1 Philipp Hahn univentionstaff 2018-11-06 10:52:09 CET
RAM is a hard limit: Actually starting too many VMs might crash the host system, which leads to the loss of all runtime state of all running VMs.

CPU is a soft limit: Less a problem, but all VMs are penalized by getting less CPU time.
Comment 2 Stefan Gohmann univentionstaff 2018-12-11 13:19:00 CET
A suggestion for the implementation:

- If we reach the RAM over-commitment limit while creating an instance, we should show a warning.

- If we reach the RAM over-commitment limit while starting an instance, we should abort with an error message.

- If we reach the RAM over-commitment limit while migrating a stopped instance, we should show a warning.

- If we reach the RAM over-commitment limit while performing a live migration, we should abort with an error message before starting the migration.
Comment 3 Christian Völker univentionstaff 2019-03-05 15:26:41 CET
Having a simple warning will not prevent high I/O on the KVM servers during migration. Or at least not without including a HUGE buffer...

When a VM with a good amount of virtual memory will be migrated to a KVM server all memory pages appear to be "active" for the target KVM server.

In practice this resulted in heavy swapping of the target host even though memory overcommitment was only around +1% of RAM. 

The only way to prevent such conditions seems to disallow memory overcommitment from Linux Kernel (see https://www.kernel.org/doc/Documentation/vm/overcommit-accounting) additionally with disallowing swap usage for all VM's by locking (see https://libvirt.org/formatdomain.html#elementsMemoryBacking) together with a hard_limit through virsh (see ftp://libvirt.org/libvirt/virshcmdref/html/sect-memtune.html).
Comment 4 Philipp Hahn univentionstaff 2019-03-06 15:17:07 CET
[4.3-3] a1ab5d999d Bug #48098 UVMM: RAM overcommitment

Package: univention-virtual-machine-manager-daemon
Version: 7.0.0-20A~4.3.0.201903061506
Branch: ucs_4.3-0
Scope: errata4.3-3

[4.3-3] 3948ec7d05 Bug #48098 UVMM: RAM overcommitment YAML
 .../staging/univention-virtual-machine-manager-daemon.yaml    | 11 +++++++++++
 1 file changed, 11 insertions(+)

TODO: Merge to 4.4-0 after QA

FYI: UCRV 'uvmm/overcommit/reserved' is a *global* limit, which applies to *all* hosts and needs only be defined on the host where UVMMd runs. By default it is unset, so the hard limit on start/migration is not enforced. For this set the UCVR to at least "1" byte. For simplicity that amount is subtracted from the nodes physical memory.

FYI: Some useful commands for QA:
  uvmm query "qemu://$(hostname -f)/system" | grep Mem
    curMem := sum_{running VMs}(currently configured memory)¹
    maxMem := sum_(all VMs)(maximum configured memory)
    phyMem := physical memory of host - reserve
    ¹: think memory ballooning
  virsh nodememstats

QA: Feel free to use xen1(=dc0) and xen16
Comment 5 Philipp Hahn univentionstaff 2019-03-06 18:42:26 CET
[4.3-3] 8d17cf6118 Bug #48098 UVMM: Add RAM overcommit protection - spelling fixes
 .../univention-virtual-machine-manager-daemon/debian/changelog    | 6 ++++++
 .../univention-virtual-machine-manager-daemon/src/de.po           | 8 ++++----
 .../univention-virtual-machine-manager-daemon/umc/js/de.po        | 8 ++++----
 3 files changed, 14 insertions(+), 8 deletions(-)

Package: univention-virtual-machine-manager-daemon
Version: 7.0.0-21A~4.3.0.201903061839
Branch: ucs_4.3-0
Scope: errata4.3-3

[4.3-3] 836f614a3e Bug #48098: univention-virtual-machine-manager-daemon 7.0.0-21A~4.3.0.201903061839
 doc/errata/staging/univention-virtual-machine-manager-daemon.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 6 Jürn Brodersen univentionstaff 2019-03-07 11:53:28 CET
Offline migrated host never appear on the target host.

2019-03-07 11:43:25,158 - uvmmd.node - INFO - Domain backuped to 8cc06646-245f-4a26-a662-05a2041a42ed..xml.save.
2019-03-07 11:43:25,160 - uvmmd.node - INFO - Starting migration of domain "8cc06646-245f-4a26-a662-05a2041a42ed" to host "qemu://slave3.univention.intranet/system" with flags 618
2019-03-07 11:43:25,284 - uvmmd.node.livecycle - ERROR - qemu://slave3.univention.intranet/system: Exception handling callback
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 925, in livecycle_event
    domStat = Domain(dom, node=self)
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 316, in __init__
    self.pd.os_type = domain.OSType()
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 455, in OSType
    if ret is None: raise libvirtError ('virDomainGetOSType() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '8cc06646-245f-4a26-a662-05a2041a42ed' (ucs4-64-foo)
2019-03-07 11:43:25,285 - uvmmd.node - INFO - Finished migration of domain "8cc06646-245f-4a26-a662-05a2041a42ed" to host "qemu://slave3.univention.intranet/system" with flags 618
Comment 7 Philipp Hahn univentionstaff 2019-03-07 12:22:16 CET
(In reply to Jürn Brodersen from comment #6)
> Offline migrated host never appear on the target host.
...
> 2019-03-07 11:43:25,160 - uvmmd.node - INFO - Starting migration of domain "XXX" to host "XXX" with flags 618
> 2019-03-07 11:43:25,284 - uvmmd.node.livecycle - ERROR -
...
> libvirtError: Domain not found: no domain with matching uuid 'XXX' (XXX)

Your slave3 has on old version of package libvirt*, which still is affected by Bug #47617 comment 5.
Comment 8 Jürn Brodersen univentionstaff 2019-03-07 15:11:40 CET
OK:
uvmm/overcommit/reserved=0
  Only warning for new vms with more ram than physical available is shown -> OK

uvmm/overcommit/reserved=1073741824
  reserved ram is subtracted from total ram in tree -> OK
  No overcommit error if enough ram is available -> OK
  overcommit error before live migration -> OK
  overcommit error before offline migration -> OK
  overcommit error before vm start -> OK

Profiles work -> OK
Wizard works -> OK
YAML -> OK
Comment 9 Erik Damrose univentionstaff 2019-03-07 16:33:29 CET
<http://errata.software-univention.de/ucs/4.3/452.html>