Bug 49548 - [4.3] Live Migration Causes Memory Thrashing on Target Host
[4.3] Live Migration Causes Memory Thrashing on Target Host
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Virtualization - KVM
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 4.3-4-errata
Assigned To: Philipp Hahn
Erik Damrose
:
Depends on:
Blocks: 49573 49574 54615
  Show dependency treegraph
 
Reported: 2019-05-24 14:07 CEST by Christian Völker
Modified: 2022-03-29 17:00 CEST (History)
3 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.143
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018052521000327
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Völker univentionstaff 2019-05-24 14:07:12 CEST
CU migrated a running VM with a huge amount of RAM (100G) to a target host with apparently enough free physical RAM available.

But the target host system went unusable due to memory thrashing and all VMs there suddenly used a lot of swap memory. All VMs where very slow as their memory had been moved to swapfile.

A "top" command still showed a good amount of free memory available on the target host, but still all VMs where running at least partially from swap.
Comment 1 Philipp Hahn univentionstaff 2019-05-25 12:32:41 CEST
r18583 | Bug #49548 libvirt: Re-enable NUMA support

Package: libvirt
Version: 3.0.0-4+deb9u3A~4.3.0.201905250642
Branch: ucs_4.3-0
Scope: errata4.3-4

[4.3-4] aa47b58462 Bug #49548: numad 0.5+20150602-5
 doc/errata/staging/libvirt.yaml | 10 ++++++++++
 doc/errata/staging/numad.yaml   | 12 ++++++++++++
 2 files changed, 22 insertions(+)

Today big CPU systems are ccNUMA (cache-coherent non-uniform memory access) systems: logically they consists of multiple nodes crammed into one CPU or case. To the un-experienced user they look like a big system with many CPUs and lots of rum, but to the Linux operating system they more look like a distributed system:

Example "lattjo":
Logical view:
> $ nproc 
> 24
> $ free -h
>               total        used        free      shared  buff/cache   available
> Mem:           125G         65G         59G         18M        1,1G         59G
                                          ^^^
> Swap:           19G          0B         19G
Physical view:
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
> node 0 size: 64314 MB
> node 0 free: 29086 MB
               ^^^^^^^^
> node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
> node 1 size: 64509 MB
> node 1 free: 31572 MB
               ^^^^^^^^
> node distances:
> node   0   1 
>   0:  10  21 
>   1:  21  10

Logically 59 GiB are free, but they are distributed 50%/50% over both nodes.
When a process allocates memory the Linux optimizes for performance and tries to allocate _local_ memory, that is memory from the node where that process is _currently_ running.
Linux prefers local memory as accessing memory connected to a remove post is more costly and takes longer: The matrix above shows that accessing remote memory is estimated to take double as much time as accessing local memory.

UVMM does no CPU pinning by default and so that might be any of the 24 cores. It also will change over time as the thread allocating memory might run on a different CPU each time.
As Qemu does _not_ allocate all memory of the VM at once the size of the VM increases steadily and each time a pieco of memory is picked from the node, where the thread then run. So so usage is statistically distributed evenly over all nodes for one Qemu process.
> # numastat -c qemu-system-x86 
> Per-node process memory usage (in MBs)
> PID              Node 0 Node 1 Total
> ---------------  ------ ------ -----
> 21990 (qemu-syst  33929  31906 65835
> 24486 (qemu-syst     38   2126  2164
> ---------------  ------ ------ -----
> Total             33967  34033 67999

The behavior is different when migrating a VM or loading it from a saved state: Then the memory is allocated _en-block_, so Qemu tries to allocate 64 GiB at once. The Linux kernel again tries to allocate them from _one_ node. As so much memory is not available the Linux kernel will push other users to SWAP to free so much memory, leading to the observed behavior: It looks like there there are another 64 GiB free (combined), but _not_ on a single node!

The Linux kernel can be told to use a different allocation strategy by starting a process through the "numactrl" wrapper:
  numactl --interleave=all ...
This tells the Linux kernel to slip all allocations over all nodes by default.

libvirt has built-in NUMA support, but it was partly disabled as for UCS-4.2 we back-ported libvirt from Debian-Stretch, but not "numad" - that is a daemon running in the background, which tries to move non-local-NUMA-memory to one NUMA node only to optimize for local memory access and provides other features needed by libvirt: "libvirt" needs "numad" to enable the "interleave" mode after it has been configured in the VM-XML like this:

  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='interleave' placement='auto'/>
  </numatune>

The important thing here is "mode=interleave" as libvirt defaults to "mode=strict", which leads to the observed (bad) behavior.

Currently there is no way to change the default of libvirtd expect changing "libvirtd.service" to use "Exec=/usr/bin/numactl interleave=all /usr/sbin/libvirtd -l".

If also found no Linux sysctl to change the default to interleave.

We can change UVMM to generate the XML by default, but that still leaves the task of updating all existing VMs to include these statements as well.


While researching this I found the following two excellent articles describing the problem (and the solution) - they are worth reading:
* <https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/>
* <http://www.admin-magazine.com/Archive/2014/20/Best-practices-for-KVM-on-NUMA-servers/>
Comment 2 Philipp Hahn univentionstaff 2019-05-28 17:59:18 CEST
[4.3-4] be50c81216 Bug #49548 uvmm: Enable NUMA memory interleave by default
 .../conffiles/etc/systemd/system/libvirtd.service.d/ucr.conf   | 10 ++++++++++
 .../univention-virtual-machine-manager-node/debian/changelog   |  6 ++++++
 .../univention-virtual-machine-manager-node/debian/control     |  3 ++-
 ...virtual-machine-manager-node-kvm.univention-config-registry |  4 ++++
 ...chine-manager-node-kvm.univention-config-registry-variables |  6 ++++++
 5 files changed, 28 insertions(+), 1 deletion(-)

[4.4-0] 2d92b98729 Bug #49548 uvmm: Enable NUMA memory interleave by default
 .../conffiles/etc/systemd/system/libvirtd.service.d/ucr.conf   | 10 ++++++++++
 .../univention-virtual-machine-manager-node/debian/changelog   |  6 ++++++
 .../univention-virtual-machine-manager-node/debian/control     |  3 ++-
 ...virtual-machine-manager-node-kvm.univention-config-registry |  4 ++++
 ...chine-manager-node-kvm.univention-config-registry-variables |  6 ++++++
 5 files changed, 28 insertions(+), 1 deletion(-)

This changes libvirtd.service to use "numactl --interleave=all" by default. It can be disabled by setting UCRV "libvirt/numa/policy/memory=no".
Other policies can still be configured using <https://libvirt.org/formatdomain.html#elementsNUMATuning>

Package: univention-virtual-machine-manager-node
Version: 6.0.0-3A~4.3.0.201905281729
Branch: ucs_4.3-0
Scope: errata4.3-4

Package: univention-virtual-machine-manager-node
Version: 7.0.1-2A~4.4.0.201905281723
Branch: ucs_4.4-0
Scope: errata4.4-0

[4.4-0] c9843158b2 Bug #49548: numad 0.5+20150602-5
 doc/errata/staging/libvirt.yaml | 13 +++++++++++++
 doc/errata/staging/numad.yaml   | 13 +++++++++++++
 2 files changed, 26 insertions(+)

[4.4-0] e2cea49212 Bug #49548: univention-virtual-machine-manager-node 7.0.1-2A~4.4.0.201905281723
 .../staging/univention-virtual-machine-manager-node.yaml       | 10 ++++++++++
 1 file changed, 10 insertions(+)

[4.3-4] 651ec30aeb Bug #49548: univention-virtual-machine-manager-node 6.0.0-3A~4.3.0.201905281729
 doc/errata/staging/univention-virtual-machine-manager-node.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

[4.3-4] 51604d2d32 Bug #49548: univention-virtual-machine-manager-node 6.0.0-3A~4.3.0.201905281729
 .../staging/univention-virtual-machine-manager-node.yaml       | 10 ++++++++++
 1 file changed, 10 insertions(+)


TODO: After QA clone Bug to 4.4-0 and fix YAMl in 4.4-0 to use the cloned Bug#
Comment 3 Philipp Hahn univentionstaff 2019-05-29 13:15:01 CEST
[4.3-4] 84494763d4 Bug #49548 uvmm: Restart libvirtd on package upgrade
 .../univention-virtual-machine-manager-node/debian/changelog  |  6 ++++++
 .../univention-virtual-machine-manager-node-kvm.postinst      | 11 +++++++++++
 2 files changed, 17 insertions(+)

Package: univention-virtual-machine-manager-node
Version: 6.0.0-4A~4.3.0.201905291312
Branch: ucs_4.3-0
Scope: errata4.3-4

[4.3-4] 345aaf1d1f Bug #49548: univention-virtual-machine-manager-node 6.0.0-4A~4.3.0.201905291312
 doc/errata/staging/univention-virtual-machine-manager-node.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 4 Erik Damrose univentionstaff 2019-05-29 16:01:33 CEST
OK: libvirt service extension, configurable with UCR libvirt/numa/policy/memory.
OK: optional numad support for libvirt
OK: tests with default and interleave option
OK: yamls