Univention Bugzilla – Bug 49548
[4.3] Live Migration Causes Memory Thrashing on Target Host
Last modified: 2022-03-29 17:00:48 CEST
CU migrated a running VM with a huge amount of RAM (100G) to a target host with apparently enough free physical RAM available. But the target host system went unusable due to memory thrashing and all VMs there suddenly used a lot of swap memory. All VMs where very slow as their memory had been moved to swapfile. A "top" command still showed a good amount of free memory available on the target host, but still all VMs where running at least partially from swap.
r18583 | Bug #49548 libvirt: Re-enable NUMA support Package: libvirt Version: 3.0.0-4+deb9u3A~4.3.0.201905250642 Branch: ucs_4.3-0 Scope: errata4.3-4 [4.3-4] aa47b58462 Bug #49548: numad 0.5+20150602-5 doc/errata/staging/libvirt.yaml | 10 ++++++++++ doc/errata/staging/numad.yaml | 12 ++++++++++++ 2 files changed, 22 insertions(+) Today big CPU systems are ccNUMA (cache-coherent non-uniform memory access) systems: logically they consists of multiple nodes crammed into one CPU or case. To the un-experienced user they look like a big system with many CPUs and lots of rum, but to the Linux operating system they more look like a distributed system: Example "lattjo": Logical view: > $ nproc > 24 > $ free -h > total used free shared buff/cache available > Mem: 125G 65G 59G 18M 1,1G 59G ^^^ > Swap: 19G 0B 19G Physical view: > $ numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17 > node 0 size: 64314 MB > node 0 free: 29086 MB ^^^^^^^^ > node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23 > node 1 size: 64509 MB > node 1 free: 31572 MB ^^^^^^^^ > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 Logically 59 GiB are free, but they are distributed 50%/50% over both nodes. When a process allocates memory the Linux optimizes for performance and tries to allocate _local_ memory, that is memory from the node where that process is _currently_ running. Linux prefers local memory as accessing memory connected to a remove post is more costly and takes longer: The matrix above shows that accessing remote memory is estimated to take double as much time as accessing local memory. UVMM does no CPU pinning by default and so that might be any of the 24 cores. It also will change over time as the thread allocating memory might run on a different CPU each time. As Qemu does _not_ allocate all memory of the VM at once the size of the VM increases steadily and each time a pieco of memory is picked from the node, where the thread then run. So so usage is statistically distributed evenly over all nodes for one Qemu process. > # numastat -c qemu-system-x86 > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Total > --------------- ------ ------ ----- > 21990 (qemu-syst 33929 31906 65835 > 24486 (qemu-syst 38 2126 2164 > --------------- ------ ------ ----- > Total 33967 34033 67999 The behavior is different when migrating a VM or loading it from a saved state: Then the memory is allocated _en-block_, so Qemu tries to allocate 64 GiB at once. The Linux kernel again tries to allocate them from _one_ node. As so much memory is not available the Linux kernel will push other users to SWAP to free so much memory, leading to the observed behavior: It looks like there there are another 64 GiB free (combined), but _not_ on a single node! The Linux kernel can be told to use a different allocation strategy by starting a process through the "numactrl" wrapper: numactl --interleave=all ... This tells the Linux kernel to slip all allocations over all nodes by default. libvirt has built-in NUMA support, but it was partly disabled as for UCS-4.2 we back-ported libvirt from Debian-Stretch, but not "numad" - that is a daemon running in the background, which tries to move non-local-NUMA-memory to one NUMA node only to optimize for local memory access and provides other features needed by libvirt: "libvirt" needs "numad" to enable the "interleave" mode after it has been configured in the VM-XML like this: <vcpu placement='auto'>4</vcpu> <numatune> <memory mode='interleave' placement='auto'/> </numatune> The important thing here is "mode=interleave" as libvirt defaults to "mode=strict", which leads to the observed (bad) behavior. Currently there is no way to change the default of libvirtd expect changing "libvirtd.service" to use "Exec=/usr/bin/numactl interleave=all /usr/sbin/libvirtd -l". If also found no Linux sysctl to change the default to interleave. We can change UVMM to generate the XML by default, but that still leaves the task of updating all existing VMs to include these statements as well. While researching this I found the following two excellent articles describing the problem (and the solution) - they are worth reading: * <https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/> * <http://www.admin-magazine.com/Archive/2014/20/Best-practices-for-KVM-on-NUMA-servers/>
[4.3-4] be50c81216 Bug #49548 uvmm: Enable NUMA memory interleave by default .../conffiles/etc/systemd/system/libvirtd.service.d/ucr.conf | 10 ++++++++++ .../univention-virtual-machine-manager-node/debian/changelog | 6 ++++++ .../univention-virtual-machine-manager-node/debian/control | 3 ++- ...virtual-machine-manager-node-kvm.univention-config-registry | 4 ++++ ...chine-manager-node-kvm.univention-config-registry-variables | 6 ++++++ 5 files changed, 28 insertions(+), 1 deletion(-) [4.4-0] 2d92b98729 Bug #49548 uvmm: Enable NUMA memory interleave by default .../conffiles/etc/systemd/system/libvirtd.service.d/ucr.conf | 10 ++++++++++ .../univention-virtual-machine-manager-node/debian/changelog | 6 ++++++ .../univention-virtual-machine-manager-node/debian/control | 3 ++- ...virtual-machine-manager-node-kvm.univention-config-registry | 4 ++++ ...chine-manager-node-kvm.univention-config-registry-variables | 6 ++++++ 5 files changed, 28 insertions(+), 1 deletion(-) This changes libvirtd.service to use "numactl --interleave=all" by default. It can be disabled by setting UCRV "libvirt/numa/policy/memory=no". Other policies can still be configured using <https://libvirt.org/formatdomain.html#elementsNUMATuning> Package: univention-virtual-machine-manager-node Version: 6.0.0-3A~4.3.0.201905281729 Branch: ucs_4.3-0 Scope: errata4.3-4 Package: univention-virtual-machine-manager-node Version: 7.0.1-2A~4.4.0.201905281723 Branch: ucs_4.4-0 Scope: errata4.4-0 [4.4-0] c9843158b2 Bug #49548: numad 0.5+20150602-5 doc/errata/staging/libvirt.yaml | 13 +++++++++++++ doc/errata/staging/numad.yaml | 13 +++++++++++++ 2 files changed, 26 insertions(+) [4.4-0] e2cea49212 Bug #49548: univention-virtual-machine-manager-node 7.0.1-2A~4.4.0.201905281723 .../staging/univention-virtual-machine-manager-node.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+) [4.3-4] 651ec30aeb Bug #49548: univention-virtual-machine-manager-node 6.0.0-3A~4.3.0.201905281729 doc/errata/staging/univention-virtual-machine-manager-node.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) [4.3-4] 51604d2d32 Bug #49548: univention-virtual-machine-manager-node 6.0.0-3A~4.3.0.201905281729 .../staging/univention-virtual-machine-manager-node.yaml | 10 ++++++++++ 1 file changed, 10 insertions(+) TODO: After QA clone Bug to 4.4-0 and fix YAMl in 4.4-0 to use the cloned Bug#
[4.3-4] 84494763d4 Bug #49548 uvmm: Restart libvirtd on package upgrade .../univention-virtual-machine-manager-node/debian/changelog | 6 ++++++ .../univention-virtual-machine-manager-node-kvm.postinst | 11 +++++++++++ 2 files changed, 17 insertions(+) Package: univention-virtual-machine-manager-node Version: 6.0.0-4A~4.3.0.201905291312 Branch: ucs_4.3-0 Scope: errata4.3-4 [4.3-4] 345aaf1d1f Bug #49548: univention-virtual-machine-manager-node 6.0.0-4A~4.3.0.201905291312 doc/errata/staging/univention-virtual-machine-manager-node.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
OK: libvirt service extension, configurable with UCR libvirt/numa/policy/memory. OK: optional numad support for libvirt OK: tests with default and interleave option OK: yamls
<http://errata.software-univention.de/ucs/4.3/521.html> <http://errata.software-univention.de/ucs/4.3/522.html> <http://errata.software-univention.de/ucs/4.3/523.html>