49614 – Memory leak in linux-4.9.168-1 (4.9.0-9 amd64)

Bug 49614 - Memory leak in linux-4.9.168-1 (4.9.0-9 amd64)

Summary: Memory leak in linux-4.9.168-1 (4.9.0-9 amd64)

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Kernel
Version:	UCS 4.4
Hardware:	Other Linux

Importance:	P5 normal
Target Milestone:	---
Assignee:	Philipp Hahn
QA Contact:	UCS maintainers

URL:
Keywords:

Depends on:
Blocks:	49910 49941
	Show dependency tree / graph

Reported:	2019-06-07 15:05 CEST by Philipp Hahn
Modified:	2022-02-22 16:06 CET (History)
CC List:	5 users (show)

See Also:	49382 49045 47233 49357 49940 49910
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	2: Will only affect a few installed domains
How will those affected feel about the bug?:	3: A User would likely not purchase the product
User Pain:	0.171
Enterprise Customer affected?:	Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2019060621000636
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
collectd: memory usage statistics (22.69 KB, image/png) 2019-06-26 12:27 CEST, Philipp Hahn	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philipp Hahn

2019-06-07 15:05:45 CEST

Public report: <https://help.univention.com/t/memoryleak-auf-slave-contoller/11892>
Affected internal system: mysig

I suspected a buffer_head leak, but the value turned down again:
# grep ^buffer_head slabinfo.2019-0*
slabinfo.2019-05-28:13:37:17:buffer_head         9235  25584    104   39    1 : tunables  120   60    8 : slabdata    656    656      0
slabinfo.2019-05-28:13:38:41:buffer_head         9509  25584    104   39    1 : tunables  120   60    8 : slabdata    656    656      0
slabinfo.2019-05-29:11:43:54:buffer_head         5790  23127    104   39    1 : tunables  120   60    8 : slabdata    593    593      0
slabinfo.2019-06-01:03:10:27:buffer_head         4078  16458    104   39    1 : tunables  120   60    8 : slabdata    422    422      0
slabinfo.2019-06-07:10:12:47:buffer_head        30225  30225    104   39    1 : tunables  120   60    8 : slabdata    775    775      0
slabinfo.2019-06-07:14:25:02:buffer_head        30787  30810    104   39    1 : tunables  120   60    8 : slabdata    790    790      0
slabinfo.2019-06-07:14:46:21:buffer_head         4593  22074    104   39    1 : tunables  120   60    8 : slabdata    566    566      0

Maybe <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.9.180&id=056066d8a769aa87f280f00c6c2d802ecc2c7a6b> fixes it?

No other idea from: git log --grep leak v4.9.168..v4.9.174
No Debian report yet.
No reproducer yet.

Stopping all processes did not fix it: the memory was still gone.

<https://www.kernel.org/doc/html/latest/dev-tools/kmemleak.html> cannot be used as Debian kernels are compiled without CONFIG_DEBUG_KMEMLEAK

Comment 1 Christian Völker

2019-06-11 15:09:10 CEST

Currently happening for a customer on both master and backup every couple of days until the next reboot.

Do you need any details from there?

Comment 2 Christian Völker

2019-06-19 13:25:23 CEST

On a system where it is currently (and reproducible) happening with a load of 112 (!!!) due to massive swapping:


smem:
[...]
samba                      460.0K 
vmtoolsd                   622.0K 
smbstatus                  938.0K 
sshd                       958.0K 
smbstatus                    1.1M 
univention-mana              1.2M 
systemd                      1.2M 
top                          1.3M 
dockerd                      2.0M 
slapd                        2.1M 
apt-get                      3.0M 
apt-get                      3.7M 
samba                        4.4M 
iotop                        5.3M 
smem                         6.1M 
----------------------------------
                            39.7M 


free:
              total        used        free      shared  buff/cache   available
Mem:        2033684      854784      462676       28324      716224      998104
Swap:       2097148           0     2097148
Total:      4130832      854784     2559824



top:
top - 10:19:09 up 4 days, 20:40,  3 users,  load average: 112,06, 111,67, 117,23
Tasks: 326 total,   1 running, 325 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3,4 us, 18,9 sy,  0,0 ni,  0,0 id, 76,3 wa,  0,0 hi,  1,4 si,  0,0 st
KiB Mem :  2033684 total,    68240 free,  1920544 used,    44900 buff/cache
KiB Swap:  2097148 total,  1351752 free,   745396 used.    15400 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 78113 root      20   0  103224  17256   6584 D  3,4  0,8   0:20.70 apt-get -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-overwrite -o DPkg::Options::=--force-overwrite-dir --trivial-only=no --assume-yes --quie
t=1 -us d+
 90512 root      20   0   55896   7980   1284 S  3,5  0,4  29:15.57 /usr/bin/python3 /usr/sbin/iotop
 67417 root      20   0   89668   2768    516 D  0,9  0,1   1:45.12 apt-get check -qq
  9202 root      20   0   38336   1708   1252 R  0,0  0,1   0:18.21 top icbd 600
     1 root      20   0  204880   1668   1364 S  0,1  0,1   3:02.91 /sbin/init splash
 76894 root      20   0   54396   1400   1208 D  0,3  0,1   0:03.13 /lib/systemd/systemd-journald
 66954 root      20   0  167228   1352    596 D  0,0  0,1   0:06.85 /usr/bin/python2.7 /usr/lib/univention-pam/ldap-group-to-file.py --check_member


vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                CEST
 0  0 664588  67684  11020 138304  194   72   312   159  188  593  5  5 86  4  0 2019-06-14 14:19:18
 1  0 655032  76488   7528 105064  181  154   368   218  190  586  5  5 85  5  0 2019-06-14 14:29:18
 0  0 706180  79472   8908 106864  186  253   698   315  212  632  6  5 83  6  0 2019-06-14 14:39:18
 0  0 717104  79228   7648 102576  197  198   494   262  204  626  6  5 85  5  0 2019-06-14 14:49:18
 0  0 731224  80884   3852  94064  201  205   496   270  207  624  5  5 85  5  0 2019-06-14 14:59:18

Comment 3 Christian Völker

2019-06-21 11:00:09 CEST

Raising as customer needs a fix as the system is not useable this way.

tunnel access can be granted, if needed.

Comment 4 Philipp Hahn

2019-06-26 12:27:28 CEST

Created attachment 10090 [details]
collectd: memory usage statistics

A self-compiled 4.9.174 does not show the increasing memory usage after ½ day, but we need a longer look.
Please note that swap still increases, so the bug might only be fixed partly: <http://xen1.knut.univention.de/collectd/bin/index.cgi?hostname=mysig.pingst.univention.de&plugin=memory&plugin=swap&timespan=86400&action=show_selection&ok_button=OK>

I filed a Debian bug for this: <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=931111>

Comment 5 Philipp Hahn

2019-07-03 13:31:52 CEST

> I filed a Debian bug for this:
> <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=931111>

There already was a bug in Debian, which looks similar:
  <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912411>

Looking deeper I back-ported a patch to better diagnose the CGroups:
  <https://www.spinics.net/lists/cgroups/msg20611.html>
  0679dee03c6d7  cgroup: keep track of number of descent cgroups
  1a926e0bbab83  cgroup: implement hierarchy limits
  4dcabece4c3a9  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock

> nr_dying_descendants
>   This is the total number of dying descendant cgroups under‐
>   neath this cgroup.  A cgroup enters the dying state after
>   being deleted.  It remains in that state for an undefined
>   period (which will depend on system load) while resources are
>   freed before the cgroup is destroyed.  Note that the presence
>   of some cgroups in the dying state is normal, and is not
>   indicative of any problem.
>
>   A process can't be made a member of a dying cgroup, and a
>   dying cgroup can't be brought back to life.

While looking for further ideas I stumbled upon the following patch series, which is not yet applied upstream, but describes a problem with the current CG implementation:
  <https://lwn.net/Articles/782633/>

This lead to <linux/Documentation/cgroup-v1/memory.txt>:
As CONFIG_MEMCG_KMEM is enabled, the Linux kernel keeps track of which CG allocated which kernel memory regions. This includes:
* stack pages
* slab pages
* sockets memory pressure
* TCP memory pressure
Especially SLAB/SLUB allocated objects can survive the livetime of the processes from a CG. Furthermore this changes the SLAB allocator to allocate pages on a per CG basis.
When 'libpam_systemd' deletes the per-session CG, the CG is not deleted immediately, but marked as "dying":
  # find /sys/fs/cgroup/memory -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'  {} +
  /sys/fs/cgroup/memory/cgroup.stat:nr_dying_descendants 360
  /sys/fs/cgroup/memory/user.slice/cgroup.stat:nr_dying_descendants 320
  /sys/fs/cgroup/memory/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 303
  /sys/fs/cgroup/memory/system.slice/cgroup.stat:nr_dying_descendants 40
  # grep ^memory /proc/cgroups 
  memory  10      452     1

So the system has a high number of dying CGs. Jürn also observed that on our system, were we had ~88k dying CGs after one week of uptime.

As a first test we disabled 'pam_systemd' on the affected system:
  # grep pam_systemd.so /etc/pam.d/*
  /etc/pam.d/common-session:#session      optional        pam_systemd.so
The test is still running, but that looks promising.

Tracking the kernel memory can be disabled by setting the following Linux kernel command line argument:
  cgroup.memory=nokmem
My test is still running, but that does not seem to fix it.

We can also try to patch systemd/src/basic/cgroup-util.c:731 cg_trim() to flush the memory before the 'rmdir()':
  # echo 0 >/sys/fs/cgroup/memory/.../memory.force_empty

Comment 6 Philipp Hahn

2019-07-24 09:15:36 CEST

Feedback to <https://www.spinics.net/lists/cgroups/msg22853.html> from Roman Gushchin and Ben Hutchings:
- Fixed in Linux 5.[23]
- Upstream patch-set at <https://lwn.net/ml/linux-kernel/20190611231813.3148843-1-guro@fb.com/>
- Alternative hack at <https://patchwork.kernel.org/cover/10772277/>

Comment 7 Philipp Hahn

2019-08-02 17:16:58 CEST

Some patches of interest:
6631142229005e1b1c311a09efe9fb3cfdac8559 blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration [v5.3-rc1~164^2~108]
68600f623d69da428c6163275f97ca126e1a8ec5 mm: don't miss the last page because of round-off error [v4.20-rc1~92^2~112]
172b06c32b949759fe6313abec514bc4f15014f4 mm: slowly shrink slabs with a relatively small number of objects [v4.19-rc5~13]
aeed1d325d429ac9699c4bf62d17156d60905519 mm/vmscan.c: generalize shrink_slab() calls in shrink_node() [v4.19-rc1~105^2~29]
9092c71bb724dba2ecba849eae69e5c9d39bd3d2 mm: use sc->priority for slab shrink targets [v4.16-rc1~122^2~80]

Comment 8 Philipp Hahn

2019-08-21 10:22:05 CEST

No priority at the moment

Comment 9 Philipp Hahn

2022-02-22 16:06:47 CET

Not reported or observed again.
UCS 4.4-8 has 4.9.272 by now show probably fixed.
Most probably caused by Bug 49910 and fixed by 4.4-1+e191