Bug 49614 - Memory leak in linux-4.9.168-1 (4.9.0-9 amd64)
Memory leak in linux-4.9.168-1 (4.9.0-9 amd64)
Status: RESOLVED WORKSFORME
Product: UCS
Classification: Unclassified
Component: Kernel
UCS 4.4
Other Linux
: P5 normal (vote)
: ---
Assigned To: Philipp Hahn
UCS maintainers
:
Depends on:
Blocks: 49910 49941
  Show dependency treegraph
 
Reported: 2019-06-07 15:05 CEST by Philipp Hahn
Modified: 2022-02-22 16:06 CET (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.171
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019060621000636
Bug group (optional):
Max CVSS v3 score:


Attachments
collectd: memory usage statistics (22.69 KB, image/png)
2019-06-26 12:27 CEST, Philipp Hahn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2019-06-07 15:05:45 CEST
Public report: <https://help.univention.com/t/memoryleak-auf-slave-contoller/11892>
Affected internal system: mysig

I suspected a buffer_head leak, but the value turned down again:
# grep ^buffer_head slabinfo.2019-0*
slabinfo.2019-05-28:13:37:17:buffer_head         9235  25584    104   39    1 : tunables  120   60    8 : slabdata    656    656      0
slabinfo.2019-05-28:13:38:41:buffer_head         9509  25584    104   39    1 : tunables  120   60    8 : slabdata    656    656      0
slabinfo.2019-05-29:11:43:54:buffer_head         5790  23127    104   39    1 : tunables  120   60    8 : slabdata    593    593      0
slabinfo.2019-06-01:03:10:27:buffer_head         4078  16458    104   39    1 : tunables  120   60    8 : slabdata    422    422      0
slabinfo.2019-06-07:10:12:47:buffer_head        30225  30225    104   39    1 : tunables  120   60    8 : slabdata    775    775      0
slabinfo.2019-06-07:14:25:02:buffer_head        30787  30810    104   39    1 : tunables  120   60    8 : slabdata    790    790      0
slabinfo.2019-06-07:14:46:21:buffer_head         4593  22074    104   39    1 : tunables  120   60    8 : slabdata    566    566      0

Maybe <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.9.180&id=056066d8a769aa87f280f00c6c2d802ecc2c7a6b> fixes it?

No other idea from: git log --grep leak v4.9.168..v4.9.174
No Debian report yet.
No reproducer yet.

Stopping all processes did not fix it: the memory was still gone.

<https://www.kernel.org/doc/html/latest/dev-tools/kmemleak.html> cannot be used as Debian kernels are compiled without CONFIG_DEBUG_KMEMLEAK
Comment 1 Christian Völker univentionstaff 2019-06-11 15:09:10 CEST
Currently happening for a customer on both master and backup every couple of days until the next reboot.

Do you need any details from there?
Comment 2 Christian Völker univentionstaff 2019-06-19 13:25:23 CEST
On a system where it is currently (and reproducible) happening with a load of 112 (!!!) due to massive swapping:


smem:
[...]
samba                      460.0K 
vmtoolsd                   622.0K 
smbstatus                  938.0K 
sshd                       958.0K 
smbstatus                    1.1M 
univention-mana              1.2M 
systemd                      1.2M 
top                          1.3M 
dockerd                      2.0M 
slapd                        2.1M 
apt-get                      3.0M 
apt-get                      3.7M 
samba                        4.4M 
iotop                        5.3M 
smem                         6.1M 
----------------------------------
                            39.7M 


free:
              total        used        free      shared  buff/cache   available
Mem:        2033684      854784      462676       28324      716224      998104
Swap:       2097148           0     2097148
Total:      4130832      854784     2559824



top:
top - 10:19:09 up 4 days, 20:40,  3 users,  load average: 112,06, 111,67, 117,23
Tasks: 326 total,   1 running, 325 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3,4 us, 18,9 sy,  0,0 ni,  0,0 id, 76,3 wa,  0,0 hi,  1,4 si,  0,0 st
KiB Mem :  2033684 total,    68240 free,  1920544 used,    44900 buff/cache
KiB Swap:  2097148 total,  1351752 free,   745396 used.    15400 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 78113 root      20   0  103224  17256   6584 D  3,4  0,8   0:20.70 apt-get -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-overwrite -o DPkg::Options::=--force-overwrite-dir --trivial-only=no --assume-yes --quie
t=1 -us d+
 90512 root      20   0   55896   7980   1284 S  3,5  0,4  29:15.57 /usr/bin/python3 /usr/sbin/iotop
 67417 root      20   0   89668   2768    516 D  0,9  0,1   1:45.12 apt-get check -qq
  9202 root      20   0   38336   1708   1252 R  0,0  0,1   0:18.21 top icbd 600
     1 root      20   0  204880   1668   1364 S  0,1  0,1   3:02.91 /sbin/init splash
 76894 root      20   0   54396   1400   1208 D  0,3  0,1   0:03.13 /lib/systemd/systemd-journald
 66954 root      20   0  167228   1352    596 D  0,0  0,1   0:06.85 /usr/bin/python2.7 /usr/lib/univention-pam/ldap-group-to-file.py --check_member


vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                CEST
 0  0 664588  67684  11020 138304  194   72   312   159  188  593  5  5 86  4  0 2019-06-14 14:19:18
 1  0 655032  76488   7528 105064  181  154   368   218  190  586  5  5 85  5  0 2019-06-14 14:29:18
 0  0 706180  79472   8908 106864  186  253   698   315  212  632  6  5 83  6  0 2019-06-14 14:39:18
 0  0 717104  79228   7648 102576  197  198   494   262  204  626  6  5 85  5  0 2019-06-14 14:49:18
 0  0 731224  80884   3852  94064  201  205   496   270  207  624  5  5 85  5  0 2019-06-14 14:59:18
Comment 3 Christian Völker univentionstaff 2019-06-21 11:00:09 CEST
Raising as customer needs a fix as the system is not useable this way.

tunnel access can be granted, if needed.
Comment 4 Philipp Hahn univentionstaff 2019-06-26 12:27:28 CEST
Created attachment 10090 [details]
collectd: memory usage statistics

A self-compiled 4.9.174 does not show the increasing memory usage after ½ day, but we need a longer look.
Please note that swap still increases, so the bug might only be fixed partly: <http://xen1.knut.univention.de/collectd/bin/index.cgi?hostname=mysig.pingst.univention.de&plugin=memory&plugin=swap&timespan=86400&action=show_selection&ok_button=OK>

I filed a Debian bug for this: <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=931111>
Comment 5 Philipp Hahn univentionstaff 2019-07-03 13:31:52 CEST
> I filed a Debian bug for this:
> <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=931111>

There already was a bug in Debian, which looks similar:
  <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912411>

Looking deeper I back-ported a patch to better diagnose the CGroups:
  <https://www.spinics.net/lists/cgroups/msg20611.html>
  0679dee03c6d7  cgroup: keep track of number of descent cgroups
  1a926e0bbab83  cgroup: implement hierarchy limits
  4dcabece4c3a9  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock

> nr_dying_descendants
>   This is the total number of dying descendant cgroups under‐
>   neath this cgroup.  A cgroup enters the dying state after
>   being deleted.  It remains in that state for an undefined
>   period (which will depend on system load) while resources are
>   freed before the cgroup is destroyed.  Note that the presence
>   of some cgroups in the dying state is normal, and is not
>   indicative of any problem.
>
>   A process can't be made a member of a dying cgroup, and a
>   dying cgroup can't be brought back to life.

While looking for further ideas I stumbled upon the following patch series, which is not yet applied upstream, but describes a problem with the current CG implementation:
  <https://lwn.net/Articles/782633/>

This lead to <linux/Documentation/cgroup-v1/memory.txt>:
As CONFIG_MEMCG_KMEM is enabled, the Linux kernel keeps track of which CG allocated which kernel memory regions. This includes:
* stack pages
* slab pages
* sockets memory pressure
* TCP memory pressure
Especially SLAB/SLUB allocated objects can survive the livetime of the processes from a CG. Furthermore this changes the SLAB allocator to allocate pages on a per CG basis.
When 'libpam_systemd' deletes the per-session CG, the CG is not deleted immediately, but marked as "dying":
  # find /sys/fs/cgroup/memory -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'  {} +
  /sys/fs/cgroup/memory/cgroup.stat:nr_dying_descendants 360
  /sys/fs/cgroup/memory/user.slice/cgroup.stat:nr_dying_descendants 320
  /sys/fs/cgroup/memory/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 303
  /sys/fs/cgroup/memory/system.slice/cgroup.stat:nr_dying_descendants 40
  # grep ^memory /proc/cgroups 
  memory  10      452     1

So the system has a high number of dying CGs. Jürn also observed that on our system, were we had ~88k dying CGs after one week of uptime.

As a first test we disabled 'pam_systemd' on the affected system:
  # grep pam_systemd.so /etc/pam.d/*
  /etc/pam.d/common-session:#session      optional        pam_systemd.so
The test is still running, but that looks promising.

Tracking the kernel memory can be disabled by setting the following Linux kernel command line argument:
  cgroup.memory=nokmem
My test is still running, but that does not seem to fix it.

We can also try to patch systemd/src/basic/cgroup-util.c:731 cg_trim() to flush the memory before the 'rmdir()':
  # echo 0 >/sys/fs/cgroup/memory/.../memory.force_empty
Comment 6 Philipp Hahn univentionstaff 2019-07-24 09:15:36 CEST
Feedback to <https://www.spinics.net/lists/cgroups/msg22853.html> from Roman Gushchin and Ben Hutchings:
- Fixed in Linux 5.[23]
- Upstream patch-set at <https://lwn.net/ml/linux-kernel/20190611231813.3148843-1-guro@fb.com/>
- Alternative hack at <https://patchwork.kernel.org/cover/10772277/>
Comment 7 Philipp Hahn univentionstaff 2019-08-02 17:16:58 CEST
Some patches of interest:
6631142229005e1b1c311a09efe9fb3cfdac8559 blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration [v5.3-rc1~164^2~108]
68600f623d69da428c6163275f97ca126e1a8ec5 mm: don't miss the last page because of round-off error [v4.20-rc1~92^2~112]
172b06c32b949759fe6313abec514bc4f15014f4 mm: slowly shrink slabs with a relatively small number of objects [v4.19-rc5~13]
aeed1d325d429ac9699c4bf62d17156d60905519 mm/vmscan.c: generalize shrink_slab() calls in shrink_node() [v4.19-rc1~105^2~29]
9092c71bb724dba2ecba849eae69e5c9d39bd3d2 mm: use sc->priority for slab shrink targets [v4.16-rc1~122^2~80]
Comment 8 Philipp Hahn univentionstaff 2019-08-21 10:22:05 CEST
No priority at the moment
Comment 9 Philipp Hahn univentionstaff 2022-02-22 16:06:47 CET
Not reported or observed again.
UCS 4.4-8 has 4.9.272 by now show probably fixed.
Most probably caused by Bug 49910 and fixed by 4.4-1+e191