Bug 57297 - Samba LDAP connections hang with lmdb backend - due to stuck samba-tool domain backup offline
Summary: Samba LDAP connections hang with lmdb backend - due to stuck samba-tool domai...
Status: CLOSED FIXED
Alias: None
Product: UCS
Classification: Unclassified
Component: Samba4
Version: UCS 5.0
Hardware: Other Linux
: P5 normal
Target Milestone: UCS 5.2
Assignee: Arvid Requate
QA Contact: Julia Bremer
URL: https://bugzilla.samba.org/show_bug.c...
Keywords:
Depends on: 56434
Blocks: 57734
  Show dependency treegraph
 
Reported: 2024-05-17 12:57 CEST by Arvid Requate
Modified: 2024-11-19 12:39 CET (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.143
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2024042221000091, 2024061021000083
Bug group (optional):
Customer ID:
Max CVSS v3 score:


Attachments
revert-series-sambabug14676.diff (14.58 KB, patch)
2024-10-23 19:33 CEST, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2024-05-17 12:57:56 CEST
In UCS 5.2 test environments we saw Samba LDAP connections hang. The main ldap_server Samba process was in epoll_wait but the four ldap_server worker child processes where all hanging in
==========
#0  0x00007ff1bc117700 in __GI___libc_fcntl64 (fd=11, cmd=7) at ../sysdeps/unix/sysv/linux/fcntl64.c:49
        sc_ret = -512
        sc_ret = <optimized out>
        ap = {{gp_offset = 16, fp_offset = 32753, overflow_arg_area = 0x7ffceee4d1f0, reg_save_area = 0x7ffceee4d1b0}}
        arg = 0x7ffceee4d200
#1  0x00007ff1bbdeed2c in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#2  0x00007ff1bbdef1fc in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#3  0x00007ff1bbdf3e39 in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#4  0x00007ff1b7ef8456 in partition_metadata_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition_metadata.c:506
        data = 0x55be424b8d70
        tdb = 0x55be42111fb0
#5  0x00007ff1b7ef2a18 in partition_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition.c:1105
        i = 0
        ret = 0
        data = 0x55be424b8d70
==========

A look into sam.ldb.d showed two interesting files: "metadata.tdb.bak-offline", and a zero bytes sized 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb':
==========
root@master091:~# ls -l /var/lib/samba/private/sam.ldb.d/
insgesamt 24616
-rw-r--r-- 1 root root  7909376 17. Mai 00:17 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r----- 1 root root        0 17. Mai 03:00 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root 10465280 17. Mai 00:17 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root  5664768 17. Mai 03:00 'DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root   503808 17. Mai 02:22 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root   212992 17. Mai 00:24 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r----- 1 root root   421888 17. Mai 03:00  metadata.tdb
-rw-r----- 1 root root     8192 17. Mai 03:00  metadata.tdb.bak-offline
==========

ps showed a stuck mdb_copy process:
==========
   111155 ?        S      0:00  |       \_ /bin/bash /usr/sbin/univention-samba4-backup
 111176 ?        S      0:00  |           \_ /usr/bin/python3 /bin/samba-tool domain backup offline --targetdir=/var/univention-backup/samba
 111242 ?        S      0:00  |               \_ /bin/mdb_copy -n /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb
==========

I couldn't quickly make sense of the gdb bt output (the code is optimized) but after killing it hard Samba LDAP bekame responsive again. I removed the '.ldb.copy.mdb' file.
Comment 1 Arvid Requate univentionstaff 2024-05-17 13:00:45 CEST
Either we find how to make "samba-tool domain backup offline" play nicely with an "online" Samba, or we may need to create a different backup solution (or stop Samba meanwhile).
Comment 2 Arvid Requate univentionstaff 2024-05-17 13:07:13 CEST
Ticket#:2024042221000091 probably has suffered the same experience with UCS 5.0-7 (or -6).
Comment 3 Arvid Requate univentionstaff 2024-05-17 14:21:23 CEST
With OpenLDAP mdb_copy seems to be expected to work concurrently even with a live slapd ( https://openldap-technical.openldap.narkive.com/ExQrZ6ks/how-to-take-hot-backup-of-mdb ) but the access pattern in the samba code may be different.

https://wiki.samba.org/index.php/Back_up_and_Restoring_a_Samba_AD_DC#Offline/local_DC_backup explicitly says that the Samba DC doesn't need to be "offline" for the "offline" backup. But apparently the tool has this issue with mbd/lmdb.
Comment 5 Arvid Requate univentionstaff 2024-06-07 16:11:33 CEST
In the past there have been adjustments made to the locking in https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py
( https://bugzilla.samba.org/show_bug.cgi?id=14676 ). So maybe it sh/could work without stopping samba. In UCS 5.2-0 Debian Bookworm currently has version 0.9.24 of https://packages.debian.org/source/bookworm/lmdb . Maybe a more recent version would help?

See also https://bugs.openldap.org/show_bug.cgi?id=10095 ( https://github.com/openldap/openldap/blob/master/libraries/liblmdb/CHANGES )
Comment 6 Arvid Requate univentionstaff 2024-06-10 18:01:19 CEST
Testing with lmdb 0.9.33 (and Samba 4.18.3) showed that the locking problem is still not fixed that way (but if we go for a higher version of lmdb, then I think we should not use 0.9.31, judging by the changelog).
Comment 7 Arvid Requate univentionstaff 2024-06-10 22:46:54 CEST
The code at
https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py?ref_type=heads#L1044 
should protect "samba-tool domain backup offline" e.g. against ldbmodify. Yet, when I create a simple ldbmodify
loop (setting description: foo and then again description: bar) running parallel to the backup I can easily
get the backup (mdb_copy) to hang. Characteristic about this is, that mdb_copy hands in pthread_mutex_lock
while some ldbmodify calls output the message "LMDB Stale readers, deleted" before finally hanging: 

Modified 1 records successfully
Modified 1 records successfully
Modified 1 records successfully
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
<bash while loop hangs>

The ldbmodify was hanging in tdb_brlock_retry -> tdb_brlock -> fcntl_lock (all in common/lock.c),
"br" is byte range I think and the last time I checked with strace it was attempting to
lock /var/lib/samba/private/sam.ldb.d/metadata.tdb.

Not sure how "real life" the ldbmodify test is, maybe one should test with nsupdate against
named to check if https://gitlab.com/samba-team/samba/-/blob/master/source4/dns_server/dlz_bind9.c#L1321
shows the same issue.
Comment 8 Arvid Requate univentionstaff 2024-08-06 17:39:00 CEST
Proposal: revert upstream workaround patches made for https://bugzilla.samba.org/show_bug.cgi?id=14676

* https://gitlab.com/samba-team/samba/-/commit/bb3dcd403ce
* https://gitlab.com/samba-team/samba/-/commit/d7c111514ad
* https://gitlab.com/samba-team/samba/-/commit/423f808ff48
* https://gitlab.com/samba-team/samba/-/commit/958931ad379

Up too Debian bookworm we only have lmdb 0.9.24, so it should work without the first and all the subsequent patches.

But we should find a proper solution anyway.
Comment 9 Arvid Requate univentionstaff 2024-10-23 19:33:23 CEST
Created attachment 11255 [details]
revert-series-sambabug14676.diff

This is the patch series that I created with git revert of the individual changes. Also needed to revert 739d7e54e780 in between. Not sure if we want some of that back committed on top of the reverts. I guess yes. But first we should check that reverting this fixes the problem of this bug here.
Comment 10 Arvid Requate univentionstaff 2024-10-30 21:32:36 CET
ae71b3fe642636599f5f | revert upstream changes for Samba Bug 14676
b7352db2f72d73554f8b | re-order and adjust following patch context
0e511c2936000b2f6fce | fixup! re-order and adjust following patch context

Successful build                                  
Package: samba                                       
Version: 2:4.21.1-1A~5.2.0.202410291324                   
Branch: 5.2-0

2cded8af8d5 | Entry for release changelog

Tested with reproducer scripts which I attached to the upstream Samba bug.
Comment 11 Julia Bremer univentionstaff 2024-11-12 13:16:19 CET
OK: Patch
OK: Build
OK: Jenkins
OK: No occurences of "hanging" samba found in neither 5.0 nor 5.2
OK: No new ucs-test, the reproducer was not valuable in a single test case
OK: Reproducer reproduced the problem before the patch revert, afterwards the problem is not reproducible anymore. 
OK: YAML

Verified