Bug 57297 - Samba LDAP connections hang with lmdb backend - due to stuck samba-tool domain backup offline
Samba LDAP connections hang with lmdb backend - due to stuck samba-tool domai...
Status: NEW
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 5.0
Other Linux
: P5 normal (vote)
: ---
Assigned To: Samba maintainers
Samba maintainers
:
Depends on: 56434
Blocks:
  Show dependency treegraph
 
Reported: 2024-05-17 12:57 CEST by Arvid Requate
Modified: 2024-06-10 22:46 CEST (History)
3 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.143
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2024042221000091,2024061021000083
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2024-05-17 12:57:56 CEST
In UCS 5.2 test environments we saw Samba LDAP connections hang. The main ldap_server Samba process was in epoll_wait but the four ldap_server worker child processes where all hanging in
==========
#0  0x00007ff1bc117700 in __GI___libc_fcntl64 (fd=11, cmd=7) at ../sysdeps/unix/sysv/linux/fcntl64.c:49
        sc_ret = -512
        sc_ret = <optimized out>
        ap = {{gp_offset = 16, fp_offset = 32753, overflow_arg_area = 0x7ffceee4d1f0, reg_save_area = 0x7ffceee4d1b0}}
        arg = 0x7ffceee4d200
#1  0x00007ff1bbdeed2c in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#2  0x00007ff1bbdef1fc in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#3  0x00007ff1bbdf3e39 in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1
No symbol table info available.
#4  0x00007ff1b7ef8456 in partition_metadata_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition_metadata.c:506
        data = 0x55be424b8d70
        tdb = 0x55be42111fb0
#5  0x00007ff1b7ef2a18 in partition_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition.c:1105
        i = 0
        ret = 0
        data = 0x55be424b8d70
==========

A look into sam.ldb.d showed two interesting files: "metadata.tdb.bak-offline", and a zero bytes sized 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb':
==========
root@master091:~# ls -l /var/lib/samba/private/sam.ldb.d/
insgesamt 24616
-rw-r--r-- 1 root root  7909376 17. Mai 00:17 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r----- 1 root root        0 17. Mai 03:00 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root 10465280 17. Mai 00:17 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root  5664768 17. Mai 03:00 'DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root   503808 17. Mai 02:22 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r--r-- 1 root root   212992 17. Mai 00:24 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb'
-rw-r--r-- 1 root root  6400128 17. Mai 12:28 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock'
-rw-r----- 1 root root   421888 17. Mai 03:00  metadata.tdb
-rw-r----- 1 root root     8192 17. Mai 03:00  metadata.tdb.bak-offline
==========

ps showed a stuck mdb_copy process:
==========
   111155 ?        S      0:00  |       \_ /bin/bash /usr/sbin/univention-samba4-backup
 111176 ?        S      0:00  |           \_ /usr/bin/python3 /bin/samba-tool domain backup offline --targetdir=/var/univention-backup/samba
 111242 ?        S      0:00  |               \_ /bin/mdb_copy -n /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb
==========

I couldn't quickly make sense of the gdb bt output (the code is optimized) but after killing it hard Samba LDAP bekame responsive again. I removed the '.ldb.copy.mdb' file.
Comment 1 Arvid Requate univentionstaff 2024-05-17 13:00:45 CEST
Either we find how to make "samba-tool domain backup offline" play nicely with an "online" Samba, or we may need to create a different backup solution (or stop Samba meanwhile).
Comment 2 Arvid Requate univentionstaff 2024-05-17 13:07:13 CEST
Ticket#:2024042221000091 probably has suffered the same experience with UCS 5.0-7 (or -6).
Comment 3 Arvid Requate univentionstaff 2024-05-17 14:21:23 CEST
With OpenLDAP mdb_copy seems to be expected to work concurrently even with a live slapd ( https://openldap-technical.openldap.narkive.com/ExQrZ6ks/how-to-take-hot-backup-of-mdb ) but the access pattern in the samba code may be different.

https://wiki.samba.org/index.php/Back_up_and_Restoring_a_Samba_AD_DC#Offline/local_DC_backup explicitly says that the Samba DC doesn't need to be "offline" for the "offline" backup. But apparently the tool has this issue with mbd/lmdb.
Comment 5 Arvid Requate univentionstaff 2024-06-07 16:11:33 CEST
In the past there have been adjustments made to the locking in https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py
( https://bugzilla.samba.org/show_bug.cgi?id=14676 ). So maybe it sh/could work without stopping samba. In UCS 5.2-0 Debian Bookworm currently has version 0.9.24 of https://packages.debian.org/source/bookworm/lmdb . Maybe a more recent version would help?

See also https://bugs.openldap.org/show_bug.cgi?id=10095 ( https://github.com/openldap/openldap/blob/master/libraries/liblmdb/CHANGES )
Comment 6 Arvid Requate univentionstaff 2024-06-10 18:01:19 CEST
Testing with lmdb 0.9.33 (and Samba 4.18.3) showed that the locking problem is still not fixed that way (but if we go for a higher version of lmdb, then I think we should not use 0.9.31, judging by the changelog).
Comment 7 Arvid Requate univentionstaff 2024-06-10 22:46:54 CEST
The code at
https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py?ref_type=heads#L1044 
should protect "samba-tool domain backup offline" e.g. against ldbmodify. Yet, when I create a simple ldbmodify
loop (setting description: foo and then again description: bar) running parallel to the backup I can easily
get the backup (mdb_copy) to hang. Characteristic about this is, that mdb_copy hands in pthread_mutex_lock
while some ldbmodify calls output the message "LMDB Stale readers, deleted" before finally hanging: 

Modified 1 records successfully
Modified 1 records successfully
Modified 1 records successfully
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
LMDB Stale readers, deleted (1)
<bash while loop hangs>

The ldbmodify was hanging in tdb_brlock_retry -> tdb_brlock -> fcntl_lock (all in common/lock.c),
"br" is byte range I think and the last time I checked with strace it was attempting to
lock /var/lib/samba/private/sam.ldb.d/metadata.tdb.

Not sure how "real life" the ldbmodify test is, maybe one should test with nsupdate against
named to check if https://gitlab.com/samba-team/samba/-/blob/master/source4/dns_server/dlz_bind9.c#L1321
shows the same issue.