Univention Bugzilla – Bug 57297
Samba LDAP connections hang with lmdb backend - due to stuck samba-tool domain backup offline
Last modified: 2024-06-10 22:46:54 CEST
In UCS 5.2 test environments we saw Samba LDAP connections hang. The main ldap_server Samba process was in epoll_wait but the four ldap_server worker child processes where all hanging in ========== #0 0x00007ff1bc117700 in __GI___libc_fcntl64 (fd=11, cmd=7) at ../sysdeps/unix/sysv/linux/fcntl64.c:49 sc_ret = -512 sc_ret = <optimized out> ap = {{gp_offset = 16, fp_offset = 32753, overflow_arg_area = 0x7ffceee4d1f0, reg_save_area = 0x7ffceee4d1b0}} arg = 0x7ffceee4d200 #1 0x00007ff1bbdeed2c in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1 No symbol table info available. #2 0x00007ff1bbdef1fc in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1 No symbol table info available. #3 0x00007ff1bbdf3e39 in ?? () from /lib/x86_64-linux-gnu/libtdb.so.1 No symbol table info available. #4 0x00007ff1b7ef8456 in partition_metadata_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition_metadata.c:506 data = 0x55be424b8d70 tdb = 0x55be42111fb0 #5 0x00007ff1b7ef2a18 in partition_start_trans (module=0x55be421123d0) at ../../source4/dsdb/samdb/ldb_modules/partition.c:1105 i = 0 ret = 0 data = 0x55be424b8d70 ========== A look into sam.ldb.d showed two interesting files: "metadata.tdb.bak-offline", and a zero bytes sized 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb': ========== root@master091:~# ls -l /var/lib/samba/private/sam.ldb.d/ insgesamt 24616 -rw-r--r-- 1 root root 7909376 17. Mai 00:17 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb' -rw-r----- 1 root root 0 17. Mai 03:00 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb' -rw-r--r-- 1 root root 6400128 17. Mai 12:28 'CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock' -rw-r--r-- 1 root root 10465280 17. Mai 00:17 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb' -rw-r--r-- 1 root root 6400128 17. Mai 12:28 'CN=SCHEMA,CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb-lock' -rw-r--r-- 1 root root 5664768 17. Mai 03:00 'DC=AUTOTEST091,DC=TEST.ldb' -rw-r--r-- 1 root root 6400128 17. Mai 12:28 'DC=AUTOTEST091,DC=TEST.ldb-lock' -rw-r--r-- 1 root root 503808 17. Mai 02:22 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb' -rw-r--r-- 1 root root 6400128 17. Mai 12:28 'DC=DOMAINDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock' -rw-r--r-- 1 root root 212992 17. Mai 00:24 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb' -rw-r--r-- 1 root root 6400128 17. Mai 12:28 'DC=FORESTDNSZONES,DC=AUTOTEST091,DC=TEST.ldb-lock' -rw-r----- 1 root root 421888 17. Mai 03:00 metadata.tdb -rw-r----- 1 root root 8192 17. Mai 03:00 metadata.tdb.bak-offline ========== ps showed a stuck mdb_copy process: ========== 111155 ? S 0:00 | \_ /bin/bash /usr/sbin/univention-samba4-backup 111176 ? S 0:00 | \_ /usr/bin/python3 /bin/samba-tool domain backup offline --targetdir=/var/univention-backup/samba 111242 ? S 0:00 | \_ /bin/mdb_copy -n /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb /var/lib/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AUTOTEST091,DC=TEST.ldb.copy.mdb ========== I couldn't quickly make sense of the gdb bt output (the code is optimized) but after killing it hard Samba LDAP bekame responsive again. I removed the '.ldb.copy.mdb' file.
Either we find how to make "samba-tool domain backup offline" play nicely with an "online" Samba, or we may need to create a different backup solution (or stop Samba meanwhile).
Ticket#:2024042221000091 probably has suffered the same experience with UCS 5.0-7 (or -6).
With OpenLDAP mdb_copy seems to be expected to work concurrently even with a live slapd ( https://openldap-technical.openldap.narkive.com/ExQrZ6ks/how-to-take-hot-backup-of-mdb ) but the access pattern in the samba code may be different. https://wiki.samba.org/index.php/Back_up_and_Restoring_a_Samba_AD_DC#Offline/local_DC_backup explicitly says that the Samba DC doesn't need to be "offline" for the "offline" backup. But apparently the tool has this issue with mbd/lmdb.
In the past there have been adjustments made to the locking in https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py ( https://bugzilla.samba.org/show_bug.cgi?id=14676 ). So maybe it sh/could work without stopping samba. In UCS 5.2-0 Debian Bookworm currently has version 0.9.24 of https://packages.debian.org/source/bookworm/lmdb . Maybe a more recent version would help? See also https://bugs.openldap.org/show_bug.cgi?id=10095 ( https://github.com/openldap/openldap/blob/master/libraries/liblmdb/CHANGES )
Testing with lmdb 0.9.33 (and Samba 4.18.3) showed that the locking problem is still not fixed that way (but if we go for a higher version of lmdb, then I think we should not use 0.9.31, judging by the changelog).
The code at https://gitlab.com/samba-team/samba/-/blob/master/python/samba/netcmd/domain/backup.py?ref_type=heads#L1044 should protect "samba-tool domain backup offline" e.g. against ldbmodify. Yet, when I create a simple ldbmodify loop (setting description: foo and then again description: bar) running parallel to the backup I can easily get the backup (mdb_copy) to hang. Characteristic about this is, that mdb_copy hands in pthread_mutex_lock while some ldbmodify calls output the message "LMDB Stale readers, deleted" before finally hanging: Modified 1 records successfully Modified 1 records successfully Modified 1 records successfully LMDB Stale readers, deleted (1) LMDB Stale readers, deleted (1) LMDB Stale readers, deleted (1) LMDB Stale readers, deleted (1) LMDB Stale readers, deleted (1) <bash while loop hangs> The ldbmodify was hanging in tdb_brlock_retry -> tdb_brlock -> fcntl_lock (all in common/lock.c), "br" is byte range I think and the last time I checked with strace it was attempting to lock /var/lib/samba/private/sam.ldb.d/metadata.tdb. Not sure how "real life" the ldbmodify test is, maybe one should test with nsupdate against named to check if https://gitlab.com/samba-team/samba/-/blob/master/source4/dns_server/dlz_bind9.c#L1321 shows the same issue.