Univention Bugzilla – Bug 39217
Samba hangs in futex after signal handler
Last modified: 2015-12-16 15:39:44 CET
Created attachment 7117 [details] coredump, memory map, lsof etc. (i386) 2015081921000151 Customer reported at least 4 School-DCs with a hanging samba process after server-password change: # ps aux | grep samba root 3394 0.0 0.0 176 4 ? Ss Aug16 0:00 runsv univention-bind-samba4 root 4160 17.6 1.5 691876 254556 ? S Aug16 829:46 /usr/sbin/samba -D root 5671 0.0 1.1 737896 184948 ? Sl Aug16 0:32 /usr/sbin/named -c /etc/bind/named.conf.samba4 -f -d 0 strace -ttT -s1024 -f -p $PID 53705 futex(0xb6f6b3c0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> * All of them running UCS 4.0-2 errata258 * i386 and x64 affected * samba 2:4.2.3-1.756.201507161454 * 3.16.0-ucs135-686-pae / 3.16.0-ucs135-amd64 Backtrace of the i386 system: (gdb) bt #0 0xb76f9d70 in __kernel_vsyscall () #1 0xb6eeddc1 in __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:95 #2 0xb6e7fe67 in _L_lock_9721 () from /lib/i386-linux-gnu/i686/cmov/libc.so.6 #3 0xb6e7e636 in *__GI___libc_free (mem=0xb89fa1d8) at malloc.c:3736 #4 0xb6e9c77c in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435 #5 0xb6e9c8b1 in __tz_convert (timer=0xbfa62c0c, use_localtime=1, tp=0xb6f6d6a0) at tzset.c:624 #6 0xb6e9afff in *__GI_localtime (t=0xbfa62c0c) at localtime.c:43 #7 0xb755284d in timeval_str_buf () from /usr/lib/i386-linux-gnu/samba/libsamba-debug.so.0 #8 0xb755365d in dbghdrclass () from /usr/lib/i386-linux-gnu/samba/libsamba-debug.so.0 #9 0xb772315c in ?? () #10 <signal handler called> #11 _int_malloc (av=<optimized out>, bytes=<optimized out>) at malloc.c:4373 #12 0xb6e7e70c in *__GI___libc_malloc (bytes=51) at malloc.c:3660 #13 0xb6f909fc in talloc_named_const () from /usr/lib/i386-linux-gnu/libtalloc.so.2 #14 0xb6fae7e1 in ldb_unpack_data () from /usr/lib/i386-linux-gnu/libldb.so.1 #15 0xb57bf03e in ?? () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so #16 0xb6bb525b in ?? () from /usr/lib/i386-linux-gnu/libtdb.so.1 #17 0xb6bb53f8 in tdb_traverse_read () from /usr/lib/i386-linux-gnu/libtdb.so.1 #18 0xb57bf4bb in ltdb_search () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so #19 0xb57bdf90 in ?? () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so #20 0xb6f831f8 in tevent_common_loop_timer_delay () from /usr/lib/i386-linux-gnu/libtevent.so.0 #21 0xb6f84374 in ?? () from /usr/lib/i386-linux-gnu/libtevent.so.0 #22 0xb6fc14f1 in ?? () from /usr/lib/i386-linux-gnu/libldb.so.1 Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Created attachment 7118 [details] coredump, memory map, lsof etc. (x64) Backtrace of the x64 system (gdb) bt #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00007f607dda0458 in _L_lock_9676 () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f607dd9e7b1 in *__GI___libc_free (mem=0x7f607e0aae40) at malloc.c:3736 #3 0x00007f607ddc150d in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435 #4 0x00007f607ddc1649 in __tz_convert (timer=0x7ffc57011e08, use_localtime=1, tp=0x7f607e0ae380) at tzset.c:624 #5 0x00007f60802d200e in timeval_str_buf () from /usr/lib/x86_64-linux-gnu/samba/libsamba-debug.so.0 #6 0x00007f60802d2c92 in dbghdrclass () from /usr/lib/x86_64-linux-gnu/samba/libsamba-debug.so.0 #7 0x00007f608267ba7a in ?? () #8 <signal handler called> #9 0x00007f607dd9b4f1 in _int_free (av=0x7f607e0aae40, p=0x7f6084f055a0) at malloc.c:4965 #10 0x00007f607dd9e7bc in *__GI___libc_free (mem=<optimized out>) at malloc.c:3738 #11 0x00007f607e4d28ec in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #12 0x00007f607e4d2660 in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #13 0x00007f607e4cd524 in _talloc_free () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #14 0x00007f60670edb0f in ?? () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so #15 0x00007f607bacc90f in ?? () from /usr/lib/x86_64-linux-gnu/libtdb.so.1 #16 0x00007f607bacca7b in tdb_traverse_read () from /usr/lib/x86_64-linux-gnu/libtdb.so.1 #17 0x00007f60670ede78 in ltdb_search () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so #18 0x00007f60670ecbf8 in ?? () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so #19 0x00007f607e2c4878 in tevent_common_loop_timer_delay () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #20 0x00007f607e2c57cd in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #21 0x00007f607e2c3ef6 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #22 0x00007f607e2c04dd in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #23 0x00007f607e6f8525 in ldb_wait () from /usr/lib/x86_64-linux-gnu/libldb.so.1 #24 0x00007f607060e643 in ldapsrv_do_call () from /usr/lib/x86_64-linux-gnu/samba/service/ldap.so #25 0x00007f607060bd07 in ?? () from /usr/lib/x86_64-linux-gnu/samba/service/ldap.so #26 0x00007f607e2c0d92 in tevent_common_loop_immediate () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #27 0x00007f607e2c57c1 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #28 0x00007f607e2c3ef6 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #29 0x00007f607e2c04dd in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #30 0x00007f607e2c069b in tevent_common_loop_wait () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #31 0x00007f607e2c3e66 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0 #32 0x00007f6074fb194d in ?? () from /usr/lib/x86_64-linux-gnu/samba/process_model/standard.so #33 0x00007f608118154f in task_server_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0 #34 0x00007f6081180093 in server_service_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0 #35 0x00007f608267c6f0 in ?? () #36 0x00007f607dd41ead in __libc_start_main (main=<optimized out>, argc=<optimized out>, ubp_av=<optimized out>, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc570132f8) at libc-start.c:244 #37 0x00007f6082678871 in _start ()
All of them where running samba debuglevel 2. As it looks as if the problem occurs while printing debug in the signal handler, that might be relevant.
Created attachment 7124 [details] bt, memory map, lsof etc. 2015081321000117 Second customer has a similar situation but in this case the krc_server process died with SIGSEGV and that lead to the "signal handler futex situation". Linux ucs-gyjkep 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux No LSB modules are available. Distributor ID: Univention Description: Univention Corporate Server 4.0-2 errata285 (Walle) Release: 4.0-2 errata285 Codename: Walle samba 2:4.2.3-1.758.201507271307 Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 Id Target Id Frame * 1 Thread 0x7f2970522720 (LWP 3044) "samba" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 No locals. #1 0x00007f296ba7c458 in _L_lock_9676 () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #2 0x00007f296ba7a7b1 in *__GI___libc_free (mem=0x7f296bd86e40) at malloc.c:3736 ignore1 = 128 ignore2 = <optimized out> ignore3 = -512 ar_ptr = 0x7f296bd86e40 p = 0x7f297091b3a0 #3 0x00007f296ba9d50d in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435 is_initialized = 1 tz = <optimized out> #4 0x00007f296ba9d649 in __tz_convert (timer=0x7ffec3c99738, use_localtime=1, tp=0x7f296bd8a380) at tzset.c:624 leap_correction = <optimized out> leap_extra_secs = <optimized out> #5 0x00007f296dfae00e in timeval_str_buf (tp=tp@entry=0x7ffec3c99790, hires=true, dst=dst@entry=0x7ffec3c997a0) at ../lib/util/time_basic.c:51 t = 1439987657 tm = <optimized out> len = <optimized out> #6 0x00007f296dfaec92 in dbghdrclass (level=level@entry=0, cls=cls@entry=0, location=location@entry=0x7f296fd064a8 "../lib/util/fault.c:78", func=func@entry=0x7f296fd06660 "fault_report") at ../lib/util/debug.c:991 old_errno = 12 header_str = "\240\235\311\303\001\000\000\000\260\234\311\303\376\177\000\000samba_kd;M\316o)\177\000\000client_access\000\275\244j/\271\204\272\213{\300\071ݙ\271:\206\026\027阽\264\275mvT\257-\271\266\314\003\233|\003\230\350V\017\224-Ѹ@\027k\266y\004\aj\374\271L;\234\006\344w\271\221pP\000\000\000@\231\311\303\376\177\000\000\260\024sd)\177\000\000\000\312\372m)\177\000\000\000\305Tp)\177\000\000\000\000\000\000\376\177\000\000\377\377\377\377\000\000\000\000}z.p\246-\237\337\001\000\000\000\000\000\000\000\360\027#", '\000' <repeats 13 times>"\220, \004\000\000\000\000\000\000msNPCall\000\340\315o)\177\000\000\362\231\023p)\177\000\000\001\000\000\000)\177\000\000\000\000\000\000\000\000\000\000\240\002\000\000\000\000\000\000\000\312\372m)\177\000\000\000\000\000\000\000\000\000\000\340\004\033n)"... hs_len = <optimized out> tv = {tv_sec = 1439987657, tv_usec = 396226} tvbuf = {buf = "\206Hd\213\000\000\000\000\002\\\023p)\177\000\000\000\000\000\000\000\000\000\000\330\313Rp)\177\000\000\001\000\000\000)\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000H\332Up)\177\000\000\000a\315q)\177\000\000\200\300\032l)\177\000\000\327u\023۪\360,\226\036", '\000' <repeats 15 times>"\240, \335Up)\177\000\000\240\230\311\303\376\177\000\000\270\230\311\303\376\177\000"} #7 0x00007f296fcf9699 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:78 counter = 1 #8 sig_fault (sig=11) at ../lib/util/fault.c:94 No locals. #9 <signal handler called> No symbol table info available. #10 0x00007f296ba76f08 in _int_free (av=0x7f296bd86e40, p=0x7f2971c2b080) at malloc.c:4993 size = 144 nextchunk = 0x7f2971c2b110 nextsize = 1168 prevsize = <optimized out> bck = 0x6974617453676e69 fwd = 0x6c6c6143504e736d errstr = <optimized out> __func__ = "_int_free" #11 0x00007f296ba7a7bc in *__GI___libc_free (mem=<optimized out>) at malloc.c:3738 ar_ptr = 0x7f296bd86e40 p = 0x7f2972563000 #12 0x00007f296533b356 in der_free_octet_string () from /usr/lib/x86_64-linux-gnu/libasn1.so.8 No symbol table info available. #13 0x00007f29652dc6bd in free_PA_DATA () from /usr/lib/x86_64-linux-gnu/libasn1.so.8 No symbol table info available. #14 0x00007f29652df20c in free_METHOD_DATA () from /usr/lib/x86_64-linux-gnu/libasn1.so.8 No symbol table info available. #15 0x00007f296040db37 in ?? () from /usr/lib/x86_64-linux-gnu/libkdc.so.2 No symbol table info available. #16 0x00007f296041a271 in ?? () from /usr/lib/x86_64-linux-gnu/libkdc.so.2 No symbol table info available. #17 0x00007f296041a60e in krb5_kdc_process_krb5_request () from /usr/lib/x86_64-linux-gnu/libkdc.so.2 No symbol table info available. #18 0x00007f2960c5bb5a in kdc_process (kdc=0x7f2971c4ada0, mem_ctx=0x7f2970db0fa0, input=0x7f2970db0fa8, reply=0x7f2970db0fb8, peer_addr=0x7f2970f11710, my_addr=<optimized out>, datagram_reply=0) at ../source4/kdc/kdc.c:161 ret = <optimized out> pa = <optimized out> ss = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>} k5_reply = {length = 0, data = 0x0} __FUNCTION__ = "kdc_process" ... ...
Ticket#2015082421000169 Next one. This time kdc hung in futex without segfault probably because of signal_handler.
Regarding the situation in Comment 3: /usr/lib/x86_64-linux-gnu/libkdc.so.2 is from heimdal.
Created attachment 7130 [details] coredump of kdc_server in futex Release: 4.0-3 errata288 Linux master 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux samba 2:4.2.3-1.758.201507271307
Does it still happen?
I guess this has been fixed by building Samba against the internal Heimdal snapshot shipped with the Samba source package. At least we have not seen any reports since http://errata.software-univention.de/ucs/4.0/289.html has been released.
(In reply to Arvid Requate from comment #8) > I guess this has been fixed by building Samba against the internal Heimdal > snapshot shipped with the Samba source package. At least we have not seen > any reports since http://errata.software-univention.de/ucs/4.0/289.html has > been released. Unfortunately not -> 2015111621000392: 4.0-3-Errata336 14039 ? S 376:56 /usr/sbin/samba -D root@ucs-3:/etc/cron.d# strace -p 14039 Process 14039 attached - interrupt to quit futex(0x7f2106390e40, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...> Linux ucs-3 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux
@Arvid: Perhaps we can simply trigger/reproduce this in a testing environment when rotating the password in a very high interval?
Happens here: 2015081921000151 occassionally too.
Ok, I propose to adjust the samba (or more specifically samba-ad-dc) init script to double check that everything is termintated at the end of the "stop" action. We already to a start-stop-daemon with a --retry plan ob "TERM/15/KILL" but apparently this doesn't cover the problem. We also give a PID-file with --pid, maybe that's a bad idea. Anyway, we should explicitly check for left-over processes after start-stop-daemon --stop to make sure that the problem doesn't come from that point. If something is left, we should log it and kill it for good. It would be ideal if we had an environment where the issue is reproducable. Ticket 2015081921000151 could be a candidate?
happens in several customer environments from our partner Onex: 2015102121000204 Mr. Frey could be a candidate to get in contact for testing purposes
happened again at a customer of our partner: 2015102121000204
happend most likely again in: 2015111821000638
> happend most likely again in: 2015111821000638 I cannot find any indication supporting this statement at that ticket. Please add the required information to that ticket if available.
Thie init script has been adjusted as proposed in Comment 12. Advisory: samba.yaml
Tests: OK. Unfortunately, I'm unable to reproduce the original bug. Code review: OK YAML: OK (small adjustment: r66137)
<http://errata.software-univention.de/ucs/4.0/374.html>