Bug 39217 - Samba hangs in futex after signal handler
Samba hangs in futex after signal handler
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 4.0
Other Linux
: P5 normal (vote)
: UCS 4.0-4-errata
Assigned To: Arvid Requate
Stefan Gohmann
:
Depends on:
Blocks: 40132
  Show dependency treegraph
 
Reported: 2015-08-19 10:28 CEST by Janis Meybohm
Modified: 2015-12-16 15:39 CET (History)
7 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
coredump, memory map, lsof etc. (i386) (33.18 MB, application/x-compressed-tar)
2015-08-19 10:28 CEST, Janis Meybohm
Details
coredump, memory map, lsof etc. (x64) (19.08 MB, application/x-compressed-tar)
2015-08-19 10:29 CEST, Janis Meybohm
Details
bt, memory map, lsof etc. (16.46 KB, application/x-compressed-tar)
2015-08-20 16:47 CEST, Janis Meybohm
Details
coredump of kdc_server in futex (7.52 MB, application/gzip)
2015-08-24 17:03 CEST, Janis Meybohm
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Janis Meybohm univentionstaff 2015-08-19 10:28:14 CEST
Created attachment 7117 [details]
coredump, memory map, lsof etc. (i386)

2015081921000151

Customer reported at least 4 School-DCs with a hanging samba process after server-password change:
# ps aux | grep samba
root      3394  0.0  0.0    176     4 ?        Ss   Aug16   0:00 runsv univention-bind-samba4
root      4160 17.6  1.5 691876 254556 ?       S    Aug16 829:46 /usr/sbin/samba -D
root      5671  0.0  1.1 737896 184948 ?       Sl   Aug16   0:32 /usr/sbin/named -c /etc/bind/named.conf.samba4 -f -d 0

strace -ttT -s1024 -f -p $PID
53705 futex(0xb6f6b3c0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>



* All of them running UCS 4.0-2 errata258
* i386 and x64 affected
* samba 2:4.2.3-1.756.201507161454
* 3.16.0-ucs135-686-pae / 3.16.0-ucs135-amd64

Backtrace of the i386 system:
(gdb) bt
#0  0xb76f9d70 in __kernel_vsyscall ()
#1  0xb6eeddc1 in __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:95
#2  0xb6e7fe67 in _L_lock_9721 () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3  0xb6e7e636 in *__GI___libc_free (mem=0xb89fa1d8) at malloc.c:3736
#4  0xb6e9c77c in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435
#5  0xb6e9c8b1 in __tz_convert (timer=0xbfa62c0c, use_localtime=1, tp=0xb6f6d6a0) at tzset.c:624
#6  0xb6e9afff in *__GI_localtime (t=0xbfa62c0c) at localtime.c:43
#7  0xb755284d in timeval_str_buf () from /usr/lib/i386-linux-gnu/samba/libsamba-debug.so.0
#8  0xb755365d in dbghdrclass () from /usr/lib/i386-linux-gnu/samba/libsamba-debug.so.0
#9  0xb772315c in ?? ()
#10 <signal handler called>
#11 _int_malloc (av=<optimized out>, bytes=<optimized out>) at malloc.c:4373
#12 0xb6e7e70c in *__GI___libc_malloc (bytes=51) at malloc.c:3660
#13 0xb6f909fc in talloc_named_const () from /usr/lib/i386-linux-gnu/libtalloc.so.2
#14 0xb6fae7e1 in ldb_unpack_data () from /usr/lib/i386-linux-gnu/libldb.so.1
#15 0xb57bf03e in ?? () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so
#16 0xb6bb525b in ?? () from /usr/lib/i386-linux-gnu/libtdb.so.1
#17 0xb6bb53f8 in tdb_traverse_read () from /usr/lib/i386-linux-gnu/libtdb.so.1
#18 0xb57bf4bb in ltdb_search () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so
#19 0xb57bdf90 in ?? () from /usr/lib/i386-linux-gnu/ldb/modules/ldb/tdb.so
#20 0xb6f831f8 in tevent_common_loop_timer_delay () from /usr/lib/i386-linux-gnu/libtevent.so.0
#21 0xb6f84374 in ?? () from /usr/lib/i386-linux-gnu/libtevent.so.0
#22 0xb6fc14f1 in ?? () from /usr/lib/i386-linux-gnu/libldb.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Comment 1 Janis Meybohm univentionstaff 2015-08-19 10:29:12 CEST
Created attachment 7118 [details]
coredump, memory map, lsof etc. (x64)

Backtrace of the x64 system
(gdb) bt
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00007f607dda0458 in _L_lock_9676 () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f607dd9e7b1 in *__GI___libc_free (mem=0x7f607e0aae40) at malloc.c:3736
#3  0x00007f607ddc150d in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435
#4  0x00007f607ddc1649 in __tz_convert (timer=0x7ffc57011e08, use_localtime=1, tp=0x7f607e0ae380) at tzset.c:624
#5  0x00007f60802d200e in timeval_str_buf () from /usr/lib/x86_64-linux-gnu/samba/libsamba-debug.so.0
#6  0x00007f60802d2c92 in dbghdrclass () from /usr/lib/x86_64-linux-gnu/samba/libsamba-debug.so.0
#7  0x00007f608267ba7a in ?? ()
#8  <signal handler called>
#9  0x00007f607dd9b4f1 in _int_free (av=0x7f607e0aae40, p=0x7f6084f055a0) at malloc.c:4965
#10 0x00007f607dd9e7bc in *__GI___libc_free (mem=<optimized out>) at malloc.c:3738
#11 0x00007f607e4d28ec in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#12 0x00007f607e4d2660 in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#13 0x00007f607e4cd524 in _talloc_free () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#14 0x00007f60670edb0f in ?? () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so
#15 0x00007f607bacc90f in ?? () from /usr/lib/x86_64-linux-gnu/libtdb.so.1
#16 0x00007f607bacca7b in tdb_traverse_read () from /usr/lib/x86_64-linux-gnu/libtdb.so.1
#17 0x00007f60670ede78 in ltdb_search () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so
#18 0x00007f60670ecbf8 in ?? () from /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/tdb.so
#19 0x00007f607e2c4878 in tevent_common_loop_timer_delay () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#20 0x00007f607e2c57cd in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#21 0x00007f607e2c3ef6 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#22 0x00007f607e2c04dd in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#23 0x00007f607e6f8525 in ldb_wait () from /usr/lib/x86_64-linux-gnu/libldb.so.1
#24 0x00007f607060e643 in ldapsrv_do_call () from /usr/lib/x86_64-linux-gnu/samba/service/ldap.so
#25 0x00007f607060bd07 in ?? () from /usr/lib/x86_64-linux-gnu/samba/service/ldap.so
#26 0x00007f607e2c0d92 in tevent_common_loop_immediate () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#27 0x00007f607e2c57c1 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#28 0x00007f607e2c3ef6 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#29 0x00007f607e2c04dd in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#30 0x00007f607e2c069b in tevent_common_loop_wait () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#31 0x00007f607e2c3e66 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#32 0x00007f6074fb194d in ?? () from /usr/lib/x86_64-linux-gnu/samba/process_model/standard.so
#33 0x00007f608118154f in task_server_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0
#34 0x00007f6081180093 in server_service_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0
#35 0x00007f608267c6f0 in ?? ()
#36 0x00007f607dd41ead in __libc_start_main (main=<optimized out>, argc=<optimized out>, ubp_av=<optimized out>, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc570132f8) at libc-start.c:244
#37 0x00007f6082678871 in _start ()
Comment 2 Janis Meybohm univentionstaff 2015-08-19 10:30:54 CEST
All of them where running samba debuglevel 2. As it looks as if the problem occurs while printing debug in the signal handler, that might be relevant.
Comment 3 Janis Meybohm univentionstaff 2015-08-20 16:47:12 CEST
Created attachment 7124 [details]
bt, memory map, lsof etc.

2015081321000117

Second customer has a similar situation but in this case the krc_server process died with SIGSEGV and that lead to the "signal handler futex situation".

Linux ucs-gyjkep 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux

No LSB modules are available.
Distributor ID: Univention
Description:    Univention Corporate Server 4.0-2 errata285 (Walle)
Release:        4.0-2 errata285
Codename:       Walle

samba    2:4.2.3-1.758.201507271307

Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
  Id   Target Id         Frame 
* 1    Thread 0x7f2970522720 (LWP 3044) "samba" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
No locals.
#1  0x00007f296ba7c458 in _L_lock_9676 () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f296ba7a7b1 in *__GI___libc_free (mem=0x7f296bd86e40) at malloc.c:3736
        ignore1 = 128
        ignore2 = <optimized out>
        ignore3 = -512
        ar_ptr = 0x7f296bd86e40
        p = 0x7f297091b3a0
#3  0x00007f296ba9d50d in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:435
        is_initialized = 1
        tz = <optimized out>
#4  0x00007f296ba9d649 in __tz_convert (timer=0x7ffec3c99738, use_localtime=1, tp=0x7f296bd8a380) at tzset.c:624
        leap_correction = <optimized out>
        leap_extra_secs = <optimized out>
#5  0x00007f296dfae00e in timeval_str_buf (tp=tp@entry=0x7ffec3c99790, hires=true, dst=dst@entry=0x7ffec3c997a0) at ../lib/util/time_basic.c:51
        t = 1439987657
        tm = <optimized out>
        len = <optimized out>
#6  0x00007f296dfaec92 in dbghdrclass (level=level@entry=0, cls=cls@entry=0, location=location@entry=0x7f296fd064a8 "../lib/util/fault.c:78", func=func@entry=0x7f296fd06660 "fault_report") at ../lib/util/debug.c:991
        old_errno = 12
        header_str = "\240\235\311\303\001\000\000\000\260\234\311\303\376\177\000\000samba_kd;M\316o)\177\000\000client_access\000\275\244j/\271\204\272\213{\300\071ݙ\271:\206\026\027阽\264\275mvT\257-\271\266\314\003\233|\003\230\350V\017\224-Ѹ@\027k\266y\004\aj\374\271L;\234\006\344w\271\221pP\000\000\000@\231\311\303\376\177\000\000\260\024sd)\177\000\000\000\312\372m)\177\000\000\000\305Tp)\177\000\000\000\000\000\000\376\177\000\000\377\377\377\377\000\000\000\000}z.p\246-\237\337\001\000\000\000\000\000\000\000\360\027#", '\000' <repeats 13 times>"\220, \004\000\000\000\000\000\000msNPCall\000\340\315o)\177\000\000\362\231\023p)\177\000\000\001\000\000\000)\177\000\000\000\000\000\000\000\000\000\000\240\002\000\000\000\000\000\000\000\312\372m)\177\000\000\000\000\000\000\000\000\000\000\340\004\033n)"...
        hs_len = <optimized out>
        tv = {tv_sec = 1439987657, tv_usec = 396226}
        tvbuf = {buf = "\206Hd\213\000\000\000\000\002\\\023p)\177\000\000\000\000\000\000\000\000\000\000\330\313Rp)\177\000\000\001\000\000\000)\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000H\332Up)\177\000\000\000a\315q)\177\000\000\200\300\032l)\177\000\000\327u\023۪\360,\226\036", '\000' <repeats 15 times>"\240, \335Up)\177\000\000\240\230\311\303\376\177\000\000\270\230\311\303\376\177\000"}
#7  0x00007f296fcf9699 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:78
        counter = 1
#8  sig_fault (sig=11) at ../lib/util/fault.c:94
No locals.
#9  <signal handler called>
No symbol table info available.
#10 0x00007f296ba76f08 in _int_free (av=0x7f296bd86e40, p=0x7f2971c2b080) at malloc.c:4993
        size = 144
        nextchunk = 0x7f2971c2b110
        nextsize = 1168
        prevsize = <optimized out>
        bck = 0x6974617453676e69
        fwd = 0x6c6c6143504e736d
        errstr = <optimized out>
        __func__ = "_int_free"
#11 0x00007f296ba7a7bc in *__GI___libc_free (mem=<optimized out>) at malloc.c:3738
        ar_ptr = 0x7f296bd86e40
        p = 0x7f2972563000
#12 0x00007f296533b356 in der_free_octet_string () from /usr/lib/x86_64-linux-gnu/libasn1.so.8
No symbol table info available.
#13 0x00007f29652dc6bd in free_PA_DATA () from /usr/lib/x86_64-linux-gnu/libasn1.so.8
No symbol table info available.
#14 0x00007f29652df20c in free_METHOD_DATA () from /usr/lib/x86_64-linux-gnu/libasn1.so.8
No symbol table info available.
#15 0x00007f296040db37 in ?? () from /usr/lib/x86_64-linux-gnu/libkdc.so.2
No symbol table info available.
#16 0x00007f296041a271 in ?? () from /usr/lib/x86_64-linux-gnu/libkdc.so.2
No symbol table info available.
#17 0x00007f296041a60e in krb5_kdc_process_krb5_request () from /usr/lib/x86_64-linux-gnu/libkdc.so.2
No symbol table info available.
#18 0x00007f2960c5bb5a in kdc_process (kdc=0x7f2971c4ada0, mem_ctx=0x7f2970db0fa0, input=0x7f2970db0fa8, reply=0x7f2970db0fb8, peer_addr=0x7f2970f11710, my_addr=<optimized out>, datagram_reply=0) at ../source4/kdc/kdc.c:161
        ret = <optimized out>
        pa = <optimized out>
        ss = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}
        k5_reply = {length = 0, data = 0x0}
        __FUNCTION__ = "kdc_process"
...
...
Comment 4 Janis Meybohm univentionstaff 2015-08-24 14:30:33 CEST
Ticket#2015082421000169

Next one. This time kdc hung in futex without segfault probably because of signal_handler.
Comment 5 Arvid Requate univentionstaff 2015-08-24 16:54:24 CEST
Regarding the situation in Comment 3:

/usr/lib/x86_64-linux-gnu/libkdc.so.2 is from heimdal.
Comment 6 Janis Meybohm univentionstaff 2015-08-24 17:03:02 CEST
Created attachment 7130 [details]
coredump of kdc_server in futex

Release:        4.0-3 errata288

Linux master 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux

samba    2:4.2.3-1.758.201507271307
Comment 7 Stefan Gohmann univentionstaff 2015-10-20 06:32:43 CEST
Does it still happen?
Comment 8 Arvid Requate univentionstaff 2015-11-17 15:24:00 CET
I guess this has been fixed by building Samba against the internal Heimdal snapshot shipped with the Samba source package. At least we have not seen any reports since http://errata.software-univention.de/ucs/4.0/289.html has been released.
Comment 9 Tim Petersen univentionstaff 2015-11-18 09:42:22 CET
(In reply to Arvid Requate from comment #8)
> I guess this has been fixed by building Samba against the internal Heimdal
> snapshot shipped with the Samba source package. At least we have not seen
> any reports since http://errata.software-univention.de/ucs/4.0/289.html has
> been released.

Unfortunately not -> 2015111621000392:
4.0-3-Errata336
14039 ?        S    376:56 /usr/sbin/samba -D
root@ucs-3:/etc/cron.d# strace -p 14039
    Process 14039 attached - interrupt to quit
    futex(0x7f2106390e40, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>


Linux ucs-3 3.16.0-ucs135-amd64 #1 SMP Debian 3.16.7-ckt11-1~bpo70+1.135.201507161851 (2015-07-1 x86_64 GNU/Linux
Comment 10 Tim Petersen univentionstaff 2015-11-18 09:47:20 CET
@Arvid: Perhaps we can simply trigger/reproduce this in a testing environment when rotating the password in a very high interval?
Comment 11 Jens Thorp-Hansen univentionstaff 2015-11-19 10:27:30 CET
Happens here: 2015081921000151 occassionally too.
Comment 12 Arvid Requate univentionstaff 2015-11-19 12:03:18 CET
Ok, I propose to adjust the samba (or more specifically samba-ad-dc) init script to double check that everything is termintated at the end of the "stop" action. We already to a start-stop-daemon with a --retry plan ob "TERM/15/KILL" but apparently this doesn't cover the problem. We also give a PID-file with --pid, maybe that's a bad idea.

Anyway, we should explicitly check for left-over processes after start-stop-daemon --stop to make sure that the problem doesn't come from that point. If something is left, we should log it and kill it for good.

It would be ideal if we had an environment where the issue is reproducable. Ticket 2015081921000151 could be a candidate?
Comment 13 Stephan Luft univentionstaff 2015-11-23 16:09:50 CET
happens in several customer environments from our partner Onex: 2015102121000204 Mr. Frey could be a candidate to get in contact for testing purposes
Comment 14 Jens Thorp-Hansen univentionstaff 2015-11-30 09:40:54 CET
happened again at a customer of our partner: 2015102121000204
Comment 15 Jens Thorp-Hansen univentionstaff 2015-11-30 11:36:51 CET
happend most likely again in: 2015111821000638
Comment 16 Arvid Requate univentionstaff 2015-11-30 14:32:44 CET
> happend most likely again in: 2015111821000638

I cannot find any indication supporting this statement at that ticket. Please add the required information to that ticket if available.
Comment 17 Arvid Requate univentionstaff 2015-11-30 16:24:48 CET
Thie init script has been adjusted as proposed in Comment 12.

Advisory: samba.yaml
Comment 18 Stefan Gohmann univentionstaff 2015-12-08 07:22:53 CET
Tests: OK. Unfortunately, I'm unable to reproduce the original bug.

Code review: OK

YAML: OK (small adjustment: r66137)
Comment 19 Janek Walkenhorst univentionstaff 2015-12-16 15:39:44 CET
<http://errata.software-univention.de/ucs/4.0/374.html>