Bug 49593 - Samba 4.10 panic: Bad talloc magic value - unknown value (from dsdb_module_am_system)
Samba 4.10 panic: Bad talloc magic value - unknown value (from dsdb_module_a...
Status: NEW
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 4.4
Other Linux
: P5 normal (vote)
: ---
Assigned To: Samba maintainers
Samba maintainers
:
Depends on: 39806
Blocks:
  Show dependency treegraph
 
Reported: 2019-06-04 14:29 CEST by Arvid Requate
Modified: 2022-01-10 11:49 CET (History)
7 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.120
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019060321000589, 2019060421000434, 2019110821000999, 2020041421000523, 2021102921000312, 2021112421000212
Bug group (optional):
Max CVSS v3 score:


Attachments
bug49593-idea2.diff (1.09 KB, patch)
2021-11-04 14:20 CET, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2019-06-04 14:29:49 CEST
We have reports of Samba 4.10 coredumps which result in downtimes for all Samba related services delivered by the respective UCS Samba/AD server:

=========================================================================
[2019/05/30 13:02:31.927173,  0, pid=1531] ../../lib/util/fault.c:128(smb_panic_default)
  smb_panic_default: PANIC (pid 1531): Bad talloc magic value - unknown value
   #1 /usr/lib/x86_64-linux-gnu/libsamba-util.so.0(smb_panic+0x4b) [0x7f6ff19e503b]
account01
[2019/05/30 08:06:45.555284,  0, pid=1269] ../../lib/util/fault.c:128(smb_panic_default)
  smb_panic_default: PANIC (pid 1269): Bad talloc magic value - unknown value
   #1 /usr/lib/x86_64-linux-gnu/libsamba-util.so.0(smb_panic+0x4b) [0x7f818675d03b]
account02
[2019/05/30 22:39:52.296122,  0, pid=2054] ../../lib/util/fault.c:128(smb_panic_default)
  smb_panic_default: PANIC (pid 2054): Bad talloc magic value - unknown value
   #1 /usr/lib/x86_64-linux-gnu/libsamba-util.so.0(smb_panic+0x4b) [0x7fe6d29e503b]
=========================================================================

log.samba shows a Samba panic stack trace (we don't have a full core dump stack trace yet):
=========================================================================
[2019/05/31 05:45:15.216799,  0, pid=2881] ../../source4/lib/cmdline/popt_common.c:74(popt_s4_talloc_log_fn)
  Bad talloc magic value - unknown value
[2019/05/31 05:45:15.216935,  0, pid=2881] ../../lib/util/fault.c:128(smb_panic_default)
  smb_panic_default: PANIC (pid 2881): Bad talloc magic value - unknown value
[2019/05/31 05:45:15.223151,  0, pid=2881] ../../lib/util/fault.c:261(log_stack_trace)
  BACKTRACE: 58 stack frames:
   #0 /usr/lib/x86_64-linux-gnu/libsamba-util.so.0(log_stack_trace+0x2d) [0x7f3843b0ef1d]
   #1 /usr/lib/x86_64-linux-gnu/libsamba-util.so.0(smb_panic+0x4b) [0x7f3843b0f03b]
   #2 /usr/lib/x86_64-linux-gnu/libtalloc.so.2(talloc_check_name+0x63) [0x7f38364e3d53]
   #3 /usr/lib/x86_64-linux-gnu/samba/libdsdb-module.so.0(dsdb_module_am_system+0x27) [0x7f382c222ee7]
   #4 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/acl.so(+0x7012) [0x7f38296b6012]
   #5 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #6 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/descriptor.so(+0x30c3) [0x7f3828e960c3]
   #7 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #8 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/extended_dn_in.so(+0x2788) [0x7f382846e788]
   #9 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #10 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/asq.so(+0x127b) [0x7f382ad0527b]
   #11 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #12 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/server_sort.so(+0x1693) [0x7f38220d6693]
   #13 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #14 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/anr.so(+0x1b52) [0x7f38292a6b52]
   #15 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #16 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/ranged_results.so(+0x15bb) [0x7f382598e5bb]
   #17 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #18 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/vlv.so(+0x27cb) [0x7f38226e17cb]
   #19 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #20 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/paged_results.so(+0x233b) [0x7f382640c33b]
   #21 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #22 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/dirsync.so(+0x29d3) [0x7f3828c8d9d3]
   #23 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #24 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/lazy_commit.so(+0x1673) [0x7f3827849673]
   #25 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #26 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/dsdb_notification.so(+0x18c3) [0x7f382887a8c3]
   #27 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #28 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/rootdse.so(+0x7839) [0x7f382535b839]
   #29 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #30 /usr/lib/x86_64-linux-gnu/ldb/modules/ldb/samba/resolve_oids.so(+0x222b) [0x7f382556922b]
   #31 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_next_request+0x172) [0x7f38366fe272]
   #32 /usr/lib/x86_64-linux-gnu/libldb.so.1(+0x214bb) [0x7f38367154bb]
   #33 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_request+0x8f8) [0x7f3836714ea8]
   #34 /usr/lib/x86_64-linux-gnu/libldb.so.1(ldb_search+0x1cd) [0x7f383671596d]
   #35 /usr/lib/x86_64-linux-gnu/samba/libsamdb-common.so.0(gendb_search_v+0x147) [0x7f38401d92d7]
   #36 /usr/lib/x86_64-linux-gnu/samba/libsamdb-common.so.0(gendb_search+0x87) [0x7f38401d9457]
   #37 /usr/lib/x86_64-linux-gnu/libsamdb.so.0(+0xbf13) [0x7f383e0fbf13]
   #38 /usr/lib/x86_64-linux-gnu/libsamdb.so.0(DsCrackNameOneName+0x18e) [0x7f383e0facae]
   #39 /usr/lib/x86_64-linux-gnu/libsamdb.so.0(dcesrv_drsuapi_CrackNamesByNameFormat+0xc0) [0x7f383e0fd450]
   #40 /usr/lib/x86_64-linux-gnu/libdcerpc-server.so.0(+0x5a6a6) [0x7f382bbae6a6]
   #41 /usr/lib/x86_64-linux-gnu/libdcerpc-server.so.0(+0x26c8c) [0x7f382bb7ac8c]
   #42 /usr/lib/x86_64-linux-gnu/libdcerpc-binding.so.0(+0x146af) [0x7f383cc546af]
   #43 /usr/lib/x86_64-linux-gnu/samba/libsamba-sockets.so.0(+0xb2e9) [0x7f383e3142e9]
   #44 /usr/lib/x86_64-linux-gnu/samba/libsamba-sockets.so.0(+0xa0bb) [0x7f383e3130bb]
   #45 /usr/lib/x86_64-linux-gnu/libtevent.so.0(tevent_common_invoke_immediate_handler+0x111) [0x7f3835427f11]
   #46 /usr/lib/x86_64-linux-gnu/libtevent.so.0(tevent_common_loop_immediate+0x1e) [0x7f3835427f5e]
   #47 /usr/lib/x86_64-linux-gnu/libtevent.so.0(+0xcdbd) [0x7f383542ddbd]
   #48 /usr/lib/x86_64-linux-gnu/libtevent.so.0(+0xafd7) [0x7f383542bfd7]
   #49 /usr/lib/x86_64-linux-gnu/libtevent.so.0(_tevent_loop_once+0xbd) [0x7f383542723d]
   #50 /usr/lib/x86_64-linux-gnu/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7f383542749b]
   #51 /usr/lib/x86_64-linux-gnu/libtevent.so.0(+0xaf77) [0x7f383542bf77]
   #52 /usr/lib/x86_64-linux-gnu/samba/process_model/standard.so(+0x2261) [0x7f383051c261]
   #53 /usr/lib/x86_64-linux-gnu/samba/libservice.so.0(task_server_startup+0x5c) [0x7f3842a573bc]
   #54 /usr/lib/x86_64-linux-gnu/samba/libservice.so.0(server_service_startup+0x96) [0x7f3842a55d36]
   #55 samba: task[dcesrv](+0x5781) [0x557da9989781]
   #56 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f383480c2e1]
   #57 samba: task[dcesrv](_start+0x2a) [0x557da9987e1a]
[2019/05/31 05:45:15.632351,  0, pid=972] ../../source4/smbd/process_standard.c:160(standard_child_pipe_handler)
  standard_child_pipe_handler: Child 2881 (rpc) terminated with signal 6
=========================================================================

In summary the call stack looks like this:

dcesrv_drsuapi_CrackNamesByNameFormat -> DsCrackNameOneName -> dsdb_search -> ldb_request -> [...] -> ldb_next_request , repeating through all DSDB LDB modules up to the module "acl", probably running the method "acl_search".

From there the code path seems pretty clear:

dsdb_module_am_system -> talloc_get_type -> talloc_check_name -> __talloc_get_name -> talloc_chunk_from_ptr -> talloc_abort_unknown_value

Looks like a memory corruption that damaged a TALLOC_MAGIC_NON_RANDOM marker of some talloc chunk, apparently the ldb context of the "acl" module itself. Might be triggered during an ACL check but the cause of the corruption is probably "somewhere" else.
Comment 1 Christian Völker univentionstaff 2019-06-06 14:56:20 CEST
Core dumps available in both tickets now.
Comment 2 Arvid Requate univentionstaff 2019-09-04 16:45:55 CEST
We have taken three steps for Ticket #2019060321000589:

1) Implemented a workaround (samba restart via panic action)
2) Analyzed the core dumps
3) Replaced the SAM database of the UCS Master by a local clone (based on "samba-tool drs clone-dc-database" with suitable options)
   and added crash/restart timestamp logging (via samba panic action script).

The analysis of the core dumps from that environment didn't give any proper hint about the cause of the issue.
We may still receive feedback from the customer if step 3) has resulted in a stabilization.

While I agree that this is a considerable user pain, I suggest reducing the value, because I don't see any actionable proposals that could be considered for implementation by the product owner. We'll have to continue monitoring the support frequency and I hope that we can gather more info from Ticket #2019060421000434 too.
Comment 3 Christian Völker univentionstaff 2019-11-08 16:17:29 CET
Happening again on a customer site. Again urgent.
BTW: for the previous case rewriting the samba database did not resolve the issue. They are currently re-deploying samba.

    smb_panic_default: PANIC (pid 13256): Bad talloc magic value - unknown value
    smb_panic_default: PANIC (pid 11993): Bad talloc magic value - access after free

USI is attached to the last ticket.

Any further steps to do? What Infos do you need?
Comment 4 Arvid Requate univentionstaff 2020-04-09 18:58:04 CEST
My gut feeling is that our patch 35_allow_dns_update_for_samaccountname.quilt for the dlz_bind9 module introduced a memory handling but in dlz_ssumatch (via Bug #39806). Circumstantial evidence: 1. Only we seem to have this problem. 2. The failing ldb_get_opaque attempts to fetch DSDB_SESSION_INFO, which can only be set via a ldb_set_opaque call, which is used in the dlz_bind9 module to set DSDB_SESSION_INFO.
Comment 5 Christian Völker univentionstaff 2020-04-14 16:58:08 CEST
A workaround was suggested. I tried to pin it down here. Is there anything I am missing?
----------------------------------------------------

    1. Install gdb
    univention-install gdb
    
    2. Create a file which contains the following lines
    =========snip==8<=====
    #!/bin/sh
    gdb -p "$(pgrep -f dcesrv)" -batch -x /opt/gdb-panic-script.txt
    echo  "/etc/init.d/samba restart" | at 'NOW + 1 MINUTE'
    =========snip==8<=====
    Make this script executable with "chmod +x /opt/restart-samba.sh"
     
    3. Create a second script which will do some debugging called /opt/gdb-panic-script.txt
    =========snip==8<=====
    bt full
    frame 9
    call talloc_report_full(module->ldb, stderr)
    quit
    =========snip==8<=====
    
    4. Edit /etc/samba/local.conf and add the following line under the [global] section manually: 
    panic action = /opt/restart-samba.sh

----------------------------------------------------
Comment 7 syapollo 2020-06-17 08:53:30 CEST
(In reply to Christian Völker from comment #5)
> A workaround was suggested. I tried to pin it down here. Is there anything I
> am missing?
> ----------------------------------------------------
> 
>     1. Install gdb
>     univention-install gdb
>     
>     2. Create a file which contains the following lines
>     =========snip==8<=====
>     #!/bin/sh
>     gdb -p "$(pgrep -f dcesrv)" -batch -x /opt/gdb-panic-script.txt
>     echo  "/etc/init.d/samba restart" | at 'NOW + 1 MINUTE'
>     =========snip==8<=====
>     Make this script executable with "chmod +x /opt/restart-samba.sh"
>      
>     3. Create a second script which will do some debugging called
> /opt/gdb-panic-script.txt
>     =========snip==8<=====
>     bt full
>     frame 9
>     call talloc_report_full(module->ldb, stderr)
>     quit
>     =========snip==8<=====
>     
>     4. Edit /etc/samba/local.conf and add the following line under the
> [global] section manually: 
>     panic action = /opt/restart-samba.sh
> 
> ----------------------------------------------------


I executed the command /opt/restart-samba.sh
There seems to be a problem:

======================================================
root@ucs-101:/opt# ./restart-samba.sh 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f449285d2e3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
#0  0x00007f449285d2e3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1  0x00007f44933b5e9a in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
No symbol table info available.
#2  0x00007f44933b3fd7 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
No symbol table info available.
#3  0x00007f44933af23d in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
No symbol table info available.
#4  0x00007f44933af49b in tevent_common_loop_wait () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
No symbol table info available.
#5  0x00007f44933b3f77 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
No symbol table info available.
#6  0x00007f448e4a3261 in ?? () from /usr/lib/x86_64-linux-gnu/samba/process_model/standard.so
No symbol table info available.
#7  0x00007f44a09df3bc in task_server_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0
No symbol table info available.
#8  0x00007f44a09ddd36 in server_service_startup () from /usr/lib/x86_64-linux-gnu/samba/libservice.so.0
No symbol table info available.
#9  0x000055a06258a781 in ?? ()
No symbol table info available.
#10 0x00007f44927942e1 in __libc_start_main (main=0x55a062588dc0 <main>, argc=2, argv=0x7ffe4f1e4d18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe4f1e4d08) at ../csu/libc-start.c:291
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, 7874377618016858662, 94147333098992, 140730225806608, 0, 0, 4177988282206810662, 4075345324185252390}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7ffe4f1e4d30, 0x7f44a1f19170}, data = {prev = 0x0, cleanup = 0x0, canceltype = 1327385904}}}
        not_first_call = <optimized out>
#11 0x000055a062588e1a in _start ()
No symbol table info available.
#9  0x000055a06258a781 in ?? ()
/opt/gdb-panic-script.txt:3: Error in sourced command file:
Attempt to extract a component of a value that is not a structure pointer.
=======================================
Comment 9 Arvid Requate univentionstaff 2021-04-21 12:30:16 CEST
Following up to Comment 4:

My impression is that it's caused by the 35_allow_dns_update_for_samaccountname.quilt patch introduced via Bug 39806,
which introduces two places where we set 

    state->session_info = NULL;                                                                      

I guess that should better be a talloc_free call? See https://talloc.samba.org/talloc/doc/html/group__talloc.html
Comment 10 Arvid Requate univentionstaff 2021-06-16 18:33:16 CEST
Ok, I think my idea that our patch for dlz_bind9 might be the culprit is not right, because the dlz module is loaded by the bind9 process and not by samba itself.

I found https://bugzilla.samba.org/show_bug.cgi?id=13214 and the discussion around https://www.spinics.net/lists/samba/msg147432.html and then I experimented with "rndc reload" and "pkill -HUP -f named" running parallel to a large nsupdate call (first I deleted+added a host record for a computers/windows object in the forward zone as Administrator and then I used that machine account to del+add it 1000 times, which should have activated the system session code introduced by the patch for Bug #39806) but I could not reproduce the issue (neither ours nor that of the upstream samba bug report).

So I think we need to drill down on samba itself and find out which process is actually affected (smbd? kdc? ...). For that access to an affected environment would be required.
Comment 13 Arvid Requate univentionstaff 2021-06-24 16:51:31 CEST
Looks like we are not alone: https://lists.samba.org/archive/samba/2020-August/231440.html
Comment 14 Arvid Requate univentionstaff 2021-07-01 20:08:42 CEST
Ok, I looked again at the recent core dumps from the Tickets related to Comment 8 / Comment 11
and I think they are a completely different issue that happens on a memberserver, not on a DC.
I've split it off as Bug #53540.

I'll attach the tickets from that customer to the new bug and remove them from this one:
2021032421000031 2021061821000191

Ticket 2021042121000222 also seems unrelated (SAML issue, different customer)


This bug here is about core dumps on AD DCs with "Bad talloc magic value - unknown value"
from the function "dsdb_module_am_system".
Comment 15 Ingo Steuwer univentionstaff 2021-07-08 16:42:30 CEST
Thanks for the analysis. So the bug hasn't been reported since >1 year, and nobody is able to reproduce it - right?

I reduce the "Who will be affected" based on these assumptions.
Comment 16 Arvid Requate univentionstaff 2021-09-08 12:08:19 CEST
Found again in a log file of a customer (Ticket #2021090221000277), no further circumstances known.
Comment 17 Arvid Requate univentionstaff 2021-11-04 12:54:01 CET
+1 Ticket #2021102921000312
Comment 18 Arvid Requate univentionstaff 2021-11-04 12:55:56 CET
Impact: rpc server on port 135 gone => Samba/AD DC unreachable for RPC calls.
Comment 19 Arvid Requate univentionstaff 2021-11-04 14:20:27 CET
Created attachment 10853 [details]
bug49593-idea2.diff

One could attempt to mitigate like sketched in the attached patch, but this will 1) not compile like this, because the talloc internals are encapsuled in lib/talloc/talloc.c and 2) will only avoid the panic at that location but may well result in a panic at some later stage, because the DSDB_SESSION_INFO is still invalid in that context.

We would need to catch the triggering action/system as it happens in an affected customer or test environment. Maybe it's an unjoined client/memberserver connecting or whatever.