34013 – univention-dire[3809]: segfault

Bug 34013 - univention-dire[3809]: segfault

Summary: univention-dire[3809]: segfault

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Listener (univention-directory-listener)
Version:	UCS 3.2
Hardware:	amd64 Linux

Importance:	P5 normal
Target Milestone:	UCS 3.2-3-errata
Assignee:	Arvid Requate
QA Contact:	Stefan Gohmann

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-01-29 16:57 CET by Philipp Hahn
Modified:	2014-09-10 17:43 CEST (History)
CC List:	2 users (show)

See Also:	24411
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
Proposed patch (1.80 KB, patch) 2014-04-17 15:19 CEST, Philipp Hahn	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philipp Hahn

2014-01-29 16:57:03 CET

ucs-kt-get -A amd64 -V 3.2 Generic
# run through Univention-System-Setup-Boot
## Master
## FQDN=test.phahn.dev, PWD=univention
## IP=10.200.17.x, GW=10.200.17.1
## SW=ø
# login as root on console

root@test:~# [  347.997015] univention-dire[3809]: segfault at 18 ip 00007f6f464112ba sp 00007fffb892d810 error 4 in libpython2.6.so.1.0[7f6f463c1000+240000]

Seen several times, but unpronounceable right now.

Comment 1 Philipp Hahn

2014-02-11 09:26:44 CET

again

Comment 2 Philipp Hahn

2014-04-10 18:44:00 CEST

Bug #24411 reports another SEGV in the listener.

Comment 3 Philipp Hahn

2014-04-11 08:34:19 CEST

(In reply to Philipp Hahn from comment #2)
Wrong bug number: Bug #25094
I mention it here, because in system-setup mode the listener gets restarted, which would terminate the previous process.

Comment 4 Philipp Hahn

2014-04-17 15:19:36 CEST

Created attachment 5879 [details]
Proposed patch

During debugging for Bug #34355 I generated the following SIGSEGV from gdb:
667             if (dbp && (rv = dbp->close(dbp, 0)) != 0) {

In my case that happened because multiple signals were processed and the first SIGINT did not terminate the listener; only the subsequent SIGTERM did:

(gdb) print dbp
$1 = (DB *) 0x6471e0
(gdb) list
662     {
663             int rv;
664
665             if ( dbc_cur != NULL )
666                     cache_free_cursor(dbc_cur);
667             if (dbp && (rv = dbp->close(dbp, 0)) != 0) {
668                     dbp->err(dbp, rv, "close");
669             }
670     #ifdef WITH_DB42
671             if ((rv = dbenvp->close(dbenvp, 0)) != 0) {
(gdb) bt
#0  0x000000000040858d in cache_close () at cache.c:667
#1  0x000000000040f17a in exit_handler (sig=15) at signals.c:112
#2  <signal handler called>
#3  0x00007ffff6d783c3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#4  0x000000000040e471 in notifier_wait (client=0x6148c0, timeout=300) at network.c:498
#5  0x0000000000404872 in notifier_listen (lp=0x617110, kp=0x0, write_transaction_file=1, lp_local=0x617190)
    at notifier.c:120
#6  0x00000000004046c5 in main (argc=17, argv=0x7fffffffe4a8) at main.c:611

The attached patch does 3 things:
1. dbp=NULL after the close, to not double-free it. This should at least fix this issue.
2. Remove the global "dbc_cur", as the 3 callers of cache_first_entry() correctly free their cursor themselves.
3. Replace lockf(F_TEST)+lockf(F_LOCK) with lockf(F_TLOCK), as the former two system calls are not atomic.

Comment 5 Philipp Hahn

2014-05-02 12:36:17 CEST

(In reply to Philipp Hahn from comment #4)
> In my case that happened because multiple signals were processed and the
> first SIGINT did not terminate the listener; only the subsequent SIGTERM did:

While debugging Bug #34335 I noticed that SIGPIPE is used by libssl/gnutls/libldap to signal some timeout condition. If a second signal like SIGINT or SIGTERM happens while the first signal handler is still running, this could explain the segmentation fault.

I also found a 2 week old core file showing the SEGV happening while processing a signal. (the backtrace was not reliable, because I no longer had the exact listener binary installed in my development environment.)

Please also notice that not all functions are signal-save; read "man 7 signal" section "Async-signal-safe functions".

Comment 6 Philipp Hahn

2014-05-26 14:40:22 CEST

(gdb) signal SIGINT
Continuing with signal SIGINT.
26.05.14 14:48:41.762  LISTENER    ( WARN    ) : received signal 2
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: nfs-shares (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: nfs-homes (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: keytab-member (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: faillog (prepared=-1)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: ldap_server (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: license_uuid (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: bind (prepared=0)
26.05.14 14:48:41.762  LISTENER    ( INFO    ) : postrun handler: well-known-sid-name-mapping (prepared=-1)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff73262ba in PyObject_Call () from /usr/lib/libpython2.6.so.1.0
(gdb) bt
#0  0x00007ffff73262ba in PyObject_Call () from /usr/lib/libpython2.6.so.1.0
#1  0x00007ffff73c6343 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.6.so.1.0
#2  0x0000000000405f2c in handler_postrun (handler=0x9529a0) at handlers.c:348
#3  0x0000000000405fbd in handlers_postrun_all () at handlers.c:368
#4  0x000000000040f8f9 in exit_handler (sig=2) at signals.c:119
#5  <signal handler called>
#6  0x00007ffff6d72870 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
#7  0x00007ffff6d1ca1f in _IO_file_xsgetn (fp=0xf92ce0, data=0x1044b48, n=8192) at fileops.c:1465
#8  0x00007ffff6d12c42 in _IO_fread (buf=0x1044a94, size=1, count=8192, fp=0xffffffffffffffff) at iofread.c:44
#9  0x00007ffff734b7ac in ?? () from /usr/lib/libpython2.6.so.1.0
#10 0x00007ffff73cc1c0 in PyEval_EvalFrameEx () from /usr/lib/libpython2.6.so.1.0
#11 0x00007ffff73cdf00 in PyEval_EvalCodeEx () from /usr/lib/libpython2.6.so.1.0
#12 0x00007ffff73cc23b in PyEval_EvalFrameEx () from /usr/lib/libpython2.6.so.1.0
#13 0x00007ffff73cdf00 in PyEval_EvalCodeEx () from /usr/lib/libpython2.6.so.1.0
#14 0x00007ffff73cc23b in PyEval_EvalFrameEx () from /usr/lib/libpython2.6.so.1.0
#15 0x00007ffff73cdf00 in PyEval_EvalCodeEx () from /usr/lib/libpython2.6.so.1.0
#16 0x00007ffff7353b80 in ?? () from /usr/lib/libpython2.6.so.1.0
#17 0x00007ffff73262d3 in PyObject_Call () from /usr/lib/libpython2.6.so.1.0
#18 0x00007ffff73c6343 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.6.so.1.0
#19 0x0000000000405f2c in handler_postrun (handler=0x9529a0) at handlers.c:348
#20 0x0000000000405fbd in handlers_postrun_all () at handlers.c:368
#21 0x0000000000404aa0 in notifier_listen (lp=0x617110, kp=0x0, write_transaction_file=0, lp_local=0x617190)
    at notifier.c:140
#22 0x00000000004047b3 in main (argc=16, argv=0x7fffffffe5f8) at main.c:612

Perhaps the signal handler should just set a global flag, so that the loop calling select() just exits.

Comment 7 Arvid Requate

2014-09-08 18:39:43 CEST

Ok, the patch has been applied and the exit_handler now checks if it is running already. Advisory: 2014-09-08-univention-directory-listener.yaml

Comment 8 Stefan Gohmann

2014-09-09 07:32:59 CEST

Tests: I was unable to reproduce it, but the code looks good and ucs-test run successfully on all system roles → verified

Code review: OK

YAML: OK

UCS 4.0 merge: OK

Comment 9 Janek Walkenhorst

2014-09-10 17:43:40 CEST

http://errata.univention.de/ucs/3.2/201.html