40059 – No Login possible from one minute to another - smbd: nss_ldap: ldap_start_tls failed: Timed out

Bug 40059 - No Login possible from one minute to another - smbd: nss_ldap: ldap_start_tls failed: Timed out

Summary: No Login possible from one minute to another - smbd: nss_ldap: ldap_start_tls...

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Kernel
Version:	UCS 4.1
Hardware:	amd64 Linux

Importance:	P1 critical
Target Milestone:	UCS 4.1-0-errata
Assignee:	Philipp Hahn
QA Contact:	Arvid Requate

URL:
Keywords:

Depends on:	40141
Blocks:
	Show dependency tree / graph

Reported:	2015-11-21 17:10 CET by snoopyf
Modified:	2016-01-21 09:15 CET (History)
CC List:	16 users (show)

See Also:	37643 39832
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
Reproducer (2.04 KB, text/plain) 2015-11-26 16:21 CET, Philipp Hahn	Details
undo_netlink-replace-rhash_portid-with-bound.patch (10.09 KB, patch) 2015-11-29 14:42 CET, Stefan Gohmann	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description snoopyf 2015-11-21 17:10:14 CET

Hi everybody,

I had the following issue multiple times in the last week:

From one minute to another it isn't possible to login neither with SSH nor directly at the console. You enter the credentials, it seems the login is in progress but then there is no shell as if the login crashs.

The only way to get out is to reboot the machine with Magic Sysrq. Keys.

At this time this can be found in syslog:

Nov 21 16:52:57 server smbd: nss_ldap: ldap_start_tls failed: Timed out
Nov 21 16:53:09 server smbd: nss_ldap: ldap_start_tls failed: Timed out
Nov 21 16:53:28 server smbd: nss_ldap: ldap_start_tls failed: Timed out
Nov 21 16:53:39 server smbd: nss_ldap: ldap_start_tls failed: Timed out
Nov 21 16:54:09 server smbd: nss_ldap: ldap_start_tls failed: Timed out
Nov 21 16:54:40 server smbd: nss_ldap: ldap_start_tls failed: Timed out

So I assume, the ldap server somehow crashed but I can't find any log...

Any hints?

Thx in advance!

Comment 1 Dirk Ahrnke 2015-11-22 19:33:53 CET

also in http://forum.univention.de/viewtopic.php?f=48&t=4561

Comment 2 Stefan Gohmann

2015-11-23 06:57:13 CET

If someone is able to reproduce it, a slapd core file would be helpful.

Comment 3 Arvid Requate

2015-11-23 10:21:47 CET

http://sdb.univention.de/1316 explains how to enable core dumps and after that, in case the slapd still hangs around, a core dump may be generated by

 kill -11 $SLAPDPID

Comment 4 snoopyf 2015-11-24 07:19:34 CET

As one is not be able to login when the server is in this faulty state generating a core dump will be tricky...

Perhaps it works when there is a logged in root - shell somewhere during slapd crashes..

Just a thought without knowing how things r configured atm:

Wouldn't it be better to configure UCS in such a way, that root can ALWAYS login even if LDAP has gone as the root account resides in /etc/passwd?

I had a similar problem with a completely self - configured server years ago (even root couldn't login when LDAP was unavailable) and I don't really remember how, but I think I could enable the root login when LDAP wasn't available by changing some orders in /etc/nsswitch or perhaps it was somwhere in the pam - config so that LDAP was the second, not the first system to be considered during the login process....

Comment 5 Stefan Gohmann

2015-11-24 13:22:15 CET

Normally a login should be possible if the LDAP server is down.

Maybe it is a problem that the LDAP server accepts connections but don't answer. Does the root login work after executing the following command:
 ucr set homedir/mount=no quota/userdefault=no

Unfortunately, we are still unable to reproduce it.

Comment 6 Arvid Requate

2015-11-24 16:01:14 CET

In case someone faces this situation and has root access to the system, gathering a backtrace with the following commands might help to track this down:

apt-get install -y gdb slapd-dbg cyrus-sasl2-dbg cy2-saml-dbg
gdb /usr/sbin/slapd "$(pidof slapd)" \
      --batch --ex 'thread apply all bt'

Comment 7 Florian Best

2015-11-24 22:10:40 CET

Also reported in http://forum.univention.de/viewtopic.php?f=48&t=4584&p=17505

Comment 8 Arvid Requate

2015-11-25 13:15:31 CET

ucr set ldap/debug/level="trace conns stats" might also help narrow down context and timing, but that would have to be set *before* it happens (and restart slapd). Looks like it happens more often on hardware.

Comment 9 Philipp Hahn

2015-11-25 14:50:05 CET

Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>?

Using a dummy prog to call getaddrinfo() shows the following stack trace:

#0  recvmsg () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007ffff7b4efcc in make_request (fd=7, pid=16647, seen_ipv4=<optimized out>, 
    seen_ipv6=<optimized out>, in6ai=<optimized out>, in6ailen=<optimized out>)
    at ../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x00007ffff7b4f44a in __check_pf (seen_ipv4=0x7fffffffe36f, seen_ipv6=0x7fffffffe36e, 
    in6ai=0x7fffffffe350, in6ailen=0x7fffffffe348) at ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x00007ffff7b0c007 in *__GI_getaddrinfo (name=0x600ba0 "0.0.0.0", service=0x400844 "7636", 
    hints=0x600fc0, pai=0x600b88) at ../sysdeps/posix/getaddrinfo.c:2389
#4  0x000000000040076b in main () at /tmp/addr.c:38

The C-library tries to get the routing table to identify, if both IPv4 and IPv6 are supported. The recvmsg() either hands or is repeatedly called - "strace" would show.
The above mentioned patch fixes such a case, were the function returning 0 is interpreted as an error instead of a successful call.

Comment 10 Christina Scheinig

2015-11-25 15:08:57 CET

Most likely as happend here 2015112521000151

Comment 11 Stephan Hendl 2015-11-25 15:58:51 CET

Also happened in our (test)environment without any user interactions... Sent backtrace to feedback@univention.de.

Comment 12 Stephan Hendl 2015-11-25 16:07:35 CET

Well, I can login into the server as root although slapd doesn't answer on any questions. A "univention-ldapsearch uid=<soemone>" doesn't come to an end. 

If someone has ideas what I can debug - feel free to contact me;-)

Comment 13 Stephan Hendl 2015-11-25 17:00:47 CET

In our environment the problem is located on a DC-Master only. All DC-Backups are running fine... S4 seems to be not affected "univention-s4search cn=..." runs as expected - even on the DC-Maser with broken slapd. Maybe that helps...

Comment 14 Stefan Gohmann

2015-11-25 20:33:33 CET

(In reply to Philipp Hahn from comment #9)
> Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>?

Excellent!

I've announced your test build as glibc component.

If someone would like to test it:

ucr set repository/online/component/glibc=true \
      repository/online/component/glibc/unmaintained=true
univention-upgrade

Comment 15 Philipp Hahn

2015-11-26 08:36:14 CET

Maybe worsened by Bug #37643 / CVE-2013-7423?

repo_admin.py --cherrypick -r 4.0-0-0 -s errata4.0-3 --releasedest 4.1 --dest errata4.1-0 -p eglibc

r15452 | Bug #40059: Fix infinite loop in check_pf (BZ #12926)

Package: eglibc
Version: 2.13-38.26.201511251527
Branch: ucs_4.1-0
Scope: errata4.1-0

Comment 16 Philipp Hahn

2015-11-26 08:38:50 CET

r65917 | Bug #40059: eglibc
 eglibc.yaml

Comment 17 Michael Voigt 2015-11-26 10:48:17 CET

Will leave an erratum to correct this problem?

Comment 18 Arvid Requate

2015-11-26 13:26:59 CET

Yes, there will be an erratum, currently we are still working on it.

Comment 19 Arvid Requate

2015-11-26 15:34:21 CET

We have another report from Dirk Ahrnke and are currently looking into this. In this particular case the lockup seems to have occurred even without the sasl2/slapd.conf.


(In reply to Philipp Hahn from comment #9)
> Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>?

Unfortunately it looks like the bug is even below that point, as the patch for that didn't help in this case. That glibc bug report also says:

> strace shows un-ending stream of recvmsg() calls, which all return 0.

and we don't see that. Instead, the recvmsg itself seems to hang, as Philipp said, possibly being blocked while reading from an empty netlink socket. Our current "best bet" is, that the patch for CVE-2013-7423 might have this colateral effect (introduced via errata4.0-4 Bug 37643, Erratum 297).


Good latest news: At least Philipp now has a reproducer, independent of OpenLDAP, just a multithreaded C code which calls getaddrinfo concurrently.

Comment 20 Philipp Hahn

2015-11-26 16:20:32 CET

r15464 | Bug #40059: Fix potential double close in __check_fd if OOM
 <https://sourceware.org/git/?p=glibc.git;a=commit;h=636064eb4c03397c86aa26e489e68f952bd5e53f>

Now building...

Comment 21 Philipp Hahn

2015-11-26 16:21:49 CET

Created attachment 7321 [details]
Reproducer

10 threads doing 1000×1000 times getaddrinfo()
Reproduced the hang on an 8-core in about 2 minutes

Comment 22 snoopyf 2015-11-26 16:40:25 CET

I can confirm, that the glibc-update does NOT fix this issue (as me and some other reported here, too: http://forum.univention.de/viewtopic.php?f=48&t=4561&p=17580#p17580)

But finally I was able to create core-dumps at the moment when slapd hangs. Just upoaded them here:

https://upload.univention.de
ID is: upload_1izJNM.gz

Perhaps it helps...

Comment 23 Arvid Requate

2015-11-26 19:30:49 CET

The reproducer code of Comment 21 triggers the bug after a couple of minutes either on hardware or on a KVM machine if the system has more than one CPU (I tested with 4). Florian reproduced it on hardware even with glibc from ucs_4.0-0. I've started another test now with UCS 3.2-3 (errata 181, randomly picked), this is still running.

Bad news: The fresh UCS 4.1-0 build of eglibc with the patch from Comment 20 didn't fix the bug, it was still reproducible (on xen3).

Comment 24 Stefan Gohmann

2015-11-26 19:41:54 CET

(In reply to Arvid Requate from comment #23)
> The reproducer code of Comment 21 triggers the bug after a couple of minutes
> either on hardware or on a KVM machine if the system has more than one CPU
> (I tested with 4). Florian reproduced it on hardware even with glibc from
> ucs_4.0-0. I've started another test now with UCS 3.2-3 (errata 181,
> randomly picked), this is still running.

If it works with an older kernel, maybe this patch is relevant:
 https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/e341694e3eb57fcda9f1adc7bfea42fe080d8d7a

Comment 25 Philipp Hahn

2015-11-27 16:50:32 CET

Looks like a Bug in the Linux Kernel 4.1.10+: <http://marc.info/?l=linux-kernel&m=144863837825887&w=2>

Comment 26 Stefan Gohmann

2015-11-29 14:42:47 CET

Created attachment 7325 [details]
undo_netlink-replace-rhash_portid-with-bound.patch

This patch reverts the following upstream commits:
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=4e27762417669cb459971635be550eb7b5598286
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=c0bb07df7d981e4091432754e30c9c720e2c0c78
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=c5adde9468b0714a051eac7f9666f23eb10b61f7
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=d48623677191e0f035d7afd344f92cf880b01f8e

I've built a temporary kernel and the reproducer runs in the fifth round without any problems.

It was not a straight forward revert, so a careful check is required.

I'll build a new kernel in the glibc scope where it can be easily tested.

Comment 27 Stefan Gohmann

2015-11-30 06:24:44 CET

I've started a rebuild in the errata4.1-0 scope (r15481) and I've added a dummy YAML file (r65972).


Until the errata has been announced the best workaround is to downgrade to Kernel 4.16.

On amd64 systems:
 univention-install linux-image-3.16.0-ucs135-amd64
On i386 systems:
 univention-install linux-image-3.16.0-ucs135-686-pae

After the reboot the 3.16 kernel should be selected in the grub menu.

Comment 28 Michael Voigt 2015-11-30 11:09:31 CET

Hi Stefan,
Perfect!. Worked here, the only burden is the Docker does not rise in the kernel 3.16, but this is irrelevant, and we can do without for a while.
Thank you

Comment 29 Philipp Hahn

2015-11-30 14:24:46 CET

(In reply to Stefan Gohmann from comment #26)
> Created attachment 7325 [details]
> undo_netlink-replace-rhash_portid-with-bound.patch

No, that patch reverts more than the 4 named patches; some of them are even from older kernels:

> 4e27762417669cb459971635be550eb7b5598286
OK
> c0bb07df7d981e4091432754e30c9c720e2c0c78
FAIL: patch from v4.1 fixing the next
> c5adde9468b0714a051eac7f9666f23eb10b61f7
FAIL: patch from v4.0
> d48623677191e0f035d7afd344f92cf880b01f8e
OK

It reverts all patches back to v4.1.0, as that patch is equal to `git diff v4.1.12..v4.1 -- net/netlink/af_netlink.[ch]`

e962218 netlink: Trim skb to alloc size to avoid MSG_TRUNC
d486236 netlink: Replace rhash_portid with bound
4e27762 netlink: Fix autobind race condition that leads to zero port ID
65d48c6 netlink, mmap: transform mmap skb into full skb on taps
d397617 netlink: make sure -EBUSY won't escape from netlink_insert
b265c30 netlink: don't hold mutex in rcu callback when releasing mmapd ring

r15482 | Bug #40059: Linux af_netlink.[ch] revert

Comment 30 Janek Walkenhorst

2015-11-30 19:38:22 CET

r65999 univention-kernel-image-signed (2.0.1-1)
r66000 univention-kernel-image-signed.yaml

Comment 31 Stefan Gohmann

2015-11-30 20:56:15 CET

(In reply to Philipp Hahn from comment #29)
> (In reply to Stefan Gohmann from comment #26)
> > Created attachment 7325 [details]
> > undo_netlink-replace-rhash_portid-with-bound.patch
> 
> No, that patch reverts more than the 4 named patches; some of them are even
> from older kernels:

I know, I followed the "Fixes:" lines in the commits. However if it works, I'm fine with it.

Comment 32 Philipp Hahn

2015-12-01 12:09:24 CET

(In reply to Janek Walkenhorst from comment #30)
> r65999 univention-kernel-image-signed (2.0.1-1)

FAIL: Still references ucs1*5*3 instead of ucs1*6*3

Package: linux
Version: 4.1.6-1.163.201511301312
Branch: ucs_4.1-0
Scope: errata4.1-0

r66011 | Bug #40059: Update to ucs163
r66010 | Bug #40059: Revert Linux kernel NETLINK changes

Package: univention-kernel-image
Version: 9.0.0-5.83.201512011131
Branch: ucs_4.1-0-errata4.1-0
Scope: errata4.1-0

Package: univention-kernel-image-signed
Version: 2.0.0-3.11.201512011131
Branch: ucs_4.1-0
Scope: errata4.1-0

r66012 | Bug #40059: linux YAML
 linux.yaml
 univention-kernel-image-signed.yaml
 univention-kernel-image.yaml

Comment 33 Arvid Requate

2015-12-01 21:05:28 CET

Verified:
* linux:
** Injected debian patch exactly reverts d486236 and 4e27762
** Injected debian patch is applied during built
** advisory OK

* eglibc
** Injected debian patch includes these two upstream commits:
*** https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fda389c8f0311dd5786be91a7b54b9f935fcafa1
*** https://sourceware.org/git/?p=glibc.git;a=commit;h=636064eb4c03397c86aa26e489e68f952bd5e53f
** Injected debian patch is applied during built
** FAIL: eglibc.yaml: version number still was 2.13-38.26.201511251527
      I adjusted it to 2.13-38.27.201511261620


* Metapackages rebuilt:
** univention-kernel-image-signed
** univention-kernel-image
** advisories OK

* Package update test (amd64): Ok

* Functional test (amd64, KVM, 4 vCPUs): OK

Before:
* Linux master10 4.1.0-ucs153-amd64 #1 SMP Debian 4.1.6-1.153.201510270857 (2015-10-27) x86_64 GNU/Linux
* Reproducer deadlocks

After:
* Reboot OK
* Linux master10 4.1.0-ucs163-amd64 #1 SMP Debian 4.1.6-1.163.201511301312 (2015-11-30) x86_64 GNU/Linux
* Reproducer exits successfully

Comment 34 Stefan Gohmann

2015-12-02 09:39:28 CET

Second part of the QA:

* Functional test (amd64, Hardware, 4 vCPUs): OK
** Before:
*** 4.1.0-ucs153-amd64 #1 SMP Debian 4.1.6-1.153.201510270857 (2015-10-27) x86_64 GNU/Linux
*** Reproducer Fail (deadlocks)
** After:
*** Reboot OK
*** 4.1.0-ucs163-amd64 #1 SMP Debian 4.1.6-1.163.201511301312 (2015-11-30) x86_64 GNU/Linux
*** Reproducer: OK (exits successfully)

* Package update test (i386) and reproducer tests: OK

Comment 35 Janek Walkenhorst

2015-12-02 12:21:20 CET

http://errata.software-univention.de/ucs/4.1/8.html
http://errata.software-univention.de/ucs/4.1/9.html
http://errata.software-univention.de/ucs/4.1/10.html
http://errata.software-univention.de/ucs/4.1/11.html

Comment 36 Philipp Hahn

2015-12-02 19:07:58 CET

Another dead-lock report in Asterisk on LKML with a proposed patch: <http://marc.info/?l=linux-kernel&m=144905644311953&w=2>

Comment 37 Philipp Hahn

2015-12-07 08:38:22 CET

(In reply to Philipp Hahn from comment #36)
> Another dead-lock report in Asterisk on LKML with a proposed patch:
> <http://marc.info/?l=linux-kernel&m=144905644311953&w=2>

Here's the culprit/missing patch: <http://marc.info/?l=linux-kernel&m=144943540403972&w=2>

orig patch from 4.3: 1f770c0a09da855a2b51af6d19de97fb955eca85 
bad backport to 4.1: 4e27762417669cb459971635be550eb7b5598286

Comment 38 Philipp Hahn

2016-01-21 09:15:38 CET

The fix went into 4.1.14:
$ git tag --contains a52ec6de6d1638e8c203d7188c55627f75371612
v4.1.14
v4.1.15