Univention Bugzilla – Bug 40059
No Login possible from one minute to another - smbd: nss_ldap: ldap_start_tls failed: Timed out
Last modified: 2016-01-21 09:15:38 CET
Hi everybody, I had the following issue multiple times in the last week: From one minute to another it isn't possible to login neither with SSH nor directly at the console. You enter the credentials, it seems the login is in progress but then there is no shell as if the login crashs. The only way to get out is to reboot the machine with Magic Sysrq. Keys. At this time this can be found in syslog: Nov 21 16:52:57 server smbd: nss_ldap: ldap_start_tls failed: Timed out Nov 21 16:53:09 server smbd: nss_ldap: ldap_start_tls failed: Timed out Nov 21 16:53:28 server smbd: nss_ldap: ldap_start_tls failed: Timed out Nov 21 16:53:39 server smbd: nss_ldap: ldap_start_tls failed: Timed out Nov 21 16:54:09 server smbd: nss_ldap: ldap_start_tls failed: Timed out Nov 21 16:54:40 server smbd: nss_ldap: ldap_start_tls failed: Timed out So I assume, the ldap server somehow crashed but I can't find any log... Any hints? Thx in advance!
also in http://forum.univention.de/viewtopic.php?f=48&t=4561
If someone is able to reproduce it, a slapd core file would be helpful.
http://sdb.univention.de/1316 explains how to enable core dumps and after that, in case the slapd still hangs around, a core dump may be generated by kill -11 $SLAPDPID
As one is not be able to login when the server is in this faulty state generating a core dump will be tricky... Perhaps it works when there is a logged in root - shell somewhere during slapd crashes.. Just a thought without knowing how things r configured atm: Wouldn't it be better to configure UCS in such a way, that root can ALWAYS login even if LDAP has gone as the root account resides in /etc/passwd? I had a similar problem with a completely self - configured server years ago (even root couldn't login when LDAP was unavailable) and I don't really remember how, but I think I could enable the root login when LDAP wasn't available by changing some orders in /etc/nsswitch or perhaps it was somwhere in the pam - config so that LDAP was the second, not the first system to be considered during the login process....
Normally a login should be possible if the LDAP server is down. Maybe it is a problem that the LDAP server accepts connections but don't answer. Does the root login work after executing the following command: ucr set homedir/mount=no quota/userdefault=no Unfortunately, we are still unable to reproduce it.
In case someone faces this situation and has root access to the system, gathering a backtrace with the following commands might help to track this down: apt-get install -y gdb slapd-dbg cyrus-sasl2-dbg cy2-saml-dbg gdb /usr/sbin/slapd "$(pidof slapd)" \ --batch --ex 'thread apply all bt'
Also reported in http://forum.univention.de/viewtopic.php?f=48&t=4584&p=17505
ucr set ldap/debug/level="trace conns stats" might also help narrow down context and timing, but that would have to be set *before* it happens (and restart slapd). Looks like it happens more often on hardware.
Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>? Using a dummy prog to call getaddrinfo() shows the following stack trace: #0 recvmsg () at ../sysdeps/unix/syscall-template.S:82 #1 0x00007ffff7b4efcc in make_request (fd=7, pid=16647, seen_ipv4=<optimized out>, seen_ipv6=<optimized out>, in6ai=<optimized out>, in6ailen=<optimized out>) at ../sysdeps/unix/sysv/linux/check_pf.c:119 #2 0x00007ffff7b4f44a in __check_pf (seen_ipv4=0x7fffffffe36f, seen_ipv6=0x7fffffffe36e, in6ai=0x7fffffffe350, in6ailen=0x7fffffffe348) at ../sysdeps/unix/sysv/linux/check_pf.c:271 #3 0x00007ffff7b0c007 in *__GI_getaddrinfo (name=0x600ba0 "0.0.0.0", service=0x400844 "7636", hints=0x600fc0, pai=0x600b88) at ../sysdeps/posix/getaddrinfo.c:2389 #4 0x000000000040076b in main () at /tmp/addr.c:38 The C-library tries to get the routing table to identify, if both IPv4 and IPv6 are supported. The recvmsg() either hands or is repeatedly called - "strace" would show. The above mentioned patch fixes such a case, were the function returning 0 is interpreted as an error instead of a successful call.
Most likely as happend here 2015112521000151
Also happened in our (test)environment without any user interactions... Sent backtrace to feedback@univention.de.
Well, I can login into the server as root although slapd doesn't answer on any questions. A "univention-ldapsearch uid=<soemone>" doesn't come to an end. If someone has ideas what I can debug - feel free to contact me;-)
In our environment the problem is located on a DC-Master only. All DC-Backups are running fine... S4 seems to be not affected "univention-s4search cn=..." runs as expected - even on the DC-Maser with broken slapd. Maybe that helps...
(In reply to Philipp Hahn from comment #9) > Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>? Excellent! I've announced your test build as glibc component. If someone would like to test it: ucr set repository/online/component/glibc=true \ repository/online/component/glibc/unmaintained=true univention-upgrade
Maybe worsened by Bug #37643 / CVE-2013-7423? repo_admin.py --cherrypick -r 4.0-0-0 -s errata4.0-3 --releasedest 4.1 --dest errata4.1-0 -p eglibc r15452 | Bug #40059: Fix infinite loop in check_pf (BZ #12926) Package: eglibc Version: 2.13-38.26.201511251527 Branch: ucs_4.1-0 Scope: errata4.1-0
r65917 | Bug #40059: eglibc eglibc.yaml
Will leave an erratum to correct this problem?
Yes, there will be an erratum, currently we are still working on it.
We have another report from Dirk Ahrnke and are currently looking into this. In this particular case the lockup seems to have occurred even without the sasl2/slapd.conf. (In reply to Philipp Hahn from comment #9) > Maybe <https://sourceware.org/bugzilla/show_bug.cgi?id=12926>? Unfortunately it looks like the bug is even below that point, as the patch for that didn't help in this case. That glibc bug report also says: > strace shows un-ending stream of recvmsg() calls, which all return 0. and we don't see that. Instead, the recvmsg itself seems to hang, as Philipp said, possibly being blocked while reading from an empty netlink socket. Our current "best bet" is, that the patch for CVE-2013-7423 might have this colateral effect (introduced via errata4.0-4 Bug 37643, Erratum 297). Good latest news: At least Philipp now has a reproducer, independent of OpenLDAP, just a multithreaded C code which calls getaddrinfo concurrently.
r15464 | Bug #40059: Fix potential double close in __check_fd if OOM <https://sourceware.org/git/?p=glibc.git;a=commit;h=636064eb4c03397c86aa26e489e68f952bd5e53f> Now building...
Created attachment 7321 [details] Reproducer 10 threads doing 1000×1000 times getaddrinfo() Reproduced the hang on an 8-core in about 2 minutes
I can confirm, that the glibc-update does NOT fix this issue (as me and some other reported here, too: http://forum.univention.de/viewtopic.php?f=48&t=4561&p=17580#p17580) But finally I was able to create core-dumps at the moment when slapd hangs. Just upoaded them here: https://upload.univention.de ID is: upload_1izJNM.gz Perhaps it helps...
The reproducer code of Comment 21 triggers the bug after a couple of minutes either on hardware or on a KVM machine if the system has more than one CPU (I tested with 4). Florian reproduced it on hardware even with glibc from ucs_4.0-0. I've started another test now with UCS 3.2-3 (errata 181, randomly picked), this is still running. Bad news: The fresh UCS 4.1-0 build of eglibc with the patch from Comment 20 didn't fix the bug, it was still reproducible (on xen3).
(In reply to Arvid Requate from comment #23) > The reproducer code of Comment 21 triggers the bug after a couple of minutes > either on hardware or on a KVM machine if the system has more than one CPU > (I tested with 4). Florian reproduced it on hardware even with glibc from > ucs_4.0-0. I've started another test now with UCS 3.2-3 (errata 181, > randomly picked), this is still running. If it works with an older kernel, maybe this patch is relevant: https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/e341694e3eb57fcda9f1adc7bfea42fe080d8d7a
Looks like a Bug in the Linux Kernel 4.1.10+: <http://marc.info/?l=linux-kernel&m=144863837825887&w=2>
Created attachment 7325 [details] undo_netlink-replace-rhash_portid-with-bound.patch This patch reverts the following upstream commits: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=4e27762417669cb459971635be550eb7b5598286 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=c0bb07df7d981e4091432754e30c9c720e2c0c78 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=c5adde9468b0714a051eac7f9666f23eb10b61f7 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=d48623677191e0f035d7afd344f92cf880b01f8e I've built a temporary kernel and the reproducer runs in the fifth round without any problems. It was not a straight forward revert, so a careful check is required. I'll build a new kernel in the glibc scope where it can be easily tested.
I've started a rebuild in the errata4.1-0 scope (r15481) and I've added a dummy YAML file (r65972). Until the errata has been announced the best workaround is to downgrade to Kernel 4.16. On amd64 systems: univention-install linux-image-3.16.0-ucs135-amd64 On i386 systems: univention-install linux-image-3.16.0-ucs135-686-pae After the reboot the 3.16 kernel should be selected in the grub menu.
Hi Stefan, Perfect!. Worked here, the only burden is the Docker does not rise in the kernel 3.16, but this is irrelevant, and we can do without for a while. Thank you
(In reply to Stefan Gohmann from comment #26) > Created attachment 7325 [details] > undo_netlink-replace-rhash_portid-with-bound.patch No, that patch reverts more than the 4 named patches; some of them are even from older kernels: > 4e27762417669cb459971635be550eb7b5598286 OK > c0bb07df7d981e4091432754e30c9c720e2c0c78 FAIL: patch from v4.1 fixing the next > c5adde9468b0714a051eac7f9666f23eb10b61f7 FAIL: patch from v4.0 > d48623677191e0f035d7afd344f92cf880b01f8e OK It reverts all patches back to v4.1.0, as that patch is equal to `git diff v4.1.12..v4.1 -- net/netlink/af_netlink.[ch]` e962218 netlink: Trim skb to alloc size to avoid MSG_TRUNC d486236 netlink: Replace rhash_portid with bound 4e27762 netlink: Fix autobind race condition that leads to zero port ID 65d48c6 netlink, mmap: transform mmap skb into full skb on taps d397617 netlink: make sure -EBUSY won't escape from netlink_insert b265c30 netlink: don't hold mutex in rcu callback when releasing mmapd ring r15482 | Bug #40059: Linux af_netlink.[ch] revert
r65999 univention-kernel-image-signed (2.0.1-1) r66000 univention-kernel-image-signed.yaml
(In reply to Philipp Hahn from comment #29) > (In reply to Stefan Gohmann from comment #26) > > Created attachment 7325 [details] > > undo_netlink-replace-rhash_portid-with-bound.patch > > No, that patch reverts more than the 4 named patches; some of them are even > from older kernels: I know, I followed the "Fixes:" lines in the commits. However if it works, I'm fine with it.
(In reply to Janek Walkenhorst from comment #30) > r65999 univention-kernel-image-signed (2.0.1-1) FAIL: Still references ucs1*5*3 instead of ucs1*6*3 Package: linux Version: 4.1.6-1.163.201511301312 Branch: ucs_4.1-0 Scope: errata4.1-0 r66011 | Bug #40059: Update to ucs163 r66010 | Bug #40059: Revert Linux kernel NETLINK changes Package: univention-kernel-image Version: 9.0.0-5.83.201512011131 Branch: ucs_4.1-0-errata4.1-0 Scope: errata4.1-0 Package: univention-kernel-image-signed Version: 2.0.0-3.11.201512011131 Branch: ucs_4.1-0 Scope: errata4.1-0 r66012 | Bug #40059: linux YAML linux.yaml univention-kernel-image-signed.yaml univention-kernel-image.yaml
Verified: * linux: ** Injected debian patch exactly reverts d486236 and 4e27762 ** Injected debian patch is applied during built ** advisory OK * eglibc ** Injected debian patch includes these two upstream commits: *** https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fda389c8f0311dd5786be91a7b54b9f935fcafa1 *** https://sourceware.org/git/?p=glibc.git;a=commit;h=636064eb4c03397c86aa26e489e68f952bd5e53f ** Injected debian patch is applied during built ** FAIL: eglibc.yaml: version number still was 2.13-38.26.201511251527 I adjusted it to 2.13-38.27.201511261620 * Metapackages rebuilt: ** univention-kernel-image-signed ** univention-kernel-image ** advisories OK * Package update test (amd64): Ok * Functional test (amd64, KVM, 4 vCPUs): OK Before: * Linux master10 4.1.0-ucs153-amd64 #1 SMP Debian 4.1.6-1.153.201510270857 (2015-10-27) x86_64 GNU/Linux * Reproducer deadlocks After: * Reboot OK * Linux master10 4.1.0-ucs163-amd64 #1 SMP Debian 4.1.6-1.163.201511301312 (2015-11-30) x86_64 GNU/Linux * Reproducer exits successfully
Second part of the QA: * Functional test (amd64, Hardware, 4 vCPUs): OK ** Before: *** 4.1.0-ucs153-amd64 #1 SMP Debian 4.1.6-1.153.201510270857 (2015-10-27) x86_64 GNU/Linux *** Reproducer Fail (deadlocks) ** After: *** Reboot OK *** 4.1.0-ucs163-amd64 #1 SMP Debian 4.1.6-1.163.201511301312 (2015-11-30) x86_64 GNU/Linux *** Reproducer: OK (exits successfully) * Package update test (i386) and reproducer tests: OK
http://errata.software-univention.de/ucs/4.1/8.html http://errata.software-univention.de/ucs/4.1/9.html http://errata.software-univention.de/ucs/4.1/10.html http://errata.software-univention.de/ucs/4.1/11.html
Another dead-lock report in Asterisk on LKML with a proposed patch: <http://marc.info/?l=linux-kernel&m=144905644311953&w=2>
(In reply to Philipp Hahn from comment #36) > Another dead-lock report in Asterisk on LKML with a proposed patch: > <http://marc.info/?l=linux-kernel&m=144905644311953&w=2> Here's the culprit/missing patch: <http://marc.info/?l=linux-kernel&m=144943540403972&w=2> orig patch from 4.3: 1f770c0a09da855a2b51af6d19de97fb955eca85 bad backport to 4.1: 4e27762417669cb459971635be550eb7b5598286
The fix went into 4.1.14: $ git tag --contains a52ec6de6d1638e8c203d7188c55627f75371612 v4.1.14 v4.1.15