Univention Bugzilla – Bug 40558
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:4615]
Last modified: 2016-11-10 22:30:11 CET
With the latest Kernel upgrade (ucs167, upstream kernel 4.1.16) we received several kernel traces. These traces are critical because smbd isn't running properly. A reboot with the previous kernel (ucs163, upstream kernel 4.1.12) seems to work. Feb 1 11:48:13 host kernel: [307999.162254] INFO: rcu_sched self-detected stall on CPU { 0} (t=5250 jiffies g=6586733 c=6586732 q=6757) Feb 1 11:48:13 host kernel: [307999.162264] Task dump for CPU 0: Feb 1 11:48:13 host kernel: [307999.162267] smbd R running 0 4615 4609 0x00000008 Feb 1 11:48:13 host kernel: [307999.162272] 00200082 f5863b90 c10b3fe9 c1682cc0 c1682cc0 c1682cc0 f79d2b00 f 5863bdc Feb 1 11:48:13 host kernel: [307999.162276] c10b722d c15bd400 00001482 0064816d 0064816c 00001a65 f79cd840 c 108b166 Feb 1 11:48:13 host kernel: [307999.162280] 00000001 f5863bb8 f5863bb8 00000000 c1682cc0 f6808cf0 00001a65 f 6808cf0 Feb 1 11:48:13 host kernel: [307999.162285] Call Trace: Feb 1 11:48:13 host kernel: [307999.162296] [<c10b3fe9>] ? rcu_dump_cpu_stacks+0x79/0xc0 Feb 1 11:48:13 host kernel: [307999.162300] [<c10b722d>] ? rcu_check_callbacks+0x3cd/0x630 Feb 1 11:48:13 host kernel: [307999.162304] [<c108b166>] ? account_process_tick+0x66/0x160 Feb 1 11:48:13 host kernel: [307999.162307] [<c10bbe4f>] ? update_process_times+0x2f/0x60 Feb 1 11:48:13 host kernel: [307999.162310] [<c10cbf9d>] ? tick_sched_handle.isra.12+0x2d/0x60 Feb 1 11:48:13 host kernel: [307999.162328] [<c10cc210>] ? tick_sched_timer+0x40/0x80 Feb 1 11:48:13 host kernel: [307999.162331] [<c10bc6b0>] ? __remove_hrtimer+0x40/0xa0 Feb 1 11:48:13 host kernel: [307999.162334] [<c10bc97f>] ? __run_hrtimer+0x6f/0x190 Feb 1 11:48:13 host kernel: [307999.162337] [<c10cc1d0>] ? tick_sched_do_timer+0x30/0x30 Feb 1 11:48:13 host kernel: [307999.162339] [<c10bd15f>] ? hrtimer_interrupt+0xef/0x260 Feb 1 11:48:13 host kernel: [307999.162343] [<c119ae3d>] ? getname_kernel+0x2d/0x100 Feb 1 11:48:13 host kernel: [307999.162348] [<c1046f7f>] ? local_apic_timer_interrupt+0x2f/0x60 Feb 1 11:48:13 host kernel: [307999.162353] [<c14e4543>] ? smp_apic_timer_interrupt+0x33/0x50 Feb 1 11:48:13 host kernel: [307999.162355] [<c14e3c7c>] ? apic_timer_interrupt+0x34/0x3c Feb 1 11:48:13 host kernel: [307999.162358] [<c14e2dc1>] ? _raw_spin_lock+0x51/0x70 Feb 1 11:48:13 host kernel: [307999.162362] [<c148c075>] ? unix_state_double_lock+0x25/0x60 Feb 1 11:48:13 host kernel: [307999.162365] [<c148de10>] ? unix_dgram_connect+0x90/0x1f0 Feb 1 11:48:13 host kernel: [307999.162369] [<c13e4267>] ? SYSC_connect+0xc7/0xe0 Feb 1 11:48:13 host kernel: [307999.162371] [<c13e2931>] ? sock_map_fd+0x41/0x60 Feb 1 11:48:13 host kernel: [307999.162374] [<c13e5014>] ? SYSC_socketcall+0x1b4/0xa20 Feb 1 11:48:13 host kernel: [307999.162376] [<c10c2940>] ? ktime_get+0x50/0x100 Feb 1 11:48:13 host kernel: [307999.162379] [<c10466db>] ? lapic_next_event+0x1b/0x20 Feb 1 11:48:13 host kernel: [307999.162381] [<c10ca0ed>] ? clockevents_program_event+0x9d/0x140 Feb 1 11:48:13 host kernel: [307999.162385] [<c129e068>] ? list_del+0x8/0x20 Feb 1 11:48:13 host kernel: [307999.162388] [<c1097ef7>] ? remove_wait_queue+0x27/0x40 Feb 1 11:48:13 host kernel: [307999.162392] [<c11c8795>] ? inotify_read+0x295/0x340 Feb 1 11:48:13 host kernel: [307999.162396] [<c10acc76>] ? handle_irq_event_percpu+0xa6/0x1a0 Feb 1 11:48:13 host kernel: [307999.162399] [<c11a786f>] ? set_close_on_exec+0x2f/0x60 Feb 1 11:48:13 host kernel: [307999.162402] [<c119d084>] ? do_fcntl+0x2f4/0x4e0 Feb 1 11:48:13 host kernel: [307999.162405] [<c107d6df>] ? commit_creds+0xff/0x1f0 Feb 1 11:48:13 host kernel: [307999.162407] [<c119d380>] ? SyS_fcntl64+0x60/0x100 Feb 1 11:48:13 host kernel: [307999.162409] [<c13e5953>] ? SyS_socketcall+0x13/0x20 Feb 1 11:48:13 host kernel: [307999.162412] [<c14e30db>] ? sysenter_do_call+0x12/0x12 Feb 1 11:48:40 host kernel: [308026.678925] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:4615] Ticket #2016020121000167 Ticket #2016012821000323
Next ticket: Ticket #2016020221000432 See ticket for more logfiles: https://otrs.knut.univention.de/otrs/index.pl?Action=AgentTicketZoom;TicketID=909606;ArticleID=1832270#
Next report: http://forum.univention.de/viewtopic.php?f=48&t=4840&p=18914
Probaby another case via Presales (Issue #4231)
(In reply to Michael Grandjean from comment #3) > Probaby another case via Presales (Issue #4231) | "Sie sind nicht berechtigt, auf diese Seite zuzugreifen" The bug happens on both architectures: _raw_spin_lock is either 0x70(i386) or 0x50(amd64) bytes long. `smbtorture base.fdpass.fdpass` did not show anything thus far. Simple Python unix_dram_{server,cliens} are running fine also. Patch between 4.4 and 4.1 look same, but <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1586a5877db9eee313379738d6581bc7c6ffb5e3> is missing in 4.1.16? RFH @ LKML: <https://lkml.org/lkml/2016/2/2/474> Now building test kernel in 4.1.6-1-glibc with net/unix/reverted: > $ git log --oneline v4.1.12..v4.1.17 -- net/unix > dc6b0ec unix: properly account for FDs passed over unix sockets > cc01a0a af_unix: Revert 'lock_interruptible' in stream receive code > 5c77e26 unix: avoid use-after-free in ep_remove_wait_queue Plan it to look for testers, if the bug also occurs with that kernel.
Request for testers We've built a patched kernel, but we don't know yet if it fixes the problem. Willing testers can include the repository containing the patched kernel and should test, if the bug still occurs. Feedback is very much appreciated. ucr set repository/online/component/glibc{=yes,/parts=unmaintained} case "$(uname -m)" in x86_64) univention-install linux-image-4.1.0-ucs170-amd64 ;; i?86) univention-install linux-image-4.1.0-ucs170-686-pae ;; esac reboot The kernel is not signed for UEFI/secured-boot!
r15870 | Also build for errata4.1-0 Package: linux Version: 4.1.6-1.171.201602081336 Branch: ucs_4.1-0 Scope: errata4.1-0
happend again here: Ticket#2016021021000328 all related logs and traces within ticket.
<https://bugs.launchpad.net/ubuntu/+source/linux-lts-trusty/+bug/1543980> for 3.13, which also has <https://lkml.org/lkml/2016/2/5/587> a back-port of: willy tarreau (1): unix: properly account for FDs passed over unix sockets Not yet able to trigger bug with proposed test "samba3.raw.composite".
Prepare for UCS-4.1-1 release with net/unix/ reverted. We can't trigger the bug and have no proof that the bug is resolved. r67311 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix) Package: univention-kernel-image-signed Version: 2.0.0-5.14.201602101441 Branch: ucs_4.1-0 Scope: errata4.1-0 r67312 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix) Package: univention-kernel-image Version: 9.0.0-7.88.201602101444 Branch: ucs_4.1-0 Scope: errata4.1-0 r67315 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix) YAML linux.yaml univention-kernel-image-signed.yaml univention-kernel-image.yaml
I can reproduce this on hardware (xen15 10.201.16.16) and KVM steps: * UCS 4.1-0 4.1.0-ucs167-amd64 * ucr set repository/online/unmaintained='yes' * apt-get install bison comerr-dev debhelper docbook-xml docbook-xsl faketime flex libacl1-dev libaio-dev libattr1-dev libblkid-dev libbsd-dev libcap-dev libcups2-dev libgnutls-dev xfslibs-dev libldap2-dev libldb-dev libncurses5-dev libntdb-dev libpam0g-dev libparse-yapp-perl libpopt-dev libreadline-dev libsubunit-dev libtalloc-dev libtdb-dev libtevent-dev perl perl-modules pkg-config po-debconf python-all-dev python-dnspython python-ldb python-ldb-dev python-ntdb python-talloc-dev python-tdb python-testtools python3 subunit xsltproc zlib1g-dev * git clone git://git.samba.org/samba.git samba * cd samba * ./configure.developer * TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite" Samba test works with 4.1.0-ucs171-amd64 so far (started overnight test loop ...) But there are some strange kernel messages ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts with OpRegion 0x0000000000000800-0x000000000000084F (\PMRG) (20150410/utaddress-254) iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround and in particularly Request for unknown module key 'Build time autogenerated kernel key: 006416f63733d99e57be1fd3a06d66c85b9e2c23' err -11 -> dmesg | grep "006416f63733d99e57be1fd3a06d66c85b9e2c23" | wc -l 194
(In reply to Felix Botner from comment #10) > I can reproduce this on hardware (xen15 10.201.16.16) and KVM > > steps: > * UCS 4.1-0 4.1.0-ucs167-amd64 > * ucr set repository/online/unmaintained='yes' > * apt-get install bison comerr-dev debhelper docbook-xml docbook-xsl > faketime flex libacl1-dev libaio-dev libattr1-dev libblkid-dev libbsd-dev > libcap-dev libcups2-dev libgnutls-dev xfslibs-dev libldap2-dev libldb-dev > libncurses5-dev libntdb-dev libpam0g-dev libparse-yapp-perl libpopt-dev > libreadline-dev libsubunit-dev libtalloc-dev libtdb-dev libtevent-dev perl > perl-modules pkg-config po-debconf python-all-dev python-dnspython > python-ldb python-ldb-dev python-ntdb python-talloc-dev python-tdb > python-testtools python3 subunit xsltproc zlib1g-dev If you do repository/online/sources=yes, you can then do "apt-get build-dep samba" > * git clone git://git.samba.org/samba.git samba > * cd samba > * ./configure.developer > * TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 > TESTS="samba3.raw.composite" > > Samba test works with 4.1.0-ucs171-amd64 so far (started overnight test loop > ...) Good. > But there are some strange kernel messages Comparing with 3.16.0-ucs135-amd64... > ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts > with OpRegion 0x0000000000000800-0x000000000000084F (\PMRG) > (20150410/utaddress-254) Known firmware bug - can be ignored > iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS disabled in BIOS, so message is expected - can be ignored > kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using > workaround Known hardware bug - can be ignored > and in particularly > > Request for unknown module key 'Build time autogenerated kernel key: > 006416f63733d99e57be1fd3a06d66c85b9e2c23' err -11 > > -> dmesg | grep "006416f63733d99e57be1fd3a06d66c85b9e2c23" | wc -l > 194 New bug in this kernel, but problem is known to as as part of Bug #38214: <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1494562> <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1494943> <https://launchpadlibrarian.net/217914743/wily+master-next-0001-x509-only-prefix-strip-raw-serial-numbers.patch> Basically its the problem, that we as Univention don't use a fixed KEY to sign our kernel, so the build process is forced to generate a random key each time a kernel is built. This time (1 out of 16) the built generated a key starting with '0', which exposes this bug. Package: linux Version: 4.1.6-1.174.201602110938 Branch: ucs_4.1-0 Scope: errata4.1-0 I tracked it down by bisecting the linux kernel to the commit of "unix: avoid use-after-free in ep_remove_wait_queue"; see <https://lkml.org/lkml/2016/2/2/474> As that patch was reverted for ucs174, that kernel can be used for now. r67358 | Bug #40558 linux: Update to ucs174 (4.1.16-unix) Package: univention-kernel-image-signed Version: 2.0.0-6.15.201602111457 Branch: ucs_4.1-0 Scope: errata4.1-0 r67359 | Bug #40558 kernel: Update to ucs174 (4.1.16-unix) Package: univention-kernel-image Version: 9.0.0-8.90.201602111504 Branch: ucs_4.1-0 Scope: errata4.1-0 r67360 | Bug #40558 kernel: Update to ucs174 (4.1.16-unix) YAML branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/linux.yaml | 2 +- branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/univention-kernel-image-signed.yaml | 2 +- branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/univention-kernel-image.yaml | 2 +-
kernel 174: * OK - on hardware * OK - amazon ec2 * OK - kvm on hardware: * OK - Samba test suite * OK - uvmm/kvm windows 10 installation * OK - uvmm/kvm ucs 4.1 installation * OK - uvmm/kvm ucs 4.1 installation with update to kernel 174 on all: * OK - TDB_NO_FSYNC=1 make -j \ test SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite" * OK - kernel make kselftest OK - dkms (sources, module build, modprobe) OK - linux.yaml OK - univention-kernel-image-signed.yaml OK - univention-kernel-image.yaml
<http://errata.software-univention.de/ucs/4.1/112.html> <http://errata.software-univention.de/ucs/4.1/113.html> <http://errata.software-univention.de/ucs/4.1/114.html>