Bug 40558 - NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:4615]
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:4615]
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Kernel
UCS 4.1
Other Linux
: P5 critical (vote)
: UCS 4.1-0-errata
Assigned To: Philipp Hahn
Felix Botner
:
Depends on:
Blocks: 42614 42927
  Show dependency treegraph
 
Reported: 2016-02-02 07:45 CET by Stefan Gohmann
Modified: 2016-11-10 22:30 CET (History)
9 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2016-02-02 07:45:56 CET
With the latest Kernel upgrade (ucs167, upstream kernel 4.1.16) we received several kernel traces. These traces are critical because smbd isn't running properly.

A reboot with the previous kernel (ucs163, upstream kernel 4.1.12) seems to work.

Feb  1 11:48:13 host kernel: [307999.162254] INFO: rcu_sched self-detected stall on CPU { 0}  (t=5250 jiffies                                                                   g=6586733 c=6586732 q=6757)
Feb  1 11:48:13 host kernel: [307999.162264] Task dump for CPU 0:
Feb  1 11:48:13 host kernel: [307999.162267] smbd            R running      0  4615   4609 0x00000008
Feb  1 11:48:13 host kernel: [307999.162272]  00200082 f5863b90 c10b3fe9 c1682cc0 c1682cc0 c1682cc0 f79d2b00 f                                                                  5863bdc
Feb  1 11:48:13 host kernel: [307999.162276]  c10b722d c15bd400 00001482 0064816d 0064816c 00001a65 f79cd840 c                                                                  108b166
Feb  1 11:48:13 host kernel: [307999.162280]  00000001 f5863bb8 f5863bb8 00000000 c1682cc0 f6808cf0 00001a65 f                                                                  6808cf0
Feb  1 11:48:13 host kernel: [307999.162285] Call Trace:
Feb  1 11:48:13 host kernel: [307999.162296]  [<c10b3fe9>] ? rcu_dump_cpu_stacks+0x79/0xc0
Feb  1 11:48:13 host kernel: [307999.162300]  [<c10b722d>] ? rcu_check_callbacks+0x3cd/0x630
Feb  1 11:48:13 host kernel: [307999.162304]  [<c108b166>] ? account_process_tick+0x66/0x160
Feb  1 11:48:13 host kernel: [307999.162307]  [<c10bbe4f>] ? update_process_times+0x2f/0x60
Feb  1 11:48:13 host kernel: [307999.162310]  [<c10cbf9d>] ? tick_sched_handle.isra.12+0x2d/0x60
Feb  1 11:48:13 host kernel: [307999.162328]  [<c10cc210>] ? tick_sched_timer+0x40/0x80
Feb  1 11:48:13 host kernel: [307999.162331]  [<c10bc6b0>] ? __remove_hrtimer+0x40/0xa0
Feb  1 11:48:13 host kernel: [307999.162334]  [<c10bc97f>] ? __run_hrtimer+0x6f/0x190
Feb  1 11:48:13 host kernel: [307999.162337]  [<c10cc1d0>] ? tick_sched_do_timer+0x30/0x30
Feb  1 11:48:13 host kernel: [307999.162339]  [<c10bd15f>] ? hrtimer_interrupt+0xef/0x260
Feb  1 11:48:13 host kernel: [307999.162343]  [<c119ae3d>] ? getname_kernel+0x2d/0x100
Feb  1 11:48:13 host kernel: [307999.162348]  [<c1046f7f>] ? local_apic_timer_interrupt+0x2f/0x60
Feb  1 11:48:13 host kernel: [307999.162353]  [<c14e4543>] ? smp_apic_timer_interrupt+0x33/0x50
Feb  1 11:48:13 host kernel: [307999.162355]  [<c14e3c7c>] ? apic_timer_interrupt+0x34/0x3c
Feb  1 11:48:13 host kernel: [307999.162358]  [<c14e2dc1>] ? _raw_spin_lock+0x51/0x70
Feb  1 11:48:13 host kernel: [307999.162362]  [<c148c075>] ? unix_state_double_lock+0x25/0x60
Feb  1 11:48:13 host kernel: [307999.162365]  [<c148de10>] ? unix_dgram_connect+0x90/0x1f0
Feb  1 11:48:13 host kernel: [307999.162369]  [<c13e4267>] ? SYSC_connect+0xc7/0xe0
Feb  1 11:48:13 host kernel: [307999.162371]  [<c13e2931>] ? sock_map_fd+0x41/0x60
Feb  1 11:48:13 host kernel: [307999.162374]  [<c13e5014>] ? SYSC_socketcall+0x1b4/0xa20
Feb  1 11:48:13 host kernel: [307999.162376]  [<c10c2940>] ? ktime_get+0x50/0x100
Feb  1 11:48:13 host kernel: [307999.162379]  [<c10466db>] ? lapic_next_event+0x1b/0x20
Feb  1 11:48:13 host kernel: [307999.162381]  [<c10ca0ed>] ? clockevents_program_event+0x9d/0x140
Feb  1 11:48:13 host kernel: [307999.162385]  [<c129e068>] ? list_del+0x8/0x20
Feb  1 11:48:13 host kernel: [307999.162388]  [<c1097ef7>] ? remove_wait_queue+0x27/0x40
Feb  1 11:48:13 host kernel: [307999.162392]  [<c11c8795>] ? inotify_read+0x295/0x340
Feb  1 11:48:13 host kernel: [307999.162396]  [<c10acc76>] ? handle_irq_event_percpu+0xa6/0x1a0
Feb  1 11:48:13 host kernel: [307999.162399]  [<c11a786f>] ? set_close_on_exec+0x2f/0x60
Feb  1 11:48:13 host kernel: [307999.162402]  [<c119d084>] ? do_fcntl+0x2f4/0x4e0
Feb  1 11:48:13 host kernel: [307999.162405]  [<c107d6df>] ? commit_creds+0xff/0x1f0
Feb  1 11:48:13 host kernel: [307999.162407]  [<c119d380>] ? SyS_fcntl64+0x60/0x100
Feb  1 11:48:13 host kernel: [307999.162409]  [<c13e5953>] ? SyS_socketcall+0x13/0x20
Feb  1 11:48:13 host kernel: [307999.162412]  [<c14e30db>] ? sysenter_do_call+0x12/0x12
Feb  1 11:48:40 host kernel: [308026.678925] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:4615]

Ticket #2016020121000167
Ticket #2016012821000323
Comment 1 Stefan Gohmann univentionstaff 2016-02-02 20:59:06 CET
Next ticket: Ticket #2016020221000432

See ticket for more logfiles:
 https://otrs.knut.univention.de/otrs/index.pl?Action=AgentTicketZoom;TicketID=909606;ArticleID=1832270#
Comment 2 Stefan Gohmann univentionstaff 2016-02-04 19:19:25 CET
Next report:
 http://forum.univention.de/viewtopic.php?f=48&t=4840&p=18914
Comment 3 Michael Grandjean univentionstaff 2016-02-05 13:38:54 CET
Probaby another case via Presales (Issue #4231)
Comment 4 Philipp Hahn univentionstaff 2016-02-05 17:07:32 CET
(In reply to Michael Grandjean from comment #3)
> Probaby another case via Presales (Issue #4231)
| "Sie sind nicht berechtigt, auf diese Seite zuzugreifen"


The bug happens on both architectures: _raw_spin_lock is either 0x70(i386) or 0x50(amd64) bytes long.

`smbtorture base.fdpass.fdpass` did not show anything thus far.
Simple Python unix_dram_{server,cliens} are running fine also.

Patch between 4.4 and 4.1 look same, but <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1586a5877db9eee313379738d6581bc7c6ffb5e3> is missing in 4.1.16?

RFH @ LKML: <https://lkml.org/lkml/2016/2/2/474>

Now building test kernel in 4.1.6-1-glibc with net/unix/reverted:
> $ git log --oneline  v4.1.12..v4.1.17 -- net/unix
> dc6b0ec unix: properly account for FDs passed over unix sockets
> cc01a0a af_unix: Revert 'lock_interruptible' in stream receive code
> 5c77e26 unix: avoid use-after-free in ep_remove_wait_queue
Plan it to look for testers, if the bug also occurs with that kernel.
Comment 5 Philipp Hahn univentionstaff 2016-02-08 09:05:40 CET
Request for testers

We've built a patched kernel, but we don't know yet if it fixes the problem.
Willing testers can include the repository containing the patched kernel and should test, if the bug still occurs. Feedback is very much appreciated.

 ucr set repository/online/component/glibc{=yes,/parts=unmaintained}
 case "$(uname -m)" in
 x86_64) univention-install linux-image-4.1.0-ucs170-amd64 ;;
 i?86) univention-install linux-image-4.1.0-ucs170-686-pae ;;
 esac
 reboot

The kernel is not signed for UEFI/secured-boot!
Comment 6 Philipp Hahn univentionstaff 2016-02-08 16:32:00 CET
r15870 | Also build for errata4.1-0

Package: linux
Version: 4.1.6-1.171.201602081336
Branch: ucs_4.1-0
Scope: errata4.1-0
Comment 7 Nico Stöckigt univentionstaff 2016-02-10 11:50:28 CET
happend again here: Ticket#2016021021000328
all related logs and traces within ticket.
Comment 8 Philipp Hahn univentionstaff 2016-02-10 13:28:01 CET
<https://bugs.launchpad.net/ubuntu/+source/linux-lts-trusty/+bug/1543980> for 3.13, which also has <https://lkml.org/lkml/2016/2/5/587> a back-port of:

willy tarreau (1):
      unix: properly account for FDs passed over unix sockets

Not yet able to trigger bug with proposed test "samba3.raw.composite".
Comment 9 Philipp Hahn univentionstaff 2016-02-10 14:53:25 CET
Prepare for UCS-4.1-1 release with net/unix/ reverted.
We can't trigger the bug and have no proof that the bug is resolved.

r67311 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix)

Package: univention-kernel-image-signed
Version: 2.0.0-5.14.201602101441
Branch: ucs_4.1-0
Scope: errata4.1-0

r67312 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix)

Package: univention-kernel-image
Version: 9.0.0-7.88.201602101444
Branch: ucs_4.1-0
Scope: errata4.1-0

r67315 | Bug #40558 kernel: Update to ucs171 (4.1.16-unix) YAML
 linux.yaml
 univention-kernel-image-signed.yaml
 univention-kernel-image.yaml
Comment 10 Felix Botner univentionstaff 2016-02-10 18:26:14 CET
I can reproduce this on hardware (xen15 10.201.16.16) and KVM

steps:
* UCS 4.1-0 4.1.0-ucs167-amd64
* ucr set repository/online/unmaintained='yes'
* apt-get install bison comerr-dev debhelper docbook-xml docbook-xsl faketime flex libacl1-dev libaio-dev libattr1-dev libblkid-dev libbsd-dev libcap-dev libcups2-dev libgnutls-dev xfslibs-dev libldap2-dev libldb-dev libncurses5-dev libntdb-dev libpam0g-dev libparse-yapp-perl libpopt-dev libreadline-dev libsubunit-dev libtalloc-dev libtdb-dev libtevent-dev perl perl-modules pkg-config po-debconf python-all-dev python-dnspython python-ldb python-ldb-dev python-ntdb python-talloc-dev python-tdb python-testtools python3 subunit xsltproc zlib1g-dev
* git clone git://git.samba.org/samba.git samba
* cd samba
* ./configure.developer
* TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite"

Samba test works with 4.1.0-ucs171-amd64 so far (started overnight test loop ...)

But there are some strange kernel messages

ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts with OpRegion 0x0000000000000800-0x000000000000084F (\PMRG) (20150410/utaddress-254)

iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS

kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround

and in particularly 

 Request for unknown module key 'Build time autogenerated kernel key: 006416f63733d99e57be1fd3a06d66c85b9e2c23' err -11

-> dmesg | grep "006416f63733d99e57be1fd3a06d66c85b9e2c23" | wc -l
194
Comment 11 Philipp Hahn univentionstaff 2016-02-11 15:08:31 CET
(In reply to Felix Botner from comment #10)
> I can reproduce this on hardware (xen15 10.201.16.16) and KVM
> 
> steps:
> * UCS 4.1-0 4.1.0-ucs167-amd64
> * ucr set repository/online/unmaintained='yes'
> * apt-get install bison comerr-dev debhelper docbook-xml docbook-xsl
> faketime flex libacl1-dev libaio-dev libattr1-dev libblkid-dev libbsd-dev
> libcap-dev libcups2-dev libgnutls-dev xfslibs-dev libldap2-dev libldb-dev
> libncurses5-dev libntdb-dev libpam0g-dev libparse-yapp-perl libpopt-dev
> libreadline-dev libsubunit-dev libtalloc-dev libtdb-dev libtevent-dev perl
> perl-modules pkg-config po-debconf python-all-dev python-dnspython
> python-ldb python-ldb-dev python-ntdb python-talloc-dev python-tdb
> python-testtools python3 subunit xsltproc zlib1g-dev

If you do repository/online/sources=yes, you can then do "apt-get build-dep samba"

> * git clone git://git.samba.org/samba.git samba
> * cd samba
> * ./configure.developer
> * TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1
> TESTS="samba3.raw.composite"
> 
> Samba test works with 4.1.0-ucs171-amd64 so far (started overnight test loop
> ...)

Good.

> But there are some strange kernel messages

Comparing with 3.16.0-ucs135-amd64...

> ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts
> with OpRegion 0x0000000000000800-0x000000000000084F (\PMRG)
> (20150410/utaddress-254)

Known firmware bug - can be ignored

> iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS

disabled in BIOS, so message is expected - can be ignored

> kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using
> workaround

Known hardware bug - can be ignored

> and in particularly 
> 
>  Request for unknown module key 'Build time autogenerated kernel key:
> 006416f63733d99e57be1fd3a06d66c85b9e2c23' err -11
> 
> -> dmesg | grep "006416f63733d99e57be1fd3a06d66c85b9e2c23" | wc -l
> 194

New bug in this kernel, but problem is known to as as part of Bug #38214:
<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1494562>
<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1494943>
<https://launchpadlibrarian.net/217914743/wily+master-next-0001-x509-only-prefix-strip-raw-serial-numbers.patch>

Basically its the problem, that we as Univention don't use a fixed KEY to sign our kernel, so the build process is forced to generate a random key each time a kernel is built. This time (1 out of 16) the built generated a key starting with '0', which exposes this bug.


Package: linux
Version: 4.1.6-1.174.201602110938
Branch: ucs_4.1-0
Scope: errata4.1-0


I tracked it down by bisecting the linux kernel to the commit of "unix: avoid use-after-free in ep_remove_wait_queue"; see <https://lkml.org/lkml/2016/2/2/474>
As that patch was reverted for ucs174, that kernel can be used for now.


r67358 | Bug #40558 linux: Update to ucs174 (4.1.16-unix)

Package: univention-kernel-image-signed
Version: 2.0.0-6.15.201602111457
Branch: ucs_4.1-0
Scope: errata4.1-0

r67359 | Bug #40558 kernel: Update to ucs174 (4.1.16-unix)

Package: univention-kernel-image
Version: 9.0.0-8.90.201602111504
Branch: ucs_4.1-0
Scope: errata4.1-0

r67360 | Bug #40558 kernel: Update to ucs174 (4.1.16-unix) YAML
 branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/linux.yaml                          | 2 +-
 branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/univention-kernel-image-signed.yaml | 2 +-
 branches/ucs-4.1/ucs-4.1-0/doc/errata/staging/univention-kernel-image.yaml        | 2 +-
Comment 12 Felix Botner univentionstaff 2016-02-12 11:58:49 CET
kernel 174:
 * OK - on hardware
 * OK - amazon ec2
 * OK - kvm

on hardware:
 * OK - Samba test suite
 * OK - uvmm/kvm windows 10 installation
 * OK - uvmm/kvm ucs 4.1 installation
 * OK - uvmm/kvm ucs 4.1 installation with update to kernel 174

on all:
 * OK - TDB_NO_FSYNC=1 make -j \
   test SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite"
 * OK - kernel make kselftest

OK - dkms (sources, module build, modprobe)

OK - linux.yaml
OK - univention-kernel-image-signed.yaml
OK - univention-kernel-image.yaml