Bug 41048 - kernel:[6201680.032002] BUG: soft lockup - CPU#1 stuck for 22s! [smbd:11900]
kernel:[6201680.032002] BUG: soft lockup - CPU#1 stuck for 22s! [smbd:11900]
Status: CLOSED DUPLICATE of bug 40838
Product: UCS
Classification: Unclassified
Component: Kernel
UCS 4.0
Other Linux
: P5 normal (vote)
: UCS 4.0-5-errata
Assigned To: Philipp Hahn
Janek Walkenhorst
:
Depends on:
Blocks: 41051
  Show dependency treegraph
 
Reported: 2016-04-13 12:22 CEST by Stefan Gohmann
Modified: 2016-06-01 18:17 CEST (History)
2 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2016-04-13 12:22:24 CEST
Ticket #2015121821000574

In a customer environment the following kernel trace happens from time to time:

Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] Modules linked in: cpuid ppdev lp ip6t_REJECT ipt_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc quota_v2 quota_tree psmouse processor parport_pc i2c_piix4 parport pcspkr thermal_sys joydev serio_raw virtio_balloon evdev ext4 crc16 mbcache jbd2 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk uhci_hcd ehci_hcd usbcore floppy usb_common ata_piix ttm drm_kms_helper drm libata i2c_core virtio_pci virtio_ring virtio scsi_mod button
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] CPU: 0 PID: 14256 Comm: smbd Not tainted 3.16.0-ucs165-amd64 #1 Debian 3.16.7-ckt20-1+deb8u3~bpo70+1.165.201601221131
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] task: ffff8801639a8110 ti: ffff88014fb2c000 task.ti: ffff88014fb2c000
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] RIP: 0010:[<ffffffff815509fb>]  [<ffffffff815509fb>] _raw_spin_lock+0x1b/0x30
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] RSP: 0018:ffff88014fb2fe50  EFLAGS: 00000212
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] RAX: 0000000000000db6 RBX: ffffffff8119fffe RCX: 0000000100070007
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] RDX: 0000000000000d9d RSI: ffff880102405440 RDI: ffff880102405750
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] RBP: 0000000000000028 R08: 00000000570de732 R09: 0000000000000005
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] R10: ffffffffffffffff R11: 0000000000000000 R12: 000000000000006e
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] R13: 0000000400000001 R14: ffff88011b7140c8 R15: ffff880216604858
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] FS:  00007f40fd760720(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] CR2: 00007f40ff423df0 CR3: 00000001a37c4000 CR4: 00000000000006f0
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] Stack:
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  ffffffff814ee608 ffff8801feeaa800 0000000000000028 ffff8801feeaa800
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  ffffffff814f0543 ffffffff81673320 ffff88014fb2fe94 ffff880216f7b000
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  00000028bfb170b0 ffff8800730ef018 ffff8801bfb17080 000000000000006e
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] Call Trace:
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  [<ffffffff814ee608>] ? unix_state_double_lock+0x28/0x70
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  [<ffffffff814f0543>] ? unix_dgram_connect+0x93/0x250
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  [<ffffffff8143b128>] ? SYSC_connect+0xe8/0x100
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006]  [<ffffffff81550f8d>] ? system_call_fast_compare_end+0x10/0x15
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] Code: b8 01 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 03 c3 f3 90 <0f> b7 07 66 39 d0 75 f6 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00

The trace is logged a lot:
root@ucs-server01:~# grep stuck /var/log/syslog 
Apr 13 08:29:54 ucs-server01 kernel: [6190608.036006] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]
Apr 13 08:30:22 ucs-server01 kernel: [6190636.036008] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]
Apr 13 08:30:54 ucs-server01 kernel: [6190668.036006] BUG: soft lockup - CPU#0 stuck for 22s! [smbd:14256]
Apr 13 08:31:22 ucs-server01 kernel: [6190696.036006] BUG: soft lockup - CPU#0 stuck for 22s! [smbd:14256]
Apr 13 08:31:58 ucs-server01 kernel: [6190732.036007] BUG: soft lockup - CPU#0 stuck for 22s! [smbd:14256]
[...]
Apr 13 10:10:06 ucs-server01 kernel: [6196620.036006] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]
Apr 13 10:10:42 ucs-server01 kernel: [6196656.036007] BUG: soft lockup - CPU#0 stuck for 22s! [smbd:14256]
Apr 13 10:11:10 ucs-server01 kernel: [6196684.036010] BUG: soft lockup - CPU#0 stuck for 22s! [smbd:14256]
Apr 13 10:11:46 ucs-server01 kernel: [6196720.036006] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]
Apr 13 10:12:15 ucs-server01 kernel: [6196748.036007] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:14256]

And then, the server gets stuck.

Workaround: Downgrade to Kernel 3.10.
Comment 1 Philipp Hahn univentionstaff 2016-04-13 17:10:42 CEST
3.16.0-ucs165-amd64 = 3.16.7-ckt20-1+deb8u3~bpo70+1.165.201601221131

which contains the problematic patch as a Debian-quilt-addon-patch:
  linux-3.16.7-ckt20/debian/patches/bugfix/all/unix-avoid-use-after-free-in-ep_remove_wait_queue.patch

The fix was released with 3.16.7-ckt27

$ git log --oneline v3.16.7..v3.16.7-ckt26 -- net/unix/af_unix.c 
456fe45 af_unix: Guard against other == sk in unix_dgram_sendmsg
c75afa1 af_unix: Don't set err in unix_stream_read_generic unless there was an error
03c7059 unix: correctly track in-flight fds in sending process user_struct
1906035 af_unix: fix struct pid memory leak
660f0e9 unix: properly account for FDs passed over unix sockets
9d81966 af_unix: Revert 'lock_interruptible' in stream receive code
6e23851 unix: avoid use-after-free in ep_remove_wait_queue
f465fb2 net/unix: fix logic about sk_peek_offset
f50e5d9 af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
5d9ab1e unix/caif: sk_socket can disappear when state is unlocked

$ git describe --contains 456fe45 6e23851
v3.16.7-ckt26~91
v3.16.7-ckt22~89

UCS-4.0-5 maintenance will end in 4 days <http://updates.software-univention.de/download/ucs-maintenance/4.0-5.yaml>, so upgrade to 4.1-1 now which has a fixed 4.1 kernel.
Comment 2 Stefan Gohmann univentionstaff 2016-04-14 06:06:49 CEST
(In reply to Philipp Hahn from comment #1)
> 3.16.0-ucs165-amd64 = 3.16.7-ckt20-1+deb8u3~bpo70+1.165.201601221131
> 
> which contains the problematic patch as a Debian-quilt-addon-patch:
>  
> linux-3.16.7-ckt20/debian/patches/bugfix/all/unix-avoid-use-after-free-in-
> ep_remove_wait_queue.patch
> 
> The fix was released with 3.16.7-ckt27

Then we should upgrade the kernel or adjust the patch.

> UCS-4.0-5 maintenance will end in 4 days
> <http://updates.software-univention.de/download/ucs-maintenance/4.0-5.yaml>,
> so upgrade to 4.1-1 now which has a fixed 4.1 kernel.

No, UCS 4.0 will be under maintenance at least end of May.
Comment 3 Philipp Hahn univentionstaff 2016-04-15 16:21:40 CEST
<http://incoming.debian.org/debian-buildd/pool/main/l/linux/>
$ repo_admin.py -F -p linux -r 4.0 -s errata4.0-5
r16410 | linux-3.16.7-ckt25-2~bpo70+1 UCS-4.0-5
r16411 | repo_admin patch copy
r16412 | revert r16411

bugfix/all/af_unix-guard-against-other-sk-in-unix_dgram_sendmsg.patch

Package: linux
Version: 3.16.7-ckt25-2~bpo70+1.191.201604151111
Branch: ucs_4.0-0-errata4.0-5
Scope: errata4.0-5

*** This bug has been marked as a duplicate of bug 40838 ***