Univention Bugzilla – Bug 30369
Kernel traces on EC2 instance
Last modified: 2013-11-19 06:42:11 CET
We saw some kernel traces with different running AWS Amazon images with UCS. For example: [1585533.125527] Pid: 1342, comm: hald Not tainted 3.2.0-ucs17-amd64 #1 Debian 3.2.30-1.17.201210041157 [1585533.125530] RIP: e030:[<ffffffff8100e19f>] [<ffffffff8100e19f>] do_signal+0x28c/0x635 [1585533.125541] RSP: e02b:ffff8800e9f9fe38 EFLAGS: 00250246 [1585533.125543] RAX: ffff8800e9f9ffd8 RBX: ffff8800e9f9ff58 RCX: 0000000000000000 [1585533.125546] RDX: 0000000000000000 RSI: ffff8800e9f9fec8 RDI: 00007fc02613fd30 [1585533.125548] RBP: 0000000000000021 R08: ffff8800e9f9e000 R09: 0000000000000000 [1585533.125550] R10: ffff8800efc0b1b0 R11: 0000000000000000 R12: ffff880004237a70 [1585533.125552] R13: 00007fc02613fb78 R14: ffff8800042375d0 R15: 00007fc02613fd40 [1585533.125558] FS: 00007fc026140700(0000) GS:ffff8800efc00000(0000) knlGS:0000000000000000 [1585533.125561] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [1585533.125563] CR2: 00007f4b3c7b0bc0 CR3: 00000000e9cb1000 CR4: 0000000000002660 [1585533.125566] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1585533.125568] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [1585533.125571] Process hald (pid: 1342, threadinfo ffff8800e9f9e000, task ffff8800042375d0) [1585533.125573] Stack: [1585533.125574] 0000000000000000 ffffffff81007555 0000000000000021 ffff8800fffffffa [1585533.125578] 000000000000053c ffffffff81107aa0 ffff8800ea3be3f8 ffff8800040c9bc0 [1585533.125581] ffff8800ea3be410 ffff880003f7bdc0 ffff88000439a140 ffff880003f7bdc0 [1585533.125584] Call Trace: [1585533.125591] [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58 [1585533.125597] [<ffffffff81107aa0>] ? fput+0x17a/0x1a2 [1585533.125602] [<ffffffff81104a42>] ? filp_close+0x64/0x6c [1585533.125605] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [1585533.125608] [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67 [1585533.125613] [<ffffffff81047012>] ? schedule_tail+0x24/0x56 [1585533.125618] [<ffffffff813850e0>] ? int_signal+0x12/0x17 [1585533.125620] Code: bd 38 01 00 00 e8 14 a6 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 7e f0 67 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 [1585533.125642] RIP [<ffffffff8100e19f>] do_signal+0x28c/0x635 [1585533.125645] RSP <ffff8800e9f9fe38> [1585533.125648] ---[ end trace 48c97fd9a1e209e1 ]--- More log files attached. Maybe the following links are helpful: http://xen.1045712.n5.nabble.com/xsave-0-workaround-needed-on-3-2-kernels-with-Xen-4-1-or-Xen-unstable-td5676755.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UserProvidedkernels.html#UsingUserProvidedKernel
Created attachment 5065 [details] syslog
Created attachment 5066 [details] kern.log
Created attachment 5067 [details] dmesg-out.txt
(In reply to comment #1) > Created an attachment (id=5065) syslog no useful information (In reply to comment #2) > Created an attachment (id=5066) kern.log Feb 4 16:53:28 dc01 kernel: [7197838.497921] alignment check: 0000 [#2] SMP Note that this is already the 2nd OOPs. (In reply to comment #3) > Created an attachment (id=5067) dmesg-out.txt [7127991.898321] alignment check: 0000 [#1] SMP 0000 (dec) --> Not instruction fetch|Kernel mode|Read|Invalid access [1585533.125530] RIP: e030:[<ffffffff8100e19f>] [<ffffffff8100e19f>] do_signal+0x28c/0x635 Looks similar to <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user level process is doing some signal stuff. Possible patch <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6> [7127991.898426] [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58 # printf '%x\n' $((0xdb+0x16)) f1 # objdump -S ./arch/x86/xen/p2m.o | sed -ne '/^[a-z ]* get_phys_to_machine/,/^0/p' unsigned long get_phys_to_machine(unsigned long pfn) { db: 48 83 c8 ff or $0xffffffffffffffff,%rax df: 48 81 ff ff ff ff 07 cmp $0x7ffffff,%rdi e6: 53 push %rbx e7: 48 89 fb mov %rdi,%rbx unsigned topidx, mididx, idx; if (unlikely(pfn >= MAX_P2M_PFN)) ea: 77 45 ja 131 <get_phys_to_machine+0x56> return INVALID_P2M_ENTRY; topidx = p2m_top_index(pfn); ec: e8 d5 ff ff ff callq c6 <p2m_top_index> /* * The INVALID_P2M_ENTRY is filled in both p2m_*identity * and in p2m_*missing, so returning the INVALID_P2M_ENTRY * would be wrong. */ if (p2m_top[topidx][mididx] == p2m_identity) f1: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # f8 <get_phys_to_machine+0x1d> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ f8: 89 c0 mov %eax,%eax mididx = p2m_mid_index(pfn); fa: 48 89 da mov %rbx,%rdx fd: 48 c1 ea 09 shr $0x9,%rdx 101: 81 e2 ff 01 00 00 and $0x1ff,%edx 107: 48 8b 04 c1 mov (%rcx,%rax,8),%rax 10b: 48 8b 04 d0 mov (%rax,%rdx,8),%rax 10f: 48 3b 05 00 00 00 00 cmp 0x0(%rip),%rax # 116 <get_phys_to_machine+0x3b> 116: 75 0f jne 127 <get_phys_to_machine+0x4c> return IDENTITY_FRAME(pfn); 118: 48 b8 00 00 00 00 00 mov $0x4000000000000000,%rax 11f: 00 00 40 122: 48 09 d8 or %rbx,%rax 125: eb 0a jmp 131 <get_phys_to_machine+0x56> idx = p2m_index(pfn); 127: 81 e3 ff 01 00 00 and $0x1ff,%ebx return p2m_top[topidx][mididx][idx]; 12d: 48 8b 04 d8 mov (%rax,%rbx,8),%rax } 131: 5b pop %rbx 132: c3 retq [7127991.898455] Code: ... <48> 8b 05 7e f0 67 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 ^^^^^^^^^^^^^^ Even considering that the 4*00 get patched to point to p2m_top, the rest of the memory seems to be corrupted, as the rest does no longer match the disassembly above. Documentation for decoding Kernel OOPSes: <http://www.dedoimedo.com/computers/crash-analyze.html>
(In reply to comment #4) > Looks similar to > <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user > level process is doing some signal stuff. > Possible patch > <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6> Is this patch part of the current 3.2 kernel?
(In reply to comment #5) > (In reply to comment #4) > > Looks similar to > > <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user > > level process is doing some signal stuff. > > Possible patch > > <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6> > > Is this patch part of the current 3.2 kernel? No, the patch is not part of our current linux_3.2.30-1.17.201210041157 kernel package. It's part of <http://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.33>
*** Bug 29838 has been marked as a duplicate of this bug. ***
Created attachment 5116 [details] Simple signal tester A very simple program using signals and select().
3.2.39 has been imported and build for Bug #28978. I know of no explicit test to trigger that bug. A test-suite for checking the signal implementation would probably help. I've schedules a ucs-test run using the newer kernel: i-d8188492 So unless proven otherwise I consider this bug fixed by Bug #28978.
<http://jenkins.knut.univention.de:8080/job/ucs-test_EC2-SingleMaster_64_RESULT/250/> showed +112 failed test, which were mostly Samba-related. To me this looks unrelated to this kernel bug. The log only showed the following interesting message: Mar 7 19:41:28 ip36e46450 kernel: [13727.056589] named[1249]: segfault at 0 ip 00007ff26bdbf6a9 sp 00007ff26d8475d0 error 6 in dlz_bind9.so[7ff26bdbb000+a000]
I've updated a EC2 instance to the new kernel: root@master0:~# uname -a Linux master0 3.2.0-ucs27-amd64 #1 SMP Debian 3.2.39-2.27.201303061719 x86_64 GNU/Linux root@master0:~# dmesg | tail -20 [3038349.173341] CR2: 00007f0d700d99b0 CR3: 00000000658fb000 CR4: 0000000000002660 [3038349.173344] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3038349.173346] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [3038349.173349] Process hald (pid: 1422, threadinfo ffff88006652c000, task ffff8800047995d0) [3038349.173351] Stack: [3038349.173353] 0000000000000000 ffffffff81007605 0000000000000021 ffff8800fffffffa [3038349.173357] 000000000000058c ffffffff81107f10 ffff88000479ae48 ffff88000417bbc0 [3038349.173361] ffff88000479ae60 ffff8800658c18c0 ffff8800044ee140 ffff8800658c18c0 [3038349.173365] Call Trace: [3038349.173372] [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58 [3038349.173379] [<ffffffff81107f10>] ? fput+0x17a/0x1a2 [3038349.173384] [<ffffffff81104eb2>] ? filp_close+0x64/0x6c [3038349.173388] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [3038349.173392] [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67 [3038349.173396] [<ffffffff81047110>] ? schedule_tail+0x24/0x56 [3038349.173401] [<ffffffff813870e0>] ? int_signal+0x12/0x17 [3038349.173403] Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 [3038349.173427] RIP [<ffffffff8100e19f>] do_signal+0x28c/0x635 [3038349.173431] RSP <ffff88006652de38> [3038349.173434] ---[ end trace 69e92d5474be2947 ]--- root@master0:~# I can reproduce it by installing xrdp via app center. This issue is not a blocker for the release, move to 3.1-1-errata.
I can reproduce this bug with a UCS 3.1-1 (3.1-0) amd64 Xen Server and a UCS 3.1-1 amd64 paravirtual machine, installed with the desktop. The virtual instance hangs starting the hald during boot: Pid: 1306, comm: hald Not tainted 3.2.0-ucs27-amd64 #1 Debian 3.2.39-2.27.201303061719 RIP: e030:[<ffffffff8100e19f>] [<ffffffff8100e19f>] do_signal+0x28c/0x635 RSP: e02b:ffff88003c5a5e38 EFLAGS: 00250246 RAX: ffff88003c5a5fd8 RBX: ffff88003c5a5f58 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88003c5a5ec8 RDI: 00007f473e6cdd30 RBP: 0000000000000021 R08: ffff88003c5a4000 R09: 00000000fffffffa R10: 0000000000000000 R11: 7b89481089480842 R12: ffff88003acb4620 R13: 00007f473e6cdb78 R14: ffff88003acb4180 R15: 00007f473e6cdd40 FS: 00007f473e6ce700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fc0a6220bc0 CR3: 000000003c412000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process hald (pid: 1306, threadinfo ffff88003c5a4000, task ffff88003acb4180) Stack: 0000ff814810eb00 ffffffff81007605 0000000000000021 ffff8800fffffffa 0000000000000518 ffffffff81107f10 ffff88003c5496b8 ffff8800043c3bc0 ffff88003c5496d0 ffff88003cb6bb80 ffff88000478a180 ffff88003cb6bb80 Call Trace: [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58 [<ffffffff81107f10>] ? fput+0x17a/0x1a2 [<ffffffff81104eb2>] ? filp_close+0x64/0x6c [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67 [<ffffffff81047110>] ? schedule_tail+0x24/0x56 [<ffffffff813870e0>] ? int_signal+0x12/0x17 Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 > 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 RIP [<ffffffff8100e19f>] do_signal+0x28c/0x635 RSP <ffff88003c5a5e38> ---[ end trace c73d95dcf9a82b39 ]--- After some time, the computer continues to boot. This does not happen with a i386 instance.
(In reply to Stefan Gohmann from comment #11) > I've updated a EC2 instance to the new kernel: > I can reproduce it by installing xrdp via app center. > > This issue is not a blocker for the release, move to 3.1-1-errata. I install xrdp using the App Center with a UCS 3.2 system running Linux 3.10 and the kernel trace didn't occur.
Fixed with the new kernel update.
OK: Could reproduce it with the 3.2.0-ucs27-amd64 kernel from UCS 3.1-1 OK: Could not reproduce it with the 3.10.0-ucs36-amd64 kernel from UCS-3.2
UCS 3.2 has been released: http://docs.univention.de/release-notes-3.2-en.html http://docs.univention.de/release-notes-3.2-de.html If this error occurs again, please use "Clone This Bug".