Bug 30369 - Kernel traces on EC2 instance
Kernel traces on EC2 instance
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Kernel
UCS 3.1
Other Linux
: P5 normal (vote)
: UCS 3.2
Assigned To: Moritz Muehlenhoff
Philipp Hahn
: interim-1
: 29838 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-10 15:10 CET by Stefan Gohmann
Modified: 2013-11-19 06:42 CET (History)
3 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
syslog (175.12 KB, text/plain)
2013-02-10 15:12 CET, Stefan Gohmann
Details
kern.log (4.14 KB, text/plain)
2013-02-10 15:12 CET, Stefan Gohmann
Details
dmesg-out.txt (17.31 KB, text/plain)
2013-02-10 15:12 CET, Stefan Gohmann
Details
Simple signal tester (3.05 KB, text/plain)
2013-03-07 16:10 CET, Philipp Hahn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2013-02-10 15:10:00 CET
We saw some kernel traces with different running AWS Amazon images with UCS.

For example:
[1585533.125527] Pid: 1342, comm: hald Not tainted 3.2.0-ucs17-amd64 #1 Debian 3.2.30-1.17.201210041157  
[1585533.125530] RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[1585533.125541] RSP: e02b:ffff8800e9f9fe38  EFLAGS: 00250246
[1585533.125543] RAX: ffff8800e9f9ffd8 RBX: ffff8800e9f9ff58 RCX: 0000000000000000
[1585533.125546] RDX: 0000000000000000 RSI: ffff8800e9f9fec8 RDI: 00007fc02613fd30
[1585533.125548] RBP: 0000000000000021 R08: ffff8800e9f9e000 R09: 0000000000000000
[1585533.125550] R10: ffff8800efc0b1b0 R11: 0000000000000000 R12: ffff880004237a70
[1585533.125552] R13: 00007fc02613fb78 R14: ffff8800042375d0 R15: 00007fc02613fd40
[1585533.125558] FS:  00007fc026140700(0000) GS:ffff8800efc00000(0000) knlGS:0000000000000000
[1585533.125561] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[1585533.125563] CR2: 00007f4b3c7b0bc0 CR3: 00000000e9cb1000 CR4: 0000000000002660
[1585533.125566] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1585533.125568] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1585533.125571] Process hald (pid: 1342, threadinfo ffff8800e9f9e000, task ffff8800042375d0)
[1585533.125573] Stack:
[1585533.125574]  0000000000000000 ffffffff81007555 0000000000000021 ffff8800fffffffa
[1585533.125578]  000000000000053c ffffffff81107aa0 ffff8800ea3be3f8 ffff8800040c9bc0
[1585533.125581]  ffff8800ea3be410 ffff880003f7bdc0 ffff88000439a140 ffff880003f7bdc0
[1585533.125584] Call Trace:
[1585533.125591]  [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58
[1585533.125597]  [<ffffffff81107aa0>] ? fput+0x17a/0x1a2
[1585533.125602]  [<ffffffff81104a42>] ? filp_close+0x64/0x6c
[1585533.125605]  [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
[1585533.125608]  [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
[1585533.125613]  [<ffffffff81047012>] ? schedule_tail+0x24/0x56
[1585533.125618]  [<ffffffff813850e0>] ? int_signal+0x12/0x17
[1585533.125620] Code: bd 38 01 00 00 e8 14 a6 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 7e f0 67 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 
[1585533.125642] RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[1585533.125645]  RSP <ffff8800e9f9fe38>
[1585533.125648] ---[ end trace 48c97fd9a1e209e1 ]---

More log files attached.

Maybe the following links are helpful: 
http://xen.1045712.n5.nabble.com/xsave-0-workaround-needed-on-3-2-kernels-with-Xen-4-1-or-Xen-unstable-td5676755.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UserProvidedkernels.html#UsingUserProvidedKernel
Comment 1 Stefan Gohmann univentionstaff 2013-02-10 15:12:01 CET
Created attachment 5065 [details]
syslog
Comment 2 Stefan Gohmann univentionstaff 2013-02-10 15:12:15 CET
Created attachment 5066 [details]
kern.log
Comment 3 Stefan Gohmann univentionstaff 2013-02-10 15:12:38 CET
Created attachment 5067 [details]
dmesg-out.txt
Comment 4 Philipp Hahn univentionstaff 2013-02-11 12:02:51 CET
(In reply to comment #1)
> Created an attachment (id=5065) syslog

no useful information

(In reply to comment #2)
> Created an attachment (id=5066) kern.log

Feb  4 16:53:28 dc01 kernel: [7197838.497921] alignment check: 0000 [#2] SMP

Note that this is already the 2nd OOPs.

(In reply to comment #3)
> Created an attachment (id=5067) dmesg-out.txt

[7127991.898321] alignment check: 0000 [#1] SMP 

0000 (dec)  --> Not instruction fetch|Kernel mode|Read|Invalid access

[1585533.125530] RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>]
do_signal+0x28c/0x635

Looks similar to <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user level process is doing some signal stuff.
Possible patch <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>

[7127991.898426]  [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58

# printf '%x\n' $((0xdb+0x16))
f1
# objdump -S ./arch/x86/xen/p2m.o | sed -ne '/^[a-z ]* get_phys_to_machine/,/^0/p'
unsigned long get_phys_to_machine(unsigned long pfn)
{
  db:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
  df:   48 81 ff ff ff ff 07    cmp    $0x7ffffff,%rdi
  e6:   53                      push   %rbx
  e7:   48 89 fb                mov    %rdi,%rbx
        unsigned topidx, mididx, idx;
        if (unlikely(pfn >= MAX_P2M_PFN))
  ea:   77 45                   ja     131 <get_phys_to_machine+0x56>
                return INVALID_P2M_ENTRY;
        topidx = p2m_top_index(pfn);
  ec:   e8 d5 ff ff ff          callq  c6 <p2m_top_index>
        /*
         * The INVALID_P2M_ENTRY is filled in both p2m_*identity
         * and in p2m_*missing, so returning the INVALID_P2M_ENTRY
         * would be wrong.
         */
        if (p2m_top[topidx][mididx] == p2m_identity)
  f1:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # f8 <get_phys_to_machine+0x1d>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  f8:   89 c0                   mov    %eax,%eax
        mididx = p2m_mid_index(pfn);
  fa:   48 89 da                mov    %rbx,%rdx
  fd:   48 c1 ea 09             shr    $0x9,%rdx
 101:   81 e2 ff 01 00 00       and    $0x1ff,%edx
 107:   48 8b 04 c1             mov    (%rcx,%rax,8),%rax
 10b:   48 8b 04 d0             mov    (%rax,%rdx,8),%rax
 10f:   48 3b 05 00 00 00 00    cmp    0x0(%rip),%rax        # 116 <get_phys_to_machine+0x3b>
 116:   75 0f                   jne    127 <get_phys_to_machine+0x4c>
                return IDENTITY_FRAME(pfn);
 118:   48 b8 00 00 00 00 00    mov    $0x4000000000000000,%rax
 11f:   00 00 40 
 122:   48 09 d8                or     %rbx,%rax
 125:   eb 0a                   jmp    131 <get_phys_to_machine+0x56>
        idx = p2m_index(pfn);
 127:   81 e3 ff 01 00 00       and    $0x1ff,%ebx
        return p2m_top[topidx][mididx][idx];
 12d:   48 8b 04 d8             mov    (%rax,%rbx,8),%rax
}
 131:   5b                      pop    %rbx
 132:   c3                      retq   

[7127991.898455] Code: ... <48> 8b 05 7e f0 67 00  a9 00  00 00 04  74 0a 49 c7  45 08 01 00 00
                                   ^^^^^^^^^^^^^^

Even considering that the 4*00 get patched to point to p2m_top, the rest of the memory seems to be corrupted, as the rest does no longer match the disassembly above.

Documentation for decoding Kernel OOPSes: <http://www.dedoimedo.com/computers/crash-analyze.html>
Comment 5 Stefan Gohmann univentionstaff 2013-02-12 06:58:47 CET
(In reply to comment #4)
> Looks similar to
> <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user
> level process is doing some signal stuff.
> Possible patch
> <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>

Is this patch part of the current 3.2 kernel?
Comment 6 Philipp Hahn univentionstaff 2013-02-12 09:28:42 CET
(In reply to comment #5)
> (In reply to comment #4)
> > Looks similar to
> > <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user
> > level process is doing some signal stuff.
> > Possible patch
> > <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>
> 
> Is this patch part of the current 3.2 kernel?

No, the patch is not part of our current linux_3.2.30-1.17.201210041157 kernel package.
It's part of <http://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.33>
Comment 7 Philipp Hahn univentionstaff 2013-02-12 09:35:31 CET
*** Bug 29838 has been marked as a duplicate of this bug. ***
Comment 8 Philipp Hahn univentionstaff 2013-03-07 16:10:24 CET
Created attachment 5116 [details]
Simple signal tester

A very simple program using signals and select().
Comment 9 Philipp Hahn univentionstaff 2013-03-07 16:30:10 CET
3.2.39 has been imported and build for Bug #28978.
I know of no explicit test to trigger that bug.
A test-suite for checking the signal implementation would probably help.

I've schedules a ucs-test run using the newer kernel: i-d8188492
So unless proven otherwise I consider this bug fixed by Bug #28978.
Comment 10 Philipp Hahn univentionstaff 2013-03-12 14:32:17 CET
<http://jenkins.knut.univention.de:8080/job/ucs-test_EC2-SingleMaster_64_RESULT/250/> showed +112 failed test, which were mostly Samba-related. To me this looks unrelated to this kernel bug.

The log only showed the following interesting message:

Mar  7 19:41:28 ip36e46450 kernel: [13727.056589] named[1249]: segfault at 0 ip 00007ff26bdbf6a9 sp 00007ff26d8475d0 error 6 in dlz_bind9.so[7ff26bdbb000+a000]
Comment 11 Stefan Gohmann univentionstaff 2013-03-15 21:45:30 CET
I've updated a EC2 instance to the new kernel:

root@master0:~# uname -a
Linux master0 3.2.0-ucs27-amd64 #1 SMP Debian 3.2.39-2.27.201303061719 x86_64 GNU/Linux
root@master0:~# dmesg | tail -20
[3038349.173341] CR2: 00007f0d700d99b0 CR3: 00000000658fb000 CR4: 0000000000002660
[3038349.173344] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3038349.173346] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[3038349.173349] Process hald (pid: 1422, threadinfo ffff88006652c000, task ffff8800047995d0)
[3038349.173351] Stack:
[3038349.173353]  0000000000000000 ffffffff81007605 0000000000000021 ffff8800fffffffa
[3038349.173357]  000000000000058c ffffffff81107f10 ffff88000479ae48 ffff88000417bbc0
[3038349.173361]  ffff88000479ae60 ffff8800658c18c0 ffff8800044ee140 ffff8800658c18c0
[3038349.173365] Call Trace:
[3038349.173372]  [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58
[3038349.173379]  [<ffffffff81107f10>] ? fput+0x17a/0x1a2
[3038349.173384]  [<ffffffff81104eb2>] ? filp_close+0x64/0x6c
[3038349.173388]  [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
[3038349.173392]  [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
[3038349.173396]  [<ffffffff81047110>] ? schedule_tail+0x24/0x56
[3038349.173401]  [<ffffffff813870e0>] ? int_signal+0x12/0x17
[3038349.173403] Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 
[3038349.173427] RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[3038349.173431]  RSP <ffff88006652de38>
[3038349.173434] ---[ end trace 69e92d5474be2947 ]---
root@master0:~# 

I can reproduce it by installing xrdp via app center.

This issue is not a blocker for the release, move to 3.1-1-errata.
Comment 12 Felix Botner univentionstaff 2013-03-21 14:20:34 CET
I can reproduce this bug with a UCS 3.1-1 (3.1-0) amd64 Xen Server and a UCS 3.1-1 amd64 paravirtual machine, installed with the desktop. The virtual instance hangs starting the hald during boot:

Pid: 1306, comm: hald Not tainted 3.2.0-ucs27-amd64 #1 Debian 3.2.39-2.27.201303061719  
RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>] do_signal+0x28c/0x635
RSP: e02b:ffff88003c5a5e38  EFLAGS: 00250246
RAX: ffff88003c5a5fd8 RBX: ffff88003c5a5f58 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88003c5a5ec8 RDI: 00007f473e6cdd30
RBP: 0000000000000021 R08: ffff88003c5a4000 R09: 00000000fffffffa
R10: 0000000000000000 R11: 7b89481089480842 R12: ffff88003acb4620
R13: 00007f473e6cdb78 R14: ffff88003acb4180 R15: 00007f473e6cdd40
FS:  00007f473e6ce700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b  
CR2: 00007fc0a6220bc0 CR3: 000000003c412000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hald (pid: 1306, threadinfo ffff88003c5a4000, task ffff88003acb4180)
Stack:
 0000ff814810eb00 ffffffff81007605 0000000000000021 ffff8800fffffffa
 0000000000000518 ffffffff81107f10 ffff88003c5496b8 ffff8800043c3bc0
 ffff88003c5496d0 ffff88003cb6bb80 ffff88000478a180 ffff88003cb6bb80
Call Trace:
 [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58
 [<ffffffff81107f10>] ? fput+0x17a/0x1a2
 [<ffffffff81104eb2>] ? filp_close+0x64/0x6c
 [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
 [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
 [<ffffffff81047110>] ? schedule_tail+0x24/0x56
 [<ffffffff813870e0>] ? int_signal+0x12/0x17
Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90
> 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00  
RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
 RSP <ffff88003c5a5e38>
---[ end trace c73d95dcf9a82b39 ]---

After some time, the computer continues to boot. This does not happen with a i386 instance.
Comment 13 Moritz Muehlenhoff univentionstaff 2013-07-05 11:44:59 CEST
(In reply to Stefan Gohmann from comment #11)
> I've updated a EC2 instance to the new kernel:
> I can reproduce it by installing xrdp via app center.
> 
> This issue is not a blocker for the release, move to 3.1-1-errata.

I install xrdp using the App Center with a UCS 3.2 system running Linux 3.10 and the kernel trace didn't occur.
Comment 14 Moritz Muehlenhoff univentionstaff 2013-07-09 14:45:21 CEST
Fixed with the new kernel update.
Comment 15 Philipp Hahn univentionstaff 2013-08-14 16:09:07 CEST
OK: Could reproduce it with the 3.2.0-ucs27-amd64 kernel from UCS 3.1-1
OK: Could not reproduce it with the 3.10.0-ucs36-amd64 kernel from UCS-3.2
Comment 16 Stefan Gohmann univentionstaff 2013-11-19 06:42:11 CET
UCS 3.2 has been released:
 http://docs.univention.de/release-notes-3.2-en.html
 http://docs.univention.de/release-notes-3.2-de.html

If this error occurs again, please use "Clone This Bug".