30369 – Kernel traces on EC2 instance

Bug 30369 - Kernel traces on EC2 instance

Summary: Kernel traces on EC2 instance

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Kernel
Version:	UCS 3.1
Hardware:	Other Linux

Importance:	P5 normal
Target Milestone:	UCS 3.2
Assignee:	Moritz Muehlenhoff
QA Contact:	Philipp Hahn

URL:
Keywords:	interim-1

Duplicates (1):	29838 (view as bug list)
Depends on:
Blocks:

Reported:	2013-02-10 15:10 CET by Stefan Gohmann
Modified:	2013-11-19 06:42 CET (History)
CC List:	3 users (show)

See Also:
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:
Max CVSS v3 score:

Attachments
syslog (175.12 KB, text/plain) 2013-02-10 15:12 CET, Stefan Gohmann	Details
kern.log (4.14 KB, text/plain) 2013-02-10 15:12 CET, Stefan Gohmann	Details
dmesg-out.txt (17.31 KB, text/plain) 2013-02-10 15:12 CET, Stefan Gohmann	Details
Simple signal tester (3.05 KB, text/plain) 2013-03-07 16:10 CET, Philipp Hahn	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stefan Gohmann

2013-02-10 15:10:00 CET

We saw some kernel traces with different running AWS Amazon images with UCS.

For example:
[1585533.125527] Pid: 1342, comm: hald Not tainted 3.2.0-ucs17-amd64 #1 Debian 3.2.30-1.17.201210041157  
[1585533.125530] RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[1585533.125541] RSP: e02b:ffff8800e9f9fe38  EFLAGS: 00250246
[1585533.125543] RAX: ffff8800e9f9ffd8 RBX: ffff8800e9f9ff58 RCX: 0000000000000000
[1585533.125546] RDX: 0000000000000000 RSI: ffff8800e9f9fec8 RDI: 00007fc02613fd30
[1585533.125548] RBP: 0000000000000021 R08: ffff8800e9f9e000 R09: 0000000000000000
[1585533.125550] R10: ffff8800efc0b1b0 R11: 0000000000000000 R12: ffff880004237a70
[1585533.125552] R13: 00007fc02613fb78 R14: ffff8800042375d0 R15: 00007fc02613fd40
[1585533.125558] FS:  00007fc026140700(0000) GS:ffff8800efc00000(0000) knlGS:0000000000000000
[1585533.125561] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[1585533.125563] CR2: 00007f4b3c7b0bc0 CR3: 00000000e9cb1000 CR4: 0000000000002660
[1585533.125566] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1585533.125568] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1585533.125571] Process hald (pid: 1342, threadinfo ffff8800e9f9e000, task ffff8800042375d0)
[1585533.125573] Stack:
[1585533.125574]  0000000000000000 ffffffff81007555 0000000000000021 ffff8800fffffffa
[1585533.125578]  000000000000053c ffffffff81107aa0 ffff8800ea3be3f8 ffff8800040c9bc0
[1585533.125581]  ffff8800ea3be410 ffff880003f7bdc0 ffff88000439a140 ffff880003f7bdc0
[1585533.125584] Call Trace:
[1585533.125591]  [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58
[1585533.125597]  [<ffffffff81107aa0>] ? fput+0x17a/0x1a2
[1585533.125602]  [<ffffffff81104a42>] ? filp_close+0x64/0x6c
[1585533.125605]  [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
[1585533.125608]  [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
[1585533.125613]  [<ffffffff81047012>] ? schedule_tail+0x24/0x56
[1585533.125618]  [<ffffffff813850e0>] ? int_signal+0x12/0x17
[1585533.125620] Code: bd 38 01 00 00 e8 14 a6 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 7e f0 67 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 
[1585533.125642] RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[1585533.125645]  RSP <ffff8800e9f9fe38>
[1585533.125648] ---[ end trace 48c97fd9a1e209e1 ]---

More log files attached.

Maybe the following links are helpful: 
http://xen.1045712.n5.nabble.com/xsave-0-workaround-needed-on-3-2-kernels-with-Xen-4-1-or-Xen-unstable-td5676755.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UserProvidedkernels.html#UsingUserProvidedKernel

Comment 1 Stefan Gohmann

2013-02-10 15:12:01 CET

Created attachment 5065 [details]
syslog

Comment 2 Stefan Gohmann

2013-02-10 15:12:15 CET

Created attachment 5066 [details]
kern.log

Comment 3 Stefan Gohmann

2013-02-10 15:12:38 CET

Created attachment 5067 [details]
dmesg-out.txt

Comment 4 Philipp Hahn

2013-02-11 12:02:51 CET

(In reply to comment #1)
> Created an attachment (id=5065) syslog

no useful information

(In reply to comment #2)
> Created an attachment (id=5066) kern.log

Feb  4 16:53:28 dc01 kernel: [7197838.497921] alignment check: 0000 [#2] SMP

Note that this is already the 2nd OOPs.

(In reply to comment #3)
> Created an attachment (id=5067) dmesg-out.txt

[7127991.898321] alignment check: 0000 [#1] SMP 

0000 (dec)  --> Not instruction fetch|Kernel mode|Read|Invalid access

[1585533.125530] RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>]
do_signal+0x28c/0x635

Looks similar to <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user level process is doing some signal stuff.
Possible patch <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>

[7127991.898426]  [<ffffffff81007555>] ? get_phys_to_machine+0x16/0x58

# printf '%x\n' $((0xdb+0x16))
f1
# objdump -S ./arch/x86/xen/p2m.o | sed -ne '/^[a-z ]* get_phys_to_machine/,/^0/p'
unsigned long get_phys_to_machine(unsigned long pfn)
{
  db:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
  df:   48 81 ff ff ff ff 07    cmp    $0x7ffffff,%rdi
  e6:   53                      push   %rbx
  e7:   48 89 fb                mov    %rdi,%rbx
        unsigned topidx, mididx, idx;
        if (unlikely(pfn >= MAX_P2M_PFN))
  ea:   77 45                   ja     131 <get_phys_to_machine+0x56>
                return INVALID_P2M_ENTRY;
        topidx = p2m_top_index(pfn);
  ec:   e8 d5 ff ff ff          callq  c6 <p2m_top_index>
        /*
         * The INVALID_P2M_ENTRY is filled in both p2m_*identity
         * and in p2m_*missing, so returning the INVALID_P2M_ENTRY
         * would be wrong.
         */
        if (p2m_top[topidx][mididx] == p2m_identity)
  f1:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # f8 <get_phys_to_machine+0x1d>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  f8:   89 c0                   mov    %eax,%eax
        mididx = p2m_mid_index(pfn);
  fa:   48 89 da                mov    %rbx,%rdx
  fd:   48 c1 ea 09             shr    $0x9,%rdx
 101:   81 e2 ff 01 00 00       and    $0x1ff,%edx
 107:   48 8b 04 c1             mov    (%rcx,%rax,8),%rax
 10b:   48 8b 04 d0             mov    (%rax,%rdx,8),%rax
 10f:   48 3b 05 00 00 00 00    cmp    0x0(%rip),%rax        # 116 <get_phys_to_machine+0x3b>
 116:   75 0f                   jne    127 <get_phys_to_machine+0x4c>
                return IDENTITY_FRAME(pfn);
 118:   48 b8 00 00 00 00 00    mov    $0x4000000000000000,%rax
 11f:   00 00 40 
 122:   48 09 d8                or     %rbx,%rax
 125:   eb 0a                   jmp    131 <get_phys_to_machine+0x56>
        idx = p2m_index(pfn);
 127:   81 e3 ff 01 00 00       and    $0x1ff,%ebx
        return p2m_top[topidx][mididx][idx];
 12d:   48 8b 04 d8             mov    (%rax,%rbx,8),%rax
}
 131:   5b                      pop    %rbx
 132:   c3                      retq   

[7127991.898455] Code: ... <48> 8b 05 7e f0 67 00  a9 00  00 00 04  74 0a 49 c7  45 08 01 00 00
                                   ^^^^^^^^^^^^^^

Even considering that the 4*00 get patched to point to p2m_top, the rest of the memory seems to be corrupted, as the rest does no longer match the disassembly above.

Documentation for decoding Kernel OOPSes: <http://www.dedoimedo.com/computers/crash-analyze.html>

Comment 5 Stefan Gohmann

2013-02-12 06:58:47 CET

(In reply to comment #4)
> Looks similar to
> <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user
> level process is doing some signal stuff.
> Possible patch
> <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>

Is this patch part of the current 3.2 kernel?

Comment 6 Philipp Hahn

2013-02-12 09:28:42 CET

(In reply to comment #5)
> (In reply to comment #4)
> > Looks similar to
> > <http://lists.debian.org/debian-backports/2012/12/msg00044.html>: the user
> > level process is doing some signal stuff.
> > Possible patch
> > <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6>
> 
> Is this patch part of the current 3.2 kernel?

No, the patch is not part of our current linux_3.2.30-1.17.201210041157 kernel package.
It's part of <http://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.33>

Comment 7 Philipp Hahn

2013-02-12 09:35:31 CET

*** Bug 29838 has been marked as a duplicate of this bug. ***

Comment 8 Philipp Hahn

2013-03-07 16:10:24 CET

Created attachment 5116 [details]
Simple signal tester

A very simple program using signals and select().

Comment 9 Philipp Hahn

2013-03-07 16:30:10 CET

3.2.39 has been imported and build for Bug #28978.
I know of no explicit test to trigger that bug.
A test-suite for checking the signal implementation would probably help.

I've schedules a ucs-test run using the newer kernel: i-d8188492
So unless proven otherwise I consider this bug fixed by Bug #28978.

Comment 10 Philipp Hahn

2013-03-12 14:32:17 CET

<http://jenkins.knut.univention.de:8080/job/ucs-test_EC2-SingleMaster_64_RESULT/250/> showed +112 failed test, which were mostly Samba-related. To me this looks unrelated to this kernel bug.

The log only showed the following interesting message:

Mar  7 19:41:28 ip36e46450 kernel: [13727.056589] named[1249]: segfault at 0 ip 00007ff26bdbf6a9 sp 00007ff26d8475d0 error 6 in dlz_bind9.so[7ff26bdbb000+a000]

Comment 11 Stefan Gohmann

2013-03-15 21:45:30 CET

I've updated a EC2 instance to the new kernel:

root@master0:~# uname -a
Linux master0 3.2.0-ucs27-amd64 #1 SMP Debian 3.2.39-2.27.201303061719 x86_64 GNU/Linux
root@master0:~# dmesg | tail -20
[3038349.173341] CR2: 00007f0d700d99b0 CR3: 00000000658fb000 CR4: 0000000000002660
[3038349.173344] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3038349.173346] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[3038349.173349] Process hald (pid: 1422, threadinfo ffff88006652c000, task ffff8800047995d0)
[3038349.173351] Stack:
[3038349.173353]  0000000000000000 ffffffff81007605 0000000000000021 ffff8800fffffffa
[3038349.173357]  000000000000058c ffffffff81107f10 ffff88000479ae48 ffff88000417bbc0
[3038349.173361]  ffff88000479ae60 ffff8800658c18c0 ffff8800044ee140 ffff8800658c18c0
[3038349.173365] Call Trace:
[3038349.173372]  [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58
[3038349.173379]  [<ffffffff81107f10>] ? fput+0x17a/0x1a2
[3038349.173384]  [<ffffffff81104eb2>] ? filp_close+0x64/0x6c
[3038349.173388]  [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
[3038349.173392]  [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
[3038349.173396]  [<ffffffff81047110>] ? schedule_tail+0x24/0x56
[3038349.173401]  [<ffffffff813870e0>] ? int_signal+0x12/0x17
[3038349.173403] Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00 
[3038349.173427] RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
[3038349.173431]  RSP <ffff88006652de38>
[3038349.173434] ---[ end trace 69e92d5474be2947 ]---
root@master0:~# 

I can reproduce it by installing xrdp via app center.

This issue is not a blocker for the release, move to 3.1-1-errata.

Comment 12 Felix Botner

2013-03-21 14:20:34 CET

I can reproduce this bug with a UCS 3.1-1 (3.1-0) amd64 Xen Server and a UCS 3.1-1 amd64 paravirtual machine, installed with the desktop. The virtual instance hangs starting the hald during boot:

Pid: 1306, comm: hald Not tainted 3.2.0-ucs27-amd64 #1 Debian 3.2.39-2.27.201303061719  
RIP: e030:[<ffffffff8100e19f>]  [<ffffffff8100e19f>] do_signal+0x28c/0x635
RSP: e02b:ffff88003c5a5e38  EFLAGS: 00250246
RAX: ffff88003c5a5fd8 RBX: ffff88003c5a5f58 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88003c5a5ec8 RDI: 00007f473e6cdd30
RBP: 0000000000000021 R08: ffff88003c5a4000 R09: 00000000fffffffa
R10: 0000000000000000 R11: 7b89481089480842 R12: ffff88003acb4620
R13: 00007f473e6cdb78 R14: ffff88003acb4180 R15: 00007f473e6cdd40
FS:  00007f473e6ce700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b  
CR2: 00007fc0a6220bc0 CR3: 000000003c412000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hald (pid: 1306, threadinfo ffff88003c5a4000, task ffff88003acb4180)
Stack:
 0000ff814810eb00 ffffffff81007605 0000000000000021 ffff8800fffffffa
 0000000000000518 ffffffff81107f10 ffff88003c5496b8 ffff8800043c3bc0
 ffff88003c5496d0 ffff88003cb6bb80 ffff88000478a180 ffff88003cb6bb80
Call Trace:
 [<ffffffff81007605>] ? get_phys_to_machine+0x16/0x58
 [<ffffffff81107f10>] ? fput+0x17a/0x1a2
 [<ffffffff81104eb2>] ? filp_close+0x64/0x6c
 [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1
 [<ffffffff8100e56d>] ? do_notify_resume+0x25/0x67
 [<ffffffff81047110>] ? schedule_tail+0x24/0x56
 [<ffffffff813870e0>] ? int_signal+0x12/0x17
Code: bd 38 01 00 00 e8 4d a8 04 00 85 c0 0f 85 70 02 00 00 65 48 8b 04 25 88 c6 00 00 8b 90
> 8b 05 fe 03 68 00 a9 00 00 00 04 74 0a 49 c7 45 08 01 00 00  
RIP  [<ffffffff8100e19f>] do_signal+0x28c/0x635
 RSP <ffff88003c5a5e38>
---[ end trace c73d95dcf9a82b39 ]---

After some time, the computer continues to boot. This does not happen with a i386 instance.

Comment 13 Moritz Muehlenhoff

2013-07-05 11:44:59 CEST

(In reply to Stefan Gohmann from comment #11)
> I've updated a EC2 instance to the new kernel:
> I can reproduce it by installing xrdp via app center.
> 
> This issue is not a blocker for the release, move to 3.1-1-errata.

I install xrdp using the App Center with a UCS 3.2 system running Linux 3.10 and the kernel trace didn't occur.

Comment 14 Moritz Muehlenhoff

2013-07-09 14:45:21 CEST

Fixed with the new kernel update.

Comment 15 Philipp Hahn

2013-08-14 16:09:07 CEST

OK: Could reproduce it with the 3.2.0-ucs27-amd64 kernel from UCS 3.1-1
OK: Could not reproduce it with the 3.10.0-ucs36-amd64 kernel from UCS-3.2

Comment 16 Stefan Gohmann

2013-11-19 06:42:11 CET

UCS 3.2 has been released:
 http://docs.univention.de/release-notes-3.2-en.html
 http://docs.univention.de/release-notes-3.2-de.html

If this error occurs again, please use "Clone This Bug".