Bug 42582 – list_add corruption - probably timer/workqueue related

Bug 42582 - list_add corruption - probably timer/workqueue related


Summary:	list_add corruption - probably timer/workqueue related

Status:	RESOLVED WONTFIX

Product:	UCS
Classification:	Unclassified
Component:	Kernel
Version:	UCS 4.1
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assigned To:	UCS maintainers
QA Contact:

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2016-10-06 10:40 CEST by Philipp Hahn
Modified:	2019-01-03 07:20 CET (History)
CC List:	4 users (show)

See Also:	41058
What kind of report is it?:	Bug Report
What type of bug is this?:	7: Crash: Bug causes crash or data loss
Who will be affected by this bug?:	3: Will affect average number of installed domains
How will those affected feel about the bug?:	4: A User would return the product
User Pain:	0.480
Enterprise Customer affected?:	Yes
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2016092221002158, 2016100621000124
Bug group (optional):	External feedback
Max CVSS v3 score:

Attachments
Kernel OOPS (10.04 KB, text/plain) 2016-10-06 10:40 CEST, Philipp Hahn	Details
Kernel OOPS 2 (2.24 KB, text/plain) 2016-10-07 10:41 CEST, Philipp Hahn	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philipp Hahn

2016-10-06 10:40:19 CEST

Created attachment 8070 [details]
Kernel OOPS

notice swapped arguments between consecutive calls.

Comment 1 Philipp Hahn

2016-10-06 10:48:39 CEST

I asked on LKML but didn't get a reply yet: <https://marc.info/?l=linux-kernel&m=147508265316854&w=2>
I also haven't found a similar bug report.

Happened in 2. school with (probably) different virtualization: Bochs vs. VMWare. (?)

Comment 2 Philipp Hahn

2016-10-07 10:41:08 CEST

Created attachment 8076 [details]
Kernel OOPS 2

Comment 3 Philipp Hahn

2016-10-07 11:02:26 CEST

The common things is a timer addition:
 [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
 [<ffffffff810dc42b>] ? add_timer_on+0x8b/0x100

 [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
 [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140

So something is racing without proper locking.
The 2nd OOPS looks like some RCU locking might be missing for the WQ. See <http://linux-kernel.2935.n7.nabble.com/mod-timer-list-add-corruption-WARNING-CPU-1-PID-0-at-lib-list-debug-c-33-list-add-0xbe-0xd0-td684405.html>.

There have been updates between v4.1.16 and v4.1.33 in that field, e.g.
 add92082e2d14367b27b0e18b0deeaedd7c1f938
 68fce03ba7901aa338a566292a59e6a753948861 !
Especially the last one looks promising:
 v4.1.12~18 introduced the bug
 v4.1.19~70 fixed it
that would explain why no-one except UCS customers see this bug.

Will hopefully be fixed with the new linux-4.1.33 kernel from Bug #41058.

Maybe enabling CONFIG_DEBUG_OBJECTS could help.

Comment 4 Philipp Hahn

2016-10-20 10:36:28 CEST

Might need debug enabled kernel build: <https://marc.info/?l=linux-btrfs&m=147694635511693&w=2>

Comment 5 Stefan Gohmann

2016-12-23 08:47:56 CET

Did it happened again or has it been fixed with the latest kernel updates?

Comment 6 Florian Best

2017-06-28 14:52:13 CEST

There is a Customer ID set so I set the flag "Enterprise Customer affected".

Comment 7 Stefan Gohmann

2019-01-03 07:20:42 CET

This issue has been filled against UCS 4.1. The maintenance with bug and security fixes for UCS 4.1 has ended on 5st of April 2018.

Customers still on UCS 4.1 are encouraged to update to UCS 4.3. Please contact
your partner or Univention for any questions.

If this issue still occurs in newer UCS versions, please use "Clone this bug" or simply reopen the issue. In this case please provide detailed information on how this issue is affecting you.