Bug 42582 - list_add corruption - probably timer/workqueue related
list_add corruption - probably timer/workqueue related
Status: RESOLVED WONTFIX
Product: UCS
Classification: Unclassified
Component: Kernel
UCS 4.1
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-06 10:40 CEST by Philipp Hahn
Modified: 2019-01-03 07:20 CET (History)
4 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 4: A User would return the product
User Pain: 0.480
Enterprise Customer affected?: Yes
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2016092221002158, 2016100621000124
Bug group (optional): External feedback
Max CVSS v3 score:


Attachments
Kernel OOPS (10.04 KB, text/plain)
2016-10-06 10:40 CEST, Philipp Hahn
Details
Kernel OOPS 2 (2.24 KB, text/plain)
2016-10-07 10:41 CEST, Philipp Hahn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2016-10-06 10:40:19 CEST
Created attachment 8070 [details]
Kernel OOPS

notice swapped arguments between consecutive calls.
Comment 1 Philipp Hahn univentionstaff 2016-10-06 10:48:39 CEST
I asked on LKML but didn't get a reply yet: <https://marc.info/?l=linux-kernel&m=147508265316854&w=2>
I also haven't found a similar bug report.

Happened in 2. school with (probably) different virtualization: Bochs vs. VMWare. (?)
Comment 2 Philipp Hahn univentionstaff 2016-10-07 10:41:08 CEST
Created attachment 8076 [details]
Kernel OOPS 2
Comment 3 Philipp Hahn univentionstaff 2016-10-07 11:02:26 CEST
The common things is a timer addition:
 [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
 [<ffffffff810dc42b>] ? add_timer_on+0x8b/0x100

 [<ffffffff810da5a6>] ? internal_add_timer+0x36/0xa0
 [<ffffffff810dc6fa>] ? mod_timer_pending+0xfa/0x140

So something is racing without proper locking.
The 2nd OOPS looks like some RCU locking might be missing for the WQ. See <http://linux-kernel.2935.n7.nabble.com/mod-timer-list-add-corruption-WARNING-CPU-1-PID-0-at-lib-list-debug-c-33-list-add-0xbe-0xd0-td684405.html>.

There have been updates between v4.1.16 and v4.1.33 in that field, e.g.
 add92082e2d14367b27b0e18b0deeaedd7c1f938
 68fce03ba7901aa338a566292a59e6a753948861 !
Especially the last one looks promising:
 v4.1.12~18 introduced the bug
 v4.1.19~70 fixed it
that would explain why no-one except UCS customers see this bug.

Will hopefully be fixed with the new linux-4.1.33 kernel from Bug #41058.

Maybe enabling CONFIG_DEBUG_OBJECTS could help.
Comment 4 Philipp Hahn univentionstaff 2016-10-20 10:36:28 CEST
Might need debug enabled kernel build: <https://marc.info/?l=linux-btrfs&m=147694635511693&w=2>
Comment 5 Stefan Gohmann univentionstaff 2016-12-23 08:47:56 CET
Did it happened again or has it been fixed with the latest kernel updates?
Comment 6 Florian Best univentionstaff 2017-06-28 14:52:13 CEST
There is a Customer ID set so I set the flag "Enterprise Customer affected".
Comment 7 Stefan Gohmann univentionstaff 2019-01-03 07:20:42 CET
This issue has been filled against UCS 4.1. The maintenance with bug and security fixes for UCS 4.1 has ended on 5st of April 2018.

Customers still on UCS 4.1 are encouraged to update to UCS 4.3. Please contact
your partner or Univention for any questions.

If this issue still occurs in newer UCS versions, please use "Clone this bug" or simply reopen the issue. In this case please provide detailed information on how this issue is affecting you.