Bug 48232 - Journald should be restarted when watchdog steps in
Journald should be restarted when watchdog steps in
Status: RESOLVED WONTFIX
Product: UCS
Classification: Unclassified
Component: Upstream packages
UCS 4.2
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
UCS maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-11-28 17:06 CET by Christian Völker
Modified: 2020-07-03 20:52 CEST (History)
2 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.257
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018112821000588
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Völker univentionstaff 2018-11-28 17:06:54 CET
journald writes it's data every five minutes to disc. Default value in /etc/systemd/journald.conf:

SyncIntervalSec=5m

In case there is some high load on the system (or at least only under /var, ie mails) AND lots of log messages arrive journald might need more than a minute to write it's data to the disc.

But when writing it does not do watchdog pings which leads to systemd-watchdog assuming the journald service is dead. So it gets killed:

Nov 18 13:56:36 mailsrv systemd[1]: systemd-journald.service watchdog timeout (limit 1min)!

I have not confirmed yet, but it looks like journald gets killed, but not restarted, which is bad.

At least systemd should restart journald.
Comment 1 Christian Völker univentionstaff 2018-11-28 17:10:02 CET
Possible workarounds so far:

1. Optimize disk speed on /var/journal (ie SSD or physically separate storage)

2. Increase interval: SyncIntervalSec=1m but there is not ucr variable currently

3. optimize filesystem by options (ie data=writeback, not recommended)
Comment 2 Christian Völker univentionstaff 2018-11-28 17:11:18 CET
Possible solutions:

A. Increase timeout value from 1min to ie 3min- is it hard coded as I did not find any possibility to set this value.

B. At least start the process again after it got killed.
Comment 3 Arvid Requate univentionstaff 2018-11-28 18:29:59 CET
Which UCS release / systemd package version?

* https://github.com/systemd/systemd/issues/1804
* https://github.com/systemd/systemd/issues/6283

root@member55:~# grep WatchdogSec /lib/systemd/system/systemd-journald.service 
WatchdogSec=3min
root@member55:~# lsb_release -r
Release:        4.3-2 errata344
Comment 4 Christian Völker univentionstaff 2018-11-29 07:50:50 CET
version/erratalevel: 425
version/patchlevel: 4
version/releasename: Lesum
version/version: 4.2
Comment 5 Arvid Requate univentionstaff 2018-11-29 13:08:39 CET
Ok, UCS 4.2-x has systemd version 215-17. There's a Debian bug tracker entry discussing this issue, where people report that it has been resolved (or significantly improved) in 227-3:

 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=805042

So, we may want to either backport a newer systemd version or (more probable) apply the upstream patch

 https://github.com/systemd/systemd/commit/4de2402b603ea2f518f451d06f09e15aeae54fab

which seems to have fixed

 https://github.com/systemd/systemd/issues/1804
Comment 6 Ingo Steuwer univentionstaff 2020-07-03 20:52:39 CEST
This issue has been filed against UCS 4.2.

UCS 4.2 is out of maintenance and many UCS components have changed in later releases. Thus, this issue is now being closed.

If this issue still occurs in newer UCS versions, please use "Clone this bug" or reopen it and update the UCS version. In this case please provide detailed information on how this issue is affecting you.