Bug 57000 - OPA process restart fails under load
OPA process restart fails under load
Status: NEW
Product: UCS@school
Classification: Unclassified
Component: HTTP-API (Kelvin)
UCS@school 5.0
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS@school maintainers
https://git.knut.univention.de/univen...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2024-01-24 14:31 CET by Moritz Bunkus
Modified: 2024-02-07 16:04 CET (History)
4 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 3: Simply Wrong: The implementation doesn't match the docu
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.103
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Moritz Bunkus 2024-01-24 14:31:33 CET
[0 root@dc-primary.ucs.production.univention-id-broker.com ~] univention-app info
UCS: 5.0-6 errata916
Installed: ucsschool=5.0 v4 4.4/ucsschool-kelvin-rest-api=1.8.9

On almost all of our current UCS systems that have the Kelvin REST API installed logrotation fails on an almost daily basis in the corresponding post rotate script. The output from the journal is:

[0 root@dc-primary.ucs.production.univention-id-broker.com ~] journalctl --since today -u logrotate
-- Logs begin at Fri 2024-01-12 13:00:06 CET, end at Wed 2024-01-24 14:27:38 CET. --
Jan 24 00:00:00 dc-primary systemd[1]: Starting Rotate log files...
Jan 24 00:00:21 dc-primary logrotate[17568]:  * Stopping UCS@school Kelvin REST API server ...
Jan 24 00:00:21 dc-primary logrotate[17568]:  * start-stop-daemon: 1 process refused to stop
Jan 24 00:00:21 dc-primary logrotate[17568]:  [ !! ]
Jan 24 00:00:21 dc-primary logrotate[17568]:  * ERROR: ucsschool-kelvin-rest-api failed to stop
Jan 24 00:00:21 dc-primary logrotate[17568]:  * ERROR: cannot stop opa as ucsschool-kelvin-rest-api is still up
Jan 24 00:00:23 dc-primary logrotate[17568]: error: error running non-shared postrotate script for /var/log/univention/ucsschool-kelvin-rest-api/opa.log of '/var/log/univention/ucsschool-kelvin-rest-api/opa.log '
Jan 24 00:00:23 dc-primary systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 00:00:23 dc-primary systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 24 00:00:23 dc-primary systemd[1]: Failed to start Rotate log files.

This means that "logrotate.service" is now in status "failed", triggering monitoring systems.

Additionally it seems like the rotation doesn't actually work if this happens.

Running the post-rotate command manually after such a failure is usually enough to whack it back  into working order:

[0 root@dc-primary.ucs.production.univention-id-broker.com ~] univention-app shell ucsschool-kelvin-rest-api service opa restart
 * Stopping UCS@school Kelvin REST API server ...
 * start-stop-daemon: no matching processes found
 [ ok ]
 * Stopping Open Policy Agent server ...w [ ok ]
 * Starting Open Policy Agent server ...ho
 [ ok ]
 * Starting UCS@school Kelvin REST API server ... [ ok ]
[0 root@dc-primary.ucs.production.univention-id-broker.com ~]

…until the service fails again the next night (or the one after).

That being said, judging from the output I don't think OPA was actually running after such a botched post-rotate script, which would be bad. Cannot verify that at the moment, though, as I've already triggered the restart manually.
Comment 1 Jürn Brodersen univentionstaff 2024-02-07 14:07:31 CET
This happens because the kelvin api takes a while to stop under load (we wait try to finish already opened requests). We should probably increase the time out in the service and handle the timeout better.

https://git.knut.univention.de/univention/components/ucsschool-kelvin-rest-api/-/issues/134
Comment 2 Jan-Luca Kiok univentionstaff 2024-02-07 16:04:43 CET
As mentioned this happens because the stopping under load takes too long too finish. Nonetheless OPA should be stopped afterwards, but indeed it seems that it is not started - For the scenario of the systems involved in this bug this should not be a problem, but nonetheless we will increase the stopping timeout.

As a workaround you could adapt the postrotate step to first only stop the service, wait a short amount of time and then start it.