Bug 56391 - UMC-Server: never ending module process during app installation possible if socket connections were closed by client
UMC-Server: never ending module process during app installation possible if s...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: UMC (Generic)
UCS 5.0
Other Linux
: P5 normal (vote)
: UCS 5.0-4-errata
Assigned To: Marius Meschter
Florian Best
https://git.knut.univention.de/univen...
:
Depends on: 56198 56336
Blocks:
  Show dependency treegraph
 
Reported: 2023-08-02 11:32 CEST by Florian Best
Modified: 2023-09-07 12:20 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.429
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2023072021000181, 2023081421000252, 2023080921000128, 2023082521000071, 2023082421000126, 2023082921000108, 2023083021000123, 2023090621000266, 2023090721000255
Bug group (optional): Regression
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Best univentionstaff 2023-08-02 11:32:29 CEST
We fixed the below error in a way that there can be situations where module processes don't end after the module timeout because they think there is still an active connection.

This is a very rare and theoretical case as most modules respond immediately or at least after some time which causes the module process to end.
But we can fake a situation where the module process is never closed via the following steps:

ucr set umc/module/timeout=30 umc/http/session-timeout=60
systemctl restart univention-management-console-server
touch /var/run/univention-appcenter.lock
curl http://Administrator:univention@localhost:8090/command/appcenter/keep_alive -X POST
^C  # after some seconds just exit that request

pgrep -fl 'univention-management-console-module.*appcenter' # process runs, seems to never stop..

+++ This bug was initially created as a clone of Bug #56336 +++

Version: 5.0-4 errata740

Error: 
Traceback (most recent call last):
  File "%PY3%/tornado/web.py", line 1595, in _execute
    result = yield result
  File "%PY3%/tornado/gen.py", line 1133, in run
    value = future.result()
  File "%PY3%/univention/management/console/resources.py", line 728, in get
    self.future = process.request(self.request.method, self.request.full_url(), body=self.request.body or None, headers=headers)
  File "%PY3%/univention/management/console/resources.py", line 263, in request
    response = self.do_request(method, uri, headers, body, self.socket)
  File "%PY3%/univention/management/console/resources.py", line 147, in do_request
    return self._wrap_future(self._client.fetch(request, raise_error=True))
  File "%PY3%/tornado/httpclient.py", line 256, in fetch
    raise RuntimeError("fetch() called on closed AsyncHTTPClient")
RuntimeError: fetch() called on closed AsyncHTTPClient
Comment 1 Florian Best univentionstaff 2023-08-03 21:03:33 CEST
Probably the following exception occurs due to the current changes:

03.08.23 21:00:45.272  MAIN        ( INFO    ) : session 'b9f732eb-5327-49e0-9b8e-287431a7e96f' timed out
03.08.23 21:00:45.272  MAIN        ( INFO    ) : Cleaning up session 'b9f732eb-5327-49e0-9b8e-287431a7e96f'
03.08.23 21:00:45.272  MAIN        ( WARN    ) : Closing all sockets to currently opened module processes for this session.
03.08.23 21:00:45       ERROR      (    12681) : Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f5f33ff9d08>)
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 758, in _run_callback
        ret = callback()
      File "/usr/lib/python3/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
        return fn(*args, **kwargs)
      File "/usr/lib/python3/dist-packages/univention/management/console/session.py", line 255, in _session_timeout_timer
        self.processes.timed_out()
      File "/usr/lib/python3/dist-packages/univention/management/console/session.py", line 409, in timed_out
        proc.abort()
      File "/usr/lib/python3/dist-packages/univention/management/console/resources.py", line 125, in abort
        self._client.close()
      File "/usr/lib/python3/dist-packages/tornado/curl_httpclient.py", line 68, in close
        self._force_timeout_callback.stop()
    AttributeError: 'NoneType' object has no attribute 'stop'
Comment 2 Christina Scheinig univentionstaff 2023-08-14 11:13:23 CEST
Happend again.
The error occurs during access of Domaine --> DNS.
Comment 3 Christina Scheinig univentionstaff 2023-08-21 11:53:36 CEST
The customer asked for the status here. Feedback from the customer: The UMC has become almost unusable since this bug, especially because you always have to click away the traceback 3-4 times before you have the possibility to log out and log in again. And it seems to occur in the @School modules as well.

I raised the pain!
Comment 4 Florian Best univentionstaff 2023-08-22 03:40:23 CEST
(In reply to Christina Scheinig from comment #2)
> Happend again.
> The error occurs during access of Domaine --> DNS.

hmm, what happened again? We need exact details if something happens in customer enviroments. The general problem was fixed by Bug #56336.

(In reply to Christina Scheinig from comment #3)
> The customer asked for the status here.
I am in vacation until end of August. Nothing is prepared here.

> Feedback from the customer: The UMC
> has become almost unusable since this bug, especially because you always
> have to click away the traceback 3-4 times before you have the possibility
> to log out and log in again. And it seems to occur in the @School modules as
> well.
which traceback?
why is a re-login necessary?
why is UMC not usable anymore? how often does it happen? can you reliably reproduce it?
which exact UCS@school modules cause this?
Comment 6 Christina Scheinig univentionstaff 2023-08-24 14:49:57 CEST
In the specific session, the customer was logged into the UMC for about 2 minutes before the error message appeared. As you can see, first of all, the message can only be removed with multiple "close" prompts and only logging out and logging back in helps to be able to work again.
Comment 7 Erik Damrose univentionstaff 2023-08-25 12:32:04 CEST
A workaround should exist, i.e. downgrade to the old UMC. It is described here: https://help.univention.com/t/21661
Comment 8 Florian Best univentionstaff 2023-08-25 13:30:05 CEST
(In reply to Erik Damrose from comment #7)
> A workaround should exist, i.e. downgrade to the old UMC. It is described
> here: https://help.univention.com/t/21661

This should not be necessary. Maybe it's easier to go back to the original UMC version from UCS 5.0-4 without errata updates.
Comment 11 Marius Meschter univentionstaff 2023-08-28 16:37:06 CEST
Bug #56336 actually made it worse as it caused Bug #56336 comment 3 exceptions not only on aborted requests but instead of timed out sessions. So UMC could get unusable after an amount of time in certain conditions (same file descriptor number was assigned for the new socket).

Instead of closing the socket connection (which causes a bug in tornado) in the UMC-server to the module process now, we don't close the connection anymore but sending a HTTP request alogn with a request-ID to the moduleprocess which has a new "/cancel" endpoint. The "/cancel" endpoint picks the original request and abort it there, so that it is responded with any response. We don't care about that response as the real client already closed the connection and will never receive the answer. Then the module processes aren't blocked anymore to shutdown if no other active requests (meanwhile) exists. Additionally we don't end the module process if there are still working threads invoked by univention.mangement.console.modules.decorators.SimpleThread.

univention-management-console.yaml
a062d8b339ae | chore(umc): adjust advisory

univention-management-console-server.yaml
a062d8b339ae | chore(umc): adjust advisory
9044299c48c2 | fix(UMC): properly cancel aborted requests

univention-management-console (12.0.31-12)
9044299c48c2 | fix(UMC): properly cancel aborted requests
Comment 12 Florian Best univentionstaff 2023-08-29 12:31:48 CEST
OK: code review
OK: no connections from UMC-server to module processes are aborted manually anymore. Therefore the exceptions cannot happen anymore.
OK: a cancel request is send to the module process when the connection breaks up, which then causes that the request is responded
OK: active threads prevent module shutdown
OK: advisory description