Bug 51986 - Race condition in App Center cache invalidation+rebuild can cause App Center listener converter services to fail
Race condition in App Center cache invalidation+rebuild can cause App Center ...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: App Center
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 5.0-0-errata
Assigned To: Nikola Radovanovic
Dirk Wiesenthal
:
Depends on:
Blocks: 54452
  Show dependency treegraph
 
Reported: 2020-09-10 10:31 CEST by Erik Damrose
Modified: 2022-05-12 11:06 CEST (History)
11 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.200
Enterprise Customer affected?: Yes
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Damrose univentionstaff 2020-09-10 10:31:56 CEST
Due to a race condition during the rebuild of the App Center cache an app listener converter service may stop running, affected apps will then no longer process user or group changes.

When univention-app update is running, the app center cache is invalidated, new metafiles are downloaded and extracted. The first usage of appcenter python library functions finds no cache and rebuilds it, e.g.

from univention.appcenter.app_cache import Apps
Apps().find(<appid>)

The race condition can happen when the univention-app update process is still unpacking the downloaded metafiles while another process trigger the cache rebuild. The rebuild is done by taking the app ini files in certain directories and building the cache from them. As the cache is then present, there is no mechanism to detect that not all files have been unpacked and the cache has to be rebuild.

That can lead to the case that an app is missing in the cache. The App Center listener converter tries to get the App object from the cache upon starting, if that fails, no processing will happen. The check is done in 

management/univention-appcenter/scripts/univention-appcenter-listener-converter:182
...
    app = Apps().find(args.app)
    if not app:
        logger.critical('App not found')
        sys.exit(1)

Symptoms:
/var/log/univention/listener_modules/<appid>.log shows
2020-09-10 10:01:48 CRITICAL App not found

# service univention-appcenter-listener-converter@<appid> status will show
Sep 09 16:40:21 ucs systemd[1]: univention-appcenter-listener-converter@<appid>.service: Main process exited, code=exited, status=1/FAILURE
Sep 09 16:40:21 ucs systemd[1]: univention-appcenter-listener-converter@<appid>.service: Unit entered failed state.
Sep 09 16:40:21 ucs-ma systemd[1]: univention-appcenter-listener-converter@ucs-to-school-transformer.service: Failed with result 'exit-code'.

Another univention-app update will usually not invalidate the cache immediately, because the cache is only invalidated when the metadata files changed on the app center server.

The chance for the problem to occur rises the more apps are installed that use a listener converter, as each service gets triggered every 5 seconds. The affected customer system has 3 apps with converters installed.

Workaround: remove cache files to force a rebuild with the next access to the cache. (Check if the test app center is used and adapt the path below accordingly)
rm /var/cache/univention-appcenter/appcenter.software-univention.de/4.4/.apps.*json
Comment 2 Dirk Ahrnke univentionstaff 2021-05-11 11:48:21 CEST
Note: it is currently unknown if this bug will really "only affect a few installed domains", however a key customer was affected frequently.
Comment 3 Thorsten univentionstaff 2021-08-17 14:50:16 CEST
This happend today also on my customers installation (which is also a major customer). It actually started after the weekly errata update cycle was completed and affected the only installed app the ox-connector. The ox-connector was not part of the actual errata update, but of course was running in parallel with the update.
Comment 5 Nikola Radovanovic univentionstaff 2021-11-23 11:24:36 CET
Please excuse me if I make some rough errors/assumptions since I don't know the system atm.

So far I have found two modules having/implementing their own locking mechanisms:
1. ucs/base/univention-updater/modules/univention/updater/locking.py
2. ucs/base/univention-lib/python/locking.py

Possibly there are more, but did not go that deep. Second one looks more natural place, but its just my assumption.

I see two possible approaches to overcome this (and similar) issues:
1. Named semaphore/mutex and file lock are candidates, each with its own strengths/flaws. Things to consider: shall we use non-blocking or blocking mode with timeouts and retries (maybe also configurable with reasonable defaults)
2. Use separate "authority" like Redis running on node providing locks. Additional dependency, but can also be used to cache things in applications

In both cases, using some sort of scoped lock implemented with the help of contextmanager, but done in one place and used by all Univention apps/modules. Idea is to acquire the lock, do the stuff and release the lock when finished. 

Challenge with first solution (OS IPC mechanisms) is if we use blocking mode, say process P1 creates the named lock (so it can be shared among processes) it becomes the "owner" of the lock. It is doing its stuff, process P2 kicks in and since in blocking mode has to wait. P1 finishes and P2 acquires the lock. Problem is how to make P2 new owner, so that the name does not get unlink when P1 finishes with the resource. This can be done naturally with reference count in shared memory, but its not trivial to test (implementation is not hard itself)

If we go with first, I also have some proposal here: https://git.knut.univention.de/nradovanovic/ipc_lock/-/blob/master/src/univention_ipc/locking.py
Comment 7 Dirk Wiesenthal univentionstaff 2021-11-26 15:59:09 CET
I cherry-picked your commit, built it for 5.0-0 and adjusted the YAML file.

Everything works as expected. The caches are clear after the update and will be rebuilt next time the cache is accessed. We could optimize further and only invalidate the cache when a new archive is found. But this would require more work in the "zsync" part of the update which seems not trivial.

univention-appcenter.yaml
98a16ee53937 | Bug #51986: YAML
3f955c5686dd | Bug #51986: Race condition in App Center cache invalidation

univention-appcenter (9.0.2-69)
3f955c5686dd | Bug #51986: Race condition in App Center cache invalidation