Univention Bugzilla – Bug 51986
Race condition in App Center cache invalidation+rebuild can cause App Center listener converter services to fail
Last modified: 2022-05-12 11:06:36 CEST
Due to a race condition during the rebuild of the App Center cache an app listener converter service may stop running, affected apps will then no longer process user or group changes. When univention-app update is running, the app center cache is invalidated, new metafiles are downloaded and extracted. The first usage of appcenter python library functions finds no cache and rebuilds it, e.g. from univention.appcenter.app_cache import Apps Apps().find(<appid>) The race condition can happen when the univention-app update process is still unpacking the downloaded metafiles while another process trigger the cache rebuild. The rebuild is done by taking the app ini files in certain directories and building the cache from them. As the cache is then present, there is no mechanism to detect that not all files have been unpacked and the cache has to be rebuild. That can lead to the case that an app is missing in the cache. The App Center listener converter tries to get the App object from the cache upon starting, if that fails, no processing will happen. The check is done in management/univention-appcenter/scripts/univention-appcenter-listener-converter:182 ... app = Apps().find(args.app) if not app: logger.critical('App not found') sys.exit(1) Symptoms: /var/log/univention/listener_modules/<appid>.log shows 2020-09-10 10:01:48 CRITICAL App not found # service univention-appcenter-listener-converter@<appid> status will show Sep 09 16:40:21 ucs systemd[1]: univention-appcenter-listener-converter@<appid>.service: Main process exited, code=exited, status=1/FAILURE Sep 09 16:40:21 ucs systemd[1]: univention-appcenter-listener-converter@<appid>.service: Unit entered failed state. Sep 09 16:40:21 ucs-ma systemd[1]: univention-appcenter-listener-converter@ucs-to-school-transformer.service: Failed with result 'exit-code'. Another univention-app update will usually not invalidate the cache immediately, because the cache is only invalidated when the metadata files changed on the app center server. The chance for the problem to occur rises the more apps are installed that use a listener converter, as each service gets triggered every 5 seconds. The affected customer system has 3 apps with converters installed. Workaround: remove cache files to force a rebuild with the next access to the cache. (Check if the test app center is used and adapt the path below accordingly) rm /var/cache/univention-appcenter/appcenter.software-univention.de/4.4/.apps.*json
Note: it is currently unknown if this bug will really "only affect a few installed domains", however a key customer was affected frequently.
This happend today also on my customers installation (which is also a major customer). It actually started after the weekly errata update cycle was completed and affected the only installed app the ox-connector. The ox-connector was not part of the actual errata update, but of course was running in parallel with the update.
Please excuse me if I make some rough errors/assumptions since I don't know the system atm. So far I have found two modules having/implementing their own locking mechanisms: 1. ucs/base/univention-updater/modules/univention/updater/locking.py 2. ucs/base/univention-lib/python/locking.py Possibly there are more, but did not go that deep. Second one looks more natural place, but its just my assumption. I see two possible approaches to overcome this (and similar) issues: 1. Named semaphore/mutex and file lock are candidates, each with its own strengths/flaws. Things to consider: shall we use non-blocking or blocking mode with timeouts and retries (maybe also configurable with reasonable defaults) 2. Use separate "authority" like Redis running on node providing locks. Additional dependency, but can also be used to cache things in applications In both cases, using some sort of scoped lock implemented with the help of contextmanager, but done in one place and used by all Univention apps/modules. Idea is to acquire the lock, do the stuff and release the lock when finished. Challenge with first solution (OS IPC mechanisms) is if we use blocking mode, say process P1 creates the named lock (so it can be shared among processes) it becomes the "owner" of the lock. It is doing its stuff, process P2 kicks in and since in blocking mode has to wait. P1 finishes and P2 acquires the lock. Problem is how to make P2 new owner, so that the name does not get unlink when P1 finishes with the resource. This can be done naturally with reference count in shared memory, but its not trivial to test (implementation is not hard itself) If we go with first, I also have some proposal here: https://git.knut.univention.de/nradovanovic/ipc_lock/-/blob/master/src/univention_ipc/locking.py
https://git.knut.univention.de/univention/ucs/-/merge_requests/191
I cherry-picked your commit, built it for 5.0-0 and adjusted the YAML file. Everything works as expected. The caches are clear after the update and will be rebuilt next time the cache is accessed. We could optimize further and only invalidate the cache when a new archive is found. But this would require more work in the "zsync" part of the update which seems not trivial. univention-appcenter.yaml 98a16ee53937 | Bug #51986: YAML 3f955c5686dd | Bug #51986: Race condition in App Center cache invalidation univention-appcenter (9.0.2-69) 3f955c5686dd | Bug #51986: Race condition in App Center cache invalidation
<https://errata.software-univention.de/#/?erratum=5.0x161>