Univention Bugzilla – Bug 56621
Monitor state of app queues
Last modified: 2024-04-08 15:18:07 CEST
Apps that connect UCS to additional services normally have to queue the changes that should be processed for retries, robustness and load. If they subscribe to certain objects like users or groups firstly an in-queue from the appcenter listener converter is given that contains JSON representations of the changed objects, the app itself then often works with out-queue(s) that buffer data that should be delivered to the respective service. If the processing goes not as planned (service unavailable, app crashed, object faulty, ...) the queues might grow or stay at a certain level. Therefore their level should be monitored and alerts via the Prometheus Alertmanager should be given at a certain state to inform administrators that the sync is not working as intended. Some notes on that: - The queues can grow to a considerable size quickly (>1.000.000 files reported after some days) if the app does not run - If an object is faulty it might stay queued forever Based on that at least three kinds of alerts come to mind: - Monotonous growth for a certain amount of time (hours?) - No changes in length for a certain amount of time (hours?) - Queue does not reach 0 after a certain amount of time (days?) Implementation idea: - The Prometheus Node Exporter checks all queue directories of apps installed locally and reports the number of contained files (`ls` might fail if the size is too big, `stat` might work) - One metric is written that distinguishes app, queue type and maybe queues themselves (example: The UCS@school ID Connector can sync to multiple environments and uses a out-queue per connected school authority) Mock-up for this: ``` # HELP app_queue_length Number of changes to be processed by the app # TYPE app_queue_length gauge app_queue_length{app_id="1234",type="IN"} 7 app_queue_length{app_id="1234",type="OUT"} 13 app_queue_length{app_id="6789",type="IN"} 0 app_queue_length{app_id="6789",type="OUT",name="authority1"} 1503884 app_queue_length{app_id="6789",type="OUT",name="authority2"} 393 ```
root@mailer:~# mkdir 1m_files root@mailer:~# cd 1m_files root@mailer:~/1m_files# time for i in $(seq -w 1000000); do touch "file_no_$i"; done real 81m10,939s user 18m52,284s sys 62m30,129s root@mailer:~/1m_files# time ls -1 | wc -l 1000000 real 0m8,183s user 0m7,587s sys 0m0,620s root@mailer:~/1m_files# time python3 -c 'import glob; len(glob.glob("*"))' 1000000 real 0m2,403s user 0m1,472s sys 0m0,929s root@mailer:~/1m_files# time python3 -c 'import os; _, _, files = next(os.walk(".")); print(len(files))' 1000000 real 0m1,064s user 0m0,471s sys 0m0,593s root@mailer:~/1m_files# time python3 -c 'import os; lst = os.listdir("."); print(len(lst))' 1000000 real 0m0,676s user 0m0,213s sys 0m0,463s Although not the fastest variant, glob would be the safest, when used with a pattern that excludes unwanted files and directories like: "20??-??-??-*.json".
"find" is very fast when used with a pattern: root@mailer:~/1m_files# time find -type f -name '*' | wc -l 1000000 real 0m1,509s user 0m1,066s sys 0m0,491s
(In reply to Daniel Tröder from comment #1) > root@mailer:~/1m_files# time ls -1 | wc -l 1× listdir() + 1e6× stat() > root@mailer:~/1m_files# time python3 -c 'import glob; len(glob.glob("*"))' 1× listdir() + List[1e6] elements > root@mailer:~/1m_files# time python3 -c 'import os; _, _, files = > next(os.walk(".")); print(len(files))' RECURSIVE× listdir() [+ 1e6× stat() with Python <= 3.5] > root@mailer:~/1m_files# time python3 -c 'import os; lst = os.listdir("."); print(len(lst))' 1× listdir() (In reply to Daniel Tröder from comment #2) > root@mailer:~/1m_files# time find -type f -name '*' | wc -l RECURSIVE 1× readdir() in C (add '-maxdepth 1' to disable recursion) python3 -c 'import os;print(sum(1 for p in os.scandir(".")))' RECURSIVE 1× readdir() returning a generator instead of a list python3 -c 'from pathlib import Path;print(sum(1 for p in Path(".").glob("*")))' 1× readdir() returning a generator instead of a list > Although not the fastest variant, glob would be the safest, when used with a > pattern that excludes unwanted files and directories like: > "20??-??-??-*.json". Just forbid anyone from directly poking a /var/spool/ directory.
(In reply to Philipp Hahn from comment #3) > (In reply to Daniel Tröder from comment #1) > > Although not the fastest variant, glob would be the safest, when used with a > > pattern that excludes unwanted files and directories like: > > "20??-??-??-*.json". > > Just forbid anyone from directly poking a /var/spool/ directory. It's in /var/lib/univention-appcenter, a directory apps are supposed to work on. What I am striving for is "robustness": handle the unexpected.
Added diagnostic module plugin to monitor the state of app queues. 5.0-7 Package: univention-management-console-module-diagnostic Version: 6.0.8-2 Branch: ucs_5.0-0 Scope: errata5.0-7 --- univention-management-console-module-diagnostic.yaml 783c32e11c22 | Bug #56621: Monitor state of app queues univention-management-console-module-diagnostic (6.0.8-2) 783c32e11c22 | Bug #56621: Monitor state of app queues 5.1 Package: univention-management-console-module-diagnostic Version: 7.0.9 Branch: ucs_5.1-0 --- univention-management-console-module-diagnostic (7.0.9) 473b52639bd3 | Bug #56621: Monitor state of app queues 5.2 Package: univention-management-console-module-diagnostic Version: 8.0.11 Branch: ucs_5.2-0 --- univention-management-console-module-diagnostic (8.0.11) 4468b6158d48 | Bug #56621: Monitor state of app queues
QA: OK: successful build for 5.0, 5.1, 5.2 OK: new diagnostic module 69_check_app_listener_queue.py OK: tested on UCS 5.0, 5.2 OK: translations OK: advisories OK: tests
The text for "resolving" this issue is unsatisfactory. Please describe the solution in technical and non-technical terms: 1. What solution has been chosen? 2. Why was it chosen? Why not another one - like the OP suggested? 3. What does the solution do? 4. How does the solution do it?
Thanks for reopening, I would have done the same, albeit for another reason: While I am of course ok with choosing a different approach than the proposed one the delivered solution does *not* meet the requirements: > Therefore their level should be monitored and alerts via the Prometheus Alertmanager should be given at a certain state to inform administrators that the sync is not working as intended. A diagnostic module is something you have to actively call to be informed, so you have to be aware of a problematic situation and it helps you debugging. But that's not the situation this request is about: What we _need_ is something that monitors the queues on itself and alerts administrators about a possible problem in case they *do not know* about this otherwise. I am a bit unhappy that (to my knowledge) neither me nor any other stakeholder was consulted, in fact I raised concerns about the chosen approach to the respective PO months ago when I learned about it because of the MR being linked to this bug - While you are free to choose the implementation as the one opening the request I would have liked to be part of the discussion, because we could have avoided this early on.
*** Bug 51896 has been marked as a duplicate of this bug. ***
The way I look at it is: This bug describes the need for an automated monitoring of the App queues. The issue that was being worked on describes the need for a UMC diagnostic plugin. So the reference to this bug is wrong. We have opened Bug#57217 for this now. The diagnostic plugin could even be the first step towards a prometheus integration in that it can write the required file. Maybe that is not necessary as the node exporter seems easy. Although... that name="authority1" thing could prove to be complicated. Anyway, for now, we focused on the manual UMC side of things. Sorry for the confusion.
> The issue that was being worked on describes the need for a UMC diagnostic plugin. So the reference to this bug is wrong. We have opened Bug#57217 for this now. That is now even stranger. Where does the request for the new Bug#57217 come from? Who did request a diagnostic module? If that bug was created to justify the created code, please delete both the issue and the code. I don't know of any need for manual checking of full queues. It's too late! If you want, you can add the created code to USI. But there is a need to check full queues automatically. Please do not release unrequested, unnecessary code.