Bug 56621 - Monitor state of app queues
Monitor state of app queues
Status: REOPENED
Product: UCS
Classification: Unclassified
Component: App Center
UCS 5.0
Other Linux
: P5 normal (vote)
: UCS 5.0-7-errata
Assigned To: App Center maintainers
App Center maintainers
https://git.knut.univention.de/univen...
:
: 51896 (view as bug list)
Depends on:
Blocks: 56578 51896
  Show dependency treegraph
 
Reported: 2023-09-18 17:32 CEST by Jan-Luca Kiok
Modified: 2024-04-08 15:18 CEST (History)
7 users (show)

See Also:
What kind of report is it?: Feature Request
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional): Large environments, Troubleshooting
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan-Luca Kiok univentionstaff 2023-09-18 17:32:58 CEST
Apps that connect UCS to additional services normally have to queue the changes that should be processed for retries, robustness and load.
If they subscribe to certain objects like users or groups firstly an in-queue from the appcenter listener converter is given that contains JSON representations of the changed objects, the app itself then often works with out-queue(s) that buffer data that should be delivered to the respective service.

If the processing goes not as planned (service unavailable, app crashed, object faulty, ...) the queues might grow or stay at a certain level.
Therefore their level should be monitored and alerts via the Prometheus Alertmanager should be given at a certain state to inform administrators that the sync is not working as intended.


Some notes on that:

- The queues can grow to a considerable size quickly (>1.000.000 files reported after some days) if the app does not run
- If an object is faulty it might stay queued forever

Based on that at least three kinds of alerts come to mind:

- Monotonous growth for a certain amount of time (hours?)
- No changes in length for a certain amount of time (hours?)
- Queue does not reach 0 after a certain amount of time (days?)


Implementation idea:

- The Prometheus Node Exporter checks all queue directories of apps installed locally and reports the number of contained files (`ls` might fail if the size is too big, `stat` might work)
- One metric is written that distinguishes app, queue type and maybe queues themselves (example: The UCS@school ID Connector can sync to multiple environments and uses a out-queue per connected school authority)

Mock-up for this:

```
# HELP app_queue_length Number of changes to be processed by the app
# TYPE app_queue_length gauge
app_queue_length{app_id="1234",type="IN"} 7
app_queue_length{app_id="1234",type="OUT"} 13
app_queue_length{app_id="6789",type="IN"} 0
app_queue_length{app_id="6789",type="OUT",name="authority1"} 1503884
app_queue_length{app_id="6789",type="OUT",name="authority2"} 393
```
Comment 1 Daniel Tröder univentionstaff 2023-09-18 18:35:13 CEST
root@mailer:~# mkdir 1m_files
root@mailer:~# cd 1m_files
root@mailer:~/1m_files# time for i in $(seq -w 1000000); do touch "file_no_$i"; done

real	81m10,939s
user	18m52,284s
sys	62m30,129s

root@mailer:~/1m_files# time ls -1 | wc -l
1000000

real	0m8,183s
user	0m7,587s
sys	0m0,620s

root@mailer:~/1m_files# time python3 -c 'import glob; len(glob.glob("*"))'
1000000

real	0m2,403s
user	0m1,472s
sys	0m0,929s

root@mailer:~/1m_files# time python3 -c 'import os; _, _, files = next(os.walk(".")); print(len(files))'
1000000

real	0m1,064s
user	0m0,471s
sys	0m0,593s

root@mailer:~/1m_files# time python3 -c 'import os; lst = os.listdir("."); print(len(lst))'
1000000

real	0m0,676s
user	0m0,213s
sys	0m0,463s

Although not the fastest variant, glob would be the safest, when used with a pattern that excludes unwanted files and directories like: "20??-??-??-*.json".
Comment 2 Daniel Tröder univentionstaff 2023-09-18 18:38:16 CEST
"find" is very fast when used with a pattern:

root@mailer:~/1m_files# time find -type f -name '*' | wc -l
1000000

real	0m1,509s
user	0m1,066s
sys	0m0,491s
Comment 3 Philipp Hahn univentionstaff 2023-09-20 18:38:06 CEST
(In reply to Daniel Tröder from comment #1)
> root@mailer:~/1m_files# time ls -1 | wc -l

1× listdir() + 1e6× stat()

> root@mailer:~/1m_files# time python3 -c 'import glob; len(glob.glob("*"))'

1× listdir() + List[1e6] elements

> root@mailer:~/1m_files# time python3 -c 'import os; _, _, files =
> next(os.walk(".")); print(len(files))'

RECURSIVE× listdir() [+ 1e6× stat() with Python <= 3.5]

> root@mailer:~/1m_files# time python3 -c 'import os; lst = os.listdir("."); print(len(lst))'

1× listdir()

(In reply to Daniel Tröder from comment #2)
> root@mailer:~/1m_files# time find -type f -name '*' | wc -l

RECURSIVE 1× readdir() in C

(add '-maxdepth 1' to disable recursion)


python3 -c 'import os;print(sum(1 for p in os.scandir(".")))'

RECURSIVE 1× readdir() returning a generator instead of a list


python3 -c 'from pathlib import Path;print(sum(1 for p in Path(".").glob("*")))'

1× readdir() returning a generator instead of a list


> Although not the fastest variant, glob would be the safest, when used with a
> pattern that excludes unwanted files and directories like:
> "20??-??-??-*.json".

Just forbid anyone from directly poking a /var/spool/ directory.
Comment 4 Daniel Tröder univentionstaff 2023-09-21 09:11:23 CEST
(In reply to Philipp Hahn from comment #3)
> (In reply to Daniel Tröder from comment #1)
> > Although not the fastest variant, glob would be the safest, when used with a
> > pattern that excludes unwanted files and directories like:
> > "20??-??-??-*.json".
> 
> Just forbid anyone from directly poking a /var/spool/ directory.

It's in /var/lib/univention-appcenter, a directory apps are supposed to work on.
What I am striving for is "robustness": handle the unexpected.
Comment 5 Juan Carlos univentionstaff 2024-04-04 13:06:13 CEST
Added diagnostic module plugin to monitor the state of app queues.


5.0-7

Package: univention-management-console-module-diagnostic
Version: 6.0.8-2
Branch: ucs_5.0-0
Scope: errata5.0-7

---

univention-management-console-module-diagnostic.yaml
783c32e11c22 | Bug #56621: Monitor state of app queues

univention-management-console-module-diagnostic (6.0.8-2)
783c32e11c22 | Bug #56621: Monitor state of app queues


5.1

Package: univention-management-console-module-diagnostic
Version: 7.0.9
Branch: ucs_5.1-0

---

univention-management-console-module-diagnostic (7.0.9)
473b52639bd3 | Bug #56621: Monitor state of app queues

5.2

Package: univention-management-console-module-diagnostic
Version: 8.0.11
Branch: ucs_5.2-0

---

univention-management-console-module-diagnostic (8.0.11)
4468b6158d48 | Bug #56621: Monitor state of app queues
Comment 6 Christian Castens univentionstaff 2024-04-05 09:25:51 CEST
QA:
  OK: successful build for 5.0, 5.1, 5.2
  OK: new diagnostic module 69_check_app_listener_queue.py
  OK: tested on UCS 5.0, 5.2
  OK: translations
  OK: advisories
  OK: tests
Comment 7 Daniel Tröder univentionstaff 2024-04-08 11:05:49 CEST
The text for "resolving" this issue is unsatisfactory.
Please describe the solution in technical and non-technical terms:

1. What solution has been chosen?
2. Why was it chosen?
   Why not another one - like the OP suggested?
3. What does the solution do?
4. How does the solution do it?
Comment 8 Jan-Luca Kiok univentionstaff 2024-04-08 11:43:09 CEST
Thanks for reopening, I would have done the same, albeit for another reason: While I am of course ok with choosing a different approach than the proposed one the delivered solution does *not* meet the requirements:

> Therefore their level should be monitored and alerts via the Prometheus Alertmanager should be given at a certain state to inform administrators that the sync is not working as intended.

A diagnostic module is something you have to actively call to be informed, so you have to be aware of a problematic situation and it helps you debugging.

But that's not the situation this request is about:
What we _need_ is something that monitors the queues on itself and alerts administrators about a possible problem in case they *do not know* about this otherwise.

I am a bit unhappy that (to my knowledge) neither me nor any other stakeholder was consulted, in fact I raised concerns about the chosen approach to the respective PO months ago when I learned about it because of the MR being linked to this bug - While you are free to choose the implementation as the one opening the request I would have liked to be part of the discussion, because we could have avoided this early on.
Comment 9 Christian Castens univentionstaff 2024-04-08 11:56:35 CEST
*** Bug 51896 has been marked as a duplicate of this bug. ***
Comment 10 Dirk Wiesenthal univentionstaff 2024-04-08 13:19:46 CEST
The way I look at it is: This bug describes the need for an automated monitoring of the App queues.

The issue that was being worked on describes the need for a UMC diagnostic plugin. So the reference to this bug is wrong. We have opened Bug#57217 for this now.

The diagnostic plugin could even be the first step towards a prometheus integration in that it can write the required file.
Maybe that is not necessary as the node exporter seems easy. Although... that name="authority1" thing could prove to be complicated. Anyway, for now, we focused on the manual UMC side of things. Sorry for the confusion.
Comment 11 Daniel Tröder univentionstaff 2024-04-08 15:18:07 CEST
> The issue that was being worked on describes the need for a UMC diagnostic plugin. So the reference to this bug is wrong. We have opened Bug#57217 for this now.

That is now even stranger.
Where does the request for the new Bug#57217 come from? Who did request a diagnostic module?
If that bug was created to justify the created code, please delete both the issue and the code.
I don't know of any need for manual checking of full queues. It's too late! If you want, you can add the created code to USI.
But there is a need to check full queues automatically.
Please do not release unrequested, unnecessary code.