Bug 56578 - Monitoring for ID Connector is missing
Monitoring for ID Connector is missing
Status: NEW
Product: UCS@school
Classification: Unclassified
Component: ucsschool-id-connector
UCS@school 5.0
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS@school maintainers
:
Depends on: 56621 56623
Blocks:
  Show dependency treegraph
 
Reported: 2023-09-13 13:26 CEST by Stefan Gohmann
Modified: 2023-12-13 11:55 CET (History)
4 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.086
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2023091221000601
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2023-09-13 13:26:37 CEST
We had a major outage in a big school environment where the ID Connector didn't work for a few weeks. This resulted in a queue with more than 1.000.000 operations after the ID Connector was restarted.

There should be a default monitoring which warns if the backlog for the ID Connector is getting too high. We have similar monitoring checks for the LDAP or Samba replication.
Comment 1 Daniel Tröder univentionstaff 2023-09-13 15:26:02 CEST
For that purpose, the ID Connector has a REST API with an endpoint at: https://FQDN/ucsschool-id-connector/api/v1/queues

It will return a list of queues with attributes:

name: name of queue, either "in-queue" or "out-queue(<school_authority>)"
head: listener json file currently being worked on
length: number of items in the queue (files in directory)
school_authority: identifier the customer used in the configuration


The endpoint ".../queues/{name}" can be used to retrieve the data for a single queue.

IMHO the monitoring should observe _all_ queues, and for that calculate the sum of all "length" attributes.

* A growing in-queue is an indicator, that the Docker container / ID-C queue-process is not running / unstable / to slow.
* A growing out-queue is an indicator, that the ID-C queue-process cannot reach one of its targets.

In both cases, the system operator should be notified.

------

The thing is, that when the Docker container / ID-C queue-process is not running, then the REST API will not work either.
But that can be seen as a clear indicator of a problem and used for an alert as well.

------

There is an alternative way to look at this problem: just looking at the filesystem.
For _every_ Docker app, the appcenter listener drops small JSON files in /var/lib/univention-appcenter/listener/<app-id>/
Those are read by the appcenter listener converter and then bigger JSON files are dropped in /var/lib/univention-appcenter/apps/<app-id>/data/listener/
That last directory is what the in-queue reads, and the number of files in there is the "length" in the REST API.

So a *generic solution to monitor _all_ Docker apps* would be to watch /var/lib/univention-appcenter/listener/<app-id>/ and /var/lib/univention-appcenter/apps/<app-id>/data/listener/.

And a specialized solution for the ID-C app would additionally fetch the length of all out-queues, so detect connection problems to downstream systems.

@Dirk: What do you think about the generic monitoring for all Docker apps?

------

@Stefan: What is the current monitoring solution in UCS 5.0? It doesn't have Nagios anymore. How is an operator now informed about problems with the LDAP or Samba replication?
Comment 2 Stefan Gohmann univentionstaff 2023-09-13 16:19:27 CEST
@Daniel: Prometheus Alertmanager

https://docs.software-univention.de/manual/5.0/en/monitoring/monitoring.html#monitoring
Comment 5 Daniel Tröder univentionstaff 2023-12-13 11:55:06 CET
If there are plans to implement a solution for this issue, I would like to discuss it.

IMHO, there is a chance for a big win for all of UCS if the generic solution is implemented, in addition to the specific one.
Actually, I don't see how the problem can be solved without implementing both!

The generic solution should be maintained by team Bitflip.
It can be implemented by them or the school team.

The generic solution should consist of a small HTTP service in a Docker container that read-only bind-mounts /var/lib/univention-appcenter/ and observes its directories.
It generates a Prometheus metrics endpoint. An example in production exists in the SDDB:

https://git.knut.univention.de/univention/ucsschool-components/id-broker-self-disclosure-db-builder/-/blob/main/id-broker-self-disclosure-db-builder/sddb_builder/rest/v1/sddb_metrics.py

Those metrics can be consumed by Prometheus.

---

The specific _additional_ solution for the ID Connector would be to simply add a Prometheus metrics endpoint to the existing HTTP API. It already contains the code to produce the required statistics. It just needs a Prometheus-compatible interface.