Univention Bugzilla – Bug 56578
Monitoring for ID Connector is missing
Last modified: 2023-12-13 11:55:06 CET
We had a major outage in a big school environment where the ID Connector didn't work for a few weeks. This resulted in a queue with more than 1.000.000 operations after the ID Connector was restarted. There should be a default monitoring which warns if the backlog for the ID Connector is getting too high. We have similar monitoring checks for the LDAP or Samba replication.
For that purpose, the ID Connector has a REST API with an endpoint at: https://FQDN/ucsschool-id-connector/api/v1/queues It will return a list of queues with attributes: name: name of queue, either "in-queue" or "out-queue(<school_authority>)" head: listener json file currently being worked on length: number of items in the queue (files in directory) school_authority: identifier the customer used in the configuration The endpoint ".../queues/{name}" can be used to retrieve the data for a single queue. IMHO the monitoring should observe _all_ queues, and for that calculate the sum of all "length" attributes. * A growing in-queue is an indicator, that the Docker container / ID-C queue-process is not running / unstable / to slow. * A growing out-queue is an indicator, that the ID-C queue-process cannot reach one of its targets. In both cases, the system operator should be notified. ------ The thing is, that when the Docker container / ID-C queue-process is not running, then the REST API will not work either. But that can be seen as a clear indicator of a problem and used for an alert as well. ------ There is an alternative way to look at this problem: just looking at the filesystem. For _every_ Docker app, the appcenter listener drops small JSON files in /var/lib/univention-appcenter/listener/<app-id>/ Those are read by the appcenter listener converter and then bigger JSON files are dropped in /var/lib/univention-appcenter/apps/<app-id>/data/listener/ That last directory is what the in-queue reads, and the number of files in there is the "length" in the REST API. So a *generic solution to monitor _all_ Docker apps* would be to watch /var/lib/univention-appcenter/listener/<app-id>/ and /var/lib/univention-appcenter/apps/<app-id>/data/listener/. And a specialized solution for the ID-C app would additionally fetch the length of all out-queues, so detect connection problems to downstream systems. @Dirk: What do you think about the generic monitoring for all Docker apps? ------ @Stefan: What is the current monitoring solution in UCS 5.0? It doesn't have Nagios anymore. How is an operator now informed about problems with the LDAP or Samba replication?
@Daniel: Prometheus Alertmanager https://docs.software-univention.de/manual/5.0/en/monitoring/monitoring.html#monitoring
If there are plans to implement a solution for this issue, I would like to discuss it. IMHO, there is a chance for a big win for all of UCS if the generic solution is implemented, in addition to the specific one. Actually, I don't see how the problem can be solved without implementing both! The generic solution should be maintained by team Bitflip. It can be implemented by them or the school team. The generic solution should consist of a small HTTP service in a Docker container that read-only bind-mounts /var/lib/univention-appcenter/ and observes its directories. It generates a Prometheus metrics endpoint. An example in production exists in the SDDB: https://git.knut.univention.de/univention/ucsschool-components/id-broker-self-disclosure-db-builder/-/blob/main/id-broker-self-disclosure-db-builder/sddb_builder/rest/v1/sddb_metrics.py Those metrics can be consumed by Prometheus. --- The specific _additional_ solution for the ID Connector would be to simply add a Prometheus metrics endpoint to the existing HTTP API. It already contains the code to produce the required statistics. It just needs a Prometheus-compatible interface.