Univention Bugzilla – Bug 54947
Alert fires for expressions not assigned to the hostname
Last modified: 2022-07-20 18:20:24 CEST
Every host executes every monitoring plugin check script and therefore writes metric-data to prometheus via the prometheus node exporter. The alert expressions (part of the alert objects in LDAP) aren't bound to specific hosts. E.g. an expression is `univention_cups_running != 1` and not `univention_cups_running{hostname=primary.domain} != 1`. If such an alert object is not assigned to a certain host (but to any other) the configuration is still written and evaluated. This then fires an alert because the system which doesn't have e.g. cups installed writes `univention_cups_running = 0` into prometheus. We should find a mechanism to bound the alerts to certain hosts. 1. extend the expression in the listener by adding "{instance=…}" checks. This requires parsing and understanding the expressions. Advantage: * would be correct from the prometheus point of view Disadvantage: * a very long expression could lead to performance problems? (e.g. with 100 hosts assigned) * usually in prometheus this is often done via substring machting * writing/using a parser/lexer is very complex and error prone 2. disable that the scripts are writing the metric-data to prometheus (this is currently already possible via UCR for each check) For this we have to define a mapping of alert <> script. * could be an attribute of the alert object * could be some local file shipped by the script Additionally we should add LDAP ACL's for DC/Memberservers which allow to adjust/create the alert objects. This makes update handling more easily, so that joinscript versions don't need to be increased during errata updates. Additionally we have to remove the check if the assigned hostname of an alert is the current hostdn of the server. otherwise prometheus must be installed on the host where the alert is assigned to.
All query expressions have been adjusted to include a "%instance%" marker which is replaced by the listener with a regex for all assigned hosts. Furthermore a new property "templateValues" has been added which allows to use placeholders e.g. ""%max%" in the query-expression,description,summary which is replaces with that specific value. An LDAP ACL is registered which allows that DCs/Memberservers can adjust alerts - so that further errata upgrade must not necessarily increase the joinscript version. The descriptions of most alerts has been improved as well. univention-monitoring-client.yaml 57e5aa0087ba | YAML Bug #54947, Bug #54985 univention-monitoring-client (1.0.0-5) e59136aaf04b | Bug #54947: add upgrade code 48050286d313 | Bug #54947: fix authentication when reloading metrics 42cb0f3c8ec8 | Bug #54947: enhance alert descriptions ecfe4bb48a84 | Bug #54947: add temlating of query expression 335f1becee08 | Bug #54947: register ACL which allows DC hosts to change/add alerts
OK: Host assignments are considered correctly OK: DCs/Memberservers can adjust alerts OK: Defined thresholds are used in prometheus query expressions correctly OK: reloading prometheus config after config change works as expected
<https://errata.software-univention.de/#/?erratum=5.0x353>