Bug 54947 - Alert fires for expressions not assigned to the hostname
Alert fires for expressions not assigned to the hostname
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Monitoring (Prometheus or Nagios)
UCS 5.0
Other Linux
: P5 normal (vote)
: UCS 5.0-2-errata
Assigned To: Florian Best
Siavash Sefid Rodi
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2022-07-06 15:19 CEST by Florian Best
Modified: 2022-07-20 18:20 CEST (History)
1 user (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Best univentionstaff 2022-07-06 15:19:52 CEST
Every host executes every monitoring plugin check script and therefore writes metric-data to prometheus via the prometheus node exporter.

The alert expressions (part of the alert objects in LDAP) aren't bound to specific hosts.
E.g. an expression is `univention_cups_running != 1` and not `univention_cups_running{hostname=primary.domain} != 1`.

If such an alert object is not assigned to a certain host (but to any other) the configuration is still written and evaluated.
This then fires an alert because the system which doesn't have e.g. cups installed writes `univention_cups_running = 0` into prometheus.

We should find a mechanism to bound the alerts to certain hosts.
1. extend the expression in the listener by adding "{instance=…}" checks. This requires parsing and understanding the expressions.
Advantage:
* would be correct from the prometheus point of view
Disadvantage:
* a very long expression could lead to performance problems? (e.g. with 100 hosts assigned)
* usually in prometheus this is often done via substring machting
* writing/using a parser/lexer is very complex and error prone

2. disable that the scripts are writing the metric-data to prometheus (this is currently already possible via UCR for each check)

For this we have to define a mapping of alert <> script.
* could be an attribute of the alert object
* could be some local file shipped by the script

Additionally we should add LDAP ACL's for DC/Memberservers which allow to adjust/create the alert objects. This makes update handling more easily, so that joinscript versions don't need to be increased during errata updates.
 
Additionally we have to remove the check if the assigned hostname of an alert is the current hostdn of the server. otherwise prometheus must be installed on the host where the alert is assigned to.
Comment 1 Florian Best univentionstaff 2022-07-18 15:57:24 CEST
All query expressions have been adjusted to include a "%instance%" marker which is replaced by the listener with a regex for all assigned hosts.
Furthermore a new property "templateValues" has been added which allows to use placeholders e.g. ""%max%" in the query-expression,description,summary which is replaces with that specific value.

An LDAP ACL is registered which allows that DCs/Memberservers can adjust alerts - so that further errata upgrade must not necessarily increase the joinscript version.

The descriptions of most alerts has been improved as well.

univention-monitoring-client.yaml
57e5aa0087ba | YAML Bug #54947, Bug #54985

univention-monitoring-client (1.0.0-5)
e59136aaf04b | Bug #54947: add upgrade code
48050286d313 | Bug #54947: fix authentication when reloading metrics
42cb0f3c8ec8 | Bug #54947: enhance alert descriptions
ecfe4bb48a84 | Bug #54947: add temlating of query expression
335f1becee08 | Bug #54947: register ACL which allows DC hosts to change/add alerts
Comment 2 Siavash Sefid Rodi univentionstaff 2022-07-20 09:28:14 CEST
OK: Host assignments are considered correctly
OK: DCs/Memberservers can adjust alerts
OK: Defined thresholds are used in prometheus query expressions correctly
OK: reloading prometheus config after config change works as expected