Bug 50665 – DCs should periodically report their replication state to LDAP

Bug 50665 - DCs should periodically report their replication state to LDAP


Summary:	DCs should periodically report their replication state to LDAP

Status:	NEW

Product:	UCS
Classification:	Unclassified
Component:	LDAP
Version:	UCS 4.4
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assigned To:	UCS maintainers
QA Contact:	UCS maintainers

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2019-12-20 09:26 CET by Arvid Requate
Modified:	2021-09-22 17:57 CEST (History)
CC List:	5 users (show)

See Also:
What kind of report is it?:	Development Internal
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:	Yes
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2019121921000565, 2021092221000186
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arvid Requate

2019-12-20 09:26:43 CET

If an Administrator wants to shorten the tansaction backlog, e.g. because the cn=translog DB has grown very large again, s/he first needs to find out the replication state of all DCs in the domain. For this purpose it could be helpful, to be able to lookup this information directly from the Primary LDAP ("Master").

To make this possible, all DCs could periodically report their replication state e.g. as an attribute of their machine account in LDAP, or, as an optimization to avoid unnecessary replication traffic, to a special branch in LDAP that is only replicated to DC Backups.

If this information would be available, then the univention-translog tool could then offer the option to perform an automatized, optimal prune. Thinking further, this automatized prune could be signalled to the DC Backups too, which then also need to prune. Finally, this could be automated to be run periodically for a clean, hands off operation.

Comment 1 Dirk Schnick

2021-09-22 14:39:28 CEST

Needed to increase the MDB today an ran into bug 53821. If translog would be (automatically) reduced the blood pressure could stay normal.

Comment 2 Philipp Hahn

2021-09-22 15:17:55 CEST

Simple solution
- add a new attribute for storing `notifier_id` with the host record to our LDAP schema for all UCS roles (and other systems, which have UDL)
- add code to /usr/lib/univention-server/server_password_change to also report /var/lib/univention-directory-listener/notifier_id when changing the server password each 3 weeks

As an alternative we can setup a cron-job to do the reporting, but this might lead to undesired performance issues: When a server updates its `notifier_id`, this creates a new transaction which then must be processes by *all* other systems as UDL is installed on all UCS roles.
By combining it with the regular server password change we can piggyback this update with a transaction, which already is happening on a regular basis.

The UCS dashboard based on Prometheus/Grafana already reports the `notifier_id`, which is exported to `/var/lib/prometheus/node-exporter/univention-server-metrics.prom` and then collected by Prometheus.
This is done by https://git.knut.univention.de/univention/components/dashboard/prometheus-node-exporter/-/tree/master/univention-node-exporter every 20 minutes.
But this would cerate a dependency on an optional component, which currently is not even available for UCS-5.

Comment 3 Daniel Tröder

2021-09-22 15:29:55 CEST

Such data in the primary node would be very welcome to monitor the state of the replication in the whole domain. It'd be simple to collect data / draw a graph and find bottlenecks. I know customers that would be happy.

But to be useful the data would have to be updated at least every 5min.
As suggested, if the subtree or the attribute is excluded from replication, that shouldn't create a performance problem.
I don't see why backups should have that data, as it is ephemeral and will automatically be recreated.

Comment 4 Florian Best

2021-09-22 17:45:15 CEST

(In reply to Philipp Hahn from comment #2)
> Simple solution
> - add a new attribute for storing `notifier_id` with the host record to our
> LDAP schema for all UCS roles (and other systems, which have UDL)
Could you also imagine - instead of extending the standard DNS schema for host records - to mis-use a SRV record "_ucs_replication_$hostname._tcp." or create a settings/data object for this?

Comment 5 Philipp Hahn

2021-09-22 17:56:28 CEST

(In reply to Daniel Tröder from comment #3)
> Such data in the primary node would be very welcome to monitor the state of
> the replication in the whole domain. It'd be simple to collect data / draw a
> graph and find bottlenecks. I know customers that would be happy.

Actually this is already part of the "UCS Dashboard" powered by "Prometheus / Grafana" and is available as `ucs_notifier_id`. Its visualization on the other hand is horrible as it simply is displayed as a graph, with no information about the delta or delay in replication.

> But to be useful the data would have to be updated at least every 5min.

univention-node-exporter by default uses 20m, but can be changed by UCR:
> cron/univention-metrics-server/time='*/20 * * * *'

> As suggested, if the subtree or the attribute is excluded from replication,
> that shouldn't create a performance problem.
> I don't see why backups should have that data, as it is ephemeral and will
> automatically be recreated.

For automatic pruning of old transactions you must have information about ALL UDL: It you prune too many old transactions some systems may no longer be able to catch up; in that case the server must be re-joined.
So you need
1. a list of all servers - available from LDAP
2. the last TID their UDL processed - either LDAP or somewhere else

What should happen when you know about a server from 1 but do not have any information about its last TID from 2?
Either you "don't do any automatic pruning then" or say "screw them, delete old transactions anyway and force them to be re-joined".

My expectation: UI gives me a list of servers, which did not report back recently - because they are power off, suspended, disconnected, stolen, destroyed, ... - with a time-stamp and their last TID - so that I as Administrator can either
- de-register the computer object to permanently remove it
- skip for now and get a reminder (for example) in one week
- screw it, which gets the computer removed from calculating the `min(TID of all servers)`, so transactions can be pruned, but this computer must be re-joined when it returns.

Please keep "backup2master" in mind: After that is probably okay if pruning old transaction does not work immediately, but eventually it should. There it might be helpful if the old TID is persistent and available to all backups - for example.

Comment 6 Philipp Hahn

2021-09-22 17:57:14 CEST

(In reply to Florian Best from comment #4)
> (In reply to Philipp Hahn from comment #2)
> > Simple solution
> > - add a new attribute for storing `notifier_id` with the host record to our
> > LDAP schema for all UCS roles (and other systems, which have UDL)
> Could you also imagine - instead of extending the standard DNS schema for
> host records - to mis-use a SRV record "_ucs_replication_$hostname._tcp." or
> create a settings/data object for this?

DNS is stored in LDAP as well and updating DNS would result in LDAP transaction, which need to be replicated.