Bug 54108 - univention-bind-ldap.serice fails to startup - zone transfer too big - killed by timeout
univention-bind-ldap.serice fails to startup - zone transfer too big - killed...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: DNS
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 5.0-1-errata
Assigned To: Iván.Delgado
Arvid Requate
https://git.knut.univention.de/univen...
:
Depends on: 50662 54140
Blocks:
  Show dependency treegraph
 
Reported: 2021-11-19 15:09 CET by Philipp Hahn
Modified: 2022-10-13 12:00 CEST (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.286
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2021111821000635, 2019121221000845
Bug group (optional): Large environments
Max CVSS v3 score:


Attachments
explain-daemon-reload.diff (1.78 KB, patch)
2022-01-05 18:21 CET, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2021-11-19 15:09:55 CET
In a large school environment starting up BIND9-LDAP takes up 6+ minutes:

> Nov 19 14:45:44 dc-backup02 systemd[1]: Starting BIND Domain Name Server with LDAP backend...
...
> Nov 19 14:45:44 dc-backup02 named[28783]: command channel listening on 127.0.0.1#55555
> Nov 19 14:45:44 dc-backup02 named[28783]: managed-keys-zone: loaded serial 0
...
> Nov 19 14:51:52 dc-backup02 named[28783]: all zones loaded
> Nov 19 14:51:52 dc-backup02 named[28783]: running


Before our emergency work-around `named` was killed by `systemd` by default after 3 minutes:

> Nov 19 14:25:19 dc-backup03 systemd[1]: Starting BIND Domain Name Server with LDAP backend...
> Nov 19 14:25:19 dc-backup03 named[5876]: starting BIND 9.10.3-P4-Univention <id:ebd72b3> -c /etc/bind/named.conf -p 7777 -u bind -f -d 0
...
> Nov 19 14:25:20 dc-backup03 named[5876]: command channel listening on 127.0.0.1#55555
> Nov 19 14:25:20 dc-backup03 named[5876]: managed-keys-zone: loaded serial 0
> Nov 19 14:28:20 dc-backup03 systemd[1]: univention-bind-ldap.service: Start-post operation timed out. Stopping.

On top of systemds timeout behavior there also is a problem with our script for `ExecStartPost`:

# systemctl cat univention-bind-ldap.service
> # /lib/systemd/system/univention-bind-ldap.service
...
> ExecStartPost=/usr/lib/univention-bind/ldap wait-for-startup

# cat /usr/lib/univention-bind/ldap
...
>         wait-for-startup)
>                 timeout="$(ucr get dns/timeout-start)"
>                 /usr/bin/timeout ${timeout:-120} /bin/sh -c \
>                         "until rndc -p 55555 status | grep --quiet 'server is up and running'; do sleep 1; done"

The UCRV `dns/timeout-start` was introduced with Bug #50662 for the *same* customer.
Incrementing it to 8 minute was not enough as `systemd` still aborts the start after 3 minutes.
In addition `TimeoutStartSec` has do be increased to 8m as well:

# /etc/systemd/system/univention-bind-ldap.service.d/override.conf
> [Service]
> TimeoutStartSec=8m
> #ExecStartPost=

As an alternative the `ExecStartPost=` can be disabled completely by clearing it.
Comment 1 Philipp Hahn univentionstaff 2021-11-19 16:54:09 CET
After enabling debugging 
> printf 'dn: cn=config\nchangetype: modify\nreplace: olcLogLevel\nolcLogLevel: 256\n\n' | ldapmodify -xH ldapi:///
> dig +time=300 @127.0.0.1 -p 7777 schule.XXX. axfr
> printf 'dn: cn=config\nchangetype: modify\nreplace: olcLogLevel\nolcLogLevel: None\n\n' | ldapmodify -xH ldapi:///

we found out that the LDAP search for the ZONE by BIND-LDAP is very slow and takes ~3 minutes:

> Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=358 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(&(zoneName=schule.XXX)(relativeDomainName=@))"
> Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=358 SEARCH RESULT tag=101 err=0 nentries=1 text=
> Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=359 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(zoneName=schule.XXX)"
...
> Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=359 SEARCH RESULT tag=101 err=0 nentries=54056 text=
> Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=360 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(&(zoneName=schule.XXX)(relativeDomainName=@))"
> Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=360 SEARCH RESULT tag=101 err=0 nentries=1 text=

The timeout for the PROXY-BIND is much less; the ZONE is never successfully downloaded for caching and querying the PROXY-BIND will return "zone not loaded":

> dig +time=300 @127.0.0.1 schule.XXX. soa
Comment 2 Philipp Hahn univentionstaff 2021-11-19 17:47:25 CET
# time univention-ldapsearch -b zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX -s sub "(objectClass=*)" 1.1 | grep -c ^dn
54087
real    0m41,353s
user    0m0,996s
sys     0m1,212s

For comparison here's a different school of ⅓ size:

# time univention-ldapsearch -b zoneName=schule.YYY.de,cn=dns,dc=schule,dc=YYY,dc=de -s sub "(zoneName=schule.YYY.de)" 1.1 | grep -c ^dn
15957
real	0m14,391s
user	0m0,456s
sys	0m0,300s
Comment 3 Ingo Steuwer univentionstaff 2021-11-22 15:18:05 CET
Can we prevent a full retrieval of the zone from "blocking" the startup and move it into a backend process?

I assume that regular "background updates" while BIND is already started work even it they take >8 minutes - is that correct?
Comment 5 Philipp Hahn univentionstaff 2021-11-23 15:08:55 CET
(In reply to Ingo Steuwer from comment #3)
> Can we prevent a full retrieval of the zone from "blocking" the startup and
> move it into a backend process?

Not that I know off.

> I assume that regular "background updates" while BIND is already started
> work even it they take >8 minutes - is that correct?

Even after allowing LDAP-BIND's startup to take longer than 2 minutes, the PROXY-BIND was still unable to get the zone transferred and refused to answer any question regarding that zone - see my comment #1
> The timeout for the PROXY-BIND is much less; the ZONE is never successfully downloaded for caching and querying the PROXY-BIND will return "zone not loaded":
> > dig +time=300 @127.0.0.1 schule.XXX. soa

Maybe helpdesk can fetch the current status from all DNS servers, e.g.
###############################################
d="$(ucr get domainname)"
udm dns/forward_zone list --filter zone="$d" |
sed -ne 's/^  nameserver: //p' |
while read -r s
do
  echo "$s"
  for p in 7777=55555 53=953  # domain=rndc
  do
    dig +timeout=300 +short @"$s" -p "${p%=*}" "$d." soa
    rndc -p "${p#*=}" zonestatus "$d"
  done
  echo
done
###############################################  

One more observation:
After LDAP-BIND loaded the zone it sent out the notifications to the PROXY-BIND slaves, which then all dutifully re-fetched the zones. We noticed many notifications and it remains unclear, if the PROXY-BIND only sees the notify from "his" LDAP-BIND or also receives the notifies from all the other LDAP-BINDs running on other servers as well.

One closing note:
For LDAP support we're using patches/bind9/5.0-0-0-ucs/1:9.11.5.P4+dfsg-5.1+deb10u5/0004-Bug-41714-Add-LDAP-support.quilt which is unmaintained since 2007. The alternative `bind9-dyndb-ldap` from <https://pagure.io/bind-dyndb-ldap> is available as a Debian package, but seems to use a different LDAP Schema <https://pagure.io/bind-dyndb-ldap/blob/master/f/doc/schema.ldif>
Comment 6 Philipp Hahn univentionstaff 2021-11-23 20:06:04 CET
The slowness seems to be caused by LDAP ACLs as both

> time ldapsearch -x -D "cn=admin,$(ucr get ldap/base)" -y /etc/ldap.secret …
> 55124
> real    0m0,705s 

and

> time ldapsearch -QY EXTERNAL -H ldapi:/// …
> 55124
> real    0m0,539s 

are fast; the first ignores all ACLs and the second is allow-listed very early.
Comment 7 Philipp Hahn univentionstaff 2021-11-23 20:07:35 CET
(In reply to Philipp Hahn from comment #5)
###############################################
d="$(ucr get domainname)"

for p in 55555 953
do
  rndc -s 127.0.0.1 -p "$p" zonestatus "$d"
done

declare -a dns=($(udm dns/forward_zone list --filter zone="$d" | sed -ne 's/^  nameserver: //p'))
for p in 7777 53
do
  echo "$p"
  for s in "${dns[@]}"
  do
    ip=$(dig +short @"${dns[0]:-0}" "$s")
    set -- $(dig +timeout=3 +tries=1 +short @"$ip" -p "$p" "$d" soa "${dns[@]}" a) || set --
    soa="$3"
    shift 7
    printf "%s\t%s\t%s\n" "$ip" "$soa" "$*"
  done
  echo
done
###############################################
Comment 8 Iván.Delgado univentionstaff 2021-12-30 08:14:08 CET
I've made configurable the systemd unit univention-bind-ldap.service with UCRV dns/timeout-start

dbd4a13314
843b7909c0
dbf3a1a076

univention-bind: 14.0.1-4A~5.0.0.202112300807
Comment 9 Iván.Delgado univentionstaff 2022-01-04 09:13:18 CET
I've applied the suggestions to format UCR template with PEP8.

dbd4a13314
843b7909c0
dbf3a1a076

univention-bind: 14.0.1-5A~5.0.0.202201040910
Comment 10 Arvid Requate univentionstaff 2022-01-05 18:21:53 CET
Created attachment 10894 [details]
explain-daemon-reload.diff

Verified:
* Code review
* Package update
* Functional test
* Advisory

I'm reopening to suggest documenting the need for a "systemctl daemon-reload", see attached patch for a proposal.
Comment 11 Iván.Delgado univentionstaff 2022-01-06 09:35:50 CET
I've added the info 
After changing the variable, a "systemctl daemon-reload" is required.
To the advisory file in commit 1e7c63e2d9.
Comment 12 Arvid Requate univentionstaff 2022-01-11 19:50:19 CET
Please re-read the patch I attached to Comment 10.
I've proposed to add the advice also to the file 
services/univention-bind/debian/univention-bind.univention-config-registry-variables

Since it's just a small change in the text, you may fix it directly on the 5.0-1
release branch, but you'll still have to add a new debian/changelog entry
(or simply increase the patchlevel version three) and reimport+build the package
and the update the binary package version in the advisory.
Comment 13 Iván.Delgado univentionstaff 2022-01-12 14:48:28 CET
I've applied the suggestions to format UCR template with PEP8.

dbd4a13314
843b7909c0
dbf3a1a076
5a7f9dea97
daa91a2d4f
4ad49d954a
788868c8c0
6971866ad0
1e7c63e2d9
9bc5bf1297
e68193dfae

univention-bind: 14.0.1-6A~5.0.0.202201121445
Comment 14 Arvid Requate univentionstaff 2022-01-12 19:36:34 CET
Verified:
* All of Comment 10
* UCR variable description
* Advisory