Univention Bugzilla – Bug 54108
univention-bind-ldap.serice fails to startup - zone transfer too big - killed by timeout
Last modified: 2022-10-13 12:00:49 CEST
In a large school environment starting up BIND9-LDAP takes up 6+ minutes: > Nov 19 14:45:44 dc-backup02 systemd[1]: Starting BIND Domain Name Server with LDAP backend... ... > Nov 19 14:45:44 dc-backup02 named[28783]: command channel listening on 127.0.0.1#55555 > Nov 19 14:45:44 dc-backup02 named[28783]: managed-keys-zone: loaded serial 0 ... > Nov 19 14:51:52 dc-backup02 named[28783]: all zones loaded > Nov 19 14:51:52 dc-backup02 named[28783]: running Before our emergency work-around `named` was killed by `systemd` by default after 3 minutes: > Nov 19 14:25:19 dc-backup03 systemd[1]: Starting BIND Domain Name Server with LDAP backend... > Nov 19 14:25:19 dc-backup03 named[5876]: starting BIND 9.10.3-P4-Univention <id:ebd72b3> -c /etc/bind/named.conf -p 7777 -u bind -f -d 0 ... > Nov 19 14:25:20 dc-backup03 named[5876]: command channel listening on 127.0.0.1#55555 > Nov 19 14:25:20 dc-backup03 named[5876]: managed-keys-zone: loaded serial 0 > Nov 19 14:28:20 dc-backup03 systemd[1]: univention-bind-ldap.service: Start-post operation timed out. Stopping. On top of systemds timeout behavior there also is a problem with our script for `ExecStartPost`: # systemctl cat univention-bind-ldap.service > # /lib/systemd/system/univention-bind-ldap.service ... > ExecStartPost=/usr/lib/univention-bind/ldap wait-for-startup # cat /usr/lib/univention-bind/ldap ... > wait-for-startup) > timeout="$(ucr get dns/timeout-start)" > /usr/bin/timeout ${timeout:-120} /bin/sh -c \ > "until rndc -p 55555 status | grep --quiet 'server is up and running'; do sleep 1; done" The UCRV `dns/timeout-start` was introduced with Bug #50662 for the *same* customer. Incrementing it to 8 minute was not enough as `systemd` still aborts the start after 3 minutes. In addition `TimeoutStartSec` has do be increased to 8m as well: # /etc/systemd/system/univention-bind-ldap.service.d/override.conf > [Service] > TimeoutStartSec=8m > #ExecStartPost= As an alternative the `ExecStartPost=` can be disabled completely by clearing it.
After enabling debugging > printf 'dn: cn=config\nchangetype: modify\nreplace: olcLogLevel\nolcLogLevel: 256\n\n' | ldapmodify -xH ldapi:/// > dig +time=300 @127.0.0.1 -p 7777 schule.XXX. axfr > printf 'dn: cn=config\nchangetype: modify\nreplace: olcLogLevel\nolcLogLevel: None\n\n' | ldapmodify -xH ldapi:/// we found out that the LDAP search for the ZONE by BIND-LDAP is very slow and takes ~3 minutes: > Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=358 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(&(zoneName=schule.XXX)(relativeDomainName=@))" > Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=358 SEARCH RESULT tag=101 err=0 nentries=1 text= > Nov 19 16:32:38 dc-backup03 slapd[3473]: conn=77587 op=359 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(zoneName=schule.XXX)" ... > Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=359 SEARCH RESULT tag=101 err=0 nentries=54056 text= > Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=360 SRCH base="zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX" scope=2 deref=0 filter="(&(zoneName=schule.XXX)(relativeDomainName=@))" > Nov 19 16:35:21 dc-backup03 slapd[3473]: conn=77587 op=360 SEARCH RESULT tag=101 err=0 nentries=1 text= The timeout for the PROXY-BIND is much less; the ZONE is never successfully downloaded for caching and querying the PROXY-BIND will return "zone not loaded": > dig +time=300 @127.0.0.1 schule.XXX. soa
# time univention-ldapsearch -b zoneName=schule.XXX,cn=dns,dc=schule,dc=XXX -s sub "(objectClass=*)" 1.1 | grep -c ^dn 54087 real 0m41,353s user 0m0,996s sys 0m1,212s For comparison here's a different school of ⅓ size: # time univention-ldapsearch -b zoneName=schule.YYY.de,cn=dns,dc=schule,dc=YYY,dc=de -s sub "(zoneName=schule.YYY.de)" 1.1 | grep -c ^dn 15957 real 0m14,391s user 0m0,456s sys 0m0,300s
Can we prevent a full retrieval of the zone from "blocking" the startup and move it into a backend process? I assume that regular "background updates" while BIND is already started work even it they take >8 minutes - is that correct?
(In reply to Ingo Steuwer from comment #3) > Can we prevent a full retrieval of the zone from "blocking" the startup and > move it into a backend process? Not that I know off. > I assume that regular "background updates" while BIND is already started > work even it they take >8 minutes - is that correct? Even after allowing LDAP-BIND's startup to take longer than 2 minutes, the PROXY-BIND was still unable to get the zone transferred and refused to answer any question regarding that zone - see my comment #1 > The timeout for the PROXY-BIND is much less; the ZONE is never successfully downloaded for caching and querying the PROXY-BIND will return "zone not loaded": > > dig +time=300 @127.0.0.1 schule.XXX. soa Maybe helpdesk can fetch the current status from all DNS servers, e.g. ############################################### d="$(ucr get domainname)" udm dns/forward_zone list --filter zone="$d" | sed -ne 's/^ nameserver: //p' | while read -r s do echo "$s" for p in 7777=55555 53=953 # domain=rndc do dig +timeout=300 +short @"$s" -p "${p%=*}" "$d." soa rndc -p "${p#*=}" zonestatus "$d" done echo done ############################################### One more observation: After LDAP-BIND loaded the zone it sent out the notifications to the PROXY-BIND slaves, which then all dutifully re-fetched the zones. We noticed many notifications and it remains unclear, if the PROXY-BIND only sees the notify from "his" LDAP-BIND or also receives the notifies from all the other LDAP-BINDs running on other servers as well. One closing note: For LDAP support we're using patches/bind9/5.0-0-0-ucs/1:9.11.5.P4+dfsg-5.1+deb10u5/0004-Bug-41714-Add-LDAP-support.quilt which is unmaintained since 2007. The alternative `bind9-dyndb-ldap` from <https://pagure.io/bind-dyndb-ldap> is available as a Debian package, but seems to use a different LDAP Schema <https://pagure.io/bind-dyndb-ldap/blob/master/f/doc/schema.ldif>
The slowness seems to be caused by LDAP ACLs as both > time ldapsearch -x -D "cn=admin,$(ucr get ldap/base)" -y /etc/ldap.secret … > 55124 > real 0m0,705s and > time ldapsearch -QY EXTERNAL -H ldapi:/// … > 55124 > real 0m0,539s are fast; the first ignores all ACLs and the second is allow-listed very early.
(In reply to Philipp Hahn from comment #5) ############################################### d="$(ucr get domainname)" for p in 55555 953 do rndc -s 127.0.0.1 -p "$p" zonestatus "$d" done declare -a dns=($(udm dns/forward_zone list --filter zone="$d" | sed -ne 's/^ nameserver: //p')) for p in 7777 53 do echo "$p" for s in "${dns[@]}" do ip=$(dig +short @"${dns[0]:-0}" "$s") set -- $(dig +timeout=3 +tries=1 +short @"$ip" -p "$p" "$d" soa "${dns[@]}" a) || set -- soa="$3" shift 7 printf "%s\t%s\t%s\n" "$ip" "$soa" "$*" done echo done ###############################################
I've made configurable the systemd unit univention-bind-ldap.service with UCRV dns/timeout-start dbd4a13314 843b7909c0 dbf3a1a076 univention-bind: 14.0.1-4A~5.0.0.202112300807
I've applied the suggestions to format UCR template with PEP8. dbd4a13314 843b7909c0 dbf3a1a076 univention-bind: 14.0.1-5A~5.0.0.202201040910
Created attachment 10894 [details] explain-daemon-reload.diff Verified: * Code review * Package update * Functional test * Advisory I'm reopening to suggest documenting the need for a "systemctl daemon-reload", see attached patch for a proposal.
I've added the info After changing the variable, a "systemctl daemon-reload" is required. To the advisory file in commit 1e7c63e2d9.
Please re-read the patch I attached to Comment 10. I've proposed to add the advice also to the file services/univention-bind/debian/univention-bind.univention-config-registry-variables Since it's just a small change in the text, you may fix it directly on the 5.0-1 release branch, but you'll still have to add a new debian/changelog entry (or simply increase the patchlevel version three) and reimport+build the package and the update the binary package version in the advisory.
I've applied the suggestions to format UCR template with PEP8. dbd4a13314 843b7909c0 dbf3a1a076 5a7f9dea97 daa91a2d4f 4ad49d954a 788868c8c0 6971866ad0 1e7c63e2d9 9bc5bf1297 e68193dfae univention-bind: 14.0.1-6A~5.0.0.202201121445
Verified: * All of Comment 10 * UCR variable description * Advisory
<https://errata.software-univention.de/#/?erratum=5.0x190>