Bug 39792 - Joins in Amazon EC2 environments fail from time to time
Joins in Amazon EC2 environments fail from time to time
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: DNS
UCS 4.1
Other Linux
: P5 normal (vote)
: UCS 4.1
Assigned To: Stefan Gohmann
Felix Botner
: interim-3
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-05 11:15 CET by Stefan Gohmann
Modified: 2015-11-17 12:12 CET (History)
0 users

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2015-11-05 11:15:01 CET
The join of slaves or docker memmber servers fails from time to time with different error messages, for example:

[2015-11-05 03:28:14.115062]Calling app join scripts...
(2015-11-05 03:28:14.839737)ssh: Could not resolve hostname master092.autotest092.local: Name or service not known
(2015-11-05 03:28:14.951566)ssh: Could not resolve hostname master092.autotest092.local: Name or service not known
(2015-11-05 03:28:15.114579)ssh: Could not resolve hostname master092.autotest092.local: Name or service not known
(2015-11-05 03:28:15.243847)Something went wrong while installing the app
[2015-11-05 03:28:15.243962]Search LDAP binddn 
[2015-11-05 03:28:15.244097]
[2015-11-05 03:28:15.244223]**************************************************************************
[2015-11-05 03:28:15.244350]* Running join scripts failed!                                           *
[2015-11-05 03:28:15.244470]**************************************************************************
[2015-11-05 03:28:15.244590]* Message:  binddn for user Administrator not found
[2015-11-05 03:28:15.244709]**************************************************************************
Comment 1 Stefan Gohmann univentionstaff 2015-11-05 11:36:14 CET
The syslog and daemon.log are now copied: r65222
Comment 2 Stefan Gohmann univentionstaff 2015-11-05 17:00:37 CET
It seems to be a problem in the bind listener module. bind9 is restarted too often:

Nov  5 09:37:48 backup092 named[1875]: received control channel command 'reload autotest092.local'
Nov  5 09:37:48 backup092 named[1576]: LDAP sdb zone 'autotest092.local': ldap_result returned -1
Nov  5 09:37:48 backup092 named[1875]: zone autotest092.local/IN: Transfer started.
Nov  5 09:37:48 backup092 named[1875]: transfer of 'autotest092.local/IN' from 127.0.0.1#7777: connected using 127.0.0.1#34412
Nov  5 09:37:48 backup092 named[1576]: client 127.0.0.1#34412: transfer of 'autotest092.local/IN': AXFR-style IXFR started
Nov  5 09:37:48 backup092 named[1875]: zone autotest092.local/IN: transferred serial 25
Nov  5 09:37:48 backup092 named[1875]: transfer of 'autotest092.local/IN' from 127.0.0.1#7777: Transfer completed: 1 messages, 20 records, 918 bytes, 0.012 secs (76500 bytes/sec)
Nov  5 09:37:48 backup092 named[1875]: zone autotest092.local/IN: sending notifies (serial 25)
Nov  5 09:37:48 backup092 named[1576]: received control channel command 'reload autotest092.local'
Nov  5 09:37:48 backup092 named[1576]: zone autotest092.local/IN: _kerberos._udp.autotest092.local/SRV 'dy3a3rzffh-1446734083265519.autotest092.local' has no address records (A or AAAA)
Nov  5 09:37:48 backup092 named[1576]: zone autotest092.local/IN: _kerberos._tcp.autotest092.local/SRV 'dy3a3rzffh-1446734083265519.autotest092.local' has no address records (A or AAAA)
Nov  5 09:37:48 backup092 named[1576]: zone autotest092.local/IN: _ldap._tcp.autotest092.local/SRV 'dy3a3rzffh-1446734083265519.autotest092.local' has no address records (A or AAAA)
Nov  5 09:37:48 backup092 named[1576]: zone autotest092.local/IN: zone serial (25) unchanged. zone may fail to transfer to slaves.
Nov  5 09:37:48 backup092 named[1576]: client 127.0.0.1#34412: transfer of 'autotest092.local/IN': AXFR-style IXFR ended
Nov  5 09:37:48 backup092 named[1576]: zone autotest092.local/IN: sending notifies (serial 25)
Nov  5 09:37:48 backup092 named[1875]: client 127.0.0.1#9602: received notify for zone 'autotest092.local'
Nov  5 09:37:48 backup092 named[1875]: zone autotest092.local/IN: notify from 127.0.0.1#9602: zone is up to date
Nov  5 09:37:48 backup092 named[1576]: shutting down
Nov  5 09:37:48 backup092 named[1576]: stopping command channel on 127.0.0.1#55555
Nov  5 09:37:48 backup092 named[1576]: no longer listening on ::#7777
Nov  5 09:37:48 backup092 named[1576]: no longer listening on 127.0.0.1#7777
Nov  5 09:37:48 backup092 named[1576]: no longer listening on 10.210.53.115#7777
Nov  5 09:37:48 backup092 named[1576]: no longer listening on 172.17.42.1#7777
Nov  5 09:37:48 backup092 named[1576]: exiting
Nov  5 09:37:48 backup092 named[1875]: client 10.210.182.236#32986: received notify for zone 'autotest092.local'
Nov  5 09:37:48 backup092 named[1875]: zone autotest092.local/IN: refused notify from non-master: 10.210.182.236#32986
Nov  5 09:37:54 backup092 named[1875]: shutting down
Nov  5 09:37:54 backup092 named[1875]: stopping command channel on 127.0.0.1#953
Nov  5 09:37:54 backup092 named[1875]: no longer listening on ::#53
Nov  5 09:37:54 backup092 named[1875]: no longer listening on 127.0.0.1#53
Nov  5 09:37:54 backup092 named[1875]: no longer listening on 10.210.53.115#53
Nov  5 09:37:54 backup092 named[1875]: no longer listening on 172.17.42.1#53
Nov  5 09:37:54 backup092 named[1875]: exiting
Nov  5 09:37:54 backup092 named[26896]: starting BIND 9.8.4-rpz2+rl005.12-P1 -c /etc/bind/named.conf -p 7777 -u bind -f -d 0
Nov  5 09:37:54 backup092 named[26896]: built with '--prefix=/usr' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc/bind' '--localstatedir=/var' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-gnu-ld' '--with-geoip=/usr' '--enable-ipv6' '--with-dlz-dlopen' 'CFLAGS=-fno-strict-aliasing -DDIG_SIGCHASE -O2'
Comment 3 Stefan Gohmann univentionstaff 2015-11-05 21:11:16 CET
From my test machine:

root@backup092:~# ps -ef | grep -i named
bind      5138  2790  0 14:56 ?        00:00:00 /usr/sbin/named -c /etc/bind/named.conf -p 7777 -u bind -f -d 0
bind      5141  2786  0 14:56 ?        00:00:00 /usr/sbin/named -c /etc/bind/named.conf.proxy -u bind -f -d 0
root     24719 23842  0 15:08 pts/0    00:00:00 grep -i named
root@backup092:~# ls -la /var/cache/bind/
insgesamt 24
drwxrwxr-x  2 root bind 4096 Nov  5 15:04 .
drwxr-xr-x 25 root root 4096 Nov  5 14:37 ..
-rw-r--r--  1 bind bind  484 Nov  5 14:47 210.10.in-addr.arpa.zone
-rw-r--r--  1 bind bind 1082 Nov  5 14:47 autotest092.local.zone
-rw-r--r--  1 bind bind  221 Nov  5 14:47 managed-keys.bind
-rw-r--r--  1 bind bind  512 Nov  5 14:47 managed-keys.bind.jnl
root@backup092:~# ls -la /var/cache/univention-bind-proxy/
insgesamt 8
drwxr-xr-x  2 root root 4096 Nov  5 14:47 .
drwxr-xr-x 25 root root 4096 Nov  5 14:37 ..
root@backup092:~# 

From daemon.log:
Nov  5 15:04:22 backup092 named[5138]: received control channel command 'reload'
Nov  5 15:04:22 backup092 named[5138]: loading configuration from '/etc/bind/named.conf'
Nov  5 15:04:22 backup092 named[5138]: reading built-in trusted keys from file '/etc/bind/bind.keys'
Nov  5 15:04:22 backup092 named[5138]: using default UDP/IPv4 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5138]: using default UDP/IPv6 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5138]: listening on IPv4 interface AExBG0BjpXz3mo2, 192.0.2.1#7777
Nov  5 15:04:22 backup092 named[5138]: sizing zone task pool based on 2 zones
Nov  5 15:04:22 backup092 named[5138]: Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
Nov  5 15:04:22 backup092 named[5138]: reloading configuration succeeded
Nov  5 15:04:22 backup092 named[5138]: LDAP sdb zone '210.10.in-addr.arpa': ldap_result returned -1
Nov  5 15:04:22 backup092 named[5138]: zone 210.10.in-addr.arpa/IN: zone serial (4) unchanged. zone may fail to transfer to slaves.
Nov  5 15:04:22 backup092 named[5138]: zone autotest092.local/IN: zone serial (15) unchanged. zone may fail to transfer to slaves.
Nov  5 15:04:22 backup092 named[5138]: reloading zones succeeded
Nov  5 15:04:22 backup092 named[5138]: zone 210.10.in-addr.arpa/IN: sending notifies (serial 4)
Nov  5 15:04:22 backup092 named[5141]: received control channel command 'reload'
Nov  5 15:04:22 backup092 named[5141]: loading configuration from '/etc/bind/named.conf.proxy'
Nov  5 15:04:22 backup092 named[5141]: reading built-in trusted keys from file '/etc/bind/bind.keys'
Nov  5 15:04:22 backup092 named[5141]: using default UDP/IPv4 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5141]: using default UDP/IPv6 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5141]: listening on IPv4 interface AExBG0BjpXz3mo2, 192.0.2.1#53
Nov  5 15:04:22 backup092 named[5141]: sizing zone task pool based on 7 zones
Nov  5 15:04:22 backup092 named[5141]: Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
Nov  5 15:04:22 backup092 named[5141]: reloading configuration succeeded
Nov  5 15:04:22 backup092 named[5138]: zone autotest092.local/IN: sending notifies (serial 15)
Nov  5 15:04:22 backup092 named[5141]: reloading zones succeeded
Nov  5 15:04:22 backup092 named[5141]: client 127.0.0.1#39801: received notify for zone '210.10.in-addr.arpa'
Nov  5 15:04:22 backup092 named[5141]: zone 210.10.in-addr.arpa/IN: notify from 127.0.0.1#39801: zone is up to date
Nov  5 15:04:22 backup092 named[5138]: received control channel command 'reload'
Nov  5 15:04:22 backup092 named[5138]: loading configuration from '/etc/bind/named.conf'
Nov  5 15:04:22 backup092 named[5138]: reading built-in trusted keys from file '/etc/bind/bind.keys'
Nov  5 15:04:22 backup092 named[5138]: using default UDP/IPv4 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5138]: using default UDP/IPv6 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5138]: no longer listening on 192.0.2.1#7777
Nov  5 15:04:22 backup092 named[5138]: sizing zone task pool based on 2 zones
Nov  5 15:04:22 backup092 named[5138]: Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
Nov  5 15:04:22 backup092 named[5138]: reloading configuration succeeded
Nov  5 15:04:22 backup092 named[5138]: zone 210.10.in-addr.arpa/IN: zone serial (4) unchanged. zone may fail to transfer to slaves.
Nov  5 15:04:22 backup092 named[5138]: zone autotest092.local/IN: zone serial (15) unchanged. zone may fail to transfer to slaves.
Nov  5 15:04:22 backup092 named[5138]: reloading zones succeeded
Nov  5 15:04:22 backup092 named[5141]: received control channel command 'reload'
Nov  5 15:04:22 backup092 named[5141]: loading configuration from '/etc/bind/named.conf.proxy'
Nov  5 15:04:22 backup092 named[5141]: reading built-in trusted keys from file '/etc/bind/bind.keys'
Nov  5 15:04:22 backup092 named[5141]: using default UDP/IPv4 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5141]: using default UDP/IPv6 port range: [1024, 65535]
Nov  5 15:04:22 backup092 named[5141]: no longer listening on 192.0.2.1#53
Nov  5 15:04:22 backup092 named[5141]: sizing zone task pool based on 7 zones
Nov  5 15:04:22 backup092 named[5141]: Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
Nov  5 15:04:22 backup092 named[5141]: reloading configuration succeeded
Nov  5 15:04:22 backup092 named[5141]: reloading zones succeeded
Nov  5 15:04:22 backup092 named[5141]: client 10.210.86.252#43370: received notify for zone '210.10.in-addr.arpa'
Nov  5 15:04:22 backup092 named[5141]: zone 210.10.in-addr.arpa/IN: refused notify from non-master: 10.210.86.252#43370
Nov  5 15:04:22 backup092 named[5141]: client 127.0.0.1#43370: received notify for zone 'autotest092.local'
Nov  5 15:04:22 backup092 named[5141]: zone autotest092.local/IN: notify from 127.0.0.1#43370: zone is up to date
Nov  5 15:04:22 backup092 named[5141]: client 10.210.86.252#43370: received notify for zone 'autotest092.local'
Nov  5 15:04:22 backup092 named[5141]: zone autotest092.local/IN: refused notify from non-master: 10.210.86.252#43370
Nov  5 15:04:27 backup092 named[5138]: zone 210.10.in-addr.arpa/IN: sending notifies (serial 4)
Nov  5 15:04:27 backup092 named[5141]: client 127.0.0.1#63372: received notify for zone '210.10.in-addr.arpa'
Nov  5 15:04:27 backup092 named[5141]: zone 210.10.in-addr.arpa/IN: notify from 127.0.0.1#63372: zone is up to date
Nov  5 15:04:27 backup092 named[5138]: zone autotest092.local/IN: sending notifies (serial 15)
Nov  5 15:04:27 backup092 named[5141]: client 10.210.86.252#9281: received notify for zone '210.10.in-addr.arpa'
Nov  5 15:04:27 backup092 named[5141]: zone 210.10.in-addr.arpa/IN: refused notify from non-master: 10.210.86.252#9281
Nov  5 15:04:27 backup092 named[5141]: client 127.0.0.1#9281: received notify for zone 'autotest092.local'
Nov  5 15:04:27 backup092 named[5141]: zone autotest092.local/IN: notify from 127.0.0.1#9281: zone is up to date
Nov  5 15:04:27 backup092 named[5141]: client 10.210.86.252#9281: received notify for zone 'autotest092.local'
Nov  5 15:04:27 backup092 named[5141]: zone autotest092.local/IN: refused notify from non-master: 10.210.86.252#9281
Comment 4 Stefan Gohmann univentionstaff 2015-11-05 21:40:53 CET
In my case the restarted is triggered because a reverse zone for the Docker interface is created. This is triggered by the  80_docker/52_app_install_dc_slave test case: 

-------------------------------------------------------------------------------
==> /var/log/univention/listener.log <==
05.11.15 09:59:29.003  LISTENER    ( PROCESS ) : DNS: /var/cache/bind/17.172.in-addr.arpa.zone does not exist. Triggering a bind9 restart.
-------------------------------------------------------------------------------

This is the normal DC slave behaviour. So, I'll disable the test case until Bug #39801 has been fixed.
Comment 5 Stefan Gohmann univentionstaff 2015-11-05 22:00:11 CET
(In reply to Stefan Gohmann from comment #4)
> This is the normal DC slave behaviour. So, I'll disable the test case until
> Bug #39801 has been fixed.

Done via r65258. Now, let's wait for the test results.
Comment 6 Stefan Gohmann univentionstaff 2015-11-06 06:17:03 CET
(In reply to Stefan Gohmann from comment #5)
> (In reply to Stefan Gohmann from comment #4)
> > This is the normal DC slave behaviour. So, I'll disable the test case until
> > Bug #39801 has been fixed.
> 
> Done via r65258. Now, let's wait for the test results.

Looks good.
Comment 7 Felix Botner univentionstaff 2015-11-16 16:14:55 CET
OK, test disabled
Comment 8 Stefan Gohmann univentionstaff 2015-11-17 12:12:18 CET
UCS 4.1 has been released:
 https://docs.software-univention.de/release-notes-4.1-0-en.html
 https://docs.software-univention.de/release-notes-4.1-0-de.html

If this error occurs again, please use "Clone This Bug".