Univention Bugzilla – Bug 41021
Samba join fails due to timeouts
Last modified: 2016-06-08 14:28:31 CEST
Ticket #2016032921000167 The join of a second Samba AD DC fails in an environment with about 60.000 users: root@dc-d:~# samba-tool domain join xxx.yyy.de DC --kerberos=no -UAdministrator%XXXXXX --realm=XXX.YYY.DE --machinepass=XXXXXXXXXXXXXXXXX --verbose Findc-dg a writeable DC for domain 'xxx.yyy.de' Found DC dc-n.xxx.yyy.de workgroup is XXX realm is xxx.yyy.de checking sAMAccountName Addc-dg CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de Addc-dg CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de Addc-dg CN=NTDS Settings,CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de Addc-dg SPNs to CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de Setting account password for DC-D$ Enabling account Calling bare provision Looking up IPv4 addresses Looking up IPv6 addresses No IPv6 address will be assigned Setting up secrets.ldb Setting up the registry Setting up the privileges database Setting up idmap db Setting up SAM db Setting up sam.ldb partitions and settings Setting up sam.ldb rootDSE Pre-loadc-dg the Samba 4 and AD schema A Kerberos configuration suitable for Samba 4 has been generated at /var/lib/samba/private/krb5.conf Provision OK for domain DN DC=xxx,DC=yyy,DC=de Starting replication Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[402/1550] linked_values[0/0] Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[804/1550] linked_values[0/0] Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1206/1550] linked_values[0/0] Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1550/1550] linked_values[0/0] Analyze and apply schema objects Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[402/1625] linked_values[0/0] Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[804/1625] linked_values[0/0] Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1206/1625] linked_values[0/0] Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1608/1625] linked_values[0/0] Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1625/1625] linked_values[28/0] Replicating critical objects from the base DN of the domain Partition[DC=xxx,DC=yyy,DC=de] objects[239/239] linked_values[216/0] Partition[DC=xxx,DC=yyy,DC=de] objects[641/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[1043/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[1445/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[1847/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[2249/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[2651/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[3053/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[3455/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[3857/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[4259/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[4661/101730] linked_values[0/0] [...] Partition[DC=xxx,DC=yyy,DC=de] objects[99131/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[99533/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[99935/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[100337/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[100739/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[101141/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[101543/101730] linked_values[0/0] Partition[DC=xxx,DC=yyy,DC=de] objects[101945/101730] linked_values[0/0] Join failed - cleaning up checking sAMAccountName removing samaccount: CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de Deleted CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de Deleted CN=NTDS Settings,CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de Deleted CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de ERROR(runtime): uncaught exception - (-1073741643, '{Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.') File "/usr/lib/python2.7/dist-packages/samba/netcmd/__init__.py", line 175, in _run return self.run(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/samba/netcmd/domain.py", line 628, in run keep_existing=keep_existing) File "/usr/lib/python2.7/dist-packages/samba/join.py", line 1190, in join_DC ctx.do_join() File "/usr/lib/python2.7/dist-packages/samba/join.py", line 1095, in do_join ctx.join_replicate() File "/usr/lib/python2.7/dist-packages/samba/join.py", line 835, in join_replicate replica_flags=ctx.domain_replica_flags) File "/usr/lib/python2.7/dist-packages/samba/drs_utils.py", line 253, in replicate (level, ctr) = self.drs.DsGetNCChanges(self.drs_handle, req_level, req) root@dc-d:~# The Samba version is 4.3.4. The replication of 'Partition[DC=xxx,DC=yyy,DC=de]' takes about 15 minutes before the error occurred.
On the RPC server side, the sort in source4/rpc_server/drsuapi/getncchanges.c in the function dcesrv_drsuapi_DsGetNCChanges takes about 4 minutes and 45 seconds. While the server side is sorting, the client throws the exception. It looks like increasing DCERPC_REQUEST_TIMEOUT from 60 to 480 in source4/librpc/rpc/dcerpc.h fixes the issue.
I've added a simple patch which increases the timeout from 60 to 480 seconds: r16549 I didn't send it upstream because it will solved upstream in a different way. Once, it has been fixed upstream, we can remove this patch. YAML: r69577
The patch fixed the issue in Ticket #2016032921000167, where other things like adjusting the smb.conf parameter drs:max work time (default: 10 seconds (?!)) didn't help. The new DCERPC_REQUEST_TIMEOUT value of 8 minutes is in the same ballpark as the 5 minutes that Active Directory uses for "RPC Replication Timeout". Adjusting the RPC timeout globally (at least in all source4/ code components) might also increase the delay until real network or server failures are detected, but I guess it's worth to go this way at this point. At least we know which parameter might need more adjustment in case we see undesirable collateral effects. Patch is applied during build, and samba replication continues to work, also while only one DC is updated and the other not (yet). Advisory: Ok.
<http://errata.software-univention.de/ucs/4.1/193.html>