Bug 41021 - Samba join fails due to timeouts
Samba join fails due to timeouts
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 4.1
Other Linux
: P5 normal (vote)
: UCS 4.1-2-errata
Assigned To: Stefan Gohmann
Arvid Requate
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-04-08 09:16 CEST by Stefan Gohmann
Modified: 2016-06-08 14:28 CEST (History)
0 users

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Gohmann univentionstaff 2016-04-08 09:16:45 CEST
Ticket #2016032921000167

The join of a second Samba AD DC fails in an environment with about 60.000 users:

root@dc-d:~# samba-tool domain join xxx.yyy.de DC --kerberos=no -UAdministrator%XXXXXX --realm=XXX.YYY.DE --machinepass=XXXXXXXXXXXXXXXXX --verbose
Findc-dg a writeable DC for domain 'xxx.yyy.de'
Found DC dc-n.xxx.yyy.de
workgroup is XXX
realm is xxx.yyy.de
checking sAMAccountName
Addc-dg CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de
Addc-dg CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de
Addc-dg CN=NTDS Settings,CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de
Addc-dg SPNs to CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de
Setting account password for DC-D$
Enabling account
Calling bare provision
Looking up IPv4 addresses
Looking up IPv6 addresses
No IPv6 address will be assigned
Setting up secrets.ldb
Setting up the registry
Setting up the privileges database
Setting up idmap db
Setting up SAM db
Setting up sam.ldb partitions and settings
Setting up sam.ldb rootDSE
Pre-loadc-dg the Samba 4 and AD schema
A Kerberos configuration suitable for Samba 4 has been generated at /var/lib/samba/private/krb5.conf
Provision OK for domain DN DC=xxx,DC=yyy,DC=de
Starting replication
Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[402/1550] linked_values[0/0]
Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[804/1550] linked_values[0/0]
Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1206/1550] linked_values[0/0]
Schema-DN[CN=Schema,CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1550/1550] linked_values[0/0]
Analyze and apply schema objects
Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[402/1625] linked_values[0/0]
Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[804/1625] linked_values[0/0]
Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1206/1625] linked_values[0/0]
Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1608/1625] linked_values[0/0]
Partition[CN=Configuration,DC=xxx,DC=yyy,DC=de] objects[1625/1625] linked_values[28/0]
Replicating critical objects from the base DN of the domain
Partition[DC=xxx,DC=yyy,DC=de] objects[239/239] linked_values[216/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[641/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[1043/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[1445/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[1847/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[2249/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[2651/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[3053/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[3455/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[3857/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[4259/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[4661/101730] linked_values[0/0]
[...]
Partition[DC=xxx,DC=yyy,DC=de] objects[99131/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[99533/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[99935/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[100337/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[100739/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[101141/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[101543/101730] linked_values[0/0]
Partition[DC=xxx,DC=yyy,DC=de] objects[101945/101730] linked_values[0/0]
Join failed - cleaning up
checking sAMAccountName
removing samaccount: CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de
Deleted CN=DC-D,OU=Domain Controllers,DC=xxx,DC=yyy,DC=de
Deleted CN=NTDS Settings,CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de
Deleted CN=DC-D,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=xxx,DC=yyy,DC=de
ERROR(runtime): uncaught exception - (-1073741643, '{Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.')
File "/usr/lib/python2.7/dist-packages/samba/netcmd/__init__.py", line 175, in _run
return self.run(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/samba/netcmd/domain.py", line 628, in run
keep_existing=keep_existing)
File "/usr/lib/python2.7/dist-packages/samba/join.py", line 1190, in join_DC
ctx.do_join()
File "/usr/lib/python2.7/dist-packages/samba/join.py", line 1095, in do_join
ctx.join_replicate()
File "/usr/lib/python2.7/dist-packages/samba/join.py", line 835, in join_replicate
replica_flags=ctx.domain_replica_flags)
File "/usr/lib/python2.7/dist-packages/samba/drs_utils.py", line 253, in replicate
(level, ctr) = self.drs.DsGetNCChanges(self.drs_handle, req_level, req)
root@dc-d:~#

The Samba version is 4.3.4. The replication of 'Partition[DC=xxx,DC=yyy,DC=de]' takes about 15 minutes before the error occurred.
Comment 1 Stefan Gohmann univentionstaff 2016-04-18 19:49:07 CEST
On the RPC server side, the sort in source4/rpc_server/drsuapi/getncchanges.c in the function dcesrv_drsuapi_DsGetNCChanges takes about 4 minutes and 45 seconds. While the server side is sorting, the client throws the exception.

It looks like increasing DCERPC_REQUEST_TIMEOUT from 60 to 480 in source4/librpc/rpc/dcerpc.h fixes the issue.
Comment 2 Stefan Gohmann univentionstaff 2016-05-27 07:02:44 CEST
I've added a simple patch which increases the timeout from 60 to 480 seconds: r16549

I didn't send it upstream because it will solved upstream in a different way. Once, it has been fixed upstream, we can remove this patch.

YAML: r69577
Comment 3 Arvid Requate univentionstaff 2016-06-07 19:35:06 CEST
The patch fixed the issue in Ticket #2016032921000167, where other things like adjusting the smb.conf parameter drs:max work time (default: 10 seconds (?!)) didn't help. The new DCERPC_REQUEST_TIMEOUT value of 8 minutes is in the same ballpark as the 5 minutes that Active Directory uses for "RPC Replication Timeout".

Adjusting the RPC timeout globally (at least in all source4/ code components) might also increase the delay until real network or server failures are detected, but I guess it's worth to go this way at this point. At least we know which parameter might need more adjustment in case we see undesirable collateral effects.

Patch is applied during build, and samba replication continues to work, also while only one DC is updated and the other not (yet).

Advisory: Ok.
Comment 4 Janek Walkenhorst univentionstaff 2016-06-08 14:28:31 CEST
<http://errata.software-univention.de/ucs/4.1/193.html>