Bug 35560 - Samba DRS replication hangs after after DC Re-Join
Samba DRS replication hangs after after DC Re-Join
Status: RESOLVED WONTFIX
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 4.2
Other Linux
: P5 normal (vote)
: ---
Assigned To: Samba maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-04 17:03 CEST by Arvid Requate
Modified: 2020-07-03 20:56 CEST (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.286
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2017110221000459, 2018052221000458, 2020012721000219
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2014-08-04 17:03:52 CEST
After re-joining an UCS Samba4 DC in a domain with more than one Samba4 DC, samba-tool drs showrepl on the other Samba4 DC shows connection problems to the re-joined DC. It seems like Samba4 on the "other" DCs still tries to connect with some Kerberos tickets which are invalid after the re-join. More details are required about this scenario. Currently the only known workaround is to restart samba4 on the "other" Samba4 DCs in the domain.

In the case I just faced the output of samba-tool drs showrepl on the master said "WERR_GENERAL_FAILURE" for the INBOUND and OUTBOUND connections to the re-joined slave. The log.samba on the slave showed bursts of 5 messages repeated every 5 seconds, probably for each connect by the master Samba4 drepl server:

[2014/08/04 17:52:00.943781,  1, pid=25917] ../source4/auth/gensec/gensec_gssapi.c:648(gensec_gssapi_update)
  GSS server Update(krb5)(1) Update failed:  Miscellaneous failure (see text): Decrypt integrity check failed for checksum type hmac-sha1-96-aes256, key type aes256-cts-hmac-sha1-96


The log.samba on the master shows corresponding messages of this kind:

[2014/08/04 17:52:01.685103,  0, pid=18449] ../source4/librpc/rpc/dcerpc_util.c:681(dcerpc_pipe_auth_recv)
  Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 for e3514235-4b06-11d1-ab04-00c04fc2dcd2@ncacn
_ip_tcp:32af0d98-9d10-4805-bf97-99bebff7e62f._msdcs.w2k12.test[1024,seal,krb5] NT_STATUS_UNSUCCESSFUL

(32af0d98-9d10-4805-bf97-99bebff7e62f._msdcs points to the IP of the slave).
Comment 1 Arvid Requate univentionstaff 2016-01-14 19:54:09 CET
Quite possibly this is a duplicate of Bug #37358.
Comment 2 Arvid Requate univentionstaff 2017-04-24 13:07:08 CEST

*** This bug has been marked as a duplicate of bug 37358 ***
Comment 3 Felix Botner univentionstaff 2017-07-06 15:29:08 CEST
I'm not sure if this is Bug #37358 (because no change on the master here, just a re-join of a non-Master UCS system).

I have a the same issue with 

s4 master + s4 backup + s4 slave

after the second re-join of the backup, drs to the backup from master and slave is broken.

CN=Schema,CN=Configuration,DC=four,DC=two
	Default-First-Site-Name\BACKUP via RPC
		DSA object GUID: 012a971c-bac3-4dc9-a036-5e4538c94a81
		Last attempt @ Thu Jul  6 00:17:12 2017 CEST failed, result 31 (WERR_GEN_FAILURE)
		6 consecutive failure(s).
		Last success @ NTTIME(0)


master log.samba:

[2017/07/06 00:17:37.955740,  0, pid=4283] ../source4/librpc/rpc/dcerpc_util.c:737(dcerpc_pipe_auth_recv)
  Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 for ncacn_ip_tcp:10.200.7.52[1024,seal,krb5,target_hostname=012a971c-bac3-4dc9-a036-5e4538c94a81._msdcs.four.two,target_principal=GC/backup.four.two/four.two,abstract_syntax=e3514235-4b06-11d1-ab04-00c04fc2dcd2/0x00000004,localaddress=10.200.7.50] NT_STATUS_UNSUCCESSFUL


backup log.samba:

  GSS server Update(krb5)(1) Update failed:  Miscellaneous failure (see text): Decrypt integrity check failed for checksum type hmac-sha1-96-aes256, key type aes256-cts-hmac-sha1-96
[2017/07/06 00:17:58.776201,  1, pid=2203] ../source4/auth/gensec/gensec_gssapi.c:622(gensec_gssapi_update)
  GSS server Update(krb5)(1) Update failed:  Miscellaneous failure (see text): Decrypt integrity check failed for checksum type hmac-sha1-96-aes256, key type aes256-cts-hmac-sha1-96

Re-joining a UCS system has to work!
Comment 4 Arvid Requate univentionstaff 2018-06-05 19:24:17 CEST
IIRC in that case DRS replication between re-joined backup and master worked.

But the DRS replication between backup and other Slave DCs didn't work. That situation had another special behaviour: Even a samba restart on the slaves didn't get the replication going again. The reason was, that the Samba/AD data on the Slaves still had an old "CN=NTDS Settings" Object for the backup-DC. That object is stored in the CN=Configuration partition.

univention-s4search --cross-ncs "CN=NTDS Settings" objectGUID

The objectGUID of those objects is relevant, because it's used by the replication for a DNS lookup of a DNS alias. In the given case, the Slave DCs continued to look for the DNS alias the the old objectGUID -- and worse, they seem to fetch a Kerberos-Ticket for FQDN. As a result we saw Kerberos authentication errors in the samba.log on the DC backup.
Comment 5 Felix Botner univentionstaff 2018-11-28 17:03:59 CET
replication work again after merging the old password kvno entries from the old (before join)   /etc/krb5.keytab

@slave cp /etc/krb5.keytab /etc/krb5.keyta.OLD
@slave re join

@master samba-tool drs showrepl

DC=DomainDnsZones,DC=four,DC=three
	Default-First-Site-Name\SLAVE via RPC
		DSA object GUID: 0bd5a0f8-9a3d-41de-865a-940a59e47cc7
		Last attempt @ Wed Nov 28 16:45:10 2018 CET failed, result 31 (WERR_GEN_FAILURE)

@slave ktutil copy  /etc/krb5.keytab.OLD /etc/krb5.keytab

@master samba-tool drs showrep
OK
Comment 6 Christian Völker univentionstaff 2020-01-27 16:46:17 CET
Just as addition: based on the last comment I was able to fix this issue on a customer site based on comment #5 from Felix.
Comment 7 Ingo Steuwer univentionstaff 2020-07-03 20:56:47 CEST
This issue has been filed against UCS 4.2.

UCS 4.2 is out of maintenance and many UCS components have changed in later releases. Thus, this issue is now being closed.

If this issue still occurs in newer UCS versions, please use "Clone this bug" or reopen it and update the UCS version. In this case please provide detailed information on how this issue is affecting you.