30836 – 98univention-samba4-dns stopped because no RID Set was replicated in 180 seconds

Bug 30836 - 98univention-samba4-dns stopped because no RID Set was replicated in 180 seconds

Summary: 98univention-samba4-dns stopped because no RID Set was replicated in 180 seconds

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Samba4
Version:	UCS 4.1
Hardware:	Other Linux

Importance:	P3 normal
Target Milestone:	UCS 4.1-3-errata
Assignee:	Stefan Gohmann
QA Contact:	Arvid Requate

URL:
Keywords:

Duplicates (3):	32993 38228 38229 (view as bug list)
Depends on:
Blocks:

Reported:	2013-03-20 13:43 CET by Arvid Requate
Modified:	2016-10-20 12:39 CEST (History)
CC List:	6 users (show)

See Also:
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	2: Will only affect a few installed domains
How will those affected feel about the bug?:	2: A Pain – users won’t like this once they notice it
User Pain:	0.114
Enterprise Customer affected?:	Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2016060821000576 2016101121000687
Bug group (optional):
Customer ID:	00006, 01505
Max CVSS v3 score:

Attachments
DRSUAPI_EXOP_FSMO_RID_ALLOC.sh (3.84 KB, application/x-shellscript) 2013-11-13 17:01 CET, Ingo Steuwer	Details
getncchanges (8.11 KB, text/x-python) 2013-11-13 17:02 CET, Ingo Steuwer	Details
join_against_s4c_dc.patch (1.65 KB, patch) 2016-06-08 18:22 CEST, Arvid Requate	Details \| Diff
check_domain_info_for_bug30836.diff (973 bytes, patch) 2016-10-17 19:45 CEST, Arvid Requate	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arvid Requate

2013-03-20 13:43:17 CET

During initial Join of a ucs3.1-1 Samba4 DC Backup the Joinscript 98univention-samba4-dns stopped because no RID Set was replicated to the DC Backup within 180 seconds. As a result the "dns-$hostname" service account was not created. join.log:

==================================================================================
Configure 98univention-samba4-dns.inst Wed Feb 13 04:15:43 CET 2013
Waiting for RID Pool replication: ...........................................................................................................................
........................................................
Error no rIDSetReferences replicated for backup11
Wed Feb 13 04:19:31 CET 2013: finish /usr/sbin/univention-join
==================================================================================

Another call to univention-run-join-scripts fixed the issue. Maybe the timeout needs to be increased a bit.

Comment 1 Arvid Requate

2013-10-28 20:56:49 CET

*** Bug 32993 has been marked as a duplicate of this bug. ***

Comment 2 Arvid Requate

2013-10-28 20:57:02 CET

I guess a better solution would be to address this via Bug 30115.

Comment 3 Ingo Steuwer

2013-11-13 17:00:51 CET

higher priority, the problem occured in 2 larger environments (both UCS 3.1, the newer one Errata 190) and was not to solve by waiting.

References: 2013111221001099 and 2013041821001047

Workaround in both cases:

- copy the attached  DRSUAPI_EXOP_FSMO_RID_ALLOC.sh and getncchanges on the DC Slave and make them executable
- run ./DRSUAPI_EXOP_FSMO_RID_ALLOC.sh, enter "Administrator" password
- the script waits for 10 seconds, which might be to short (change in the code)

The Rid Pool object should be created and replicated.

Comment 4 Ingo Steuwer

2013-11-13 17:01:48 CET

Created attachment 5626 [details]
DRSUAPI_EXOP_FSMO_RID_ALLOC.sh

Comment 5 Ingo Steuwer

2013-11-13 17:02:22 CET

Created attachment 5627 [details]
getncchanges

Comment 6 Tim Petersen

2013-12-10 14:33:36 CET

Occured again at ticket 2013121021002509

Comment 7 Tim Petersen

2014-06-04 09:50:48 CEST

Again 2014060421004323

Comment 8 Arvid Requate

2015-02-23 11:06:37 CET

I just faced this again on a DC Slave which picked the DC Backup DC as the system to join against. My impression is that this cause the problem, I guess that it takes too long for the joining Slave until

1. The new DC Slave account is replicated to the DC Master (PDC emulator)
2. The PDC Emulator has created a "CN=RID Set" for the DC Slave
3. The "CN=RID Set" object replicated to the DC Slave

Usually the "CN=RID Set" should be present at the time the Samba join ha completed.

This is the join.log:
======================================================================
Finding a writeable DC for domain 'ar40i1.qa'
Found DC backup51.ar40i1.qa
workgroup is AR40I1
realm is ar40i1.qa
[...]


Configure 98univention-samba4-dns.inst Wed Nov 26 13:19:36 CET 2014
2014-11-26 13:19:36.874980262+01:00 (in joinscript_init)
Waiting for RID Pool replication: ...................................................................................................................................................................................
Error no rIDSetReferences replicated for slave52
======================================================================

In case this happens again before we have a go at fixing it, please attach the relevant join.log info, especially the "Found DC " line. It would also be relevant to know if there are "DC Master only" cases where this happens, which would help falsify my theory.

Comment 9 Arvid Requate

2015-02-23 11:11:19 CET

Btw. in my case the situation fixed itself, I just had to run univention-run-join-scripts again.

Also, no CNF-objects appeared, i.e. Bug 33388 did not raise his head in this case.

I would propse the same fix though, make Samba join against the system which has the "PDC Empulator" FSMO (with a reasonable alternative for Slave PDCs like in UCS@school).

Comment 10 Michael Grandjean

2015-03-27 20:10:13 CET

Again via 2015032721000261

Comment 11 Arvid Requate

2015-04-13 12:52:58 CEST

*** Bug 38228 has been marked as a duplicate of this bug. ***

Comment 12 Arvid Requate

2015-04-13 12:57:39 CEST

*** Bug 38229 has been marked as a duplicate of this bug. ***

Comment 13 Stefan Gohmann

2015-11-18 09:07:57 CET

This happens again in a fresh UCS 4.1 test installation with 3 Samba 4 DCs (Master, Backup and Slave):

Waiting for RID Pool replication: ...................................................................................................................................................................................
Error no rIDSetReferences replicated for slave413

After rebooting and running univention-run-join-scripts it worked directly.

Comment 14 Arvid Requate

2015-11-23 18:48:53 CET

Ok, I could reproduce it: it looks looks this happens when the slave joins against the DC Backup (i.e. not the RID Master).

So we have some options:

a) make Samba join against the master (S4-Connector or PDC emulator) always
b) make 96univention-samba4.inst trigger "RID Set" generation explicitly
c) make 98univention-samba4-dns.inst trigger "RID Set" generation

Comment 15 Arvid Requate

2015-11-23 18:49:59 CET

Bug 33388 could be an argument for option a)

Comment 16 Arvid Requate

2015-11-23 19:19:27 CET

Actually in my test domain the CN=RID Set eventually got created on the DC Master, but the timestamps show that it's more that 10 minutes after the account object was created on the DC Backup:
=============================================================================
dn: CN=SLAVE12,OU=Domain Controllers,DC=ar41i1,DC=qa
replPropertyMetaData:     NDR: struct replPropertyMetaDataBlob
        version                  : 0x00000001 (1)
        reserved                 : 0x00000000 (0)
        ctr                      : union replPropertyMetaDataCtr(case 1)
        ctr1: struct replPropertyMetaDataCtr1
            count                    : 0x0000001a (26)
            reserved                 : 0x00000000 (0)
            array: ARRAY(26)
                array: struct replPropertyMetaData1
                    attid                    : DRSUAPI_ATTID_objectClass (0x0)
                    version                  : 0x00000001 (1)
                    originating_change_time  : Mon Nov 23 18:37:56 2015 CET
                    originating_invocation_id: <ID of DC Backup>
=============================================================================

=============================================================================
dn: CN=RID Set,CN=SLAVE12,OU=Domain Controllers,DC=ar41i1,DC=qa
replPropertyMetaData:     NDR: struct replPropertyMetaDataBlob
        version                  : 0x00000001 (1)
        reserved                 : 0x00000000 (0)
        ctr                      : union replPropertyMetaDataCtr(case 1)
        ctr1: struct replPropertyMetaDataCtr1
            count                    : 0x0000000a (10)
            reserved                 : 0x00000000 (0)
            array: ARRAY(10)
                array: struct replPropertyMetaData1
                    attid                    : DRSUAPI_ATTID_objectClass (0x0)
                    version                  : 0x00000001 (1)
                    originating_change_time  : Mon Nov 23 18:48:45 2015 CET
                    originating_invocation_id: <ID of Master>
=============================================================================

Comment 17 Stefan Gohmann

2015-11-24 13:24:49 CET

(In reply to Arvid Requate from comment #15)
> Bug 33388 could be an argument for option a)

Yes, I vote for a).

Comment 18 Jens Thorp-Hansen

2016-06-08 15:43:52 CEST

happend again at Ticket#2016060821000576

(slave joins against backup instead of master)

Comment 19 Arvid Requate

2016-06-08 18:22:08 CEST

Created attachment 7728 [details]
join_against_s4c_dc.patch

Via Bug 32257 we introduced a function get_available_s4connector_dc in the univention-samba4 join script. I guess we could use that, see attachment, untested.

Comment 20 Stefan Gohmann

2016-10-12 21:58:06 CEST

It looks like it happened again: Ticket #2016101121000687.

Comment 21 Stefan Gohmann

2016-10-14 07:52:51 CEST

* It is possible that Samba 4 joins against another DC and not against
  the master. This could led to different problems. The join script
  now tries to join against the S4 Connector system first (Bug #30836).

UCS 4.1-3: r73194
UCS 4.2: r73195
YAML: r73196

Comment 22 Arvid Requate

2016-10-17 19:45:47 CEST

Created attachment 8127 [details]
check_domain_info_for_bug30836.diff

Ok, works.

Corner case: If I stop samba4 on the S4-Connector host (master in my case) then the first join attempt fails (python traceback) and continues as before by letting Samba choose any DC on the domain. So the script then falls back to the old behavior. That's ok.


Maybe we should also do the "samba-tool domain info" introduced
for Bug 34422 comment 2 to avoid a broken sam.ldb in case of replication issues? See attached patch. On the other hand, we may want to avoid adding yet another layer of logic and instead choose for UCS 5.0 to simplify the joinscript to *always* join against the S4-Connector host and just immediately abort the join if that fails instead of desperately attempting to "somehow" get the join done and possibly ending up in an undefined state in the end.

Comment 23 Stefan Gohmann

2016-10-17 20:04:24 CEST

Thanks, the patch makes sense. Applied: r73304 + r73305 + r73306

Comment 24 Arvid Requate

2016-10-17 21:18:25 CEST

Ok works and code is merged to UCS 4.2. Advisory is up to date to.

Comment 25 Janek Walkenhorst

2016-10-20 12:39:52 CEST

<http://errata.software-univention.de/ucs/4.1/309.html>