Bug 30836 - 98univention-samba4-dns stopped because no RID Set was replicated in 180 seconds
98univention-samba4-dns stopped because no RID Set was replicated in 180 seconds
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Samba4
UCS 4.1
Other Linux
: P3 normal (vote)
: UCS 4.1-3-errata
Assigned To: Stefan Gohmann
Arvid Requate
:
: 32993 38228 38229 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-20 13:43 CET by Arvid Requate
Modified: 2016-10-20 12:39 CEST (History)
6 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.114
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2016060821000576 2016101121000687
Bug group (optional):
Max CVSS v3 score:


Attachments
DRSUAPI_EXOP_FSMO_RID_ALLOC.sh (3.84 KB, application/x-shellscript)
2013-11-13 17:01 CET, Ingo Steuwer
Details
getncchanges (8.11 KB, text/x-python)
2013-11-13 17:02 CET, Ingo Steuwer
Details
join_against_s4c_dc.patch (1.65 KB, patch)
2016-06-08 18:22 CEST, Arvid Requate
Details | Diff
check_domain_info_for_bug30836.diff (973 bytes, patch)
2016-10-17 19:45 CEST, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2013-03-20 13:43:17 CET
During initial Join of a ucs3.1-1 Samba4 DC Backup the Joinscript 98univention-samba4-dns stopped because no RID Set was replicated to the DC Backup within 180 seconds. As a result the "dns-$hostname" service account was not created. join.log:

==================================================================================
Configure 98univention-samba4-dns.inst Wed Feb 13 04:15:43 CET 2013
Waiting for RID Pool replication: ...........................................................................................................................
........................................................
Error no rIDSetReferences replicated for backup11
Wed Feb 13 04:19:31 CET 2013: finish /usr/sbin/univention-join
==================================================================================

Another call to univention-run-join-scripts fixed the issue. Maybe the timeout needs to be increased a bit.
Comment 1 Arvid Requate univentionstaff 2013-10-28 20:56:49 CET
*** Bug 32993 has been marked as a duplicate of this bug. ***
Comment 2 Arvid Requate univentionstaff 2013-10-28 20:57:02 CET
I guess a better solution would be to address this via Bug 30115.
Comment 3 Ingo Steuwer univentionstaff 2013-11-13 17:00:51 CET
higher priority, the problem occured in 2 larger environments (both UCS 3.1, the newer one Errata 190) and was not to solve by waiting.

References: 2013111221001099 and 2013041821001047

Workaround in both cases:

- copy the attached  DRSUAPI_EXOP_FSMO_RID_ALLOC.sh and getncchanges on the DC Slave and make them executable
- run ./DRSUAPI_EXOP_FSMO_RID_ALLOC.sh, enter "Administrator" password
- the script waits for 10 seconds, which might be to short (change in the code)

The Rid Pool object should be created and replicated.
Comment 4 Ingo Steuwer univentionstaff 2013-11-13 17:01:48 CET
Created attachment 5626 [details]
DRSUAPI_EXOP_FSMO_RID_ALLOC.sh
Comment 5 Ingo Steuwer univentionstaff 2013-11-13 17:02:22 CET
Created attachment 5627 [details]
getncchanges
Comment 6 Tim Petersen univentionstaff 2013-12-10 14:33:36 CET
Occured again at ticket 2013121021002509
Comment 7 Tim Petersen univentionstaff 2014-06-04 09:50:48 CEST
Again 2014060421004323
Comment 8 Arvid Requate univentionstaff 2015-02-23 11:06:37 CET
I just faced this again on a DC Slave which picked the DC Backup DC as the system to join against. My impression is that this cause the problem, I guess that it takes too long for the joining Slave until

1. The new DC Slave account is replicated to the DC Master (PDC emulator)
2. The PDC Emulator has created a "CN=RID Set" for the DC Slave
3. The "CN=RID Set" object replicated to the DC Slave

Usually the "CN=RID Set" should be present at the time the Samba join ha completed.

This is the join.log:
======================================================================
Finding a writeable DC for domain 'ar40i1.qa'
Found DC backup51.ar40i1.qa
workgroup is AR40I1
realm is ar40i1.qa
[...]


Configure 98univention-samba4-dns.inst Wed Nov 26 13:19:36 CET 2014
2014-11-26 13:19:36.874980262+01:00 (in joinscript_init)
Waiting for RID Pool replication: ...................................................................................................................................................................................
Error no rIDSetReferences replicated for slave52
======================================================================

In case this happens again before we have a go at fixing it, please attach the relevant join.log info, especially the "Found DC " line. It would also be relevant to know if there are "DC Master only" cases where this happens, which would help falsify my theory.
Comment 9 Arvid Requate univentionstaff 2015-02-23 11:11:19 CET
Btw. in my case the situation fixed itself, I just had to run univention-run-join-scripts again.

Also, no CNF-objects appeared, i.e. Bug 33388 did not raise his head in this case.

I would propse the same fix though, make Samba join against the system which has the "PDC Empulator" FSMO (with a reasonable alternative for Slave PDCs like in UCS@school).
Comment 10 Michael Grandjean univentionstaff 2015-03-27 20:10:13 CET
Again via 2015032721000261
Comment 11 Arvid Requate univentionstaff 2015-04-13 12:52:58 CEST
*** Bug 38228 has been marked as a duplicate of this bug. ***
Comment 12 Arvid Requate univentionstaff 2015-04-13 12:57:39 CEST
*** Bug 38229 has been marked as a duplicate of this bug. ***
Comment 13 Stefan Gohmann univentionstaff 2015-11-18 09:07:57 CET
This happens again in a fresh UCS 4.1 test installation with 3 Samba 4 DCs (Master, Backup and Slave):

Waiting for RID Pool replication: ...................................................................................................................................................................................
Error no rIDSetReferences replicated for slave413

After rebooting and running univention-run-join-scripts it worked directly.
Comment 14 Arvid Requate univentionstaff 2015-11-23 18:48:53 CET
Ok, I could reproduce it: it looks looks this happens when the slave joins against the DC Backup (i.e. not the RID Master).

So we have some options:

a) make Samba join against the master (S4-Connector or PDC emulator) always
b) make 96univention-samba4.inst trigger "RID Set" generation explicitly
c) make 98univention-samba4-dns.inst trigger "RID Set" generation
Comment 15 Arvid Requate univentionstaff 2015-11-23 18:49:59 CET
Bug 33388 could be an argument for option a)
Comment 16 Arvid Requate univentionstaff 2015-11-23 19:19:27 CET
Actually in my test domain the CN=RID Set eventually got created on the DC Master, but the timestamps show that it's more that 10 minutes after the account object was created on the DC Backup:
=============================================================================
dn: CN=SLAVE12,OU=Domain Controllers,DC=ar41i1,DC=qa
replPropertyMetaData:     NDR: struct replPropertyMetaDataBlob
        version                  : 0x00000001 (1)
        reserved                 : 0x00000000 (0)
        ctr                      : union replPropertyMetaDataCtr(case 1)
        ctr1: struct replPropertyMetaDataCtr1
            count                    : 0x0000001a (26)
            reserved                 : 0x00000000 (0)
            array: ARRAY(26)
                array: struct replPropertyMetaData1
                    attid                    : DRSUAPI_ATTID_objectClass (0x0)
                    version                  : 0x00000001 (1)
                    originating_change_time  : Mon Nov 23 18:37:56 2015 CET
                    originating_invocation_id: <ID of DC Backup>
=============================================================================

=============================================================================
dn: CN=RID Set,CN=SLAVE12,OU=Domain Controllers,DC=ar41i1,DC=qa
replPropertyMetaData:     NDR: struct replPropertyMetaDataBlob
        version                  : 0x00000001 (1)
        reserved                 : 0x00000000 (0)
        ctr                      : union replPropertyMetaDataCtr(case 1)
        ctr1: struct replPropertyMetaDataCtr1
            count                    : 0x0000000a (10)
            reserved                 : 0x00000000 (0)
            array: ARRAY(10)
                array: struct replPropertyMetaData1
                    attid                    : DRSUAPI_ATTID_objectClass (0x0)
                    version                  : 0x00000001 (1)
                    originating_change_time  : Mon Nov 23 18:48:45 2015 CET
                    originating_invocation_id: <ID of Master>
=============================================================================
Comment 17 Stefan Gohmann univentionstaff 2015-11-24 13:24:49 CET
(In reply to Arvid Requate from comment #15)
> Bug 33388 could be an argument for option a)

Yes, I vote for a).
Comment 18 Jens Thorp-Hansen univentionstaff 2016-06-08 15:43:52 CEST
happend again at Ticket#2016060821000576

(slave joins against backup instead of master)
Comment 19 Arvid Requate univentionstaff 2016-06-08 18:22:08 CEST
Created attachment 7728 [details]
join_against_s4c_dc.patch

Via Bug 32257 we introduced a function get_available_s4connector_dc in the univention-samba4 join script. I guess we could use that, see attachment, untested.
Comment 20 Stefan Gohmann univentionstaff 2016-10-12 21:58:06 CEST
It looks like it happened again: Ticket #2016101121000687.
Comment 21 Stefan Gohmann univentionstaff 2016-10-14 07:52:51 CEST
* It is possible that Samba 4 joins against another DC and not against
  the master. This could led to different problems. The join script
  now tries to join against the S4 Connector system first (Bug #30836).

UCS 4.1-3: r73194
UCS 4.2: r73195
YAML: r73196
Comment 22 Arvid Requate univentionstaff 2016-10-17 19:45:47 CEST
Created attachment 8127 [details]
check_domain_info_for_bug30836.diff

Ok, works.

Corner case: If I stop samba4 on the S4-Connector host (master in my case) then the first join attempt fails (python traceback) and continues as before by letting Samba choose any DC on the domain. So the script then falls back to the old behavior. That's ok.


Maybe we should also do the "samba-tool domain info" introduced
for Bug 34422 comment 2 to avoid a broken sam.ldb in case of replication issues? See attached patch. On the other hand, we may want to avoid adding yet another layer of logic and instead choose for UCS 5.0 to simplify the joinscript to *always* join against the S4-Connector host and just immediately abort the join if that fails instead of desperately attempting to "somehow" get the join done and possibly ending up in an undefined state in the end.
Comment 23 Stefan Gohmann univentionstaff 2016-10-17 20:04:24 CEST
Thanks, the patch makes sense. Applied: r73304 + r73305 + r73306
Comment 24 Arvid Requate univentionstaff 2016-10-17 21:18:25 CEST
Ok works and code is merged to UCS 4.2. Advisory is up to date to.
Comment 25 Janek Walkenhorst univentionstaff 2016-10-20 12:39:52 CEST
<http://errata.software-univention.de/ucs/4.1/309.html>