Bug 51776 - Hanging Join process during certificate retrieval
Hanging Join process during certificate retrieval
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Join (univention-join)
UCS 5.0
Other Linux
: P5 normal (vote)
: UCS 5.0-0-errata
Assigned To: Philipp Hahn
Florian Best
:
Depends on: 51804
Blocks: 53810
  Show dependency treegraph
 
Reported: 2020-08-06 09:59 CEST by Florian Best
Modified: 2021-09-20 08:57 CEST (History)
4 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:
hahn: Patch_Available+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Best univentionstaff 2020-08-06 09:59:30 CEST
One of our Jenkins-Test instance hangs since 1 day:

root     12539  0.0  0.0   7304  3628 ?        Ss   Aug05   0:00  \_ bash -c . utils.sh && run_setup_join_on_non_master
root     12545  0.0  0.0   7324  3888 ?        S    Aug05   0:00      \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root      8270  0.0  0.0   7324  2448 ?        S    Aug05   0:00          \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root      8274  0.0  0.1   7684  4208 ?        S    Aug05   0:00          |   \_ /bin/bash /usr/share/univention-join/univention-join -dcaccount Administrator -dcpwd /tmp/tmp.gkqnTREeI5
root     10031  0.0  0.0   2388  1708 ?        S    Aug05   0:04          |       \_ /bin/sh /usr/sbin/univention-fetch-certificate slave098 master098.autotest098.local
root     31664  1.0  0.1  15720  7804 ?        S    09:56   0:00          |           \_ /usr/bin/python2.7 /usr/sbin/univention-scp /etc/machine.secret -r slave098$@master098.autotest098.local:/etc/univention/ssl/slave098 slave098$@master098.autotest098.local:/etc/univention/ssl/slave098.autotest098.local /etc/univention/ssl/
root     31665  0.0  0.0   5356   760 ?        Ss   09:56   0:00          |               \_ scp -o StrictHostKeyChecking=no -o ControlPath=none -r slave098$@master098.autotest098.local:/etc/univention/ssl/slave098 slave098$@master098.autotest098.local:/etc/univention/ssl/slave098.autotest098.local /etc/univention/ssl/
root     31666  1.0  0.0      0     0 ?        Z    09:56   0:00          |                   \_ [ssh] <defunct>
root     31668  0.0  0.1  15816  7448 ?        S    09:56   0:00          |                   \_ /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -oRemoteCommand=none -oRequestTTY=no -o StrictHostKeyChecking=no -o ControlPath=none -l slave098$ -- master098.autotest098.local scp -r -d -f /etc/univention/ssl/slave098.autotest098.local
root      8271  0.0  0.0   7324  2368 ?        S    Aug05   0:00          \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
Comment 1 Florian Best univentionstaff 2020-10-30 09:17:03 CET
This happened twice again in our Jenkins environment, again on a DC Slave.
Comment 2 Felix Botner univentionstaff 2020-10-30 09:28:41 CET
I had a similar problem caused by this Bug #51804 (comment 9), could this be the problem here as well?
Comment 3 Florian Best univentionstaff 2020-10-30 09:34:22 CET
(In reply to Felix Botner from comment #2)
> I had a similar problem caused by this Bug #51804 (comment 9), could this be
> the problem here as well?

Sounds like, yes.
Maybe we can use Bug #51804 to fix the issue and this bug to fix the error handling, so that the join aborts after 1 hour when the certificates aren't created.
Comment 4 Felix Botner univentionstaff 2020-10-30 09:41:22 CET
(In reply to Florian Best from comment #3)
> (In reply to Felix Botner from comment #2)
> > I had a similar problem caused by this Bug #51804 (comment 9), could this be
> > the problem here as well?
> 
> Sounds like, yes.
> Maybe we can use Bug #51804 to fix the issue and this bug to fix the error
> handling, so that the join aborts after 1 hour when the certificates aren't
> created.

Yep, that makes sense.
Comment 5 Philipp Hahn univentionstaff 2020-11-21 08:25:05 CET
Jenkins → UCSschool-4.4 → Install U@S 4.4 Multiserver Large Env
is stalled since 11 days:

@slave300-s3

> 06:32:11 [slave300-s3]   . utils.sh; run_setup_join_on_non_master
# ps axfu
root      4040  0.0  0.1 144192 11012 ?        Ss   Nov15   0:05 sshd: root@notty
root      4051  0.0  0.0  13552  3716 ?        Ss   Nov15   0:00  \_ bash -c . utils.sh; run_setup_join_on_non_master
root      4059  0.0  0.0  13692  3808 ?        S    Nov15   0:00      \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root      5555  0.0  0.0  13692  2828 ?        S    Nov15   0:00          \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root      5559  0.0  0.0  13908  4296 ?        S    Nov15   0:00          |   \_ /bin/bash /usr/share/univention-join/univention-join -dcaccount Administrator -dcpwd /tmp/tmp.5xlPTsybgu
root      7838  0.0  0.0   4276  1596 ?        S    Nov15   0:19          |       \_ /bin/sh /usr/sbin/univention-fetch-certificate slave300-s3 master300.autotest300.local
root      8301  0.0  0.0   7364   668 ?        S    08:06   0:00          |           \_ sleep 20
root      5556  0.0  0.0  13692  2704 ?        S    Nov15   0:00          \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention

# find  /etc/univention/ssl -ls
  2878206      4 drwxr-xr-x   3 root     root         4096 Nov 15 06:37 /etc/univention/ssl
  2878207      4 drwxr-xr-x   2 root     root         4096 Nov 15 06:37 /etc/univention/ssl/ucsCA
  2878246      4 -rw-r--r--   1 root     root         1948 Nov 15 06:37 /etc/univention/ssl/ucsCA/CAcert.pem


@master300:

# less /home/Administrator/.univention-server-join.log
10.11.20 05:52:24.077  DEBUG_INIT
univention-server-join called
Parameter: -bindpwfile /tmp/tmp.CIjCnemboE -binddn uid=Administrator,cn=users,dc=autotest300,dc=local -ip 10.207.229.64 -netmask 255.255.0.0 -mac 52:54:00:e0:d8:f4 -role domaincontroller_slave -hostname slave300-s3 -domainname autotest300.local
        Calculated subnet = 10.207
        forwardZone zoneName=autotest300.local,cn=dns,dc=autotest300,dc=local
        reverseZone zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest300,dc=local
        dhcpEntry 
Join DC Slave
        Create new DC Slave 
15.11.20 06:37:00.591  DEBUG_INIT

# find /etc/univention/ssl -name slave300\* -ls
  2878533      4 drwxr-x---   2 slave300-s1$ DC Backup Hosts     4096 Nov 15 06:25 /etc/univention/ssl/slave300-s1.autotest300.local
  2878560      0 lrwxrwxrwx   1 root         nogroup               29 Nov 15 06:25 /etc/univention/ssl/slave300-s1 -> slave300-s1.autotest300.local

# grep cn=slave300-s3 /var/log/univention/listener.log
<EMPTY>

# cat /var/lib/univention-ldap/listener/listener
1335 cn=2012,cn=gidNumber,cn=temporary,cn=univention,dc=autotest300,dc=local a
1336 cn=2012,cn=gidNumber,cn=temporary,cn=univention,dc=autotest300,dc=local d
1337 cn=52:54:00:e0:d8:f4,cn=mac,cn=temporary,cn=univention,dc=autotest300,dc=local a
1338 cn=10.207.229.64,cn=aRecord,cn=temporary,cn=univention,dc=autotest300,dc=local a
1339 cn=uidNumber,cn=temporary,cn=univention,dc=autotest300,dc=local m
1340 cn=2012,cn=uidNumber,cn=temporary,cn=univention,dc=autotest300,dc=local d
1341 cn=slave300-s3$,cn=uid,cn=temporary,cn=univention,dc=autotest300,dc=local a
1342 cn=slave300-s3,cn=dc,cn=computers,dc=autotest300,dc=local a
1343 cn=slave300-s3$,cn=uid,cn=temporary,cn=univention,dc=autotest300,dc=local d
1344 cn=slave300-s3,cn=dc,cn=computers,dc=autotest300,dc=local m
1345 cn=slave300-s3,cn=dc,cn=computers,dc=autotest300,dc=local m
1346 relativeDomainName=slave300-s3,zoneName=autotest300.local,cn=dns,dc=autotest300,dc=local a
1347 zoneName=autotest300.local,cn=dns,dc=autotest300,dc=local m
1348 relativeDomainName=64.229,zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest300,dc=local a
1349 zoneName=207.10.in-addr.arpa,cn=dns,dc=autotest300,dc=local m
1350 cn=10.207.229.64,cn=aRecord,cn=temporary,cn=univention,dc=autotest300,dc=local d
1351 cn=52:54:00:e0:d8:f4,cn=mac,cn=temporary,cn=univention,dc=autotest300,dc=local d
1352 cn=DC Slave Hosts,cn=groups,dc=autotest300,dc=local m


# cat /var/log/univention/notifier.log
15.11.20 06:25:11.908  DEBUG_INIT
15.11.20 06:37:00.646  TRANSFILE   ( ERROR   ) : ldap_sasl_interactive_bind_s(): Can't contact LDAP server
15.11.20 06:37:06.030  DEBUG_INIT

(gdb) bt full
#0  0x00007fb81ce135e3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1  0x000055a3afcbffd4 in network_client_main_loop () at network.c:315
        fd = 1024
        testfds = {fds_bits = {656, 0 <repeats 15 times>}}
#2  0x000055a3afcc297f in main (argc=6, argv=0x7ffde6c44288) at univention-directory-notifier.c:237
        foreground = 1
        debug = 1

So Bug #51804 again
Comment 6 Philipp Hahn univentionstaff 2020-11-21 08:27:08 CET
(In reply to Philipp Hahn from comment #5)
# /etc/init.d/univention-directory-notifier restart
# wc -l /var/lib/univention-ldap/listener/listener
0 /var/lib/univention-ldap/listener/listener
Comment 9 Philipp Hahn univentionstaff 2021-07-22 13:55:24 CEST
[phahn:~/REPOS/ucs/base/univention-ssl] 5.0-0+* 141 ± git cl -4
[5.0-0] e92d0a11bc style[ssl-download] Check also for machine.secret
 base/univention-ssl/univention-fetch-certificate | 3 +++
 1 file changed, 3 insertions(+)

[5.0-0] 4549820633 fix[ssl-download] univention-scp detection
 base/univention-ssl/debian/ucslint.overrides     | 2 ++
 base/univention-ssl/univention-fetch-certificate | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

[5.0-0] 700a44a507 style[ssl-download] shellcheck issues
 base/univention-ssl/univention-fetch-certificate | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

[5.0-0] 429be491fc fix[ssl-download] Abort after timeout
 base/univention-ssl/debian/changelog             |  6 ++++++
 base/univention-ssl/univention-fetch-certificate | 14 +++++---------
 doc/errata/staging/univention-ssl.yaml           | 10 ++++++++++
 3 files changed, 21 insertions(+), 9 deletions(-)

Package: univention-ssl
Version: 14.0.2-2A~5.0.0.202107221328
Branch: ucs_5.0-0
Scope: errata5.0-0

[5.0-0] b11b3a2d1c Bug #51776: ssl, Bug #53339: udm
 doc/errata/staging/univention-directory-manager-modules.yaml | 2 +-
 doc/errata/staging/univention-ssl.yaml                       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
Comment 10 Florian Best univentionstaff 2021-07-23 13:45:43 CEST
OK: timeout after 10 minutes

# time univention-fetch-certificate msater.school.dev master.school.dev                                                                                                                                                        
Download host certificate for msater.school.dev:...............................univention-fetch-certificate: failed to get host certificate                                                                                                   

real    10m20,468s
user    0m0,413s
sys     0m0,039s

OK: code review
~OK: YAML