Bug 30005 - endless loop in univention-join if no host certificate available
endless loop in univention-join if no host certificate available
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Join (univention-join)
UCS 4.3
Other Linux
: P5 normal (vote)
: UCS 4.3-2-errata
Assigned To: Erik Damrose
Philipp Hahn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-11 12:26 CET by Tim Petersen
Modified: 2018-11-28 12:10 CET (History)
7 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.171
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2013010921000847
Bug group (optional): Error handling, External feedback, Troubleshooting
Max CVSS v3 score:
hahn: Patch_Available+


Attachments
Always create missing SSL certificates. (3.41 KB, patch)
2015-03-08 13:19 CET, Philipp Hahn
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Petersen univentionstaff 2013-01-11 12:26:24 CET
Seen at Ticket #2013010921000847.

univention-join executes the following code when fetching the new host certificate:

"
[...]
download_host_certificate

if [ ! -d "/etc/univention/ssl/$hostname" ] &&  [ ! -d "/etc/univention/ssl/$hostname.$domainname" ]; then
    echo "failed to get host certificate" failed_message "failed to get host certificate"
fi
[...]
"

The check is never reached, when the remote certificate is not available at the remote host, because the scp runs in an endless loop:

"
download_host_certificate () {
    echo -n "Download host certificate "
    local HOSTPWD="/etc/machine.secret"
    local HOSTACCOUNT="$hostname\$"
    while true
    do
        univention-scp "$HOSTPWD" -q -r \
            "$HOSTACCOUNT@$DCNAME:/etc/univention/ssl/$hostname" \
            "$HOSTACCOUNT@$DCNAME:/etc/univention/ssl/$hostname.$domainname" \
            /etc/univention/ssl/ >>/var/log/univention/join.log 2>&1
        [ -d "/etc/univention/ssl/$hostname" ] && [ -d "/etc/univention/ssl/$hostname.$domainname" ] && break
        echo -n "."
        sleep 20
    done

    echo -e "\033[60Gdone"
}
"

Because of that, you don't get an appropriate failure message at shell - it keeps printing dots...
Comment 1 Sebastian 2014-06-26 11:22:34 CEST
On different computers, we have the same problem, too. It seems that it happens when we re-join a system after changing its IP address.

Copying the certificate for that specific system from the DC manually, was a workaround for us.
Comment 2 Florian Best univentionstaff 2014-08-26 18:09:25 CEST
Happened to me again in UCS 4.0 when joining a slave. The UMC System setup hangs then forever without any information. The join should fail after some time.
Comment 3 Sebastian 2015-01-05 17:51:37 CET
Still there when joining a UCS4 Memberserver.
Comment 4 Philipp Hahn univentionstaff 2015-01-06 08:46:47 CET
(In reply to Sebastian from comment #1)
> On different computers, we have the same problem, too. It seems that it
> happens when we re-join a system after changing its IP address.

This may be caused by Bug #31926, which triggers when a computer is removed and re-joined in short order.

This also happens with an older version of the univention-directory-listener (or an system upgraded from UCS-3.x), where the old cached data is not removed if a listener module fails. This then causes gencertificate.py:114 to take the "else" path instead of re-creating the missing certificate (Bug #35261)
Comment 5 Philipp Hahn univentionstaff 2015-03-08 12:57:29 CET
Happend again in a Jenkins-UCS@school test, where the
- Listener was killed
- did not terminate because of Bug #27895,
- after force-kill cache contains already new computer
- certificate was not created
- joining slave hangs since 2 days.

Time to debug this: 1h

gencertificate.py should be changed to *always* create any missing certificate instead of doing that only for *old_cache=None*.
Comment 6 Philipp Hahn univentionstaff 2015-03-08 13:19:09 CET
Created attachment 6749 [details]
Always create missing SSL certificates.
Comment 7 Florian Best univentionstaff 2016-09-21 19:50:45 CEST
hit me again!
Comment 8 Stefan Gohmann univentionstaff 2016-09-21 19:55:27 CEST
(In reply to Florian Best from comment #7)
> hit me again!

Version?
Comment 9 Florian Best univentionstaff 2016-09-21 19:56:01 CEST
(In reply to Stefan Gohmann from comment #8)
> (In reply to Florian Best from comment #7)
> > hit me again!
> 
> Version?

UCS 4.1-3
Comment 10 Florian Best univentionstaff 2016-10-21 14:51:22 CEST
One reason why the ssh login fails is the NSCD uid cache if the computer object was removed recently before (Bug #31926).
Comment 11 Florian Best univentionstaff 2017-11-30 15:28:28 CET
Our today UCS 4.2-3 Jenkins job are hanging in this loop.

root      3697  0.0  0.1  93676  7196 ?        Ss   Nov29   0:00 sshd: root@notty    
root      3699  0.0  0.0  15268  3568 ?        Ss   Nov29   0:00  \_ bash -c . utils.sh; run_setup_join_on_non_master "XXXXX"
root      3704  0.0  0.1  15520  3876 ?        S    Nov29   0:00      \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root     11697  0.0  0.0  15520  2716 ?        S    Nov29   0:00          \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Administrator --password_file /tmp/univention
root     11701  0.0  0.1  15908  4296 ?        S    Nov29   0:01          |   \_ /bin/bash /usr/share/univention-join/univention-join -dcaccount Administrator -dcpwd /tmp/tmp.Ctura0G3V5
root     21070  0.0  0.0   7272   652 ?        S    09:27   0:00          |       \_ sleep 20
Comment 12 Florian Best univentionstaff 2017-11-30 15:30:50 CET
Could not chdir to home directory /dev/null: Not a directory
scp: /etc/univention/ssl/slave095: No such file or directory
Comment 13 Erik Damrose univentionstaff 2018-11-27 15:53:23 CET
I added logic to univention-join that will try to download the host certificate for 10 minutes. If that does not succeed, abort the join with an error message.

ad722fbf Stop trying to download host certificate during join after 10 minutes have passed, then mark the join as failed
a1d148bc yaml

Package: univention-join
Version: 10.0.0-25A~4.3.0.201811271547
Comment 14 Erik Damrose univentionstaff 2018-11-27 15:53:58 CET
I added logic to univention-join that will try to download the host certificate for 10 minutes. If that does not succeed, abort the join with an error message.

ad722fbf Stop trying to download host certificate during join after 10 minutes have passed, then mark the join as failed
a1d148bc yaml

Package: univention-join
Version: 10.0.0-25A~4.3.0.201811271547
Comment 15 Philipp Hahn univentionstaff 2018-11-27 18:30:15 CET
OK: ad722fbf
OK: a1d148bc
OK: errata-announce -V --only univention-join.yaml
OK: univention-join.yaml
OK: univention-join -dcaccount Administrator -dcpwd <(exec printf univention)
> Download host certificate: ..............................failed to get host certificate
Comment 16 Sebastian 2018-11-28 10:15:30 CET
Endless loop appears again, installing OpenProject 7.3.1 via:

univention-app install openproject

I think this app is based on a ucs4.1 docker image, doesn't it?
Comment 17 Arvid Requate univentionstaff 2018-11-28 12:10:40 CET
<http://errata.software-univention.de/ucs/4.3/340.html>