Bug 38984 - System setup stalls - broken ssh no longer reading full pipe?
System setup stalls - broken ssh no longer reading full pipe?
Status: CLOSED FIXED
Product: UCS Test
Classification: Unclassified
Component: Framework
unspecified
Other Linux
: P5 normal (vote)
: ---
Assigned To: Philipp Hahn
Sönke Schwardt-Krummrich
:
: 39356 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-20 11:31 CEST by Philipp Hahn
Modified: 2023-03-25 06:50 CET (History)
2 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional): Error handling, Troubleshooting
Max CVSS v3 score:
hahn: Patch_Available+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2015-07-20 11:31:55 CEST
# S4 Slave <http://jenkins.knut.univention.de:8080/job/UCS-4.0/job/UCS-4.0-2/job/Autotest%20MultiEnv/SambaVersion=s4,Systemrolle=slave/68/>

ucs-test writes to full PIPE, which sshd is not reading:

@ Slave = 10.210.188.85
# ps axf
 3407 ?        Ss     0:00 sshd: root@notty 
18133 ?        Ss     0:00  \_ bash -c . utils.sh; run_setup_join_on_non_master "Univention@99"
18138 ?        S      0:00      \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Admi
18232 ?        Sl     0:02          \_ /usr/bin/python2.7 /usr/lib/univention-system-setup/scripts/05_role/10role
 7805 ?        Z      0:00              \_ [10role] <defunct>

# strace -p 18232
Process 18232 attached - interrupt to quit
write(1, "__MSG__:Removing libobt0\n", 25^C <unfinished ...>
Process 18232 detached

# lsof -a -p 18232 -d 1
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
10role  18232 root    1w  FIFO    0,8      0t0 34118 pipe

# lsof -d 0-65535 | grep 34118
sshd       3407     root    8r     FIFO                0,8      0t0      34118 pipe
bash      18133     root    1w     FIFO                0,8      0t0      34118 pipe
setup-joi 18138     root    1w     FIFO                0,8      0t0      34118 pipe
10role    18232     root    1w     FIFO                0,8      0t0      34118 pipe

# lsof -a -p 3407 -d 0-65535
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
sshd    3407 root    0u   CHR    1,3      0t0  5705 /dev/null
sshd    3407 root    1u   CHR    1,3      0t0  5705 /dev/null
sshd    3407 root    2u   CHR    1,3      0t0  5705 /dev/null
sshd    3407 root    3u  IPv4  14222      0t0   TCP unassigned-hostname.unassigned-domain:ssh->10.210.216.13:55893 (ESTABLISHED)
sshd    3407 root    5r  FIFO    0,8      0t0 14292 pipe
sshd    3407 root    6w  FIFO    0,8      0t0 14292 pipe
sshd    3407 root    8r  FIFO    0,8      0t0 34118 pipe
sshd    3407 root   10r  FIFO    0,8      0t0 34119 pipe

# strace -p 3407
Process 3407 attached - interrupt to quit
select(11, [3 5], [], NULL, NULL)       = 1 (in [3])



@ Jenkins:
$ lsof -a -p 909 -d 0-65535
COMMAND   PID    USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
...
ucs-ec2-c 909 jenkins   10r  FIFO    0,8      0t0 242771 pipe
ucs-ec2-c 909 jenkins   11w  FIFO    0,8      0t0 242771 pipe
ucs-ec2-c 909 jenkins   12u  IPv4 230926      0t0    TCP jenkins.knut.univention.de:38214->10.210.253.90:ssh (ESTABLISHED)
ucs-ec2-c 909 jenkins   13u  IPv4 230974      0t0    TCP jenkins.knut.univention.de:55893->10.210.188.85:ssh (ESTABLISHED)

$ strace -p 909 -f
Process 909 attached - interrupt to quit
[pid   909] select(11, [10], [], [], {2, 493150})   = 0 (Timeout)
[pid  1076] poll([{fd=12, events=POLLIN}], 1, 100) = 0
[pid  1098] poll([{fd=13, events=POLLIN}], 1, 100) = 0

$ lsof -d 0-65535 | grep 242771
ucs-ec2-c   909 jenkins   10r  FIFO                0,8      0t0  242771 pipe
ucs-ec2-c   909 jenkins   11w  FIFO                0,8      0t0  242771 pipe


Would need openssh-server-dbg package for further diagnostics.
Comment 1 Philipp Hahn univentionstaff 2015-09-03 16:31:34 CEST
Again seen multiple times:

$ ps axf
 1051 ?        Ss     0:00 /usr/sbin/sshd
 1137 ?        Ss     0:00  \_ sshd: root@notty 
 1423 ?        Ts     0:00  |   \_ /bin/bash ./in8.sh member 1
 3500 ?        S      0:00  |       \_ /usr/bin/python2.6 /usr/sbin/univention-upgrade --noninteractive --ignoreterm

# lsof -p 1051,1137,1423,3500 -a -d 0-65535
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
sshd      1051 root    0u   CHR    1,3      0t0    1365 /dev/null
sshd      1051 root    1u   CHR    1,3      0t0    1365 /dev/null
sshd      1051 root    2u   CHR    1,3      0t0    1365 /dev/null
sshd      1051 root    3u  IPv4   6132      0t0     TCP *:ssh (LISTEN)
sshd      1051 root    4u  IPv6   6134      0t0     TCP *:ssh (LISTEN)
sshd      1137 root    0u   CHR    1,3      0t0    1365 /dev/null
sshd      1137 root    1u   CHR    1,3      0t0    1365 /dev/null
sshd      1137 root    2u   CHR    1,3      0t0    1365 /dev/null
sshd      1137 root    3u  IPv6   6274      0t0     TCP [fe80::5054:ff:fee8:e5cd]:ssh->[fe80::219:66ff:fef7:46fb]:58844 (ESTABLISHED)
sshd      1137 root    5r  FIFO    0,8      0t0    6484 pipe
sshd      1137 root    6w  FIFO    0,8      0t0    6484 pipe
sshd      1137 root    8r  FIFO    0,8      0t0    7219 pipe
sshd      1137 root   10r  FIFO    0,8      0t0    7220 pipe
in8.sh    1423 root    0r  FIFO    0,8      0t0    7218 pipe
in8.sh    1423 root    1w  FIFO    0,8      0t0    7219 pipe
in8.sh    1423 root    2w  FIFO    0,8      0t0    7220 pipe
in8.sh    1423 root  255r   REG  253,0     3983 1707717 /root/in8.sh
univentio 3500 root    0r  FIFO    0,8      0t0    7218 pipe
univentio 3500 root    1w  FIFO    0,8      0t0    7219 pipe
univentio 3500 root    2w  FIFO    0,8      0t0    7220 pipe
univentio 3500 root    3u   REG  253,0    44393 2105005 /var/log/univention/updater.log

# strace -p 3500
write(1, "\nStarting univention-upgrade. Cu"..., 70^C <unfinished ...>

# strace -p 1137
select(11, [3 5], [], NULL, NULL^C <unfinished ...>


RFA: Build a openssh package without the debugging symbols being stripped to find the code fragment of the select() call.
Comment 2 Philipp Hahn univentionstaff 2015-09-04 16:06:37 CEST
A SSH "transport" can contains multiple "channels", one for each "sftp-session", "shell-session", ...
When executing a command that single channel is used to multiplex stdout and stderr; they can be filled independently and are buffered independently, but they share a common output buffer: If one fills the output buffer and isn't consumed by the other end, writing will block!

Also ssh operates in blocking mode, so reading from an empty stream on the receiving end will also block; the read(max_buf_size) is very misleading here, as a read(4096) blocks even when only a single byte is available.

I have an improved version of ucs-kvm-create in my in8 repo:
ec4bc86 ucs-kvm-create: More debugging output
1b8e912 ucs-kvm-create: Make ssh channel non-blocking
f1c0e7e ucs-kvm-create: Merge _ssh_exec
35c59ea ucs-kvm-create: Close server ssh connection
7880227 ucs-kvm-create: Allow non-consecutive commands
bf756fb ucs-kvm-create: Parallel commands
cf26c94 ucs-kvm-create: Argument quoting
Comment 4 Philipp Hahn univentionstaff 2015-09-15 11:01:33 CEST
f57f929 ucs-kvm-create: Add --terminate support
dfc847d ucs-kvm-create: Save name from ucs-kt-get
707ef83 ucs-kvm-create: Factor out ssh connect to host
72562b7 ucs-kvm-create: Code re-indent
3a946ec ucs-kvm-create: More debugging output
cd70d6a ucs-kvm-create: Make ssh channel non-blocking
efc8930 ucs-kvm-create: Merge _ssh_exec
35c59ea ucs-kvm-create: Close server ssh connection
7880227 ucs-kvm-create: Allow non-consecutive commands
bf756fb ucs-kvm-create: Parallel commands
Comment 5 Philipp Hahn univentionstaff 2015-09-15 12:56:08 CEST
r63722 | Bug #38984 kt: pyflakes
r63721 | Bug #38984 kt: Add --terminate support
r63720 | Bug #38984 kt: Save name from ucs-kt-get
r63719 | Bug #38984 kt: Factor out ssh connect to host
r63718 | Bug #38984 kt: Code re-indent
r63717 | Bug #38984 kt: More debugging output
r63716 | Bug #38984 kt: Make ssh channel non-blocking
r63715 | Bug #38984 kt: Merge _ssh_exec
r63714 | Bug #38984 kt: Close server ssh connection
r63713 | Bug #38984 kt: Allow non-consecutive commands
r63712 | Bug #38984 kt: Parallel commands
r63711 | Bug #38984 kt: Argument quoting
r63710 | Bug #38984 kt: pyflakes
r63709 | Bug #38984 kt: Add --terminate support
r63708 | Bug #38984 kt: Save name from ucs-kt-get
r63707 | Bug #38984 kt: Factor out ssh connect to host
r63706 | Bug #38984 kt: Code re-indent
r63705 | Bug #38984 kt: More debugging output
r63704 | Bug #38984 kt: Make ssh channel non-blocking
r63703 | Bug #38984 kt: Merge _ssh_exec
r63702 | Bug #38984 kt: Close server ssh connection
r63701 | Bug #38984 kt: Allow non-consecutive commands
r63700 | Bug #38984 kt: Parallel commands
r63699 | Bug #38984 kt: Argument quoting

@Sönke: Have a quick look at "ucs-kvm-create --parallel --terminate"; they are optional and should break anything from your setups.
Comment 6 Philipp Hahn univentionstaff 2015-09-17 06:29:31 CEST
*** Bug 39356 has been marked as a duplicate of this bug. ***
Comment 7 Philipp Hahn univentionstaff 2015-10-05 14:24:09 CEST
r64221 | Bug #38984 EC2: Detect ucs-kt-get errors
r64220 | Bug #38984 EC2: Detect ucs-kt-get errors
Comment 8 Philipp Hahn univentionstaff 2015-10-06 12:27:30 CEST
r64248 | Bug #38984 KVM: Fix template locking
r64247 | Bug #38984 KVM: Copyright 2015

Package: ucs-kvm-testenv
Version: 1.2.49-1.75.201510061224
Branch: ucs_4.0-0
Scope: internal
Comment 9 Stefan Gohmann univentionstaff 2015-11-15 11:09:33 CET
You added in r64221 a raise for unknown errors:
--- python/ec2/lib.py   (Revision 64220)
+++ python/ec2/lib.py   (Revision 64221)
@@ -196,6 +196,7 @@
                                        self._log('[%d]: Authentication failed... [%s]' % (timeout - now + start, ex))
                                except Exception, ex:
                                        self._log('[%d]: Unknown error "%s"...' % (timeout - now + start, ex))
+                                       raise
                                time.sleep(5)
                                now = time.time()

I don't think that is a good idea, at least some of my setups failed:

[2779] Network error: [Errno 111] Connection refused
[2774] Network error: [Errno 111] Connection refused
[2769] Network error: [Errno 111] Connection refused
[2764]: Authentication failed... [Authentication failed.]
[2744]: Authentication failed... [Authentication failed.]
[2736]: Authentication failed... [Authentication failed.]
[2729]: Authentication failed... [Authentication failed.]
[2721]: Authentication failed... [Authentication failed.]
[2714]: Authentication failed... [Authentication failed.]
[2706]: Authentication failed... [Authentication failed.]
[2699]: Authentication failed... [Authentication failed.]
[2691]: Authentication failed... [Authentication failed.]
[2684]: Authentication failed... [Authentication failed.]
[2676]: Authentication failed... [Authentication failed.]
[2669]: Authentication failed... [Authentication failed.]
[2661]: Unknown error "Error reading SSH protocol banner"...

I've removed the raise in UCS 4.1: r65518. Please have a closer look.
Comment 10 Philipp Hahn univentionstaff 2015-11-16 14:13:37 CET
(In reply to Stefan Gohmann from comment #9)
> You added in r64221 a raise for unknown errors:
...
> +                                       raise
...
> I've removed the raise in UCS 4.1: r65518. Please have a closer look.

Please don't ignore arbitrary errors.

> [2661]: Unknown error "Error reading SSH protocol banner"...

r65577 | Bug #38984: Revert "Don't fail during an unknown SSH error"
 Ignore only SSHException
Comment 11 Stefan Gohmann univentionstaff 2015-11-16 17:36:48 CET
(In reply to Philipp Hahn from comment #10)
> (In reply to Stefan Gohmann from comment #9)
> > You added in r64221 a raise for unknown errors:
> ...
> > +                                       raise
> ...
> > I've removed the raise in UCS 4.1: r65518. Please have a closer look.
> 
> Please don't ignore arbitrary errors.

Only if you ensure you catch all possible errors and if it is scope of this bug.
Comment 12 Stefan Gohmann univentionstaff 2016-10-12 07:48:05 CEST
For this bug is no separate QA needed.