Univention Bugzilla – Bug 38984
System setup stalls - broken ssh no longer reading full pipe?
Last modified: 2023-03-25 06:50:49 CET
# S4 Slave <http://jenkins.knut.univention.de:8080/job/UCS-4.0/job/UCS-4.0-2/job/Autotest%20MultiEnv/SambaVersion=s4,Systemrolle=slave/68/> ucs-test writes to full PIPE, which sshd is not reading: @ Slave = 10.210.188.85 # ps axf 3407 ? Ss 0:00 sshd: root@notty 18133 ? Ss 0:00 \_ bash -c . utils.sh; run_setup_join_on_non_master "Univention@99" 18138 ? S 0:00 \_ /bin/bash /usr/lib/univention-system-setup/scripts/setup-join.sh --dcaccount Admi 18232 ? Sl 0:02 \_ /usr/bin/python2.7 /usr/lib/univention-system-setup/scripts/05_role/10role 7805 ? Z 0:00 \_ [10role] <defunct> # strace -p 18232 Process 18232 attached - interrupt to quit write(1, "__MSG__:Removing libobt0\n", 25^C <unfinished ...> Process 18232 detached # lsof -a -p 18232 -d 1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME 10role 18232 root 1w FIFO 0,8 0t0 34118 pipe # lsof -d 0-65535 | grep 34118 sshd 3407 root 8r FIFO 0,8 0t0 34118 pipe bash 18133 root 1w FIFO 0,8 0t0 34118 pipe setup-joi 18138 root 1w FIFO 0,8 0t0 34118 pipe 10role 18232 root 1w FIFO 0,8 0t0 34118 pipe # lsof -a -p 3407 -d 0-65535 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sshd 3407 root 0u CHR 1,3 0t0 5705 /dev/null sshd 3407 root 1u CHR 1,3 0t0 5705 /dev/null sshd 3407 root 2u CHR 1,3 0t0 5705 /dev/null sshd 3407 root 3u IPv4 14222 0t0 TCP unassigned-hostname.unassigned-domain:ssh->10.210.216.13:55893 (ESTABLISHED) sshd 3407 root 5r FIFO 0,8 0t0 14292 pipe sshd 3407 root 6w FIFO 0,8 0t0 14292 pipe sshd 3407 root 8r FIFO 0,8 0t0 34118 pipe sshd 3407 root 10r FIFO 0,8 0t0 34119 pipe # strace -p 3407 Process 3407 attached - interrupt to quit select(11, [3 5], [], NULL, NULL) = 1 (in [3]) @ Jenkins: $ lsof -a -p 909 -d 0-65535 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ... ucs-ec2-c 909 jenkins 10r FIFO 0,8 0t0 242771 pipe ucs-ec2-c 909 jenkins 11w FIFO 0,8 0t0 242771 pipe ucs-ec2-c 909 jenkins 12u IPv4 230926 0t0 TCP jenkins.knut.univention.de:38214->10.210.253.90:ssh (ESTABLISHED) ucs-ec2-c 909 jenkins 13u IPv4 230974 0t0 TCP jenkins.knut.univention.de:55893->10.210.188.85:ssh (ESTABLISHED) $ strace -p 909 -f Process 909 attached - interrupt to quit [pid 909] select(11, [10], [], [], {2, 493150}) = 0 (Timeout) [pid 1076] poll([{fd=12, events=POLLIN}], 1, 100) = 0 [pid 1098] poll([{fd=13, events=POLLIN}], 1, 100) = 0 $ lsof -d 0-65535 | grep 242771 ucs-ec2-c 909 jenkins 10r FIFO 0,8 0t0 242771 pipe ucs-ec2-c 909 jenkins 11w FIFO 0,8 0t0 242771 pipe Would need openssh-server-dbg package for further diagnostics.
Again seen multiple times: $ ps axf 1051 ? Ss 0:00 /usr/sbin/sshd 1137 ? Ss 0:00 \_ sshd: root@notty 1423 ? Ts 0:00 | \_ /bin/bash ./in8.sh member 1 3500 ? S 0:00 | \_ /usr/bin/python2.6 /usr/sbin/univention-upgrade --noninteractive --ignoreterm # lsof -p 1051,1137,1423,3500 -a -d 0-65535 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sshd 1051 root 0u CHR 1,3 0t0 1365 /dev/null sshd 1051 root 1u CHR 1,3 0t0 1365 /dev/null sshd 1051 root 2u CHR 1,3 0t0 1365 /dev/null sshd 1051 root 3u IPv4 6132 0t0 TCP *:ssh (LISTEN) sshd 1051 root 4u IPv6 6134 0t0 TCP *:ssh (LISTEN) sshd 1137 root 0u CHR 1,3 0t0 1365 /dev/null sshd 1137 root 1u CHR 1,3 0t0 1365 /dev/null sshd 1137 root 2u CHR 1,3 0t0 1365 /dev/null sshd 1137 root 3u IPv6 6274 0t0 TCP [fe80::5054:ff:fee8:e5cd]:ssh->[fe80::219:66ff:fef7:46fb]:58844 (ESTABLISHED) sshd 1137 root 5r FIFO 0,8 0t0 6484 pipe sshd 1137 root 6w FIFO 0,8 0t0 6484 pipe sshd 1137 root 8r FIFO 0,8 0t0 7219 pipe sshd 1137 root 10r FIFO 0,8 0t0 7220 pipe in8.sh 1423 root 0r FIFO 0,8 0t0 7218 pipe in8.sh 1423 root 1w FIFO 0,8 0t0 7219 pipe in8.sh 1423 root 2w FIFO 0,8 0t0 7220 pipe in8.sh 1423 root 255r REG 253,0 3983 1707717 /root/in8.sh univentio 3500 root 0r FIFO 0,8 0t0 7218 pipe univentio 3500 root 1w FIFO 0,8 0t0 7219 pipe univentio 3500 root 2w FIFO 0,8 0t0 7220 pipe univentio 3500 root 3u REG 253,0 44393 2105005 /var/log/univention/updater.log # strace -p 3500 write(1, "\nStarting univention-upgrade. Cu"..., 70^C <unfinished ...> # strace -p 1137 select(11, [3 5], [], NULL, NULL^C <unfinished ...> RFA: Build a openssh package without the debugging symbols being stripped to find the code fragment of the select() call.
A SSH "transport" can contains multiple "channels", one for each "sftp-session", "shell-session", ... When executing a command that single channel is used to multiplex stdout and stderr; they can be filled independently and are buffered independently, but they share a common output buffer: If one fills the output buffer and isn't consumed by the other end, writing will block! Also ssh operates in blocking mode, so reading from an empty stream on the receiving end will also block; the read(max_buf_size) is very misleading here, as a read(4096) blocks even when only a single byte is available. I have an improved version of ucs-kvm-create in my in8 repo: ec4bc86 ucs-kvm-create: More debugging output 1b8e912 ucs-kvm-create: Make ssh channel non-blocking f1c0e7e ucs-kvm-create: Merge _ssh_exec 35c59ea ucs-kvm-create: Close server ssh connection 7880227 ucs-kvm-create: Allow non-consecutive commands bf756fb ucs-kvm-create: Parallel commands cf26c94 ucs-kvm-create: Argument quoting
Again: <http://jenkins.knut.univention.de:8080/job/UCS-4.0/job/UCS-4.0-3/job/Autotest%20MultiEnv/40/>
f57f929 ucs-kvm-create: Add --terminate support dfc847d ucs-kvm-create: Save name from ucs-kt-get 707ef83 ucs-kvm-create: Factor out ssh connect to host 72562b7 ucs-kvm-create: Code re-indent 3a946ec ucs-kvm-create: More debugging output cd70d6a ucs-kvm-create: Make ssh channel non-blocking efc8930 ucs-kvm-create: Merge _ssh_exec 35c59ea ucs-kvm-create: Close server ssh connection 7880227 ucs-kvm-create: Allow non-consecutive commands bf756fb ucs-kvm-create: Parallel commands
r63722 | Bug #38984 kt: pyflakes r63721 | Bug #38984 kt: Add --terminate support r63720 | Bug #38984 kt: Save name from ucs-kt-get r63719 | Bug #38984 kt: Factor out ssh connect to host r63718 | Bug #38984 kt: Code re-indent r63717 | Bug #38984 kt: More debugging output r63716 | Bug #38984 kt: Make ssh channel non-blocking r63715 | Bug #38984 kt: Merge _ssh_exec r63714 | Bug #38984 kt: Close server ssh connection r63713 | Bug #38984 kt: Allow non-consecutive commands r63712 | Bug #38984 kt: Parallel commands r63711 | Bug #38984 kt: Argument quoting r63710 | Bug #38984 kt: pyflakes r63709 | Bug #38984 kt: Add --terminate support r63708 | Bug #38984 kt: Save name from ucs-kt-get r63707 | Bug #38984 kt: Factor out ssh connect to host r63706 | Bug #38984 kt: Code re-indent r63705 | Bug #38984 kt: More debugging output r63704 | Bug #38984 kt: Make ssh channel non-blocking r63703 | Bug #38984 kt: Merge _ssh_exec r63702 | Bug #38984 kt: Close server ssh connection r63701 | Bug #38984 kt: Allow non-consecutive commands r63700 | Bug #38984 kt: Parallel commands r63699 | Bug #38984 kt: Argument quoting @Sönke: Have a quick look at "ucs-kvm-create --parallel --terminate"; they are optional and should break anything from your setups.
*** Bug 39356 has been marked as a duplicate of this bug. ***
r64221 | Bug #38984 EC2: Detect ucs-kt-get errors r64220 | Bug #38984 EC2: Detect ucs-kt-get errors
r64248 | Bug #38984 KVM: Fix template locking r64247 | Bug #38984 KVM: Copyright 2015 Package: ucs-kvm-testenv Version: 1.2.49-1.75.201510061224 Branch: ucs_4.0-0 Scope: internal
You added in r64221 a raise for unknown errors: --- python/ec2/lib.py (Revision 64220) +++ python/ec2/lib.py (Revision 64221) @@ -196,6 +196,7 @@ self._log('[%d]: Authentication failed... [%s]' % (timeout - now + start, ex)) except Exception, ex: self._log('[%d]: Unknown error "%s"...' % (timeout - now + start, ex)) + raise time.sleep(5) now = time.time() I don't think that is a good idea, at least some of my setups failed: [2779] Network error: [Errno 111] Connection refused [2774] Network error: [Errno 111] Connection refused [2769] Network error: [Errno 111] Connection refused [2764]: Authentication failed... [Authentication failed.] [2744]: Authentication failed... [Authentication failed.] [2736]: Authentication failed... [Authentication failed.] [2729]: Authentication failed... [Authentication failed.] [2721]: Authentication failed... [Authentication failed.] [2714]: Authentication failed... [Authentication failed.] [2706]: Authentication failed... [Authentication failed.] [2699]: Authentication failed... [Authentication failed.] [2691]: Authentication failed... [Authentication failed.] [2684]: Authentication failed... [Authentication failed.] [2676]: Authentication failed... [Authentication failed.] [2669]: Authentication failed... [Authentication failed.] [2661]: Unknown error "Error reading SSH protocol banner"... I've removed the raise in UCS 4.1: r65518. Please have a closer look.
(In reply to Stefan Gohmann from comment #9) > You added in r64221 a raise for unknown errors: ... > + raise ... > I've removed the raise in UCS 4.1: r65518. Please have a closer look. Please don't ignore arbitrary errors. > [2661]: Unknown error "Error reading SSH protocol banner"... r65577 | Bug #38984: Revert "Don't fail during an unknown SSH error" Ignore only SSHException
(In reply to Philipp Hahn from comment #10) > (In reply to Stefan Gohmann from comment #9) > > You added in r64221 a raise for unknown errors: > ... > > + raise > ... > > I've removed the raise in UCS 4.1: r65518. Please have a closer look. > > Please don't ignore arbitrary errors. Only if you ensure you catch all possible errors and if it is scope of this bug.
For this bug is no separate QA needed.