Univention Bugzilla – Bug 36044
Connection to download server updates.software-univention.de gets stuck / stalls
Last modified: 2015-11-14 19:56:27 CET
The UCS 4 installation hangs, it seems the connection to our download server is the problem: root@ucs401:~# ps -ef | grep 11967 root 11967 11966 0 03:50 tty1 00:00:00 /usr/bin/python2.7 /usr/sbin/ucr commit root 13305 11967 0 03:51 tty1 00:00:00 /usr/bin/python2.7 root 13808 13720 0 07:45 pts/0 00:00:00 grep 11967 root@ucs401:~# ps -ef | grep 13305 root 13305 11967 0 03:51 tty1 00:00:00 /usr/bin/python2.7 root 13810 13720 0 07:45 pts/0 00:00:00 grep 13305 root@ucs401:~# strace -p 13305 Process 13305 attached - interrupt to quit recvfrom(3, ^C <unfinished ...> Process 13305 detached root@ucs401:~# ls -la /proc/13305/fd total 0 dr-x------ 2 root root 0 Oct 3 07:46 . dr-xr-xr-x 9 root root 0 Oct 3 04:36 .. lr-x------ 1 root root 64 Oct 3 07:46 0 -> pipe:[243560] l-wx------ 1 root root 64 Oct 3 07:46 1 -> pipe:[243561] l-wx------ 1 root root 64 Oct 3 07:46 2 -> pipe:[241391] lrwx------ 1 root root 64 Oct 3 07:46 3 -> socket:[243573] root@ucs401:~# netstat -tupen | grep 243573 tcp 0 0 10.201.40.1:44538 176.9.114.147:80 ESTABLISHED 0 243573 13305/python2.7 root@ucs401:~# host 176.9.114.147 147.114.9.176.in-addr.arpa domain name pointer download2.software-univention.de. root@ucs401:~#
The Python module univention.updater.tools does not use urrlib2.urlopen(timeout=...), so the updater is prone to such problems. See Bug #35760 and Bug #32935 for similar issues. <https://docs.python.org/2.6/library/urllib2.html#urllib2.urlopen> As the updater still initiates a lot of HTTP requests, it's likely that it triggers some DoS prevention mechanism and new connections are refused or silently dropped. We should really re-think implementing Bug #25891 comment 1 AKA Bug #33035.
> The Python module univention.updater.tools does not use > urrlib2.urlopen(timeout=...), so the updater is prone to such problems. > See Bug #35760 and Bug #32935 for similar issues. Does the timeout argument really fix the problem? IIRC the TCP connection has been established successfully and the HTTP request has been sent to the repo server but the server does not answer. I'm not sure if the "timeout" argument does also cover the HTTP response time but only the duration of "raw TCP actions" (e.g. TCP connect).
Again: # lsof -p 2520 python2.6 2520 root 3u IPv4 13557 0t0 TCP test35.phahn.qa:40333->download2.software-univention.de:www (ESTABLISHED) $ strace -p 2520 Process 2520 attached - interrupt to quit recvfrom(3, # ps axf 2144 pts/0 S 0:00 | \_ /bin/bash /tmp/root/tmpN1BgO2/postupDsrcDx.sh 3.2-4 2161 pts/0 S 0:00 | | \_ /usr/bin/python2.6 /usr/sbin/ucr set repository/online/component/ 2520 pts/0 S 0:00 | | \_ /usr/bin/python2.6 # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:94:3f:71 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 1641567 11071 0 360 0 0 TX: bytes packets errors dropped carrier collsns 480976 2705 0 0 0 0
Again: # ps axf 2103 pts/0 S 0:00 | \_ /bin/sh -c univention-config-registry commit /etc/apt/sources.list.d/20_u 2104 pts/0 S 0:00 | \_ /usr/bin/python2.6 /usr/sbin/univention-config-registry commit /etc/a 2105 pts/0 S 0:00 | \_ /usr/bin/python2.6 2141 pts/0 R+ 0:00 \_ ps axf # lsof -p 2105 python2.6 2105 root 3u IPv4 12234 0t0 TCP test37.phahn.qa:40971->download2.software-univention.de:www (ESTABLISHED) # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:e7:6f:f1 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 1209039 12658 0 443 0 0 TX: bytes packets errors dropped carrier collsns 458056 3168 0 0 0 0 # strace -p 2105 Process 2105 attached - interrupt to quit recv(3,
Again: # ps axf 1852 pts/0 S 0:10 \_ /usr/bin/python2.6 /usr/sbin/univention-repository-create -m /cdrom -n 6671 pts/0 S 0:00 | \_ /usr/bin/python2.6 # lsof -p 6671 python2.6 6671 root 3u IPv4 20969 0t0 TCP test35.phahn.qa:48043->download2.software-univention.de:www (CLOSE_WAIT) python2.6 6671 root 4u IPv4 20976 0t0 TCP test35.phahn.qa:48044->download2.software-univention.de:www (ESTABLISHED) # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:94:3f:71 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 1106955 9417 0 9 0 0 TX: bytes packets errors dropped carrier collsns 802874 6369 0 0 0 0 # strace -p 6671 Process 6671 attached - interrupt to quit recvfrom(4, # strace -f ucr commit /etc/apt/mirror.list ... [pid 6874] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 [pid 6874] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("176.9.114.147")}, 16) = 0 [pid 6874] sendto(3, "HEAD /3.0/maintained/errata139/a"..., 273, 0, NULL, 0) = 273 [pid 6874] recvfrom(3,
Again: # ps axfwww 18938 pts/0 S 0:00 \_ /usr/bin/python2.7 /usr/sbin/ucr set repository/online/unmaintained=yes repository/online/sources=yes 18942 pts/0 S 0:00 | \_ /usr/bin/python2.7 # strace -p 18942 recvfrom(3, # lsof -p 18942 python2.7 18942 root 3u IPv4 185744 0t0 TCP h70.phahn.pt:60237->download2.software-univention.de:http ( # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 52:54:00:c2:20:d1 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 52835273 110569 0 0 0 0 TX: bytes packets errors dropped carrier collsns 32737512 110529 0 0 0 0 # netstat -tp | grep download2 tcp 0 257 h70.phahn.pt:60237 download2.software:http VERBUNDEN 18942/python2.7
Again: While provisioning a cloud-init instance, setup-join.sh calls ucr commit, which got stuck in the same way as comment 4
Short question to all: what kind of machines were affected? internal KVM? internal VMWare? EC2-Instance? Just to narrow it down, if e.g. our internal firewall cuts open connections randomly or if the download server is the culprit.
Internal machines (kvm + openstack)
Again 2015-01-29: <http://jenkins.knut.univention.de:8080/job/UCS-3.2/job/UCS-3.2-4/job/Autotest%20MultiEnv/SambaVersion=s4,Systemrolle=slave/62/console> EC2 # ps axfwwwu root 11825 0.0 0.0 12536 1992 ? S Jan28 0:00 | \_ /bin/bash 18proxy-special root 11935 0.0 0.2 31184 7856 ? S Jan28 0:00 | \_ /usr/bin/python2.6 /usr/sbin/univention-config-registry set local/repository=no repository/mirror=no repository/online=yes repository/online/server=localhost repository/online/port=80 repository/online/prefix=univention-repository proxy/username= proxy/password= proxy/address= root 11936 0.0 0.2 36268 9196 ? S Jan28 0:00 | \_ /usr/bin/python2.6 # lsof -p 11936 python2.6 11936 root 3u IPv4 42364 0t0 TCP slave095.autotest095.local:60173->download2.software-univention.de:www (ESTABLISHED) python2.6 11936 root 4u IPv4 42343 0t0 TCP slave095.autotest095.local:60172->download2.software-univention.de:www (CLOSE_WAIT) # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 02:1c:a3:87:c5:ee brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 577719347 403938 0 0 0 0 TX: bytes packets errors dropped carrier collsns 18011831 216844 0 0 0 0 # netstat -tp | grep download2 tcp 1 0 slave095.autotest:60172 download2.software-:www CLOSE_WAIT 11936/python2.6 tcp 0 0 slave095.autotest:60173 download2.software-:www VERBUNDEN 11936/python2.6 # ip route get `dig +short download2.software-univention.de` 176.9.114.147 via 10.210.216.13 dev eth0 src 10.210.42.76
Again 2015-01-29 xen12 # ps axfwwwu root 7338 0.0 0.0 9244 2160 ? S Jan28 0:00 \_ /bin/bash /usr/share/univention-updater/univention-updater-check root 7461 0.0 0.1 44004 13784 ? S Jan28 0:00 \_ /usr/bin/python2.7 /usr/sbin/univention-upgrade --check --setucr root 7464 0.0 0.0 4192 544 ? S Jan28 0:00 \_ /bin/sh -c univention-config-registry commit /etc/apt/sources.list.d/20_ucs-online-component.list; LC_ALL=C apt-get update >/dev/null; LC_ALL=C apt-get -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-overwrite -o DPkg::Options::=--force-overwrite-dir --trivial-only=no --assume-yes --quiet=1 -us dist-upgrade root 7465 0.0 0.1 33908 10772 ? S Jan28 0:00 \_ /usr/bin/python2.7 /usr/sbin/univention-config-registry commit /etc/apt/sources.list.d/20_ucs-online-component.list root 7466 0.0 0.1 43500 12976 ? S Jan28 0:00 \_ /usr/bin/python2.7 # strace -p 7466 recvfrom(3, # lsof -p 7466 python2.7 7466 root 3u IPv4 27593 0t0 TCP xen12.phahn.dev:60828->download2.software-univention.de:http (ESTABLISHED) # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 1000 link/ether bc:ae:c5:07:cc:5c brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 109928637 641657 0 0 0 583376 TX: bytes packets errors dropped carrier collsns 6082298 37400 0 0 0 0 # ip -s link show eth1 3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 1000 link/ether bc:ae:c5:07:cc:5c brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 31183460 460836 0 460836 0 460836 TX: bytes packets errors dropped carrier collsns 0 0 0 0 0 0 # ip -s link show br0 12: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT link/ether bc:ae:c5:07:cc:5c brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 87682122 557449 0 3136 0 0 TX: bytes packets errors dropped carrier collsns 6086870 37430 0 0 0 0 # netstat -tp | grep download2 tcp 0 0 xen12.phahn.dev:60828 download2.software:http VERBUNDEN 7466/python2.7 # ip route get `dig +short download2.software-univention.de` 176.9.114.147 via 192.168.0.240 dev br0 src 192.168.0.135
Seen this also on machines behind a DSL connection, all instances were affected: # ps xawuf| grep update root 1545 0.0 0.0 9912 1708 pts/1 S+ 13:20 0:00 \_ grep update root 16245 0.0 0.0 4192 328 ? Ss Jan28 0:00 \_ /bin/sh -c [ -x /usr/share/univention-updater/univention-updater-check ] && /usr/sbin/jitter 600 /usr/share/univention-updater/univention-updater-check 2> /dev/null > /dev/null root 16248 0.0 0.0 9244 1116 ? S Jan28 0:00 \_ /bin/bash /usr/share/univention-updater/univention-updater-check
Created attachment 6647 [details] Collect stall info Again 2015-02-05 ... did resolve itself after several minutes.
Another one: # ~/36044_apt-stall.py # date --rfc-3339=date 2015-02-05 # ip -s link show lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 RX: bytes packets errors dropped overrun mcast 1015209 5926 0 0 0 0 TX: bytes packets errors dropped carrier collsns 1015209 5926 0 0 0 0 # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:98:2d:30 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 70815131 362830 0 960 0 0 TX: bytes packets errors dropped carrier collsns 48768986 300762 0 0 0 0 # netstat -tp # dig +short download2.software-univention.de # ip route get 176.9.114.147 176.9.114.147 via 10.200.17.1 dev eth0 src 10.200.17.20 cache ipid 0xdf99 rtt 345ms rttvar 140ms cwnd 10 176.9.114.147 # lsof wget 23663 root 5u IPv4 148889 0t0 TCP deb00351.XXX.de:58495->download2.software-univention.de:www (ESTABLISHED) # ps afwwwu 23663 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 23824 0.2 0.3 12700 3764 pts/1 Ss 14:26 0:00 -bash root 23877 0.0 0.5 27240 5160 pts/1 S+ 14:26 0:00 \_ python /root/36044_apt-stall.py root 23887 0.0 0.0 0 0 pts/1 Z+ 14:26 0:00 \_ [lsof] <defunct> root 23889 0.0 0.0 6448 740 pts/1 R+ 14:26 0:00 \_ ps afwwwu 23663 root 23663 0.0 0.1 20412 1992 ? S 14:22 0:00 wget --no-host-directories --no-cache --limit-rate=100m -t 5 -r -N -l inf -o /var/lib/univention-repository/var/index-log.0 -i /var/lib/univention-repository/var/index-urls.0 # timeout 1s strace -q -p 23663 select(6, [5], NULL, NULL, {634, 560182} <unfinished ...> # ucr search --brief --non-empty ^repository/ repository/app_center/server: appcenter.software-univention.de repository/mirror/basepath: /var/lib/univention-repository repository/mirror/recreate_packages: no repository/mirror/server: univention-repository.knut.univention.de repository/mirror/threads: 10 repository/mirror/version/end: 3.2-1 repository/mirror/version/start: 2.0-0 repository/mirror: no repository/online/component/3.1-1-errata/description: Errata updates for UCS 3.1-1 repository/online/component/3.1-1-errata/version: 3.1 repository/online/component/3.1-1-errata: enabled repository/online/component/extsec2.4/localmirror: yes repository/online/component/extsec2.4/username: 00XXX repository/online/component/extsec2.4/version: 2.4 repository/online/component/extsec2.4: no repository/online/component/extsec3.1/localmirror: yes repository/online/component/extsec3.1/username: 00XXX repository/online/component/extsec3.1/version: 3.1 repository/online/component/extsec3.1: no repository/online/component/XXX/parts: maintained repository/online/component/XXX/prefix: apt/00XXX repository/online/component/XXX/server: service.univention.de repository/online/component/XXX/username: 00XXX repository/online/component/XXX: yes repository/online/hotfixes: no repository/online/port: 80 repository/online/server: univention-repository.knut.univention.de repository/online/unmaintained: yes repository/online: yes
Once again on billy: root@billy:~# ps -ef | grep 9515 root 9515 9514 0 Jan28 ? 00:00:00 /usr/bin/python2.7 /usr/sbin/univention-config-registry commit /etc/apt/sources.list.d/20_ucs-online-component.list root 9517 9515 0 Jan28 ? 00:00:00 /usr/bin/python2.7 root 31915 31651 0 15:56 pts/0 00:00:00 grep 9515 root@billy:~# strace -f -p 9515 Process 9515 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C <unfinished ...> Process 9515 detached root@billy:~# strace -f -p 9517 Process 9517 attached - interrupt to quit recv(3, ^C <unfinished ...> Process 9517 detached root@billy:~# ls -la /proc/9517/fd/3 lrwx------ 1 root root 64 Feb 5 15:58 /proc/9517/fd/3 -> socket:[133057786] root@billy:~# lsof | grep 133057786 python2.7 9517 root 3u IPv4 133057786 0t0 TCP billy.knut.univention.de:33249->download2.software-univention.de:http (ESTABLISHED) root@billy:~# I think we should add a maximum timeout for every connection to the repository server. @Philipp / @Sönke, possible or do you see other options?
# date --rfc-3339=date 2015-02-16 # ip -s link show lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 RX: bytes packets errors dropped overrun mcast 1493577 15114 0 0 0 0 TX: bytes packets errors dropped carrier collsns 1493577 15114 0 0 0 0 # ip -s link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether e0:69:95:ba:9f:08 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 782962732 9669679 0 11164 0 356790 TX: bytes packets errors dropped carrier collsns 26999142 211485 0 0 0 0 # netstat -tp # dig +short download2.software-univention.de # ip route get 176.9.114.147 176.9.114.147 via 10.200.17.1 dev eth0 src 10.200.17.243 cache 176.9.114.147 # lsof # ucr search --brief --non-empty ^repository/ repository/app_center/server: appcenter.software-univention.de repository/mirror/basepath: /var/lib/univention-repository repository/mirror/recreate_packages: yes repository/mirror/threads: 10 repository/mirror: no repository/online/component/4.0-0-errata/description: Errata updates for UCS 4.0-0 repository/online/component/4.0-0-errata/version: 4.0 repository/online/component/4.0-0-errata: enabled repository/online/port: 80 repository/online/server: updates.software-univention.de repository/online/unmaintained: yes repository/online: yes Seems to have resolved itself after a very long delay. Stefan did restart the apache processes on the download server.
r58465 | Bug #36044 Updater: Set timeout of 600s for HTTP connections r58469 | Bug #36044 Updater: Catch one more socket timeout exception Added a default timeout of 600s=10m for using urllib2.openener(). Package: univention-updater Version: 10.0.53-7.1379.201502261235 Branch: ucs_4.0-0 Scope: errata4.0-1 r58466 | Bug #36044 Updater: Set timeout of 600s for HTTP connections YAML 2015-02-26-univention-updater.yaml I verified that ucs-test/tests/09/28errors is not affected by that change: There is a test which detects blocked ports. If repository/online/timeout=10 is set, the test fails as 10s is lower then the time allowed for the initial connect. All other ucs-tests/tests/09/* appeared to pass as normal. For my internal testing I do something like this: ~/limit_bandwidth -n 192.168.0.10/32 1 iptables -I INPUT 1 -s 192.168.0.10 -p tcp --sport 80 -m connbytes --connbytes-mode bytes --connbytes-dir both --connbytes 40960 -j DROP ucr set repository/online/httpmethod=GET repository/online/timeout=10 python -c 'import logging,sys;logging.basicConfig(level=logging.DEBUG, stream=sys.stderr) from univention.updater.tools import UniventionUpdater u = UniventionUpdater() u.print_version_repositories()' iptables -D INPUT 1 ~/limit_bandwidth --stop ucr unset repository/online/httpmethod repository/online/timeout Needs more work as the raised exception DownloadError() is not handled correctly: Only the single "deb" statement gets skipped, but the user has no indication that the other repositories failed to download due to a timeout condition: >INFO:updater.UCSHttp:Requesting http://univention-repository.knut.univention.de/4.0/unmaintained/4.0-1/all/Packages. gz >ERROR:updater.UCSHttp:Failed GET http://univention-repository.knut.univention.de/4.0/unmaintained/4.0-1/all/Packages .gz: <urlopen error timed out> >Traceback (most recent call last): > File "/usr/lib/pymodules/python2.7/univention/updater/tools.py", line 443, in access > res = UCSHttpServer.opener.open(req, timeout=self.timeout) > File "/usr/lib/python2.7/urllib2.py", line 401, in open > response = self._open(req, data) > File "/usr/lib/python2.7/urllib2.py", line 419, in _open > '_open', req) > File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain > result = func(*args) > File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open > return self.do_open(httplib.HTTPConnection, req) > File "/usr/lib/python2.7/urllib2.py", line 1181, in do_open > raise URLError(err) >URLError: <urlopen error timed out>
r58487 | Bug #36044 updater: Make timeout for fatal Switch to ConfigurationError Package: univention-updater Version: 10.0.53-8.1380.201502261547 Branch: ucs_4.0-0 Scope: errata4.0-1 r58488 | Bug #36044 updater: Make timeout for fatal YAML
The updater network connection no longer stall. $ cat /etc/apt/sources.list.d/15_ucs-online-version.list ... # ConfigurationError: Configuration error: timeout in network connection
REOPEN: code change → python code is OK, sent hints for typos in UCR description to phahn OK: functional test OK: changelog ??: YAML (currently ok; UCR description fix is currently missing)
(In reply to Sönke Schwardt-Krummrich from comment #20) > REOPEN: code change > → python code is OK, sent hints for typos in UCR description to phahn r58789 | Bug #36044: Reword UCR variable description > OK: functional test > OK: changelog > ??: YAML (currently ok; UCR description fix is currently missing) r58798 | Bug #36044: Reword UCR variable description YAML
OK: code change OK: functional test OK: changelog OK: YAML
http://errata.univention.de/ucs/4.0/106.html
*** Bug 35952 has been marked as a duplicate of this bug. ***