Univention Bugzilla – Bug 52196
stunnel stops working and hangs indefinitely
Last modified: 2021-01-27 17:36:59 CET
stunnel stops working and hangs indefinitely In multiple customer environments the stunnel service stopped working. Impact: Users need to relogin at seemingly random while their saml session should still be valid. Reason: The saml sessions are not synchronised between the Backup Nodes and Primary Node any more due to the hanging stunnel. Announce: This is especially problematic since the hanging stunnel is not detected automatically, restarting the service does not work and but systemctl exits successfully (Note: systemd restarts the init script, which technically was sucessful). Work around: Kill the stunnel process with a SIGKILL signal (kill -9 $STUNNEL4_PID)
gdb backtrace: #0 0x00007f7909cd6470 in futex_wait (private=<optimized out>, expected=0, futex_word=0x5624fd61381c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=0, futex_word=0x5624fd61381c) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_rwlock_wrlock_slow (rwlock=0x5624fd613810) at pthread_rwlock_wrlock.c:67 #3 0x00007f790a8d36e9 in CRYPTO_THREAD_write_lock () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 #4 0x00005624fc0f9eff in log_flush (new_mode=LOG_MODE_ERROR) at log.c:121 #5 0x00005624fc0f4e96 in main (argc=<optimized out>, argv=<optimized out>) at ui_unix.c:59
lsof: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME stunnel4 1429 samlcgi cwd DIR 254,0 4096 2 / stunnel4 1429 samlcgi rtd DIR 254,0 4096 2 / stunnel4 1429 samlcgi txt REG 254,0 172176 44582789 /usr/bin/stunnel4 stunnel4 1429 samlcgi mem REG 254,0 79936 48234581 /lib/x86_64-linux-gnu/libgpg-error.so.0.21.0 stunnel4 1429 samlcgi mem REG 254,0 468920 48234560 /lib/x86_64-linux-gnu/libpcre.so.3.13.3 stunnel4 1429 samlcgi mem REG 254,0 1112184 48234546 /lib/x86_64-linux-gnu/libgcrypt.so.20.1.6 stunnel4 1429 samlcgi mem REG 254,0 72024 44565974 /usr/lib/x86_64-linux-gnu/liblz4.so.1.7.1 stunnel4 1429 samlcgi mem REG 254,0 154376 48234558 /lib/x86_64-linux-gnu/liblzma.so.5.2.2 stunnel4 1429 samlcgi mem REG 254,0 31744 48234521 /lib/x86_64-linux-gnu/librt-2.24.so stunnel4 1429 samlcgi mem REG 254,0 155400 48234562 /lib/x86_64-linux-gnu/libselinux.so.1 stunnel4 1429 samlcgi mem REG 254,0 1689360 48234504 /lib/x86_64-linux-gnu/libc-2.24.so stunnel4 1429 samlcgi mem REG 254,0 135440 48234519 /lib/x86_64-linux-gnu/libpthread-2.24.so stunnel4 1429 samlcgi mem REG 254,0 39784 48239423 /lib/x86_64-linux-gnu/libwrap.so.0.7.6 stunnel4 1429 samlcgi mem REG 254,0 89064 48234511 /lib/x86_64-linux-gnu/libnsl-2.24.so stunnel4 1429 samlcgi mem REG 254,0 10688 48234523 /lib/x86_64-linux-gnu/libutil-2.24.so stunnel4 1429 samlcgi mem REG 254,0 14640 48234507 /lib/x86_64-linux-gnu/libdl-2.24.so stunnel4 1429 samlcgi mem REG 254,0 2715840 44565546 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 stunnel4 1429 samlcgi mem REG 254,0 442984 44565547 /usr/lib/x86_64-linux-gnu/libssl.so.1.1 stunnel4 1429 samlcgi mem REG 254,0 153288 48234499 /lib/x86_64-linux-gnu/ld-2.24.so stunnel4 1429 samlcgi mem REG 254,0 557552 48234585 /lib/x86_64-linux-gnu/libsystemd.so.0.17.0 stunnel4 1429 samlcgi 0u CHR 1,3 0t0 1028 /dev/null stunnel4 1429 samlcgi 1u CHR 1,3 0t0 1028 /dev/null stunnel4 1429 samlcgi 2u CHR 1,3 0t0 1028 /dev/null stunnel4 1429 samlcgi 4r FIFO 0,10 0t0 18738 pipe stunnel4 1429 samlcgi 5w FIFO 0,10 0t0 18738 pipe stunnel4 1429 samlcgi 6u unix 0xffff924574129400 0t0 18750 type=DGRAM
Is this only in one customer environment? Might networking problems be the root cause, or is it a timeout? If a timeout occures we might simply force a restart of stunnel on a daily base?
(In reply to Ingo Steuwer from comment #3) > Is this only in one customer environment? Might networking problems be the > root cause, or is it a timeout? > > If a timeout occures we might simply force a restart of stunnel on a daily > base? No, we found this in three different school environments.
I reduced the feeling to "A pain …", because a workaround exists. I don't think that affected customers would return the product because of this issue.
I have now a very good description of the impact of this issue. We observe more and more(?) that after about 3 minutes a session is terminated even though there is active work going on and the session should be renewed automatically, because the timeout should only take effect in case of inactivity.
(In reply to Nico Gulden from comment #10) > I reduced the feeling to "A pain …", because a workaround exists. What is the workaround? > I don't > think that affected customers would return the product because of this issue. The user pain was not set to "return the product" but to "would likely not purchase". And I think that this is blocking further progress on the daily work because it's only possible to have 5 minutes lasting sessions then.
(In reply to Florian Best from comment #12) > (In reply to Nico Gulden from comment #10) > > I reduced the feeling to "A pain …", because a workaround exists. > What is the workaround? See the bug description: > Work around: > Kill the stunnel process with a SIGKILL signal (kill -9 $STUNNEL4_PID)
(In reply to Christina Scheinig from comment #11) > I have now a very good description of the impact of this issue. > > We observe more and more(?) that after about 3 minutes a session is > terminated even though there is active work going on and the session should > be renewed automatically, because the timeout should only take effect in > case of inactivity. How often does the problem occure? Is it still the case that package upgrades cause stunnel to stop, or does it happen more often?
(In reply to Ingo Steuwer from comment #14) > (In reply to Christina Scheinig from comment #11) > > I have now a very good description of the impact of this issue. > > > > We observe more and more(?) that after about 3 minutes a session is > > terminated even though there is active work going on and the session should > > be renewed automatically, because the timeout should only take effect in > > case of inactivity. > > How often does the problem occure? Is it still the case that package > upgrades cause stunnel to stop, or does it happen more often? I cannot say. This was reported from our IT now, and restarting by killing the process immediately solves the issue. I saw on one of the customer servers (master), that the stunnel was hanging again and the sockets from the other backups are missing, but I don't know if it was after some updates. At the moment, I don't have more Tickets with this issue. (all together 4 different environments)
(In reply to Christina Scheinig from comment #15) > At the moment, I don't have more Tickets with this issue. (all together 4 > different environments) We do automated updates in our internal environment, so updates might be the trigger. Do we know from the other environments if updates are automated?
No luck so far in reproducing this, but we could add a (nagios?) check like this -> eval "$(ucr shell domainname)" -> echo "stats" | openssl s_client -connect "ucs-sso.${domainname}:11212" \ -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" \ -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key ... STAT evictions 0 STAT reclaimed 0 STAT crawler_reclaimed 0 STAT crawler_items_checked 0 STAT lrutail_reflocked 0 END DONE to check the stunnel (openssl s_client) and memcached (echo stats) part. And we should think about importing a new stunnel version from debian. We currently use the debian/strech version stunnel4 (3:5.39-2) But there is a stretch-backports version - 3:5.50-1 - with a lot of "upstream fixes"
(In reply to Felix Botner from comment #18) > No luck so far in reproducing this, but > > we could add a (nagios?) check like this > > -> eval "$(ucr shell domainname)" > -> echo "stats" | openssl s_client -connect "ucs-sso.${domainname}:11212" \ > -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" \ > -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key > ... > STAT evictions 0 > STAT reclaimed 0 > STAT crawler_reclaimed 0 > STAT crawler_items_checked 0 > STAT lrutail_reflocked 0 > END > DONE hint from Michael we should use ucs/server/sso/fqdn for the server (-connect) and add two optional UCR variables for -cert and -key (default should be /etc/univention/ssl/${ucs_server_sso_fqdn}/cert.pem|private.key)
imported/built stunnel (3:)5.50-1 from strech-backports Successful build Package: stunnel4 Version: 3:5.50-1~bpo9+1A~4.4.0.202101141824 Branch: ucs_4.4-0 Scope: errata4.4-7
QA just for the stunnel4 import: OK: patch for debhelper to enable build for UCS 4, disable tests OK: ~We need the some fix in u-saml so that stunnel4 is restarted in u-saml.postinst. The stunnel4 package does not restart the service on update. But a u-saml fix is pending at this bug, thus OK OK: saml session via stunnel works with new package version OK: stunnel4 3:5.50-1~bpo9+1A~4.4.0.202101141824 OK: yaml "Verified" for this part. Missing is a systemd watchdog to monitor and restart potential hanging stunnel4 services
Package: univention-saml Version: 6.0.3-3A~4.4.0.202101192328 Branch: ucs_4.4-0 Scope: errata4.4-7 14b297c46e Bug #52196: yaml 2ff0062cf3 Bug #52196 create service that monitors stunnel4 and kills it if it hangs. I created a systemd service which is supposed to kill the stunnel4 daemon if it does not respond any more, and is marked as active. Since the Bug description said that only kill -9 was effective, I used that in my script and configured the stunnel4 to always restart. I added two ucr variables:ucs/server/sso/key and ucs/server/sso/cert, so that the paths can be adjusted 52196#c20.
ab8b065013 Bug #52196: ucslint fix add copyright Package: univention-saml Version: 6.0.3-4A~4.4.0.202101201025 Branch: ucs_4.4-0 Scope: errata4.4-7
[1] univention-stunnel4-watchdog should we move the "eval "$(ucr shell)" to outside of the loop? [2] univention-stunnel4-watchdog > ucs_server_sso_key="/etc/univention/ssl/${ucs_server_sso_fqdn}/secret.key" this should be ...private.key not secret key [3] Is this test correct (timeout 5 bash ...)? Currently it seems that we only check the timeout and not if the openssl/echo command really works (as seen in [2]) [4] Should we move "systemctl is-active --quiet "stunnel4.service"" to before the test?
[5] please make that sleep timeout configurable via ucr
Package: univention-saml Version: 6.0.3-6A~4.4.0.202101201351 Branch: ucs_4.4-0 Scope: errata4.4-7 33ddb200f0 Bug #52196: yaml 37da33f7a3 Bug #52196: make sleep configurable, cleanup watchdog-service I incorporated your feedback. I hope this is ok now.
sorry forgot one point please update the stunnel4 dependency to stunnel4 (>=3:5.50-1),
Successful build Package: univention-saml Version: 6.0.3-7A~4.4.0.202101201457 Branch: ucs_4.4-0 Scope: errata4.4-7 647386d764 Bug #52196: yaml bump 04075973c0 Bug #52196: bump stunnel4 dependency done
sorry ... we should check with timeout 5 bash -c "echo Q | openssl s_client -connect localhost:11212" (without checking memcached, just stunnel)
Package: univention-saml Version: 6.0.3-8A~4.4.0.202101201553 Branch: ucs_4.4-0 Scope: errata4.4-7 3703c2fb0c Bug #52196: yaml bump ae35d72cd5 Bug #52196: Only check stunnel, not memcached. Remove now unnecessary ucr vars And again :)
(1) sorry, this is a never ending story, the check with localhost does not work (as it returns 1) echo "Q" | openssl s_client -connect "localhost:11212" echo $? 1 seems that we have to use echo "Q" | openssl s_client -connect "ucs-sso.${domainname}:11212" -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key" 2>/dev/null 1>/dev/null echo $? 0 or we scrap that return code and just go for the timeout? (2) in my test env with a backup i have the problem that stunnel is "broken" after the reboot systemctl status stunnel4.service ● stunnel4.service - LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons) Loaded: loaded (/etc/init.d/stunnel4; generated; vendor preset: enabled) Drop-In: /lib/systemd/system/stunnel4.service.d └─10-watchdog.conf Active: failed (Result: exit-code) since Wed 2021-01-20 21:14:37 CET; 3min 2s ago Docs: man:systemd-sysv-generator(8) Process: 1984 ExecStop=/etc/init.d/stunnel4 stop (code=exited, status=0/SUCCESS) Process: 2219 ExecStart=/etc/init.d/stunnel4 start (code=exited, status=1/FAILURE) CPU: 24ms Jan 20 21:14:36 master stunnel4[2219]: You should check that you have specified the pid= in you configuration file Jan 20 21:14:36 master systemd[1]: Failed to start LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons). Jan 20 21:14:36 master systemd[1]: stunnel4.service: Unit entered failed state. Jan 20 21:14:36 master systemd[1]: stunnel4.service: Failed with result 'exit-code'. Jan 20 21:14:37 master systemd[1]: stunnel4.service: Service hold-off time over, scheduling restart. Jan 20 21:14:37 master systemd[1]: Stopped LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons). Jan 20 21:14:37 master systemd[1]: stunnel4.service: Start request repeated too quickly. Jan 20 21:14:37 master systemd[1]: Failed to start LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons). Jan 20 21:14:37 master systemd[1]: stunnel4.service: Unit entered failed state. Jan 20 21:14:37 master systemd[1]: stunnel4.service: Failed with result 'exit-code'. during the boot the watchdog killed stunnel, but why? systemctl status univention-stunnel4-watchdog.service ● univention-stunnel4-watchdog.service - Restarting stunnel4 if it hangs (Bug 52196) Loaded: loaded (/lib/systemd/system/univention-stunnel4-watchdog.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2021-01-20 21:14:02 CET; 4min 15s ago Main PID: 453 (univention-stun) Tasks: 2 (limit: 4915) Memory: 3.1M CPU: 2.262s CGroup: /system.slice/univention-stunnel4-watchdog.service ├─ 453 /bin/bash /etc/init.d/univention-stunnel4-watchdog start └─6546 sleep 30 Jan 20 21:14:02 master systemd[1]: Started Restarting stunnel4 if it hangs (Bug 52196). Jan 20 21:14:35 master univention-stunnel4-watchdog[453]: Killing stunnel4 to restart, it seems to be hanging. See https://forge.univention.org/bugzilla/show_bug.cgi?id=52196 this needs some investigation, maybe it is just an artifact of problem (1)
Package: univention-saml Version: 6.0.3-14A~4.4.0.202101221305 Branch: ucs_4.4-0 Scope: errata4.4-7 4003953f57 Bug #52196: yaml update 15d647560b Bug #52196: cleanup 8090439572 Bug #52196: changelog 2c1831d46a Bug #52196: Do not remove memcached.socket after killing the service 693d6fd7aa Bug #52196: fix typo f5d8b7dc63 Bug #52196: rm sockets files of stunnel4 after kill or else it doesn't start any more 89214c47b1 Bug #52196: correct stunnel4 version Many encountered problems later, we changed the test, added an ucr variable to activate the watchdog, ucr set ucs/server/sso/stunnel4/watchdog/active=true and service univention-stunnel4-watchdog restart. We had to remove the socket in /var/run/univention-saml, because killing the stunnel4 meant that that socket was left over and stunnel4 wouldn't start any more. We are still not sure if this watchdog correctly identifies the problems mentioned here, since we werenÄt able to reproduce this exact problem.
integrated further qa feedback, We dont use is_ucr_true anymore to keep the used ressources as low as possible. this means, the watchdog can only be activated by setting ucs/server/sso/stunnel4/watchdog/active=true not yes. univention-saml (6.0.3-15) 404c10b818ce | Bug #52196: further cleanup, dont use is_ucr_true to keep ressources used minimal
OK - watchdog disabled by default, can be started with ucr set ucs/server/sso/stunnel4/watchdog/active=true OK - restarts stunnel after timeout OK - yaml
<https://errata.software-univention.de/#/?erratum=4.4x876> <https://errata.software-univention.de/#/?erratum=4.4x877>