Bug 52196 – stunnel stops working and hangs indefinitely

Bug 52196 - stunnel stops working and hangs indefinitely


Summary:	stunnel stops working and hangs indefinitely

Status:	CLOSED FIXED

Product:	UCS
Classification:	Unclassified
Component:	SAML
Version:	UCS 4.4
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	UCS 4.4-7-errata
Assigned To:	Julia Bremer
QA Contact:	Felix Botner

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2020-10-08 09:50 CEST by Jürn Brodersen
Modified:	2021-01-27 17:36 CET (History)
CC List:	8 users (show)

See Also:
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	2: Will only affect a few installed domains
How will those affected feel about the bug?:	5: Blocking further progress on the daily work
User Pain:	0.286
Enterprise Customer affected?:
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2020092921000621, 2020110521000251
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jürn Brodersen

2020-10-08 09:50:34 CEST

stunnel stops working and hangs indefinitely

In multiple customer environments the stunnel service stopped working.


Impact:
Users need to relogin at seemingly random while their saml session should still be valid.

Reason:
The saml sessions are not synchronised between the Backup Nodes and Primary Node any more due to the hanging stunnel.

Announce:
This is especially problematic since the hanging stunnel is not detected automatically, restarting the service does not work and but systemctl exits successfully (Note: systemd restarts the init script, which technically was sucessful).

Work around:
Kill the stunnel process with a SIGKILL signal (kill -9 $STUNNEL4_PID)

Comment 1 Jürn Brodersen

2020-10-08 09:51:40 CEST

gdb backtrace:


#0  0x00007f7909cd6470 in futex_wait (private=<optimized out>, expected=0, futex_word=0x5624fd61381c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=0, futex_word=0x5624fd61381c) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_rwlock_wrlock_slow (rwlock=0x5624fd613810) at pthread_rwlock_wrlock.c:67
#3  0x00007f790a8d36e9 in CRYPTO_THREAD_write_lock () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00005624fc0f9eff in log_flush (new_mode=LOG_MODE_ERROR) at log.c:121
#5  0x00005624fc0f4e96 in main (argc=<optimized out>, argv=<optimized out>) at ui_unix.c:59

Comment 2 Jürn Brodersen

2020-10-08 09:52:28 CEST

lsof:


COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
stunnel4 1429 samlcgi  cwd    DIR              254,0     4096        2 /
stunnel4 1429 samlcgi  rtd    DIR              254,0     4096        2 /
stunnel4 1429 samlcgi  txt    REG              254,0   172176 44582789 /usr/bin/stunnel4
stunnel4 1429 samlcgi  mem    REG              254,0    79936 48234581 /lib/x86_64-linux-gnu/libgpg-error.so.0.21.0
stunnel4 1429 samlcgi  mem    REG              254,0   468920 48234560 /lib/x86_64-linux-gnu/libpcre.so.3.13.3
stunnel4 1429 samlcgi  mem    REG              254,0  1112184 48234546 /lib/x86_64-linux-gnu/libgcrypt.so.20.1.6
stunnel4 1429 samlcgi  mem    REG              254,0    72024 44565974 /usr/lib/x86_64-linux-gnu/liblz4.so.1.7.1
stunnel4 1429 samlcgi  mem    REG              254,0   154376 48234558 /lib/x86_64-linux-gnu/liblzma.so.5.2.2
stunnel4 1429 samlcgi  mem    REG              254,0    31744 48234521 /lib/x86_64-linux-gnu/librt-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0   155400 48234562 /lib/x86_64-linux-gnu/libselinux.so.1
stunnel4 1429 samlcgi  mem    REG              254,0  1689360 48234504 /lib/x86_64-linux-gnu/libc-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0   135440 48234519 /lib/x86_64-linux-gnu/libpthread-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0    39784 48239423 /lib/x86_64-linux-gnu/libwrap.so.0.7.6
stunnel4 1429 samlcgi  mem    REG              254,0    89064 48234511 /lib/x86_64-linux-gnu/libnsl-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0    10688 48234523 /lib/x86_64-linux-gnu/libutil-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0    14640 48234507 /lib/x86_64-linux-gnu/libdl-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0  2715840 44565546 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
stunnel4 1429 samlcgi  mem    REG              254,0   442984 44565547 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
stunnel4 1429 samlcgi  mem    REG              254,0   153288 48234499 /lib/x86_64-linux-gnu/ld-2.24.so
stunnel4 1429 samlcgi  mem    REG              254,0   557552 48234585 /lib/x86_64-linux-gnu/libsystemd.so.0.17.0
stunnel4 1429 samlcgi    0u   CHR                1,3      0t0     1028 /dev/null
stunnel4 1429 samlcgi    1u   CHR                1,3      0t0     1028 /dev/null
stunnel4 1429 samlcgi    2u   CHR                1,3      0t0     1028 /dev/null
stunnel4 1429 samlcgi    4r  FIFO               0,10      0t0    18738 pipe
stunnel4 1429 samlcgi    5w  FIFO               0,10      0t0    18738 pipe
stunnel4 1429 samlcgi    6u  unix 0xffff924574129400      0t0    18750 type=DGRAM

Comment 3 Ingo Steuwer

2020-10-09 17:15:16 CEST

Is this only in one customer environment? Might networking problems be the root cause, or is it a timeout?

If a timeout occures we might simply force a restart of stunnel on a daily base?

Comment 4 Christina Scheinig

2020-10-09 17:53:42 CEST

(In reply to Ingo Steuwer from comment #3)
> Is this only in one customer environment? Might networking problems be the
> root cause, or is it a timeout?
> 
> If a timeout occures we might simply force a restart of stunnel on a daily
> base?

No, we found this in three different school environments.

Comment 10 Nico Gulden

2020-11-02 09:59:43 CET

I reduced the feeling to "A pain …", because a workaround exists. I don't think that affected customers would return the product because of this issue.

Comment 11 Christina Scheinig

2020-11-10 16:41:11 CET

I have now a very good description of the impact of this issue.

We observe more and more(?) that after about 3 minutes a session is terminated even though there is active work going on and the session should be renewed automatically, because the timeout should only take effect in case of inactivity.

Comment 12 Florian Best

2020-11-10 18:08:42 CET

(In reply to Nico Gulden from comment #10)
> I reduced the feeling to "A pain …", because a workaround exists. 
What is the workaround?

> I don't
> think that affected customers would return the product because of this issue.
The user pain was not set to "return the product" but to "would likely not purchase".
And I think that this is blocking further progress on the daily work because it's only possible to have 5 minutes lasting sessions then.

Comment 13 Nico Gulden

2020-11-11 10:37:07 CET

(In reply to Florian Best from comment #12)
> (In reply to Nico Gulden from comment #10)
> > I reduced the feeling to "A pain …", because a workaround exists. 
> What is the workaround?

See the bug description:

> Work around:
> Kill the stunnel process with a SIGKILL signal (kill -9 $STUNNEL4_PID)

Comment 14 Ingo Steuwer

2020-11-11 15:27:08 CET

(In reply to Christina Scheinig from comment #11)
> I have now a very good description of the impact of this issue.
> 
> We observe more and more(?) that after about 3 minutes a session is
> terminated even though there is active work going on and the session should
> be renewed automatically, because the timeout should only take effect in
> case of inactivity.

How often does the problem occure? Is it still the case that package upgrades cause stunnel to stop, or does it happen more often?

Comment 15 Christina Scheinig

2020-11-11 15:56:11 CET

(In reply to Ingo Steuwer from comment #14)
> (In reply to Christina Scheinig from comment #11)
> > I have now a very good description of the impact of this issue.
> > 
> > We observe more and more(?) that after about 3 minutes a session is
> > terminated even though there is active work going on and the session should
> > be renewed automatically, because the timeout should only take effect in
> > case of inactivity.
> 
> How often does the problem occure? Is it still the case that package
> upgrades cause stunnel to stop, or does it happen more often?

I cannot say.

This was reported from our IT now, and restarting by killing the process immediately solves the issue.

I saw on one of the customer servers (master), that the stunnel was hanging again and the sockets from the other backups are missing, but I don't know if it was after some updates.

At the moment, I don't have more Tickets with this issue. (all together 4 different environments)

Comment 16 Ingo Steuwer

2020-11-12 16:34:02 CET

(In reply to Christina Scheinig from comment #15)
> At the moment, I don't have more Tickets with this issue. (all together 4
> different environments)

We do automated updates in our internal environment, so updates might be the trigger. Do we know from the other environments if updates are automated?

Comment 18 Felix Botner

2020-12-22 17:36:25 CET

No luck so far in reproducing this, but

we could add a (nagios?) check like this 

-> eval "$(ucr shell domainname)"
-> echo "stats" | openssl s_client -connect "ucs-sso.${domainname}:11212" \
  -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" \
  -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key
...
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 0
STAT lrutail_reflocked 0
END
DONE

to check the stunnel (openssl s_client) and memcached (echo stats) part.

And we should think about importing a new stunnel version from debian. We currently use the debian/strech version 

  stunnel4 (3:5.39-2) 

But there is a stretch-backports version - 3:5.50-1 - with a lot of "upstream fixes"

Comment 19 Felix Botner

2020-12-23 14:15:30 CET

(In reply to Felix Botner from comment #18)
> No luck so far in reproducing this, but
> 
> we could add a (nagios?) check like this 
> 
> -> eval "$(ucr shell domainname)"
> -> echo "stats" | openssl s_client -connect "ucs-sso.${domainname}:11212" \
>   -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" \
>   -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key
> ...
> STAT evictions 0
> STAT reclaimed 0
> STAT crawler_reclaimed 0
> STAT crawler_items_checked 0
> STAT lrutail_reflocked 0
> END
> DONE

hint from Michael

we should use ucs/server/sso/fqdn for the server (-connect) and add two optional UCR variables for -cert and -key (default should be /etc/univention/ssl/${ucs_server_sso_fqdn}/cert.pem|private.key)

Comment 20 Felix Botner

2021-01-14 18:32:12 CET

imported/built stunnel (3:)5.50-1 from strech-backports

Successful build
Package: stunnel4
Version: 3:5.50-1~bpo9+1A~4.4.0.202101141824
Branch: ucs_4.4-0
Scope: errata4.4-7

Comment 21 Erik Damrose

2021-01-19 18:55:42 CET

QA just for the stunnel4 import:
OK: patch for debhelper to enable build for UCS 4, disable tests
OK: ~We need the some fix in u-saml so that stunnel4 is restarted in u-saml.postinst. The stunnel4 package does not restart the service on update. But a u-saml fix is pending at this bug, thus OK
OK: saml session via stunnel works with new package version
OK: stunnel4 3:5.50-1~bpo9+1A~4.4.0.202101141824
OK: yaml

"Verified" for this part. Missing is a systemd watchdog to monitor and restart potential hanging stunnel4 services

Comment 22 Julia Bremer

2021-01-19 23:42:05 CET

Package: univention-saml
Version: 6.0.3-3A~4.4.0.202101192328
Branch: ucs_4.4-0
Scope: errata4.4-7

14b297c46e Bug #52196: yaml
2ff0062cf3 Bug #52196 create service that monitors stunnel4 and kills it if it hangs.


I created a systemd service which is supposed to kill the stunnel4 daemon if it does not respond any more, and is marked as active.
Since the Bug description said that only kill -9 was effective, I used that in my script and configured the stunnel4 to always restart.
I added two ucr variables:ucs/server/sso/key and ucs/server/sso/cert, so that the paths can be adjusted 52196#c20.

Comment 23 Julia Bremer

2021-01-20 10:27:18 CET

ab8b065013 Bug #52196: ucslint fix add copyright

Package: univention-saml
Version: 6.0.3-4A~4.4.0.202101201025
Branch: ucs_4.4-0
Scope: errata4.4-7

Comment 24 Felix Botner

2021-01-20 11:56:12 CET

[1] univention-stunnel4-watchdog

should we move the "eval "$(ucr shell)" to outside of the loop?

[2] univention-stunnel4-watchdog

> ucs_server_sso_key="/etc/univention/ssl/${ucs_server_sso_fqdn}/secret.key"

this should be ...private.key not secret key

[3]

Is this test correct (timeout 5 bash ...)? Currently it seems that we only check the timeout and not if the openssl/echo command really works (as seen in [2])

[4]

Should we move "systemctl is-active --quiet "stunnel4.service"" to before the test?

Comment 25 Felix Botner

2021-01-20 12:09:42 CET

[5] please make that sleep timeout configurable via ucr

Comment 26 Julia Bremer

2021-01-20 13:53:26 CET

Package: univention-saml
Version: 6.0.3-6A~4.4.0.202101201351
Branch: ucs_4.4-0
Scope: errata4.4-7


33ddb200f0 Bug #52196: yaml
37da33f7a3 Bug #52196: make sleep configurable, cleanup watchdog-service

I incorporated your feedback. I hope this is ok now.

Comment 27 Felix Botner

2021-01-20 14:44:39 CET

sorry forgot one point

please update the stunnel4 dependency to 

stunnel4 (>=3:5.50-1),

Comment 28 Julia Bremer

2021-01-20 15:01:58 CET

Successful build
Package: univention-saml
Version: 6.0.3-7A~4.4.0.202101201457
Branch: ucs_4.4-0
Scope: errata4.4-7

647386d764 Bug #52196: yaml bump
04075973c0 Bug #52196: bump stunnel4 dependency

done

Comment 29 Felix Botner

2021-01-20 15:42:04 CET

sorry ...

we should check with

timeout 5 bash -c "echo Q | openssl s_client -connect localhost:11212"

(without checking memcached, just stunnel)

Comment 30 Julia Bremer

2021-01-20 15:55:56 CET

Package: univention-saml
Version: 6.0.3-8A~4.4.0.202101201553
Branch: ucs_4.4-0
Scope: errata4.4-7
3703c2fb0c Bug #52196: yaml bump
ae35d72cd5 Bug #52196: Only check stunnel, not memcached. Remove now unnecessary ucr vars

And again :)

Comment 31 Felix Botner

2021-01-20 21:38:04 CET

(1)
sorry, this is a never ending story, the check with localhost does not work (as it returns 1)

echo "Q" | openssl s_client -connect "localhost:11212"
echo $?
1

seems that we have to use

echo "Q" | openssl s_client -connect "ucs-sso.${domainname}:11212" -cert "/etc/univention/ssl/ucs-sso.${domainname}/cert.pem" -key "/etc/univention/ssl/ucs-sso.${domainname}/private.key" 2>/dev/null 1>/dev/null
echo $?
0

or we scrap that return code and just go for the timeout?

(2)

in my test env with a backup i have the problem that stunnel is "broken" after the reboot


systemctl status stunnel4.service 
● stunnel4.service - LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons)
   Loaded: loaded (/etc/init.d/stunnel4; generated; vendor preset: enabled)
  Drop-In: /lib/systemd/system/stunnel4.service.d
           └─10-watchdog.conf
   Active: failed (Result: exit-code) since Wed 2021-01-20 21:14:37 CET; 3min 2s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1984 ExecStop=/etc/init.d/stunnel4 stop (code=exited, status=0/SUCCESS)
  Process: 2219 ExecStart=/etc/init.d/stunnel4 start (code=exited, status=1/FAILURE)
      CPU: 24ms

Jan 20 21:14:36 master stunnel4[2219]: You should check that you have specified the pid= in you configuration file
Jan 20 21:14:36 master systemd[1]: Failed to start LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons).
Jan 20 21:14:36 master systemd[1]: stunnel4.service: Unit entered failed state.
Jan 20 21:14:36 master systemd[1]: stunnel4.service: Failed with result 'exit-code'.
Jan 20 21:14:37 master systemd[1]: stunnel4.service: Service hold-off time over, scheduling restart.
Jan 20 21:14:37 master systemd[1]: Stopped LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons).
Jan 20 21:14:37 master systemd[1]: stunnel4.service: Start request repeated too quickly.
Jan 20 21:14:37 master systemd[1]: Failed to start LSB: Start or stop stunnel 4.x (TLS tunnel for network daemons).
Jan 20 21:14:37 master systemd[1]: stunnel4.service: Unit entered failed state.
Jan 20 21:14:37 master systemd[1]: stunnel4.service: Failed with result 'exit-code'.

during the boot the watchdog killed stunnel, but why?

systemctl status univention-stunnel4-watchdog.service 
● univention-stunnel4-watchdog.service - Restarting stunnel4 if it hangs (Bug 52196)
   Loaded: loaded (/lib/systemd/system/univention-stunnel4-watchdog.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-01-20 21:14:02 CET; 4min 15s ago
 Main PID: 453 (univention-stun)
    Tasks: 2 (limit: 4915)
   Memory: 3.1M
      CPU: 2.262s
   CGroup: /system.slice/univention-stunnel4-watchdog.service
           ├─ 453 /bin/bash /etc/init.d/univention-stunnel4-watchdog start
           └─6546 sleep 30

Jan 20 21:14:02 master systemd[1]: Started Restarting stunnel4 if it hangs (Bug 52196).
Jan 20 21:14:35 master univention-stunnel4-watchdog[453]: Killing stunnel4 to restart, it seems to be hanging. See https://forge.univention.org/bugzilla/show_bug.cgi?id=52196

this needs some investigation, maybe it is just an artifact of problem (1)

Comment 32 Julia Bremer

2021-01-22 13:11:47 CET

Package: univention-saml
Version: 6.0.3-14A~4.4.0.202101221305
Branch: ucs_4.4-0
Scope: errata4.4-7


4003953f57 Bug #52196: yaml update
15d647560b Bug #52196: cleanup
8090439572 Bug #52196: changelog
2c1831d46a Bug #52196: Do not remove memcached.socket after killing the service
693d6fd7aa Bug #52196: fix typo
f5d8b7dc63 Bug #52196: rm sockets files of stunnel4 after kill or else it doesn't start any more
89214c47b1 Bug #52196: correct stunnel4 version

Many encountered problems later, we changed the test, 
added an ucr variable to activate the watchdog,
ucr set ucs/server/sso/stunnel4/watchdog/active=true  and
service univention-stunnel4-watchdog restart.

We had to remove the socket in /var/run/univention-saml, because killing the stunnel4 meant that that socket was left over and stunnel4 wouldn't start any more.
We are still not sure if this watchdog correctly identifies the problems mentioned here, since we werenÄt able to reproduce this exact problem.

Comment 33 Julia Bremer

2021-01-22 15:52:06 CET

integrated further qa feedback,
We dont use is_ucr_true anymore to keep the used ressources as low as possible.
this means, the watchdog can only be activated by setting 
ucs/server/sso/stunnel4/watchdog/active=true
not yes.

univention-saml (6.0.3-15)
404c10b818ce | Bug #52196: further cleanup, dont use is_ucr_true to keep ressources used minimal

Comment 34 Felix Botner

2021-01-25 09:23:52 CET

OK - watchdog disabled by default, can be started with 
     ucr set ucs/server/sso/stunnel4/watchdog/active=true
OK - restarts stunnel after timeout
OK - yaml

Comment 35 Erik Damrose

2021-01-27 17:36:59 CET

<https://errata.software-univention.de/#/?erratum=4.4x876>
<https://errata.software-univention.de/#/?erratum=4.4x877>