Bug 51047 – Nutzung des Self Service gestört (Errorcode 20)

Bug 51047 - Nutzung des Self Service gestört (Errorcode 20)


Summary:	Nutzung des Self Service gestört (Errorcode 20)

Status:	CLOSED DUPLICATE of bug 50583

Product:	UCS
Classification:	Unclassified
Component:	Self Service
Version:	UCS 4.4
Hardware:	Other Linux

Importance:	P5 normal with 2 votes (vote)
Target Milestone:	UCS 4.4-4-errata
Assigned To:	Florian Best
QA Contact:	Max Pohle

URL:
Keywords:

Depends on:	51366 51367 51402
Blocks:
	Show dependency tree / graph

Reported:	2020-03-31 14:35 CEST by Dirk Schnick
Modified:	2021-03-19 12:20 CET (History)
CC List:	15 users (show)

See Also:	50583 28239
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	3: Will affect average number of installed domains
How will those affected feel about the bug?:	3: A User would likely not purchase the product
User Pain:	0.257
Enterprise Customer affected?:	Yes
School Customer affected?:	Yes
ISV affected?:
Waiting Support:	Yes
Flags outvoted (downgraded) after PO Review:
Ticket number:	2020031921000794, 2020032721000279, 2020041621000298, 2020041721000278, 2020042221000973
Bug group (optional):
Max CVSS v3 score:

Attachments
pwchange.py (2.31 KB, text/x-python) 2020-05-12 12:04 CEST, Felix Botner	Details
repr2.py (1.73 KB, text/x-python) 2020-05-25 11:35 CEST, Felix Botner	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dirk Schnick

2020-03-31 14:35:12 CEST

The use of the Self Service is disrupted for both tickets. Bug will be opened in consultation with Sönke.
The restart of the univention-management-console-server helps with the first issue (analysis of the second issue by Christian only at the beginning)
Customer of the first ticket has noticed that the univention-management-console-server has many and increasing connections to LDAP.

Comment 1 Ingo Steuwer

2020-04-01 08:14:01 CEST

Please add more information about the issue and make the title more descriptive.

* what happens, what was expected?

* any hints about circumstances / root cause of the issue?

Comment 2 Dirk Schnick

2020-04-14 09:37:05 CEST

Changing the password via web-interface is not possible as server says "Errorcode 20" We had a similar problem in the past, the errorcode means unknown situation, if I got the old bug (41786 https://forge.univention.org/bugzilla/show_bug.cgi?id=41786) right. Florian fixed unknown situations there; now we seem to have a new unknown situation again.
The customer has observed a larger number of half-open connections (CLOSEWAIT lsof | grep -P '\ssock\s' | wc -l == 1257) between univention-management-console-server and LDAP. A restart of this component solves the problem. However, this only represents a WO.
A restart of the univention-management-console-web-server as suggested by Sönke does not solve the problem reproducibly.
Before Corona the students changed their passwords by using the windows client, now they need to use the webinterface. Problem may exists a longer time, but was not noticed. The webinterface is currently heavily used.

Comment 3 Florian Best

2020-04-15 15:47:44 CEST

/var/log/univention/management-console-server.log
15.04.20 15:44:20.950  AUTH        ( INFO    ) : Trying to authenticate user u'J***'
15.04.20 15:44:20.950  LDAP        ( INFO    ) : uldap.search filter=(&(uid=J***)(objectClass=person)) base= scope=sub attr=['uid'] unique=1 required=0 timeout=-1 sizelimit=0
15.04.20 15:44:20.951  AUTH        ( INFO    ) : Canonicalized username: 'j***'
15.04.20 15:44:20.959  AUTH        ( INFO    ) : PAM says: 'Sie m\xc3\xbcssen Ihr Passwort sofort \xc3\xa4ndern (Passwortablauf).'
15.04.20 15:44:20.959  AUTH        ( ERROR   ) : PAM: authentication error: ('Authentifizierungstoken ist nicht mehr g\xc3\xbcltig; neues erforderlich', 12)
15.04.20 15:44:20.959  AUTH        ( INFO    ) : Das Passwort ist abgelaufen und muss erneuert werden.
15.04.20 15:44:20.960  AUTH        ( WARN    ) : Changing password failed (('Fehler beim \xc3\x84ndern des Authentifizierungstoken', 20)). Prompts: [('Current Kerberos password: ', 1)]
15.04.20 15:44:20.961  AUTH        ( ERROR   ) : Passwort ändern fehlgeschlagen. Der Grund konnte nicht festgestellt werden. Für den Fall, dass es hilft, hier die originale Fehlernachricht: Errorcode 20: Das neue Passwort konnte nicht gesetzt werden.

Comment 4 Florian Best

2020-04-15 16:28:37 CEST

I connected with pyrasite to the UMC-Server process:

>>> len(set(univention.management.console.ldap.machine_connection.im_self._LDAP__ldap_connections.values()))
826

We see, there are 826 LDAP connections, which have been established:

>>> failed = []
>>> active = 0
>>> for i in univention.management.console.ldap.machine_connection.im_self._LDAP__ldap_connections.values():
...  try:
...   i[0].whoami()
...   active += 1
...  except Exception as exc:
...   failed.append(exc)
...


>>> len(failed)
578
→ These are invalid, e.g. because the SAML message is outdated or the bind state is lost.

>>> active
252
→ These are working connections.

# lsof -p 23616 | grep 7389 | wc -l
136
hm, we see 136 ldap connections open or in CLOSE_WAIT state.

Probably the ldap connection cache is the leak, where a missing unbind() or something like this is necessary when we don't need the connection anymore. As the self service uses anonymous connections these might not be cleaned up correctly or only after 15 minutes (session timeout). Maybe too many attempts in these 15 minutes are done?

Comment 5 Ingo Steuwer

2020-04-15 17:20:01 CEST

is increasing ldap/maxopenfiles (and correspondig other limits) a possible workaround?

Comment 6 Dirk Schnick

2020-04-16 09:36:21 CEST

Another school customer reports the occurrence of the problem. Ticket and customer entries updated

Comment 7 Christina Scheinig

2020-04-16 13:43:30 CEST

ucr get ldap/maxopenfiles
8192

Which other corresponding limits do you mean?

Comment 8 Erik Damrose

2020-04-16 13:48:52 CEST

Note: It also happend on our ISV provider portal, a system which is not mainly used for user self service.

Comment 9 Ingo Steuwer

2020-04-16 14:28:10 CEST

(In reply to Christina Scheinig from comment #7)
> ucr get ldap/maxopenfiles
> 8192
> 
> Which other corresponding limits do you mean?

AFAIR:

This defines the maximum number of open files for slapd, and each LDAP connection is one of those open files. The limit for open files is not only this "internal" limit of slapd, but also defined "global" on POSIX level. I don't remember how to increase the later one.

But: increasing these limits is only a workaround.

Comment 10 Lukas Zumvorde

2020-04-16 15:46:25 CEST

Another school customer is affected (list updated).

We first thought that setting the objectClass univentionPasswordSelfService would help. However, we also noticed that after a restart of the UMC we could use the self service with a user without this objectClass. The Problem is not the objectClass.

Comment 11 Sönke Schwardt-Krummrich

2020-04-16 16:04:08 CEST

(In reply to Florian Best from comment #4)
> Probably the ldap connection cache is the leak, where a missing unbind() or
> something like this is necessary when we don't need the connection anymore.
> As the self service uses anonymous connections these might not be cleaned up
> correctly or only after 15 minutes (session timeout). Maybe too many
> attempts in these 15 minutes are done?

Yes, I assume the same reason. How can we nail that down?

Comment 13 Lukas Zumvorde

2020-04-17 13:44:23 CEST

@Nico:
sadly i don't know how to reproduce it.

Comment 14 Christina Scheinig

2020-04-23 10:32:48 CEST

And the next schoolcustomer has this issue.

Is there a recommendation to which value ldap/maxopenfiles should be set to provide it as workaround

Comment 15 Erik Damrose

2020-04-23 10:42:26 CEST

(In reply to Christina Scheinig from comment #14)
> Is there a recommendation to which value ldap/maxopenfiles should be set to
> provide it as workaround

Has anyone confirmed that raising that specific value helps to mitigate the issue? I would double the current value for testing.

Comment 16 Jair Ritter 2020-05-05 09:44:03 CEST

(In reply to Erik Damrose from comment #15)
> Has anyone confirmed that raising that specific value helps to mitigate the
> issue? I would double the current value for testing.

After changing the value we only have the problem with Error 20 every 2-3 days instead of every day.

But I suspect that this is more related to the slightly decreasing usage of the portal.

Comment 17 Felix Botner

2020-05-12 11:59:16 CEST

So finally we are able to reproduce this with the following setup

1 UCS Server (singlemaster, no samba, 4.4-4)
 * users atest1 - atest1000,
         btest1 - btest1000,
         ...

5 testclients (4GB, 2Cores) with python/selenium

Test script pwchange.py. This scrip changes the password of e.g atest1 - atest1000 in 10 threads with selenium via http://$IP/univention/self-service/#page=passwordchange


Two testruns so far. After both runs a manual pw change via UMC (the script does check if the pw works, just starts the process) resulted in 


 Benachrichtigung
 Passwort ändern fehlgeschlagen. Der Grund konnte nicht festgestellt werden. Für 
 den Fall, dass es hilft, hier die originale Fehlernachricht: Errorcode 20: Das 
 neue Passwort konnte nicht gesetzt werden.

On the UCS Server (10.200.7.160):

activated kerberos logging in /root/.krb5/config
[logging]
krb5 = 0-/FILE:/tmp/krb.log

During a pw change attempt, the log says


2020-05-12T09:39:14 searching DNS for realm FOUR.FOUR udp.kerberos -> 0
2020-05-12T09:39:14 submissing new requests to new host
2020-05-12T09:39:14 error message: Did not find a plugin for send_to_kdc: 2

!!! 2020-05-12T09:39:14 fd too large for select !!!

2020-05-12T09:39:14 searching DNS for realm FOUR.FOUR tcp.kerberos -> 0
2020-05-12T09:39:14 submissing new requests to new host
2020-05-12T09:39:14 error message: Did not find a plugin for send_to_kdc: 2
2020-05-12T09:39:14 fd too large for select
2020-05-12T09:39:14 DNS lookup failed domain: _kerberos._http.FOUR.FOUR.


It seems the error is in the communication in the kerberos lib.

heimdal: lib/krb5/send_to_kdc.c
#ifndef NO_LIMIT_FD_SETSIZE
    if (fd >= FD_SETSIZE) {
        _krb5_debug(context, 0, "fd too large for select");
        rk_closesocket(fd);
        continue;
    }
#endif

samba: ./source4/heimdal/lib/krb5/send_to_kdc.c
#ifndef NO_LIMIT_FD_SETSIZE
     if (fd >= FD_SETSIZE) {
     return -1;
     }
#endif

Both heimdal and samba-heimdal use select to monitor the open fd's. But select has the (in)famous) upper limit FD_SETSIZE (1024).

and my umc process has fas more open fd's

-> ls -al /proc/26309/fd/| wc -l
2373

-> lsof -p 26309| grep sock| wc -l

And does anybody know what "kind" of socket this is?

-> lsof -p 26309| grep sock
univentio 26309 root 2376u  sock                0,8       0t0 1605880 protocol: TCP
univentio 26309 root 2377u  sock                0,8       0t0 1605987 protocol: TCP
univentio 26309 root 2378u  sock                0,8       0t0 1606102 protocol: TCP
univentio 26309 root 2380u  sock                0,8       0t0 1606154 protocol: TCP
univentio 26309 root 2381u  sock                0,8       0t0 1606185 protocol: TCP


So next steps:
 * Where do they come from?
 * How do we get rid of those fd's?

Comment 18 Felix Botner

2020-05-12 12:04:58 CEST

Created attachment 10359 [details]
pwchange.py

Comment 38 Max Pohle

2020-05-19 23:09:01 CEST

Hi! 

Good news! I think I have a working solution to offer :)
Please tell me, what you think about that.

1. In `/usr/lib/python2.7/dist-packages/univention/management/console/pam.py` append this method to the PamAuth class:
```
class PamAuth(object):
# [...] # here may or may not be other code
    def stop(self):
        syslog.syslog("CLEANING PAM (%s)" % str(type(self.pam)))
        del self.pam
```


2. In `/usr/lib/python2.7/dist-packages/univention/management/console/auth.py` call the new method after the thread has finished (see: "Furter Explanation")
```
def __authentication_result(self, thread, result, request):
    # [...] [skipped some code here for readability]
    # Note: self.pam is of type PamAuth, not pam in this file
    #       stop() is called in `pam.py` and deletes the real PAM object there.
    self.pam.stop() # append this line 
```


# Furter Explanation

That the PAM object has to be deleted as I did, that I verified against the authors testing code of the pam-python module, see here:

    https://github.com/Ralnoc/pam-python/blob/master/src/test.py

In our product the PamAuth object has an `init` method, which creates an instance of PAM like so: `self.pam = PAM.pam()`, which is never deleted and since this is a C wrapper, it does not work well together with __del__ destructors or the python garbage collector.

Our `auth.py` creates the PamAuth object first and then calls further functions from the notifier. threads.Simple calls `self.__authenticate_thread` first and when that is finish `self.__authentication_result` is called automatically (also in case of errors). That is why I decided to put the clean code there, even though I find it a little unfortunate to distribute the creation and destruction across the hole file:
```
def authenticate(self, msg):
    # [...] [skipped some code here for readability]
    self.pam = PamAuth(locale) # << HERE
    thread = threads.Simple('pam', \
        notifier.Callback(self.__authenticate_thread, **args),
        notifier.Callback(self.__authentication_result, msg))
    thread.run()
```


With this fix I have only two file handles remaining open. New file descriptor spawn during the login, but disappear again :D

I now think, that the remaining two file descriptors may have to do with the OpenSSL connection I was talking about earlier.

Comment 41 Ingo Steuwer

2020-05-22 13:43:10 CEST

FYI: There are 1-2 "feedback@" tickets mentioning this issue per week, I don't link all of them here.

Comment 45 Ingo Steuwer

2020-06-19 17:32:41 CEST

status update: most important root causes have been fixed with Bug #51366 and Bug #51367 - please update to the latest errata releases to improve the situation.

Bug #51402 is under review.

Comment 46 Ingo Steuwer

2020-09-01 18:02:39 CEST

(In reply to Ingo Steuwer from comment #45)
> status update: most important root causes have been fixed with Bug #51366
> and Bug #51367 - please update to the latest errata releases to improve the
> situation.
> 
> Bug #51402 is under review.

All related Bugs are fixed now. From my point of view, all known causes for this bug are fixed, so we can close this one too.

Do I miss something?

Comment 47 Christina Scheinig

2020-09-02 09:44:24 CEST

(In reply to Ingo Steuwer from comment #46)
> (In reply to Ingo Steuwer from comment #45)
> > status update: most important root causes have been fixed with Bug #51366
> > and Bug #51367 - please update to the latest errata releases to improve the
> > situation.
> > 
> > Bug #51402 is under review.
> 
> All related Bugs are fixed now. From my point of view, all known causes for
> this bug are fixed, so we can close this one too.
> 
> Do I miss something?

I have not heard any complaints, but I asked my customer for his observations again. So please wait some more days.

Comment 48 Florian Best

2020-09-02 09:47:51 CEST

This has been fixed in Bug #50583.

*** This bug has been marked as a duplicate of bug 50583 ***

Comment 49 Christina Scheinig

2020-09-02 10:34:48 CEST

(In reply to Christina Scheinig from comment #47)
> (In reply to Ingo Steuwer from comment #46)
> > (In reply to Ingo Steuwer from comment #45)
> > > status update: most important root causes have been fixed with Bug #51366
> > > and Bug #51367 - please update to the latest errata releases to improve the
> > > situation.
> > > 
> > > Bug #51402 is under review.
> > 
> > All related Bugs are fixed now. From my point of view, all known causes for
> > this bug are fixed, so we can close this one too.
> > 
> > Do I miss something?
> 
> I have not heard any complaints, but I asked my customer for his
> observations again. So please wait some more days.

So the first feedback is, that the issue seems to be solved

Comment 51 Dirk Schnick

2020-09-02 16:09:15 CEST

Informed my two customers and asked if ticket can be closed. Received one feedback already -> ticket is closed.
Seems to be fixed.