Univention Bugzilla – Bug 51047
Nutzung des Self Service gestört (Errorcode 20)
Last modified: 2021-03-19 12:20:41 CET
The use of the Self Service is disrupted for both tickets. Bug will be opened in consultation with Sönke. The restart of the univention-management-console-server helps with the first issue (analysis of the second issue by Christian only at the beginning) Customer of the first ticket has noticed that the univention-management-console-server has many and increasing connections to LDAP.
Please add more information about the issue and make the title more descriptive. * what happens, what was expected? * any hints about circumstances / root cause of the issue?
Changing the password via web-interface is not possible as server says "Errorcode 20" We had a similar problem in the past, the errorcode means unknown situation, if I got the old bug (41786 https://forge.univention.org/bugzilla/show_bug.cgi?id=41786) right. Florian fixed unknown situations there; now we seem to have a new unknown situation again. The customer has observed a larger number of half-open connections (CLOSEWAIT lsof | grep -P '\ssock\s' | wc -l == 1257) between univention-management-console-server and LDAP. A restart of this component solves the problem. However, this only represents a WO. A restart of the univention-management-console-web-server as suggested by Sönke does not solve the problem reproducibly. Before Corona the students changed their passwords by using the windows client, now they need to use the webinterface. Problem may exists a longer time, but was not noticed. The webinterface is currently heavily used.
/var/log/univention/management-console-server.log 15.04.20 15:44:20.950 AUTH ( INFO ) : Trying to authenticate user u'J***' 15.04.20 15:44:20.950 LDAP ( INFO ) : uldap.search filter=(&(uid=J***)(objectClass=person)) base= scope=sub attr=['uid'] unique=1 required=0 timeout=-1 sizelimit=0 15.04.20 15:44:20.951 AUTH ( INFO ) : Canonicalized username: 'j***' 15.04.20 15:44:20.959 AUTH ( INFO ) : PAM says: 'Sie m\xc3\xbcssen Ihr Passwort sofort \xc3\xa4ndern (Passwortablauf).' 15.04.20 15:44:20.959 AUTH ( ERROR ) : PAM: authentication error: ('Authentifizierungstoken ist nicht mehr g\xc3\xbcltig; neues erforderlich', 12) 15.04.20 15:44:20.959 AUTH ( INFO ) : Das Passwort ist abgelaufen und muss erneuert werden. 15.04.20 15:44:20.960 AUTH ( WARN ) : Changing password failed (('Fehler beim \xc3\x84ndern des Authentifizierungstoken', 20)). Prompts: [('Current Kerberos password: ', 1)] 15.04.20 15:44:20.961 AUTH ( ERROR ) : Passwort ändern fehlgeschlagen. Der Grund konnte nicht festgestellt werden. Für den Fall, dass es hilft, hier die originale Fehlernachricht: Errorcode 20: Das neue Passwort konnte nicht gesetzt werden.
I connected with pyrasite to the UMC-Server process: >>> len(set(univention.management.console.ldap.machine_connection.im_self._LDAP__ldap_connections.values())) 826 We see, there are 826 LDAP connections, which have been established: >>> failed = [] >>> active = 0 >>> for i in univention.management.console.ldap.machine_connection.im_self._LDAP__ldap_connections.values(): ... try: ... i[0].whoami() ... active += 1 ... except Exception as exc: ... failed.append(exc) ... >>> len(failed) 578 → These are invalid, e.g. because the SAML message is outdated or the bind state is lost. >>> active 252 → These are working connections. # lsof -p 23616 | grep 7389 | wc -l 136 hm, we see 136 ldap connections open or in CLOSE_WAIT state. Probably the ldap connection cache is the leak, where a missing unbind() or something like this is necessary when we don't need the connection anymore. As the self service uses anonymous connections these might not be cleaned up correctly or only after 15 minutes (session timeout). Maybe too many attempts in these 15 minutes are done?
is increasing ldap/maxopenfiles (and correspondig other limits) a possible workaround?
Another school customer reports the occurrence of the problem. Ticket and customer entries updated
ucr get ldap/maxopenfiles 8192 Which other corresponding limits do you mean?
Note: It also happend on our ISV provider portal, a system which is not mainly used for user self service.
(In reply to Christina Scheinig from comment #7) > ucr get ldap/maxopenfiles > 8192 > > Which other corresponding limits do you mean? AFAIR: This defines the maximum number of open files for slapd, and each LDAP connection is one of those open files. The limit for open files is not only this "internal" limit of slapd, but also defined "global" on POSIX level. I don't remember how to increase the later one. But: increasing these limits is only a workaround.
Another school customer is affected (list updated). We first thought that setting the objectClass univentionPasswordSelfService would help. However, we also noticed that after a restart of the UMC we could use the self service with a user without this objectClass. The Problem is not the objectClass.
(In reply to Florian Best from comment #4) > Probably the ldap connection cache is the leak, where a missing unbind() or > something like this is necessary when we don't need the connection anymore. > As the self service uses anonymous connections these might not be cleaned up > correctly or only after 15 minutes (session timeout). Maybe too many > attempts in these 15 minutes are done? Yes, I assume the same reason. How can we nail that down?
@Nico: sadly i don't know how to reproduce it.
And the next schoolcustomer has this issue. Is there a recommendation to which value ldap/maxopenfiles should be set to provide it as workaround
(In reply to Christina Scheinig from comment #14) > Is there a recommendation to which value ldap/maxopenfiles should be set to > provide it as workaround Has anyone confirmed that raising that specific value helps to mitigate the issue? I would double the current value for testing.
(In reply to Erik Damrose from comment #15) > Has anyone confirmed that raising that specific value helps to mitigate the > issue? I would double the current value for testing. After changing the value we only have the problem with Error 20 every 2-3 days instead of every day. But I suspect that this is more related to the slightly decreasing usage of the portal.
So finally we are able to reproduce this with the following setup 1 UCS Server (singlemaster, no samba, 4.4-4) * users atest1 - atest1000, btest1 - btest1000, ... 5 testclients (4GB, 2Cores) with python/selenium Test script pwchange.py. This scrip changes the password of e.g atest1 - atest1000 in 10 threads with selenium via http://$IP/univention/self-service/#page=passwordchange Two testruns so far. After both runs a manual pw change via UMC (the script does check if the pw works, just starts the process) resulted in Benachrichtigung Passwort ändern fehlgeschlagen. Der Grund konnte nicht festgestellt werden. Für den Fall, dass es hilft, hier die originale Fehlernachricht: Errorcode 20: Das neue Passwort konnte nicht gesetzt werden. On the UCS Server (10.200.7.160): activated kerberos logging in /root/.krb5/config [logging] krb5 = 0-/FILE:/tmp/krb.log During a pw change attempt, the log says 2020-05-12T09:39:14 searching DNS for realm FOUR.FOUR udp.kerberos -> 0 2020-05-12T09:39:14 submissing new requests to new host 2020-05-12T09:39:14 error message: Did not find a plugin for send_to_kdc: 2 !!! 2020-05-12T09:39:14 fd too large for select !!! 2020-05-12T09:39:14 searching DNS for realm FOUR.FOUR tcp.kerberos -> 0 2020-05-12T09:39:14 submissing new requests to new host 2020-05-12T09:39:14 error message: Did not find a plugin for send_to_kdc: 2 2020-05-12T09:39:14 fd too large for select 2020-05-12T09:39:14 DNS lookup failed domain: _kerberos._http.FOUR.FOUR. It seems the error is in the communication in the kerberos lib. heimdal: lib/krb5/send_to_kdc.c #ifndef NO_LIMIT_FD_SETSIZE if (fd >= FD_SETSIZE) { _krb5_debug(context, 0, "fd too large for select"); rk_closesocket(fd); continue; } #endif samba: ./source4/heimdal/lib/krb5/send_to_kdc.c #ifndef NO_LIMIT_FD_SETSIZE if (fd >= FD_SETSIZE) { return -1; } #endif Both heimdal and samba-heimdal use select to monitor the open fd's. But select has the (in)famous) upper limit FD_SETSIZE (1024). and my umc process has fas more open fd's -> ls -al /proc/26309/fd/| wc -l 2373 -> lsof -p 26309| grep sock| wc -l And does anybody know what "kind" of socket this is? -> lsof -p 26309| grep sock univentio 26309 root 2376u sock 0,8 0t0 1605880 protocol: TCP univentio 26309 root 2377u sock 0,8 0t0 1605987 protocol: TCP univentio 26309 root 2378u sock 0,8 0t0 1606102 protocol: TCP univentio 26309 root 2380u sock 0,8 0t0 1606154 protocol: TCP univentio 26309 root 2381u sock 0,8 0t0 1606185 protocol: TCP So next steps: * Where do they come from? * How do we get rid of those fd's?
Created attachment 10359 [details] pwchange.py
Hi! Good news! I think I have a working solution to offer :) Please tell me, what you think about that. 1. In `/usr/lib/python2.7/dist-packages/univention/management/console/pam.py` append this method to the PamAuth class: ``` class PamAuth(object): # [...] # here may or may not be other code def stop(self): syslog.syslog("CLEANING PAM (%s)" % str(type(self.pam))) del self.pam ``` 2. In `/usr/lib/python2.7/dist-packages/univention/management/console/auth.py` call the new method after the thread has finished (see: "Furter Explanation") ``` def __authentication_result(self, thread, result, request): # [...] [skipped some code here for readability] # Note: self.pam is of type PamAuth, not pam in this file # stop() is called in `pam.py` and deletes the real PAM object there. self.pam.stop() # append this line ``` # Furter Explanation That the PAM object has to be deleted as I did, that I verified against the authors testing code of the pam-python module, see here: https://github.com/Ralnoc/pam-python/blob/master/src/test.py In our product the PamAuth object has an `init` method, which creates an instance of PAM like so: `self.pam = PAM.pam()`, which is never deleted and since this is a C wrapper, it does not work well together with __del__ destructors or the python garbage collector. Our `auth.py` creates the PamAuth object first and then calls further functions from the notifier. threads.Simple calls `self.__authenticate_thread` first and when that is finish `self.__authentication_result` is called automatically (also in case of errors). That is why I decided to put the clean code there, even though I find it a little unfortunate to distribute the creation and destruction across the hole file: ``` def authenticate(self, msg): # [...] [skipped some code here for readability] self.pam = PamAuth(locale) # << HERE thread = threads.Simple('pam', \ notifier.Callback(self.__authenticate_thread, **args), notifier.Callback(self.__authentication_result, msg)) thread.run() ``` With this fix I have only two file handles remaining open. New file descriptor spawn during the login, but disappear again :D I now think, that the remaining two file descriptors may have to do with the OpenSSL connection I was talking about earlier.
FYI: There are 1-2 "feedback@" tickets mentioning this issue per week, I don't link all of them here.
status update: most important root causes have been fixed with Bug #51366 and Bug #51367 - please update to the latest errata releases to improve the situation. Bug #51402 is under review.
(In reply to Ingo Steuwer from comment #45) > status update: most important root causes have been fixed with Bug #51366 > and Bug #51367 - please update to the latest errata releases to improve the > situation. > > Bug #51402 is under review. All related Bugs are fixed now. From my point of view, all known causes for this bug are fixed, so we can close this one too. Do I miss something?
(In reply to Ingo Steuwer from comment #46) > (In reply to Ingo Steuwer from comment #45) > > status update: most important root causes have been fixed with Bug #51366 > > and Bug #51367 - please update to the latest errata releases to improve the > > situation. > > > > Bug #51402 is under review. > > All related Bugs are fixed now. From my point of view, all known causes for > this bug are fixed, so we can close this one too. > > Do I miss something? I have not heard any complaints, but I asked my customer for his observations again. So please wait some more days.
This has been fixed in Bug #50583. *** This bug has been marked as a duplicate of bug 50583 ***
(In reply to Christina Scheinig from comment #47) > (In reply to Ingo Steuwer from comment #46) > > (In reply to Ingo Steuwer from comment #45) > > > status update: most important root causes have been fixed with Bug #51366 > > > and Bug #51367 - please update to the latest errata releases to improve the > > > situation. > > > > > > Bug #51402 is under review. > > > > All related Bugs are fixed now. From my point of view, all known causes for > > this bug are fixed, so we can close this one too. > > > > Do I miss something? > > I have not heard any complaints, but I asked my customer for his > observations again. So please wait some more days. So the first feedback is, that the issue seems to be solved
Informed my two customers and asked if ticket can be closed. Received one feedback already -> ticket is closed. Seems to be fixed.