Bug 35741

Summary: Test 25reconnect_uldap failed in jenkins
Product: UCS Test Reporter: Alexander Kramer <kramer>
Component: GeneralAssignee: Philipp Hahn <hahn>
Status: CLOSED FIXED QA Contact: Stefan Gohmann <gohmann>
Severity: normal    
Priority: P5 CC: best, gohmann
Version: unspecifiedKeywords: interim-2
Target Milestone: UCS 4.0   
Hardware: Other   
OS: Linux   
See Also: https://forge.univention.org/bugzilla/show_bug.cgi?id=34292
https://forge.univention.org/bugzilla/show_bug.cgi?id=35148
What kind of report is it?: --- What type of bug is this?: ---
Who will be affected by this bug?: --- How will those affected feel about the bug?: ---
User Pain: Enterprise Customer affected?:
School Customer affected?: ISV affected?:
Waiting Support: Flags outvoted (downgraded) after PO Review:
Ticket number: Bug group (optional):
Max CVSS v3 score:
Bug Depends on: 35841, 35852    
Bug Blocks:    
Attachments: verbose strace
verbose stdout

Description Alexander Kramer univentionstaff 2014-08-28 13:34:52 CEST
During autotest-090-master-s3 the instance got stuck while performing:
25reconnect_uldap

from the ucs-test.log:
*** BEGIN *** ['/usr/bin/python2.7', '25reconnect_uldap'] ***
4274 *** 10_ldap/25reconnect_uldap *** Test reconnect mechanism of uldap ***
Comment 1 Alexander Kramer univentionstaff 2014-08-28 14:39:59 CEST
The strace for the process give the following endless output:
select(0, NULL, NULL, NULL, {0, 4286})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
...

If you run the script from the shell it also gets stuck after printing:
************************************************************************
Test: connect, stop,  - start after 9 seconds, search
************************************************************************
[ ok ] Stopping ldap server(s): slapd ...done.
[ ok ] Starting ldap server(s): slapd ...done.
[ ok ] Checking Schema ID: ...done.

And the endless strace for this process looks the same:
select(0, NULL, NULL, NULL, {0, 5463})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
...
Comment 2 Alexander Kramer univentionstaff 2014-08-28 14:50:31 CEST
Created attachment 6077 [details]
verbose strace
Comment 3 Alexander Kramer univentionstaff 2014-08-28 14:51:00 CEST
Created attachment 6078 [details]
verbose stdout
Comment 4 Stefan Gohmann univentionstaff 2014-08-30 06:37:27 CEST
r53174: I added a missing wait to a subprocess call.
Comment 5 Philipp Hahn univentionstaff 2014-09-02 16:00:56 CEST
"gdb -p %1" shows the process with a single thread hanging in

/usr/lib/python2.7/dist-packages/ldap/ldapobject.py:766
  def reconnect(self,uri):
    while self._pending_reconnect:
      time.sleep(0.01)
    else:
      self._pending_reconnect = 1

"timeval.tv_usec = 10_000" equals those 0.01s.

Adding that .wait()ing for the restart terminates the stall, but the test fails then for the negative check (delaying 11s while only 10s are allowed).
Comment 6 Philipp Hahn univentionstaff 2014-09-03 17:33:15 CEST
python-ldap_2.4.9 broke it:
* ldapobject.ReconnectLDAPObject.reconnect() now does kind of
  an internal locking to pause other threads while reconnecting
  is pending.

UCS-4.0 uses 2.4.10, which is the latest version in Debian.
Debian-RFP 2.4.15: <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=742487>

May-be fixed by 2.4.13:
<https://mail.python.org/pipermail/python-ldap/2014q1/003327.html>
* ldapobject.ReconnectLDAPObject now uses internal class-wide
  lock for serializing reconnects
* Method signature of ReconnectLDAPObject.reconnect() changed to be able
  to call it with separate retry_max and retry_delay values

See <https://mail.python.org/pipermail/python-ldap/2013q3/003265.html> for a patch; this alone does not seem to fix the bug.

A quick test with a self-compiled 2.4.15 also failed:The process is then multi-threaded and hangs in some lock.

The thread links to <https://bugs.launchpad.net/ldapuserfolder/+bug/650371>.
Comment 7 Philipp Hahn univentionstaff 2014-09-08 13:57:57 CEST
See Bug #35841 for a patch for python-ldap.
The patch is not enough to fix the test case, as some internal working seems to have changes.

(In reply to Stefan Gohmann from comment #4)
> r53174: I added a missing wait to a subprocess call.
this patch needs to be reverted, as it breaks the whole idea of the test case: slapd is (re-)started *while* the main test case is running to test the reconnect behavior. By adding the .wait() the tests runs only after the slapd is fully working again.
Comment 8 Philipp Hahn univentionstaff 2014-09-10 09:36:40 CEST
r53510 | Bug #35741 test: 10_ldap/25reconnect_uldap: revert subprocess.call.wait
 r53174 was reverted as Bug #35841 and Bug #35852 have been fixed.

Package: ucs-test
Version: 5.0.9-1.794.201409100922
Branch: ucs_4.0-0

No changelog as this is only a revert, so no visible source code change.
Comment 9 Stefan Gohmann univentionstaff 2014-10-01 15:57:39 CEST
OK, looks good.
Comment 10 Stefan Gohmann univentionstaff 2014-11-26 06:54:47 CET
UCS 4.0-0 has been released:
 http://docs.univention.de/release-notes-4.0-0-en.html
 http://docs.univention.de/release-notes-4.0-0-de.html

If this error occurs again, please use "Clone This Bug".