Bug 39439 - Replication module does not handle delete of listener/cache/filtered objects
Replication module does not handle delete of listener/cache/filtered objects
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 3.2
amd64 Linux
: P5 normal (vote)
: UCS 3.2-8
Assigned To: Philipp Hahn
Felix Botner
https://univention.plan.io/issues/3245
:
Depends on:
Blocks: 40066 40067
  Show dependency treegraph
 
Reported: 2015-09-30 09:40 CEST by Ingo Sieverdingbeck
Modified: 2023-03-25 06:51 CET (History)
5 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ingo Sieverdingbeck univentionstaff 2015-09-30 09:40:17 CEST
The listener sometimes does not always delete an object on all or some backup or Slave servers after it has been deleted from the LDAP Master.

This behaviour occurs only sporadically. Log files are attached to the Redmine issue #3245.
Comment 1 Arvid Requate univentionstaff 2015-09-30 13:06:01 CEST
Excellent logs! Brilliant.

According to the listener.log the replication module has been called successfully for the delete
========================================================================
28.09.15 17:13:06.358  LISTENER    ( ALL     ) : <<<
MSGID: 192409
785107 uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base> d
[...]
28.09.15 17:13:06.465  LISTENER    ( ALL     ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base>
28.09.15 17:13:06.465  LISTENER    ( ALL     ) : blocking signals (was 0)
28.09.15 17:13:06.465  LISTENER    ( INFO    ) : delete handlers for uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base>
28.09.15 17:13:06.465  LISTENER    ( INFO    ) : replication: Running handler for: uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base>
28.09.15 17:13:06.466  LISTENER    ( INFO    ) : handler: replication (successful)
========================================================================

But the slaves syslog shows no DEL. Looking back in the log I see the ADD:
========================================================================
Sep 28 17:06:35 slave01-202 slapd[1024]: conn=1134 op=253420 ADD dn="uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base>"
========================================================================

While there is no DEL at 17:13:06 instead I see that this connection (1134) gets closed due to idletimeout:
========================================================================
Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1134 fd=32 closed (idletimeout)
Sep 28 17:13:06 slave01-202 slapd[1024]: connection_read(32): no connection!
Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 fd=17 ACCEPT from IP=127.0.0.1:42627 (IP=0.0.0.0:7389)
Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 BIND dn="cn=update,<$ldap_base>" method=128
Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 BIND dn="cn=update,<$ldap_base>" mech=SIMPLE ssf=0
Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 RESULT tag=97 err=0 text=
========================================================================

So I assume that replication.py somehow skips the delete, I currently don't see how or why. From reading the code, I would have thought that the ldap.delete_s should be called in _delete_dn_recursive, then possibly throw some subclass of ldap.LDAPError which would get caught by replication.py, triggering a univention.debug.ERROR log message, a reconnect and a retry.
Comment 2 Arvid Requate univentionstaff 2015-09-30 13:09:25 CEST
I would propose to start by backporting the patches from Bug #34763, they are a good idea for enhanced replication robustness anyway and may avoid the issue reported here.
Comment 3 Philipp Hahn univentionstaff 2015-09-30 16:15:19 CEST
If you use UCRV listener/cache/filter, the entries are NOT added to the cache. On removal of such an object, the listener has no knowledge, that the entry is handles by the replication module and thus does not call it on removal.


8: 785068 cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, a
14: 28.09.15 17:06:34.822  LISTENER    ( ALL     ) : no cache entry found for cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention,
32: 28.09.15 17:06:34.834  LISTENER    ( INFO    ) : replication: Running handler for: cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention,
33: 28.09.15 17:06:34.837  LISTENER    ( INFO    ) : handler: replication (successful)
...
2738: 785087 uid=mb20150928150634629169,cn=mailboxes, a
2744: 28.09.15 17:06:35.241  LISTENER    ( ALL     ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=fr,
2745: 28.09.15 17:06:35.241  LISTENER    ( ALL     ) : updating by DN uid=mb20150928150634629169,cn=mailboxes,ou=fr,
...
2811: 785088 cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, d
2817: 28.09.15 17:06:35.278  LISTENER    ( ALL     ) : got 792 bytes for cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention,
2120: 28.09.15 17:06:35.286  LISTENER    ( INFO    ) : deleted from cache: cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention,
...
8981: 785107 uid=mb20150928150634629169,cn=mailboxes,ou=fr, d
8990: 28.09.15 17:13:06.464  LISTENER    ( ALL     ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=fr,
8996: 28.09.15 17:13:06.465  LISTENER    ( INFO    ) : replication: Running handler for: uid=mb20150928150634629169,cn=mailboxes,ou=fr,
8997: 28.09.15 17:13:06.466  LISTENER    ( INFO    ) : handler: replication (successful):

/usr/lib/univention-directory-listener/system/replication.py:
 0890 def handler(dn, new, listener_old, operation):
 1075                 elif old and not new:

* old=None as there is no cache entry
* new is None because the entry is deleted
* nothing to do
* fast exit with success


See Bug #3882 comment 1:
> The test shows a fundamental problem in the implementation: the handler is
> not called when the object is deleted!
> 
> This is caused by the fact, that on delete the listener neither knows about
> the object from the cache not is able to retrieve the data from the (remote)
> LDAP server, as the object is already deleted there. So the listener can't
> decide, if the module should be called.
> Normally src/handlers.c:880 calls cache_entry_module_present() to check, if
> the module is registered for the object.
> 
> There already is a special-handling for the "replication" module in
> handlers_delete(). Maybe make that dynamic by using a UCRV to configure a
> list of modules to always call on delete?

The replication module is always called, but it doesn't handle the case old=new=None.


(In reply to Arvid Requate from comment #2)
> I would propose to start by backporting the patches from Bug #34763, they
> are a good idea for enhanced replication robustness anyway and may avoid the
> issue reported here.
No, very unlikely; see above.
Comment 4 Ingo Steuwer univentionstaff 2015-09-30 17:09:02 CEST
(In reply to Philipp Hahn from comment #3)
> If you use UCRV listener/cache/filter, the entries are NOT added to the
> cache. On removal of such an object, the listener has no knowledge, that the
> entry is handles by the replication module and thus does not call it on
> removal.

listener/cache/filter is defined in this scenario.

[..]
> 
> /usr/lib/univention-directory-listener/system/replication.py:
>  0890 def handler(dn, new, listener_old, operation):
>  1075                 elif old and not new:
> 
> * old=None as there is no cache entry
> * new is None because the entry is deleted
> * nothing to do
> * fast exit with success

replication.py handler is defined as

def handler(dn, new, listener_old, operation)

"old" is an LDAP object that has been retrieved from local LDAP. So deletion works even if the cache (listener_old) is empty.

This is also the case here: deletion works in >90% of our tests. The logs contain successfull deletions also.
Comment 5 Philipp Hahn univentionstaff 2015-11-20 15:32:24 CET
/var/log/univention/listener.log @ backup:
23.10.15 13:42:39.615  LISTENER    ( WARN    ) : LOCAL not found: cn=9999,cn=system users,ou=fr,BASE {'matched': 'cn=system users,ou=fr,BASE', 'desc': 'No such object'}
23.10.15 15:25:57.160  LISTENER    ( WARN    ) : LOCAL server down: cn=0,cn=mailboxes,ou=fr,BASE {'desc': "Can't contact LDAP server"}


/var/log/syslog @ backup:
Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SRCH base="cn=9999,cn=system users,ou=fr,BASE" scope=0 deref=0 filter="(objectClass=*)"
Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SRCH attr=* +
Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SEARCH RESULT tag=101 err=32 nentries=0 text=
Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=240000 ADD dn="cn=9999,cn=system users,ou=fr,BASE"
Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=240000 RESULT tag=105 err=0 text=
...
Oct 23 13:50:49 in8-backup-201 slapd[5346]: conn=1057 fd=23 closed (idletimeout)

1. New objects are created, but not cached due to filter
2. Connection times out and is shut down by LDAP server
3. One of the non-cached objects is deleted
4. replication.py # getOldValues() catches ldap.SERVER_DOWN itself and returns {}
5. listener decides to do nothing, as old={} from getOldValues() and new={} because object is deleted
Comment 6 Philipp Hahn univentionstaff 2015-11-20 17:23:04 CET
r65814 | Bug #39439 replication: Fix LDAP server timeout handling
r65815 | Bug #39439 replication: Update copyright
r65825 | Bug #39439 replication: Fix LDAP server timeout handling
r65826 | Bug #39439 replication: Fix LDAP server timeout handling

Package: univention-directory-replication
Version: 7.0.1-16.97.201511201718
Branch: ucs_3.2-0
Scope: ucs3.2-8


r65817 | Bug #39439 replication: Fix LDAP server timeout handling CL
  Some replicated objects were not deleted, when the LDAP server closed its side of the LDAP connection (<u:bug>39439</u:bug>).


Test:
# @backup
ucr set ldap/idletimeout=60 ldap/debug/level=conns listener/cache/filter='(cn=filtered)'
/etc/init.d/slapd restart
/etc/init.d/univention-directory-listener restart

udm container/cn create --set name=filtered
univention-ldapsearch -LLL cn=filtered
# dn: cn=filtered,dc=phahn,dc=qa
mv /var/lib/univention-directory-listener/cache.db.lock{,.old}
univention-directory-listener-dump | grep filtered

sleep 60 ; sleep 15
udm container/cn remove --filter name=filtered
univention-ldapsearch -LLL cn=filtered
ldapsearch -H "ldap://$(ucr get ldap/master):$(ucr get ldap/master/port)/" -ZZ -D "$(ucr get ldap/hostdn)" -y /etc/machine.secret -LLL -b "$(ucr get ldap/base)" -s one dn
univention-directory-listener-dump | grep filtered
tail /var/log/univention/listener.log
...
20.11.15 17:21:21.536  LISTENER    ( INFO    ) : delete handlers for cn=filtered,dc=phahn,dc=qa
20.11.15 17:21:21.536  LISTENER    ( INFO    ) : replication: Running handler for: cn=filtered,dc=phahn,dc=qa
20.11.15 17:21:21.537  LISTENER    ( WARN    ) : Can't contact LDAP server: retrying
20.11.15 17:21:21.537  LISTENER    ( INFO    ) : replication: Running handler for: cn=filtered,dc=phahn,dc=qa
20.11.15 17:21:21.543  LISTENER    ( PROCESS ) : LOCAL found result: cn=filtered,dc=phahn,dc=qa ['20151120161945.405538Z#000000#000#000000']
20.11.15 17:21:21.544  LISTENER    ( INFO    ) : LDAP keys=['hasSubordinates', 'entryCSN', 'cn', 'objectClass', 'univentionObjectType', 'creatorsName', 'entryUUID', 'modifiersName', 'createTimestamp', 'entryDN', 'subschemaSubentry', 'structuralObjectClass', 'modifyTimestamp']; listener keys=[]
20.11.15 17:21:21.544  LISTENER    ( INFO    ) : replication: old entries from LDAP server and Listener do not match
20.11.15 17:21:21.544  LISTENER    ( ALL     ) : replication: delete: cn=filtered,dc=phahn,dc=qa
20.11.15 17:21:21.548  LISTENER    ( INFO    ) : handler: replication (successful)
Comment 7 Philipp Hahn univentionstaff 2015-11-20 17:25:04 CET
Needs forward-port to errata4.0-4 and errata4.1-0 after QA
Comment 8 Philipp Hahn univentionstaff 2015-11-23 11:19:11 CET
r65841 | Bug #39439 repl: Reduce error level for timeout handling
 ud.PROCESS -> ud.ALL

Package: univention-directory-replication
Version: 7.0.1-17.98.201511231118
Branch: ucs_3.2-0
Scope: ucs3.2-8
Comment 9 Felix Botner univentionstaff 2015-11-23 11:49:37 CET
OK, works fine

master + slave

@slave
ucr set ldap/idletimeout='30' 
ucr set listener/cache/filter='(cn=filtered)'
ucr set ldap/debug/level=256
/etc/init.d/univention-directory-listener restart
/etc/init.d/slapd restart

@master
udm container/cn create --set name=filtered

@slave
univention-ldapsearch -LLL cn=filtered -> OK
...
# listener ldap connection timed out
slave slapd[12810]: conn=1035 fd=17 closed (idletimeout)

@master
udm container/cn remove --filter name=filtered

@slave
11:36:44.995 LISTENER    ( WARN    ) : Can't contact LDAP server: retrying
11:36:44 slave slapd[12810]: conn=1036 fd=17 ACCEPT from IP=127.0.0.1:33537
univention-ldapsearch -LLL cn=filtered -> NOTHING, OK

OK - changelog