Univention Bugzilla – Bug 39439
Replication module does not handle delete of listener/cache/filtered objects
Last modified: 2023-03-25 06:51:30 CET
The listener sometimes does not always delete an object on all or some backup or Slave servers after it has been deleted from the LDAP Master. This behaviour occurs only sporadically. Log files are attached to the Redmine issue #3245.
Excellent logs! Brilliant. According to the listener.log the replication module has been called successfully for the delete ======================================================================== 28.09.15 17:13:06.358 LISTENER ( ALL ) : <<< MSGID: 192409 785107 uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base> d [...] 28.09.15 17:13:06.465 LISTENER ( ALL ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base> 28.09.15 17:13:06.465 LISTENER ( ALL ) : blocking signals (was 0) 28.09.15 17:13:06.465 LISTENER ( INFO ) : delete handlers for uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base> 28.09.15 17:13:06.465 LISTENER ( INFO ) : replication: Running handler for: uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base> 28.09.15 17:13:06.466 LISTENER ( INFO ) : handler: replication (successful) ======================================================================== But the slaves syslog shows no DEL. Looking back in the log I see the ADD: ======================================================================== Sep 28 17:06:35 slave01-202 slapd[1024]: conn=1134 op=253420 ADD dn="uid=mb20150928150634629169,cn=mailboxes,ou=<some_ou>,<$ldap_base>" ======================================================================== While there is no DEL at 17:13:06 instead I see that this connection (1134) gets closed due to idletimeout: ======================================================================== Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1134 fd=32 closed (idletimeout) Sep 28 17:13:06 slave01-202 slapd[1024]: connection_read(32): no connection! Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 fd=17 ACCEPT from IP=127.0.0.1:42627 (IP=0.0.0.0:7389) Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 BIND dn="cn=update,<$ldap_base>" method=128 Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 BIND dn="cn=update,<$ldap_base>" mech=SIMPLE ssf=0 Sep 28 17:13:06 slave01-202 slapd[1024]: conn=1369 op=0 RESULT tag=97 err=0 text= ======================================================================== So I assume that replication.py somehow skips the delete, I currently don't see how or why. From reading the code, I would have thought that the ldap.delete_s should be called in _delete_dn_recursive, then possibly throw some subclass of ldap.LDAPError which would get caught by replication.py, triggering a univention.debug.ERROR log message, a reconnect and a retry.
I would propose to start by backporting the patches from Bug #34763, they are a good idea for enhanced replication robustness anyway and may avoid the issue reported here.
If you use UCRV listener/cache/filter, the entries are NOT added to the cache. On removal of such an object, the listener has no knowledge, that the entry is handles by the replication module and thus does not call it on removal. 8: 785068 cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, a 14: 28.09.15 17:06:34.822 LISTENER ( ALL ) : no cache entry found for cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, 32: 28.09.15 17:06:34.834 LISTENER ( INFO ) : replication: Running handler for: cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, 33: 28.09.15 17:06:34.837 LISTENER ( INFO ) : handler: replication (successful) ... 2738: 785087 uid=mb20150928150634629169,cn=mailboxes, a 2744: 28.09.15 17:06:35.241 LISTENER ( ALL ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=fr, 2745: 28.09.15 17:06:35.241 LISTENER ( ALL ) : updating by DN uid=mb20150928150634629169,cn=mailboxes,ou=fr, ... 2811: 785088 cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, d 2817: 28.09.15 17:06:35.278 LISTENER ( ALL ) : got 792 bytes for cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, 2120: 28.09.15 17:06:35.286 LISTENER ( INFO ) : deleted from cache: cn=mb20150928150634629169,cn=uid,cn=temporary,cn=univention, ... 8981: 785107 uid=mb20150928150634629169,cn=mailboxes,ou=fr, d 8990: 28.09.15 17:13:06.464 LISTENER ( ALL ) : no cache entry found for uid=mb20150928150634629169,cn=mailboxes,ou=fr, 8996: 28.09.15 17:13:06.465 LISTENER ( INFO ) : replication: Running handler for: uid=mb20150928150634629169,cn=mailboxes,ou=fr, 8997: 28.09.15 17:13:06.466 LISTENER ( INFO ) : handler: replication (successful): /usr/lib/univention-directory-listener/system/replication.py: 0890 def handler(dn, new, listener_old, operation): 1075 elif old and not new: * old=None as there is no cache entry * new is None because the entry is deleted * nothing to do * fast exit with success See Bug #3882 comment 1: > The test shows a fundamental problem in the implementation: the handler is > not called when the object is deleted! > > This is caused by the fact, that on delete the listener neither knows about > the object from the cache not is able to retrieve the data from the (remote) > LDAP server, as the object is already deleted there. So the listener can't > decide, if the module should be called. > Normally src/handlers.c:880 calls cache_entry_module_present() to check, if > the module is registered for the object. > > There already is a special-handling for the "replication" module in > handlers_delete(). Maybe make that dynamic by using a UCRV to configure a > list of modules to always call on delete? The replication module is always called, but it doesn't handle the case old=new=None. (In reply to Arvid Requate from comment #2) > I would propose to start by backporting the patches from Bug #34763, they > are a good idea for enhanced replication robustness anyway and may avoid the > issue reported here. No, very unlikely; see above.
(In reply to Philipp Hahn from comment #3) > If you use UCRV listener/cache/filter, the entries are NOT added to the > cache. On removal of such an object, the listener has no knowledge, that the > entry is handles by the replication module and thus does not call it on > removal. listener/cache/filter is defined in this scenario. [..] > > /usr/lib/univention-directory-listener/system/replication.py: > 0890 def handler(dn, new, listener_old, operation): > 1075 elif old and not new: > > * old=None as there is no cache entry > * new is None because the entry is deleted > * nothing to do > * fast exit with success replication.py handler is defined as def handler(dn, new, listener_old, operation) "old" is an LDAP object that has been retrieved from local LDAP. So deletion works even if the cache (listener_old) is empty. This is also the case here: deletion works in >90% of our tests. The logs contain successfull deletions also.
/var/log/univention/listener.log @ backup: 23.10.15 13:42:39.615 LISTENER ( WARN ) : LOCAL not found: cn=9999,cn=system users,ou=fr,BASE {'matched': 'cn=system users,ou=fr,BASE', 'desc': 'No such object'} 23.10.15 15:25:57.160 LISTENER ( WARN ) : LOCAL server down: cn=0,cn=mailboxes,ou=fr,BASE {'desc': "Can't contact LDAP server"} /var/log/syslog @ backup: Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SRCH base="cn=9999,cn=system users,ou=fr,BASE" scope=0 deref=0 filter="(objectClass=*)" Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SRCH attr=* + Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=239999 SEARCH RESULT tag=101 err=32 nentries=0 text= Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=240000 ADD dn="cn=9999,cn=system users,ou=fr,BASE" Oct 23 13:42:39 in8-backup-201 slapd[5346]: conn=1057 op=240000 RESULT tag=105 err=0 text= ... Oct 23 13:50:49 in8-backup-201 slapd[5346]: conn=1057 fd=23 closed (idletimeout) 1. New objects are created, but not cached due to filter 2. Connection times out and is shut down by LDAP server 3. One of the non-cached objects is deleted 4. replication.py # getOldValues() catches ldap.SERVER_DOWN itself and returns {} 5. listener decides to do nothing, as old={} from getOldValues() and new={} because object is deleted
r65814 | Bug #39439 replication: Fix LDAP server timeout handling r65815 | Bug #39439 replication: Update copyright r65825 | Bug #39439 replication: Fix LDAP server timeout handling r65826 | Bug #39439 replication: Fix LDAP server timeout handling Package: univention-directory-replication Version: 7.0.1-16.97.201511201718 Branch: ucs_3.2-0 Scope: ucs3.2-8 r65817 | Bug #39439 replication: Fix LDAP server timeout handling CL Some replicated objects were not deleted, when the LDAP server closed its side of the LDAP connection (<u:bug>39439</u:bug>). Test: # @backup ucr set ldap/idletimeout=60 ldap/debug/level=conns listener/cache/filter='(cn=filtered)' /etc/init.d/slapd restart /etc/init.d/univention-directory-listener restart udm container/cn create --set name=filtered univention-ldapsearch -LLL cn=filtered # dn: cn=filtered,dc=phahn,dc=qa mv /var/lib/univention-directory-listener/cache.db.lock{,.old} univention-directory-listener-dump | grep filtered sleep 60 ; sleep 15 udm container/cn remove --filter name=filtered univention-ldapsearch -LLL cn=filtered ldapsearch -H "ldap://$(ucr get ldap/master):$(ucr get ldap/master/port)/" -ZZ -D "$(ucr get ldap/hostdn)" -y /etc/machine.secret -LLL -b "$(ucr get ldap/base)" -s one dn univention-directory-listener-dump | grep filtered tail /var/log/univention/listener.log ... 20.11.15 17:21:21.536 LISTENER ( INFO ) : delete handlers for cn=filtered,dc=phahn,dc=qa 20.11.15 17:21:21.536 LISTENER ( INFO ) : replication: Running handler for: cn=filtered,dc=phahn,dc=qa 20.11.15 17:21:21.537 LISTENER ( WARN ) : Can't contact LDAP server: retrying 20.11.15 17:21:21.537 LISTENER ( INFO ) : replication: Running handler for: cn=filtered,dc=phahn,dc=qa 20.11.15 17:21:21.543 LISTENER ( PROCESS ) : LOCAL found result: cn=filtered,dc=phahn,dc=qa ['20151120161945.405538Z#000000#000#000000'] 20.11.15 17:21:21.544 LISTENER ( INFO ) : LDAP keys=['hasSubordinates', 'entryCSN', 'cn', 'objectClass', 'univentionObjectType', 'creatorsName', 'entryUUID', 'modifiersName', 'createTimestamp', 'entryDN', 'subschemaSubentry', 'structuralObjectClass', 'modifyTimestamp']; listener keys=[] 20.11.15 17:21:21.544 LISTENER ( INFO ) : replication: old entries from LDAP server and Listener do not match 20.11.15 17:21:21.544 LISTENER ( ALL ) : replication: delete: cn=filtered,dc=phahn,dc=qa 20.11.15 17:21:21.548 LISTENER ( INFO ) : handler: replication (successful)
Needs forward-port to errata4.0-4 and errata4.1-0 after QA
r65841 | Bug #39439 repl: Reduce error level for timeout handling ud.PROCESS -> ud.ALL Package: univention-directory-replication Version: 7.0.1-17.98.201511231118 Branch: ucs_3.2-0 Scope: ucs3.2-8
OK, works fine master + slave @slave ucr set ldap/idletimeout='30' ucr set listener/cache/filter='(cn=filtered)' ucr set ldap/debug/level=256 /etc/init.d/univention-directory-listener restart /etc/init.d/slapd restart @master udm container/cn create --set name=filtered @slave univention-ldapsearch -LLL cn=filtered -> OK ... # listener ldap connection timed out slave slapd[12810]: conn=1035 fd=17 closed (idletimeout) @master udm container/cn remove --filter name=filtered @slave 11:36:44.995 LISTENER ( WARN ) : Can't contact LDAP server: retrying 11:36:44 slave slapd[12810]: conn=1036 fd=17 ACCEPT from IP=127.0.0.1:33537 univention-ldapsearch -LLL cn=filtered -> NOTHING, OK OK - changelog