Univention Bugzilla – Bug 47587
cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND
Last modified: 2019-10-30 12:17:31 CET
We had two support cases lately where the removal of a reverse DNS zone caused the listener replication to get stuck: =============================================================================== 10.08.18 17:19:15.927 LISTENER ( PROCESS ) : updating 'zoneName=30.20.10.in-addr.arpa,cn=dns,dc=ucsdom,dc=local' command m 10.08.18 17:19:15.945 LISTENER ( ERROR ) : dntree_del_id: delete failed: subordinate objects must be deleted first 10.08.18 17:19:15.945 LISTENER ( ERROR ) : cache_update_master_entry: storing master entry in database failed 10.08.18 17:19:15.945 LISTENER ( ERROR ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797) 10.08.18 17:19:15.945 LISTENER ( WARN ) : received signal 6 =============================================================================== A check with univention-directory-listener-dump -i showed that the master entry is actually ok. So I assume that MDB_PAGE_NOTFOUND might be a red herring. It could be that the listener mishandles the MDB API in this situation. We should try to reproduce this. In case this happens again, please try to raise the listener/debug/level to 4 and restart the listener before continuing with workarounds.
From Ticket#: 2018081621000464 we received a level 4 listener.log: ======================================================================= LISTENER (INFO ) : DNS: Removing zone 8.200.10.in-addr.arpa LISTENER (INFO ) : handler: bind (successful) LISTENER (INFO ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schule,dc=local LISTENER (ALL ) : cache_delete_entry: Transaction begin LISTENER (ALL ) : blocking signals (was 1) LISTENER (INFO ) : dntree_lookup_id4ldapdn: found id=5170 LISTENER (ERROR) : dntree_del_id: delete failed: subordinate objects must be deleted first LISTENER (ALL ) : unblocking signals (was 2) LISTENER (ALL ) : cache_update_entry: Transaction abort LISTENER (ALL ) : cache_delete_entry: Transaction begin LISTENER (ALL ) : blocking signals (was 1) LISTENER (INFO ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schulen,dc=local not in id2dn LISTENER (ALL ) : unblocking signals (was 2) LISTENER (ALL ) : cache_update_entry: Transaction abort LISTENER (ALL ) : unblocking signals (was 1) LISTENER (ALL ) : cache_update_master_entry: Transaction begin LISTENER (ALL ) : cache_update_master_entry: Transaction commit LISTENER (ERROR) : cache_update_master_entry: storing master entry in database failed LISTENER (ERROR) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797) LISTENER (WARN ) : received signal 6 ======================================================================= Looks strange that dntree_get_id4dn doesn't find the entry in the "id2dn" table when dntree_lookup_id4ldapdn found id=5170, but I have not looked into the code for a while. Looks almost like the MDB_PAGE_NOTFOUND is somehow propagated to the next transaction, but that would be pretty strange and also cache_update_master_entry operates on the "id2entry" table.
For a customer that has this issue, what is the workaround mentioned?
Happened again. Workaround is to re-join the affected servers. This will trigger a resync from scratch of the LDAP database and therefore overwrite the (at least partially) corrupted database of the listener.
Happened again. But after a few days the new added Objects are replicated to the server
univention-join should fix the issue.
Cannot reproduce this yet, tried by deleting reverse zone + subobjects via UMC on * 4.2-3 errata259 (liblmdb0 0.9.14-1) * 4.4-1 errata196 (liblmdb0 0.9.18-5) On a 4.4-1 DC Master get this listener.log without the MDB_PAGE_NOTFOUND: ========================================================================= 24.07.19 15:11:09.382 LISTENER ( INFO ) : DNS: Removing zone 8.200.10.in-addr.arpa 24.07.19 15:11:09.415 LISTENER ( INFO ) : handler: bind (successful) 24.07.19 15:11:09.415 LISTENER ( INFO ) : handler: pkgdb-watch (skipped) 24.07.19 15:11:09.415 LISTENER ( INFO ) : handler: portal_groups (skipped) 24.07.19 15:11:09.415 LISTENER ( INFO ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_delete_entry: Transaction begin 24.07.19 15:11:09.415 LISTENER ( ALL ) : blocking signals (was 1) 24.07.19 15:11:09.415 LISTENER ( INFO ) : dntree_lookup_id4ldapdn: found id=295 24.07.19 15:11:09.415 LISTENER ( ERROR ) : dntree_del_id: delete failed: subordinate objects must be deleted first 24.07.19 15:11:09.415 LISTENER ( ALL ) : unblocking signals (was 2) 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_update_entry: Transaction abort 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_delete_entry: Transaction begin 24.07.19 15:11:09.415 LISTENER ( ALL ) : blocking signals (was 1) 24.07.19 15:11:09.415 LISTENER ( INFO ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa not in id2dn 24.07.19 15:11:09.415 LISTENER ( ALL ) : unblocking signals (was 2) 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_update_entry: Transaction abort 24.07.19 15:11:09.415 LISTENER ( ALL ) : unblocking signals (was 1) 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_update_master_entry: Transaction begin 24.07.19 15:11:09.415 LISTENER ( ALL ) : cache_update_master_entry: Transaction commit 24.07.19 15:11:09.422 LISTENER ( INFO ) : Last Notifier ID: 1482 ========================================================================= Two explanations for the details of this log: 1) The message "dntree_get_id4dn: DN zoneName=[...] not in id2dn" occurs because the listener first tries to remove the lowercase version and then the mixed case version, which is not present. That's ok. 2) The "dntree_del_id: delete failed: subordinate objects must be deleted first" occurs when deleting a DNS Zone which contains some subobjects. In that case, the listener repeatedly receives these events: * delete of a child DNS object * modification of the SOA-record of the zone (version number bump) For each of these SOA modifications the listener looks for the object in LDAP and - depending on timing - doesn't find it any more because the UDM/LDAP-Server is already finished with deleting the whole subtree. In a case like that, the Listener treats the modifcation as a delete and calls change_delete for the zone object. In this case the cache_delete_entry function (or rather dntree_del_id) detects that there are still subobjects in the cache and refuses to delete the zone object from the cache. That decision not too bad I guess. Finally, when the Listner processes the removeal of the last child object, the cache_delete_entry actually works. And when it finally receives the actual LDAP delete operation for the zone object itself, the zone object is already gone from the Listener cache and that's fine. But all of this doesn't explain the MDB_PAGE_NOTFOUND.
Did this ever happen on systems where the last initialization of the database was done with UCS 4.4-0 or newer? What I would like to understand is whether this might be an after-effect of an database corruption in older versions.
As this happens frequently on customers as soon as deleting reverse DNS zones I raise "affected by". Especially in larger environments the workaround to re-join the systems is not a suitable way. We need to fix this. A listener.log in Level 4 will be attached here shortly.
For a urgent support case I started writing <https://help.univention.com/t/howto-update-listener-cachemasterentry/13149>, which contains a Python script to read and update the CacheMasterEntry.
It happened again in a customer environment on 2 backups and 5 slaves. The customer has a lot more systems. root@backup:~# grep -A6 -B1 8431468 /var/lib/univention-ldap/notify/transaction 8431467 uid=xxxxxx,cn=users,dc=foo,dc=de m 8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m 8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m 8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431473 uid=yyyyyy,cn=users,dc=foo,dc=de m 8431474 uid=yyyyyy,cn=users,dc=foo,dc=de m root@backup:~# root@backup:/var/log/univention# cat /var/lib/univention-directory-listener/notifier_id ; echo 8431468 root@backup:/var/log/univention# 27.09.19 17:48:03.724 LISTENER ( PROCESS ) : updating 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' command m 27.09.19 17:48:03.746 LISTENER ( ERROR ) : dntree_del_id: delete failed: subordinate objects must be deleted first 27.09.19 17:48:03.746 LISTENER ( ERROR ) : cache_update_master_entry: storing master entry in database failed 27.09.19 17:48:03.746 LISTENER ( ERROR ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797) 27.09.19 17:48:03.746 LISTENER ( WARN ) : received signal 6 root@backup:~# grep 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' /var/lib/univention-ldap/notify/transaction 8431459 relativeDomainName=200.50,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431460 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m 8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m 8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d 8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m 8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d root@backup:~#
Created attachment 10207 [details] close_local_cursor_in_dntree_del_id.patch After reading LMDB documentation very thoroughly and checking the code I could only find this one missing mdb_cursor_close. This is in the active code path, so it might affect things. According to the documentation it should not matter because we are calling mdb_txn_abort shortly after in one of the calling functions (cache_delete_entry). I also see some potential to use less of lmdb special functions in the code path. The chain of derived data is pretty long in this case: cache_delete_entry does: * mdb_txn_begin * mdb_cursor_open to obtain a cursor for the id2dn subdatabase * cache_delete_entry_in_transaction, passing id2dn_write_cursor_p in cache_delete_entry_in_transaction does: * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor * mdb_del on the cache entry from the id2entry subdatabase * dntree_del_id, passing the id2dn cursor down to it dntree_del_id does: * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor * mdb_cursor_dbi to extract the database handle from the id2dn cursor * mdb_cursor_open to obtain a *second* cursor for the id2dn subdatabase * then it returns from this function in the situation of this bug logging "subordinate objects must be deleted first" and doesn't free the second cursor for the id2dn subdatabase I'll stop here with this description because this is the point that the attached patch fixes. Maybe we could additionally restructure the code to pass both the transaction handle and the database handle down through the functions and each function opens and closes cursors as it needs. lmdb 0.9.18 had at least one issue with cursors ( www.openldap.org/its/index.cgi/Software Bugs?id=8542 ), so that might be a good idea anyway. Refectoring the cache code is a piece of work tough. We could also think about updating to a newer lmdb version, but I've not seen any obviously related commit in the upstream repository.
The central part comes down to this error message: mdb_txn_commit: failed: MDB_PAGE_NOTFOUND So this doesn't come from the mdb_put, which is used in cache_update_master_entry to write the new CacheMasterEntry structure at position 0 of the id2entry subdatabase. That would probably also emit "MDB_NOTFOUND" instead. Instead it comes in the commit after this put. The documentation says about MDB_PAGE_NOTFOUND: http://www.lmdb.tech/doc/group__errors.html#gabeb52e4c4be21b329e31c4add1b71926 ============================================================== #define MDB_PAGE_NOTFOUND (-30797) Requested page not found - this usually indicates corruption ============================================================== So something has gone haywire in the DB handling, but apparently it's not on disc, otherwise Philipp's workaround would probably not have made it work again. If only we could reproduce it live, we could attach with gdb and breakpoint at mdb_page_get, which is the only function in the mdb source code which returns MDB_PAGE_NOTFOUND. Then we could at least get a "bt full" output to see the call stack. There is a compile time macro "MDB_DEBUG" which we could set to 1 to gather audit output, but that's probably nothing you want in production.
Created attachment 10214 [details] close_and_reopen_cache_and_retry_on_failure_in_cache_update_master_entry.diff This rough patch exemplifies the idea we discussed about closing and reopening the cache and retrying cache_update_master_entry in case of error. Just to show another line of workaround. A small amount of refactoring is required to make this work.
I had a look at LMDB and UDL, too: - Currently we have LMDB-0.9.18 in UCS-4.4 from Debian-Stretch - Debian-Buster has 0.9.22 - stable 0.9.23 was released 2018/12/19 and 0.9.24 is WIP 0.9.22 contains fixes for 3 interesting issues: - <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8622> - <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8722> - <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8542> The first 2 fix bugs related to cursor handling when using MDB_DUPSORT or mdb_cursor_del(), which UDL uses. To me the commits look like some corruption to the cursor when entries are deleted: A cursor then may still point to some entry already deleted, which may trigger bugs later. This would be in-sync with out observation that the bug is triggered when DNS entries are deleted. Depending on the number of entries in the UDL cache trees deleting entries might trigger mdb_rebalance() and other functions to free() pages (or to move them to the free-list of LMDB.) As such the pending cursor might become invalid as it then points to a freed page. Committing the UDL transaction in cache.c:429 will free the left-over cursor, which only then performs the pending free'ing, mdb_cursor_del() then calls mdb_page_get(), which is the only function returning MDB_PAGE_NOTFOUND. This also would explain why the bug is not easily reproducible: The "to be deleted entry" must be properly aligned to be the "only / whatever..." entry per page. Or "there must be X other entries" to trigger the re-balance. Re-running `univention-join` "fixes" the problem as then the cache is re-build from the current state, so no delete operations are triggered. Why this still failed for customer 00026 I cannot explain. I tried to re-produce the bug by creating different numbers of DNS entries, but so far was unable to find breaking parameters. Adding the missing mdb_cursor_del() from attachment 10207 [details] should improve the situation as it - should protect us from any error in liblmdb - should prevent the late cursor freeing by mdb_txn_close() to triggering the latent error condition - moves the error detection closer to the error location We also should update our version of LMDB to 0.9.22 (or later) to get at least the known bugs fixed. If the bugs still occurs in the future, we still can invest more time than.
[4.4-2] 8697f3143e Bug #47587 UDL: Fix Makefile management/univention-directory-listener/debian/rules | 6 +++--- management/univention-directory-listener/tests/Makefile | 3 +++ 2 files changed, 6 insertions(+), 3 deletions(-) [4.4-2] d36c58159c Bug #47587 UDL: Switch to AutomaticDebugPackage management/univention-directory-listener/debian/control | 12 ------------ management/univention-directory-listener/debian/rules | 2 +- 2 files changed, 1 insertion(+), 13 deletions(-) [4.4-2] 5ccc2d8b7b Bug #47587 UDL: Fix error message management/univention-directory-listener/src/cache_dn.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) [4.4-2] 5400cd1209 Bug #47587 UDL: Close cursor on success doc/errata/staging/univention-directory-listener.yaml | 10 ++++++++++ management/univention-directory-listener/debian/changelog | 6 ++++++ management/univention-directory-listener/src/cache_dn.c | 1 + 3 files changed, 17 insertions(+) Package: univention-directory-listener Version: 13.0.2-12A~4.4.0.201910250856 Branch: ucs_4.4-0 Scope: errata4.4-2 [4.4-2] b432e55e77 Bug #47587: univention-directory-listener 13.0.2-12A~4.4.0.201910250856 doc/errata/staging/univention-directory-listener.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Verified: * Code changes * Package update, manual tests (user creation, recursive ptr zone removal, univention-directory-listener-dump). * Advisory For reference: The lmdb update shows these upstream changes: =================================================================== LMDB 0.9.22 Release (2018-03-22) Fix MDB_DUPSORT alignment bug (ITS#8819) Fix regression with new db from 0.9.19 (ITS#8760) Fix liblmdb to build on Solaris (ITS#8612) Fix delete behavior with DUPSORT DB (ITS#8622) Fix mdb_cursor_get/mdb_cursor_del behavior (ITS#8722) LMDB 0.9.21 Release (2017/06/01) Fix xcursor after cursor_del (ITS#8622) LMDB 0.9.20 (Withdrawn) Fix mdb_load with escaped plaintext (ITS#8558) Fix mdb_cursor_last / mdb_put interaction (ITS#8557) LMDB 0.9.19 Release (2016/12/28) Fix mdb_env_cwalk cursor init (ITS#8424) Fix robust mutexes on Solaris 10/11 (ITS#8339) Tweak Win32 error message buffer Fix MDB_GET_BOTH on non-dup record (ITS#8393) Optimize mdb_drop Fix xcursors after mdb_cursor_del (ITS#8406) Fix MDB_NEXT_DUP after mdb_cursor_del (ITS#8412) Fix mdb_cursor_put resetting C_EOF (ITS#8489) Fix mdb_env_copyfd2 to return EPIPE on SIGPIPE (ITS#8504) Fix mdb_env_copy with empty DB (ITS#8209) Fix behaviors with fork (ITS#8505) Fix mdb_dbi_open with mainDB cursors (ITS#8542) Fix robust mutexes on kFreeBSD (ITS#8554) Fix utf8_to_utf16 error checks (ITS#7992) Fix F_NOCACHE on MacOS, error is non-fatal (ITS#7682) Build Make shared lib suffix overridable (ITS#8481) Documentation Cleanup doxygen nits Note reserved vs actual mem/disk usage ===================================================================
<http://errata.software-univention.de/ucs/4.4/327.html> <http://errata.software-univention.de/ucs/4.4/328.html>