47587 – cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND

Bug 47587 - cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND

Summary: cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND

Status:	CLOSED FIXED

Alias:	None

Product:	UCS
Classification:	Unclassified
Component:	Listener (univention-directory-listener)
Version:	UCS 4.3
Hardware:	Other Linux

Importance:	P5 normal
Target Milestone:	UCS 4.4-2-errata
Assignee:	Philipp Hahn
QA Contact:	Arvid Requate

URL:	https://help.univention.com/t/recreat...
Keywords:

Depends on:
Blocks:

Reported:	2018-08-14 15:33 CEST by Arvid Requate
Modified:	2019-10-30 12:17 CET (History)
CC List:	9 users (show)

See Also:	48613
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	3: Will affect average number of installed domains
How will those affected feel about the bug?:	5: Blocking further progress on the daily work
User Pain:	0.429
Enterprise Customer affected?:	Yes
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2018080621000312, 2018081021000573, 2018081621000464, 2019013121000777, 2019020121000572, 2019092721000494
Bug group (optional):
Customer ID:	00026
Max CVSS v3 score:

Attachments
close_local_cursor_in_dntree_del_id.patch (796 bytes, patch) 2019-10-16 23:23 CEST, Arvid Requate	Details \| Diff
close_and_reopen_cache_and_retry_on_failure_in_cache_update_master_entry.diff (3.56 KB, patch) 2019-10-21 13:52 CEST, Arvid Requate	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arvid Requate

2018-08-14 15:33:22 CEST

We had two support cases lately where the removal of a reverse DNS zone caused the listener replication to get stuck:

===============================================================================
10.08.18 17:19:15.927  LISTENER    ( PROCESS ) : updating 'zoneName=30.20.10.in-addr.arpa,cn=dns,dc=ucsdom,dc=local' command m
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : cache_update_master_entry: storing master entry in database failed
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
10.08.18 17:19:15.945  LISTENER    ( WARN    ) : received signal 6
===============================================================================

A check with

 univention-directory-listener-dump -i

showed that the master entry is actually ok. So I assume that MDB_PAGE_NOTFOUND might be a red herring. It could be that the listener mishandles the MDB API in this situation. We should try to reproduce this.

In case this happens again, please try to raise the listener/debug/level to 4 and restart the listener before continuing with workarounds.

Comment 1 Arvid Requate

2018-08-21 14:03:41 CEST

From Ticket#: 2018081621000464 we received a level 4 listener.log:

=======================================================================
LISTENER (INFO ) : DNS: Removing zone 8.200.10.in-addr.arpa
LISTENER (INFO ) : handler: bind (successful)
LISTENER (INFO ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schule,dc=local
LISTENER (ALL  ) : cache_delete_entry: Transaction begin
LISTENER (ALL  ) : blocking signals (was 1)
LISTENER (INFO ) : dntree_lookup_id4ldapdn: found id=5170
LISTENER (ERROR) : dntree_del_id: delete failed: subordinate objects must be deleted first
LISTENER (ALL  ) : unblocking signals (was 2)
LISTENER (ALL  ) : cache_update_entry: Transaction abort
LISTENER (ALL  ) : cache_delete_entry: Transaction begin
LISTENER (ALL  ) : blocking signals (was 1)
LISTENER (INFO ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schulen,dc=local not in id2dn
LISTENER (ALL  ) : unblocking signals (was 2)
LISTENER (ALL  ) : cache_update_entry: Transaction abort
LISTENER (ALL  ) : unblocking signals (was 1)
LISTENER (ALL  ) : cache_update_master_entry: Transaction begin
LISTENER (ALL  ) : cache_update_master_entry: Transaction commit
LISTENER (ERROR) : cache_update_master_entry: storing master entry in database failed
LISTENER (ERROR) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
LISTENER (WARN ) : received signal 6
=======================================================================

Looks strange that dntree_get_id4dn doesn't find the entry in the "id2dn" table when dntree_lookup_id4ldapdn found id=5170, but I have not looked into the code for a while. Looks almost like the MDB_PAGE_NOTFOUND is somehow propagated to the next transaction, but that would be pretty strange and also cache_update_master_entry operates on the "id2entry" table.

Comment 2 bwesemann 2019-01-31 19:49:34 CET

For a customer that has this issue, what is the workaround mentioned?

Comment 3 Christian Völker

2019-02-01 10:41:48 CET

Happened again. Workaround is to re-join the affected servers. This will trigger a resync from scratch of the LDAP database and therefore overwrite the (at least partially) corrupted database of the listener.

Comment 4 Christina Scheinig

2019-02-04 16:02:24 CET

Happened again. But after a few days the new added Objects are replicated to the server

Comment 5 Arvid Requate

2019-02-21 20:31:07 CET

univention-join should fix the issue.

Comment 6 Arvid Requate

2019-07-24 19:54:23 CEST

Cannot reproduce this yet, tried by deleting reverse zone + subobjects via UMC on

* 4.2-3 errata259   (liblmdb0 0.9.14-1)
* 4.4-1 errata196   (liblmdb0 0.9.18-5)

On a 4.4-1 DC Master get this listener.log without the MDB_PAGE_NOTFOUND:
=========================================================================
24.07.19 15:11:09.382  LISTENER    ( INFO    ) : DNS: Removing zone 8.200.10.in-addr.arpa
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: bind (successful)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: pkgdb-watch (skipped)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: portal_groups (skipped)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_delete_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : blocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : dntree_lookup_id4ldapdn: found id=295
24.07.19 15:11:09.415  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 2)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_entry: Transaction abort
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_delete_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : blocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa not in id2dn
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 2)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_entry: Transaction abort
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_master_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_master_entry: Transaction commit
24.07.19 15:11:09.422  LISTENER    ( INFO    ) : Last Notifier ID: 1482
=========================================================================

Two explanations for the details of this log:

1) The message "dntree_get_id4dn: DN zoneName=[...] not in id2dn" occurs because the listener first tries to remove the lowercase version and then the mixed case version, which is not present. That's ok.

2) The "dntree_del_id: delete failed: subordinate objects must be deleted first" occurs when deleting a DNS Zone which contains some subobjects. In that case, the listener repeatedly receives these events:

* delete of a child DNS object
* modification of the SOA-record of the zone (version number bump)

For each of these SOA modifications the listener looks for the object in LDAP and - depending on timing - doesn't find it any more because the UDM/LDAP-Server is already finished with deleting the whole subtree. In a case like that, the Listener treats the modifcation as a delete and calls change_delete for the zone object. In this case the cache_delete_entry function (or rather dntree_del_id) detects that there are still subobjects in the cache and refuses to delete the zone object from the cache. That decision not too bad I guess. Finally, when the Listner processes the removeal of the last child object, the cache_delete_entry actually works. And when it finally receives the actual LDAP delete operation for the zone object itself, the zone object is already gone from the Listener cache and that's fine.

But all of this doesn't explain the MDB_PAGE_NOTFOUND.

Comment 7 Ingo Steuwer

2019-08-13 08:51:07 CEST

Did this ever happen on systems where the last initialization of the database was done with UCS 4.4-0 or newer?

What I would like to understand is whether this might be an after-effect of an database corruption in older versions.

Comment 8 Christian Völker

2019-09-27 13:03:14 CEST

As this happens frequently on customers as soon as deleting reverse DNS zones I raise "affected by".

Especially in larger environments the workaround to re-join the systems is not a suitable way.

We need to fix this. A listener.log in Level 4 will be attached here shortly.

Comment 9 Philipp Hahn

2019-09-29 09:26:49 CEST

For a urgent support case I started writing <https://help.univention.com/t/howto-update-listener-cachemasterentry/13149>, which contains a Python script to read and update the CacheMasterEntry.

Comment 10 Stefan Gohmann

2019-09-30 09:06:22 CEST

It happened again in a customer environment on 2 backups and 5 slaves. The customer has a lot more systems.

root@backup:~# grep -A6 -B1  8431468 /var/lib/univention-ldap/notify/transaction
8431467 uid=xxxxxx,cn=users,dc=foo,dc=de m
8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431473 uid=yyyyyy,cn=users,dc=foo,dc=de m
8431474 uid=yyyyyy,cn=users,dc=foo,dc=de m
root@backup:~# 


root@backup:/var/log/univention# cat /var/lib/univention-directory-listener/notifier_id ; echo
8431468
root@backup:/var/log/univention#  

27.09.19 17:48:03.724  LISTENER    ( PROCESS ) : updating 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' command m
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : cache_update_master_entry: storing master entry in database failed
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
27.09.19 17:48:03.746  LISTENER    ( WARN    ) : received signal 6 

root@backup:~# grep 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' /var/lib/univention-ldap/notify/transaction
8431459 relativeDomainName=200.50,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431460 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
root@backup:~#

Comment 12 Arvid Requate

2019-10-16 23:23:43 CEST

Created attachment 10207 [details]
close_local_cursor_in_dntree_del_id.patch

After reading LMDB documentation very thoroughly and checking the code I could only find this one missing mdb_cursor_close. This is in the active code path, so it might affect things. According to the documentation it should not matter because we are calling mdb_txn_abort shortly after in one of the calling functions (cache_delete_entry).

I also see some potential to use less of lmdb special functions in the code path. The chain of derived data is pretty long in this case:


cache_delete_entry does:
 * mdb_txn_begin
 * mdb_cursor_open to obtain a cursor for the id2dn subdatabase
 * cache_delete_entry_in_transaction, passing id2dn_write_cursor_p in 

cache_delete_entry_in_transaction does:
 * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor
 * mdb_del on the cache entry from the id2entry subdatabase
 * dntree_del_id, passing the id2dn cursor down to it

dntree_del_id does:
 * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor
 * mdb_cursor_dbi to extract the database handle from the id2dn cursor
 * mdb_cursor_open to obtain a *second* cursor for the id2dn subdatabase
 * then it returns from this function in the situation of this bug
   logging "subordinate objects must be deleted first"
   and doesn't free the second cursor for the id2dn subdatabase

I'll stop here with this description because this is the point that the attached patch fixes.

Maybe we could additionally restructure the code to pass both the transaction handle and the database handle down through the functions and each function opens and closes cursors as it needs. lmdb 0.9.18 had at least one issue with cursors ( www.openldap.org/its/index.cgi/Software Bugs?id=8542 ), so that might be a good idea anyway. Refectoring the cache code is a piece of work tough.

We could also think about updating to a newer lmdb version, but I've not seen any obviously related commit in the upstream repository.

Comment 13 Arvid Requate

2019-10-17 01:16:31 CEST

The central part comes down to this error message:

  mdb_txn_commit: failed: MDB_PAGE_NOTFOUND

So this doesn't come from the mdb_put, which is used in cache_update_master_entry to write the new CacheMasterEntry structure at position 0 of the id2entry subdatabase. That would probably also emit "MDB_NOTFOUND" instead.

Instead it comes in the commit after this put.


The documentation says about MDB_PAGE_NOTFOUND:

http://www.lmdb.tech/doc/group__errors.html#gabeb52e4c4be21b329e31c4add1b71926
==============================================================
#define MDB_PAGE_NOTFOUND   (-30797)

Requested page not found - this usually indicates corruption
==============================================================

So something has gone haywire in the DB handling, but apparently it's not on disc, otherwise Philipp's workaround would probably not have made it work again.


If only we could reproduce it live, we could attach with gdb and breakpoint at mdb_page_get, which is the only function in the mdb source code which returns MDB_PAGE_NOTFOUND. Then we could at least get a "bt full" output to see the call stack. There is a compile time macro "MDB_DEBUG" which we could set to 1 to gather audit output, but that's probably nothing you want in production.

Comment 14 Arvid Requate

2019-10-21 13:52:10 CEST

Created attachment 10214 [details]
close_and_reopen_cache_and_retry_on_failure_in_cache_update_master_entry.diff

This rough patch exemplifies the idea we discussed about closing and reopening the cache and retrying cache_update_master_entry in case of error. Just to show another line of workaround. A small amount of refactoring is required to make this work.

Comment 15 Philipp Hahn

2019-10-21 14:40:08 CEST

I had a look at LMDB and UDL, too:
- Currently we have LMDB-0.9.18 in UCS-4.4 from Debian-Stretch
- Debian-Buster has 0.9.22
- stable 0.9.23 was released 2018/12/19 and 0.9.24 is WIP

0.9.22 contains fixes for 3 interesting issues:
- <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8622>
- <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8722>
- <http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8542>
The first 2 fix bugs related to cursor handling when using MDB_DUPSORT or mdb_cursor_del(), which UDL uses.
To me the commits look like some corruption to the cursor when entries are deleted: A cursor then may still point to some entry already deleted, which may trigger bugs later.
This would be in-sync with out observation that the bug is triggered when DNS entries are deleted.

Depending on the number of entries in the UDL cache trees deleting entries might trigger mdb_rebalance() and other functions to free() pages (or to move them to the free-list of LMDB.) As such the pending cursor might become invalid as it then points to a freed page.

Committing the UDL transaction in cache.c:429 will free the left-over cursor, which only then performs the pending free'ing, mdb_cursor_del() then calls mdb_page_get(), which is the only function returning MDB_PAGE_NOTFOUND.

This also would explain why the bug is not easily reproducible: The "to be deleted entry" must be properly aligned to be the "only / whatever..." entry per page.
Or "there must be X other entries" to trigger the re-balance.

Re-running `univention-join` "fixes" the problem as then the cache is re-build from the current state, so no delete operations are triggered. Why this still failed for customer 00026 I cannot explain.


I tried to re-produce the bug by creating different numbers of DNS entries, but so far was unable to find breaking parameters.


Adding the missing mdb_cursor_del() from attachment 10207 [details] should improve the situation as it
- should protect us from any error in liblmdb
- should prevent the late cursor freeing by mdb_txn_close() to triggering the latent error condition
- moves the error detection closer to the error location

We also should update our version of LMDB to 0.9.22 (or later) to get at least the known bugs fixed. If the bugs still occurs in the future, we still can invest more time than.

Comment 16 Philipp Hahn

2019-10-25 18:07:03 CEST

[4.4-2] 8697f3143e Bug #47587 UDL: Fix Makefile
 management/univention-directory-listener/debian/rules   | 6 +++---
 management/univention-directory-listener/tests/Makefile | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)

[4.4-2] d36c58159c Bug #47587 UDL: Switch to AutomaticDebugPackage
 management/univention-directory-listener/debian/control | 12 ------------
 management/univention-directory-listener/debian/rules   |  2 +-
 2 files changed, 1 insertion(+), 13 deletions(-)

[4.4-2] 5ccc2d8b7b Bug #47587 UDL: Fix error message
 management/univention-directory-listener/src/cache_dn.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

[4.4-2] 5400cd1209 Bug #47587 UDL: Close cursor on success
 doc/errata/staging/univention-directory-listener.yaml     | 10 ++++++++++
 management/univention-directory-listener/debian/changelog |  6 ++++++
 management/univention-directory-listener/src/cache_dn.c   |  1 +
 3 files changed, 17 insertions(+)

Package: univention-directory-listener
Version: 13.0.2-12A~4.4.0.201910250856
Branch: ucs_4.4-0
Scope: errata4.4-2

[4.4-2] b432e55e77 Bug #47587: univention-directory-listener 13.0.2-12A~4.4.0.201910250856
 doc/errata/staging/univention-directory-listener.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 17 Arvid Requate

2019-10-28 18:51:37 CET

Verified:
* Code changes
* Package update, manual tests (user creation, recursive ptr zone removal, univention-directory-listener-dump).
* Advisory


For reference: The lmdb update shows these upstream changes:
===================================================================
LMDB 0.9.22 Release (2018-03-22)
       Fix MDB_DUPSORT alignment bug (ITS#8819)
       Fix regression with new db from 0.9.19 (ITS#8760)
       Fix liblmdb to build on Solaris (ITS#8612)
       Fix delete behavior with DUPSORT DB (ITS#8622)
       Fix mdb_cursor_get/mdb_cursor_del behavior (ITS#8722)

LMDB 0.9.21 Release (2017/06/01)
       Fix xcursor after cursor_del (ITS#8622)

LMDB 0.9.20 (Withdrawn)
       Fix mdb_load with escaped plaintext (ITS#8558)
       Fix mdb_cursor_last / mdb_put interaction (ITS#8557)

LMDB 0.9.19 Release (2016/12/28)
       Fix mdb_env_cwalk cursor init (ITS#8424)
       Fix robust mutexes on Solaris 10/11 (ITS#8339)
       Tweak Win32 error message buffer
       Fix MDB_GET_BOTH on non-dup record (ITS#8393)
       Optimize mdb_drop
       Fix xcursors after mdb_cursor_del (ITS#8406)
       Fix MDB_NEXT_DUP after mdb_cursor_del (ITS#8412)
       Fix mdb_cursor_put resetting C_EOF (ITS#8489)
       Fix mdb_env_copyfd2 to return EPIPE on SIGPIPE (ITS#8504)
       Fix mdb_env_copy with empty DB (ITS#8209)
       Fix behaviors with fork (ITS#8505)
       Fix mdb_dbi_open with mainDB cursors (ITS#8542)
       Fix robust mutexes on kFreeBSD (ITS#8554)
       Fix utf8_to_utf16 error checks (ITS#7992)
       Fix F_NOCACHE on MacOS, error is non-fatal (ITS#7682)
       Build
               Make shared lib suffix overridable (ITS#8481)
       Documentation
               Cleanup doxygen nits
               Note reserved vs actual mem/disk usage
===================================================================

Comment 18 Erik Damrose

2019-10-30 12:17:31 CET

<http://errata.software-univention.de/ucs/4.4/327.html>
<http://errata.software-univention.de/ucs/4.4/328.html>