Bug 47587 - cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND
cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND
Status: NEW
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 4.3
Other Linux
: P5 normal (vote)
: ---
Assigned To: Arvid Requate
UCS maintainers
https://help.univention.com/t/recreat...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-08-14 15:33 CEST by Arvid Requate
Modified: 2019-10-17 01:16 CEST (History)
8 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.429
Enterprise Customer affected?: Yes
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Ticket number: 2018080621000312, 2018081021000573, 2018081621000464, 2019013121000777, 2019020121000572, 2019092721000494
Bug group (optional):
Max CVSS v3 score:


Attachments
close_local_cursor_in_dntree_del_id.patch (796 bytes, patch)
2019-10-16 23:23 CEST, Arvid Requate
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate univentionstaff 2018-08-14 15:33:22 CEST
We had two support cases lately where the removal of a reverse DNS zone caused the listener replication to get stuck:

===============================================================================
10.08.18 17:19:15.927  LISTENER    ( PROCESS ) : updating 'zoneName=30.20.10.in-addr.arpa,cn=dns,dc=ucsdom,dc=local' command m
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : cache_update_master_entry: storing master entry in database failed
10.08.18 17:19:15.945  LISTENER    ( ERROR   ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
10.08.18 17:19:15.945  LISTENER    ( WARN    ) : received signal 6
===============================================================================

A check with

 univention-directory-listener-dump -i

showed that the master entry is actually ok. So I assume that MDB_PAGE_NOTFOUND might be a red herring. It could be that the listener mishandles the MDB API in this situation. We should try to reproduce this.

In case this happens again, please try to raise the listener/debug/level to 4 and restart the listener before continuing with workarounds.
Comment 1 Arvid Requate univentionstaff 2018-08-21 14:03:41 CEST
From Ticket#: 2018081621000464 we received a level 4 listener.log:

=======================================================================
LISTENER (INFO ) : DNS: Removing zone 8.200.10.in-addr.arpa
LISTENER (INFO ) : handler: bind (successful)
LISTENER (INFO ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schule,dc=local
LISTENER (ALL  ) : cache_delete_entry: Transaction begin
LISTENER (ALL  ) : blocking signals (was 1)
LISTENER (INFO ) : dntree_lookup_id4ldapdn: found id=5170
LISTENER (ERROR) : dntree_del_id: delete failed: subordinate objects must be deleted first
LISTENER (ALL  ) : unblocking signals (was 2)
LISTENER (ALL  ) : cache_update_entry: Transaction abort
LISTENER (ALL  ) : cache_delete_entry: Transaction begin
LISTENER (ALL  ) : blocking signals (was 1)
LISTENER (INFO ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=schulen,dc=local not in id2dn
LISTENER (ALL  ) : unblocking signals (was 2)
LISTENER (ALL  ) : cache_update_entry: Transaction abort
LISTENER (ALL  ) : unblocking signals (was 1)
LISTENER (ALL  ) : cache_update_master_entry: Transaction begin
LISTENER (ALL  ) : cache_update_master_entry: Transaction commit
LISTENER (ERROR) : cache_update_master_entry: storing master entry in database failed
LISTENER (ERROR) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
LISTENER (WARN ) : received signal 6
=======================================================================

Looks strange that dntree_get_id4dn doesn't find the entry in the "id2dn" table when dntree_lookup_id4ldapdn found id=5170, but I have not looked into the code for a while. Looks almost like the MDB_PAGE_NOTFOUND is somehow propagated to the next transaction, but that would be pretty strange and also cache_update_master_entry operates on the "id2entry" table.
Comment 2 bwesemann 2019-01-31 19:49:34 CET
For a customer that has this issue, what is the workaround mentioned?
Comment 3 Christian Völker univentionstaff 2019-02-01 10:41:48 CET
Happened again. Workaround is to re-join the affected servers. This will trigger a resync from scratch of the LDAP database and therefore overwrite the (at least partially) corrupted database of the listener.
Comment 4 Christina Scheinig univentionstaff 2019-02-04 16:02:24 CET
Happened again. But after a few days the new added Objects are replicated to the server
Comment 5 Arvid Requate univentionstaff 2019-02-21 20:31:07 CET
univention-join should fix the issue.
Comment 6 Arvid Requate univentionstaff 2019-07-24 19:54:23 CEST
Cannot reproduce this yet, tried by deleting reverse zone + subobjects via UMC on

* 4.2-3 errata259   (liblmdb0 0.9.14-1)
* 4.4-1 errata196   (liblmdb0 0.9.18-5)

On a 4.4-1 DC Master get this listener.log without the MDB_PAGE_NOTFOUND:
=========================================================================
24.07.19 15:11:09.382  LISTENER    ( INFO    ) : DNS: Removing zone 8.200.10.in-addr.arpa
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: bind (successful)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: pkgdb-watch (skipped)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : handler: portal_groups (skipped)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : deleted from cache: zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_delete_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : blocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : dntree_lookup_id4ldapdn: found id=295
24.07.19 15:11:09.415  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 2)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_entry: Transaction abort
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_delete_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : blocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( INFO    ) : dntree_get_id4dn: DN zoneName=8.200.10.in-addr.arpa,cn=dns,dc=ar41i1,dc=qa not in id2dn
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 2)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_entry: Transaction abort
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : unblocking signals (was 1)
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_master_entry: Transaction begin
24.07.19 15:11:09.415  LISTENER    ( ALL     ) : cache_update_master_entry: Transaction commit
24.07.19 15:11:09.422  LISTENER    ( INFO    ) : Last Notifier ID: 1482
=========================================================================

Two explanations for the details of this log:

1) The message "dntree_get_id4dn: DN zoneName=[...] not in id2dn" occurs because the listener first tries to remove the lowercase version and then the mixed case version, which is not present. That's ok.

2) The "dntree_del_id: delete failed: subordinate objects must be deleted first" occurs when deleting a DNS Zone which contains some subobjects. In that case, the listener repeatedly receives these events:

* delete of a child DNS object
* modification of the SOA-record of the zone (version number bump)

For each of these SOA modifications the listener looks for the object in LDAP and - depending on timing - doesn't find it any more because the UDM/LDAP-Server is already finished with deleting the whole subtree. In a case like that, the Listener treats the modifcation as a delete and calls change_delete for the zone object. In this case the cache_delete_entry function (or rather dntree_del_id) detects that there are still subobjects in the cache and refuses to delete the zone object from the cache. That decision not too bad I guess. Finally, when the Listner processes the removeal of the last child object, the cache_delete_entry actually works. And when it finally receives the actual LDAP delete operation for the zone object itself, the zone object is already gone from the Listener cache and that's fine.

But all of this doesn't explain the MDB_PAGE_NOTFOUND.
Comment 7 Ingo Steuwer univentionstaff 2019-08-13 08:51:07 CEST
Did this ever happen on systems where the last initialization of the database was done with UCS 4.4-0 or newer?

What I would like to understand is whether this might be an after-effect of an database corruption in older versions.
Comment 8 Christian Völker univentionstaff 2019-09-27 13:03:14 CEST
As this happens frequently on customers as soon as deleting reverse DNS zones I raise "affected by".

Especially in larger environments the workaround to re-join the systems is not a suitable way.

We need to fix this. A listener.log in Level 4 will be attached here shortly.
Comment 9 Philipp Hahn univentionstaff 2019-09-29 09:26:49 CEST
For a urgent support case I started writing <https://help.univention.com/t/howto-update-listener-cachemasterentry/13149>, which contains a Python script to read and update the CacheMasterEntry.
Comment 10 Stefan Gohmann univentionstaff 2019-09-30 09:06:22 CEST
It happened again in a customer environment on 2 backups and 5 slaves. The customer has a lot more systems.

root@backup:~# grep -A6 -B1  8431468 /var/lib/univention-ldap/notify/transaction
8431467 uid=xxxxxx,cn=users,dc=foo,dc=de m
8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431473 uid=yyyyyy,cn=users,dc=foo,dc=de m
8431474 uid=yyyyyy,cn=users,dc=foo,dc=de m
root@backup:~# 


root@backup:/var/log/univention# cat /var/lib/univention-directory-listener/notifier_id ; echo
8431468
root@backup:/var/log/univention#  

27.09.19 17:48:03.724  LISTENER    ( PROCESS ) : updating 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' command m
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : dntree_del_id: delete failed: subordinate objects must be deleted first
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : cache_update_master_entry: storing master entry in database failed
27.09.19 17:48:03.746  LISTENER    ( ERROR   ) : cache.c:429:cache_update_master_entry mdb_txn_commit: failed: MDB_PAGE_NOTFOUND: Requested page not found (-30797)
27.09.19 17:48:03.746  LISTENER    ( WARN    ) : received signal 6 

root@backup:~# grep 'zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de' /var/lib/univention-ldap/notify/transaction
8431459 relativeDomainName=200.50,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431460 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431468 relativeDomainName=10.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431469 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431470 relativeDomainName=11.4,zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
8431471 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de m
8431472 zoneName=153.184.in-addr.arpa,cn=dns,dc=foo,dc=de d
root@backup:~#
Comment 12 Arvid Requate univentionstaff 2019-10-16 23:23:43 CEST
Created attachment 10207 [details]
close_local_cursor_in_dntree_del_id.patch

After reading LMDB documentation very thoroughly and checking the code I could only find this one missing mdb_cursor_close. This is in the active code path, so it might affect things. According to the documentation it should not matter because we are calling mdb_txn_abort shortly after in one of the calling functions (cache_delete_entry).

I also see some potential to use less of lmdb special functions in the code path. The chain of derived data is pretty long in this case:


cache_delete_entry does:
 * mdb_txn_begin
 * mdb_cursor_open to obtain a cursor for the id2dn subdatabase
 * cache_delete_entry_in_transaction, passing id2dn_write_cursor_p in 

cache_delete_entry_in_transaction does:
 * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor
 * mdb_del on the cache entry from the id2entry subdatabase
 * dntree_del_id, passing the id2dn cursor down to it

dntree_del_id does:
 * mdb_cursor_txn to extract the transaction handle again from the id2dn cursor
 * mdb_cursor_dbi to extract the database handle from the id2dn cursor
 * mdb_cursor_open to obtain a *second* cursor for the id2dn subdatabase
 * then it returns from this function in the situation of this bug
   logging "subordinate objects must be deleted first"
   and doesn't free the second cursor for the id2dn subdatabase

I'll stop here with this description because this is the point that the attached patch fixes.

Maybe we could additionally restructure the code to pass both the transaction handle and the database handle down through the functions and each function opens and closes cursors as it needs. lmdb 0.9.18 had at least one issue with cursors ( www.openldap.org/its/index.cgi/Software Bugs?id=8542 ), so that might be a good idea anyway. Refectoring the cache code is a piece of work tough.

We could also think about updating to a newer lmdb version, but I've not seen any obviously related commit in the upstream repository.
Comment 13 Arvid Requate univentionstaff 2019-10-17 01:16:31 CEST
The central part comes down to this error message:

  mdb_txn_commit: failed: MDB_PAGE_NOTFOUND

So this doesn't come from the mdb_put, which is used in cache_update_master_entry to write the new CacheMasterEntry structure at position 0 of the id2entry subdatabase. That would probably also emit "MDB_NOTFOUND" instead.

Instead it comes in the commit after this put.


The documentation says about MDB_PAGE_NOTFOUND:

http://www.lmdb.tech/doc/group__errors.html#gabeb52e4c4be21b329e31c4add1b71926
==============================================================
#define MDB_PAGE_NOTFOUND   (-30797)

Requested page not found - this usually indicates corruption
==============================================================

So something has gone haywire in the DB handling, but apparently it's not on disc, otherwise Philipp's workaround would probably not have made it work again.


If only we could reproduce it live, we could attach with gdb and breakpoint at mdb_page_get, which is the only function in the mdb source code which returns MDB_PAGE_NOTFOUND. Then we could at least get a "bt full" output to see the call stack. There is a compile time macro "MDB_DEBUG" which we could set to 1 to gather audit output, but that's probably nothing you want in production.