Bug 56423 – Inconsistent diskcache in id-connector stops replication

Bug 56423 - Inconsistent diskcache in id-connector stops replication


Summary:	Inconsistent diskcache in id-connector stops replication

Status:	NEW

Product:	UCS@school
Classification:	Unclassified
Component:	ucsschool-id-connector
Version:	UCS@school 5.0
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assigned To:	UCS@school maintainers
QA Contact:

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2023-08-11 17:21 CEST by Erik Damrose
Modified:	2023-09-13 16:25 CEST (History)
CC List:	4 users (show)

See Also:	54640
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	1: Will affect a very few installed domains
How will those affected feel about the bug?:	5: Blocking further progress on the daily work
User Pain:	0.143
Enterprise Customer affected?:
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:	2023081021000198
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Erik Damrose

2023-08-11 17:21:11 CEST

If the diskcache in the id-connector is inconsistent, e.g. due to power outage, items from the cache cannot be parsed. In the queues.log:

2023-08-11 15:13:20 ERROR [InQueue.distribute_loop:356] During preprocessing of file (1/51132) '2023-08-10-17-43-59-023977.json': Ran out of input
Traceback (most recent call last):
  File "/ucsschool-id-connector/src/ucsschool_id_connector/queues.py", line 327, in distribute_loop
    new_path = await self.preprocess_file(path)
  File "/ucsschool-id-connector/src/ucsschool_id_connector/queues.py", line 288, in preprocess_file
    changed |= any([await coro for coro in result_coros])
  File "/ucsschool-id-connector/src/ucsschool_id_connector/queues.py", line 288, in <listcomp>
    changed |= any([await coro for coro in result_coros])
  File "/ucsschool-id-connector/src/plugins/plugins/listener_objects.py", line 172, in preprocess_add_mod_object
    obj.old_data = self.get_old_data(obj)
  File "/ucsschool-id-connector/src/plugins/plugins/listener_objects.py", line 139, in get_old_data
    return self.old_data_db.get(obj.id)
  File "/ucsschool-id-connector/src/ucsschool_id_connector/db.py", line 82, in get
    value = self._cache.get(key, default, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/diskcache/core.py", line 1173, in get
    value = self._disk.fetch(mode, filename, db_value, read)
  File "/usr/lib/python3.8/site-packages/diskcache/core.py", line 282, in fetch
    return pickle.load(reader)
EOFError: Ran out of input

Each time the ldap object with the inconsistent cache entry is processed, the InQueue stops. As an Administrator, you can move the JSON file mentioned in the logfile, but the Problem might reoccur if the object is modified again.

To me it is not obvious how the situation should be handled. The old object in the cache is invalid. Should we simply remove the corrupt cache entry? Should we write the current new object to the cache?
How will id connector plugin(s) logic deal with each solution?

As a workaround, the corrupt entries can be removed from the cache:
# univention-app shell ucsschool-id-connector
# python3
from typing import Any, Dict, Optional, Type, Union
from ucsschool_id_connector.db import OldDataDB
from ucsschool_id_connector.models import ListenerUserOldDataEntry
from ucsschool_id_connector.constants import OLD_DATA_DB_PATH
import diskcache

odb=OldDataDB(OLD_DATA_DB_PATH, Type[ListenerUserOldDataEntry])
odb._cache.volume() # Sanity Check: Same size as the filesystems /var/lib/univention-appcenter/apps/ucsschool-id-connector/data/old_data_db/cache.db

for key in odb._cache:
  try:
    entry=odb._cache.get(key)
  except EOFError as e:
    print(key)
    # del odb._cache[key] # if the corrupt entries should be deleted

odb._cache.close()

Comment 1 Daniel Tröder

2023-08-14 09:06:04 CEST

Corrupted "old data" is problematic.
Move and remove operations cannot be performed safely without it, as the old state of the object is unknown.

1. IMHO we should modify the documentation to inform operators of this, so they can employ a backup & restore strategy.

2. A cron job may also be installed, automatically creating daily backups to /var/backup in the host.

3. The code should be made more robust to handle this kind of situation, by removing the inconsistent entry and doing a "resync" of the affected object. It should inform the operator with an ERROR message, that the object may not be synced correctly anymore (incomplete (re)moves).