Univention Bugzilla – Bug 56728
UnicodeDecodeError when trying to import non-unicode file (school import)
Last modified: 2023-12-19 16:01:56 CET
In a customer environment, user import files are encrypted with PGP and then get decrypted using a custom HttpApiCsvReader class. Currently, this fails if the encrypted file is a binary file: 2023-10-12 09:18:25 INFO cmdline.prepare_import:197 ------ UCS@school import tool starting ------ 2023-10-12 09:18:25 INFO cmdline.prepare_import:199 Import started by HTTP API (class 'HttpApiImportFrontend'). 2023-10-12 09:18:25 ERROR tasks.run_import_job:111 An error occurred while preparing the import job: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte Traceback (most recent call last): File "/usr/lib/python3/dist-packages/ucsschool/http_api/import_api/tasks.py", line 104, in run_import_job runner.prepare_import() File "/usr/lib/python3/dist-packages/ucsschool/importer/frontend/cmdline.py", line 203, in prepare_import line = fin.readline() File "/usr/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte The importer tries to print the first line from the file it wants to import, and fails if it is not a valid utf-8 file. In UCS 4 this used to work, and likely broke because of the upgrade to Python 3.
This does not only happen when trying to import binary files, but for any non-unicode file types. Customer wants to import a file that is encoded in iso-8859-1/latin-1; which results in the same error. I changed the title to reflect that.
This is indeed a bug. The importer generally supports different encodings and has methods to recognize it from the file. BUT: At the spot we get the exception we print the first line of the import file for debugging purposes (verified by Daniel Tröder), as it is useful for customers and support due to the many occasions where imports failed because the header was faulty. We do a simple open() statement, which tries to read the file in text mode, which in turn defaults to utf-8. Since this is for debugging purposes only, the solution is quite simple: diff --git a/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py b/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py index 5136af597..3dff8743a 100755 --- a/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py +++ b/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py @@ -199,7 +199,7 @@ class CommandLine(object): "Import started by %s (class %r).", self.import_initiator, self.__class__.__name__ ) - with open(self.config["input"]["filename"]) as fin: + with open(self.config["input"]["filename"], "rb") as fin: line = fin.readline() self.logger.info("First line of %r:\n%r", self.config["input"]["filename"], line) We just have to open the file in read-byte mode instead to avoid encoding problems here. Until this bug is fixed, you could apply this patch in the customer environment to fix the problem temporarily.
Fixed with ucs-school-import (18.0.40) 34c30aa9727dca72ee1c5853b16d2c85dcb3c992
Errata updates for UCS@school 5.0 v4 have been released. https://docs.software-univention.de/ucsschool-changelog/5.0v4/en/changelog.html https://docs.software-univention.de/ucsschool-changelog/5.0v4/de/changelog.html If this error occurs again, please clone this bug.