Bug 56728 - UnicodeDecodeError when trying to import non-unicode file (school import)
UnicodeDecodeError when trying to import non-unicode file (school import)
Status: CLOSED FIXED
Product: UCS@school
Classification: Unclassified
Component: Import scripts
UCS@school 5.0
Other Linux
: P5 normal (vote)
: UCS@school 5.0 v4-errata
Assigned To: Johannes Königer
Alexander Steffen
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2023-10-12 10:45 CEST by Jannik Ahlers
Modified: 2023-12-19 16:01 CET (History)
5 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.057
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jannik Ahlers univentionstaff 2023-10-12 10:45:55 CEST
In a customer environment, user import files are encrypted with PGP and then get decrypted using a custom HttpApiCsvReader class.
Currently, this fails if the encrypted file is a binary file:

2023-10-12 09:18:25 INFO  cmdline.prepare_import:197  ------ UCS@school import tool starting ------
2023-10-12 09:18:25 INFO  cmdline.prepare_import:199  Import started by HTTP API (class 'HttpApiImportFrontend').
2023-10-12 09:18:25 ERROR tasks.run_import_job:111  An error occurred while preparing the import job: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ucsschool/http_api/import_api/tasks.py", line 104, in run_import_job
    runner.prepare_import()
  File "/usr/lib/python3/dist-packages/ucsschool/importer/frontend/cmdline.py", line 203, in prepare_import
    line = fin.readline()
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte


The importer tries to print the first line from the file it wants to import, and fails if it is not a valid utf-8 file.
In UCS 4 this used to work, and likely broke because of the upgrade to Python 3.
Comment 1 Jannik Ahlers univentionstaff 2023-10-20 15:57:35 CEST
This does not only happen when trying to import binary files, but for any non-unicode file types.
Customer wants to import a file that is encoded in iso-8859-1/latin-1; which results in the same error.
I changed the title to reflect that.
Comment 2 Ole Schwiegert univentionstaff 2023-10-24 12:03:01 CEST
This is indeed a bug. The importer generally supports different encodings and has methods to recognize it from the file. BUT:

At the spot we get the exception we print the first line of the import file for debugging purposes (verified by Daniel Tröder), as it is useful for customers and support due to the many occasions where imports failed because the header was faulty.

We do a simple open() statement, which tries to read the file in text mode, which in turn defaults to utf-8. Since this is for debugging purposes only, the solution is quite simple:


diff --git a/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py b/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py
index 5136af597..3dff8743a 100755
--- a/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py
+++ b/ucs-school-import/modules/ucsschool/importer/frontend/cmdline.py
@@ -199,7 +199,7 @@ class CommandLine(object):
             "Import started by %s (class %r).", self.import_initiator, self.__class__.__name__
         )
 
-        with open(self.config["input"]["filename"]) as fin:
+        with open(self.config["input"]["filename"], "rb") as fin:
             line = fin.readline()
             self.logger.info("First line of %r:\n%r", self.config["input"]["filename"], line)




We just have to open the file in read-byte mode instead to avoid encoding problems here.
Until this bug is fixed, you could apply this patch in the customer environment to fix the problem temporarily.
Comment 5 Johannes Königer univentionstaff 2023-12-15 09:39:54 CET
Fixed with ucs-school-import (18.0.40)
34c30aa9727dca72ee1c5853b16d2c85dcb3c992
Comment 6 Johannes Königer univentionstaff 2023-12-19 16:01:56 CET
Errata updates for UCS@school 5.0 v4 have been released.

https://docs.software-univention.de/ucsschool-changelog/5.0v4/en/changelog.html
https://docs.software-univention.de/ucsschool-changelog/5.0v4/de/changelog.html

If this error occurs again, please clone this bug.