Univention Bugzilla – Bug 47677
Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)
Last modified: 2019-03-16 22:29:24 CET
Created attachment 9640 [details] Screenshot of Traceback Microsoft's Excel (at least 2016) makes it surprisingly hard to save a CSV file with any other encoding then the default (for german versions of Excel that's "Western Europe (ISO 8859-1)" (aka "Latin-1")). The unofficial workaround is to use Notepad for the conversion: https://answers.microsoft.com/en-us/msoffice/forum/msoffice_excel-mso_windows8-mso_2013_release/how-can-i-save-a-csv-with-utf-8-encoding-using/12801501-c1e4-4a64-80d9-96b680b64cfe When uploading a Latin-1 encoded CSV file, the importer will fail with an UnicodeDecodeError Traceback (see attached screenshot). Therefore, the importer sould be able to automatically detect Latin-1 encoded CSV files and convert them to UTF-8 before proceeding.
Fixed in branch dtroeder/47677_handle_8859-1_csv_files. The encoding of a CSV file is now always retrieved first. A conversion is done while reading the file. f3bca6e45: Bug #47677: handle CSV files with all kinds of encoding 2018-09-17 18:00:35 DEBUG csv_reader.__init__:80 Reading '/tmp/test' with encoding 'utf-8-sig'. 2018-09-17 18:01:01 DEBUG csv_reader.__init__:80 Reading 'test_users_2018-09-17_10:44:12.csv' with encoding 'us-ascii'. 2018-09-17 18:01:26 DEBUG csv_reader.__init__:80 Reading 'test_users_2018-08-29_10:44:24.csv' with encoding 'utf-8'. 2018-09-17 18:13:44 DEBUG csv_reader.__init__:80 Reading 'Import_UCS_Schule03.csv' with encoding 'iso-8859-1'.
Unfortunately the current approach won't work, since libmagic only check the first x bytes of the file: : > /tmp/magictest for i in $(seq 1 9999) ; do echo 'Dies;Ist;ein;test;von einem kleinen;"Beispielskript mit 9999";Zeilen;die hier;aufgelistet werden; foo bla baz bar foo;bla baz' >> /tmp/magictest done echo "öäüÖÄÜß" >> /tmp/magictest python >>> import magic >>> magic.detect_from_filename('/tmp/magictest') FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text') >>> open('/tmp/magictest').read().decode('us-ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1169883: ordinal not in range(128) >>> open('/tmp/magictest').read().decode('utf8')[0:10] u'Dies;Ist;e' So if there is an umlaut or something else encoded in latin1, utf8, ... after x bytes, the check will fail.
The complete file is now used to detect the encoding. 3f197425b Bug #47677: read complete file to detect encoding
Commit f3bca6e457f1211388810bcfc980ad48330a475b of branch dtroeder/47677_handle_8859-1_csv_files contains code that is, to my knowledge, part of Bug #47681
encoding = magic.Magic(mime_encoding=True).from_buffer(open(filename, 'rb').read()) This change does test the entire file content indeed, though it would be best to ensure that the file is properly closed after checking.
I removed the code from Bug #47681 (was a merge error). I made some tests, and the behavior with open() is like this: When the reference returned by open() is deleted, the file descriptor is closed. That happens right aber the .read(), as there is no variable for the FD. I tested this with: >>> open(..).read() # lsof -p $PID # → no open file # lsof -p $PID > /tmp/start >>> fd = open(..) # lsof -p $PID > /tmp/opened # diff /tmp/start /tmp/opened # + an open FD >>> del fd # lsof -p $PID > /tmp/fd_deleted # diff /tmp/opened /tmp/fd_deleted # - an open FD So all fine? I made some reading, and it turns out that this behavior is not guaranteed. CPython does reference counting - so the above works just fine. But other Python implementations may not work that way, and could leave the FD open. So in UCS we're probably fine, because we won't move away from CPython. But in the interest of interoperability I think it's good style to support others too. ec77fba9c Bug #47677: read file one once, make sure its closed
Exactly the not guaranteed behavior was what worried me. Looks fine for me now. If you build the code I can do the final testing and Verify.
Addad a changelog entry, merged, built and created an advisory. ucs-school-import (16.0.2-49)
Advisory & Changelog: OK I tested with import files in utf-8 and some different ISO encodings. All seemed to work fine. Some tests seem to fail because of the changes made here. Waiting for them to be fixed before verifying
[4.3] d7aece015 Bug #47677: handle CsvReader initialization without filename [4.3] 21eb7d5f7 Bug #47677: Sniffer.sniff() expects a string (with multiple delimiter chars) [4.3] aebfca5e9 Bug #47677: add type annotations [4.3] ef0fb4e2b Bug #47677: changelog [4.3] a769c0777 Bug #47677: advisory update
Changelog & Advisory: OK Test 90_ucsschool.114_importuser_udm_properties_udm_syntax_checks.test does not fail anymore: OK Import via command line of utf-8, ascii, ISO 8859-1 files works: OK Import via UMC of utf-8, ascii, ISO 8859-1 files works: OK
UCS@school 4.3 v6 has been released. https://docs.software-univention.de/changelog-ucsschool-4.3v6-de.html If this error occurs again, please clone this bug.