Bug 47677 - Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)
Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)
Status: CLOSED FIXED
Product: UCS@school
Classification: Unclassified
Component: HTTP-API (Kelvin)
UCS@school 4.3
Other Windows NT
: P5 normal (vote)
: UCS@school 4.3 v6
Assigned To: Daniel Tröder
Ole Schwiegert
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-08-27 14:25 CEST by Michael Grandjean
Modified: 2019-03-16 22:29 CET (History)
6 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 4: A User would return the product
User Pain: 0.229
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments
Screenshot of Traceback (479.59 KB, image/png)
2018-08-27 14:25 CEST, Michael Grandjean
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Grandjean univentionstaff 2018-08-27 14:25:58 CEST
Created attachment 9640 [details]
Screenshot of Traceback

Microsoft's Excel (at least 2016) makes it surprisingly hard to save a CSV file with any other encoding then the default (for german versions of Excel that's "Western Europe (ISO 8859-1)" (aka "Latin-1")).

The unofficial workaround is to use Notepad for the conversion:
https://answers.microsoft.com/en-us/msoffice/forum/msoffice_excel-mso_windows8-mso_2013_release/how-can-i-save-a-csv-with-utf-8-encoding-using/12801501-c1e4-4a64-80d9-96b680b64cfe

When uploading a Latin-1 encoded CSV file, the importer will fail with an UnicodeDecodeError Traceback (see attached screenshot).

Therefore, the importer sould be able to automatically detect Latin-1 encoded CSV files and convert them to UTF-8 before proceeding.
Comment 1 Daniel Tröder univentionstaff 2018-09-17 18:14:35 CEST
Fixed in branch dtroeder/47677_handle_8859-1_csv_files. The encoding of a CSV file is now always retrieved first. A conversion is done while reading the file.

f3bca6e45: Bug #47677: handle CSV files with all kinds of encoding


2018-09-17 18:00:35 DEBUG csv_reader.__init__:80  Reading '/tmp/test' with encoding 'utf-8-sig'.
2018-09-17 18:01:01 DEBUG csv_reader.__init__:80  Reading 'test_users_2018-09-17_10:44:12.csv' with encoding 'us-ascii'.
2018-09-17 18:01:26 DEBUG csv_reader.__init__:80  Reading 'test_users_2018-08-29_10:44:24.csv' with encoding 'utf-8'.
2018-09-17 18:13:44 DEBUG csv_reader.__init__:80  Reading 'Import_UCS_Schule03.csv' with encoding 'iso-8859-1'.
Comment 2 Sönke Schwardt-Krummrich univentionstaff 2018-09-18 10:15:33 CEST
Unfortunately the current approach won't work, since libmagic only check the first x bytes of the file:

: > /tmp/magictest
for i in $(seq 1 9999) ; do 
   echo 'Dies;Ist;ein;test;von einem kleinen;"Beispielskript mit 9999";Zeilen;die hier;aufgelistet werden; foo bla baz bar foo;bla baz' >> /tmp/magictest
done
echo "öäüÖÄÜß" >> /tmp/magictest

python
>>> import magic
>>> magic.detect_from_filename('/tmp/magictest')
FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text')
>>> open('/tmp/magictest').read().decode('us-ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1169883: ordinal not in range(128)
>>> open('/tmp/magictest').read().decode('utf8')[0:10]
u'Dies;Ist;e'

So if there is an umlaut or something else encoded in latin1, utf8, ... after x bytes, the check will fail.
Comment 3 Daniel Tröder univentionstaff 2018-09-18 12:40:45 CEST
The complete file is now used to detect the encoding.

3f197425b Bug #47677: read complete file to detect encoding
Comment 4 Ole Schwiegert univentionstaff 2018-09-25 07:43:53 CEST
Commit f3bca6e457f1211388810bcfc980ad48330a475b of branch dtroeder/47677_handle_8859-1_csv_files contains code that is, to my knowledge, part of Bug #47681
Comment 5 Ole Schwiegert univentionstaff 2018-09-25 08:08:58 CEST
encoding = magic.Magic(mime_encoding=True).from_buffer(open(filename, 'rb').read())

This change does test the entire file content indeed, though it would be best to ensure that the file is properly closed after checking.
Comment 6 Daniel Tröder univentionstaff 2018-09-25 08:45:34 CEST
I removed the code from Bug #47681 (was a merge error).

I made some tests, and the behavior with open() is like this:

When the reference returned by open() is deleted, the file descriptor is closed. That happens right aber the .read(), as there is no variable for the FD.
I tested this with:

>>> open(..).read()
# lsof -p $PID
# → no open file

# lsof -p $PID > /tmp/start
>>> fd = open(..)
# lsof -p $PID > /tmp/opened
# diff /tmp/start /tmp/opened
# + an open FD
>>> del fd
# lsof -p $PID > /tmp/fd_deleted
# diff /tmp/opened /tmp/fd_deleted
# - an open FD

So all fine? I made some reading, and it turns out that this behavior is not guaranteed. CPython does reference counting - so the above works just fine. But other Python implementations may not work that way, and could leave the FD open. So in UCS we're probably fine, because we won't move away from CPython. But in the interest of interoperability I think it's good style to support others too.

ec77fba9c Bug #47677: read file one once, make sure its closed
Comment 7 Ole Schwiegert univentionstaff 2018-09-25 09:49:48 CEST
Exactly the not guaranteed behavior was what worried me. Looks fine for me now. If you build the code I can do the final testing and Verify.
Comment 8 Daniel Tröder univentionstaff 2018-09-25 11:59:37 CEST
Addad a changelog entry, merged, built and created an advisory.

ucs-school-import (16.0.2-49)
Comment 9 Ole Schwiegert univentionstaff 2018-09-26 08:59:54 CEST
Advisory & Changelog: OK
I tested with import files in utf-8 and some different ISO encodings. All seemed to work fine.

Some tests seem to fail because of the changes made here. Waiting for them to be fixed before verifying
Comment 10 Daniel Tröder univentionstaff 2018-09-26 10:19:43 CEST
[4.3] d7aece015 Bug #47677: handle CsvReader initialization without filename
[4.3] 21eb7d5f7 Bug #47677: Sniffer.sniff() expects a string (with multiple delimiter chars)
[4.3] aebfca5e9 Bug #47677: add type annotations
[4.3] ef0fb4e2b Bug #47677: changelog
[4.3] a769c0777 Bug #47677: advisory update
Comment 11 Ole Schwiegert univentionstaff 2018-09-28 08:58:15 CEST
Changelog & Advisory: OK
Test 90_ucsschool.114_importuser_udm_properties_udm_syntax_checks.test does not fail anymore: OK
Import via command line of utf-8, ascii, ISO 8859-1 files works: OK
Import via UMC of utf-8, ascii, ISO 8859-1 files works: OK
Comment 12 Sönke Schwardt-Krummrich univentionstaff 2018-11-16 11:48:09 CET
UCS@school 4.3 v6 has been released.

https://docs.software-univention.de/changelog-ucsschool-4.3v6-de.html

If this error occurs again, please clone this bug.