47677 – Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)

Bug 47677 - Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)

Summary: Allow ISO 8859-1 encoded CSV files (automatically convert to UTF-8)

Status:	CLOSED FIXED

Alias:	None

Product:	UCS@school
Classification:	Unclassified
Component:	HTTP-API (Kelvin)
Version:	UCS@school 4.3
Hardware:	Other Windows NT

Importance:	P5 normal
Target Milestone:	UCS@school 4.3 v6
Assignee:	Daniel Tröder
QA Contact:	Ole Schwiegert

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-08-27 14:25 CEST by Michael Grandjean
Modified:	2019-03-16 22:29 CET (History)
CC List:	6 users (show)

See Also:	47305
What kind of report is it?:	Bug Report
What type of bug is this?:	5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?:	2: Will only affect a few installed domains
How will those affected feel about the bug?:	4: A User would return the product
User Pain:	0.229
Enterprise Customer affected?:
School Customer affected?:	Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Customer ID:	02149
Max CVSS v3 score:

Attachments
Screenshot of Traceback (479.59 KB, image/png) 2018-08-27 14:25 CEST, Michael Grandjean	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Grandjean

2018-08-27 14:25:58 CEST

Created attachment 9640 [details]
Screenshot of Traceback

Microsoft's Excel (at least 2016) makes it surprisingly hard to save a CSV file with any other encoding then the default (for german versions of Excel that's "Western Europe (ISO 8859-1)" (aka "Latin-1")).

The unofficial workaround is to use Notepad for the conversion:
https://answers.microsoft.com/en-us/msoffice/forum/msoffice_excel-mso_windows8-mso_2013_release/how-can-i-save-a-csv-with-utf-8-encoding-using/12801501-c1e4-4a64-80d9-96b680b64cfe

When uploading a Latin-1 encoded CSV file, the importer will fail with an UnicodeDecodeError Traceback (see attached screenshot).

Therefore, the importer sould be able to automatically detect Latin-1 encoded CSV files and convert them to UTF-8 before proceeding.

Comment 1 Daniel Tröder

2018-09-17 18:14:35 CEST

Fixed in branch dtroeder/47677_handle_8859-1_csv_files. The encoding of a CSV file is now always retrieved first. A conversion is done while reading the file.

f3bca6e45: Bug #47677: handle CSV files with all kinds of encoding


2018-09-17 18:00:35 DEBUG csv_reader.__init__:80  Reading '/tmp/test' with encoding 'utf-8-sig'.
2018-09-17 18:01:01 DEBUG csv_reader.__init__:80  Reading 'test_users_2018-09-17_10:44:12.csv' with encoding 'us-ascii'.
2018-09-17 18:01:26 DEBUG csv_reader.__init__:80  Reading 'test_users_2018-08-29_10:44:24.csv' with encoding 'utf-8'.
2018-09-17 18:13:44 DEBUG csv_reader.__init__:80  Reading 'Import_UCS_Schule03.csv' with encoding 'iso-8859-1'.

Comment 2 Sönke Schwardt-Krummrich

2018-09-18 10:15:33 CEST

Unfortunately the current approach won't work, since libmagic only check the first x bytes of the file:

: > /tmp/magictest
for i in $(seq 1 9999) ; do 
   echo 'Dies;Ist;ein;test;von einem kleinen;"Beispielskript mit 9999";Zeilen;die hier;aufgelistet werden; foo bla baz bar foo;bla baz' >> /tmp/magictest
done
echo "öäüÖÄÜß" >> /tmp/magictest

python
>>> import magic
>>> magic.detect_from_filename('/tmp/magictest')
FileMagic(mime_type='text/plain', encoding='us-ascii', name='ASCII text')
>>> open('/tmp/magictest').read().decode('us-ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1169883: ordinal not in range(128)
>>> open('/tmp/magictest').read().decode('utf8')[0:10]
u'Dies;Ist;e'

So if there is an umlaut or something else encoded in latin1, utf8, ... after x bytes, the check will fail.

Comment 3 Daniel Tröder

2018-09-18 12:40:45 CEST

The complete file is now used to detect the encoding.

3f197425b Bug #47677: read complete file to detect encoding

Comment 4 Ole Schwiegert

2018-09-25 07:43:53 CEST

Commit f3bca6e457f1211388810bcfc980ad48330a475b of branch dtroeder/47677_handle_8859-1_csv_files contains code that is, to my knowledge, part of Bug #47681

Comment 5 Ole Schwiegert

2018-09-25 08:08:58 CEST

encoding = magic.Magic(mime_encoding=True).from_buffer(open(filename, 'rb').read())

This change does test the entire file content indeed, though it would be best to ensure that the file is properly closed after checking.

Comment 6 Daniel Tröder

2018-09-25 08:45:34 CEST

I removed the code from Bug #47681 (was a merge error).

I made some tests, and the behavior with open() is like this:

When the reference returned by open() is deleted, the file descriptor is closed. That happens right aber the .read(), as there is no variable for the FD.
I tested this with:

>>> open(..).read()
# lsof -p $PID
# → no open file

# lsof -p $PID > /tmp/start
>>> fd = open(..)
# lsof -p $PID > /tmp/opened
# diff /tmp/start /tmp/opened
# + an open FD
>>> del fd
# lsof -p $PID > /tmp/fd_deleted
# diff /tmp/opened /tmp/fd_deleted
# - an open FD

So all fine? I made some reading, and it turns out that this behavior is not guaranteed. CPython does reference counting - so the above works just fine. But other Python implementations may not work that way, and could leave the FD open. So in UCS we're probably fine, because we won't move away from CPython. But in the interest of interoperability I think it's good style to support others too.

ec77fba9c Bug #47677: read file one once, make sure its closed

Comment 7 Ole Schwiegert

2018-09-25 09:49:48 CEST

Exactly the not guaranteed behavior was what worried me. Looks fine for me now. If you build the code I can do the final testing and Verify.

Comment 8 Daniel Tröder

2018-09-25 11:59:37 CEST

Addad a changelog entry, merged, built and created an advisory.

ucs-school-import (16.0.2-49)

Comment 9 Ole Schwiegert

2018-09-26 08:59:54 CEST

Advisory & Changelog: OK
I tested with import files in utf-8 and some different ISO encodings. All seemed to work fine.

Some tests seem to fail because of the changes made here. Waiting for them to be fixed before verifying

Comment 10 Daniel Tröder

2018-09-26 10:19:43 CEST

[4.3] d7aece015 Bug #47677: handle CsvReader initialization without filename
[4.3] 21eb7d5f7 Bug #47677: Sniffer.sniff() expects a string (with multiple delimiter chars)
[4.3] aebfca5e9 Bug #47677: add type annotations
[4.3] ef0fb4e2b Bug #47677: changelog
[4.3] a769c0777 Bug #47677: advisory update

Comment 11 Ole Schwiegert

2018-09-28 08:58:15 CEST

Changelog & Advisory: OK
Test 90_ucsschool.114_importuser_udm_properties_udm_syntax_checks.test does not fail anymore: OK
Import via command line of utf-8, ascii, ISO 8859-1 files works: OK
Import via UMC of utf-8, ascii, ISO 8859-1 files works: OK

Comment 12 Sönke Schwardt-Krummrich

2018-11-16 11:48:09 CET

UCS@school 4.3 v6 has been released.

https://docs.software-univention.de/changelog-ucsschool-4.3v6-de.html

If this error occurs again, please clone this bug.