Bug 52878 - Add turkish umlauts to list of umlauts that are converted to ASCII via <:umlauts>
Add turkish umlauts to list of umlauts that are converted to ASCII via <:umla...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: UDM (Generic)
UCS 4.4
Other All
: P5 normal (vote)
: UCS 4.4-8-errata
Assigned To: Tobias Wenzel
Ole Schwiegert
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-03-09 10:09 CET by Michael Grandjean
Modified: 2021-05-12 13:37 CEST (History)
6 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 3: Will affect average number of installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.171
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:
best: Patch_Available+


Attachments
patch (git:fbest/52878-unidecode-umlauts) (3.11 KB, patch)
2021-03-09 10:53 CET, Florian Best
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Grandjean univentionstaff 2021-03-09 10:09:26 CET
# univention-app info
UCS: 4.4-7 errata907
Installed: ucsschool=4.4 v8
Upgradable:

Scenario: I want to create users (via usertemplate or UCS@school import) and use the ":umlauts" function to replace umlauts in usernames.

Expected behaviour: turkish umlauts (Ç,ç,Ğ,ğ,İ,ı,Ş,ş) are replaced with their ASCII look-alikes (C,c,G,g,I,i,S,s), just like french and scandinavian umlauts.

Observed behaviour: turkish umlauts are stripped. Someone named "Pınar Ağrı" will end up with a username like "pnar.ar" instead of "pinar.agri".
Comment 2 Florian Best univentionstaff 2021-03-09 10:53:07 CET
Created attachment 10636 [details]
patch (git:fbest/52878-unidecode-umlauts)

We should change the implementation to use python3-unidecode:

>>> import unidecode
>>> unidecode.unidecode('Ç,ç,Ğ,ğ,İ,ı,Ş,ş,Pınar Ağrı')
'C,c,G,g,I,i,S,s,Pinar Agri'
Comment 3 Michael Grandjean univentionstaff 2021-03-09 11:05:00 CET
Ah, Florian was faster than me and I like his approach better. 
For the record: in the meanwhile I created https://git.knut.univention.de/univention/ucs/commit/2f1570ae
Comment 5 Tobias Wenzel univentionstaff 2021-04-14 12:38:12 CEST
The fix is implemented in 


[twenzel/4.4/52878_turkish_umlauts] 53ae6b4967 Bug #52878: fix umlauts in patterns

In python 2 we need text.decode() to make this work. Thanks for the patch!


# univention-app info
UCS: 4.4-7 errata947
Installed: cups=2.2.1 samba4=4.10 squid=3.5 ucsschool=4.4 v9

/var/lib/ucs-school-import/configs/user_import.json

...
 "username": {  
            "default": "<:umlauts><firstname>.<lastname><:lower>[COUNTER2]"  
 }
...

I got the following before the fix:

Pınar Ağrı -> pinar.aara

After the fix:

Pınar Ağrı -> pinar.agri
ÇçĞğİıŞş -> ccggiiss
Comment 6 Tobias Wenzel univentionstaff 2021-04-26 14:43:56 CEST
Unfortunately, the unidecode-patch is not enough.
ä would be converted to a and not to ae as before

We tried to implement this with py-icu, but we didn't succeed:

transliterator = icu.Transliterator.createInstance('de-ascii')

should do the trick but for some reason this didn't work. 

To catch cases like 'Pınar Ağrı' and 'Jürgen Deveaux' we added the suggested patch -> comment 2 after the original string replacement. 
A solution with icu would be better, but this 'does the trick' as well. 

[twenzel/4.4/52878_turkish_umlauts] 7b27cab427 Bug #52878: merge old and new conversion methods


New tests were added in ucs & ucsschool:

[twenzel/4.4/52878_turkish_umlauts] 304a8abe4 Bug #52878: umlauts tests

-> this commit will be dropped:
[twenzel/4.4/52878_turkish_umlauts] 6c4797176 Bug #52878: tests for turkish and other umlauts, selectable test methods
Comment 7 Tobias Wenzel univentionstaff 2021-04-27 09:19:26 CEST
The icu versions for python2 & python3 seem to be different. 

python2
```
transliterator.transliterate('Ä Ö Ü ä ö ü ß ł ŧ ø þ æ ſ ð đ ŋ ħ ĸ Ł Ŧ Ø
   ...:  Þ Æ ẞ Ð Ŋ Ħ') 
u'A O U a o u ss l t o th ae s d d n h q L T O TH AE SS D N H'
```

python3
```
In [4]: transliterator = icu.Transliterator.createInstance('de-ascii;')         
In [5]: transliterator.transliterate('Ä Ö Ü ä ö ü ß ł ŧ ø þ æ ſ ð đ ŋ ħ ĸ Ł Ŧ Ø Þ Æ ẞ Ð Ŋ Ħ')                                                           
Out[5]: 'AE OE UE ae oe ue ss l t o th ae s d d n h q L T O TH AE SS D N H'
```
→ This is still not what's expected, we want Ä → Ae

--------------

fixed encoding/ decoding

[twenzel/4.4/52878_turkish_umlauts] b8256eeca3 fixup! Bug #52878: remove encoding
[twenzel/4.4/52878_turkish_umlauts] ab22929716 Bug #52878: remove encoding
Comment 8 Daniel Tröder univentionstaff 2021-05-03 13:04:53 CEST
The package python-unidecode has been move to maintained in 4.4-8errata and will be released next Wednesday (05.05.2021).
Comment 9 Daniel Tröder univentionstaff 2021-05-03 13:05:52 CEST
The UDM package was built and an advisory created:

[4.4-8] 6d8685f127 Bug #52878: fix umlauts in patterns
[4.4-8] b6002480ee Bug #52878: changelogs and advisory
[4.4-8] 064cdc757d Bug #52878: advisory update

univention-directory-manager-modules (14.0.20-6)
Comment 10 Daniel Tröder univentionstaff 2021-05-03 13:36:35 CEST
A merge request for UCS 5.0-0 was created: https://git.knut.univention.de/univention/ucs/-/merge_requests/88
Comment 11 Tobias Wenzel univentionstaff 2021-05-04 09:29:50 CEST
Extracted code of loop in

[4.4-8] 6106bd360e Bug #52878: extract code from loop


Package: ucs-test
Version: 9.0.7-30A~4.4.0.202105040926
Branch: ucs_4.4-0
Scope: errata4.4-8
Comment 12 Ole Schwiegert univentionstaff 2021-05-05 09:50:46 CEST
Jenkins: OK
Changelog&Advisories: OK
Manual Tests: OK
Packages install: OK
Comment 13 Florian Best univentionstaff 2021-05-05 11:27:56 CEST
REOPEN: The merge request doesn't reflect the latest changes
REOPEN: The comments in the merge request aren't integrated.
Comment 14 Daniel Tröder univentionstaff 2021-05-05 13:02:13 CEST
(In reply to Florian Best from comment #13)
> REOPEN: The merge request doesn't reflect the latest changes
> REOPEN: The comments in the merge request aren't integrated.

The work on the 5.0 merge has been split into bug 53224.
Comment 15 Florian Best univentionstaff 2021-05-05 13:03:03 CEST
(In reply to Daniel Tröder from comment #14)
> (In reply to Florian Best from comment #13)
> > REOPEN: The merge request doesn't reflect the latest changes
> > REOPEN: The comments in the merge request aren't integrated.
> 
> The work on the 5.0 merge has been split into bug 53224.

The comments belong also to the UCS 4.4 code.
Comment 16 Tobias Wenzel univentionstaff 2021-05-06 11:34:14 CEST
Changes according to remarks have been added in

[4.4-8] a616d35a73 Bug #52878: qa remarks
[4.4-8] a92dcb0966 Bug #52878: advisory update

Package: univention-directory-manager-modules
Version: 14.0.20-7A~4.4.0.202105061110
Branch: ucs_4.4-0
Scope: errata4.4-8

Package: ucs-test
Version: 9.0.7-31A~4.4.0.202105061112
Branch: ucs_4.4-0
Scope: errata4.4-8

merge request has been updated in

[twenzel/5.0/52878_turkish_umlauts] ae31c8a30d Bug #52878: qa remarks
[twenzel/5.0/52878_turkish_umlauts] fec7dfb059 Bug #52878: fix umlauts in patterns

uas QA will check and verify
Comment 17 Julia Bremer univentionstaff 2021-05-07 11:04:02 CEST
Due to this commit, all UCS 4.4-8 Tests failed:

a616d35a73a40f8d9a05db9e6e3cee9ac90b10f5


Multiple join scripts could not be executed.

[master090] 2021-05-06T23:57:03.074057	+ univention-check-join-status
[master090] 2021-05-06T23:57:03.407947	Warning: 'univention-ldap-server' is not configured.
[master090] 2021-05-06T23:57:03.427876	Warning: 'univention-join' is not configured.
[master090] 2021-05-06T23:57:03.536059	Warning: 'univention-saml' is not configured.
[master090] 2021-05-06T23:57:03.538767	Warning: 'univention-management-console-web-server' is not configured.
[master090] 2021-05-06T23:57:03.543950	Error: Not all install files configured: 4 missing

In the join.log this traceback a bunch of times: 

Traceback (most recent call last):
  File "/usr/share/univention-directory-manager-tools/univention-cli-server", line 219, in doit
    output = univention.admincli.admin.doit(arglist)
  File "/usr/lib/python2.7/dist-packages/univention/admincli/admin.py", line 409, in doit
    out = _doit(arglist)
  File "/usr/lib/python2.7/dist-packages/univention/admincli/admin.py", line 755, in _doit
    dn = object.create()
  File "/usr/lib/python2.7/dist-packages/univention/admin/handlers/__init__.py", line 557, in create
    dn = self._create(response=response, serverctrls=serverctrls)
  File "/usr/lib/python2.7/dist-packages/univention/admin/handlers/__init__.py", line 1235, in _create
    self.set_default_values()
  File "/usr/lib/python2.7/dist-packages/univention/admin/handlers/__init__.py", line 1352, in set_default_values
    if p.default(self):
  File "/usr/lib/python2.7/dist-packages/univention/admin/__init__.py", line 312, in default
    return self._replace(base_default, object)
  File "/usr/lib/python2.7/dist-packages/univention/admin/__init__.py", line 301, in _replace
    return pattern_replace(copy.copy(res), object)
  File "/usr/lib/python2.7/dist-packages/univention/admin/__init__.py", line 189, in pattern_replace
    value = modify_text(value, global_commands)
  File "/usr/lib/python2.7/dist-packages/univention/admin/__init__.py", line 139, in modify_text
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')
TypeError: normalize() argument 2 must be unicode, not str
Comment 18 Florian Best univentionstaff 2021-05-07 11:05:28 CEST
The reason is the missing second u'' string in the UMLAUTS mapping.
Comment 19 Tobias Wenzel univentionstaff 2021-05-10 08:30:54 CEST
I'm very sorry: I only had fixed this in the merge request.

[4.4-8] e50825bb5d Bug #52878: change advisory
[4.4-8] 35073d260b Bug #52878: add u-string
Comment 20 Tobias Wenzel univentionstaff 2021-05-11 12:58:56 CEST
As discussed, I removed the line 

text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')

→ it is covered by unidecode. This should fix our test.

[4.4-8] bd6cc9a12b Bug #52878: modify advisory
[4.4-8] 421d6e552d Bug #52878: remove unicodedata.normalize call


Package: univention-directory-manager-modules
Version: 14.0.20-9A~4.4.0.202105111251
Branch: ucs_4.4-0
Scope: errata4.4-8

[twenzel/5.0/52878_turkish_umlauts] 8bad51859a Bug #52878: remove unicodedata.normalize call

@QA please check if tests pass

e.g. 01_base/01_ignore_disabled_users_in_license_count
Comment 21 Tobias Wenzel univentionstaff 2021-05-11 14:06:06 CEST
Package: univention-directory-manager-modules
Version: 14.0.20-10A~4.4.0.202105111403
Branch: ucs_4.4-0
Scope: errata4.4-8

[4.4-8] 5b40e539ea Bug #52878: modify advisory
[4.4-8] 9b7bc63e03 Bug #52878: remove unused import
Comment 22 Daniel Tröder univentionstaff 2021-05-11 14:34:39 CEST
OK: code changes
OK: automated tests
 - 11_use_usertemplate_umlauts
 - 90_ucsschool/219_import_users_umlauts.py
 - 90_ucsschool/219_import_users_umlauts_normalized.py

Also OK, but IMHO unrelated: 01_base/01_ignore_disabled_users_in_license_count

OK: manual test:

---------------------------------------------------------------------
# dpkg -l python-univention-directory-manager

ii  python-univention-directory-m 14.0.20-5A~4.4.0.20

# ipython

In [1]: from univention.admin import property as uadmin_property
In [2]: prop = uadmin_property("_replace")
In [3]: prop._replace("<:umlauts>Pınar Ağrı", {})
Out[3]: 'Pnar Agr'

-------------- update --------------

# dpkg -l python-univention-directory-manager

ii  python-univention-directory-m 14.0.20-10A~4.4.0.2

# ipython

In [1]: from univention.admin import property as uadmin_property
In [2]: prop = uadmin_property("_replace")
In [3]: prop._replace("<:umlauts>Pınar Ağrı", {})
Out[3]: 'Pinar Agri'
---------------------------------------------------------------------

OK: advisories

The funcionality in 5.0 was not tested.