Bug 47580 - Normalize in user templates of names with umlauts do not work completely
Normalize in user templates of names with umlauts do not work completely
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: UDM (Generic)
UCS 4.3
Other Linux
: P5 major (vote)
: UCS 4.3-2-errata
Assigned To: Ole Schwiegert
Johannes Keiser
:
: 45387 (view as bug list)
Depends on: 44367 44370
Blocks:
  Show dependency treegraph
 
Reported: 2018-08-13 12:33 CEST by Christina Scheinig
Modified: 2020-06-22 17:01 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 5: Major Usability: Impairs usability in key scenarios
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 5: Blocking further progress on the daily work
User Pain: 0.286
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support: Yes
Flags outvoted (downgraded) after PO Review:
Ticket number: 2018080821000194
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christina Scheinig univentionstaff 2018-08-13 12:33:54 CEST
There are still characters, especially " ` " which are not normalized and therefore still existing in the mail address.

E.g. "Vivian D` Muster" should be changeable to VivianDMuster@schule.example.de

+++ This bug was initially created as a clone of Bug #44370 +++

If you use <:umlauts> in user templates (example of primary mail address: <firstname>[0].<lastname><:strip><:umlauts>@demo.univention.de), there will be a problem with names like "Ýlang Mustermann".


+++ This bug was initially created as a clone of Bug #44367 +++

Import of "Ýlang Müstèrmánn" produce "?" in username and email address with default settings.

I think there are some umlauts missing in "class property" in /usr/share/pyshared/univention/admin/__init__.py

May be better to use something like unicodedata.normalize() (see https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize) instead of hard coding UMLAUTS!
Comment 1 Ole Schwiegert univentionstaff 2018-09-03 09:47:56 CEST
Also the strip/trim command seems to not work.

As far as I understood the unicode documentation, unicode normalization is intended to transform unicode strings to remove the case that the same character is represented by different codes. Useful if you always want the same representation of a character in your data(base), but not really intended to remove special characters from strings or replace them with 'similar' ASCII characters via normalization-encoding chain.

A better solution in my opinion would be unidecode (https://pypi.org/project/Unidecode/) which tries to do exactly what we intend. Represent Unicode strings as ASCII as close as possible. In the library are a lot of hand crafted solutions (similar to our umlaut replacement code, but much more extensive).

Since ` is an ASCII character neither unicode normalization+encoding nor unidecode would remove them. For that we should facilitate pythons builtin string function isalpha() which lets us check if a given character is alphanumerical (or a given string consists entirely out of alphanumerical characters)

If we filter the given string/name after applying asciification we should get quite a solid result.
Comment 2 Ole Schwiegert univentionstaff 2018-09-03 10:10:55 CEST
'Also the strip/trim command seems to not work.'
--Ignore that comment
Comment 3 Ole Schwiegert univentionstaff 2018-09-10 12:56:37 CEST
Package: univention-management-console-module-udm
Version: 8.0.5-16A~4.3.0.201809101250

Package: univention-directory-manager-modules
Version: 13.0.22-3A~4.3.0.201809101251

The option :umlauts was not altered at all. Since symbols like '`# etc are not falling into the category of the umlauts option (it takes umlauts and transforms them into ASCII-representations).

Instead there is the new option :alphanum which removes all symbols that are not alphanumerical or spaces. In the UCRV directory/manager/templates/alphanum/whitelist you can save a string containing all symbols that should be ignored by that option.

:alphanum should be used carefully since it removes even the @-sign of an email address, if it is applied to the entire email field for example. It is better to use it on specific attributes only, like <firstname>.
Comment 4 Jürn Brodersen univentionstaff 2018-09-10 16:37:36 CEST
Looks like jenkins doesn't like your docu commit:
http://jenkins.knut.univention.de:8080/job/UCS-4.3/job/UCS-4.3-2/job/HandbookUCS/71/warnings5Result/new/

I guess you need to whitelist whitelist for the spelling check... :)
Comment 5 Ole Schwiegert univentionstaff 2018-09-12 08:25:47 CEST
Package: univention-directory-manager-modules
Version: 13.0.22-4A~4.3.0.201809120823

Remove debug entry

whitelist added to english dict
Comment 6 Ole Schwiegert univentionstaff 2018-09-18 11:17:18 CEST
Package: univention-directory-manager-modules
Version: 13.0.22-5A~4.3.0.201809181105

Package: univention-management-console-module-udm
Version: 8.0.5-17A~4.3.0.201809181115

Integrated the discussed code improvements and equalized the umlauts dict for front- and backend.

The discussed option to parametrize user template options was moved into a new Feature Request Bug #47830
Comment 7 Ole Schwiegert univentionstaff 2018-09-18 11:18:34 CEST
(In reply to Ole Schwiegert from comment #3)
> Package: univention-management-console-module-udm
> Version: 8.0.5-16A~4.3.0.201809101250
> 
> Package: univention-directory-manager-modules
> Version: 13.0.22-3A~4.3.0.201809101251
> 
> The option :umlauts was not altered at all. Since symbols like '`# etc are
> not falling into the category of the umlauts option (it takes umlauts and
> transforms them into ASCII-representations).
> 
> Instead there is the new option :alphanum which removes all symbols that are
> not alphanumerical or spaces. In the UCRV
> directory/manager/templates/alphanum/whitelist you can save a string
> containing all symbols that should be ignored by that option.
> 
> :alphanum should be used carefully since it removes even the @-sign of an
> email address, if it is applied to the entire email field for example. It is
> better to use it on specific attributes only, like <firstname>.

The whitespace character is also filtered by the option if not excluded in the whitelist
Comment 8 Ole Schwiegert univentionstaff 2018-09-18 11:55:00 CEST
Package: univention-directory-manager-modules
Version: 13.0.22-6A~4.3.0.201809181151

Fix error in postinst script
Comment 9 Ole Schwiegert univentionstaff 2018-09-19 10:59:41 CEST
Package: univention-directory-manager-modules
Version: 13.0.23-4A~4.3.0.201809191058

fixed unicode bug
Comment 10 Johannes Keiser univentionstaff 2018-09-19 13:04:36 CEST
OK only alphanumeric characters are kept (in the frontend this is constraint to  ascii letters, basic latin 1 letters, and the digits 0-9)
When IE11 is no longer supported we can use unicode regex to also keep all unicode alphanumeric characters in the frontend
OK Characters defined in the ucr variable directory/manager/templates/alphanum/whitelist are also kept
OK Code
OK YAML
-> verified
Comment 12 Florian Best univentionstaff 2020-06-22 17:01:25 CEST
*** Bug 45387 has been marked as a duplicate of this bug. ***