Normalised Unicode Strings

Due to inconsistent use of combining characters, and alternate ways of writing the same letter, Unicode developed ways of normalising characters, before comparing strings. You could compare the strings without normalising them, but you might not care about certain differences in characters. Such as whether someone used a Å or a visually identical, but different: Å.

Unicode defines 2 yes-no normalisation options for Unicode characters. Which results in 4 different possible combination options which are called: NFC, NFD, NFKC, NFKD. The options are Defined as follows:

Composition (can be done with un-canonsied characters) :
1. Compose …C : squash down into smallest number of code points. I.e. , a + U+0300’ ◌̀ (‘Combining Grave Accent’) + is replaced by à.
2. Decompose …D: stretch out into longest number of code-points, all the accents get spun out into combining characters, i.e. à is replaced by a + ◌̀.
Canon (always done in conjunction with either a full composed or fully decomposed string):
1. Canonise …K…: Convert the code-point into the “canonical version” of the character. I.e. ² becomes 2.
2. Leave un-canonised …: Leave the character as-is

The french Wikipedia page, has a better diagram to show some examples of what “Composition” and “Canon” do.

These two methods provide a matrix of 4 possible normalised forms of unicode strings, with the crypic names “NFC”, “NFD”, “NFKC”, “NFKD”:

Normalised Forms	Composed	Decomposed
Non-Canonical	`NFC`	`NFD`
Canonical	`NFKC`	`NFKD`

The function unicodedata.normalize( "form", "string" ) takes one of these four normalisation forms as an argument and returns a normalised string.

Examples

Lets compare some characters after applying the 4 different normalisation methods. Are the strings equal?

First import the unicodedata module from the standard library:

import unicodedata as uc

The 4 normalised-forms:

NF = ["NFC", "NFD", "NFKC", "NFKD"]

Using the comparison a == A as an example.

>>>for x in NF: x ,uc.normalize( x  ,  "a") == uc.normalize( x  ,  "A")
...
('NFC', False)
('NFD', False)
('NFKC', False)
('NFKD', False)

Different Capitalisation `a == A`

Unnormalised, a ≠ A

Shown more clearly with both characters normalised in the same way:

`a == A`	Composed	Decomposed
Non-Canonical	`NFC`❌	`NFD`❌
Canonical	`NFKC` ❌	`NFKD` ❌

The .normalize method, has no effect on case. Have to use the.lower() method on both strings to do a case insensitive comparison.

Different Form of Same “Canonical” Character: `① == 1`

Unnormalised, ① ≠ 1

With both characters normalised in the same way and ignoring the code:

`① == 1`	Composed	Decomposed
Non-Canonical	`NFC`❌	`NFD`❌
Canonical	`NFKC` ✅	`NFKD` ✅

①and1 are the same ‘canonical’ character despite being different code-points.

Different Form of Same “Canonical” Character: `y == 𝒚`

Unnormalised, y ≠ 𝒚

With both characters normalised in the same way:

`y == 𝒚`	Composed	Decomposed
Non-Canonical	`NFC`❌	`NFD`❌
Canonical	`NFKC` ✅	`NFKD` ✅

yand𝒚 are the same ‘canonical’ character despite being different code-points.

However, for information: ß ≠ SS and a ≠ α.

Composed vs Decomposed Letter: `à == ◌̀ + a`

Unnormalised, à ≠ ◌̀ + a

with both characters normalised in the same way:

`à == ◌̀ + a`	Composed	Decomposed
Non-Canonical	`NFC`✅	`NFD`✅
Canonical	`NFKC` ✅	`NFKD` ✅

In composed form, a + ◌̀ becomes à. When decomposed, à becomes a + ◌̀. So they are always the same characters.

Resources

Talk: Dylan Beattie Talk on Plain text NDC Oslo

Examples

Different Capitalisation a == A

Different Form of Same “Canonical” Character: ① == 1

Different Form of Same “Canonical” Character: y == 𝒚

Composed vs Decomposed Letter: à == ◌̀ + a

Resources

Different Capitalisation `a == A`

Different Form of Same “Canonical” Character: `① == 1`

Different Form of Same “Canonical” Character: `y == 𝒚`

Composed vs Decomposed Letter: `à == ◌̀ + a`