Normalised Unicode Strings
Due to inconsistent use of combining characters, and alternate ways of writing the same letter, Unicode developed ways of normalising characters, before comparing strings. You could compare the strings without normalising them, but you might not care about certain differences in characters. Such as whether someone used a Å or a visually identical, but different: Å.
Unicode defines 2 yes-no normalisation options for Unicode characters. Which results in 4 different possible combination options which are called: NFC
, NFD
, NFKC
, NFKD
. The options are Defined as follows:
- Composition (can be done with un-canonsied characters) :
- Compose
…C
: squash down into smallest number of code points. I.e. ,a
+U+0300
’◌̀
(‘Combining Grave Accent’) + is replaced byà
. - Decompose
…D
: stretch out into longest number of code-points, all the accents get spun out into combining characters, i.e.à
is replaced bya
+◌̀
.
- Compose
- Canon (always done in conjunction with either a full composed or fully decomposed string):
- Canonise
…K…
: Convert the code-point into the “canonical version” of the character. I.e.²
becomes2
. - Leave un-canonised
…
: Leave the character as-is
- Canonise
The french Wikipedia page, has a better diagram to show some examples of what “Composition” and “Canon” do.
These two methods provide a matrix of 4 possible normalised forms of unicode strings, with the crypic names “NFC”, “NFD”, “NFKC”, “NFKD”:
Normalised Forms | Composed | Decomposed |
---|---|---|
Non-Canonical | NFC |
NFD |
Canonical | NFKC |
NFKD |
The function unicodedata.normalize( "form", "string" )
takes one of these four normalisation forms as an argument and returns a normalised string.
Examples
Lets compare some characters after applying the 4 different normalisation methods. Are the strings equal?
First import the unicodedata
module from the standard library:
import unicodedata as uc
The 4 normalised-forms:
NF = ["NFC", "NFD", "NFKC", "NFKD"]
Using the comparison a == A
as an example.
>>>for x in NF: x ,uc.normalize( x , "a") == uc.normalize( x , "A")
...
('NFC', False)
('NFD', False)
('NFKC', False)
('NFKD', False)
Different Capitalisation a == A
Unnormalised, a ≠ A
Shown more clearly with both characters normalised in the same way:
a == A |
Composed | Decomposed |
---|---|---|
Non-Canonical | NFC ❌ |
NFD ❌ |
Canonical | NFKC ❌ |
NFKD ❌ |
The .normalize
method, has no effect on case. Have to use the.lower()
method on both strings to do a case insensitive comparison.
Different Form of Same “Canonical” Character: ① == 1
Unnormalised, ① ≠ 1
With both characters normalised in the same way and ignoring the code:
① == 1 |
Composed | Decomposed |
---|---|---|
Non-Canonical | NFC ❌ |
NFD ❌ |
Canonical | NFKC ✅ |
NFKD ✅ |
①
and1
are the same ‘canonical’ character despite being different code-points.
Different Form of Same “Canonical” Character: y == 𝒚
Unnormalised, y ≠ 𝒚
With both characters normalised in the same way:
y == 𝒚 |
Composed | Decomposed |
---|---|---|
Non-Canonical | NFC ❌ |
NFD ❌ |
Canonical | NFKC ✅ |
NFKD ✅ |
y
and𝒚
are the same ‘canonical’ character despite being different code-points.
However, for information: ß ≠ SS
and a ≠ α
.
Composed vs Decomposed Letter: à == ◌̀ + a
Unnormalised, à ≠ ◌̀ + a
with both characters normalised in the same way:
à == ◌̀ + a |
Composed | Decomposed |
---|---|---|
Non-Canonical | NFC ✅ |
NFD ✅ |
Canonical | NFKC ✅ |
NFKD ✅ |
In composed form, a + ◌̀
becomes à
. When decomposed, à
becomes a + ◌̀
. So they are always the same characters.