Python Strings
Default String Type
In Python 3, all strings are Unicode strings are stored as a series of ‘code points’ (the integer reference numbers of the characters) unless preceded by b'...
, designating an ASCII-encoded byte-string.
In Python 2, all strings were encoded by default as ASCII encoded byte strings, unless the string was preceded by a u'...
designating a UTF-8 string.
Summary of the two strings types:
'...'
| Unicode string | python3 native string = list of integers of Unicode code-points: \u0000
→ \U0010ffff
b'...'
| Bytes string | python2 native string = list of integers (including encoded ASCII characters): \x00
→ \xff
A byte string only accepts literal ASCII characters or hexadecimal escape sequences in the style \x##
. It only represents integer values between 0 and 255.
A Unicode string (default python 3 string) can represent all Unicode characters.
In Python 3, the u’string’ notation doesn’t throw an error to allow backward compatibility but changes nothing about the string
Writing Unicode Characters
You must remember the following escape sequences used to enter Unicode characters into a string with the code point number in hexadecimal given by the #’s:
Escape Sequence to Write into Python | Character |
---|---|
"\x▯▯" |
ASCII Characters for code-points 00 to ff (all ASCII characters can be entered in this way) |
\u▯▯▯▯ |
Unicode Character for code-points from 0000 to ffff (hexadecimal numbers) |
\U▯▯▯▯▯▯▯▯ |
Unicode Character for code-points 00000000 to 00110000 (hexadecimal numbers) |
\o▯▯▯ |
Unicode character with octal value ▯▯▯ |
\N{name} |
Character name in the Unicode database |
You must remember the following functions which convert between the Unicode character and it’s decimal code-point value:
chr(▯)
returns Character from Numberord("c")
returns Decimal Number from character
Get Character: chr(▯)
Returns the Unicode character from the Code-Point decimal number.
>>>chr(97)
'a'
Get Code-Point: ord("▯")
To get the decimal number value of the ASCII character, i.e. the reverse of chr(▯)
use ord('▯')
.
>>>ord('a')
97
All the following representations are equal:
>>>"a" == "\x61" == "\u0061" == chr(0x61) == chr(ord("a"))
True
You can get a list of the ‘codepoint’ values of Unicode characters in this example string “✅❌✍” with:
>>>[ord(a) for a in "✅❌✍"]
[9989, 10060, 9997]
Get List of Byte Values: bytearray("…")
list(bytearray("…"))
returns the values of the bytes in a string as a list. Notice how this shows the value of the encoded bytes, not the code-point integers.
>>>lst = list(bytearray("✅❌✍", encoding='utf-8'))
>>>print(lst)
[226, 156, 133, 226, 157, 140, 226, 156, 141]
To view these integers in binary, you have to apply some formatting. This has basically no practical function, but is it useful to visualize what is going on when you encode the text"✅❌✍"
:
>>>[str((bin(a))[2:].zfill(8)) for a in lst]
['11100010', '10011100', '10000101', '11100010', '10011101', '10001100', '11100010', '10011100', '10001101']
Encode and Decode
You must know that u-string has a built in method for: .encode()
and b-string has a built in method for .decode()
. The default argument is “utf-8”.
Encode means u-string → b-string
Decode means b-string → u-string
Encoding a u-string to a b-string via utf-8:
>>>"✅❌✍".encode()
b'\xe2\x9c\x85\xe2\x9d\x8c\xe2\x9c\x8d'
Decoding the above b-string to a u-string via utf-8:
>>>b'\xe2\x9c\x85\xe2\x9d\x8c\xe2\x9c\x8d'.decode()
"✅❌✍"
The bytes(string, "utf-8")
function is identical to the string .encode()
method:
>>>bytes("✅", "utf-8")
b'\xe2\x9c\x85'
Specify Encoding When Writing a File
In Python 3 on Windows, the open('newfile.txt','w') as f
makes a new file encoded with your system default which is still apparently Windows-1252 (aka. cp1252).
Therefore you, need to specify the encoding as “UTF-8” every time: open('newfile.txt','w', encoding='UTF-8') as f
.
Reading python strings
When python reads a string. It will always replace \r\n
with \n
. So bear this in min when writing line breaks regular expressions.