RegEx
A Regular Expression, (a.k.a. RegEx) is a sequence of characters that specifies a search pattern in text. The sequence is called a pattern or expression. The text that this expression finds in the text string is known as the ‘match’. Or more specifically in python the ‘match-object’, because it contains more than just the match text.
The python standard library has a regex package called re
. There is more feature rich PyPi package called regex
.
Quick Start Guide
A regular expression is useful if you have a string of text called string
and you want a list of all the matches of words starting with say express...
, best practice involves a few steps:
1. Get the string
Normally, this means opening a file but here I’ll write it directly:
>>> string = r"expression and expressive"
2. Compile the expression
I’ll use the expression r"\bexpress(.*?)\b"
which looks for all the words starting with expression. All these special characters are explained later, but the \b
means boundary between space and word-character, brackets make that part of an expression a ‘group’, which can be extracted independently from the match, later on.
>>> import re
>>> pattern = re.compile(r"\bexpress(.*?)\b")
Pattern
is now a re.compile-object.
3. Run a method on the pattern
You now run the .finditer(…)
method on pattern
, specifying string
as the argument:
>>> match_object_iterator = pattern.finditer(string)
You now have an iterator-object containing re.match-objects.
4. Return the Matching Group for each Match
Here I am returning:
- group
0
, which means the whole expression, - group
1
, which is the part in brackets.
>>> for i, mo in enumerate(match_object_iterator):
... print(i, mo.group(0))
... print(i, mo.group(1))
...
0 expression
0 ion
1 expressive
1 ive
This guide will now talk only about how to make the r"\bexpress(.*?)\b"
part of this ordeal, known as the ‘pattern’ or ‘expression’. I deal with python re functions elsewhere. RegEx is powerful and compelling because it makes otherwise complex find-and-replace, or parsing operations trivial and it is useful in every field of computing.
glob
Before talking about regular expressions, we should discuss ‘globbing’ which you have probably used in windows file explorer search boxes. glob is a simpler version of regex with much more limited options, which is used to search for file-names. Globbing is done by all modern shells, following the POSIX specification.
The syntax is simple:
Wildcard | Meaning and equivalent regex |
---|---|
* |
Any string but file separator: [^\]* in windows or [^/]* in Mac or Linux. |
** |
Any string including file separator .* . I.e. search recursive |
? |
. Any Single character |
[abc] |
[abc] Character class |
[a-z] |
[a-z] Character class range |
Conventions
In this guide, I use:
…
to refer to a part of an expression, any unconnected series of character. This is to distinguish it from three literal full-stops...
.▯
to refer to any single character.#
to refer to any singe digit character.r"…"
to refer to the complete expression raw-string, as entered into python.
It is important to contract these three elements:
- Expression (or pattern): the regular expression character-string e.g.
r"\bexpress(.*?)\b"
. - String: the text-string which you are searching inside of, using the regular expression.
- Match object: what is returned by the regex search. The ‘match’ refers to the match ofa a given group in the match-object.
Resources
Reference
Testers
Exercises
Visualisers
Normal Characters
All letters and numbers and many symbols in ASCII are typed in literally:
a
-z
A
-Z
0
-9
_
-
,
!
&
,
;
For example:
- The spacebar character `` is a literal whitespace also known as a blank.
,
is a literal comma`
is a literal backtick
Special Characters
[
\
^
$
.
|
?
*
+
(
)
need to be escaped with a preceding \
to match literally:
Their special purposes (in python re) are:
[
start of a character class. (The closing bracket]
only needs to be escaped to include it within a character class)(
and)
start and end of a string group\
is either escape or group back-reference character^
is either start of line anchor or NOT operator in bracket list as:[^…]
$
is end of line anchor.
matches any single character except newline|
is the OR operator in a strings group(…|…)
*
is the Kleene Star a 0 or more modifier.?
is either a 0 or 1 character-modifier, a non-greedy as modifier-modifier*?
or a group-type modifier as(?…)
,+
is a 1 or more modifier/
is Delimiter special in PERL, but not in Python
However, inside character-classes […]
, special characters are literal and do not to be escaped with the following exceptions:
\
]
-
/
in PERL, but not in Python
The Delimiter /
In PERL and other old RegEx languages a ‘delimiter’ marks the start and end of the whole expression e.g.: /^expression$/
. It must be escaped within the expression as\/
, even in character classes as […\/…]
. In the online checker tool, it assumes /
is the delimiter by default, so change it to a less common character like @
.
But, in python re
and regex
, the delimiter /
is not to be escaped. The expression is delimited with string quotes "
, e.g.: r"^expression$"
. So "
would need to be escaped in this python string like any other.
Meta-Characters
These are the meta-characters: \w
\s
\d
\n
. These are escape sequences. Escape sequences, were invented in the C programming language. It is the practice of representing a special character in a string by preceding it with a \
. The convention was expanded for Regex to include far more and calls them ‘meta-characters’. The following meta-characters are true for all RegEx flavours:
\w
a word character (letter, number or underscore), equivalent to[a-zA-Z0-9_]
\W
anything but a word character i.e.[^a-zA-Z0-9_]
\s
whitespace (space, tab, newline or, form-feed)\S
anything but whitespace\d
digit character equivalent to[0-9]
\D
anything but a digit, equivalent to[^0-9]
\r
carriage return\n
linefeed. Will match only theLF
inCR LF
Bucking the trend with the line separators:
\N
Is not a valid meta-character. It is NOT not-newline which is written with.
in normal mode or,[^\r\n]
in dot-all mode.\R
Is only valid in some flavours of regex,\R
matches all line-break characters including:CR LF
,CR
-LF
andU+0085
(a.k.a “New Line”). But\R
is not a valid meta-character in Python re or regex. To write whitespace apart from newline you need:[^\S\r\n]
Note on matching linebreaks in Python: When python reads a string. It will always replace
\r\n
with\n
when reading a CR-LF text file. So only write\n
into python regular expressions.
Start And End Anchors ^…$
Anchors are zero-length matches. I.e. they match the boundaries between or before or after words and string, and not characters themselves.
^
means before the first character in the line (normal-mode) or string (multi-line mode)$
means after the last character in the line (normal-mode) or string (multi-line mode)
By default they are tied to the start and end of the string. But, they can be made to match line-breaks too, by adding the multi-line
flag as an argument to the particular RegEx function.
In my experience, in multi-line mode, you can only use a single instance of each of these start and end anchors per pattern. To match line-breaks within an expression you still need to use \r?\n
There are two additional anchors which are useful in multi-line mode. They work in Python re, regex101 and Notepad++:
\A
means end of entire string\Z
means end of string,
Word Boundary Anchor \b…\b
The \b
anchor matches a cursor position between a word character and a non-word character.
- Words are made up of letters, numbers or underscore. As a character class they are:
[a-zA-Z0-9_]
- Non-Words are made up of anything but these characters, e.g .
SPACE
,-
,/
.
Never write two consecutive anchors e.g. \b\b
, ^\b...
or …\b$
as you cannot have two sequential cursor positions, un-separated by a character.
\b
does NOT refer to a space it means a text cursor position between two characters. Also called a ‘caret’ ‘pointer’ or ‘insertion point.
Note: Bucking the convention of meta-characters, there is no \B
(anything but a word boundary) as it is an anchor not a character.
Groups (…)
and (…|…)
Groups provide a way to:
- apply occurrence modifiers to a part of an expression e.g.
(…)*
. - later extract different parts of the match from the resulting match object.
A pipe separator |
within the group means either/or in an expression: (a|b|c)
means a b or c. Therefore: [abc]
is equivalent to (a|b|c)
Types of groups:
r"﹍"
The whole expression itself is group number 0"…(﹍)…
a normal group, number-labelled by order of opening bracket.(…|…)
Either a or b"…(?P<name>﹍)…"
A named group, that later can be later retrieved from a match object. These don’t work in Notepad++’s PERL RegEx implementation, so I try to avoid them.(?:…)
Non-capturing group Means the group cannot be back-referenced later on with\#
. I.e. it skips the numbering of this group.
Note
- You still need to escape characters in the brackets.
- Occurrence modifiers after the brackets apply to the whole string group e.g.
(…)*
- A repeated capturing group will only capture the last iteration. Optionally:
- Put a capturing group around the repeated group to capture all iterations in a single group or;
- Write individual groups for each value.
Character Class […]
and [^…]
Match any of the characters in the list. The whole list matches a single character.
[abc]
matches any of the characters a, b or c.[^abc]
matches anything but a, b or c. I.e. anything but any of the individual characters in this range. In other words,^
is a NOT modifier for the character class.
No need to use escape characters for special characters apart from \
, ]
and -
within these square brackets. In other words, the modifier characters and brackets have no power here.
Note:
-
is a NOT modifier for the subsequent character in a character class. So you can put it as the last character in the class rather than escaping it to be more concise.
Can represent a span of ASCII characters between a range demarcated by -
:
[a-z]
matches any lowercase letter a to z
[A-Z]
matches any uppercase letter A-Z
[a-zA-Z]
matches any uppercase or lowercase letter
[0-9]
matches any digit
Occurrence Modifiers {…}
, *
, ?
and +
A.K.A. Repetition Modifiers, Quantifiers or Alternations. These only have meaning in conjunction with the preceeding character in the regex expression.
I’ll use ▯
here as a placeholder to mean any non-modifier character, meta-character or group.
▯*
, the Kleene star, matches 0 or more of▯
.▯+
matches 1 or more of▯
,▯?
matches 0 or 1 of of▯
.?
doesn’t greedy-non-greedyness here.▯{…,﹍}
match a specific range of occurances of▯
:
For example:
▯{2}
matches exactly 2 sequential occurances▯{2,}
matches at least 2 occurances▯{,2}
matches up to 2 occurances▯{2,3}
matches 2 to 3 occurances
When a modifier is placed after another modifier, the behaviour of the preceeding character in the expression is different.
I didn’t initially understand modifiers:
- I thought it working as a character itself: operating as a lookahead in the string, not modifying the regex character to the left of it within the RegEx pattern itself.
- I didn’t realise you can only use 1 modifier at a time. You shouldn’t combine them. For example, to make a 0 or more modifier, don’t use
+?
, use*
. This combination means something else.
In RegEx. .*
is the wildcard character, not a solitary *
Most searching tools, such as windows explorer search bar, tend to use an isolated Kleene star*
to mean a “wildcard”. I.e. a sequence of 0 or more of any character.
But in a RegEx pattern the *
is not a character, it is a modifier:
.
means a single occurence of any character apart from newline equivalent to[^\n]
;.*
means 0 or more of any character apart from newline, i.e. the wildcard.
In other words, the *
is just a modifier which means 0 or more of the character immediately to the left of it in the RegEx pattern. In RegEx, *
doesn’t mean anything without a preceding character.
Greedy ▯*
vs Non-Greedy ▯*?
By default, the ▯*
modifier looks for the longest possible matching string. This is called ‘greedy’ behaviour.
To instead look for the shortest possible matching string, called non-greedy or lazy behaviour, write the modified-modifier: ▯*?
.
I initially misunderstood this symbol. The ?
can only change the *
modifier to non-greedy, not a whole group or a whole expression. It is meaningless to have a ‘greedy or non-greedy expression’, you can only have a ‘greedy or non-greedy modifier’.
Also note that without the preceding modifier, the ?
is a 0-or-1 modifier not a greedy-or-non-greedy modifier.
You can also change the greediness of the +
and the ?
modifiers with +?
, ??
respectively, but it is confusing and beyond what you would likely ever need.
Group Backreferences \#
I’ll use #
here to mean any digit. I am not using ▯
because the character must be a digit character specifically.
Note: in the python re module, syntax is
\#
but in PCRE syntax (used in the Notepad++ RegEx) it is instead$#
.
If you divide the expression into groups with: ungrouped(group1)ungrouped(group2), you can reference the same match later in the expression:
\1
first group’s match\2
second group’s match ⋮\#
is the#
th group’s match.
Note: You don’t put brackets around these things.
Nesting Groups
The group number is the order of declaration of opening brackets, even if one group is within another. e.g. :
|← Group 0 (Whole Expression) →|
||← Group 1 →| |
|||← Group 2 →| | |
||| |←No Group→||←Group 3→|| | |
r"^((………(?:……………………)(………………………))………………………………………)﹍﹍﹍﹍$" ← Expression
12 - 3 ← Opening bracket declaration order
Lookaheads ?=
and Lookbehinds ?<=
A lookahead (?=﹍)
is effectively just a non-caputuring group which matches to the end of the string by default. Its is equivalent to adding an end anchor to a non-capturing group as: (?:﹍$)
.
A lookbehind (?<=﹍)
is equivalent to the non-capturing group as: (?:^﹍)
.
A negative look-ahead or -behind, involves replacing the the =
with a !
.
(?=…)
Lookahead
(?!…)
Negative Lookahead
(?<=…)
Lookbehind. Yes, it has a redundant =
after the <
(?<!…)
Negative Lookbehind
Modifiers are not allowed in look-ahead or -behinds
In most RegEx flavours including the python re
package, you can’t have ‘variable width’ lookahead or lookbehind statements, meaning you can’t use modifiers (…*
, …{,}
, …?
or …+
) in them.
Instead, use the python regex
package or use a non-matching group .
See this stack overflow post for an explanation.
Advanced Grouping
Non-Capturing Group (?:pattern)
Recall that you can use Parenthesized Back-References to capture the matches. To disable capturing, use ?: inside the parentheses in the form of (?:pattern). In other words, ?: disables the creation of a capturing group, so as not to create an unnecessary capturing group.
Named Capturing Group (?P<name>pattern)
The capture group can be referenced later by name.
The allowed characters are [a-zA-Z0-9_] for a group name
Atomic Grouping (>pattern)
Disable backtracking, even if this may lead to match failure.
Conditional (?(Cond)then|else)
Flags
Also known as ‘match options’, these are not placed within the expression string itself. Instead, the flags are entered as arguments to the regex function.
The most common flags
Flag | Purpose | Default | Python Flag ▯ in:re.compile("pattern", re.▯) |
---|---|---|---|
Multi-Line | Make ^ and $ match the start and end of a each line |
^ and $ match only the start and end of string |
re.M or re.MULTILINE |
Dot-All | Make . match \n |
. doesn’t match \n |
re.S or re.DOTALL |
Case-Sensitive | Make case insensitive | Case sensitive | re.I or re.IGNORECASE |
Of Notepad++, Pycharm and VScode search, only Notepad++ has the option for DOTALL, so any of the patterns below, where .
needs to match \n
will not work in Pycharm and VScode.
Editor | DOTALL | MULTILINE |
---|---|---|
Notepad++ | ✅ Can be turned on or off. | ✅ |
Pycharm | ❌ | ✅ |
VScode | ❌ | ✅ |
To enter multiple flags into python, you need to use the OR operators:
mo = re.search(r'^test.*$(.*)', flags=re.I | re.DOTALL)
Rarely Used meta-characters, just for completeness
\B
a boundary between two word or two non-word characters.\t
Tab,\T
Not tab\b
or backspace,\B
Not backspace\h
horizontal whitespace,\H
not horizontal whitespace\f
form feed,\F
anything but form feed\v
Vertical Tab,\V
anything but vertical tab\a
bell character ␇, for ‘audible’ , is a now useless non-readable character. a stands for ‘audible’. Originally sent to ring a small electromechanical bell on tickers and other teleprinters and teletypewriters to alert operators at the other end of the line, often of an incoming message.\A
anything but bell character.
In Python re
, not but original PERL, PCRE (Notepad++) or PCRE2:
\A
and\Z
mean are start and end of string anchors\nooo
octal number whereooo
is an octal number\uhhhh
any unicode characters up to code-point 10,000 where hhhh is hex number\xhh
256 bit (i.e. ASCII) character with where hh is hex number\p{<name>}
matches any character in this ‘unicode general category’ or ‘named block’ called:<name>
eg.IsGreek
,IsBasicLatin
,Sc
(S symbol & c currency)\P{<name>}
not the unicode category name
Only in python regex
not re
:
\R
means eitherCR
LF
,CR
orU+0085
a.k.a. ‘Next Line’.\K
removes everything prior in the matching group.