Search Patterns



Regex pattern Match
^ Beginning of the string
$ End of the string

[a-e] = [abcde]
[0-5] = [012345]
[A-Z] = [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
[A-Za-z] = all letters
[-az] or [az-] = "-" or "a" or "z"
[-a-z] = "-" or "a...z"
[^abc] = not ("a", "b" or "c")
[a^bc] = "a", "b", "c" or "^"

() defining a group
. Any character

>>> re.findall(r"^A(.*)B","A123B")
['123']
a|b a or b

>>> re.findall("1(a|b)2","001a20001b20")
['a', 'b']

a{4} Exactly 4 a's
a{4,8} Between (inclusive) 4 and 8 a's
a{9,} 9 or more a's
? match 0 or 1 repetitions of the preceding re
ab? will match either ‘a’ or ‘ab’.

>>> re.findall("ab?","123abaacd")
['ab', 'a', 'a']

* match 0 or more repetitions of the preceding re,
ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

>>> re.findall("ab*","123abbbbabacd")
['abbbb', 'ab', 'a']

+ match 1 or more repetitions of the preceding re
. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

>>> re.findall("ab+","123abbbbabacd")
['abbbb', 'ab']

?, *, + are "greedy" patterns: they match as much text as possible.
Adding a '?' makes them "non-greedy": as few characters as possible will be matched.

>>> re.findall(r"<.*>"," b ")
[' b ']
>>> re.findall(r"<.*>"," b ']
>>> re.findall(r"<.*?>"," b ")
['', '']

\d Any decimal digit: [0-9]
\D complement of \d. Any non-digit character: [^0-9]
\s Any whitespace character: [ \t\n\r\f\v]
\S Complement of \s. Any non-whitespace character: [^ \t\n\r\f\v]
\w Any alphanumeric character: [a-zA-Z0-9_]
\W Complement of \w
\b A word boundary (empty string, but only at the start or end of a word)
\B A non-word boundary (empty string, but not at the start or end of a word)

Escape Sequences in Strings


Escape SequenceMeaning Notes
\newlineIgnored
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\N{name} Character named name in the Unicode database (Unicode only)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\uxxxx Character with 16-bit hex value xxxx (Unicode only)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo
\xhh Character with hex value hh

Groups


(?P...) define a capturing group named 'name'
(?P=name) refer to the captured group named 'name'
\n the n'th captured group
(?#...) a comment

>>> match = re.search(r"<([a-z]+)>(.*)","Samuel")
>>> match.group(1)
'name'
>>> match.group(2)
'Samuel'

>>> re.findall(r"<([a-z]+)>(.*)","Samuel")
[('name', 'Samuel')]

>>> m = re.search(r"(?P\w+) (?P\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

>>> re.findall(r"<(?P[a-z]+)>(.*)", "Malcolm Reynolds")
[('name', 'Malcolm Reynolds')]

(?=...) positive lookahead

>>> re.findall('abc (?=def)', 'abc def')
['abc ']

(?!...) negative lookahead

>>> re.findall('abc(?!def)', 'abcde')
['abc']

(?<=...) positive lookbehind

>>> re.findall('(?<=abc)def', 'abcdef')
['def']

>>> re.findall(r'(?<=-)\w+', 'spam-egg')
['egg']

>>> re.findall(r'(?<=:).*\.(?#find the list)', 'This is an list: 1, 2, 3, 4 .')
[' 1, 2, 3, 4 .']

(?<!...) negative lookbehind

>>> re.findall('(?<!abc)def', 'abcdef defabc')
['def']

Example usage:

>>> import re
>>> match = re.search(r"at","A cat in a hat.")
>>> match
>_sre.SRE_Match object; span=(3, 5), match='at'<
>>> match = re.search(r"(at)","A cat in a hat.")
>>> m.group(1)
'at'
>>> m.group(0)
'at'
>>> m.span()
(3, 5)
>>> m.start()
3
>>> m.end()
5

>>> re.findall(r"at","A cat in a hat.")
['at', 'at']

>>> re.sub("at","**", "A cat in a hat.")
'A c** in a h**.'

>>> compiled_re = re.compile("at")
>>> compiled_re.search("A cat in a hat.")
<_sre.SRE_Match object; span=(3, 5), match='at'>

>>> re.findall(r"at","A cat in a hat./nA rAt and a bAt.", re.IGNORECASE)
['at', 'at', 'At', 'At']

>>> "A cat  and  a  \n  rat".split(" ")
['A', 'cat', '', 'and', '', 'a', '', '\n', '', 'rat']
>>> "A cat  and  a  \n  rat".split(None)
['A', 'cat', 'and', 'a', 'rat']
>>> "A cat  and  a  \n  rat".split(r"at")
['A c', '  and  a  \n  r', '']
AbbreviationFull nameDescription
re.Ire.IGNORECASEMakes the regular expression case-insensitive
re.Lre.LOCALEThe behaviour of some special sequences like \w, \W, \b,\s, \S will be made dependant on the current locale, i.e. the user's language, country aso.
re.Mre.MULTILINE^ and $ will match at the beginning and at the end of each line and not just at the beginning and the end of the string
re.Sre.DOTALLThe dot "." will match every character plus the newline
re.Ure.UNICODEMakes \w, \W, \b, \B, \d, \D, \s, \S dependent on Unicode character properties
re.Xre.VERBOSEAllowing "verbose regular expressions", i.e. whitespace are ignored. This means that spaces, tabs, and carriage returns are not matched as such. If you want to match a space in a verbose regular expression, you'll need to escape it by escaping it with a backslash in front of it or include it in a character class.
# are also ignored, except when in a character class or preceded by an non-escaped backslash. Everything following a "#" will be ignored until the end of the line, so this character can be used to start a comment.

Links:

  • python-course1
  • python-course2
  • regex
  • regex-lookarounds
  • regular expressions