7.6.1. Regular expression language

In this section we explore Python’s string-matching capabilities. The ability to match paterns in strings depends on having a formally defined language for patterns that precisely define sets of strings. Expressions in this language are called regular expressions, regexes, or simply REs. Below, we collect some material introducing the Python regular expression language. Python REs are quite similar those of other programming languages such as Perl, so that students who have encountered regular expressions elsewhere will probably not find this section very difficult. However, there are some differences between Python REs and those of other popular string manipulating languages like Perl. In particular, there are differences in how pattern matching functions are called, and how their results are retrieved. Those are covered in the next section.

7.6.1.1. Python for Google developers: Regular expression lesson

Three important sources

  1. Google Python course regular expression lesson
  2. Andrew Kuchling’s Regular expression HOWTO
  3. Python’s library page on REs

Read the Google lesson first for a very good introduction and overview, then the Kuchling’s HOWTO for a slightly more indepth picture, and use the library page as a reference.

7.6.1.2. Basics

Python regular expressions are strings: strings that are used to match sets of strings. In many cases a character just matches itself. All alphabetic characters and numerals match themselves, and case matters, so cat matches cat and CAT matches CAT. But a number of characters and character sequences have special meanings.

Below we list some these, the regular expression operators of Python:

Operator    Behavior
--------    ---------

.           Wildcard, matches any character
^abc        Matches some pattern abc at the start of a string
abc$        Matches some pattern abc at the end of a string
[abc]       Matches one of a set of characters
[^abc]      Matches any character NOT in the set of characters
[A-Z0-9]    Matches one of a range of characters
ed|ing|s    Matches one of the specified strings (disjunction)
*           Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+           One or more of previous item, e.g. a+, [a-z]+
?           Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n}         Exactly n repeats where n is an integer
{n,}        At least n repeats
{,n}        No more than n repeats
{m,n}       At least m and no more than n repeats
a(b|c)+     Parentheses that indicate the scope of the operators

The power of the matching language comes from the wildcards and disjunctions. Thus cat matches cat, but c.t matches cat, cot, and cut, as well as cet, cit, czt, c t, and c#t. That is, the wildcard . matches any character, not just letters. Note that space (” “) counts as a character and that sequences of two spaces count as two characters. Note also that line breaks and tabs also count as characters and they are different characters than space, even though spaces and tabs may look the same to you when they print out on the screen. But by default, the one character . can’t match is a line break, so by default you won’t get matches that cross lines.

A frequent mistake for beginners is to forget the meaning of [ … ]. Without a wildcard operator following, this specifies exactly one character, but allows it to be one of a set. Thus c[aou]t matches cat, cut and cot, but does not match ct, caut, cout and caut. Thus it matches only strings that are 3 characters long. Adding the negation symbol doesn’t change this fact: c[^aou]t still matches 3-character strings, just different ones, for example, c#t, cit, cet and c t.

The wildcard characters + and * introduce patterns that match infinite sets of strings. Thus, ca*t matches any string beginning with ct and ending with t that has 0 or more a’s in the middle: ct, cat, caat, caat, caaat, …

Used with [… ] * means 0 or more instances of the characters in the set. So c[aou]*t matches ct, cat, cot, cut, caut, caat, caout, cuaot, couat coaut, cuuaaooouaot, and so on.

Parentheses have more than one function in Python regular expressions. First, they can indicate the scope of a wild card, that is, the regular expression a wild card is intended to apply to. Thus a(b|c)+ is equivalent to a[bc]+. It matches ab, ac, abc, acb, abbc, accb, abbbcb, and so on; since bc matches exactly one string, bc, a(bc)+ matches abc, abcbc,and abcbcbc, but not acb, abbcc, or abcb. Without the parentheses, the + is interpreted as applying only to the character immediately preceding it. So abc+ matches, abc, abcc, and abccc, but not abbc or abcabc. To match the last string you might use (abc)+. Second, parentheses can identify a subpart of the match called a group; for example, we may want to extract a subpart of the match for further processing. In this function, the parentheses do not affect the set of strings matched. Thus, a(b)c matches exactly the same string as abc, but makes the b part of the match available for later processing. We will see how to make use of this feature in the next section.

In addition to the regular expression operators, there are special symbols that stand for important sets of strings.

Special symbols standing for sets of strings:

Symbol    Function

 \b       Word boundary (zero width)
 \B       This is the opposite of \b, only matching when the current position
          is not at a word boundary (also zero width)
 \d       Any decimal digit (equivalent to [0-9])
 \D       Any non-digit character (equivalent to [^0-9])
 \s       Any whitespace character (equivalent to [ \t\n\r\f\v]
 \S       Any non-whitespace character (equivalent to [^ \t\\n\\r\\f\\v])
 \w       Any alphanumeric character (equivalent to [a-zA-Z0-9\_])
 \W       Any non-alphanumeric character (equivalent to [^a-zA-Z0-9\_])
 \t       The tab character
 \n       The newline character
 ^        Matches only at the start of the string (zero width), when not
          in multiline mode, after \n in multiline mode
 $        Matches at the end of a line, which is either the end of the string,
          or immediately preceding a newline character
 \A       Matches only at the start of the string (zero width), even in multiline
          mode
 \Z       Matches only at the end of the string (zero width)

These are helpful for a variety of reasons. They shorten patterns and make them easier to understand. For example, to match Social Security numbers we would use:

\d{3}-\d{2}-\d{4}

On the other hand suppose we wanted to count all instances of the word the, capitalized or not, in a text. The pattern:

[Tt]he

will match instances of the occurring word internally and count cases like other, mother, and there. An accurate count would come from all strings matching the following pattern:

\b[Tt]he\b

The \b on either side only matches if there are no legal word internal characters precededing or following the string. For example a space is allowed and an o is not. No character at all is also allowed. So the The beginning Samuel Beckett’s novel Murphy would match. (The sun shone, having no alternative, on the nothing new.)

7.6.1.3. Advice

Adapted from the Google online Python course:

Regular expression patterns pack a lot of meaning into just a few characters , but they are so dense, you can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of match() or search(), see Regular expression functions and methods. If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it’s matching nothing, you can’t make any progress since there’s nothing concrete to look at. Once it’s matching too much, then you can work on tightening it up incrementally to hit just what you want.

Section Regular expression examples provides some examples using match and the regular expression Ipython notebook provides specific Python code to implement other parts of this advice.