7.6.2. Regular expression functions and methods

In this section, we introduce some the basic functions and methods using regular expressions for retrieving information from files or long text strings. Basically, in the last section we sketched a language, regular expression language, which with a few variations is commonly used in many programming languages. This section is about how to use regular expressions in Python.

First, all Python regular expression matching is not available until the re module is loaded:

>>> import re

Second, while regular expressions can be defined with simple strings in the RE language, it is much more efficient, for matches that will be used over and over, to compile the expression into a Python Regular expression instance; The regular expression class defines match, search, findall, and other useful methods. For example:

>>> import re
>>> regex = re.compile(r'b.+')
>>> regex.match('ba')
<_sre.SRE_Match object at 0x10047d3d8>
>>> m = regex.match('ba')
>>> m.group()
'ba'

The match method of regular expression objects returns an instance of a match object (if there is a match), which has methods of its own (like group) for retrieving the results. Calling group with no arguments returns the entire match.

In this section we discuss methods for regular expression objects and match objects, presenting some ; Taken together these methods cover the useful things you can do with regular expressions in Python.

7.6.2.1. Regular expression objects

Python regular expression objects are compiled regular expressions created by re.compile. In the example above:

>>> import re
>>> regex = re.compile(r'b.+')

The name regex is defined to be a Python regular expression object. Regular expression objects are then used to find patterns in and manipulate strings. The five most commonly used regular expression methods are:

  1. search

  2. match

  3. split

  4. findall

  5. sub

Each of these methods on a compiled regular expression object has a corresponding function which can be applied to strings. Thus, using the regular expression object regex created above we could do:

>>> regex.search('ba')

Equivalently, we could bypass the step in which we created a regular expression object and just do:

>>> search(r'b.+','ba')

The same goes for match, split, findall and sub. The difference between these two is efficiency. Using the regular expression object is faster. If you are doing multiple searches with the same pattern, use a regular expression object.

The following descriptions are taken from Python docs (re module). In each case pattern refers to the regular expression object which calls the method.

search(string)

Scan through string looking for a location where this regular expression produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

match(string)

If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match:

>>> pattern = re.compile("o")
>>> pattern.match("dog")      # No match as "o" is not at the start of "dog".

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

split(string, maxsplit=0)

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

findall(string)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in pattern, return a list of groups; this will be a list of tuples if pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

sub(repl, string)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. Any backslash escapes in repl are processed. That is, n is converted to a single newline character, r is converted to a carriage return, and so forth. Unknown escapes such as j are left alone. Backreferences, such as 6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
           r'def my_\1 ():',
          'def hello_world( ):')
'def my_hello_world ():

This concludes this partial description of match regular expression methods. For more info, see the Python docs.

7.6.2.2. Match objects

Python match objects are what regular expression methods search and match return.

The two most important methods of match objects are group and groups. Both the following descriptions are from the Python re docs.

group(*groupN)

*groupN refers to any number of arguments, including none.

Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without no arguments, *groupN defaults to a single zero argument. If a *groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned:

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

If the regular expression uses the special (?P<name>...) syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, an IndexError exception is raised.

A moderately complicated example:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'

groups(default_value=None)

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default_value argument is used for groups that did not participate in the match; it defaults to None.

For example:

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')

If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to None unless a default_value argument is supplied:

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m.groups()      # Second group defaults to None.
('24', None)
>>> m.groups('0')   # Now, the second group defaults to '0'.
('24', '0')