12.2. Regular expression functions and methods¶
In this section, we introduce some the basic functions and methods using regular expressions for retrieving information from files or long text strings. Basically, in the last section we sketched a language, regular expression language, which with a few variations is commonly used in many programming languages. This section is about how to use regular expressions in Python.
First, all Python regular expression matching is not available until the re module is loaded:
>>> import re
Second, while regular expressions can be defined with
simple strings in the RE language, it is much more efficient,
for matches that will be used over and over, to compile the expression into
a Python Regular expression instance; The regular expression
findall, and other useful methods. For example:
>>> import re >>> regex = re.compile(r'b.+') >>> regex.match('ba') <_sre.SRE_Match object at 0x10047d3d8> >>> m = regex.match('ba') >>> m.group() 'ba'
match method of regular expression objects returns an instance
match object (if there is a match), which has methods of its
group) for retrieving the results. Calling
with no arguments returns the entire match.
In this section we discuss methods for regular expression objects and match objects, presenting some ; Taken together these methods cover the useful things you can do with regular expressions in Python.
12.2.1. Regular expression objects¶
regular expression objects are compiled regular expressions
re.compile. In the example above:
>>> import re >>> regex = re.compile(r'b.+')
regex is defined to be a Python regular expression object.
Regular expression objects are then used to find patterns in and manipulate
strings. The five most commonly used regular expression methods are:
Each of these methods on a compiled regular expression object has a corresponding
function which can be applied to strings. Thus, using the regular expression object
regex created above we could do:
Equivalently, we could bypass the step in which we created a regular expression object and just do:
The same goes for
match, split, findall and
sub. The difference
between these two is efficiency. Using the regular expression object is faster.
If you are doing multiple searches with the same pattern, use a regular
The following descriptions are taken from Python docs (re module). In each case
pattern refers to the regular expression object
which calls the method.
Scan through string looking for a location where this regular expression produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the
pattern; note that this is different from finding a zero-length match at some point in the string.
If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the
pattern; note that this is different from a zero-length match:>>> pattern = re.compile("o") >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
If you want to locate a match anywhere in string, use
search()instead (see also search() vs. match()).
Split string by the occurrences of
pattern. If capturing parentheses are used in
pattern, then the text of all groups in
patternare also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.
Return all non-overlapping matches of
patternin string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in
pattern, return a list of groups; this will be a list of tuples if
patternhas more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Return the string obtained by replacing the leftmost non-overlapping occurrences of
stringby the replacement
repl. If the
patternisn’t found, string is returned unchanged. Any backslash escapes in
replare processed. That is,
nis converted to a single newline character,
ris converted to a carriage return, and so forth. Unknown escapes such as
jare left alone. Backreferences, such as
6, are replaced with the substring matched by group 6 in the
pattern. For example:>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', r'def my_\1 ():', 'def hello_world( ):') 'def my_hello_world ():
This concludes this partial description of match regular expression methods. For more info, see the Python docs.
12.2.2. Match objects¶
Python match objects are what regular expression methods
The two most important methods of match objects are
groups. Both the following descriptions are
from the Python re docs.
*groupNrefers to any number of arguments, including none.
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without no arguments,
*groupNdefaults to a single zero argument. If a
*groupNargument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an
IndexErrorexception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned:>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") >>> m.group(0) # The entire match 'Isaac Newton' >>> m.group(1) # The first parenthesized subgroup. 'Isaac' >>> m.group(2) # The second parenthesized subgroup. 'Newton' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Isaac', 'Newton')
If the regular expression uses the special
(?P<name>...)syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, an
IndexErrorexception is raised.
A moderately complicated example:>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") >>> m.group('first_name') 'Malcolm' >>> m.group('last_name') 'Reynolds'
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The
default_valueargument is used for groups that did not participate in the match; it defaults to None.
For example:>>> m = re.match(r"(\d+)\.(\d+)", "24.1632") >>> m.groups() ('24', '1632')
If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to
default_valueargument is supplied:>>> m = re.match(r"(\d+)\.?(\d+)?", "24") >>> m.groups() # Second group defaults to None. ('24', None) >>> m.groups('0') # Now, the second group defaults to '0'. ('24', '0')