7.6.2. Regular expression functions and methods¶
In this section, we introduce some the basic functions and methods using regular expressions for retrieving information from files or long text strings. Basically, in the last section we sketched a language, regular expression language, which with a few variations is commonly used in many programming languages. This section is about how to use regular expressions in Python.
First, all Python regular expression matching is not available until the re module is loaded:
>>> import re
Second, while regular expressions can be defined with
simple strings in the RE language, it is much more efficient,
for matches that will be used over and over, to compile the expression into
a Python Regular expression instance; The regular expression
class defines match
, search
,
findall
, and other useful methods. For example:
>>> import re
>>> regex = re.compile(r'b.+')
>>> regex.match('ba')
<_sre.SRE_Match object at 0x10047d3d8>
>>> m = regex.match('ba')
>>> m.group()
'ba'
The match
method of regular expression objects returns an instance
of a match
object (if there is a match), which has methods of its
own (like group
) for retrieving the results. Calling group
with no arguments returns the entire match.
In this section we discuss methods for regular expression objects and match objects, presenting some ; Taken together these methods cover the useful things you can do with regular expressions in Python.
7.6.2.1. Regular expression objects¶
Python regular expression objects
are compiled regular expressions
created by re.compile
. In the example above:
>>> import re
>>> regex = re.compile(r'b.+')
The name regex
is defined to be a Python regular expression object.
Regular expression objects are then used to find patterns in and manipulate
strings. The five most commonly used regular expression methods are:
search
match
split
findall
sub
Each of these methods on a compiled regular expression object has a corresponding
function which can be applied to strings. Thus, using the regular expression object
regex
created above we could do:
>>> regex.search('ba')
Equivalently, we could bypass the step in which we created a regular expression object and just do:
>>> search(r'b.+','ba')
The same goes for match, split, findall
and sub
. The difference
between these two is efficiency. Using the regular expression object is faster.
If you are doing multiple searches with the same pattern, use a regular
expression object.
The following descriptions are taken from Python docs (re module). In each case pattern
refers to the regular expression object
which calls the method.
search(string)
Scan through string looking for a location where this regular expression produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the
pattern
; note that this is different from finding a zero-length match at some point in the string.
match(string)
If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the
pattern
; note that this is different from a zero-length match:>>> pattern = re.compile("o") >>> pattern.match("dog") # No match as "o" is not at the start of "dog".If you want to locate a match anywhere in string, use
search()
instead (see also search() vs. match()).
split(string, maxsplit=0)
Split string by the occurrences of
pattern
. If capturing parentheses are used inpattern
, then the text of all groups inpattern
are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.
findall(string)
Return all non-overlapping matches of
pattern
in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present inpattern
, return a list of groups; this will be a list of tuples ifpattern
has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
sub(repl, string)
Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern
instring
by the replacementrepl
. If thepattern
isn’t found, string is returned unchanged. Any backslash escapes inrepl
are processed. That is,n
is converted to a single newline character,r
is converted to a carriage return, and so forth. Unknown escapes such asj
are left alone. Backreferences, such as6
, are replaced with the substring matched by group 6 in thepattern
. For example:>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', r'def my_\1 ():', 'def hello_world( ):') 'def my_hello_world ():
This concludes this partial description of match regular expression methods. For more info, see the Python docs.
7.6.2.2. Match objects¶
Python match objects are what regular expression methods search
and match
return.
The two most important methods of match objects are group
and
groups
. Both the following descriptions are
from the Python re docs.
group(*groupN)
*groupN
refers to any number of arguments, including none.Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without no arguments,
*groupN
defaults to a single zero argument. If a*groupN
argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, anIndexError
exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned:>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") >>> m.group(0) # The entire match 'Isaac Newton' >>> m.group(1) # The first parenthesized subgroup. 'Isaac' >>> m.group(2) # The second parenthesized subgroup. 'Newton' >>> m.group(1, 2) # Multiple arguments give us a tuple. ('Isaac', 'Newton')If the regular expression uses the special
(?P<name>...)
syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, anIndexError
exception is raised.A moderately complicated example:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") >>> m.group('first_name') 'Malcolm' >>> m.group('last_name') 'Reynolds'
groups(default_value=None)
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The
default_value
argument is used for groups that did not participate in the match; it defaults to None.For example:
>>> m = re.match(r"(\d+)\.(\d+)", "24.1632") >>> m.groups() ('24', '1632')If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to
None
unless adefault_value
argument is supplied:>>> m = re.match(r"(\d+)\.?(\d+)?", "24") >>> m.groups() # Second group defaults to None. ('24', None) >>> m.groups('0') # Now, the second group defaults to '0'. ('24', '0')