7.6.3. Regular expression examples¶
In this section, we give some examples
of the uses of regular expressions. These examples are discussed
in more detail in the ipython regular expression notebook.
7.6.3.1. Crossword puzzle solving¶
The following example is adapted from NLTK Book, Ch. 3
Let’s say we’re in the midst of doing a cross word puzzle and we need an 8-letter word whose third letter is j and whose sixth letter is t which means sad. We want words that match the following regular expression pattern:
'^..j..t..$'
Notice that specifies a string of exactly 8 characters because
of the ^
and the $
, which mark the beginning
and ending of the string, respectively. Each .
is a wildcard
which matches any single character.
To use this let’s fetch a big list of known words and search it:
>>> from nltk.corpus import words
>>> wds = words.words()
>>> len(wds)
235786
>>> cands = [w for w in wds if re.search('^..j..t..$',w)]
>>> cands
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic',
'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted',
'unjustly']
And now we check our list and there it is: dejected. Will you ever be stumped by a crossword puzzle again?
7.6.3.2. Textonyms (requires NLTK)¶
The NLTK Book, Ch. 3 introduces the following concept of textonym with the definition given below. To understand the definition, it may be helpful to make reference to a picture of a cellphone keypad.
The T9 system is used for entering text on mobile phones: Two or more words that are
entered with the same sequence of keystrokes are known as textonyms. For example,
both hole and golf are entered by pressing the sequence 4653
.
What other words could
be produced with the same sequence?

Here we use the regular expression '^[ghi][mno][jlk][def]$'
:
>>> [w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']
Try the following. Find all words that can be spelled out with the sequence
3456
.
7.6.3.3. Matching card hands (from the Python library page for REs)¶
In this example, we’ll use the following helper function to display match objects a little more gracefully:
def displaymatch(regex,text):
match = regex.match(text)
if match is None:
matchstring = None
else:
matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end()+1:])
print('%-10s %s' % (text,matchstring))
Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.
To see if a given string is a valid hand, one could do the following:
>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid, "akt5q")
akt5q [akt5q]
>>> displaymatch(valid, "akt5e")
akt5e None
>>> displaymatch(valid, "akt")
akt None
>>> displaymatch(valid, "727ak")
727ak [727ak]
That last hand, “727ak”, contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences. Each time a parenthesis is
used in a regular expression, a group is created, a subpart of the match which
is made available for future reference. Backreferences to such groups are
made by using \integer
. For example:
>>> pair = re.compile(r".*(.).*\1.*")
This specifies a string in which one character is repeated. The first
occurrence of the character matches (.)
and the second matches
\1
. Only an exact repetition of the character in
parentheses matches the \1
.
>>> displaymatch(pair, "717ak") # Pair of 7s.
727ak [727ak]
>>> displaymatch(pair, "718ak") # No pairs, no match.
718ak None
>>> displaymatch(pair, "354aa") # Pair of aces.
354aa [354aa]
To find out what card the pair consists of, one could use the groups() method of MatchObject in the following manner:
>>> pair.match("717ak").groups()[0]
'7'
>>> pair.match("354aa").groups()[0]
'a'
Code written to return match objects must always check
for success before attempting to retrieve results. The next example generates
an error because re.match()
returns None
, which doesn’t have a group
method:
>>> result = pair.match("718ak").groups()[0]
Traceback (most recent call last):
...
re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'
This needs to be written to be something like:
>>> m = pair.match("718ak").groups()[0]
>>> if m:
result = m.match("718ak").groups()[0]
else:
result = None
The if
test fails when result
is None
, and
the match
method is not called.