7.6.3. Regular expression examples

In this section, we give some examples of the uses of regular expressions. These examples are discussed in more detail in the ipython regular expression notebook.

7.6.3.1. Crossword puzzle solving

The following example is adapted from NLTK Book, Ch. 3

Let’s say we’re in the midst of doing a cross word puzzle and we need an 8-letter word whose third letter is j and whose sixth letter is t which means sad. We want words that match the following regular expression pattern:

'^..j..t..$'

Notice that specifies a string of exactly 8 characters because of the ^ and the $, which mark the beginning and ending of the string, respectively. Each . is a wildcard which matches any single character.

To use this let’s fetch a big list of known words and search it:

>>> from nltk.corpus import words
>>> wds = words.words()
>>> len(wds)
235786
>>> cands = [w for w in wds if re.search('^..j..t..$',w)]
>>> cands
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic',
 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted',
 'unjustly']

And now we check our list and there it is: dejected. Will you ever be stumped by a crossword puzzle again?

7.6.3.2. Textonyms (requires NLTK)

The NLTK Book, Ch. 3 introduces the following concept of textonym with the definition given below. To understand the definition, it may be helpful to make reference to a picture of a cellphone keypad.

The T9 system is used for entering text on mobile phones: Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence?

../_images/T9.png

Here we use the regular expression '^[ghi][mno][jlk][def]$':

>>> [w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

Try the following. Find all words that can be spelled out with the sequence 3456.

7.6.3.3. Matching card hands (from the Python library page for REs)

In this example, we’ll use the following helper function to display match objects a little more gracefully:

def displaymatch(regex,text):
    match = regex.match(text)
    if match is None:
       matchstring = None
    else:
       matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end()+1:])
    print '%-10s %s' % (text,matchstring)

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if a given string is a valid hand, one could do the following:

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid, "akt5q")
akt5q   [akt5q]
>>> displaymatch(valid, "akt5e")
akt5e   None
>>> displaymatch(valid, "akt")
akt     None
>>> displaymatch(valid, "727ak")
727ak   [727ak]

That last hand, “727ak”, contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences. Each time a parenthesis is used in a regular expression, a group is created, a subpart of the match which is made available for future reference. Backreferences to such groups are made by using \integer. For example:

>>> pair = re.compile(r".*(.).*\1.*")

This specifies a string in which one character is repeated. The first occurrence of the character matches (.) and the second matches \1. Only an exact repetition of the character in parentheses matches the \1.

>>> displaymatch(pair, "717ak")     # Pair of 7s.
727ak   [727ak]
>>> displaymatch(pair, "718ak")     # No pairs, no match.
718ak   None
>>> displaymatch(pair, "354aa")     # Pair of aces.
354aa   [354aa]

To find out what card the pair consists of, one could use the groups() method of MatchObject in the following manner:

>>> pair.match("717ak").groups()[0]
'7'

>>> pair.match("354aa").groups()[0]
'a'

Code written to return match objects must always check for success before attempting to retrieve results. The next example generates an error because re.match() returns None, which doesn’t have a group method:

>>> result = pair.match("718ak").groups()[0]
Traceback (most recent call last):
    ...
    re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

This needs to be written to be something like:

>>> m = pair.match("718ak").groups()[0]
>>> if m:
      result = m.match("718ak").groups()[0]
    else:
      result = None

The if test fails when result is None, and the match method is not called.