7.4. Putting it all together

The following code examples put a lot of the concepts we have been using together and introduce some routine tasks of text processing:

1
2
3
4
5
6
7
8
9
import nltk
from collections import Counter

def word_freqs_file(filename):
    with open(filename,'r') as fh:
       text = fh.read()
    # Tokenize the resulting string using English orthography conventions
    # and return a Counter for the word freqs
    return Counter(nltk.word_tokenize(text))

Covering come of the key steps:

  1. Line 1: We import a new module nltk, short for Natural Language Tool Kit. This module contains a whole host of tools for processing text and language.
  2. Line 2: We import the Counter class, discussed in Dictionaries as a way of recording word counts.
  3. Lines 4-9. We define a function word_freqs_file which opens up and reads a file and returns the word frequencies for that file.
  4. Lines 5,6: Open the file and read it in as one long string with the read method on file handles storing it in the variable text.
  5. Line 9: This single calls the nltk word_tokenize. This tokenizer does a much better job accurately breaking the string text into words than text.split() would. The result is a list of words. It is this list we pass to the Counter creation function to get the word frequencies for the file.

The following variant works on web pages instead of files. It uses an nltk program called clean_url to download the web page as a text string. We will look more closely at what is going on here in Chapter Introduction to web-crawling in Python ; for now the main point of this example is that we don’t need to know all the details. What we need to know is what the program requires as input (a web address, or URL) and what it returns (a long string containing all the text content of the web page, with the HTML formatting markup stripped away). This is exactly what is meant by hiding complexity, one of the main motivations for bundling code up into functions and modules:

1
2
3
4
5
6
7
8
9
import nltk
from collections import Counter

def word_freqs_webpage(url):
    ## Download webpge and strip away all html
    text = nltk.clean_url(url)
    # Tokenize the resulting string using English orthography conventions
    # Return  Counter for the word freqs
    return Counter(nltk.word_tokenize(text))

Notice that the two functions word_freqs_file and word_freqs_webpage share a line of code, line 9 above. This suggests that the line could be bundled up into a reusable function. The right organization for these lines of code is something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import nltk
from collections import Counter

def word_freqs_webpage(url):
    ## Download webpge and strip away all html
    ## return freq_dist
    return get_freq_dist(nltk.clean_url(url))

def word_freqs_file(filename):
    with open(filename,'r') as fh:
       return get_freq_dist(fh.read())

def get_freq_dist (text):
    # Tokenize the resulting string using English orthography conventions
    # And return the counts
    return Counter(nltk.word_tokenize(text))

We have abstracted freq_dist out as a function of its which takes in a text string and returns a frequency distribution. This is a useful piece of abstraction because there are many different kinds of sources of a text string, from the web to files to user input to Graphical User interface, and this function does not need any information about where the string came from to operate correctly.

7.4.1. Summary

  1. Text needs to be properly tokenized if we are going to do some kind of statistics or modeling that depends on the words in it. The nltk module provides a good text tokenizer.

  2. NLTK also provides tools for downloading web pages and converting them to text, stripping away HTML. There are some donwsides to doing things this way, to be covered when we look at web scraping in more detail in Chapter Introduction to web-crawling in Python.

  3. Text processing breaks down into certain natural reusuable steps and good reusable code should provide fit into a natural text processing pipeline. In this section we looked at the following steps.

    ../_images/text_process1.png