5.9. Putting it all together

In this section we deal withe xamples that combine some of the concepts we’ve been discussing int complete functions or programs.

5.9.1. Example One: Word counting

The following code puts a lot of the concepts we have been using together.

Let’s assume we want a program, that does the following: It opens up a text file and counts the words in the file and prints out the 10 most common words. Let’s say we want to call the program like this:

% python count_words.py pride_and_preudice.txt

Where pride_and_prejudice.txt is the name of the file we want to count the words of.

Then we create a python file name count_words.py and we out the following code into it.:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from collections import Counter
import sys

def word_freqs(filename):
    freqdist = Counter()
    ## Read in the file
    filehandle = open(filename, 'r')
    ## Loop through each line and in each line loop
    ## through each word
    for line in filehandle:
        line_list = line.split()
        for word in line_list:
            ## tabulate all words counts
            freqdist[word] += 1
    print freq_dist.most_common(10)
    return freqdist


 if __name__ == '__main__':
   if len(sys.argv) == 2:
      filename = sys.argv[1]
   word_freqs(filename)

Covering this line by line:

  1. We import a new module collections to make Counters available.
  1. We create a counter
  1. We open the file
  1. We loop through it line by line
  2. We split the current line into a list of words
  3. We loop through the wordsin the current line
  1. We update the count of the current word by one
  2. Having exited the loop (look at indentation), we print out some information from the counter
  3. We return the counter
  1. We enter the part of the program that will be executed each time the file is loaded as a top level module in Python (it isn’t being imported by some other program).
  1. In that case there will be a list of command line arguments (all strings) in sys.argv. The first is the program name. The second is the filename the user supplied when calling the program. We set the variable filename to the second commandline argument.
  2. We call the function word_freqs on filename. It counts the word frequencies and prints out the top 10 words.
  3. That’s all. The script (and Python) will now be exited.

5.9.2. Example two: precision and recall

Let’s use what we know about Booleans to perform a simple data analysis task.

Suppose we have a system that predicts something, say, whether it will be sunny today, and let’s furthermore assume that we assembled a list of such predictions, together with the actual outcome of what was predicted. When the prediction and the outcome are the same, we have a successful prediction. When they differ, we have a failed prediction. So:

data = [('sunny','sunny'), ('sunny','cloudy'), ('sunny', 'rainy'), ('rainy','rainy')]

records a sequence of predictions in which two out of four are failures.

Three important scores for evaluating our performance are called accuracy, precision, and recall. For the above data, assuming that the task is predicting sunny days, we’ll call a ‘sunny’ day ‘positive’ and anything else ‘negative’, Putting all our system’s positive predictions in the first row and all the negative predictions in the second, we can tabulate our prediction success with a contingency table as follows:

Pred/Outcome Pos Neg
Pos 1 0
  tp fn
Neg 2 1
  fp tn

Let N be the size of the dataset, tp and fp be true and false positive respectively and tn and fn be true and false negatives respective. Accuracy is the percentage of correct answers out of the total dataset (\frac{tp+tn}{N}). In this case the total dataset has size 4, 2 are correctly classified, so the accuracy is .5. Precision is the percentage of true positives out all positive guesses the system made (\frac{tp}{tp + fp}), so in this case precision is .33. Recall is the percentage of true positives out of actual positives (\frac{tp}{tp + fn}). In this case there was one actually sunny day out of the 4, and we predicted it, so is recall is 1.0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def do_evaluation (pairs, pos_cls='pos'):
    (N, tp, tn, fp,fn) = (0,0,0,0,0)
    for (predicted, actual) in pairs:
        N += 1
        if predicted == actual:
            if actual == pos_cls:
                tp += 1
            else:
                tn += 1
        else:
            if actual == pos_cls:
                fn += 1
            else:
                fp += 1
    (accuracy, precision, recall) = (float(tp + tn)/N,float(tp)/(tp + fp),float(tp)/(tp + fn))
    return (accuracy, precision, recall)

Considering this line by line:

  1. Two arguments to the function, we’ll call the default positive class (the one whose presence we’re testing for) 'pos'
  1. We set a number of variables in one line by assigning to a tuple of variables. We have to keep counts of the 4 cells in the contingency table, as well as the total number of things.
  1. Notice the indentation is crucial . The else on line 10 has to belong to the if on line 8, not to the if on line 6, and the indentation shows that.
  1. Again, the indentation is crucial . The else on line 12 belongs to the if on line 6, not to the if on line 8, and the indentation shows that.

To do our evaluation, we would do:

>>> data = [('sunny','sunny'), ('sunny','cloudy'), ('sunny', 'rainy'), ('rainy','rainy')]
>>> (acc,pre, rec) = do_evaluation(data,pos_cls= 'sunny')
>>> print 'Acc: {0:.2f} Prec: {1:.2f} Rec: {2:.2f}'.format(acc,pre,rec)
Acc: 0.50 Prec: 0.33 Rec: 1.00