# Text classification with Naive Bayes

This notebook is about text classification using Naive Bayes.  Section 2 tells you about the data
and NLTK.  Section 3 tells you about training a Naive Bayes classifier and how
you can modify the classifier's behavior.  Section 4 tells you what you have
to hand in to get credit for this portion of the final.

# Introduction to Data and NLTK Code

## Getting the data

For this notebook you need to have some NLTK `movie reviews` data installed.  First make sure you have nltk installed as a Python module (or the first line in the code block below won't work).  Then install the data as follows:

```
import nltk
nltk.download()
```

This pops up a separate window.  It has several tabs, one of them labeled `corpora`.  Select that tab and you will see a long alphabetically ordered list of all the corpora `NLTK` supplies.  Scroll down and select `Movie Reviews` and click on `Download`.  This will install that data on your machine, and all the code in this notebook
should work.

## Peliminaries:  Code for getting data and extracting features

In [3]:
import nltk
#nltk.download()

In [9]:

def unigram_features (words):
    """ 
    This is the simplest possible feature representation of a document.
    
    Each word is a feature.  We return a dictionary that represents the
    set of vocabulary words in the document.
    
    You can improve the performance of your classifier by being more selective
    about what features you use.
    """
    return dict((word, True) for word in words)


def extract_features (corpus, file_ids, cls, feature_extractor=unigram_features):
    """
    Turn a set of files all belonging to one class into a list
    of (feature dictionary, cls) pairs, to be used in testing or training
    a classifier.
    
    Whatever you replace extract_features with must continue to return 
    a list of feature-dictionary cls pairs.  The features in the feature dictionary
    can change, but the second member of the pair must always be the class (`cls` in the
    code below).
    """
    return [(feature_extractor(corpus.words(i)), cls) for i in file_ids]


def get_words_from_corpus (corpus, file_ids):

    for file_id in file_id|s:
        words = corpus.words(file_id)
        for word in words:
            yield word



The easiest approach to th problem below is to replace  `unigram` features.  Try that first. 

You need to write another function that is more selective about what words will be used
as features.  A simple replacement function might be called `restricted_unigram_features` and 
defined as below:

In [8]:
def restricted_unigram_features (words):
    return dict((word, True) for word in words if word in my_preferred_words)

You would then replace the call to `unigram_features` with a call to `restricted_unigram_features` in the 
code below.  You would of course also have to define the list `my_preferred_words`.  In order to
that you might want to pay attention to what `most_informative_features` tells you about your baseline
system.

## Loading the data

In line 3  of the next cell, we import the NLTK Bo Pang and Lillian Lee's movie reviews corpus.  Line 5 prints some information about the corpus properties; these appear in the output cell.  

In [1]:
# Using a corpus of movie review data
# 2000 positive and negative reviews, evenly balanced.
from nltk.corpus import movie_reviews as mr

# If you want to read the corpus collectors' introduction to
# this corpus, uncomment the next line.
#print mr.readme()

The movie review data is packaged up as an NLTK corpus, which 
gives us access to a number of tools for text handling.  The simplest
is that we have two views of the movie review data, word by word and character by character.

The character by character view uses the `raw` method, which returns all the data with no argument, or just the data from a single file with a file-id argument.  In the example below we return the first 100 characters from the first positive movie review.   

In [2]:
data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))
print mr.raw(data['pos'][0])[:100]

films adapted from comic books have had plenty of success , whether they're about superheroes ( batm


The word by word character view uses the `words` method::

In [3]:
print mr.words(data['pos'][0])[:10]

[u'films', u'adapted', u'from', u'comic', u'books', u'have', u'had', u'plenty', u'of', u'success']


We use the word by word view when we extract features (`mr.words` in the definition of `extract_features`).

You can experiment with using other views as well.  That would mean redefining
`extract_features`.  For example, it may pay to pay more attention to the first and last
paragraphs of a movie review when classifying it.  Then you would want the
`mr.paras` view.  You would have to replace `unigram_features` with another function
that expected a sequence of paragraphs instead of a sequence of words.

The key point is that whatever you replace `extract_features` with must return 
the kind of thing `extract_features` returns now, a list of dictionary,class pairs,
one for each movie review file.  The class is always `pos` or `neg` (this was a positive or negative
review).  The feature dictionary contains a set of Boolean features, whose values
are `True` or `False` (those are special Python constants and case must be preserved).
So for example, the ungram feature representation of a document containing just one sentence,
`See me walk` is:

In [11]:
feat_dict = {'walk':True, 'see':True, 'me':True}
feat_dict

{'me': True, 'see': True, 'walk': True}

If you use bigram features too it would be:

feat_dict = feat_dict = {'walk':True, 'see':True, 'me':True,('me','walk'):True, ('see','me'):True, ('me','walk'):True}

# Baseline System: Setup

The code below defines and trains the baseline system you will use for this exercise.  It uses
the unigram feature extractor defined above

In [14]:

from nltk.corpus import movie_reviews as mr

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

#######################################################################
#
#  Dividing up the data
#######################################################################

# 1000 reviews. Use 90% of the data for training.  Use 10% for test.
test_start_index = 900

neg_training = extract_features(mr, data['neg'][:test_start_index], 'neg',
                                feature_extractor=unigram_features)

# Use 10% for testing the classifier on unseen data.
neg_test = extract_features(mr, data['neg'][test_start_index:], 'neg',
                                feature_extractor=unigram_features)

pos_training = extract_features(mr, data['pos'][:test_start_index],'pos',
                                feature_extractor=unigram_features)

pos_test = extract_features(mr, data['pos'][test_start_index:],'pos',
                                feature_extractor=unigram_features)

train_set = pos_training + neg_training

test_set = pos_test + neg_test

Line 2 imports the movie review data from NLTK, and lines 4 and 5 store the two halves of the corpus in a dictionary (positive and negative reviews, 1000 of each).   The commands on 
the next few lines extract features from the data files, sorting them in pos and negative training and positive and negative test sets.  The training set is 90% of the the data; the test set is 10% of the data.  The feature extractor used is `unigram_features`, the simple feature extractor defined in the first code cell of this notebook.  This feature extractor just uses every word that appears in a document as a feature.  

Finally in line 31 the positive and negative training data is combined into a single training set, and in line 36, a Naive Bayes (NB) classifier is trained. The code above works as is and defines what you may
view as your baseline classifier.

When you want to improve your classifier, your easiest strategy is just to change only one thing
in the cell above: Change the feature extractor to be something other than `unigram features`.  You
would create a new function just like `unigram_features` except that it is more selective about
what it's features are.

We have now split the data into two unequal haves, each with positive and negative examples, and called the larger half `train_set` and the smaller half `test_set`.

In [16]:
len(data['neg']),len(neg_training)

(1000, 900)

## Training the classifier

Next we **train** the classifier by giving it the training set.

In [18]:
# Use a Naive Bayes Classifier
from nltk.classify import NaiveBayesClassifier

#  Train a classifier on our training data.
classifier = NaiveBayesClassifier.train(train_set)

## Basic feature representation

In the next cell, we look at an example representation of a movie review, as it is done
in `unigram_features`. It is basically just a set, telling us what words occurred in the review, but not how many times they occurred.  In terms of a Python data structure, it's a dictionary whose keys are the words in the document, and whose values are all `True.`  We're not bothering to represent the words
that don't occur in the document, so there are no keys whose value are is `False`.

In [9]:
test_set[0]

({u'!': True,
  u'%': True,
  u"'": True,
  u'(': True,
  u')': True,
  u',': True,
  u'.': True,
  u'100': True,
  u'13': True,
  u'1912': True,
  u'3': True,
  u':': True,
  u'?': True,
  u'a': True,
  u'about': True,
  u'above': True,
  u'accomplished': True,
  u'accuracy': True,
  u'achieve': True,
  u'across': True,
  u'acting': True,
  u'actually': True,
  u'adding': True,
  u'adds': True,
  u'admirably': True,
  u'admired': True,
  u'aft': True,
  u'ago': True,
  u'aims': True,
  u'aliens': True,
  u'all': True,
  u'almost': True,
  u'alone': True,
  u'already': True,
  u'also': True,
  u'although': True,
  u'amaze': True,
  u'amazed': True,
  u'america': True,
  u'an': True,
  u'and': True,
  u'anew': True,
  u'any': True,
  u'anything': True,
  u'are': True,
  u'aren': True,
  u'as': True,
  u'ask': True,
  u'astonishing': True,
  u'at': True,
  u'atlantic': True,
  u'attention': True,
  u'backdrop': True,
  u'be': True,
  u'beautifully': True,
  u'because': True,
  u'been': T

In [10]:
len(data['neg'])

1000

## A demonstration

In the next code cell, we demonstrate what the classifier does on the first reviews in the the positive and negative training set.  The output window shows that both reviews are correctly classified by the NB classifier we just trained.

First, we pick a review.

In [12]:
def get_review_text (clf,file_id,start=0,end=None):
    words = list(mr.words(data[clf][file_id]))
    return ' '.join(words[start:end])

print get_review_text('pos',0,end=95)

print '     . . . . . . '

print get_review_text('pos',0,start=-190)

films adapted from comic books have had plenty of success , whether they ' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there ' s never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid ' 80s with a 12 - part series called the watchmen .
     . . . . . . 
, but cinematographer peter deming ( don ' t say a word ) ably captures the dreariness of victorian - era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black - and - white comic . oscar winner martin childs ' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . even the acting in from hell is solid , with the dreamy depp turning in a typically strong perfor

Reading this makes it clear why movie reviews are hard, especially with the approach we're taking;  each word is treated as an independent feature whose presence increases or decreases the probability the review is positive. Overall it's positive review but it's not easy finding single words that might give us good evidence of that,
and there are words that probably point the other way as well:

```
Pos         Neg
-----------------
ably      dreariness
flashy    violence
surprise  bad
strong    creepy
deftly    cringed
good
oscar
winner
great
```

Nevertheless, let's try this out and see how we do.

In [13]:
predicted_label0 = classifier.classify(pos_test[0][0])

print 'Predicted: %s Actual: pos' % (predicted_label0,)

Predicted: pos Actual: pos


We got it right!  Let's try a negative review.

In [14]:
predicted_label1 = classifier.classify(neg_test[0][0])

print 'Predicted: %s Actual: neg' % (predicted_label1,)

# To see the the feature dictionary passed in to the classifier,
# uncomment the next line
#pos_test[0][0]

Predicted: neg Actual: neg


Right again!

Let's try the examples we cooked up before.  We need go to from
a string like `"Inception is the best movie ever"` to a feature dictionary
and pass that to the classifier.


In [15]:
classifier.classify(unigram_features('Inception is the best movie ever'.split()))

'pos'

In [16]:
classifier.classify(unigram_features("I don't know how anyone could sit through Inception".split()))

'neg'

Interesting.  We may get some insight on this one below.

## Most informative features

Heres what our classifier learned.  These are the features for which the ratio of the positive to negative probability (or vice versa) is the highest.

In [17]:

classifier.show_most_informative_features()


Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
               ludicrous = True              neg : pos    =     14.2 : 1.0
              astounding = True              pos : neg    =     12.3 : 1.0
                  avoids = True              pos : neg    =     12.3 : 1.0
                 idiotic = True              neg : pos    =     11.8 : 1.0
               atrocious = True              neg : pos    =     11.7 : 1.0
             fascination = True              pos : neg    =     11.0 : 1.0
                 offbeat = True              pos : neg    =     11.0 : 1.0
               animators = True              pos : neg    =     10.3 : 1.0
                  symbol = True              pos : neg    =     10.3 : 1.0


The correct response to the display above is to say, that's all very well but I need to
see more.  Here's how.

In [19]:
classifier.show_most_informative_features(n=100)

Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
               ludicrous = True              neg : pos    =     14.2 : 1.0
              astounding = True              pos : neg    =     12.3 : 1.0
                  avoids = True              pos : neg    =     12.3 : 1.0
                 idiotic = True              neg : pos    =     11.8 : 1.0
               atrocious = True              neg : pos    =     11.7 : 1.0
             fascination = True              pos : neg    =     11.0 : 1.0
                 offbeat = True              pos : neg    =     11.0 : 1.0
               animators = True              pos : neg    =     10.3 : 1.0
                  symbol = True              pos : neg    =     10.3 : 1.0
                religion = True              pos : neg    =     10.2 : 1.0
                   jolie = True              neg : pos    =      9.7 : 1.0
                  hudson = True              neg : pos    =      9.7 : 1.0

We have more features showing up as good indicators of positive, and this is a fairly good indicate or a probl;em with our classifier, as we'll see below.  Actually the classifier is a little too reluctant to classify reviews as negative on the test set.  This suggests that reliable negative indicators are not that common, at least on the test set.

## Serious evaluation

The next cell takes the first step toward testing a classifier a little more seriously.  It defines some code for evaluating classifier output.  The evaluation metrics defined are precision, recall, and accuracy.  Let N be the size of the dataset, $tp$ and $fp$ be true and false positive respectively and $tn$ and $fn$ be true and false negatives respective.  Accuracy is the percentage of correct answers out of the total corpus $\left(\frac{tp+tn}{N}\right)$, Precision is the percentage of true positives out all positive guesses the system made $\left(\frac{tp}{tp + fp}\right)$,
while recall is the percentage of true positives out of all good reviews $\left(\frac{tp}{tp + fn}\right)$.

In [18]:
from sklearn.metrics import precision_score, recall_score,accuracy_score

def do_evaluation (pairs, pos_label='pos', verbose=True):
    predicted, actual = zip(*pairs)
    (precision, recall,accuracy) = (precision_score(actual,predicted,pos_label=pos_label), 
                                    recall_score(actual,predicted,pos_label=pos_label),
                                    accuracy_score(actual,predicted))
    if verbose:
        print_results(precision, recall, accuracy, pos_label)
    return (precision, recall,accuracy)

def print_results (precision, recall, accuracy, pos_label):
    banner =  'Evaluation with class label = %s' % pos_label
    print
    print banner
    print '=' * len(banner)
    print '{0:10s} {1:.1f}'.format('Precision',precision*100)
    print '{0:10s} {1:.1f}'.format('Recall',recall*100)
    print '{0:10s} {1:.1f}'.format('Accuracy',accuracy*100)


The code in the next cell actually tests our NB classifier on the entire test set and prints out the result.  Note that precision and recall give different results depending on which  class we think of ourselves as detecting (which class we think of as positive).  We give evaluation numbers with respect to positive and negative reviews.  These show that our classifier actually misses a number of  negative reviews because it misclassifies them as positive (recall of positive high, recall of negative low).  Thus its high recall number when `pos_cls = pos` needs to be taken with a grain of salt.  It achieves this high recall by guessing positive a lot of the time.
In fact, it guesses positive 74% of the time, even though it was trained on data that was 50% positive and 50% negative.

This fact make it even more interesting that we correctly classified

```
I don't know how anyone could sit through Inception.
```

as negative.  In fact it turns out 'sit' is just a pretty good indicator of a negative review. It occurs 79 times in our set of 1000 negative reviews, quite often followed by 'through'.  This tells us something important.  Our intuitions aren't always good at finding good features.

So why does our classifier guess positive so often?  Well, probably because it had more success finding strong positive indicators than it did finding strong negative indicators, as our glance at the most informative features suggested.  This is something we might want to worry about as we design good classifiers.

In [19]:
pairs = [(classifier.classify(example), actual)
            for (example, actual) in test_set]

do_evaluation (pairs)
pos_guesses = [p for (p,a) in pairs if p=='pos']
pos_actual = [a for (p,a) in pairs if a=='pos']
do_evaluation (pairs, pos_label='neg')
print 'Note that {:.1%} of our classifier guesses were positive'.format(float(len(pos_guesses))/len(pairs))
print 'While {:.1%} of the reviews were actually positive'.format(float(len(pos_actual))/len(pairs))
# to see the actual pairs that came out of the test uncomment the next line
#pairs


Evaluation with class label = pos
Precision  65.5
Recall     97.0
Accuracy   73.0

Evaluation with class label = neg
Precision  94.2
Recall     49.0
Accuracy   73.0
Note that 74.0% of our classifier guesses were positive
While 50.0% of the reviews were actually positive


#  Your task

You will generate three experimental systems, each differing from the baseline system in the feature set
that is used for classification.  At least one system must make use of information about feature value that you obtained by looking at `most_informative_features`. For each system:

1. You will replace/modify `unigram_features` or replace/modify `extract_features` and hand in hard copies of your modified code.  If your code uses a list of words or any external data, you must include that data.  Note that this is Python code and it must be turned in in readable form. Python code whose indentation and formatting has been destroyed by placing it in a Word file will not be accepted. 
2. You will include hard copy showing the results of running your modified classifier on the test set.  It should look like the printout above, showing the results of precision and recall for both positive and negative labeling.

Finally you will hand in a brief discussion comparing your three experiments and what worked and what didn't.  In discussing which classifier is better/worse you must make reference to the evaluation
numbers, precision/recall for the positive and negative classification tasks.


Extra credit:  Use information obtained from word similarity measures together with
most informative features in one system.  The way to approach this is to take high value features and let highly similar words stand in place of actually seeing the word.  So if "outstanding" is a basic feature, but you've never seen "excellent" in the training data, let "excellent" count as making the "outstanding" feature be `True` (even though you never saw it in the training data) because `excellent` has a high similarity score to `outstanding`. See the optional [Word similarity assignment.](http://gawron.sdsu.edu/compling/course_core/assignments/word2Vec_assignment.html)