Maximum Entropy modeling assignment

For this assignment you need a Python package that is NOT part of the standrad Python distro. It is called nltk. A google search on the string "nltk" will direct you to the nltk home page, or you can go to:
The home page will tell you NOT to use nltk with Python 2.6 or later (in particular, the terrifying Python 3.0 that is now available). So if at all possible you should stick to Python 2.5.X. Among the optional python packages needed for some portions of nltk, you will need numpy.
You will also need to install the data for nltk after installing the software. So your install needs are:
1. Nltk software: Follow platform specific directions on NLTK download page
2. Numpy: Follow platform specific directions on NLTK download page
  NOTE!: Version 1.5.1 of numpy works on Pythonm 2.6 and Python 2.7, but 1.6.1 (the latest version) only works on Python 2.6 NOT on Pytho 2.7, so DONT get the latest version of numpy if you have Python 2.7. Get version 1.5.1! This can be found here:
3. NLTK data: Follow the directions on the NLTK data page. Note: Installing the data is definitely optional for this assignment. You can get all the data you need from the XML corpus below.
For this assignment you will use the NLTK senseval data to do sense disambiguation for the word hard. Below, for generality, I will use word to designate the word whose tokens we are doing sense disambiguation on. But for this assignment in particular, the only value word will have is hard.

To check for nltk access. start up python and proceed as follows:

[gawron@ngram ~]$ python
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) 
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.classify.maxent
>>> nltk.classify.maxent.demo()
Training classifier...

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.374
             2          -0.61426        0.626
       .... [lots more iterations!]....
Optimization terminated successfully.
         Current function value: 0.404530
         Iterations: 37
         Function evaluations: 77
         Gradient evaluations: 77
Testing classifier...
Accuracy: 0.7940
Avg. log likelihood: -0.5958

Unseen Names      P(Male)  P(Female)
----------------------------------------
  Octavius        *0.9756   0.0244
  Thomasina        0.0291  *0.9709
  Barnett         *0.6795   0.3205
  Angelina         0.0029  *0.9971
  Saunders        *0.8483   0.1517
>>>

This demos the max ent "names" model, a classifier which assigns gender to proper names. Although demo is run by a function named demo() defined in nltk.classify.maxent, this, as shown below, is just a wrapper for the real names_demo function defined in nltk.classify.util.

Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:


def names_demo_features(name):
    features = {}
    features['alwayson'] = True
    features['startswith'] = name[0].lower()
    features['endswith'] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count(%s)' % letter] = name.lower().count(letter)
        features['has(%s)' % letter] = letter in name.lower()
    return features

This illustrates the format for a feature function. It builds and returns a dictionary whose keys are feature names and whose values are the feature values of some context. In this application, a context is a name string and the features are:

a count feature for each letter of the alphabet, e.g., 'count(h)', which returns the number of 'h's occurring in the name.
an occurrence feature for each letter of the alphabet, e.g., 'has(h)', which returns True if 'h' occurs in the name, and False if not.
'startswith' and 'endswith' features which return the letter the the name starts/ends with.

I have tried importing maxent on both bulba and ngram and it works for me on both machines.
If you have a problem, make sure the following environment variable is set in Unix on bulba/ngram (not Python):
This can be checked by typing 'echo $NLTK_DATA' to a shell:
```
[gawron@ngram ~]$ echo $NLTK_DATA
```
The shell should reply:
```
/opt/lib/nltk/data
```
If the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
```
export NLTK_DATA=/opt/lib/nltk/data
```
Your task is to get a word sense disambiguation demo working for the word 'hard' (the subcorpus of senseval is called 'hard.pos') using a max entropy classifier.

This corpus can be accessed as follows:

>>> from nltk.corpus import senseval
>>> korp = senseval.read('hard.pos')

korp is a list corpus innstances!

>>> len(korp)
4333
>>> i0 = korp[0]
>>> i0.context
[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")]

A instance's context is a list of word pos pairs.

>>> i0.senses
('HARD1',)
>>> i0.context[i0.position]
('hard', 'JJ')

The instance's senses tells you which of 3 senses of the word 'hard' is being used in this example. More than one at a time is possible so there is a tuple of senses. All the examples in this corpus have exactly one sense/

There are 3 senses for hard in this corpus.
1. "HARD1": difficult to do
2. "HARD2": potent (as in "the hard stuff")
3. "HARD3": physically resistant to denting, bending, or scratching
The first thing you should do is get a sense for how hard the disambiguation task is by computing a "baseline" score. This is the accurracy score earned by a word sense disambiguator that always guesses the most probable sense of hard.
Your task is to extract some features from the corpus and train a max ent model that learns how to predict senses of the word hard based on the features on teh context it occurs i. You should test your model using the names_demo example in nltk.classify.maxent module.
If you want to iugnore the most common words of English among your feature, here is Google's stoplist with most preps removed. "and" added
```
    stopwords = [ 'I',    'a',    'an',    'are',    'as',    'and',
                  'be',    'com',   'how',  'is',    'it',    'of',    'or',
                  'that',    'the',  'this',    'to',    'was',    'what',
                  'when',   'where',    'who',    'will',    'with',    
                  'the',    'www']
```

To modify names demo to be a sense disambiguatio demo you you modify the part that loads the name corpus to load the senseval corpus instead. You should extract features usinng your own feature extraction function wsd_features(senseval_inst, vocab): This should return a dictionary of the features in the cotext of the occurrence ogf thew ord hard. You should experiment with new features for extra credit but minimally, implement the following:

a feature named 'alwaystrue' that always has the value 1. Thus the first two lines of code in the definition of wsd_features are:
```
features = {}
features['alwaystrue'] = 1
```
And the last line shd be:
```
return features
```
For each of the vocab items (or maybe justr the 300 most frequently occurring vocab items, w, implement a feature such that features[w] = 1 if and only if w occurs in senseval_inst.context. Note the feature dictionary should always return values for each of the vocab items you are using, whether or not w occurs, so make sure:
```
      features[w] = False
      
```
when w does not occur in senseval_inst.context.
Optional: A feature checking whether the part of speech of the FOLLOWING word is TO.

Your assignment is to modify a versiuon of names demo to be a senseval demo, and to write a feature extractor that works in the demo. Then run you your code to see how you do at classifying sense of teh word hard, testing it by loaqding trainer in the nltk maxent module, as is donne at the very bottom of teh file maxent.py.

List of links