Maximum Entropy modeling assignment (Due Mar. 14, 2017)

  1. Here are the main components of the assignment. They should all be downloaded and placed in a single directory on your local machine, which we will call $WORKING_DIR.
    1. XML data file
    2. XML data file backup
    3. Top level feature extractor. This is the file you will edit.
    4. Module loaded by the top level feature extractor. . This must be present in the same directory in order for the feature extractor to work.
    5. Training and testing code. Uses NLTK
    6. Assignment slides: Background directly related to this assignment.
    7. Max entropy lecture slides
    Instructions for using these modules are given below.
  2. Here is what you need to email me by the deadline date above (gawron@mail.sdsu.edu):
    1. The edited version of call_extract_event that gave you the best score you achieved on the word sense disambiguation task.
    2. The output from call_maxent_event.py on your best scoring run. The trainer outputs a lot of stuff, starting with something like:
      Marks-MacBook-Pro:max_entropy_module gawron$ python -i call_maxent.py senseval-hard.evt 100
      Reading data...
        Senses: HARD1 HARD2 HARD3
      Splitting into test & train...
      Training classifier...
        ==> Training (100 iterations)
      
            Iteration    Log Likelihood    Accuracy
            ---------------------------------------
                   1          -1.09861        0.087
      
      
      and ending with something like:
      Testing classifier...
      Accuracy: 0.8756
      Total: 410
      
      
      Label                 Precision      Recall
      _____________________________________________
      HARD1                     0.873       1.000
      HARD2                     0.944       0.515
      HARD3                     0.833       0.125
      
      
      Label                  Num Corr
      HARD1                337
      HARD2                17
      HARD3                5
      
      
        -2.418 get_VB==True and label is 'HARD3'
        -2.212 look_NN==True and label is 'HARD1'
        -2.121 hardest_JJ==True and label is 'HARD3'
        -2.020 find_VB==True and label is 'HARD2'
             ....
      
      The final bit is the list of the most informative features in the trained model.

      Cut and paste all this output and send it.

    3. Answer the following questions by writing up a few sentences in a Word or PDF document.
      1. How do you compare the performance of the Max Ent classifier with the Naive Bayes classifier? (See the output at the end of your training output; also see NLTK Naive Bayes classifier). Does the relative performance of the two classifiers change after modifying the feature extractor? Which classifier is better? Include a discussion of computational efficiency.
      2. How does the best Naive Bayes classifier differ from the best Max Ent classifier in the top 20 most informative features?
      3. Execute the following code after training and testing:
        >>> (feats, label) = test[0]
        >>> pdist = nb_classifier.prob_classify(feats)
        >>> pdist.prob('HARD3')
        
        The first lines set the variable feats and cls to the extracted feature dictionary and class of the first example on the test set. It is instructive to look at these and see what you've got. The variable label is the correct class label or word sense for that example. It should correspond to one of the classes found when the data was loaded in at the beginning of your training output The second line of code applies the classifier to that example. and returns probabilities for all three classes in the form of an object called pdist, which has a method prob which returns the probability when given a class.

        What kind of probability is being returned. Is it:

        1. P(label | feats): the probability of the class 'HARD3' given the feature set; or
        2. P(feats | label): the probability of feature set given the label 'HARD3'; or
        3. P(feats, label): the joint probability of the feature set and label?
        Hint: Compute pdist.prob for all three classes.
      4. What feature or features are causing the probability of one class to be so high for the feature set in test[7]?
    4. For more background and info on maxent models, see Le Zhang's (U. Edinburgh) max ent page,
    5. For this assignment you need a Python package that is NOT part of the standard Python distro. It is called nltk. A google search on the string "nltk" will direct you to the nltk home page, or you can go to:
        NLTK home page
      If at all possible you should stick to Python 2.6 or 2.7. Among the optional python packages needed for some portions of nltk, you will need numpy.

      You will not need to install the NLTK data after installing the software. So your install needs are:

      1. NLTK software: Follow platform specific directions on NLTK download page
      2. Numpy: Follow platform specific directions on NLTK download page

        NOTE!: Version 1.6.1 of numpy works with Python 2.7.3 and with the latest version of nltk (2.0.4). See the NLTK website for numpy links.

      3. NLTK data: To do this, follow the directions on the NLTK data page. Note: Installing the data is definitely optional for this assignment. You can get all the data you need from the XML corpus below.
    6. To check for nltk access. start up python and proceed as follows:
      [gawron@ngram ~]$ python
      Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) 
      [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import nltk
      

    Running your maxent Classifier

    1. To train and test your classifier you need to give it some labeled data.

      The training corpus is an xml file with example sentences using the word hard. This XML file contains about 4000 events. Each event is a sentence in which the word hard occurred with one of three senses.

      1. "HARD1": difficult to do
      2. "HARD2": potent (as in "the hard stuff")
      3. "HARD3": physically resistant to denting, bending, or scratching

      In the xml file an event looks something like this:

        <senseval_instance word="hard-a" sense="HARD1" position="20">
        ``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC 
        someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP 
        and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_''
        </senseval_instance>
        

      To disambiguate these sentences you need to turn the XML into a form the max ent trainer can use. Essentially you have to turn the raw data into feature dictionaries that represent each example with a set of binary yes/no features. So the steps are:

        1. Feature extraction Map raw data into features:
          XML file → EVENT file
        2. Training Train and test the maxent model on the event file
          EVENT file → maxent model

        These steps look like this on the command line:

        1. python -i call_extract_event.py senseval-hard.xml
          
            XML file → EVENT file
        2. python -i call_maxent.py senseval-hard.evt 3
          
            EVENT file → maxent model

        Let's take these steps in turn:

        1. feature extraction: When you execute call_extract_event.py, the xml file is read in and another file, called the event file, is created. This file is named senseval_hard.evt. In the course of this assignment you will experiment with different versions of feature extraction, which will produce different versions of senseval-hard.evt, but you should never change the original data, senseval-hard.xml. To help you out in case you accidentally change senseval-hard.xml, there is an exact copy (called senseval-hard.xml.saf).
        2. The second step above trains a max ent model with the function call_maxent.py. The second argument in the call_maxent.py line is the number of training iterations to run (here 3). The second argument is optional but the default value is 50; since 50 iterations can take a while, you should pass in a lower value for the second argument when debugging. When training seriously, the right value is about 100.

        Preparing training data for your classifier

        The data in your event file, the data you train your classifier on, is not raw text data. It is a representation of the features of the raw data that you think are important for doing sense disambiguation.

        The XML file you have been given you have been given consists of some 4000 training instances. The event file you create also consists of 4000 examples: each begins with a line consisting solely of the string "BEGIN EVENT" and ends with a line consisting solely of "END EVENT". The second line of the event tells you which of 3 senses of the word 'hard' is being used in this event. After the sense come the features of this example. In the sample event file you have been given, an example looks like this:

        BEGIN EVENT
        HARD1
        soft_JJ         0
        more_RBR        0
        new_JJ          0
        many_JJ         0
        has_VBZ         1
        stick_NN        0
        imagine_VB      0
        ,_,             1
        (_(             0
        even_RB         0
        were_VBD        0
        ...
        END EVENT
        
        This event representation says that in this example (with sense HARD1), the word has occurred, and a comma occurred, and the words soft, more, new, many, stick, imagine, left parenthesis, even and were did not occur (the event representation includes still other present and absent words, omitted here).

        This event representation is what is passed to the maxent module, which then builds a statistical model capturing which word features are the best predictors of senses of hard.

        The event file is computed by running a feature extraction function, which creates a Python dictionary for each event in the training set. The keys on the dictionary are the features, and the values are the feature values.

        The only features used in the default feature extractor you start with (in

        call_extract_event.py
        ) are word features, which tell you when a word is present in a context. Thus, the function you should focus on is
        extract vocab
        which chooses which words will be features. Currently, it just uses the top 100 most frequent words in the XML corpus. It ignores the most common words of English, using the following list:
            stopwords = [ 'I',    'a',    'an',    'are',    'as',    'and',
                          'be',    'com',   'how',  'is',    'it',    'of',    'or',
                          'that',    'the',  'this',    'to',    'was',    'what',
                          'when',   'where',    'who',    'will',    'with',    
                          'the',    'www']
        
        You should modify extract_vacob to choose more informative features, and improve the test accuracy you are shown after training when you run
        call_maxent,.py

        The first thing you should do is get a sense for how hard the disambiguation task. Do this by running the two commands call_extract_event and call_maxent.py as above. This will recreate your event file and run the maxent trainer on it.

        During feature extraction the extraction code collects and prints out information about the statistical distriubution of the three senses. It looks like this:

        HARD1                3455 0.797
        HARD2                 502 0.116
        HARD3                 376 0.087
        
        This tells you that about 80% of the examples in the training set use the most common sense, HARD1. This gives you baseline performance score: A sense classifier that always guessed the most common sense would be right about 80% of the time.

      1. List of links
        1. NLTK home page
        2. NLTK download page
        3. NLTK data page.