Marks-MacBook-Pro:max_entropy_module gawron$ python -i call_maxent.py senseval-hard.evt 100 Reading data... Senses: HARD1 HARD2 HARD3 Splitting into test & train... Training classifier... ==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -1.09861 0.087and ending with something like:
Testing classifier... Accuracy: 0.8756 Total: 410 Label Precision Recall _____________________________________________ HARD1 0.873 1.000 HARD2 0.944 0.515 HARD3 0.833 0.125 Label Num Corr HARD1 337 HARD2 17 HARD3 5 -2.418 get_VB==True and label is 'HARD3' -2.212 look_NN==True and label is 'HARD1' -2.121 hardest_JJ==True and label is 'HARD3' -2.020 find_VB==True and label is 'HARD2' ....The final bit is the list of the most informative features in the trained model.
Cut and paste all this output and send it.
>>> (feats, label) = test[0] >>> pdist = nb_classifier.prob_classify(feats) >>> pdist.prob('HARD3')The first lines set the variable feats and cls to the extracted feature dictionary and class of the first example on the test set. It is instructive to look at these and see what you've got. The variable label is the correct class label or word sense for that example. It should correspond to one of the classes found when the data was loaded in at the beginning of your training output The second line of code applies the classifier to that example. and returns probabilities for all three classes in the form of an object called pdist, which has a method prob which returns the probability when given a class.
What kind of probability is being returned. Is it:
You will not need to install the NLTK data after installing the software. So your install needs are:
NOTE!: Version 1.6.1 of numpy works with Python 2.7.3 and with the latest version of nltk (2.0.4). See the NLTK website for numpy links.
[gawron@ngram ~]$ python Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk
To train and test your classifier you need to give it some labeled data.
The training corpus is an xml file with example sentences using the word hard. This XML file contains about 4000 events. Each event is a sentence in which the word hard occurred with one of three senses.
In the xml file an event looks something like this:
<senseval_instance word="hard-a" sense="HARD1" position="20"> ``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_'' </senseval_instance>
To disambiguate these sentences you need to turn the XML into a form the max ent trainer can use. Essentially you have to turn the raw data into feature dictionaries that represent each example with a set of binary yes/no features. So the steps are:
1. | Feature extraction | Map raw data into features:
|
2. | Training | Train and test the maxent model on the event file
|