Marks-MacBook-Pro:max_entropy_module gawron$ python -i call_maxent.py senseval-hard.evt 100
Reading data...
Senses: HARD1 HARD2 HARD3
Splitting into test & train...
Training classifier...
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -1.09861 0.087
and ending with something like:
Testing classifier...
Accuracy: 0.8756
Total: 410
Label Precision Recall
_____________________________________________
HARD1 0.873 1.000
HARD2 0.944 0.515
HARD3 0.833 0.125
Label Num Corr
HARD1 337
HARD2 17
HARD3 5
-2.418 get_VB==True and label is 'HARD3'
-2.212 look_NN==True and label is 'HARD1'
-2.121 hardest_JJ==True and label is 'HARD3'
-2.020 find_VB==True and label is 'HARD2'
....
The final bit is the list of the most informative features in the trained
model.
Cut and paste all this output and send it.
>>> (feats, label) = test[0]
>>> pdist = nb_classifier.prob_classify(feats)
>>> pdist.prob('HARD3')
The first lines set the variable feats and cls
to the extracted feature dictionary and class of the first example
on the test set. It is instructive to look at these and see what
you've got. The variable label is the correct class label or word sense
for that example. It should correspond to one of the classes found
when the data was loaded in at the beginning of your training output
The second line of code applies the classifier to that example.
and returns probabilities for all three classes in the form
of an object called pdist, which has a method prob
which returns the probability when given a class.
What kind of probability is being returned. Is it:
You will not need to install the NLTK data after installing the software. So your install needs are:
NOTE!: Version 1.6.1 of numpy works with Python 2.7.3 and with the latest version of nltk (2.0.4). See the NLTK website for numpy links.
[gawron@ngram ~]$ python Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk
To train and test your classifier you need to give it some labeled data.
The training corpus is an xml file with example sentences using the word hard. This XML file contains about 4000 events. Each event is a sentence in which the word hard occurred with one of three senses.
In the xml file an event looks something like this:
<senseval_instance word="hard-a" sense="HARD1" position="20"> ``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_'' </senseval_instance>
To disambiguate these sentences you need to turn the XML into a form the max ent trainer can use. Essentially you have to turn the raw data into feature dictionaries that represent each example with a set of binary yes/no features. So the steps are:
| 1. | Feature extraction | Map raw data into features:
|
| 2. | Training | Train and test the maxent model on the event file
|