You will also need to install the data for nltk after installing the software. So your install needs are:
NOTE!: Version 1.5.1 of numpy works on Pythonm 2.6 and Python 2.7, but 1.6.1 (the latest version) only works on Python 2.6 NOT on Pytho 2.7, so DONT get the latest version of numpy if you have Python 2.7. Get version 1.5.1! This can be found here:
[gawron@ngram ~]$ python
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.classify.maxent
>>> nltk.classify.maxent.demo()
Training classifier...
Iteration Log Likelihood Accuracy
---------------------------------------
1 -0.69315 0.374
2 -0.61426 0.626
.... [lots more iterations!]....
Optimization terminated successfully.
Current function value: 0.404530
Iterations: 37
Function evaluations: 77
Gradient evaluations: 77
Testing classifier...
Accuracy: 0.7940
Avg. log likelihood: -0.5958
Unseen Names P(Male) P(Female)
----------------------------------------
Octavius *0.9756 0.0244
Thomasina 0.0291 *0.9709
Barnett *0.6795 0.3205
Angelina 0.0029 *0.9971
Saunders *0.8483 0.1517
>>>
This demos the max ent "names" model, a classifier which
assigns gender to proper names. Although
demo is run by a function named demo()
defined in nltk.classify.maxent, this, as shown below, is just a wrapper
for the real names_demo function
defined in nltk.classify.util.
Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:
def names_demo_features(name):
features = {}
features['alwayson'] = True
features['startswith'] = name[0].lower()
features['endswith'] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features['count(%s)' % letter] = name.lower().count(letter)
features['has(%s)' % letter] = letter in name.lower()
return features
This illustrates the format for a feature function. It builds
and returns a dictionary whose keys are feature names and whose values
are the feature values of some context. In this application,
a context is a name string and the features are:
If you have a problem, make sure the following environment variable is set in Unix on bulba/ngram (not Python):
[gawron@ngram ~]$ echo $NLTK_DATAThe shell should reply:
/opt/lib/nltk/dataIf the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
export NLTK_DATA=/opt/lib/nltk/data
>>> from nltk.corpus import senseval
>>> korp = senseval.read('hard.pos')
korp is a list corpus innstances!
>>> len(korp)
4333
>>> i0 = korp[0]
>>> i0.context
[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")]
A instance's context is a list of word pos pairs.
>>> i0.senses
('HARD1',)
>>> i0.context[i0.position]
('hard', 'JJ')
The instance's senses tells you which of 3 senses of the word 'hard' is being used
in this example. More than one at a time is possible so there is a tuple
of senses. All the examples in this corpus have exactly one sense/
stopwords = [ 'I', 'a', 'an', 'are', 'as', 'and',
'be', 'com', 'how', 'is', 'it', 'of', 'or',
'that', 'the', 'this', 'to', 'was', 'what',
'when', 'where', 'who', 'will', 'with',
'the', 'www']
features = {}
features['alwaystrue'] = 1
And the last line shd be:
return features
features[w] = False
when w does not occur in senseval_inst.context.