You will also need to install the data for nltk after installing the software. So your install needs are:
NOTE!: Version 1.5.1 of numpy works on Pythonm 2.6 and Python 2.7, but 1.6.1 (the latest version) only works on Python 2.6 NOT on Pytho 2.7, so DONT get the latest version of numpy if you have Python 2.7. Get version 1.5.1! This can be found here:
[gawron@ngram ~]$ python Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk.classify.maxent >>> nltk.classify.maxent.demo() Training classifier... Iteration Log Likelihood Accuracy --------------------------------------- 1 -0.69315 0.374 2 -0.61426 0.626 .... [lots more iterations!].... Optimization terminated successfully. Current function value: 0.404530 Iterations: 37 Function evaluations: 77 Gradient evaluations: 77 Testing classifier... Accuracy: 0.7940 Avg. log likelihood: -0.5958 Unseen Names P(Male) P(Female) ---------------------------------------- Octavius *0.9756 0.0244 Thomasina 0.0291 *0.9709 Barnett *0.6795 0.3205 Angelina 0.0029 *0.9971 Saunders *0.8483 0.1517 >>>This demos the max ent "names" model, a classifier which assigns gender to proper names. Although demo is run by a function named demo() defined in nltk.classify.maxent, this, as shown below, is just a wrapper for the real names_demo function defined in nltk.classify.util.
Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:
def names_demo_features(name): features = {} features['alwayson'] = True features['startswith'] = name[0].lower() features['endswith'] = name[-1].lower() for letter in 'abcdefghijklmnopqrstuvwxyz': features['count(%s)' % letter] = name.lower().count(letter) features['has(%s)' % letter] = letter in name.lower() return featuresThis illustrates the format for a feature function. It builds and returns a dictionary whose keys are feature names and whose values are the feature values of some context. In this application, a context is a name string and the features are:
If you have a problem, make sure the following environment variable is set in Unix on bulba/ngram (not Python):
[gawron@ngram ~]$ echo $NLTK_DATAThe shell should reply:
/opt/lib/nltk/dataIf the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
export NLTK_DATA=/opt/lib/nltk/data
>>> from nltk.corpus import senseval >>> korp = senseval.read('hard.pos')korp is a list corpus innstances!
>>> len(korp) 4333 >>> i0 = korp[0] >>> i0.context [('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")]A instance's context is a list of word pos pairs.
>>> i0.senses ('HARD1',) >>> i0.context[i0.position] ('hard', 'JJ')The instance's senses tells you which of 3 senses of the word 'hard' is being used in this example. More than one at a time is possible so there is a tuple of senses. All the examples in this corpus have exactly one sense/
stopwords = [ 'I', 'a', 'an', 'are', 'as', 'and', 'be', 'com', 'how', 'is', 'it', 'of', 'or', 'that', 'the', 'this', 'to', 'was', 'what', 'when', 'where', 'who', 'will', 'with', 'the', 'www']
features = {} features['alwaystrue'] = 1And the last line shd be:
return features
features[w] = Falsewhen w does not occur in senseval_inst.context.