Starting
|
 
|
NLTK and NLTK_LITE are installed on bulba. You can do
all your assignments via the comp ling lab machines.
Start up Python and import an nltk module:
% python
>>> from nltk.tokenizer import *
Here nltk is a module and tokenizer is one
of its submodules and the import statement
imports all names from the modules into the current
name space.
|
NLTK
Installation
|
 
|
You may if you wish install NLTK on your home
machine, either in Linux or in Windows,
if you have Python 2.4 installed.
- You must have Python 2.4.
Upgrade if you don't.
- Use the following versions of nltk and nltk_lite, which have some additions
specific for local courses:
- Use NLTK 0.6.X or later from here
- Be sure to download the win32.exe version if you want this
to live iUse NLTn Windows.
- Be sure to download the separate platform
independent "nltk_lite-doc" file. You may also want to
download nltk_lite-corpora if you have the disk space to spare.
- For NLTK to work on your home machine you need the numarray module installed, which
is not part of the standard Python distro. Go to the
numarray module web site
and download the "Source gz" file for numarray 1.5.2 or later (which is described as working
on any platform) even if you're Windows. This is a sourceforge website , so when you click
on what LOOKS LIKE the download file (a link titled "Download numarray-1.3.3.tar.gz")
you're actually taken to apage that lets you choose a "Mirror site" for
your actual download. And not till you click on one of those, say Phoenix AZ, does
your actual download begin.
- The install directions for "NLTK with small mods" are in the README file you'll
get in the directory created when you untar your tar
file. There isn't anything comparable for the numarray module
but the install directions are basically the same. To wit:
In Windows, do the following two commands
> C:\Python24\python.exe setup.py build
> C:\Python24\python.exe setup.py install
Of course if you've installed Python elsewhere
on your home machine, the full path to Python will look
different. This is in Windows syntax. The commands for Linux:
% python setup.py build
% python setup.py install
If you are on a Linux machine, you will have to be root to do
the second command, unless you have an unusual setup.
So you don't know how to untar a tar file and you dont have tar installed on youir home
machine! Ah but you DO!.
There is a Python module called "tarfile".
>>> import tarfile
>>> tar = tarfile.open("numarray-1.3.3.tar.gz","r:gz")
>>> tar
>>> for tarinfo in tar:
... tar.extract(tarinfo)
...
>>> tar.close()
Note, in order for the above to work your current working
directory must be the same as that of the tar file (if
not use a full pathname in tar.open command). The effect
of the above commands will then be
to untar the tar file into a subdirectory of current working directory.
Where you untar these files doesnt matter, because
you are just using that directory as worskpace from which
the real installation into Python will happen.
Helpful hint. TO check your current working directory in
Python, do:
>>> import os
>>> os.getcwd()
'/home/gawron/python/nltk'
|
Modules
|
 
|
The nltk modules are:
- token: classes for representing and processing individual elements of text, such as words and sentences
- probability: classes for representing and processing probabilistic information.
- tree: classes for representing and processing hierarchical information over text.
- cfg: classes for representing and processing context free grammars.
- fsa: finite state automata
- tagger: tagging each word with a part-of-speech, a sense, etc
- parser: building trees over text (includes chart, chunk and probabilistic parsers)
- classifier: classify text into categories
(includes feature, featureSelection, maxent, naivebayes
- draw: visualize NLP structures and processes
- corpus: access (tagged) corpus data
Many of these have analogues in nltk_lite.
|
Source
Tree
|
 
|
The source trees for nltk and nltk_lite on bulba are at
/usr/lib/python2.4/site-packages/nltk
/usr/lib/python2.4/site-packages/nltk_lite
|
Tokenizer
|
 
|
This is where you can
learn about
how to use the tokenizer module:
Tokenization demo/tutorial.
|
|