NLTK Startup Directions

Introduction to Computational Linguistics

Tools
Programming Tools

    NLTK Startup directions
    Starting  

    NLTK and NLTK_LITE are installed on bulba. You can do all your assignments via the comp ling lab machines.

    Start up Python and import an nltk module:

    % python
    >>> from nltk.tokenizer import *
    
    Here nltk is a module and tokenizer is one of its submodules and the import statement imports all names from the modules into the current name space.
    NLTK
    Installation
     

    You may if you wish install NLTK on your home machine, either in Linux or in Windows, if you have Python 2.4 installed.

    1. You must have Python 2.4. Upgrade if you don't.
    2. Use the following versions of nltk and nltk_lite, which have some additions specific for local courses:
      1. Use NLTK 0.6.X or later from here
      2. Be sure to download the win32.exe version if you want this to live iUse NLTn Windows.
      3. Be sure to download the separate platform independent "nltk_lite-doc" file. You may also want to download nltk_lite-corpora if you have the disk space to spare.
      4. For NLTK to work on your home machine you need the numarray module installed, which is not part of the standard Python distro. Go to the numarray module web site and download the "Source gz" file for numarray 1.5.2 or later (which is described as working on any platform) even if you're Windows. This is a sourceforge website , so when you click on what LOOKS LIKE the download file (a link titled "Download numarray-1.3.3.tar.gz") you're actually taken to apage that lets you choose a "Mirror site" for your actual download. And not till you click on one of those, say Phoenix AZ, does your actual download begin.
      5. The install directions for "NLTK with small mods" are in the README file you'll get in the directory created when you untar your tar file. There isn't anything comparable for the numarray module but the install directions are basically the same. To wit:
          In Windows, do the following two commands
          > C:\Python24\python.exe setup.py build
          > C:\Python24\python.exe setup.py install
          
          Of course if you've installed Python elsewhere on your home machine, the full path to Python will look different. This is in Windows syntax. The commands for Linux:
          % python setup.py build
          % python setup.py install
          
          If you are on a Linux machine, you will have to be root to do the second command, unless you have an unusual setup.

      So you don't know how to untar a tar file and you dont have tar installed on youir home machine! Ah but you DO!.

      There is a Python module called "tarfile".

      >>> import tarfile
      >>> tar = tarfile.open("numarray-1.3.3.tar.gz","r:gz")
      >>> tar 
      
      >>> for tarinfo in tar:
      ...  tar.extract(tarinfo)
      ... 
      >>> tar.close()
      

      Note, in order for the above to work your current working directory must be the same as that of the tar file (if not use a full pathname in tar.open command). The effect of the above commands will then be to untar the tar file into a subdirectory of current working directory. Where you untar these files doesnt matter, because you are just using that directory as worskpace from which the real installation into Python will happen.

      Helpful hint. TO check your current working directory in Python, do:

      >>> import os
      >>> os.getcwd()
      '/home/gawron/python/nltk'
      
    Modules  

    The nltk modules are:

    1. token: classes for representing and processing individual elements of text, such as words and sentences
    2. probability: classes for representing and processing probabilistic information.
    3. tree: classes for representing and processing hierarchical information over text.
    4. cfg: classes for representing and processing context free grammars.
    5. fsa: finite state automata
    6. tagger: tagging each word with a part-of-speech, a sense, etc
    7. parser: building trees over text (includes chart, chunk and probabilistic parsers)
    8. classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes
    9. draw: visualize NLP structures and processes
    10. corpus: access (tagged) corpus data
    Many of these have analogues in nltk_lite.
    Source
    Tree
     

    The source trees for nltk and nltk_lite on bulba are at

    /usr/lib/python2.4/site-packages/nltk
    /usr/lib/python2.4/site-packages/nltk_lite
    
    Tokenizer  

    This is where you can learn about how to use the tokenizer module:

      Tokenization demo/tutorial.