M.A. specialization in Computational Linguistics

Topics
Introduction
Faculty
Courses

Faculty
Jean Mark Gawron
Rob Malouf
Other faculty

Lab
Hardware
Software and Data

Program Website
http://bulba.sdsu.edu

Website for this flyer
http://bulba.sdsu.edu/
flyer.htm

Introduction and Statement of Purpose

Computational Linguistics is an interdisciplinary field principally combining elements of linguistics and computer science. Also relevant are issues in cognitive and developmental psychology concerning language acquisition and processing and various heuristic learning and search techniques taken from the fields of machine learning and artificial intelligence.

Through 5 decades of research in theoretical and applied language processing systems, the field has built up a well-defined body of results and a well-articulated set of research problems, culminating, over the last 10 years, in a period of explosive growth, both in jobs and sophistication. There are two immediate sources of this growth, first, the sucessful expansion of information technology into every corner of our lives, and second, a series of recent successes related to the advances in speech recognition technology in the 80s and 90s.

The result has been not only a growth in the number of jobs but also in the number of application areas in computational linguistics. Natural language or speech interfaces have been developed or are in development for database programs, document search engines, operating systems and software desktops, computer games and interactive computer simulations such as battlefield simulations. Natural language capability has thus become an important component of the study of human-computer interfaces, as speech and text input become an increasingly important auxiliary to graphical and multimodal systems. In addition as integrated circuit technology has migrated out of the computer and into every corner of our lives, speech and language technology has been or is being integrated into toys, household applicances, medical instruments, air traffic control systems, and GPS navigation systems.

Responding to the the University's goals of embracing technology, developing community partnerships, and heightening interdisciplinary connections, the Linguistics Department has designed a new M.A. specialization in computational linguistics. This specialization will provide a coherent array of courses leading to a level of expertise sufficient for employment in industry, thereby significantly broadening, in a direct way, our service to students and the community. The new specialization is designed also to provide a strong foundation for Ph.D study.

Faculty

Jean Mark Gawron. Jean Mark Gawron has been at a number of centers of computational linguistic research, including University of Edinburgh Epistemics and AI programs, the New York University Courant Institute, the Center for the Study of Language and Information at Stanford. For a number of years he was a member of the Natural Language group at SRI International. He joined the Department of Linguistics and Oriental Languages to start up a computational linguistics program in the fall of 2000. His research interests include semantics and pragmatics, probabilistic parsing, interleaving of syntactic and semantic constraints, Machine Translation, and computational lexical semantics.

Web-site: http://www.rohan.sdsu.edu/~gawron.

Rob Malouf Rob Malouf is currently a post-doc in the humanities computing department at the University of Groningen in the Netherlands and will be joining the SDSU Computational Linguistics program in the fall of 2002. Rob was a member of the NWO PIONIER research project Algorithms for Linguistic Processing. Before that, he was involved in the project "Computational and theoretical modeling of structures in performance," part of the School of Behavioral and Cognitive Neurosciences at the University of Groningen. His work is aimed at integrating symbolic grammars with knowledge derived automatically from corpora via machine learning techniques. He is particularly interested in data-driven approaches to modeling the development and behavior of syntactic categories. Recently, he did some work on using memory-based learning to extract rules for ordering prenominal adjectives in English from large corpora. His current work focuses on applying maximum entropy/minimum divergence modeling to attribute-value grammar formalisms such as HPSG using big computers. He is also interested in other applications of probability theory.

Web-site: http://odur.let.rug.nl/~malouf/

Other faculty:

    http://www-rohan.sdsu.edu/dept/linguist/lol.html

Courses

Computational Linguistics Lab

A key component of the computational linguistics program at SDSU is the integration of practical lab component in most courses.

The computational linguistics lab, located in the Professional Studies and Fine Arts Building, is housed as a separate lab within the Social Science Research Lab.

The lab consists of a subnetwork of 15 Pentium III PCs running Red Hat Linux with RAMs ranging a Half a Gig to 2 Gig. There is also a Sun Ultra 80. The PCs are equipped with read-write CDs, microphones, sound and video cards, and individual hard disks.

Over and above the actual hardware the language modeling lab has three components.

  1. Linguistic Corpora: Linguistic corpora are bodies of language data that have been preprocessed and partially or completely annotated for some kind of linguistic structure. Annotation ranges from simple part-of-speech tagging to marking morphological roots (marking verb forms such broken with its root break) to marking complete syntactic analyses. The last kind of corpus, called a tree bank, is most relevant for statistically training language models in assigning syntactic structure. Some of the corpora:
    • Penn Treebank
    • CSR I,II, and III, Speech Recognition Training data.
    • Wall Stree Journal, New York Times
    • British National Corpus
    • Switchboard
  2. Large-vocabulary Resources: Online resources for providing linguistic information about large vocabularies. Large-scale online linguistic resources in turn provide the bootstrapping gear for processing corpora. Online dictionaries provide the starting point for syntactic processing and information retrieval. Thesauri play a role in information retrieval and semantic processing. More sophisticated resources such as Word Net, a semantic network of English words encoding relations such as synonymy and hyponymy, and FrameNet, encoding relational properties of words, will enable deeper semantic processing. Some of the large vocabulary resources:
    • FrameNet
    • WordNet
    • Comlex
  3. A Natural Language Understanding (NLU) system: The NLU system assigns syntactic and semantic analyses to previously unseen data. This is the system to be trained by one subset of the the data and tested on another.
    • Gemini Natural Language System (SRI)

For more detailed information about the lab, visit the lab website: http://bulba.sdsu.edu/research/description.html.

Website for this flyer
http://www.rohan.sdsu.edu/~gawron/curric/flyer.htm