Linguistics 571

Computational Corpus Linguistics

Pre-requisites: At least 2 linguistics courses

Required Texts

Schwartz, R. L., Olson, E. and Christiansen,T. 1997. Learning Perl. O'Reilcy.

Wall, L., Christiansen, T., and Orwant, J.. 1997. Programming Perl. O'Reilly.

Course Description

This is simultaneously a practical hands on introduction to computation with text corpora and an introduction to Perl. Issues covered include strtcuture of text database and tools for searching them, tokenizing, part-of-speech tagging, and lemmatizing (stemming) large corpora. Students are required to write Unix scripts and Perl programs. Ideally the course should be useful both to those interested in pursuing linguistic research using large corpora and to linguists seeking an introduction to programming.

Grading

Programming projects (100%)

Course Outline

Week 1:

Introduction to corpus linguistics and corpus computational linguistics.

Week 2: Introduction to Perl.

Week 3: Definition of word. Tokenization.

Week 4: Part-of-speech tag sets. Automatic tagging.

Week 5,6: Lemmatization.

Week 7: Pattern searches on corpora.

Week 8: Finding collocations.

Week 9: Building bigrams.

Week 10:

Generalized text markup schemes. SGML.

Week 11: Syntactically annotated corpora. The Penn Tree Bank.

Week 12,13: Finding subcategorization frames.

Week 14,15: Semantic resources. Word Net and Frame Net.