Linguistics 596

Search with egrep

Examples using the Wall Street Journal corpus

The Wall Street Journal Corpus is in the following directory:

  1. /home/wsj/

When you log onto bulba, you see something like the following:

This means you can type commands. You can verify that you can "see" the corpora files by typing the command "ls" (for "list directory contents") and telling it to look at at the wsj directory: Be sure to hit carriage return after typing the command. Then you should see the following: The files you're interested in are in these directories. They have names like: ws940701.

First, connect to one of these directories.

Then list the contents:

Notice you dont have to say what directory to list. The default is always the one you're connected to.

Now you can do egrep commands on all these files. Let's try to find the word 'anointed' in the wsj corpus. [gawron@bulba 1994]$ egrep 'anointed' * ws940719: There is no home for males in Ms. McIntosh's creation; we are the anointed ws940725:problem is that this portrait of the anointed dragon-slayer doesn't much square ws940831: Enter MCI, which in late February anointed Nextel as its sole wireless play ws940930: Dripping with arrogance and superiority, these self-anointed, credentialed ws941003:caught the attention of advertising executives by dismissing its newly anointed ws941012:Schroeder to label the coach a "self-anointed ayatollah." ws941024:generosity toward the east Germans and was last summer anointed by President ws941027:judiciary in an article debating Kishore Mahbubani, Singapore's anointedws941101: These lists usually consist of 20 to 50 stocks anointed by a stock-selection ws941121:rules. There was the winner, newly anointed the best fighter in the world pound ws941125:kids these days. Someone has, in fact, anointed this 27-year-old as a

Egrep retrieves a number of lines.

Introduce 'wc'. There is a basic size facility for files in Unix called 'wc'. It measures number of lines, numbers of words and number of bytes in a file. We can look at the relative size of our two corpora using wc.

We can use the -c option on egrep to COUNT the number of lines a word occurs in each file

For most of the files the result is 0.

Finally it's useful to be able to store results of a search. You can do this in Linix by REDIRECTING the results of your search into a file. It won't display onscreen then.

Be sure and store it in a file in your OWN home directory. You don't have write permission in either my directory or the wall street journal directories.

We can look at this file a screenful at a time by using the 'more' command:

A screenful of the file will appear in your bulba window. Every time you hit return you see a new window. Typing "q" for "quit" will get you out of "more."