Search with egrep
The Wall Street Journal Corpus is in the following directory:
When you log onto bulba, you see something like the following:
First, connect to one of these directories.
Then list the contents:
Now you can do egrep commands on all these files. Let's try to find the word 'anointed' in the wsj corpus. [gawron@bulba 1994]$ egrep 'anointed' * ws940719: There is no home for males in Ms. McIntosh's creation; we are the anointed ws940725:problem is that this portrait of the anointed dragon-slayer doesn't much square ws940831: Enter MCI, which in late February anointed Nextel as its sole wireless play ws940930: Dripping with arrogance and superiority, these self-anointed, credentialed ws941003:caught the attention of advertising executives by dismissing its newly anointed ws941012:Schroeder to label the coach a "self-anointed ayatollah." ws941024:generosity toward the east Germans and was last summer anointed by President ws941027:judiciary in an article debating Kishore Mahbubani, Singapore's anointedws941101: These lists usually consist of 20 to 50 stocks anointed by a stock-selection ws941121:rules. There was the winner, newly anointed the best fighter in the world pound ws941125:kids these days. Someone has, in fact, anointed this 27-year-old as a
Egrep retrieves a number of lines.
Introduce 'wc'. There is a basic size facility for files in Unix called 'wc'. It measures number of lines, numbers of words and number of bytes in a file. We can look at the relative size of our two corpora using wc.
We can use the -c option on egrep to COUNT the number of lines a word occurs in each file
Finally it's useful to be able to store results of a search. You can do this in Linix by REDIRECTING the results of your search into a file. It won't display onscreen then.
We can look at this file a screenful at a time by using the 'more' command: