Top-Down Parsing |
  |
Think of this as an algorithm that takes one particular view of how to do a derivation. Suppose we are trying to parse the sentence John likes Richard M. Nixon with respect to the grammar:
Non-terminals are symbols that occur on the LHS of a rule. The non-terminals in this grammar
eats in on under the a red big
The START Symbol in the grammar is S. Then a derivation that yields the sentence is:
|
||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Derivations |   |
Notice there are many derivations that could produce the sentence from this grammar.
Both derivations start with the start symbol and begin a sequence of re-writes, terminating with a line that consists ONLY of terminals. Therefore these are top-down derivations. The first derivation picked a particular symbol and kept re-writing until the result waas all terminals, then picked th enext symbol, and so on, moving left to right. This is a top-down left-to-right depth first derivation. The second derivation just rewrote symbols in order of occurrence from left to right. This is a top-down left-to-right breadth first derivation. |
||||||||||||||||||
Bottum-up Derivation |
  |
Derivations do not have to start with the start symbol in order to fulfill the role we want of them (justifying that a particular striong is licensed by a particular grammar). We can also start with the string and work our way to the start symbol.
Notice this is just exactly the reverse of the first top-down depth-first derivation. |
||||||||||||||||||
Trees |   |
A derivation can be associated with a tree in a fairly obvious way. All three of the derivations we just looked at produced the same tree: ![]() It turns out we don't usually care what order the steps in a derivation were found. What we're usually after is just the groupings of phrases they describe, because this is what is connected to meaning. Thus what we want is the tree not the derivation. This is true in parsing computer languages as well: ![]() ![]() Derivations are still very useful though because computing a tree will always mean finding a particular derivation (the steps have to be done in some order!). So they help us classify parsing algorithms. |
||||||||||||||||||
Parsing |   |
We can now define what parsers and recoignizers are. Recognizers determine whether a particular string is licsned by a particular grammar. They give a yes/no answer. They do this by finding a derivation. Parsers find the tree or trees for a string licensed by the grammar. They also do this by finding a derivation. So the two kinds of algorithms are very cloisely connected. |
||||||||||||||||||
Parser states: Recognition |
  |
We pair a list of categories we are trying to find with the span of input the search starts at:
A parser state is thus the current goal of the computation. It consists of two pieces of information:
For topdown parsing, we always start in the following state:
|
Here is the simple grammar from above:
Step | Curr. State |
State Stack |
Rewrite/ Found |
---|---|---|---|
1. | ((S) (John likes R.M.N.)) |   | initial state |
2. | ((NP VP) (John likes R.M.N.)) | ((VP) (John likes R.M.N.)) |
S -> NP VP (current state) S -> VP (state stack) |
3. | ((Art N VP) (John likes R.M.N.)) |
((Art Adj N VP) (John likes R.M.N.)) ((PN VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.)) |
NP -> Art N (current state) NP -> Art Adj N (state stack) NP -> PN (state stack) |
4. | ((Art Adj N VP) (John likes R.M.N.)) |
((PN VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.)) |
Popped state stack!! "John" is not an Art |
5. | ((PN VP) (John likes R.M.N.)) | ((VP) (John likes R.M.N.)) |
Popped state stack!! "John" is not an Art |
6. | ((VP) (likes R.M.N.)) | ((VP) (John likes R.M.N.)) |
Lexical pop! "John" is a PN |
7. | ((V NP) (likes R.M.N.)) |
((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
VP -> V NP VP -> V PP |
8. | ((NP) (R.M.N.)) |
((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
Lexical pop! "likes" is a V |
9. | ((Art N) (R.M.N.)) |
((Art Adj N) (R.M.N.)) ((PN) (R.M.N.)) ((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
NP -> Art N (current state) NP -> Art Adj N (backup state) NP -> PN (backup state) |
10. | ((Art Adj N) (R.M.N.)) |
((PN) (R.M.N.)) ((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
Popped state stack!! "Richard" is not an Art |
11. | ((PN) (R.M.N.)) |
((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
Popped state stack!! "Richard" is not an Art |
12. | (() ()) |
((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) |
Lexical pop! "R.M.N." is a PN Done!! |
Parser states: Parsing |
  |
A triple: Node list, input, tree:
We start in the following state:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parsing |   |
Here are the steps of the parsing
computation for
John likes R.M.N..
|
Pseudocode Version |
  |
Note: This version has lots of errors. Please check Text errata for corrections: Pesudocode (my version, disagrees even with corrected version): function TOP-DOWN-PARSE (Input, Grammar) returns a parse tree agenda <- (Initial S tree, Beginning of input) current-search-state <- POP(agenda) loop if SUCCESSFUL-PARSE?(current-search-state) then return TREE(current-search-state) else if CAT-OF-NODE(current-search-state) is a POS then if CAT-OF-NODE(current-search-state) in POS(CURRENT-INPUT(current-search-state)) then PUSH(LEXICAL-POP(current-search-state), agenda) else current-search-state <- POP(agenda) else expansions <- EXPANSIONS(CAT-OF-NODE(current-search-state)) current-expansion <- POP(expansions) current-search-state <- EXPAND-CURRENT-NODE(current-search-state, current-expansion) agenda <- APPEND(expansions, agenda) end |
---|---|---|
Python Version |
  |
Assume a Parser class.
Full-blown code is given here.
In the following code many of the function definitions begin with long multi-line strings. These are Python self-doumentations strings. You can look at the documentation of ANY python method or function by look at its __doc__ attribute. For example, assume the definition of recognize_string below, and assume it is a method defined for some class Parser, and that "p1" is an instance of that class. Then I could look at the documentation of recognize_string as follows: >>> print p1.recognize_string.__doc__ {string}: A string. The string to be parsed by current goals. Immediately converted to a [WordList].by self.legal_input. Now lets look at the code of a simple top down recognizer (note: the code below differs in minor details from the code in the full-blown implementation, mostly in calls to trace functions). def recognize_string(self, string=''): """ {string}: A string. The string to be parsed. Immediately converted to a [WordList] by self.legal_input. [Return]: The result of calling recurse_recognize on the start-cat, the empty agenda, and the wordlist generated from string. """ if self.legal_input(string): return self.recurse_recognize([self.start_cat],[],self.input) else: return False def recurse_recognize(self,goals,agenda,wordlist): """ {goals}: [GoalList]: A list of categories and/or words proposed to cover wordlist {agenda}: A list of [ParserState]s: each a pair of ([GoalList],[WordList]) {wordlist}: a list of words. [Return]: True iff {goals} and {input} are empty Suppose {goals} is non-empty: If {goals}[0] is a nonterminal, call expand (expand it with the grammar and continue parsing recursively). If {goals}[0] is a terminal then call_match_word (try to match the next word and continue parsing). Both expand and match_word contain recursive calls to recurse_recognize. Suppose {goals} is empty: Then if {input} is empty, [Return] True. Else backtrack (If {agenda} is non-empty continue parsing with the next state on the agenda; else [Return] False) backtrack also contains a recursive call to recurse recignize. """ if len(goals) > 0: next_goal = goals[0] if not next_goal in self.terminals: return self.expand(next_goal,goals[1:],agenda,wordlist) else: return self.match_word(next_goal,wordlist,goals[1:],agenda) elif len(wordlist) == 0: return True else: return self.backtrack(agenda) def expand(self,next_goal,rest_goals,agenda,wordlist): """ Expand next_goal using grammar (self.productions) Generate new GoalList using one production and add unused productions to the agenda. Keep parsing with new GoalList, new agenda, old WordList. """ productions = self.productions[next_goal] next_production = productions[0] rest_productions = productions[1:] for p in rest_productions: agenda = [ParserState(p+rest_goals,wordlist)] + agenda return self.recurse_recognize(next_production+rest_goals, agenda, wordlist) def match_word(self,next_goal,wordlist,rest_goals,agenda): """ {next_goal} is a terminal. Suppose {wordlist} is non-empty: If next_goal matches {wordlist}[0], keep parsing with {rest_goals}, {agenda} and {wordlist}[1:]; else this parse path fails; call backtrack (if {agenda} is non-empty, continue parsing with the next parser state on {agenda}, and if {agenda} is empty, [Return]: False) Suppose wordlist is empty This parse path fails. Call backtrack """ if len(wordlist) > 0: if next_goal == wordlist[0]: ## We just matched a word! Discard and keep parsing rest of wordlist return self.recurse_recognize(rest_goals,agenda,wordlist[1:]) else: return self.backtrack(agenda) else: return self.backtrack(agenda) def backtrack(self, agenda): """ Suppose {agenda} is non-empty: Then agenda[0].GoalList = new goals agenda[0].WordList = new wordlist Keep parsing with new goals and new worldlist and popped agenda. Suppose {agenda} is empty return False """ if len(agenda) > 0: return self.recurse_recognize(agenda[0].GoalList,agenda[1:],agenda[0].WordList) else: return False |