The code in cky_recognizer.py gives a simple implementation of a CKY recognizer which also accesses some small grammars suitable for debugging and demoing. The code requires several python modules to all be in the same directory to work, all downloadable from the links below.
The cky_recognizer module defines a class CKYParser.
To create CKYParser instance p with grammar g:
p = CKYParser(g)
The parse_input method of this class takes a string argument will parse the given string using the grammar given at initialization time.
To parse string i with context-free-grammar g:
>>> i = "salespeople sold the dog biscuits"
>>> g = not_so_toy_grammar()
>>> cnf_g = cnf_grammar(g)
>>> parser = CKYParser(cnf_g)
>>> parser.parse_input(i,True)
The cnf_grammar function being called in line 3 has a single argument which is an arbitrary context-free grammar. The call to cnf_grammar turns the grammar into a CNF grammar and stores it in a format sometimes called bottum-up grammar format. This format is suitable for efficient use with a CKY parser.
Bottum-up format is very simple. A CNF rule with probability:
is stored in a dictionary whose keys are first daughters and who values are pairs of mothers and second daughters. So in the simplest case, the above rule would be stored as:
>>> BU_Grammar[B] = (A,C, 0.8)
Actually there may be more than one rule with the same first daughter, so we might have:
So in general the bottum_grammar needs to associate a list of rules with any first daughter. For the above, we would have:
>>> BU_Grammar[B] = [(A,C,.8), (D,E,.6)]
The implementation below actually stores each pair as as an object instance with attributes next_dtr and mother, so if we do:
>>> rules = BU_Grammar[B]
And the rules[0] is the first rule above, then A and C can be accessed as follows:
>>> rules[0].next_dtr
C
>>> rules[0].mother_cat
A
For a grammar with probabilities, the probability can be accessed as:
>>> rules[0].prob
0.8
The heart of the parser is the CKY loop. This looks like this:
1 2 3 4 5 6 7 8 9 10 | for j in range(1,len(self.chart)):
for cat in self.grammar.lexicon[self.input[j-1]]:
self.add_lexical_edge(cat,j)
for i in range(j-2,-1,-1):
for k in range(i+1, j):
for cat in self.chart[k][i]:
for rule in self.grammar.find_BURules(cat):
cat2 = rule.next_dtr
if cat2 in self.chart[j][k]:
self.add_to_chart(j,i,rule.mother_cat)
|