Parser writing help

If you are stuck on getting started, forget about objects for now and just write a function that takes a string and a grammar and does CKY recognition. No probabilities. That is, it builds a chart. The following helps you assemble some pieces of that function.

What is a chart? It is what’s called _table_ in the J&M pseudocode for the CKY algorithm in Chapter 13. It IS just a table. In the enclosed file start_up_package.py it’s a list of lists of dictionaries.

Note

To loop through the keys in a dictionary do:

>>> for k in my_dict:
     print k

To check if a key already exists in a dictionary (to avoid that annoying key error), do:

>>> key in my_dict

If this returns True the key is there ; if not , it just returns a benign False.

chart[n][0][‘S’] is true if you found an S from 0 to n during parsing. (So is ‘S’ in chart[n][0]?)

A CKY recognizer just does bookkeeping with the table, looping through the cells in a particular order, recording where constituents can be found based on where they have already been found. At the end of recognition, you just check to see if chart[n][0][‘S’] is True.

Note

If ‘start’ is the starting index of an input span and ‘end’ is the ending index of the span, the span shd be accessed:

>>>  chart[end][start]

This is what is done in the example below.

To write your parser, you will use the pseudocode in J&M, but don’t write any code yet. Get used to some data structures first.

CKY Start up package

The file start_up_package.py has a small PCFG toy grammar ALREADY in CNF form and a chart initialization function, initialize_chart. The variable s1 has been set to the string:

"the flight includes a meal"

in that file. If you execute:

>>> initialize_chart(s1.split())

you will get an initial chart suitable for parsing s1. You will use such a chart in your parser. The first line of code in your Python parser will include a call to initialize_chart. After that, assuming you have a grammar, you can just add the loops written in the J&M pseudocode to this Python code.

The file start_up_package.py also includes a function make_bottum_up grammar that puts the grammar in a form usable by a cky parser. It takes an NLTK grammar and returns a pair of a binary grammar and a lexicon (the lexicon contains all the unary rules, since the grammar is in CNF form). When loaded into Python, the file start_up_package.py calls this function on the enclosed NLTK grammar. Use the resulting toy bottum up grammar for your first parser (the two parts are called lexicon and binary_grammar). Use the enclosed example sentence (The flight includes a meal) when debugging your parser.

If the NLTK production is:

S -> NP VP [prob]

it’s retrieved from binary_grammar as:

binary_grammar[NP][VP] = [(S, prob)]

That is, binary_grammar[NP][VP] is the list of things that can be built from an NP and VP. In this toy grammar there’s just one such thing, an S. But the implementation doesn’t assume there’s just one. And in a bottum up parser if we find an NP and a VP we’ll just build all of them.

The probabilities have been included but just ignore them in your first implementation. The point of the binary grammar is that when in the pseudocode you have i, j, and k fixed (you are trying to build something from i to j and you are considering the case of what you can do with dtrs from i to k and from k to j) you can loop through the cats in the chart from i to k, and quickly check whether there is a binary rule whose first dtr matches. Answering the following questions shd make this clear. If not, ask.

Coders: You are welcome to USE the start_up_package.py file to get started. If you dont understand something in the file or can’t get it to work, ask me. If you don’t use the package you are on your own until you get a parser working. I will answer no questions. I have tried to make the ingredients of the package as generic as possible. It relies on the most basic pieces of NLTK and that is all.

Before writing any code do all of the following. Make sure you can answer any questions asked below. If not, ask. In fact make sure that answering the questions is easy. All of the questions teach you relationships among the data structures that you should have at your fingertips while writing code.

(a) Read through the enclosed demo and make sure you can reproduce it in your own environment and understand it. I have enclosed an example PARTIAL chart (called partial_chart in start_up_package.py) which represents an intermediate state of the parser when parsing the sentence ‘the flight includes a meal’,

(b) Look at partial_chart and make sure you understand its contents. Answer the following questions.

  1. Is there an NP from 0 to 2? What piece of Python code verifies this is true? Is there an NP from 3 to 5?
  2. Is there a VP from 2 to 5? What piece of Python code verifies this?
  3. Is there an S from 0 to 5?
  4. Is there a determiner from 0 to 1?
  1. BE the parser and add something to the chart using the bottum up grammar.

To be specific, there is an NP from 0 to 2 (chart[2][0][NP] = True). So look in the binary_grammar to find a dictionary for all rules whose first dtr is NP if there is one (if NP in binary grammar: binary_grammar[NP]);

Go ahead and evaluate this in Python now. DONT just keep reading. Follow along using the interactive capabilities of Python. Load the enclosed file into Python. look at binary_grammar and look at partial_chart. What is the value of binary_grammar[NP]? What does this tell you?

(c) Suppose you are at the point in the CKY algorithm parsing “the flight includes a meal” where:

i = 0
j = 5
k = 2

This means you are building things from 0 to 5, trying to use first dtrs that start at 0 and end at 2, and second dtrs that start at 2 and end at 5. How do you loop through all the constituents from 0 to 2? That is what goes in place of _____ in the following:

for all cat in ______:

if you want to be looping through all constituients in the partial chart from 0 to 2?

Each such cat from 0 to 2 is potentially the first dtr in a constituent from 0 to 5, IF you can find a grammar rule that uses such a dtr as a first dtr, and IF the second dtr in the grammar rule can be found in the chart from 2 through 5. How do you check this? What piece of Python code do you write?

You have the binary_grammar; you have the partial chart; by hypothesis you have found all the cats from 0 to 2 and you are looping through them;

  1. What expression finds the part of binary_grammar containing all rules whose first dtr is cat? That same expression just gives you the set of possible second dtrs whose first dtr is cat.
  2. what expression looks in the chart from 2 to 5 to see if those potential second dtrs are there?
  3. The enclosed grammar of course licenses an S from 0 to 5 if you have found an NP from 0 to 2 and a VP from 2 to 5. What expression UPDATES the chart to record the fact that you have found an S from 0 to 5?

The main function of this package is to give you a concrete example of how to implement the main data structures a CKY parser will need, in particular, the chart (which our text calls the parse table).

CKY control help

The file cky_control.py shows you how to write the critical CKY control loop in Python. This is the heart of the CKY algorithm.

The relevant loop code is at the bottom of the file:

For each index j from 0 through end of string, Find all constituents
UP to index j.  Do this as follows: Find all constituents in the span
(i,j) counting DOWN from j-2.  So for each j we find progressively
longer constituents ending at j. If j = 5, we process in the
following order: (3,5), (2,5), (1,5).

Note

There is no parser here. Just print code revealing the order in which the parser processes spans.

When you run the code in cky_control.py it prints out a trace of what the CKY parser does for an example sentence, The flight includes a meal. It starts by printing out the sentence:

 the  flight  includes  a  meal
0    1       2         3  4     5

You can see, for example, that the span (2,4) includes the words includes a The code then prints out a trace of all the i,j, k bindings CKY uses, in the order CKY processes them.

Looking at the loop at the end of the file, should give you a pretty good idea of how to write the main loop of a CKY parser in Python:

for j in range(1, len(chart)):
    print
    print 'j: %d ' % j
    print '  word added to chart: %s' % inp[j-1]
    print '  range of i: %s' % range(j-2,-1,-1)
    for i in range(j-2,-1,-1):
        print_list(inp[i:j],i,j)
        ctr = 0
        for k in range(i+1, j):
            ctr += 1
            print 'k: %*d' % (ctr, k)

Table Of Contents

Previous topic

Welcome to parsing_tools’s documentation!

Next topic

Drawing parse trees

This Page