Probabilistic
Context
Free
Grammars


A probabilistic context free grammar is
a context free grammar with probabilities attached to the rules.
Model Parameters

The model parameters are the probabilities assigned
to grammar rules.
Computing Probabilities

We discuss how the model assigns probabilities
to strings and to analyses of strings.
Exploiting Probabilities in Parsing

We discuss how to find the most probable
parse given the model.
Estimating Parameters

We sketch how rule probabilities are estimated from a syntactically
annotated corpus.

Probability
Model


Important Probabilities

Rule Probabilities

Probability of an expansion given the category being expanded.
P(γ  A), the probability that the sequence of grammar
symbols γ will expand category A. Given the definition
of a conditional probability, this means the probabilities
of all the expansions of a single catgeory must add to 1.
For example:
S > NP VP, .8  Aux NP VP, .2
Given that S is being expanded, the probability of the
NP VP rule is .8; the probability of
the Aux NP VP expansion is .2. The rule probabilities
are the parameters of a PCFG model.
Tree Probability

The probability of a tree given all the rules used to construct it.
This will be the product of the probabilities of all the rules
used to construct it. We motivate this below.
For now an example. Given this grammar:
the sentence
Book the dinner flight
has at least two trees:
P(T_{left})  =  .05 * .20 * .20 * .20 * .75 *
.30 * .60 * .10 * .40 = 2.2 × 10^{6} 
P(T_{right})  =  .05 * .10 * .20 * .15 * .75 *
.75 * .30 * .60 * .10 * .40 = 6.1 × 10^{7} 
Therefore the more probable of the two parses is Tleft,
which is the intuitive result.
String Probability

Sum of the tree probabilities for all analyses of the string.

Estimating
Parameters


Given a tree bank,
the maximum likelihood estimate for the PCFG
rule:
VP → VBD NP PP
is:
Count(VP → VBD NP PP)



Count(VP)


Highest
Probability
parse


Our primary subject of interest is finding
the highest probability
parse for a sentence.
This can be done with a rather simple variant of CKY.
CKY is a bottum up algorithm. A bottum up algorithm
is compatible with our probability model because the
probability of a subtree like 

Is independent of anything outside that tree,
in particular of where that subtree occurs in a sentence
(for instance, subject position or object position),
so once we have found all ways of building a constituent,
we are guaranteed to have found the maximum
amount of probability mass that constituent
can add to any tree that includes it. 
CKY is ideal for our purposes because
no constituent spanning (i,j) is built until
all ways of building constituents
spanning (k,l),
i ≤ k < l < j
have been found. Since we can find the most
probable parses for all possible subconstituents
of a constituent spanning (i,j) before we build
any constituents spaning (i,j), we have the essential
feature of a Viterbi like algorithm.
We modify CKY as follows: instead of building
a parse table in which each
cell assigns value True or False for each
category, we build a parse table in which
each cell assigns a probability to each category.
Each time we find a new way of building an
A spanning (i,j), we compute its probability
and check to see if it is more propbable than
the current value for table[i][j][A].
Look here for more
discussion of this Viterbilike algorithm for finding the
best parse assigned by a PCFG.

Example


Partial PCFG:
S  →  NP VP  .80 
Det  →  the  .40 
NP  →  Det N  .30 
Det  →  a  .40 
VP  →  V NP  .20 
N  →  meal  .01 
V  →  includes  .05 
N  →  flight  .02 

Problems


Independence assumptions are (damagingly) false

Where particular kinds of category expansions
happen very much depends on where in the tree
the category occurs.
Lexical

Rule probabilities above the preterminal level are oblivious
to what words occur in the tree. But this is wrong.

Independence


Pronouns (Switchboard corpus [spoken, telephone], Francis et al. 1999)
 Pronoun  NonPronoun 
Subject  91%  9 % 
Object  34%  66% 

Lexical


 Left tree is correct for this example
 Right tree is correct for
Conjunction ambiguities can
create trees to which a PCFG cant assign
distinct probabilities, but intuitively,
the tree are quite different in probabilitity:
Notice the two trees employ exactly the same bags of
rules, so they must, acording to a PCFG, be equiprobable.

Using
grandparents


Johnson (1998) demonstrates the utility of
using the grandparent node
as a contextual parameter of a probabilistic
parsing model.
This option uses more derivational history.

 NP_{1}: P(...  P = NP, GP = S)
 NP_{2}: P(...  P = NP, GP = VP)
Significant improvement on PCFG:
 Can capture distributional differences between subject and
object NPs (such as likelihood of pronoun)
 Outperforms leftcorner parser described in
Manning and Carpenter (1997)
S → NP VP   S → NP^S VP^S 

NP → PRP   NP^S → PRP 
  NP^VP → PRP 

NP → DT NN   NP^S → DT NN 
  NP^VP → DT NN 
This strategy is called splitting. We split the category NP
in two so as to model distinct internal distributions depending
on the external context of an NP Note that in the work
above, preterminals are not split.

Splitting preterminals


Johnson's grandparent grammars work because they
reclaim some of the ground lost due
to the incorrectness of the independence assumptions
of PCFGs. But the other problem
for PCFGs that we noted is that they miss
lexical dependencies. If we extend the idea
of gradparent information to preterminals,
we can begin to capture some of that information.
The parse on the left is incorrect because of an incorrect choice
between the following two rules:
PP → IN PP
PP → IN SBAR
The parse on the right, in which the categories (including
preterminal categories) are annotated
with grandparent information, is correct.
The choice is made possible because the string advertising works
can be analyzed either as an SBAR or an NP:
[_{SBAR} Advertising works] when it is clever.
 [_{NP} The advertising works of Andy Warhol]
made him a legend.
Now as it happens words of preterminal category IN
differ greatly in how likely they are to occur with
sentences:
IN  SBAR 
if
 John comes 
before
 John comes 
after
 John comes 
? under
 John comes 
? behind
 John comes 
So this is a clear case
where distributional facts about words helps.
And the simple act of extending the grandparent
strategy to preterminals helps capture
the differences between these two classes of IN.

Generalized
Splitting


Obviously there are many cases where splitting helps.
But there are also cases where splitting hurts. We already
have a sparseness problem. If a split results
in a defective or underrepresented distribution,
it can lead to overfitting and it can hurt.
Generalizations that were there before the split
can be blurred or lost.
Two approaches:
Handtuned rules (Klein and Manning 2003b)
Splitandmerge algorithm: search for optimal splits
(Petrov et al. 2006)

Lexicalized
CFGs


An extreme form of splitting:
VP → VBD NP PP  
VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  
 
VP(sold,VBD) → VBD(sold,VBD) NP(pictures,NNS) PP(of,P)  
 
. . . 

Notice categories are being annotated with two kinds of information:
 Head word
 Category of their immediately dominated head constituent
Lexical rules all have probability 1.0

Implementing
Lexicalized
Parsers


Mathematically, a lexicalized PCFG is a PCFG, so
In principle, a lexicalized grammar has an N^{3} CKY
parser just like any PCFG.
The problem is, we just blew up the grammar when we lexicalized it:
VP → VBD NP PP  
VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  
VP(spilled,VBD) → VBD(spilled,VBD) NP(ketchup,NNS) PP(on,P)  
VP(X,VBD) → VBD(X,VBD) NP(Y,NNS) PP(Z,P)  
And the real parsing complexity is O(n^{3}G^{2}), where G
is the size of the grammar.
The G term now implicitly depends on n or G, whichever is smaller, and n is generally SMALLER than G.
So the real parsing complexity is O(n^{3}n^{2})
How many distinct attributes can n there now be in a cell for span i,j?
nonterminal, word (between i and j), preterminal of word
Therefore worst case number of items in a span:
num nonterminals X n (num words between i and j) X num poses
If we assume every part of speech can "project" only one non terminal:
Nonterminals X n
That's how many distinct floating point numbers (probs) we may need to store for a span (a cell
in the table).
Data structures?
Currently a cell is filled buy a dictionary from nonterminals to floats.
table[3][5] = {NP: .003, VP: 0004}
What shd replace it?
Answer:
A dictionary from a pair of nonterminals and words to floats.
{(NP,sacks): .003, (VP,sacks): .0004, (VP,dump):.01}
Note: At the same time, a lexicalized grammar rule may look like this:
(NP, sacks) → sacks 0.01
(VP, helps) → (VBZ,helps) (NP,students) 0.0001

Making
a lexicalized
grammar


VP → VBD NP .1
VP → VBD .9
VBD → kicked .1
VBD → saw .8
VBD → drank .1
NP → Det N 1.0
Det → a .2
Det → the .8
N → book .1
N → ball .7
N → gin .2
Then the lexicalized grammar with uniform probabilities is:
S → (NP,book) (VP,kicked) .111
S → (NP,book) (VP,kicked) .111
S → (NP,book) (VP,kicked) .111
S → (NP,gin) (VP,kicked) .111
S → (NP,gin) (VP,kicked) .111
S → (NP,gin) (VP,kicked) .111
S → (NP,ball) (VP,kicked) .111
S → (NP,ball) (VP,kicked) .111
S → (NP,ball) (VP,kicked) .111
(VP,kicked) → (VBD,kicked) (NP,book) .033
(VP,kicked) → (VBD,kicked) (NP,gin) .033
(VP,kicked) → (VBD,kicked) (NP,ball) .033
(VP,kicked) → (VBD,kicked) .3
(VP,kicked) → (VBD,kicked) .3
(VP,kicked) → (VBD,kicked) .3
(VP,saw) → (VBD,saw) (NP,book) .033
(VP,saw) → (VBD,saw) (NP,gin) .033
(VP,saw) → (VBD,saw) (NP,ball) .033
(VP,saw) → (VBD,saw) .3
(VP,saw) → (VBD,saw) .3
(VP,saw) → (VBD,saw) .3
(VP,drank) → (VBD,drank) (NP,book) .011
(VP,drank) → (VBD,drank) (NP,gin) .011
(VP,drank) → (VBD,drank) (NP,ball) .011
(VP,drank) → (VBD,drank) .3
(VP,drank) → (VBD,drank) .3
(VP,drank) → (VBD,drank) .3
(VBD, kicked) → kicked 1.0
(VBD,saw) → saw 1.0
(VBD,drank) → drank 1.0
(NP,book) → (Det,a) (N,book) .5
(NP,book) → (Det,the) (N,book) .5
(NP,ball) → (Det,a) (N,ball) .5
(NP,ball) → (Det,the) (N,ball) .5
(NP,gin) → (Det,a) (N,gin) .5
(NP,gin) → (Det,the) (N,gin) .5
(Det,a) → a 1.0
(Det,the) → the 1.0
(N,book) → book 1.0
(N,ball) → ball 1.0
(N,gin) → gin 1.0

Sparseness
problems


Maximum likelihood estimates for the PCFG defined
by lexicalized splitting are doomed by sparseness:
The rule prob for:
VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  
is:
Count(VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P))



Count(VP(dumped,VBD))

No corpus is likely to supply reliable counts for such events.
How do we usually solve sparseness problems?
We need to make some further independence assumptions!

Collins I


Generation story: Break down a rule into number
of contextually dependent subevents which collectively "generate"
an occurrence of a rule:
LHS → L_{n} L_{n1} ... H R_{1} ....
R_{n1} R_{n}
Basic decomposition: Generate head first, then L_{i}
modifers terminating with STOP, then R_{i} modifiers
terminating with STOP. For the tree above we have the following
sub event probs
 P(HLHS) = P(VBD(dumped,VBD)  VP(dumped,VBD))
 P_{l}(STOP  VP(dumped,VBD), VBD(dumped,VBD))
 P_{r}(NP(sacks,NNS)  VP(dumped,VBD),VBD(dumped,VBD))
note independent of any left modifiers
 P_{r}(PP(into,P)  VP(dumped,VBD),VBD(dumped,VBD)):
note independent of NP(sacks,NNS)
 P_{r}(STOP  VP(dumped,VBD), VBD(dumped,VBD),PP)
Let
R = VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P))
We estimate the probability of R
by multiplying these probs together:
P(R) =
P(VBD(dumped,VBD)  VP(dumped,VBD)) ×
P_{l}(STOP  VP(dumped,VBD), VBD(dumped,VBD)) ×
P_{r}(NP(sacks,NNS)  VP(dumped,VBD),VBD(dumped,VBD)) ×
P_{r}(PP(into,P)  VP(dumped,VBD),VBD(dumped,VBD)) ×
P_{r}(STOP  VP(dumped,VBD), VBD(dumped,VBD),PP)

State of the art


Using the PARSEVAL metric for comparing gold standard
and system parse trees (Black et al. 1991),
stateoftheart statistical parsers trained on the Wall
Stree Journal treebank are performing
at about 90% precision 90% recall.
