San Diego State University logo

Probabilistic Context Grammars


A probabilistic context free grammar is a context free grammar with probabilities attached to the rules.

Model Parameters
The model parameters are the probabilities assigned to grammar rules.
Computing Probabilities
We discuss how the model assigns probabilities to strings and to analyses of strings.
Exploiting Probabilities in Parsing
We discuss how to find the most probable parse given the model.
Estimating Parameters
We sketch how rule probabilities are estimated from a syntactically annotated corpus.

Important Probabilities
Rule Probabilities
Probability of an expansion given the category being expanded. P(γ | A), the probability that the sequence of grammar symbols γ will expand category A. Given the definition of a conditional probability, this means the probabilities of all the expansions of a single catgeory must add to 1. For example:
S -> NP VP, .8 | Aux NP VP, .2
Given that S is being expanded, the probability of the NP VP rule is .8; the probability of the Aux NP VP expansion is .2. The rule probabilities are the parameters of a PCFG model.
Tree Probability
The probability of a tree given all the rules used to construct it. This will be the product of the probabilities of all the rules used to construct it. We motivate this below.

For now an example. Given this grammar:

the sentence
    Book the dinner flight
has at least two trees:
    P(Tleft) = .05 * .20 * .20 * .20 * .75 * .30 * .60 * .10 * .40 = 2.2 × 10-6
    P(Tright) = .05 * .10 * .20 * .15 * .75 * .75 * .30 * .60 * .10 * .40 = 6.1 × 10-7
Therefore the more probable of the two parses is Tleft, which is the intuitive result.
String Probability
Sum of the tree probabilities for all analyses of the string.

Given a tree bank, the maximum likelihood estimate for the PCFG rule:

    VP → VBD NP PP
    Count(VP → VBD NP PP)


Our primary subject of interest is finding the highest probability parse for a sentence.

This can be done with a rather simple variant of CKY. CKY is a bottum up algorithm. A bottum up algorithm is compatible with our probability model because the probability of a subtree like
Is independent of anything outside that tree, in particular of where that subtree occurs in a sentence (for instance, subject position or object position), so once we have found all ways of building a constituent, we are guaranteed to have found the maximum amount of probability mass that constituent can add to any tree that includes it.

CKY is ideal for our purposes because no constituent spanning (i,j) is built until all ways of building constituents spanning (k,l),

    i ≤ k < l < j
have been found. Since we can find the most probable parses for all possible subconstituents of a constituent spanning (i,j) before we build any constituents spaning (i,j), we have the essential feature of a Viterbi like algorithm.

We modify CKY as follows: instead of building a parse table in which each cell assigns value True or False for each category, we build a parse table in which each cell assigns a probability to each category. Each time we find a new way of building an A spanning (i,j), we compute its probability and check to see if it is more propbable than the current value for table[i][j][A].

Look here for more discussion of this Viterbi-like algorithm for finding the best parse assigned by a PCFG.


Partial PCFG:

    S NP VP .80 Det the .40
    NP Det N .30 Det a .40
    VP V NP .20 N meal .01
    V includes .05 N flight .02


Independence assumptions are (damagingly) false
Where particular kinds of category expansions happen very much depends on where in the tree the category occurs.
Rule probabilities above the preterminal level are oblivious to what words occur in the tree. But this is wrong.

Pronouns (Switchboard corpus [spoken, telephone], Francis et al. 1999)

      Pronoun Non-Pronoun
    Subject 91% 9 %
    Object 34% 66%


  1. Left tree is correct for this example
  2. Right tree is correct for

    Conjunction ambiguities can create trees to which a PCFG cant assign distinct probabilities, but intuitively, the tree are quite different in probabilitity:

    Notice the two trees employ exactly the same bags of rules, so they must, acording to a PCFG, be equiprobable.

Johnson (1998) demonstrates the utility of using the grandparent node as a contextual parameter of a probabilistic parsing model.

This option uses more derivational history.

  1. NP1: P(... | P = NP, GP = S)
  2. NP2: P(... | P = NP, GP = VP)

Significant improvement on PCFG:

  1. Can capture distributional differences between subject and object NPs (such as likelihood of pronoun)
  2. Outperforms left-corner parser described in Manning and Carpenter (1997)

S → NP VP   S → NP^S VP^S
    NP^VP → PRP
    NP^VP → DT NN

This strategy is called splitting. We split the category NP in two so as to model distinct internal distributions depending on the external context of an NP Note that in the work above, preterminals are not split.

Splitting preterminals  

Johnson's grandparent grammars work because they reclaim some of the ground lost due to the incorrectness of the independence assumptions of PCFGs. But the other problem for PCFGs that we noted is that they miss lexical dependencies. If we extend the idea of gradparent information to preterminals, we can begin to capture some of that information.

The parse on the left is incorrect because of an incorrect choice between the following two rules:

    PP → IN PP
    PP → IN SBAR
The parse on the right, in which the categories (including preterminal categories) are annotated with grandparent information, is correct.

The choice is made possible because the string advertising works can be analyzed either as an SBAR or an NP:

    [SBAR Advertising works] when it is clever.
  1. [NP The advertising works of Andy Warhol] made him a legend.
Now as it happens words of preterminal category IN differ greatly in how likely they are to occur with sentences:
    if John comes
    before John comes
    after John comes
    ? under John comes
    ? behind John comes
So this is a clear case where distributional facts about words helps. And the simple act of extending the grandparent strategy to preterminals helps capture the differences between these two classes of IN.


Obviously there are many cases where splitting helps.

But there are also cases where splitting hurts. We already have a sparseness problem. If a split results in a defective or under-represented distribution, it can lead to overfitting and it can hurt. Generalizations that were there before the split can be blurred or lost.

Two approaches:

  • Hand-tuned rules (Klein and Manning 2003b)
  • Split-and-merge algorithm: search for optimal splits (Petrov et al. 2006)
  • Lexicalized

    An extreme form of splitting:

      VP → VBD NP PP   VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  
          VP(sold,VBD) → VBD(sold,VBD) NP(pictures,NNS) PP(of,P)  
          . . .

    Notice categories are being annotated with two kinds of information:
    1. Head word
    2. Category of their immediately dominated head constituent

    Lexical rules all have probability 1.0


    Mathematically, a lexicalized PCFG is a PCFG, so In principle, a lexicalized grammar has an N3 CKY parser just like any PCFG.

    The problem is, we just blew up the grammar when we lexicalized it:

      VP → VBD NP PP  
      VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  
      VP(spilled,VBD) → VBD(spilled,VBD) NP(ketchup,NNS) PP(on,P)  
      VP(X,VBD) → VBD(X,VBD) NP(Y,NNS) PP(Z,P)  

    And the real parsing complexity is O(n3G2), where G is the size of the grammar.

    The G term now implicitly depends on n or G, whichever is smaller, and n is generally SMALLER than G. So the real parsing complexity is O(n3n2)

    How many distinct attributes can n there now be in a cell for span i,j?

      nonterminal, word (between i and j), preterminal of word
    Therefore worst case number of items in a span:
      num nonterminals X n (num words between i and j) X num poses
    If we assume every part of speech can "project" only one non terminal:
      |Nonterminals| X n
    That's how many distinct floating point numbers (probs) we may need to store for a span (a cell in the table).

    Data structures?

      Currently a cell is filled buy a dictionary from nonterminals to floats.
      table[3][5] = {NP: .003, VP: 0004}
      What shd replace it?
      A dictionary from a pair of nonterminals and words to floats.

      {(NP,sacks): .003, (VP,sacks): .0004, (VP,dump):.01}
      Note: At the same time, a lexicalized grammar rule may look like this:
      (NP, sacks) → sacks 0.01
      (VP, helps) → (VBZ,helps) (NP,students) 0.0001
    a lexicalized

    VP → VBD NP .1 
    VP → VBD .9
    VBD → kicked .1
    VBD → saw  .8
    VBD → drank .1
    NP → Det N  1.0
    Det → a  .2
    Det → the .8
    N → book  .1
    N → ball  .7
    N → gin  .2
    Then the lexicalized grammar with uniform probabilities is:
    S → (NP,book) (VP,kicked) .111
    S → (NP,book) (VP,kicked) .111
    S → (NP,book) (VP,kicked) .111
     S → (NP,gin) (VP,kicked) .111
    S → (NP,gin) (VP,kicked) .111
    S → (NP,gin) (VP,kicked) .111
     S → (NP,ball) (VP,kicked) .111
    S → (NP,ball) (VP,kicked) .111
    S → (NP,ball) (VP,kicked) .111
    (VP,kicked) → (VBD,kicked) (NP,book) .033
    (VP,kicked) → (VBD,kicked) (NP,gin) .033
    (VP,kicked) → (VBD,kicked) (NP,ball) .033
    (VP,kicked) → (VBD,kicked)  .3
    (VP,kicked) → (VBD,kicked)  .3
    (VP,kicked) → (VBD,kicked)  .3
    (VP,saw) → (VBD,saw) (NP,book) .033
    (VP,saw) → (VBD,saw) (NP,gin) .033
    (VP,saw) → (VBD,saw) (NP,ball) .033 
    (VP,saw) → (VBD,saw)  .3
    (VP,saw) → (VBD,saw)  .3
    (VP,saw) → (VBD,saw)  .3
    (VP,drank) → (VBD,drank) (NP,book) .011 
    (VP,drank) → (VBD,drank) (NP,gin) .011
    (VP,drank) → (VBD,drank) (NP,ball) .011 
    (VP,drank) → (VBD,drank)  .3
    (VP,drank) → (VBD,drank)  .3
    (VP,drank) → (VBD,drank)  .3
    (VBD, kicked) → kicked 1.0
    (VBD,saw) → saw  1.0
    (VBD,drank) → drank 1.0
    (NP,book) → (Det,a) (N,book)  .5
    (NP,book) → (Det,the) (N,book)  .5  
    (NP,ball) → (Det,a) (N,ball)  .5
    (NP,ball) → (Det,the) (N,ball)  .5
    (NP,gin) → (Det,a) (N,gin)  .5
    (NP,gin) → (Det,the) (N,gin)  .5
    (Det,a) → a  1.0
    (Det,the) → the 1.0
    (N,book) → book  1.0
    (N,ball) → ball  1.0
    (N,gin) → gin  1.0

    Maximum likelihood estimates for the PCFG defined by lexicalized splitting are doomed by sparseness: The rule prob for:

    VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P)  


      Count(VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P))
    No corpus is likely to supply reliable counts for such events.

    How do we usually solve sparseness problems? We need to make some further independence assumptions!

    Collins I  

    Generation story: Break down a rule into number of contextually dependent subevents which collectively "generate" an occurrence of a rule:

      LHS → Ln Ln-1 ... H R1 .... Rn-1 Rn

    Basic decomposition: Generate head first, then Li modifers terminating with STOP, then Ri modifiers terminating with STOP. For the tree above we have the following sub event probs

    1. P(H|LHS) = P(VBD(dumped,VBD) | VP(dumped,VBD))
    2. Pl(STOP | VP(dumped,VBD), VBD(dumped,VBD))
    3. Pr(NP(sacks,NNS) | VP(dumped,VBD),VBD(dumped,VBD))
      note independent of any left modifiers
    4. Pr(PP(into,P) | VP(dumped,VBD),VBD(dumped,VBD)):
      note independent of NP(sacks,NNS)
    5. Pr(STOP | VP(dumped,VBD), VBD(dumped,VBD),PP)
      R = VP(dumped,VBD) → VBD(dumped,VBD) NP(sacks,NNS) PP(into,P))
    We estimate the probability of R by multiplying these probs together:
      P(R) =
        P(VBD(dumped,VBD) | VP(dumped,VBD)) × Pl(STOP | VP(dumped,VBD), VBD(dumped,VBD)) × Pr(NP(sacks,NNS) | VP(dumped,VBD),VBD(dumped,VBD)) ×
        Pr(PP(into,P) | VP(dumped,VBD),VBD(dumped,VBD)) × Pr(STOP | VP(dumped,VBD), VBD(dumped,VBD),PP)
    State of the art  

    Using the PARSEVAL metric for comparing gold standard and system parse trees (Black et al. 1991), state-of-the-art statistical parsers trained on the Wall Stree Journal treebank are performing at about 90% precision 90% recall.