Computational Linguistics

Linguistics 581

Smoothing language models

Smoothing

Note that most of the cells in our original count tables will be zero.

We don't see many of the words in English.
We don't see the huge majority of the bigrams of English.
We see only a tiny sliver of all the possible trigramns

Most of the time our bigram model assigns probability zero to a potential following word:

P(w_n | w_n-1) = |w_n-1w_n| ÷ |w_n |
P(food | want) = |want food| ÷ |want| = 0 ÷ 1215 = 0

Probability zero means it can't happen. But we aren't entitled to reach that conclusion.

Add-one
Smoothing
basic idea
We add one to every cell of this table
We get this table
We recompute our our total occurrences:

I 3437 +1616 =5053

want 1215 + 1616 = 2931

to 3256 + 1616 = 4872

eat 938 + 1616 = 2554

Chinese 213 + 1616 = 1829

food 1506 + 1616 = 3122

lunch 459 + 1616 = 2075

Now we recompute the probabilities:

P(w_n | w_n-1) = |w_n-1w_n| ÷ |w_n |
P(food | want) = |want to| ÷ |want| = 1 ÷ 2931 = .0003
P(to | want) = |want to| ÷ |want| = 787 ÷ 2931 = .27

This gives us this bigram probability table.
Compare this one.
Some things to notice:

The events that used to be zeroes don't all have the same probability.
All the events in the same row that were zeros in the old model get the same probability in the new model.
ALL the non-zero probabilities went down.
Sometimes the change doesn't look very large

P(eat | I)[.0038 -> .0028]
P(I | to)[.00092 -> .00082]

Some very predictable events became less predictable:

P(to|want)[.65-> .22]
P(food|Chinese) [.56 -> .066]

Other probabilities changed by large factors.

P(lunch|Chinese) [.0047 -> .0011]
P(food|want) [.0066 -> .0032]

Likelihood ratios changed

old model: P(I|lunch) = 4 * P(food|lunch)
new model: P(I|lunch) = 2.5 * P(food|lunch)

Conclusion: Increasing the zero probabilities from zero to a small number was good, but the effect on the non-zero probabilities was not always good. We're blurring our original model. (So smoothing is not a bad name for what we're doing; just remember it affects all the probabilities.)

Perhaps we've assigned too much probability to the zeros, with the result that sharply predictable events [P(to|want)] became much less so, and some moderately rare events became very rare.
Perhaps we would like a model that changes the existing model less, but still steals away some probability to assign to the zero events.

What went
Wrong
If we're going to assign the probability to zero-events, the probabilities of others has to go down.
Why? Because the probability of all the possible events we're looking at must add up to 1.
Take the case of want:

Count before smoothing: 1215
Count after smoothing: 1215 + 1616 = 2931
Number of word types not seen to follow want (estimating):

Top 4 words (to, a, some, Thai) = .75 of the probability mass
tokens not in top 4 = 304
(.25 * 1215)
a minimum of 1308 (1612 - 304) words never-before seen to follow want

This means that, in the model, following want, almost half of the probability mass is reserved for unseen events, 1308 events each of which has the probability 1/1308.

1308 ÷ 2931 = .45
Which means the probability of all the previously seen words has to go down precipitously (1.0 -> .55)

It's easy to see what the extreme case would be. Consider the word Francisco (capitalized) and suppose p(Francisco| San) = .5. and suppose but that San was a much rarer word than want, say, with count 100.
We'd still have pretty good evidence that Francisco was extremely likely after San. Our initial model would assign probability .5.
What would happen with add-one smoothing?

Count of San before smoothing: 100
Count after smoothing: 100 + 1616 [Vocab size!] = 1716
Min number of word types not seen to follow San: 1616 - 50 = 1566
Min probability for unseen events: 1566 / 1716 = .91
Francisco took half the probability for seen events in the original model (.5). Now after smoothing
p(Francisco| San) = 50/1716 = .03

Witten Bell
Discounting
The Idea
Key idea; Some words are promiscuous (they occur with a wide variety of words relative to their frequency).
Some are faithful: They occur with a very small number of words given their frequency.
Our fictional example of Francisco was a maximally faithful word. 50 occurrences all followed by the the same word San.
Key Idea: Find a way of measuring word promiscuity. Relativize the amount of probability mass a word receives for zero's to how promiscuous it is.
The more promiscuous a word is, the more word probability mass it receives for following zeroes (the more likely it is that we havent seen all the words that can follow it in any given corpus).

Probability of a new event
How likely is it that tyhe next word we see will be a word we've never seen before

in the corpus as a whole?
In a sub-corpus? [following the word San or (want)]

Probability of seeing a new type:
T ÷ (N + T)
T is the number of observed types. N is the number of words in (sub-)corpus:
N + T = the number of words plus the number of types
Corpus viewed as a set of N + T events.

Unigram
Discounting
For any corpus (sub-corpus), we will use
T ÷ (N + T)
as our estimate of how much probability mass to reserve for zeros.
if we divided this equally, anmd there are Z zero ngrams: each 0 ngram would get this much
T ÷ (Z*(N + T))

Bigram
Discounting

Looking at sub-corpus (for example, the words following San) we compute the probability of seeing a new type following the word San.
This becomes our promiscuity measure.

T(San) ÷ (N(San) + T(San))
The number of word types following San (T(San)) divided by the sum of the number of word tokens following San (= c(San), the count of San) and the number types following San. Suppose we had an absolutely faithful word. Maybe devein was always followed by shrimp in our corpus. And suppose we had 100 tokens of devein. What is the above quantity?
Total prob mass reserved for devein = 1 ÷ (100 + 1)
How about a maximally promiscuous word with the same frequency:
Total prob mass reserved for want = 100 ÷ (100 + 100)

Lets use Z(w) for the count of the words NOT seen to follow w. Then our new conditional probability for an unseen word has to be divided among those Z words:
prob(w'|w) = T(w) ÷ (Z(w)(N(w) + T(w)))
where w' has never been seen to follow w (c(ww')=0).
The tricky thing thing that each probability for a seen bigram has to get reduced by the right amount to make everything add up to 1:
p(w'|w)= c(ww') ÷ (c(w) + T(w))
where w' has never been seen to follow w (c(ww') is bigger than 0).
The bigram counts will add up to c(w) (=N(w)). So the total probability mass for SEEN bigrams following w will be:
N(w) ÷ (N(w) + T(w))
leaving us:
T(w) ÷ (N(w) + T(w))
for the unseen bigrams. And this agrees with the amount we decided to reserve for them!
Berkeley Restaurant example revisited:

Witten-Bell smoothed bigram counts.
Unsmoothed bigram counts.
Add-One smoothed bigram counts.
Smoothed counts: Multiply smoothed probabilities by corpus size N.

Add-one Smoothing basic idea		We add one to every cell of this table We get this table We recompute our our total occurrences: I 3437 +1616 =5053 want 1215 + 1616 = 2931 to 3256 + 1616 = 4872 eat 938 + 1616 = 2554 Chinese 213 + 1616 = 1829 food 1506 + 1616 = 3122 lunch 459 + 1616 = 2075 Now we recompute the probabilities: P(w_n \| w_n-1) = \|w_n-1w_n\| ÷ \|w_n \| P(food \| want) = \|want to\| ÷ \|want\| = 1 ÷ 2931 = .0003 P(to \| want) = \|want to\| ÷ \|want\| = 787 ÷ 2931 = .27 This gives us this bigram probability table. Compare this one. Some things to notice: The events that used to be zeroes don't all have the same probability. All the events in the same row that were zeros in the old model get the same probability in the new model. ALL the non-zero probabilities went down. Sometimes the change doesn't look very large P(eat \| I)[.0038 -> .0028] P(I \| to)[.00092 -> .00082] Some very predictable events became less predictable: P(to\|want)[.65-> .22] P(food\|Chinese) [.56 -> .066] Other probabilities changed by large factors. P(lunch\|Chinese) [.0047 -> .0011] P(food\|want) [.0066 -> .0032] Likelihood ratios changed old model: P(I\|lunch) = 4 * P(food\|lunch) new model: P(I\|lunch) = 2.5 * P(food\|lunch) Conclusion: Increasing the zero probabilities from zero to a small number was good, but the effect on the non-zero probabilities was not always good. We're blurring our original model. (So smoothing is not a bad name for what we're doing; just remember it affects all the probabilities.) Perhaps we've assigned too much probability to the zeros, with the result that sharply predictable events [P(to\|want)] became much less so, and some moderately rare events became very rare. Perhaps we would like a model that changes the existing model less, but still steals away some probability to assign to the zero events.
What went Wrong		If we're going to assign the probability to zero-events, the probabilities of others has to go down. Why? Because the probability of all the possible events we're looking at must add up to 1. Take the case of want: Count before smoothing: 1215 Count after smoothing: 1215 + 1616 = 2931 Number of word types not seen to follow want (estimating): Top 4 words (to, a, some, Thai) = .75 of the probability mass tokens not in top 4 = 304 (.25 * 1215) a minimum of 1308 (1612 - 304) words never-before seen to follow want This means that, in the model, following want, almost half of the probability mass is reserved for unseen events, 1308 events each of which has the probability 1/1308. 1308 ÷ 2931 = .45 Which means the probability of all the previously seen words has to go down precipitously (1.0 -> .55) It's easy to see what the extreme case would be. Consider the word Francisco (capitalized) and suppose p(Francisco\| San) = .5. and suppose but that San was a much rarer word than want, say, with count 100. We'd still have pretty good evidence that Francisco was extremely likely after San. Our initial model would assign probability .5. What would happen with add-one smoothing? Count of San before smoothing: 100 Count after smoothing: 100 + 1616 [Vocab size!] = 1716 Min number of word types not seen to follow San: 1616 - 50 = 1566 Min probability for unseen events: 1566 / 1716 = .91 Francisco took half the probability for seen events in the original model (.5). Now after smoothing p(Francisco\| San) = 50/1716 = .03
Witten Bell Discounting The Idea		Key idea; Some words are promiscuous (they occur with a wide variety of words relative to their frequency). Some are faithful: They occur with a very small number of words given their frequency. Our fictional example of Francisco was a maximally faithful word. 50 occurrences all followed by the the same word San. Key Idea: Find a way of measuring word promiscuity. Relativize the amount of probability mass a word receives for zero's to how promiscuous it is. The more promiscuous a word is, the more word probability mass it receives for following zeroes (the more likely it is that we havent seen all the words that can follow it in any given corpus).
Probability of a new event		How likely is it that tyhe next word we see will be a word we've never seen before in the corpus as a whole? In a sub-corpus? [following the word San or (want)] Probability of seeing a new type: T ÷ (N + T) T is the number of observed types. N is the number of words in (sub-)corpus: N + T = the number of words plus the number of types Corpus viewed as a set of N + T events.
Unigram Discounting		For any corpus (sub-corpus), we will use T ÷ (N + T) as our estimate of how much probability mass to reserve for zeros. if we divided this equally, anmd there are Z zero ngrams: each 0 ngram would get this much T ÷ (Z*(N + T))
Bigram Discounting		Looking at sub-corpus (for example, the words following San) we compute the probability of seeing a new type following the word San. This becomes our promiscuity measure. T(San) ÷ (N(San) + T(San)) The number of word types following San (T(San)) divided by the sum of the number of word tokens following San (= c(San), the count of San) and the number types following San. Suppose we had an absolutely faithful word. Maybe devein was always followed by shrimp in our corpus. And suppose we had 100 tokens of devein. What is the above quantity? Total prob mass reserved for devein = 1 ÷ (100 + 1) How about a maximally promiscuous word with the same frequency: Total prob mass reserved for want = 100 ÷ (100 + 100) Lets use Z(w) for the count of the words NOT seen to follow w. Then our new conditional probability for an unseen word has to be divided among those Z words: prob(w'\|w) = T(w) ÷ (Z(w)(N(w) + T(w))) where w' has never been seen to follow w (c(ww')=0). The tricky thing thing that each probability for a seen bigram has to get reduced by the right amount to make everything add up to 1: p(w'\|w)= c(ww') ÷ (c(w) + T(w)) where w' has never been seen to follow w (c(ww') is bigger than 0). The bigram counts will add up to c(w) (=N(w)). So the total probability mass for SEEN bigrams following w will be: N(w) ÷ (N(w) + T(w)) leaving us: T(w) ÷ (N(w) + T(w)) for the unseen bigrams. And this agrees with the amount we decided to reserve for them! Berkeley Restaurant example revisited: Witten-Bell smoothed bigram counts. Unsmoothed bigram counts. Add-One smoothed bigram counts. Smoothed counts: Multiply smoothed probabilities by corpus size N.