Computational Linguistics

Linguistics 596

Probabilistic Models of Spelling and Pronunciation

Probability models as ignorance models

One important use of probabilistic models is when we try to predict some phenomenon and we don't know all the factors that can influence, or we do, but they are much too complicated to build a precise model of.

Exactly the situation with language!

2-feature sample space Manner

FRIC FRIC

Voicing - VCD f,th, s p, t, k

VCD v,dh, z b,d,g

2-feature sample space	Manner
FRIC	FRIC
Voicing	- VCD	f,th, s	p, t, k
VCD	v,dh, z	b,d,g

Imagine a random sample of English consonants, clipped from out of streams of real speech.

cardinality

Let:

T = Total set of consonants
VCD = Total set of voiced consonants
- VCD = Total set of unvoiced consonants
FRIC = Total set of fricative
FRIC = Total set of non-fricative consonants (stops, liquids, affricates, ...)

We define |VCD| as the number of elements in the set of voiced sounds (also call the cardinality of the set.

Frequency
Interpretation

Prob(VCD) is a number x between 0 and 1
As you take larger and larger samples of events, the ratio of voiced events to the total number of events will tend to approach x.
Prob(VCD) =~ |VCD| ÷|T|
Where =~ means "approximately equals".
This is a frequency interpretation of probability.

Sample
Space
The mathematical set of possible outcomes of the sort we care about, often called Omega.
When we are talking about VCD, -VCD, there are two outcomes. The sample space is
Omega = {VCD, -VCD}
When we are talking about single coin tossings the set of outcomes is Head or Tails (H or T):
Omega = { H, T}
When we are talking about sequences of 3 coin tosses, there are 8 possible outcomes:
Omega = {HHH,HHT, HTH, HTT,THH, THT, TTH, TTT}
When we are talking about voiceleess contrasts and 3 different consonant choices, the sample space is:
Omega = {VVV,VVN, VNV, VNN,NVV, NVN, NNV, NNN}
Where V represents a voiced outcome and N represents a nonvoiced outcome.
Contrast the sample space with a sample, some actual set of events gotten by experiment or sorting through public records blind luck.
Each sample will be an element of the sample space. Suppose we do 8 experiments with the coin and each time we get HHH. Hmm. It's beginning to look like the right probability model is:
Prob(TTT) = 0 Prob(TTH) = 0 Prob(THT) = 0 Prob(THH) = 0 Prob(HTT) = 0 Prob(HTH) = 0 Prob(HHT) = 0 Prob(HHH) = 1
In other words the coin is biased. Of course we're not SURE this is the right probability model. We might just have been unlucky....
With any real sample we can ask the question: Is it biased? That is, has it been chosen in some way that changes the probabilistic properties of the space. Is the coin weighted? Did we get our consonant corpus from an opera singer?
The general strategy of the frequency model definition: The larger our set of samples, the less the chance of bias.

Probability
Distribution
Prob is called a probability distribution function. It assigns a number between 0 and 1 to a set of events in the sample space.

Conditional
Probability
As our example suggests, we want to be able to talk about multiple kinds of events simultaneously. That is, with heads and tails there is only one property to worry about. With consonants there are many, VCD/-VCD, FRIC/-FRIC, and so on.
Now think of the sample space as different consonant utterances and think of VCD as some subset of those utterances and FRIC as some other subset.
Now we can talk about all the following:

Prob(VCD) =~ |VCD| ÷|T|
Prob(FRIC) =~ |FRIC| ÷|T|
Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ |VCD Intersect Fric| ÷|T|

We also want to talk about conditional probabilities. By definition:

Prob(VCD | FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC)
Read this as "the probability of VCD given FRIC". This is the probability of a sound being voiced given that you know it's a frictative.
Conditional probabilities can be thought of as Probabilities relativized to subsets of the sample space. Start with the total sample space, restrict it to fricatives, use the same frequentist interpretation of probability, and you get the above formula. Here's why:
Prob(VCD, FRIC) ÷ Prob(FRIC) =
(|VCD Intersect FRIC| ÷ |T|) ÷ (|FRIC| ÷ |T|) =
(|VCD Intersect FRIC| ÷ |T|) * (|T| ÷ |FRIC|) =
(|VCD Intersect FRIC| ÷ |FRIC|) =

Note and think about the following distinction:

Prob(VCD | FRIC) = Prob(VCD,FRIC) ÷ Prob(FRIC) = (|VCD Intersect FRIC| ÷ |FRIC|)
Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ |VCD Intersect FRIC| ÷|T|

Chain Rule

This definition immediately leads to something called the chain rule.

Prob(VCD | FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC)
Prob(VCD,FRIC) = Prob(VCD | FRIC) * Prob(FRIC)

independence
Now one thing the chain rule leads to is an account of a special case. Suppose

Prob(VCD | FRIC) = Prob(VCD).
In this very special case, we call the features (or events in probability talk) FRIC and VCD independent. In this very special case (and ONLY then):

Prob(VCD,FRIC) = Prob(VCD) * Prob(Fric)

Bayes' rule

The chain rule is symmetric:

Prob(VCD,FRIC) = Prob(VCD | FRIC) * Prob(FRIC)
Prob(VCD,FRIC) = Prob(FRIC | VCD) * Prob(VCD)

So it follows
Prob(VCD | FRIC) * Prob(FRIC) = Prob(FRIC | VCD) * Prob(VCD)
This is called Bayes' Rule.
Bayes' Rule is often written in this form:

Prob(A | B) = Prob(B | A) * Prob(A) ÷ Prob(B)

cardinality		Let: T = Total set of consonants VCD = Total set of voiced consonants - VCD = Total set of unvoiced consonants FRIC = Total set of fricative FRIC = Total set of non-fricative consonants (stops, liquids, affricates, ...) We define \|VCD\| as the number of elements in the set of voiced sounds (also call the cardinality of the set.
Frequency Interpretation		Prob(VCD) is a number x between 0 and 1 As you take larger and larger samples of events, the ratio of voiced events to the total number of events will tend to approach x. Prob(VCD) =~ \|VCD\| ÷\|T\| Where =~ means "approximately equals". This is a frequency interpretation of probability.
Sample Space		The mathematical set of possible outcomes of the sort we care about, often called Omega. When we are talking about VCD, -VCD, there are two outcomes. The sample space is Omega = {VCD, -VCD} When we are talking about single coin tossings the set of outcomes is Head or Tails (H or T): Omega = { H, T} When we are talking about sequences of 3 coin tosses, there are 8 possible outcomes: Omega = {HHH,HHT, HTH, HTT,THH, THT, TTH, TTT} When we are talking about voiceleess contrasts and 3 different consonant choices, the sample space is: Omega = {VVV,VVN, VNV, VNN,NVV, NVN, NNV, NNN} Where V represents a voiced outcome and N represents a nonvoiced outcome. Contrast the sample space with a sample, some actual set of events gotten by experiment or sorting through public records blind luck. Each sample will be an element of the sample space. Suppose we do 8 experiments with the coin and each time we get HHH. Hmm. It's beginning to look like the right probability model is: Prob(TTT) = 0 Prob(TTH) = 0 Prob(THT) = 0 Prob(THH) = 0 Prob(HTT) = 0 Prob(HTH) = 0 Prob(HHT) = 0 Prob(HHH) = 1 In other words the coin is biased. Of course we're not SURE this is the right probability model. We might just have been unlucky.... With any real sample we can ask the question: Is it biased? That is, has it been chosen in some way that changes the probabilistic properties of the space. Is the coin weighted? Did we get our consonant corpus from an opera singer? The general strategy of the frequency model definition: The larger our set of samples, the less the chance of bias.
Probability Distribution		Prob is called a probability distribution function. It assigns a number between 0 and 1 to a set of events in the sample space.
Conditional Probability		As our example suggests, we want to be able to talk about multiple kinds of events simultaneously. That is, with heads and tails there is only one property to worry about. With consonants there are many, VCD/-VCD, FRIC/-FRIC, and so on. Now think of the sample space as different consonant utterances and think of VCD as some subset of those utterances and FRIC as some other subset. Now we can talk about all the following: Prob(VCD) =~ \|VCD\| ÷\|T\| Prob(FRIC) =~ \|FRIC\| ÷\|T\| Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ \|VCD Intersect Fric\| ÷\|T\| We also want to talk about conditional probabilities. By definition: Prob(VCD \| FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC) Read this as "the probability of VCD given FRIC". This is the probability of a sound being voiced given that you know it's a frictative. Conditional probabilities can be thought of as Probabilities relativized to subsets of the sample space. Start with the total sample space, restrict it to fricatives, use the same frequentist interpretation of probability, and you get the above formula. Here's why: Prob(VCD, FRIC) ÷ Prob(FRIC) = (\|VCD Intersect FRIC\| ÷ \|T\|) ÷ (\|FRIC\| ÷ \|T\|) = (\|VCD Intersect FRIC\| ÷ \|T\|) * (\|T\| ÷ \|FRIC\|) = (\|VCD Intersect FRIC\| ÷ \|FRIC\|) = Note and think about the following distinction: Prob(VCD \| FRIC) = Prob(VCD,FRIC) ÷ Prob(FRIC) = (\|VCD Intersect FRIC\| ÷ \|FRIC\|) Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ \|VCD Intersect FRIC\| ÷\|T\|
Chain Rule		This definition immediately leads to something called the chain rule. Prob(VCD \| FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC) Prob(VCD,FRIC) = Prob(VCD \| FRIC) * Prob(FRIC)
independence		Now one thing the chain rule leads to is an account of a special case. Suppose Prob(VCD \| FRIC) = Prob(VCD). In this very special case, we call the features (or events in probability talk) FRIC and VCD independent. In this very special case (and ONLY then): Prob(VCD,FRIC) = Prob(VCD) * Prob(Fric)
Bayes' rule		The chain rule is symmetric: Prob(VCD,FRIC) = Prob(VCD \| FRIC) * Prob(FRIC) Prob(VCD,FRIC) = Prob(FRIC \| VCD) * Prob(VCD) So it follows Prob(VCD \| FRIC) * Prob(FRIC) = Prob(FRIC \| VCD) * Prob(VCD) This is called Bayes' Rule. Bayes' Rule is often written in this form: Prob(A \| B) = Prob(B \| A) * Prob(A) ÷ Prob(B)

Bayes' Rule: An example

Let's imagine we have a contagious disease and a test for the disease. The facts are the following:

Among patients who have the disease, 100% of the time the test correctly classifies the patient as positive for the disease.
Among patients who do not have the disease, 95% of the time the test correctly classifies the patient as negative for the disease, 5% of the time it is incorrect (false positive).
1% of the subjects have the disease.

Our question is a policy question. Do we quarantine subjects who have a positive test result?

T is the total population of test subjects.
D is the set of subjects who have the disease.
H is the set of subjects who are healthy (do not have the disease).
P is the set of subjects who TEST positive for the disease.

We're interested in the probability that a subject has the disease GIVEN a positive result. We use Bayes' rule:

Prob(D | P) = Prob(P | D) * Prob(D) ÷ Prob(P)

The facts of the problem directly give us several of the numbers on the right hand side:

Prob(P |D) = 1.0
Prob(D) = .01

We can compute Prob(P) as follows:

Prob(P) = Prob(P, D) + Prob(P, H)
1. Prob(P,D) = Prob(P | D) * Prob(D)
  - Note: This is the true positive rate (.95) times the disease rate (.01)
2. Prob(P,H) = Prob(P | H) * Prob(H)
  - Note: This is the false positive rate (.05) times the healthiness rate (.99)
Prob(P,D) = 1.0 * .01 = .01
Prob(P,H) = .05 * .99 = .0495
Prob(P) = .01 + .0495 = .0595

We now have all the numbers to plug into Bayes' rule:

Prob(D | P) = Prob(P | D) * Prob(D) ÷ Prob(P)
Prob(D | P) = 1.0 * .01 ÷ .0595 = .168

Conclusion: A positive test result gives us a little better than a one in six chance that the subject has the disease!

What's going on? Rounding off some:

Prob(D | P) = 1 * (.01 ÷ .06)

The ratio of the disease rate to the positive rate is what shrinks our result.

The large size of Prob(P) is due mostly to the false positive rate.

Prob(P) = .01 + .0495 = .0595

Conclusion: This test has WAY too large a false positive rate for a disease this rare.

A Similar Test Population (300)
D Bonafide Positive P

H False Positive

Negative N

False P = 15/297 = 5 %

Prob(H|P) = 15/18 = 83 1/3 %

Prob(D|P) = 3/18 = 16 2/3 %

The noisy channel model

Components of the noisy channel model

An observation O
A word w to which it corresponds
A nosiy channel C which may distort w.
A decoder responsible for find the most likely w given O. Note: because of noise, the same O may potentially correspond to many different w.

Some applications of the noisy channel model:

Speech recognition. O is an acoustic signal. The word w is the word the speaker actually uttered.
Orthography. O is a sequence of letters (possibly misspelled). The word w is the word the writer actually intended.
Machine Translation. Say we are translating from English to French. O is a sentence in the English. We view O as a distorted version of a French signal. Thus w is the French sentence we are trying to "recover."

We use a probabilistic model

w = argmax_{w in V} Prob(w | O)
w = argmax_{w in V} Prob(O | w) * Prob(w) ÷ Prob(O)
w = argmax_{w in V} Prob(O | w) * Prob(w)
w is the word that maximizes the probability that w gave rise to the observation (signal).
w is the word that maximizes the product of the likelihood of w (Prob(O | w)) times the prior probability (Prob(w)) of w

w	=	argmax	P(O \| w)	P(w)
		w in V	Likelihood	Prior

A balance must be struck:

Rareness of word (small prior) may outweigh high likelihood
Commonness of word (high prior) may outweigh low likelihood

Effects of the channel and effects of the message space

Spelling errors may be due different kinds of transmission constraints (typographic and cognitive in typing versus visual similarity cinstraints in OCR.

An OCR example:

The quick brown fox jumps over the lazy yellow dog.
The q~ick brown foxjumps over tb l azy yellow dog.

Resemblance of characters is a critical factor in determing confusability. And different fonts will have different confusability issues.

A typing example:

The quick brown fox jumps over the lazy yellow dog.
The quicdk brown foc jumsp over thelazy yellwo dog.

Closeness of keys to one another enters in. Also motor control issues (deltion,letter transposition).

Another way of looking at likelihoods and priors:

Channel modeling (likelihood model): Build a model of how signals are realized given the properties of the channel. (an error or distortion model)
Message modeling (prior probability model): Build a model of what messages look like.

We will take a close look at how both kinds of probabilistic models work in two domains:

Spelling correction
- Error model relating acoustic information to phones (channel model)
- Language model predicting what word comes next given (message model) previous context.
Speech recognition:
- Acoustic model relating acoustic information to phones (channel model)
- Language model predicting what word comes next given (message model) previous context.

Spelling Correction

We apply the Bayesian method to the problem of spelling correction first using an error model.

Observation = t (for "typo")
Word = c (for "correction"):

We're looking for the correction "c" which maximizes the probability of "t". So we bring in our probabilistic model thus:

c = argmax_{c in V} Prob(c | t)
c = argmax_{c in V} Prob(t | c) * Prob(c)

As promised we will thus have two models:

Likelihood model: P(t | c)
Prior model: P(c)

Prior Model
P(c)

Size of corpus (textbook): 44 Million
For each word c, count occurrences to get its frequency (freq(c)).
Divide by 44 Million to get its probability (P(c)).

c freq(c) p(c)

actress 1343 .0000315

cress 0 0

caress 4 .0000001

Likelihood Model
P(t | c) Model

For the P( t | c) model, we first classify errors..
We estimate p(t | c) using 4 confusion matrices, one for deletion, substitution, insertion, and transposition.
Using "[ c => t ]" for the number of times that the correct string "c" was realized as typo "t" we define 4 different counts:

del[x,y] = [xy => x]
ins[x,y] = [x => xy]
sub[x,y] = [x => y]
trans[x,y] = [xy => yx]

We estimate the probability of an error of a certain type as the number of times the error occurred didvided by the total number of ocurrences of the correction.
For a deletion of "y" following "x" ("[xy => x]"), for example, we estimate P(t | c) as follows:

del[x,y]

P(x | xy) == __________

freq(xy)

Table of final probabilities

Sparseness
The sparseness of the raw probability table makes it unusable.
Problems:

Too many zeroes. The zeroes in our table may sometimes be things that never occur, but among them there are certainly many things that do occur but which we didn't happen to see in our 44 Million words of data.
The things that have count 1 in our data really have a very similar problem. They may be extremely rare events that we just happen to have seen. But among them there are certainly muich more commen events which we happen to have seen only once.
And so on for all events that only occurred a small number of times (2,3,4,.....)
This is a general problem for all probabilistic modeling. We return to give it a serious treatment later.

Find some way of collapsing the data into large classes so that we are dealing with event classes that have larger counts.
Find some way of adjusting the probabilities of rare events uniformly, that gives us a better estimates of their probabilities (given we know little about their true probabilities).

Cost-based
Model
Instead of a raw probabilistic model, we will pursue various muted alternatives.
The general idea: We pursue the concept of cost, which may or may not be probability-based. We look for solutions that minimize costs.

Minimum Edit Distance

We investigate the idea of edit distance. An edit distance is a spcial case of a cost. Minimize it to find best solution. Problem: We find the form acress. This could be a misspelling for any of a number of different words:

actress
cress
caress
access
across
acres

Many of these involve multiple errors. We dont actually have a model of multiple errors, even if we fix the sparseness problems in our probability model. We can't just multiply probabilities (or can we?)

Let us for the moment set the details of probability aside and just deal with approaches that incorporate the following assumption

Two errors are less likely than one.

We imagine a great orthographic space populated by text strings. These text strings are at different distances from each other.

We call the misspelling in the text the T (target) and we call a potential word that might given rise to this misspelling S (a candidate source, the metaphor of the Noisy Channel Model).

Edit Distance between two strings T and S

Sum of the cost of the individual editing operations needed to transform T into S

[Note: Even though we say SUM, we might still be using probabilities. How?]

Minumum editing distance between T and S.

First assign a cost to each editing operation
Note: We could use probabilities as well, using the model for Prob(T | S) sketched above). For simplicity we use ALD.
Now compute the cost of a sequence of operations that take you from T to S.
The ALD between T and S is the minimum cost possible for a sequence of editing operations that transform T into S.

Essential intuition of the algorithm. Each path from T to S goes through a sequence of intermediate strings. If S_i lies on the optimal path P from T to S, then then the sequence leading from T to S_i must also be optimal.

We now need ways of computing the shortest distance between any target and any source: shortest_distance(s,t).

We let concatenation be represented by "+":
- "inte" + "n" = "inte" concatenated with "n" = "inten"
- "exec" + "u" = "exec" concatenated with "u" = "execu"
We give an inductive definition of path cost.
1. shortest_distance(w, w) = 0
2. PathCost(w, w'+x)= shortest_distance(w,w') + ins-cost(w', w'+x)
3. PathCost(w+x,w') = shortest_distance(w,w') + del-cost(w+x,w)
4. PathCost(w+x,w'+y) = shortest_distance(w,w') + subst-cost(w+x,w'+y)

Example

source	target
linguis lingusi

PathCost_ins(linguis, lingus+i) = shortest_distance(linguis,lingus) + ins-cost(lingus,lingusi)

shortest_distance(linguis,lingus)
Source	l	i	n	g	u	i	s
Target	l	i	n	g	u	∅	s
Target	0	0	0	0	0	1	0

Note: shortest_distance(linguis,lingus) will involve a deletion (of "i")

PathCost_del(lingui+s, lingusi) = shortest_distance(lingui,lingusi) + del-cost(linguis,lingui)

shortest_distance(lingui,lingusi)
Source	l	i	n	g	u	∅	i
Target	l	i	n	g	u	s	i
Target	0	0	0	0	0	1	0

PathCost_subst(lingui+s,lingus+i) = shortest_distance(linguis,lingusi) + subst-cost(linguis,lingusi)

shortest_distance(lingui,lingus)
Source	l	i	n	g	u	i
Target	l	i	n	g	u	s
Target	0	0	0	0	0	2

We define shortest_distance as the Minimum of the 3 path costs.

          shortest_distance(w,w') = 
                Min { PathCost_sub>ins(w,w'),
                      PathCost_sub>del(w,w'),
                      PathCost_sub>subst(w,w')}

Computation of a minimum edit distance

Target
n	2	3 insert	4
i	1	2 subst	3 del
#	0	1	2
	#	e	x	Source

Different aligments, different costs:

target i r t i o n

source i t i o n 1

source i t i o n 6

The algorithm

D	Bonafide Positive	P
H	False Positive	P
H	Negative	N











False P = 15/297 = 5 %
Prob(H\|P) = 15/18 = 83 1/3 %
Prob(D\|P) = 3/18 = 16 2/3 %