One important use of probabilistic models is when we try to predict some phenomenon and we don't know all the factors that can influence, or we do, but they are much too complicated to build a precise model of.
Exactly the situation with language!
| 2-feature sample space | Manner | ||
|---|---|---|---|
| FRIC | FRIC | ||
| Voicing | - VCD | f,th, s | p, t, k |
| VCD | v,dh, z | b,d,g | |
Imagine a random sample of English consonants, clipped from out of streams of real speech.
| cardinality | |
Let:
We define |VCD| as the number of elements in the set of voiced sounds (also call the cardinality of the set. |
|---|---|---|
| Frequency Interpretation |
|
This is a frequency interpretation of probability. |
|
Sample Space |
  |
The mathematical set of possible outcomes of the sort we care about, often called Omega. When we are talking about VCD, -VCD, there are two outcomes. The sample space is
Contrast the sample space with a sample, some actual set of events gotten by experiment or sorting through public records blind luck. Each sample will be an element of the sample space. Suppose we do 8 experiments with the coin and each time we get HHH. Hmm. It's beginning to look like the right probability model is: Prob(TTT) = 0 Prob(TTH) = 0 Prob(THT) = 0 Prob(THH) = 0 Prob(HTT) = 0 Prob(HTH) = 0 Prob(HHT) = 0 Prob(HHH) = 1In other words the coin is biased. Of course we're not SURE this is the right probability model. We might just have been unlucky.... With any real sample we can ask the question: Is it biased? That is, has it been chosen in some way that changes the probabilistic properties of the space. Is the coin weighted? Did we get our consonant corpus from an opera singer? The general strategy of the frequency model definition: The larger our set of samples, the less the chance of bias. |
|
Probability Distribution |
  |
Prob is called a probability distribution function. It assigns a number between 0 and 1 to a set of events in the sample space. |
|
Conditional Probability |
  |
As our example suggests, we want to be able to talk about multiple kinds of events simultaneously. That is, with heads and tails there is only one property to worry about. With consonants there are many, VCD/-VCD, FRIC/-FRIC, and so on. Now think of the sample space as different consonant utterances and think of VCD as some subset of those utterances and FRIC as some other subset. Now we can talk about all the following:
We also want to talk about conditional probabilities. By definition:
Conditional probabilities can be thought of as Probabilities relativized to subsets of the sample space. Start with the total sample space, restrict it to fricatives, use the same frequentist interpretation of probability, and you get the above formula. Here's why:
(|VCD Intersect FRIC| ÷ |T|) ÷ (|FRIC| ÷ |T|) = (|VCD Intersect FRIC| ÷ |T|) * (|T| ÷ |FRIC|) = (|VCD Intersect FRIC| ÷ |FRIC|) = Note and think about the following distinction:
|
| Chain Rule | |
This definition immediately leads to something called the chain rule.
|
| independence |   |
Now one thing the chain rule leads to is an account of a special case. Suppose
|
| Bayes' rule | |
The chain rule is symmetric:
So it follows Prob(VCD | FRIC) * Prob(FRIC) = Prob(FRIC | VCD) * Prob(VCD) This is called Bayes' Rule. Bayes' Rule is often written in this form:
|
Let's imagine we have a contagious disease and a test for the disease. The facts are the following:
Our question is a policy question. Do we quarantine subjects who have a positive test result?
We're interested in the probability that a subject has the disease GIVEN a positive result. We use Bayes' rule:
The facts of the problem directly give us several of the numbers on the right hand side:
We can compute Prob(P) as follows:
We now have all the numbers to plug into Bayes' rule:
What's going on? Rounding off some:
The large size of Prob(P) is due mostly to the false positive rate.
Conclusion: This test has WAY too large a false positive rate for a disease this rare.
| D | Bonafide Positive | P | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H | False Positive | ||||||||||||||||||||||||||||
| Negative | N | ||||||||||||||||||||||||||||
|   | |||||||||||||||||||||||||||||
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| False P = 15/297 = 5 % | |||||||||||||||||||||||||||||
| Prob(H|P) = 15/18 = 83 1/3 % | |||||||||||||||||||||||||||||
| Prob(D|P) = 3/18 = 16 2/3 % | |||||||||||||||||||||||||||||
Components of the noisy channel model
Some applications of the noisy channel model:
We use a probabilistic model
| w | = | argmax | P(O | w) | P(w) |
|   |   | w in V | Likelihood | Prior |
|---|
A balance must be struck:
Spelling errors may be due different kinds of transmission constraints (typographic and cognitive in typing versus visual similarity cinstraints in OCR.
An OCR example:
A typing example:
Another way of looking at likelihoods and priors:
We will take a close look at how both kinds of probabilistic models work in two domains:
We apply the Bayesian method to the problem of spelling correction first using an error model.
We're looking for the correction "c" which maximizes the probability of "t". So we bring in our probabilistic model thus:
As promised we will thus have two models:
|
Prior Model P(c) |
  |
|
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Likelihood Model |
  |
For the P( t | c) model, we first classify errors.. We estimate p(t | c) using 4 confusion matrices, one for deletion, substitution, insertion, and transposition. Using "[ c => t ]" for the number of times that the correct string "c" was realized as typo "t" we define 4 different counts:
We estimate the probability of an error of a certain type as the number of times the error occurred didvided by the total number of ocurrences of the correction. For a deletion of "y" following "x" ("[xy => x]"), for example, we estimate P(t | c) as follows:
|
||||||||||||
| Sparseness |   |
The sparseness of the raw probability table makes it unusable. Problems:
|
||||||||||||
|
Cost-based Model |
  |
Instead of a raw probabilistic model, we will pursue various muted alternatives. The general idea: We pursue the concept of cost, which may or may not be probability-based. We look for solutions that minimize costs. |
We investigate the idea of edit distance. An edit distance is a spcial case of a cost. Minimize it to find best solution. Problem: We find the form acress. This could be a misspelling for any of a number of different words:
Let us for the moment set the details of probability aside and just deal with approaches that incorporate the following assumption
| Two errors are less likely than one. |
We imagine a great orthographic space populated by text strings. These text strings are at different distances from each other.
We call the misspelling in the text the T (target) and we call a potential word that might given rise to this misspelling S (a candidate source, the metaphor of the Noisy Channel Model).
| Sum of the cost of the individual editing operations needed to transform T into S |
Minumum editing distance between T and S.
Essential intuition of the algorithm. Each path from T to S goes through a sequence of intermediate strings. If Si lies on the optimal path P from T to S, then then the sequence leading from T to Si must also be optimal.
We now need ways of computing the shortest distance between any target and any source: shortest_distance(s,t).
| source | target |
|---|---|
| linguis lingusi |
| Source | l | i | n | g | u | i | s |
|---|---|---|---|---|---|---|---|
| Target | l | i | n | g | u | ∅ | s |
| Target | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Source | l | i | n | g | u | ∅ | i |
|---|---|---|---|---|---|---|---|
| Target | l | i | n | g | u | s | i |
| Target | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Source | l | i | n | g | u | i |
|---|---|---|---|---|---|---|
| Target | l | i | n | g | u | s |
| Target | 0 | 0 | 0 | 0 | 0 | 2 |
shortest_distance(w,w') =
Min { PathCostsub>ins(w,w'),
PathCostsub>del(w,w'),
PathCostsub>subst(w,w')}
Computation of a minimum edit distance
| Target |   | |||
| n | 2 | 3 insert | 4 |   |
|---|---|---|---|---|
| i | 1 | 2 subst | 3 del |
|
| # | 0 | 1 | 2 | |
|   | # | e | x | Source |
Different aligments, different costs:
| target | i | r | t | i | o | n | |
| source | i |   | t | i | o | n | 1 |
| source | i | t | i | o | n |   | 6 |