Linguistics 596

Probabilistic Models of Spelling and Pronunciation


Probability models as ignorance models

One important use of probabilistic models is when we try to predict some phenomenon and we don't know all the factors that can influence, or we do, but they are much too complicated to build a precise model of.

Exactly the situation with language!

2-feature sample space Manner
FRIC FRIC
Voicing - VCD f,th, s p, t, k
VCD v,dh, z b,d,g

Imagine a random sample of English consonants, clipped from out of streams of real speech.

cardinality

Let:

  1. T = Total set of consonants
  2. VCD = Total set of voiced consonants
  3. - VCD = Total set of unvoiced consonants
  4. FRIC = Total set of fricative
  5. FRIC = Total set of non-fricative consonants (stops, liquids, affricates, ...)

We define |VCD| as the number of elements in the set of voiced sounds (also call the cardinality of the set.

Frequency
Interpretation

    Prob(VCD) is a number x between 0 and 1
As you take larger and larger samples of events, the ratio of voiced events to the total number of events will tend to approach x.
    Prob(VCD) =~ |VCD| ÷|T|
Where =~ means "approximately equals".

This is a frequency interpretation of probability.

Sample
Space
 

The mathematical set of possible outcomes of the sort we care about, often called Omega.

When we are talking about VCD, -VCD, there are two outcomes. The sample space is

    Omega = {VCD, -VCD}
When we are talking about single coin tossings the set of outcomes is Head or Tails (H or T):
    Omega = { H, T}
When we are talking about sequences of 3 coin tosses, there are 8 possible outcomes:
    Omega = {HHH,HHT, HTH, HTT,THH, THT, TTH, TTT}
When we are talking about voiceleess contrasts and 3 different consonant choices, the sample space is:
    Omega = {VVV,VVN, VNV, VNN,NVV, NVN, NNV, NNN}
Where V represents a voiced outcome and N represents a nonvoiced outcome.

Contrast the sample space with a sample, some actual set of events gotten by experiment or sorting through public records blind luck.

Each sample will be an element of the sample space. Suppose we do 8 experiments with the coin and each time we get HHH. Hmm. It's beginning to look like the right probability model is:

Prob(TTT) = 0
Prob(TTH) = 0
Prob(THT) = 0
Prob(THH) = 0
Prob(HTT) = 0
Prob(HTH) = 0
Prob(HHT) = 0
Prob(HHH) = 1
In other words the coin is biased. Of course we're not SURE this is the right probability model. We might just have been unlucky....

With any real sample we can ask the question: Is it biased? That is, has it been chosen in some way that changes the probabilistic properties of the space. Is the coin weighted? Did we get our consonant corpus from an opera singer?

The general strategy of the frequency model definition: The larger our set of samples, the less the chance of bias.

Probability
Distribution
 

Prob is called a probability distribution function. It assigns a number between 0 and 1 to a set of events in the sample space.

Conditional
Probability
 

As our example suggests, we want to be able to talk about multiple kinds of events simultaneously. That is, with heads and tails there is only one property to worry about. With consonants there are many, VCD/-VCD, FRIC/-FRIC, and so on.

Now think of the sample space as different consonant utterances and think of VCD as some subset of those utterances and FRIC as some other subset.

Now we can talk about all the following:

  1. Prob(VCD) =~ |VCD| ÷|T|
  2. Prob(FRIC) =~ |FRIC| ÷|T|
  3. Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ |VCD Intersect Fric| ÷|T|

We also want to talk about conditional probabilities. By definition:

  • Prob(VCD | FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC)
Read this as "the probability of VCD given FRIC". This is the probability of a sound being voiced given that you know it's a frictative.

Conditional probabilities can be thought of as Probabilities relativized to subsets of the sample space. Start with the total sample space, restrict it to fricatives, use the same frequentist interpretation of probability, and you get the above formula. Here's why:

    Prob(VCD, FRIC) ÷ Prob(FRIC) =
    (|VCD Intersect FRIC| ÷ |T|) ÷ (|FRIC| ÷ |T|) =
    (|VCD Intersect FRIC| ÷ |T|) * (|T| ÷ |FRIC|) =
    (|VCD Intersect FRIC| ÷ |FRIC|) =

Note and think about the following distinction:

  1. Prob(VCD | FRIC) = Prob(VCD,FRIC) ÷ Prob(FRIC) = (|VCD Intersect FRIC| ÷ |FRIC|)
  2. Prob(VCD, FRIC) = Prob(VCD Intersect FRIC) =~ |VCD Intersect FRIC| ÷|T|
Chain Rule

This definition immediately leads to something called the chain rule.

  1. Prob(VCD | FRIC) = Prob(VCD, FRIC) ÷ Prob(FRIC)
  2. Prob(VCD,FRIC) = Prob(VCD | FRIC) * Prob(FRIC)

independence  

Now one thing the chain rule leads to is an account of a special case. Suppose

  • Prob(VCD | FRIC) = Prob(VCD).
In this very special case, we call the features (or events in probability talk) FRIC and VCD independent. In this very special case (and ONLY then):
  • Prob(VCD,FRIC) = Prob(VCD) * Prob(Fric)
Bayes' rule

The chain rule is symmetric:

  1. Prob(VCD,FRIC) = Prob(VCD | FRIC) * Prob(FRIC)
  2. Prob(VCD,FRIC) = Prob(FRIC | VCD) * Prob(VCD)

So it follows

Prob(VCD | FRIC) * Prob(FRIC) = Prob(FRIC | VCD) * Prob(VCD)

This is called Bayes' Rule.

Bayes' Rule is often written in this form:

  • Prob(A | B) = Prob(B | A) * Prob(A) ÷ Prob(B)


Bayes' Rule: An example

Let's imagine we have a contagious disease and a test for the disease. The facts are the following:

Our question is a policy question. Do we quarantine subjects who have a positive test result?

We're interested in the probability that a subject has the disease GIVEN a positive result. We use Bayes' rule:

The facts of the problem directly give us several of the numbers on the right hand side:

  1. Prob(P |D) = 1.0
  2. Prob(D) = .01

We can compute Prob(P) as follows:

We now have all the numbers to plug into Bayes' rule:

Conclusion: A positive test result gives us a little better than a one in six chance that the subject has the disease!

What's going on? Rounding off some:

The ratio of the disease rate to the positive rate is what shrinks our result.

The large size of Prob(P) is due mostly to the false positive rate.

Conclusion: This test has WAY too large a false positive rate for a disease this rare.

 
A Similar Test Population (300)
D Bonafide Positive P
H False Positive
Negative N
 
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
                                                           
False P = 15/297 = 5 %
Prob(H|P) = 15/18 = 83 1/3 %
Prob(D|P) = 3/18 = 16 2/3 %


The noisy channel model

Components of the noisy channel model

  1. An observation O
  2. A word w to which it corresponds
  3. A nosiy channel C which may distort w.
  4. A decoder responsible for find the most likely w given O. Note: because of noise, the same O may potentially correspond to many different w.

Some applications of the noisy channel model:

  1. Speech recognition. O is an acoustic signal. The word w is the word the speaker actually uttered.
  2. Orthography. O is a sequence of letters (possibly misspelled). The word w is the word the writer actually intended.
  3. Machine Translation. Say we are translating from English to French. O is a sentence in the English. We view O as a distorted version of a French signal. Thus w is the French sentence we are trying to "recover."

We use a probabilistic model

w = argmax P(O | w) P(w)
    w in V Likelihood Prior

A balance must be struck:

  1. Rareness of word (small prior) may outweigh high likelihood
  2. Commonness of word (high prior) may outweigh low likelihood

Effects of the channel and effects of the message space

Spelling errors may be due different kinds of transmission constraints (typographic and cognitive in typing versus visual similarity cinstraints in OCR.

An OCR example:

Resemblance of characters is a critical factor in determing confusability. And different fonts will have different confusability issues.

A typing example:

Closeness of keys to one another enters in. Also motor control issues (deltion,letter transposition).

Another way of looking at likelihoods and priors:

  1. Channel modeling (likelihood model): Build a model of how signals are realized given the properties of the channel. (an error or distortion model)
  2. Message modeling (prior probability model): Build a model of what messages look like.

We will take a close look at how both kinds of probabilistic models work in two domains:

  1. Spelling correction
  2. Speech recognition:

Spelling Correction

We apply the Bayesian method to the problem of spelling correction first using an error model.

  1. Observation = t (for "typo")
  2. Word = c (for "correction"):

We're looking for the correction "c" which maximizes the probability of "t". So we bring in our probabilistic model thus:

As promised we will thus have two models:

  1. Likelihood model: P(t | c)
  2. Prior model: P(c)

Prior Model
P(c)
 

  1. Size of corpus (textbook): 44 Million
  2. For each word c, count occurrences to get its frequency (freq(c)).
  3. Divide by 44 Million to get its probability (P(c)).
    c freq(c) p(c)
    actress 1343 .0000315
    cress 0 0
    caress 4 .0000001

Likelihood Model
P(t | c) Model

 

For the P( t | c) model, we first classify errors..

We estimate p(t | c) using 4 confusion matrices, one for deletion, substitution, insertion, and transposition.

Using "[ c => t ]" for the number of times that the correct string "c" was realized as typo "t" we define 4 different counts:

  1. del[x,y] = [xy => x]
  2. ins[x,y] = [x => xy]
  3. sub[x,y] = [x => y]
  4. trans[x,y] = [xy => yx]

We estimate the probability of an error of a certain type as the number of times the error occurred didvided by the total number of ocurrences of the correction.

For a deletion of "y" following "x" ("[xy => x]"), for example, we estimate P(t | c) as follows:

        del[x,y]
    P(x | xy) == __________
        freq(xy)

Table of final probabilities

Sparseness  

The sparseness of the raw probability table makes it unusable.

Problems:

  1. Too many zeroes. The zeroes in our table may sometimes be things that never occur, but among them there are certainly many things that do occur but which we didn't happen to see in our 44 Million words of data.
  2. The things that have count 1 in our data really have a very similar problem. They may be extremely rare events that we just happen to have seen. But among them there are certainly muich more commen events which we happen to have seen only once.
  3. And so on for all events that only occurred a small number of times (2,3,4,.....)
  4. This is a general problem for all probabilistic modeling. We return to give it a serious treatment later.
    1. Find some way of collapsing the data into large classes so that we are dealing with event classes that have larger counts.
    2. Find some way of adjusting the probabilities of rare events uniformly, that gives us a better estimates of their probabilities (given we know little about their true probabilities).
Cost-based
Model
 

Instead of a raw probabilistic model, we will pursue various muted alternatives.

The general idea: We pursue the concept of cost, which may or may not be probability-based. We look for solutions that minimize costs.


Minimum Edit Distance

We investigate the idea of edit distance. An edit distance is a spcial case of a cost. Minimize it to find best solution. Problem: We find the form acress. This could be a misspelling for any of a number of different words:

  1. actress
  2. cress
  3. caress
  4. access
  5. across
  6. acres
Many of these involve multiple errors. We dont actually have a model of multiple errors, even if we fix the sparseness problems in our probability model. We can't just multiply probabilities (or can we?)

Let us for the moment set the details of probability aside and just deal with approaches that incorporate the following assumption

    Two errors are less likely than one.

We imagine a great orthographic space populated by text strings. These text strings are at different distances from each other.

We call the misspelling in the text the T (target) and we call a potential word that might given rise to this misspelling S (a candidate source, the metaphor of the Noisy Channel Model).

Edit Distance between two strings T and S

    Sum of the cost of the individual editing operations needed to transform T into S
[Note: Even though we say SUM, we might still be using probabilities. How?]

Minumum editing distance between T and S.

  1. First assign a cost to each editing operation
      Alternative Levenshtein Distance (ALD): The operations insertion and deletion cost 1. Substitution (a combined deletion and insertion) costs 2, except in the case of substituting a character for itself, which costs 0.
    Note: We could use probabilities as well, using the model for Prob(T | S) sketched above). For simplicity we use ALD.
  2. Now compute the cost of a sequence of operations that take you from T to S.
  3. The ALD between T and S is the minimum cost possible for a sequence of editing operations that transform T into S.

Essential intuition of the algorithm. Each path from T to S goes through a sequence of intermediate strings. If Si lies on the optimal path P from T to S, then then the sequence leading from T to Si must also be optimal.

We now need ways of computing the shortest distance between any target and any source: shortest_distance(s,t).

  1. We let concatenation be represented by "+":
  2. We give an inductive definition of path cost.
    1. shortest_distance(w, w) = 0
        [special case: shortest_distance(∈, ∈)=0]
    2. PathCost(w, w'+x)= shortest_distance(w,w') + ins-cost(w', w'+x)
    3. PathCost(w+x,w') = shortest_distance(w,w') + del-cost(w+x,w)
    4. PathCost(w+x,w'+y) = shortest_distance(w,w') + subst-cost(w+x,w'+y)
  3. Example
    source target
    linguis lingusi
    1. PathCostins(linguis, lingus+i) = shortest_distance(linguis,lingus) + ins-cost(lingus,lingusi)
        shortest_distance(linguis,lingus)
        Source l i n g u i s
        Target l i n g u s
        Target 0 0 0 0 0 1 0
      Note: shortest_distance(linguis,lingus) will involve a deletion (of "i")
  4. PathCostdel(lingui+s, lingusi) = shortest_distance(lingui,lingusi) + del-cost(linguis,lingui)
      shortest_distance(lingui,lingusi)
      Source l i n g u i
      Target l i n g u s i
      Target 0 0 0 0 0 1 0
  5. PathCostsubst(lingui+s,lingus+i) = shortest_distance(linguis,lingusi) + subst-cost(linguis,lingusi)
      shortest_distance(lingui,lingus)
      Source l i n g u i
      Target l i n g u s
      Target 0 0 0 0 0 2
  • We define shortest_distance as the Minimum of the 3 path costs.
              shortest_distance(w,w') = 
                    Min { PathCostsub>ins(w,w'),
                          PathCostsub>del(w,w'),
                          PathCostsub>subst(w,w')}
               

    Computation of a minimum edit distance

      Target  
      n 2 3
      insert
      4  
      i 1 2
      subst
      3
      del
      # 0 1 2
        # e x Source

    Different aligments, different costs:
    target i r t i on
    source i   t i on 1
    source i t i on   6

    The algorithm