Computational Linguistics

Linguistics 596

Viterbi Algorithm

Decoding and Segmentation

Given a string of phones find the most probable sequence of words it realizes. We are given as an input a sequence of phones.

aa n iy dh ax

Each phone gets one instant of time:

aa n iy dh ax

1 2 3 4 5

We use a language to model to assign probabilities to sequences of phones. Paths through the model determine sequences of words. Our output is the most probable sequence of words:

I need the

We've simultaneously accomplished decoding and segmentation:

Decoding: Finding what word is realized by a string of phones.
Segmentation: Splitting up the sequence of phones into subsequences representing words (no white spaces between words!).

Language model

Pronunciation Network The key ideas are:
1. States in this diagram are labeled with phones. Being in a state labeled "dh" means that this state is only compatible with the input "dh".
2. In formal HMM talk, this is captured by saying that the "observation likelihood" of phone "dh" (o(dh)) for that state is 1, and the observation likelihood for any other phone is 0.
Bigram model
The two models combined

We call the start state start. We name other states by the word subnetworks they belong to.

Pr(I_aa|start) = Pr(I_aa | I) * Pr(I|start)
Pr(I_aa|start) = .079 * .02

The core computation: Viterbi probability

In general a state s can be reached at time t via a number of paths. For each path reaching state s at time t, we compute a path probability. We call the best of these viterbi(s,t).

Basic idea: For a state s' compatible with the input phone at time t+1, compute viterbi(s',t+1), assuming we know viterbi(s,t) for all s.

For each s, compute path-prob(s'|s,t)[path probability for reaching s' at time t+1 from s at time t.].

path-prob(s'\|s,t) =	viterbi(s,t) *	a[s,s']
probability of path to s' through s	best path score for state s at time t	transition probability for s => s'

viterbi(s',t+1) = max_{s in STATES} path-prob(s' | s,t)
Assume viterbi(start,0) = 1.0. For all other states s except start, assume viterbi(s,0) = 0.
If a state s is not compatible with input phone at time t, assume viterbi(s,t)=0.
- This is guaranteed in a formal HMM by defining the path prob with one extra term:
  - path-prob(s'|s,t) = viterbi(s,t) * a[s,s'] * b_s'(o_t+1)
  Here b_s'(o_t+1) is the observation likelihood in state s' for the observation at time t+1. If this is 0 for some s', all path-probs leading to s' will be 0 and viterbi(s',t+1) will be 0.
There are two ways a state can be a dead end.
1. All its transitions to input-compatible states have probability 0.
2. Its viterbi probability is 0.
In either case all paths continueing from the state will have path probability 0.

Key idea:

Viterbi scores are probabilities. Viterbi(s,t) is the probability of the most likely path to s at t.

The Viterbi algorihm

Basic idea of Viterbi algorithm. Find the most likely path, given some input, through a probabilistic network. This path determines a sequence of words, thus accomplishing decoding and segmentation.
So for input of length n, we compute the Viterbi probability for each state s and each time 0 through n, keeping track of the path associated with each viterbi score. The state with the best viterbi score is the state we want to end up in. What we want to return is the best probability path to that state.
The path probability matrix

t = 0		Compatible state is start viterbi(start,0) = 1.0
t= 1		o₁ = aa; compatible states are on_aa and I_aa viterbi(on_aa,1) path-prob(on_aa \| start, 0) = viterbi(start,0) * a[start,on_aa] a[start,on_aa] = pr(on_aa\|on) * pr(on \| start) a[start,on_aa] = 1.0 * .00077 path-prob(on_aa \| start, 0) = 1.0 * .00077 = .00077 viterbi(on_aa,1) = .00077 viterbi(I_aa,1) path-prob(I_aa \| start, 0) = viterbi(start,0) * a[start,I_aa] a[start,I_aa] = pr(I_aa\|I) * pr(I \| start) a[start,I_aa] = .20 * .079 = .0016 path-prob(I_aa \| start, 0) = 1.0 * .0016 = .00016 viterbi(I_aa,1) = .0016
t=2		o₂ = n; compatible states are on_{aa n}, the_n and need_n viterbi(on_{aa n},2) From on_aa path-prob(on_{aa n}\| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,on_{aa n}] path-prob(on_{aa n}\| on_aa, 1) = .00077 * 1.0 = .00077 From I_aa path-prob(on_{aa n}\| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,on_{aa n}] path-prob(on_{aa n}\| I_aa, 1) = .0016 * 0 = 0 viterbi(on_{aa n},2) = Max({.00077, 0}) = .00077 viterbi(the_n,2) From on_aa path-prob(the_n\| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,the_n] path-prob(the_n\| on_aa, 1) = .00077 * 0 = 0 From I_aa path-prob(the_n\| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,the_n] a[I_aa,the_n] = pr(the\|I) * Pr(the_n \|the) a[I_aa,the_n] = .00018 * .08 path-prob(the_n\| I_aa, 1) = .0016 * .00018 * .08 = .000000023 viterbi(the_n,2) = Max({.000000023, 0}) = .000000023 viterbi(need_n,2) From on_aa path-prob(need_n\| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,need_n] path-prob(need_n\| on_aa, 1) = .00077 * 0 = 0 From I_aa path-prob(need_n\| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,need_n] a[I_aa,need_n] = pr(need\|I) * Pr(need_n \|need) a[I_aa,need_n] = .0016 * 1.0 = .0016 path-prob(need_n\| I_aa, 1) = .0016 * .0016 = .000026 viterbi(need_n,2) = Max({.000026, 0}) = .000026

t = 0

Compatible state is start
viterbi(start,0) = 1.0

t= 1

o₁ = aa; compatible states are on_aa and I_aa

viterbi(on_aa,1)

path-prob(on_aa | start, 0) = viterbi(start,0) * a[start,on_aa]
a[start,on_aa] = pr(on_aa|on) * pr(on | start)
a[start,on_aa] = 1.0 * .00077
path-prob(on_aa | start, 0) = 1.0 * .00077 = .00077
viterbi(on_aa,1) = .00077

viterbi(I_aa,1)

path-prob(I_aa | start, 0) = viterbi(start,0) * a[start,I_aa]
a[start,I_aa] = pr(I_aa|I) * pr(I | start)
a[start,I_aa] = .20 * .079 = .0016
path-prob(I_aa | start, 0) = 1.0 * .0016 = .00016
viterbi(I_aa,1) = .0016

t=2

o₂ = n; compatible states are on_{aa n}, the_n and need_n

viterbi(on_{aa n},2)

From on_aa
- path-prob(on_{aa n}| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,on_{aa n}]
- path-prob(on_{aa n}| on_aa, 1) = .00077 * 1.0 = .00077
From I_aa
- path-prob(on_{aa n}| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,on_{aa n}]
- path-prob(on_{aa n}| I_aa, 1) = .0016 * 0 = 0
viterbi(on_{aa n},2) = Max({.00077, 0}) = .00077

viterbi(the_n,2)

From on_aa
- path-prob(the_n| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,the_n]
- path-prob(the_n| on_aa, 1) = .00077 * 0 = 0
From I_aa
- path-prob(the_n| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,the_n]
- a[I_aa,the_n] = pr(the|I) * Pr(the_n |the)
- a[I_aa,the_n] = .00018 * .08
- path-prob(the_n| I_aa, 1) = .0016 * .00018 * .08 = .000000023
viterbi(the_n,2) = Max({.000000023, 0}) = .000000023

viterbi(need_n,2)

From on_aa
- path-prob(need_n| on_aa, 1) = viterbi(on_aa,1) * a[on_aa,need_n]
- path-prob(need_n| on_aa, 1) = .00077 * 0 = 0
From I_aa
- path-prob(need_n| I_aa, 1) = viterbi(I_aa,1) * a[I_aa,need_n]
- a[I_aa,need_n] = pr(need|I) * Pr(need_n |need)
- a[I_aa,need_n] = .0016 * 1.0 = .0016
- path-prob(need_n| I_aa, 1) = .0016 * .0016 = .000026
viterbi(need_n,2) = Max({.000026, 0}) = .000026

The algorithm