Statistical Methods in Computational Linguistics

Expectation, Information, and Entropy

Expectation

Car Prices

Let's reconsider the joint distribution of COLOR and MAKE in which the two variables are independent.

	Green	Red	Yellow	Total
VW	30	10	10	50
BMW	18	6	6	30
Jag	12	4	4	20
Total	60	20	20	100

Now let's consider prices:

	Green	Red	Yellow	Total
VW	10K	7K	5K	50
BMW	25K	20K	22K	30
Jag	48K	42K	40K	20
Total	60	20	20	100

What can we learn from this table besides the fact that you should immediately go out and paint your car green?

Expected
Value

Suppose we want to get the average price of a car?

We dont just add the 9 prices together and divide by 9 because that ignores the relative frequencies of the prices.

A green VW price needs to be weighted more in the final average because green VWs are more common.

Weighting by relative frequency:

Make/Color	Price	Relative Frequency	Weighted Price
Green VW	10000	30/100=.3	3000
Red VW	7000	10/100=.1	700
Yellow VW	5000	10/100=.1	500
Green BMW	25000	18/100=.18	4500
Red BMW	20000	6/100=.06	1200
Yellow BMW	22000	6/100=.06	1320
Green Jag	48000	12/100=.12	5760
Red Jag	42000	4/100=.04	1640
Yellow Jag	40000	4/100=.04	1600
Expected Price			19720

This turns out just to be another way of taking an AVERAGE.

Make/Color	Price	Frequency	Costs
Green VW	10000	30	300000
Red VW	7000	10	70000
Yellow VW	5000	10	50000
Green BMW	25000	18	450000
Red BMW	20000	6	120000
Yellow BMW	22000	6	132000
Green Jag	48000	12	576000
Red Jag	42000	4	164000
Yellow Jag	40000	4	160000
Total Car Costs			1972000
Average			19720

General
Expected
Value

The general idea of expected value is that we have some function that assigns a number to every member of a sample space.

MAKE = {VW,BMW,JAG}

cost.

The expected value of the function is just the sum of its values for each member of the sample space weighted by probability

expected price

weighted by relative frequency

pdf

Note that we defined random variables formally so that the notion of expectation always applies to them. A random variable is a function that returns a number. So for any random variable X defined over sample space x₁, x₁, ... x_n:

₁

₂

But the COLOR, MAKE functions aren't very interesting random variables to take the expectation of. So we introduced the price function.

Information

We're interested in assigning a NUMBER to an event that characterizes the quantity of information it carries.

Next we'll be interested in the expected value of that number. The expected quantity of information per event. (Hint: that's the entropy)

Two Probabilistic
Criteria

Significance: The more improbable an event is, the more information it carries.
Additivity: If x₁ and x₂ are independent events:

Inverse
Probabilities
(Attempt 1)

Let's try:

I(m) = 1/p(m)

Examples:

Criterion 1, significance, is satisfied.

Result: Contradiction of criterion 2, additivity. Consider two independent events such that p(x₁) =1/4 and p(x₂)=1/8. Then:

p(x₁x₂)= 1/32 = 1/4 * 1/8
I(x₁x₂)= 32

But:

I(x₁)= 4
I(x₂)= 8
I(x₁) + I(x₂) = 12

Log
probability
(Attempt 2)

Observation: We know the probabilities of independent events multiply, but we want their combined information content to add:

One function that does something VERY like what we want is log:

log(x * y) = log x + log y

Thus we have two ideas:

Using inverse probabilities deals with our significance criterion
Using logs deals with additivity,

Combining our two ideas we get:

I(x) = log(1/p(x)) = - log p(x)

Examples:

p(x₁) = 1/4 ==> I(x₁) = -log (1/4) = - (- 2) = 2
p(x₂) = 1/8 ==> I(x₂) = 8 = - log (1/8) = - ( - 3) = 3

As desired, we satisfy significance.

Now we satisfy additivity:

p(x₁x₂) = 1/32 ==> I(x₁x₂) = - log (1/32) = - ( - 5) = 5
I(x₁x₂) = I(x₁) + I(x₂)

5 = 2 + 3

The unit is bits. Think of bits as counting binary choices. Taking probabilities out of it:

ⁿ

Example (binary numbers). 3 bits. 8 choices.

000 = 0
001 = 1
010 = 2
011 = 3
100 = 4
101 = 5
110 = 6
111 = 7

Information
Quantity
Defined

Assume a random variable X with probability mass function (pmf) p. For each x in the range of X:

I(x) = - log p(x)

Entropy

We next define entropy as the expected information quantity.

Expected
Information
Quantity

I (Information quantity) is a function that returns a number for each member of a sample space of possible events. We can compute the expected value of I, which we call H:

pdf

Coin
tossing

Suppose the probability of a head is 1/2. Then,

Suppose the probability of a head is 3/4. Then,

Suppose the probability of a head is 7/8. Then,

Entropy
as disorder

Consider the graph of all possible values of p(H)(pdf, ps).

Entropy is a measure of disorder.

General fact:

For any random variable X, varying p and keeping Omega the same, the uniform distribution gives the highest value of H.

Entropy
as Choice
Number

Consider the 8-sided die of the text book:

Suppose the probability of each face is 1/8. Then,

Suppose the probability of one die is 1/4, and the others are all 3/28. Then,

Suppose the probability of one die is 7/8, and the others are all 1/56.

Same trend as the examples with heads, but the numbers are staying higher.

ⁿ

So entropy goes up slowly as the number of events goes up, as the number of binary choices (bits) required to determine an event goes up.

Optimal
Encoding

Consider a varaint of 20 questions in which your pay-off halves after each question. To maximize average payoff you want a strategy that guarantees you the least average number of questions.

Consider the following horse race:

Distribution X for horses
H1	1/2
H2	1/4
H3	1/8
H4	1/16
H5	1/64
H6	1/64
H7	1/64
H8	1/64

H(X) = -1/2 log 1/2 - 1/4 log 1/4 - 1/16 log 1/16 - 4 (1/64 log -64)

= 2 bits

The optimal encoding of winners:

Expected Number of questions
Horse	Encoding
H1	0
H2	10
H3	110
H4	1110
H5	11110
H6	111100
H7	111101
H6	111110
H7	111111
Average questions		3.0

The length of the average message will be 2 bits.

Entropy of
Joint Distribution

Definition of H(X,Y). (ps, pdf).

Entropy of
Conditional
Distribution

Definition of H(X|Y). (ps, pdf).

Mutual
Information

We write Mutual Infomration of X and Y as I(X;Y). This is symmetric and defined as:

I(X;Y) = H(X) - H(X | Y)

Lots of different intuitions. Very interesting concept, lots of applications.

HEre's one intution:

_x,y

This is:

The expected value of the ration of p(x|y) to p(x)
Average amount of reduction in uncertainty, weight by p(x,y).

Interdisiplinary
Nature of
Information
Theory

Fields where the notion of entropy (or something like it) plays a role:

Communication Theory
Computer Science
Physics
Probability Theory
Statistics
Economics
1. Portfolio Theory
2. Kelly Gambling

Cross entropy
(Intuition)

Meassure average amount of surprise

Sum up information of each new word, GIVEN model:
= log(p(w₁)* p(w₂|w₁) ... * p(w_n| w₁, ...w_n-1))
= log p(w₁, ... w_n) [chain rule]
Divide by n (to make this a per word measure)
The smaller this number the better the model (the less surprised by the corpus). The bigger p, the smaller this number.

Cross Entropy
Definition

Cross-Entropy

NB: This is always HIGHER than:

(Per Word) Cross-Entropy of Model for a corpus of size n

₁₌₁

(Per Word) Cross-Entropy of Model for the Language

_n->Inf

₁₌₁

By a wonderful magical theorem, this =:

_n->Inf

₁

Cross-entropy
Intuition

Entropy of a subdistribution
Event (x) p(x) Info
(- log p(x)) Weighted
Info

e₁ 0.25 2 0.5

e₂ 0.125 3 0.375

Cross-entropy for that sub-distribution
Event (x) p(x) m(x) Info
(- log m(x)) Weighted
Info
- p(x) log m(x) Delta

e₁ 0.25 0.125 3 0.75 + 0.25

e₂ 0.125 0.25 2 0.25 - 0.125

Total Delta + 0.125

Required Prob mass to make up for delta for e ₁
Event (x) p(x) m(x) Info
(- log m(x)) Weighted
Info
- p(x) log m(x) Delta

e₂ 0.125 0.5 1 0.125 - 0.25