Statistical Methods in Computational Linguistics

Expectation, Information, and Entropy

Expectation

Car Prices

Let's reconsider the joint distribution of COLOR and MAKE in which the two variables are independent.

      Green Red Yellow Total
    VW 30 10 10 50
    BMW 18 6 6 30
    Jag 12 4 4 20
    Total 60 20 20 100

Now let's consider prices:

      Green Red Yellow Total
    VW 10K 7K 5K 50
    BMW 25K 20K 22K 30
    Jag 48K 42K 40K 20
    Total 60 20 20 100
What can we learn from this table besides the fact that you should immediately go out and paint your car green?
Expected
Value

Suppose we want to get the average price of a car?

We dont just add the 9 prices together and divide by 9 because that ignores the relative frequencies of the prices.

A green VW price needs to be weighted more in the final average because green VWs are more common.

Weighting by relative frequency:

    Make/Color Price Relative
    Frequency
    Weighted
    Price
    Green VW 10000 30/100=.3 3000
    Red VW 7000 10/100=.1 700
    Yellow VW 5000 10/100=.1 500
    Green BMW 25000 18/100=.18 4500
    Red BMW 20000 6/100=.06 1200
    Yellow BMW 22000 6/100=.06 1320
    Green Jag 48000 12/100=.12 5760
    Red Jag 42000 4/100=.04 1640
    Yellow Jag 40000 4/100=.04 1600
    Expected Price 19720

This turns out just to be another way of taking an AVERAGE.

    Make/Color Price Frequency Costs
    Green VW 10000 30 300000
    Red VW 7000 10 70000
    Yellow VW 5000 10 50000
    Green BMW 25000 18 450000
    Red BMW 20000 6 120000
    Yellow BMW 22000 6 132000
    Green Jag 48000 12 576000
    Red Jag 42000 4 164000
    Yellow Jag 40000 4 160000
    Total Car Costs 1972000
    Average 19720
General
Expected
Value

The general idea of expected value is that we have some function that assigns a number to every member of a sample space.

    Since we think the cost is independent of the color let's use MAKE as the sample space:
      MAKE = {VW,BMW,JAG}
    The number we want to assign to every member of the sample is the cost.

The expected value of the function is just the sum of its values for each member of the sample space weighted by probability

    We computed the expected price by adding all the prices weighted by relative frequency.
    General definition of expected value. (ps, pdf).
Note that we defined random variables formally so that the notion of expectation always applies to them. A random variable is a function that returns a number. So for any random variable X defined over sample space x1, x1, ... xn:
    E(X) = p(x1)X(x1) + p(x2)X(x2) + ... p(xn)X(xn)
But the COLOR, MAKE functions aren't very interesting random variables to take the expectation of. So we introduced the price function.

Information

We're interested in assigning a NUMBER to an event that characterizes the quantity of information it carries.

Next we'll be interested in the expected value of that number. The expected quantity of information per event. (Hint: that's the entropy)

    Two Probabilistic
    Criteria

    1. Significance: The more improbable an event is, the more information it carries.
        P(x1) > P(x2) ==> I(x2) < I(x1)
    2. Additivity: If x1 and x2 are independent events:
        I(x1x2) = I(x1) + I(x2)
    Inverse
    Probabilities
    (Attempt 1)

    Let's try:

      I(m) = 1/p(m)

    Examples:

      p(m) = 1/4 ==> I(m) = 4
      p(m) = 1/8 ==> I(m) = 8
    Criterion 1, significance, is satisfied.

    Result: Contradiction of criterion 2, additivity. Consider two independent events such that p(x1) =1/4 and p(x2)=1/8. Then:

    1. p(x1x2)= 1/32 = 1/4 * 1/8
    2. I(x1x2)= 32
    But:
    1. I(x1)= 4
    2. I(x2)= 8
    3. I(x1) + I(x2) = 12
    Log
    probability

    (Attempt 2)

    Observation: We know the probabilities of independent events multiply, but we want their combined information content to add:

      I(p(x,y)) = I(p(x)) + I(p(y))
      I(p(x) * p(y)) = I(p(x)) + I(p(y))

    One function that does something VERY like what we want is log:

      log(x * y) = log x + log y

    Thus we have two ideas:

    1. Using inverse probabilities deals with our significance criterion
    2. Using logs deals with additivity,
    Combining our two ideas we get:
      I(x) = log(1/p(x)) = - log p(x)

    Examples:

    1. p(x1) = 1/4 ==> I(x1) = -log (1/4) = - (- 2) = 2
    2. p(x2) = 1/8 ==> I(x2) = 8 = - log (1/8) = - ( - 3) = 3
    As desired, we satisfy significance.

    Now we satisfy additivity:

    1. p(x1x2) = 1/32 ==> I(x1x2) = - log (1/32) = - ( - 5) = 5
    2. I(x1x2) = I(x1) + I(x2)
      5 = 2 + 3

    The unit is bits. Think of bits as counting binary choices. Taking probabilities out of it:

      With n bits, you have enough to specify 2n choices.
    Example (binary numbers). 3 bits. 8 choices.
    1. 000 = 0
    2. 001 = 1
    3. 010 = 2
    4. 011 = 3
    5. 100 = 4
    6. 101 = 5
    7. 110 = 6
    8. 111 = 7
    Information
    Quantity
    Defined

    Assume a random variable X with probability mass function (pmf) p. For each x in the range of X:

      I(x) = - log p(x)

Entropy

We next define entropy as the expected information quantity.

    Expected
    Information
    Quantity

    I (Information quantity) is a function that returns a number for each member of a sample space of possible events. We can compute the expected value of I, which we call H:

      H: expected value of I. (ps, pdf).
    Coin
    tossing

    Suppose the probability of a head is 1/2. Then,

      H(X) = [1/2 * - log (1/2)] + [1/2 * - log (1/2)]
      H(X) = [1/2 * 1] + [1/2 * 1]
      H(X) = 1

    Suppose the probability of a head is 3/4. Then,

      H(X) = [1/4 * - log (1/4)] + [3/4 * - log (3/4)]
      H(X) = [1/4 * 2] + [3/4 * .415]
      H(X) = .5 + .311 = .811
    Suppose the probability of a head is 7/8. Then,
      H(X) = [1/8 * - log (1/8)] + [7/8 * - log (7/8)]
      H(X) = [1/8 * 3] + [7/8 * .193]
      H(X) = .375 + .144 = .519
    Entropy
    as disorder

    Consider the graph of all possible values of p(H)(pdf, ps).

      The more predictable the system becomes (the further p(H) moves from .5) the lower H is. Entropy is a measure of disorder.

    General fact:

      For any random variable X, varying p and keeping Omega the same, the uniform distribution gives the highest value of H.
    Entropy
    as Choice
    Number

    Consider the 8-sided die of the text book:

    Suppose the probability of each face is 1/8. Then,

      H(X) = 8([1/8 * - log (1/8)])
      H(X) = 8([1/8 * 3]
      H(X) = 3

    Suppose the probability of one die is 1/4, and the others are all 3/28. Then,

      H(X) = [1/4 * - log (1/4)] + 7([3/28 * - log (3/28)])
      H(X) = [1/4 * 2] + [7 * 3/28 * 3.22]
      H(X) = .5 + 2.42 = 2.92
    Suppose the probability of one die is 7/8, and the others are all 1/56.
      H(X) = [1/8 * - log (1/8)] + 7([1/56 * - log (1/56)])
      H(X) = [1/8 * 3] + [1/8 * 5.81]
      H(X) = .375 + .726 = 1.101
    Same trend as the examples with heads, but the numbers are staying higher.
      For a sample space of size 2n, the entropy of a uniform distribution is n.
    So entropy goes up slowly as the number of events goes up, as the number of binary choices (bits) required to determine an event goes up.
    Optimal
    Encoding

    Consider a varaint of 20 questions in which your pay-off halves after each question. To maximize average payoff you want a strategy that guarantees you the least average number of questions.

    Consider the following horse race:

      Distribution X for horses
      H1 1/2
      H2 1/4
      H3 1/8
      H4 1/16
      H5 1/64
      H6 1/64
      H7 1/64
      H8 1/64

    H(X) = -1/2 log 1/2 - 1/4 log 1/4 - 1/16 log 1/16 - 4 (1/64 log -64)
      = 2 bits
    The optimal encoding of winners:

      Expected Number of questions
      Horse Encoding
      H1 0
      H2 10
      H3 110
      H4 1110
      H5 11110
      H6 111100
      H7 111101
      H6 111110
      H7 111111
      Average questions 3.0
    The length of the average message will be 2 bits.
    Entropy of
    Joint Distribution

    Definition of H(X,Y). (ps, pdf).

    Entropy of
    Conditional
    Distribution

    Definition of H(X|Y). (ps, pdf).

    Mutual
    Information

    We write Mutual Infomration of X and Y as I(X;Y). This is symmetric and defined as:

      I(X;Y) = H(X) - H(X | Y)

    Lots of different intuitions. Very interesting concept, lots of applications.

    HEre's one intution:

      I(X;Y) Sumx,y p(x,y) [p(x|y)÷p(x)]
    This is:
    1. The expected value of the ration of p(x|y) to p(x)
    2. Average amount of reduction in uncertainty, weight by p(x,y).
    Interdisiplinary
    Nature of
    Information
    Theory
    Fields where the notion of entropy (or something like it) plays a role:
    1. Communication Theory
        Optimal encoding
    2. Computer Science
        Kolomogorov Complexity: length of the shortest binary program for describing/computing a string
    3. Physics
        Thermodynamics: @nd law. Entropy always increases. The universe tends toward heat death (= the uniform distribution)
    4. Probability Theory
    5. Statistics
    6. Economics
      1. Portfolio Theory
      2. Kelly Gambling
    Cross entropy
    (Intuition)

    Meassure average amount of surprise

    1. Sum up information of each new word, GIVEN model:
        - log p(w1)
        - log p(w2|w1) +
        - log p(w3|w1w2)+
        ...
        -log p(wn | w1, w2, w3, ...wn-1)
    2. = log(p(w1)* p(w2|w1) ... * p(wn| w1, ...wn-1))
    3. = log p(w1, ... wn) [chain rule]
    4. Divide by n (to make this a per word measure)
    5. The smaller this number the better the model (the less surprised by the corpus). The bigger p, the smaller this number.
    Cross Entropy
    Definition

    Cross-Entropy

      H(p,m) = - Sumx p(x) log m(x)
    NB: This is always HIGHER than:
      H(p) = - Sumx p(x) log p(x)

    (Per Word) Cross-Entropy of Model for a corpus of size n

      H(p,m) = - 1/n Sum1=1 p(wi) log m(wi)

    (Per Word) Cross-Entropy of Model for the Language

      H(p,m) = - Lim n->Inf 1/n Sum1=1 p(wi) log m(wi)

    By a wonderful magical theorem, this =:

      H(p,m) = - Lim n->Inf 1/n p(w1 ... wn)
    Cross-entropy
    Intuition

    Entropy of a subdistribution
    Event (x) p(x) Info
    (- log p(x))
    Weighted
    Info
    e1 0.25 2 0.5
    e2 0.125 3 0.375
    Cross-entropy for that sub-distribution
    Event (x) p(x) m(x) Info
    (- log m(x))
    Weighted
    Info
    - p(x) log m(x)
    Delta
    e1 0.25 0.125 3 0.75 + 0.25
    e2 0.125 0.25 2 0.25 - 0.125
    Total Delta + 0.125
    Required Prob mass to make up for delta for e 1
    Event (x) p(x) m(x) Info
    (- log m(x))
    Weighted
    Info
    - p(x) log m(x)
    Delta
    e2 0.125 0.5 1 0.125 - 0.25