Car Prices 
Let's reconsider the joint distribution of COLOR and MAKE in which the two variables are independent.
Now let's consider prices:



Expected Value 
Suppose we want to get the average price of a car? We dont just add the 9 prices together and divide by 9 because that ignores the relative frequencies of the prices. A green VW price needs to be weighted more in the final average because green VWs are more common. Weighting by relative frequency:
This turns out just to be another way of taking an AVERAGE.


General Expected Value 
The general idea of expected value is that we have some function that assigns a number to every member of a sample space.
The expected value of the function is just the sum of its values for each member of the sample space weighted by probability
But the COLOR, MAKE functions aren't very interesting random variables to take the expectation of. So we introduced the price function. 
We're interested in assigning a NUMBER to an event that characterizes the quantity of information it carries.
Next we'll be interested in the expected value of that number. The expected quantity of information per event. (Hint: that's the entropy)
Two Probabilistic Criteria 



Inverse Probabilities (Attempt 1) 
Let's try:
Examples:
p(x_{2}) = 1/8 ==> I(x_{2}) = log (1/8) = 3 Result: Contradiction of criterion 2, additivity. Consider two independent events such that p(x_{1}) =1/4 and p(x_{2})=1/8. Then:


Log probability (Attempt 2) 
Observation: We know the probabilities of independent events multiply, but we want their combined information content to add:
I(p(x) * p(y)) = I(p(x)) + I(p(y)) One function that does something VERY like what we want is log:
Thus we have two ideas:
Examples:
Now we satisfy additivity:
The unit is bits. Think of bits as counting binary choices. Taking probabilities out of it:


Information Quantity Defined 
Assume a random variable X with probability mass function (pmf) p. For each x in the sample space of X:


Surprisal 
A lot of people prefer the term surprisal to information. That is, what
It's helpful to think about it this way, particularly for linguistics examples. But I will continue to use the term information, out of pure love for tradition. 
We next define entropy as the expected information quantity.
Expected Information Quantity (Entropy) 
I (Information quantity) is a function that returns a number for each member of a sample space of possible events. We can compute the expected value of I, which we call H: Let Ω for p be x_{1} ... x _{n}. Then.
H(p) is called the entropy or expected information value of p. It is one many numbers we can associate with a probability distribution, such as mean, standard deviation, or variance. Each of these tells us something significant about the distribution. You should think of entropy as a measurement you can take, in particular a measurement you can take of a probability distribution. Like measurements of physical quantities, like relative velocity, temperature, and density, it is significant, but there are many factors that conspire together to give us that one number, so it can be hard to interpret. Always remember it's an average (expected value). Thus distributions that are very different in nature can have the same entropy. 


Coin tossing 
Suppose the probability of a head is 1/2. Then,
H(X) = [1/2 * 1] + [1/2 * 1] H(X) = 1 Suppose the probability of a head is 3/4. Then,
H(X) = [1/4 * 2] + [3/4 * .415] H(X) = .5 + .311 = .811
H(X) = [1/8 * 3] + [7/8 * .193] H(X) = .375 + .144 = .519 And finally, suppose we have the perfectly engineered cheater's coin such that the probability of a head is 1:
H(X) = [0 *   ∞] + [1 * 0] H(X) = [0 * ∞] + [1 * 0] H(X) = 0 + 0 = 0 Lim_{z → 0} z *  log z = 0,]so this is legitimate.] 

Entropy as disorder 
Consider the graph of all possible values of p(H)(pdf, ps).
General fact:


Entropy as Choice Number 
Consider an 8sided die: Suppose the probability of each face is 1/8. Then,
H(X) = 8([1/8 * 3] H(X) = 3 Suppose the probability of one die is 1/4, and the others are all 3/28. Then,
H(X) = [1/4 * 2] + [7 * 3/28 * 3.22] H(X) = .5 + 2.42 = 2.92
H(X) = [1/8 * 3] + [1/8 * 5.81] H(X) = .375 + .726 = 1.101
Reviewing:
in general, entropy goes up slowly as the number of events goes up, as the number of binary choices (bits) required to determine an event goes up. 

Optimal Encoding 
Consider the following horse race:
Encoding winners (first try)
((3.0/64) * 4) + (3.0/16) + (3.0/8) + (3.0/4) + (3.0/2) = 3.0 Encoding winners (Optimal way)
((6.0/64) * 4) + (1.0/4) + (3.0/8) + (1.0/2) + (1.0/2) = 2.0 Shannon Weaver Encoding Theorem: A code averaging H(p) bits per horse always exists and is always the optimal encoding. Corollary: For a uniform distribution, we can do no better than the code we started with, averaging 3 bits per horse. Twenty QuestionsAnother way to look at entropy is that it measures the average number of yes/no questions it takes to determine an outcome, using an optimal questioning strategy:
Consider a variant of 20 questions in which your payoff halves after each question. To maximize average payoff you want a strategy that guarantees you the least average number of questions.


Perplexity 
Perplexity(p(x)) = 2^{H(p(x))} If H(p(x)) = 3, Perp(p(x))= 8. If H(p(x)) = 4, Perp(p(x))= 16 Speech recognition tradition uses perplexity. Perplexity is the measure of the average number of choices a random variable has to make. For a uniform distribution over horses, it's 8. For a skewed distribution like the horse race above, it's 4.  
Entropy of Joint Distribution 


Entropy of Conditional Distribution 


Mutual Information 
Suppose we want to define the amount that knowing an event y reduces the surprisal of knowing x. The probability of y when x is known is P(xy). The reduction in the surprisal for x  written I(x;y)  is then: (1) I(x;y) = I(x)  I(xy) =  log P(x)  ( log P(x  y)) =  log P(x) + log P(x  y) = log (1/P(x)) + log P(x  y) = log (1/P(x)) + log P(x,y)/P(y) (2a) = log (P(x,y)/(P(x)*P(y)) (2b) = log (P(x  y)/P(x)) I(x;y) is called the Pointwise Mutual Information of x and y. An interesting property, not obvious from (1), but clear from (2a), is that it is symmetric:
Pointwise mutual information, like surprisal, is a numerical function of events in a probability distribution, the distribution P(x,y). We can therefore take its expectation with respect to that distribution. This gives:
It can be shown that under this definition:
Summarizing the intuitions about (4):


Interdisiplinary Nature of Information Theory 
Fields where the notion of entropy (or something like it)
plays a role:


Cross entropy (Intuition) 
Measure average surprisal assigned to a corpus by our model.
Of course our model will always treat some maximum chunksize, say, n. And a good test corpus should naturally be much bigger than n. Let W_{1,n} be a corpus of nword chunks we write in lowercase, say w_{1,n}. We get the following definition of crossentropy:
Now the summation here is just the log of the probability the model assigns to the whole corpus (the log of the product of the probabilities of each nword chunk token). So retooling n to now be the length of the corpus, this is usually rewritten:


Cross Entropy Definition 
It is an interesting fact that this very simple definition of crossentropy as perword surprisal gives exactly the same results as a more mathematically motivated definition, given some natural assumptions. CrossEntropy
NB: This is always HIGHER than:
(discussion below). (Per Word) CrossEntropy of Model for a corpus of size n
(Per Word) CrossEntropy of Model for the Language
By a magical theorem, this turns into the definition of crossentropy as perword surprisal in a corpus, if we use a large enough corpus. Suppose we have a lot samples of text of size n which we know to be independent of each other, large enough so that for all strings of length n, all w_{1,n}, we have a valid estimate of the probability of w_{1,n}. Then we can show:
The idea is that a large enough corpus will have frequencies that faithfully reflect the true model of English. Following Charniak (1997), 2.7, consider an example. Suppose for the sake of concreteness that all 20 word sequences of English were independent of one another and suppose m was a complete model of such sequences. Assume there is one 20word sequence M1 whose probability according to the true model of English is .05. Then (1a) tells us the contribution M1 makes to the overall crossentropy of m is:
Now what does (1b) tell us? For that we need a corpus. Suppose again for concreteness, a 2000 word corpus (100 20 word sequences) was sufficient to adequately represent the distribution of all possible 20 word sequences. If the corpus gives a representative sampling of M1, there will be 5 occurrences of M1 in the 100 sequence corpus. So according to (1b) the overall contribution of M1 to the crossentropy of m will be:
Now the weightings in (2) and (3) are equal, and therefore (2) and (3) are equal. And in general for any message, they will be equal, because:
for any representative corpus. In our example:
If we now divide both sides of (4) by the length of M1 we get:
The LHS is the weighting of M1's contribution according to (3) and the RHS is the weighting according to (2). They are equal. 

Crossentropy: Intuitions 
For any model m differing from the true model p crossentropy(p,m) > entropy(p). Now in general any incorrect probability model m is going to do some underestimating of the event space of p (adding to H(p)), as well as some overestimating (subtracting from H(p)). Suppose we underestimate:
So in general underestimating the probability of an event pushes the crossentropy upward. Symmetrically, overestimating pushes it downward:
But the effect of underestimating always outweighs the effect of overestimating. How does this work? This is a property of the log function. Consider a subdistribution of p that only contains 1/2 of the prob mass.
Now consider a misguided model m: p m H(p) H(p,m) Δ_{p} Δ_{H} _____________________________________________ e_{1} 0.2500 0.1250 0.5000 0.7500 0.1250 +0.2500 e_{2} 0.2500 0.3750 0.5000 0.3538 +0.1250 0.1462 __________________________________________________ Totals: 0.5000 0.5000 1.0000 1.1038 +0.0000 +0.1038 So underestimating p(e_{1}) made H_{p,m}(e_{1}) exceed H(e_{1}) (.5199 > .3466), and the overestimation of e_{2} wasn't enough to make up for it.
How much do we have to overestimate e_{2} to make up for underestimating e_{1}? p m H(p) H(p,m) Δ_{p} Δ_{H} ___________________________________________ e_{1} 0.2500 0.1250 0.5000 0.7500 0.1250 +0.2500 e_{2} 0.2500 0.5000 0.5000 0.2500 +0.2500 0.2500 ___________________________________________ Totals: 0.5000 0.6250 1.0000 1.0000 +0.1250 +0.0000 So now the Δ_{H} balances, but the Δ_{p} does not. But this means the rest of m (.375) is going to underestimate the probability of the rest of p (.5), which will make the crossentropy for that set of events exceed the entropy. Varying the amount of the overestimations and breaking them up always leads to a net positive Δ H. For example: p m H(p) H(p,m) &Delta_{p} Δ_{H} ___________________________________________ 0.0500 0.0600 0.2161 0.2029 +0.0100 0.0132 0.0400 0.0500 0.1858 0.1729 +0.0100 0.0129 0.0300 0.0400 0.1518 0.1393 +0.0100 0.0125 0.0200 0.0300 0.1129 0.1012 +0.0100 0.0117 0.0100 0.0200 0.0664 0.0564 +0.0100 0.0100 0.0600 0.0500 0.2435 0.2593 0.0100 +0.0158 0.0500 0.0400 0.2161 0.2322 0.0100 +0.0161 0.0400 0.0300 0.1858 0.2024 0.0100 +0.0166 0.0300 0.0200 0.1518 0.1693 0.0100 +0.0175 0.0200 0.0100 0.1129 0.1329 0.0100 +0.0200 ___________________________________________ Totals: 0.3500 0.3500 1.6430 1.6688 0.0000 +0.0258Here the probabilities balance, but the entropy contributions do not, and the crossentropy once again is greater. In general it takes more than p+Δ probability mass to make up for an underestimation of Δ. To do your own crossentropy trials, try out cross_entropy computing mdule, with some help from the documentation. 