This post is for anybody that is confused about the numerous different types of entropy concepts out there, and how they relate to one another. The concepts covered are:
- Cross entropy
- KL divergence
- Relative entropy
- Log loss
- Akaike Information Criterion
- Cross validation
Let’s dive in!
Surprise and information
Previously, I talked about the relationship between surprise and information. It is expressed by the following equation:
Surprise = Information = – log(P)
I won’t rehash the justification for this equation, but highly recommend you check out the previous post if this seems unusual to you.
In addition, we introduced the ideas of expected surprise and total expected surprise, which were expressed by the following equations:
Expected surprise = – P log(P)
Total expected surprise = – ∑ P log(P)
As we saw previously, the total expected surprise for a distribution is synonymous with the entropy of somebody with that distribution.
Which leads us straight into the topic of this post!
The entropy of a distribution is how surprised we expect to be if we suddenly learn the truth about the distribution. It is also the amount of information we expect to gain upon learning the truth.
A small degree of entropy means that we expect to learn very little when we hear the truth. A large degree of entropy means that we expect to gain a lot of information upon hearing the truth. Therefore a large degree of entropy represents a large degree of uncertainty. Entropy is our distance from certainty.
Entropy = Total expected surprise = – ∑ P log(P)
Notice that this is not the distance from truth. We can be very certain, and very wrong. In this case, our entropy will be high, because it is our expected surprise. That is, we calculate entropy by looking at the average surprise over our probability distribution, not the true distribution. If we want to evaluate the distance from truth, we need to evaluate the average over the true distribution.
We can do this by using cross-entropy.
In general, the cross entropy is a function of two distributions P and Q. The cross entropy of P and Q is the surprise you expect somebody with the distribution Q to have, if you have distribution P.
Cross Entropy = Surprise P expects of Q = – ∑ P log(Q)
The actual average surprise of your distribution P is therefore the cross-entropy between P and the true distribution. It is how surprised somebody would expect you to be, if they had perfect knowledge of the true distribution.
Actual average surprise = – ∑ Ptrue log(P)
Notice that the smallest possible value that the cross entropy could take on is the entropy of the true distribution. This makes sense – if your distribution is as close to the truth as possible, but the truth itself contains some amount of uncertainty (for example, a fundamentally stochastic process), then the best possible state of belief you could have would be exactly as uncertain as the true distribution is. Maximum cross entropy between your distribution and the true distribution corresponds to maximum distance from the truth.
If we want a quantity that is zero when your distribution is equal to the true distribution, then you can shift the cross entropy H(Ptrue, P) over by the value of the true entropy S(Ptrue). This new quantity H(Ptrue, P) – S(Ptrue) is known as the Kullback-Leibler divergence.
Shifted actual average surprise = Kullback-Leibler divergence
= – ∑ Ptrue log(P) + ∑ Ptrue log(Ptrue)
= ∑ Ptrue log(Ptrue/P)
It represents the information gap, or the actual average difference in difference between your distribution and the true distribution. The smallest possible value of the Kullback-Leibler divergence is zero, when your beliefs are completely aligned with reality.
Since KL divergence is just a constant shift away from cross entropy, minimizing one is the same as minimizing the other. This makes sense; the only real difference between the two is whether we want our measure of “perfect accuracy” to start at zero (KL divergence) or to start at the entropy of the true distribution (cross entropy).
The negative KL divergence is just a special case of what’s called relative entropy. The relative entropy of P and Q is just the negative cross entropy of P and Q, shifted so that it is zero when P = Q.
Relative entropy = shifted cross entropy
= – ∑ P log(P/Q)
Since the cross entropy between P and Q measures how surprised P expects Q to be, the relative entropy measures P’s expected gap in average surprisal between themselves and Q.
KL divergence is what you get if you substitute in Ptrue for P. Thus it is the expected gap in average surprisal between a distribution and the true distribution.
Maximum KL divergence corresponds to maximum distance from the truth, while maximum entropy corresponds to maximum from certainty. This is why we maximize entropy, but minimize KL divergence. The first is about humility – being as uncertain as possible given the information that you possess. The second is about closeness to truth.
Since KL divergence is just a constant shift away from cross entropy, minimizing one is the same as minimizing the other. This makes sense, the only real difference between the two is whether we want our “perfectly accurate” measure to start at zero (KL divergence) or at the entropy of the true distribution (cross entropy).
Since we don’t start off with access to Ptrue, we can’t directly calculate the cross entropy H(Ptrue, P). But lucky for us, a bunch of useful approximations are available!
Log loss uses the fact that if we have a set of data D generated by the true distribution, the expected value of F(x) taken over the true distribution will be approximately just the average value of F(x), for x in D.
Cross Entropy = – ∑ Ptrue log(P)
(Data set D, N data points)
Cross Entropy ~ Log loss = – ∑x in D log(P(x)) / N
This approximation should get better as our data set gets larger. Log loss is thus just a large-numbers approximation of the actual expected surprise.
Akaike information criterion
Often we want to use our data set D to optimize our distribution P with respect to some set of parameters. If we do this, then the log loss estimate is biased. Why? Because we use the data in two places: first to optimize our distribution P, and second to evaluate the information distance between P and the true distribution.
This allows problems of overfitting to creep in. A distribution can appear to have a fantastically low information distance to the truth, but actually just be “cheating” by ensuring success on the existing data points.
The Akaike information criterion provides a tweak to the log loss formula to try to fix this. It notes that the difference between the cross entropy and the log loss is approximately proportional to the number of parameters you tweaked divided by the total size of the data set: k/N.
Thus instead of log loss, we can do better at minimizing cross entropy by minimizing the following equation:
AIC = Log loss + k/N
(The exact form of the AIC differs by multiplicative constants in different presentations, which ultimately is unimportant if we are just using it to choose an optimal distribution)
The explicit inclusion of k, the number of parameters in your model, represents an explicit optimization for simplicity.
The derivation of AIC relies on a complicated set of assumptions about the underlying distribution. These assumptions limit the validity of AIC as an approximation to cross entropy / KL divergence.
But there exists a different set of techniques that rely on no assumptions besides those used in the log loss approximation (the law of large numbers and the assumption that your data is an unbiased sampling of the true distribution). Enter the holy grail of model selection!
The problem, recall, was that we used the same data twice, allowing us to “cheat” by overfitting. First we used it to tweak our model, and second we used it to evaluate our model’s cross entropy.
Cross validation solves this problem by just separating the data into two sets, the training set and the testing set. The training set is used for tweaking your model, and the testing set is used for evaluating the cross entropy. Different procedures for breaking up the data result in different flavors of cross-validation.
There we go! These are some of the most important concepts built off of entropy and variants of entropy.