I’ve talked quite a bit about D_{KL} on this blog, but I think I’ve yet to give a simple introduction to the concept. That’s what this post is about; an introduction to DKL in all it’s wonder!

# What is D_{KL}?

Essentially, D_{KL} is a measure of the *information distance* between a given model of reality and reality itself. Information distance, more precisely, is a quantification of how many bits of information you would need to receive in order to update your model of reality into perfect alignment with reality. Said another way, it is how much information you would lose if you started with a perfectly aligned model of reality and ended with the model of reality that you currently have. And said one final way, it is *how much more* *surprised* you will be on average given your beliefs, than you would be if you had all true beliefs.

Here’s the functional form of D_{KL}:

D_{KL}(P_{true}, P) = ∫ P_{true} log(P_{true} / P)

= E_{true distribution}[ log(P_{true} / P) ]

The E_{true distribution}[] on the second line refers to an expected value taken over the true distribution.

Why the log? I give some intuitive reasons here.

The problem, however, is that D_{KL} cannot be directly calculated. Notice that one of the inputs to the function is the true probability distribution over outcomes. If you had access to this from the start, then you would have no need for any fancy methods of inference in the first place.

However, there *are* good ways to indirectly approximate D_{KL}. How could we ever approximate a function that takes in an input that we don’t know? Through data!

Loosely speaking, data functions as a *window* that allows you to sneak peeks at reality. When you observe the outcomes of an experiment, the result you get will not always be aligned with your beliefs about reality. Instead, the outcomes of the experiment reflect the nature of reality itself, unfiltered by your beliefs.

(This is putting aside subtleties about good experimental design, but even those subtleties are unnecessary; technically the data you get is always a product of the nature of reality as it is, it’s just that our interpretation of the data might be flawed.)

So if we have access to some set of data from well-designed experiments (that is, experiments whose results we are correctly interpreting), we can use it to form an approximation of the D_{KL} of any given model of reality. This first approximation is called the *log loss*, and takes the following form:

Log Loss = – E_{data}[ log(P) ]

There is one more problem with this notion of using data to approximate D_{KL}. The problem is that normally, we use data to update our beliefs. If we first update our beliefs with the data, and then approximate the D_{KL} of our new distribution using the data, then we are biasing our approximation. It’s sort of like assessing intelligence by giving people a IQ test, but they were allowed to study by examining that exact IQ test and its answer key. If they do well on that test, it might not be because they are actually intelligent, but rather just that they’ve memorized all of the answers (overfit to the data of the IQ test).

So we have a few choices; first, we could refuse to update our beliefs on the data, and then have a nice unbiased estimate of the D_{KL} of our un-updated distribution. Second, we could update our beliefs on the data, but give up hope of an unbiased estimate of D_{KL}. Third, we could use some of the data for updating our beliefs, and the rest of it for evaluating D_{KL }(this option is called cross validation). Or finally, we could try to find some way to *approximate* the amount of bias introduced by a given update of beliefs to our estimate of D_{KL}.

Amazingly, we actually know precisely how to do this final option! This was the great contribution of the brilliant Japanese statistician Hirotogu Akaike. The equation he derived when trying to quantify the degree of bias is called the *Akaike information criterion*.

AIC(Model M) = Number of parameters in M – log P(data | M)

The best model in a set is the one with the lowest AIC score. It makes a lot of sense that models with more parameters are penalized; models with more tweakable parameters are like students that are better at memorizing an answer key to their test.

Can we do any better than AIC? Yes, in fact! For small data sets, a better measure is AICc, which adds a correction term that scales like 1/N.

So to summarize everything in a few sentences:

**D**_{KL}is a measure of how far your model of reality is from the truth.**D**_{KL}cannot be calculated without prior knowledge of the truth.**However, we can use data to approximate D**_{KL}, by calculating log loss.**Unfortunately, if we are also using the data to update our beliefs, log loss is a biased estimator of D**_{KL}.**We can approximate the bias and negate it using the Akaike information criterion, AIC.****An even better approximation for small data sets is AICc.**

## One thought on “The basics of information divergence”