Galileo and the Schelling point improbability principle

An alternative history interaction between Galileo and his famous statistician friend


In the year 1609, when Galileo Galilei finished the construction of his majestic artificial eye, the first place he turned his gaze was the glowing crescent moon. He reveled in the crevices and mountains he saw, knowing that he was the first man alive to see such a sight, and his mind expanded as he saw the folly of the science of his day and wondered what else we might be wrong about.

For days he was glued to his telescope, gazing at the Heavens. He saw the planets become colorful expressive spheres and reveal tiny orbiting companions, and observed the distant supernova which Kepler had seen blinking into existence only five years prior. He discovered that Venus had phases like the Moon, that some apparently single stars revealed themselves to be binaries when magnified, and that there were dense star clusters scattered through the sky. All this he recorded in frantic enthusiastic writing, putting out sentences filled with novel discoveries nearly every time he turned his telescope in a new direction. The universe had opened itself up to him, revealing all its secrets to be uncovered by his ravenous intellect.

It took him two weeks to pull himself away from his study room for long enough to notify his friend Bertolfo Eamadin of his breakthrough. Eamadin was a renowned scholar, having pioneered at age 15 his mathematical theory of uncertainty and created the science of probability. Galileo often sought him out to discuss puzzles of chance and randomness, and this time was no exception. He had noticed a remarkable confluence of three stars that were in perfect alignment, and needed the counsel of his friend to sort out his thoughts.

Eamadin arrived at the home of Galileo half-dressed and disheveled, obviously having leapt from his bed and rushed over immediately upon receiving Galileo’s correspondence. He practically shoved Galileo out from his viewing seat and took his place, eyes glued with fascination on the sky.

Galileo allowed his friend to observe unmolested for a half-hour, listening with growing impatience to the ‘oohs’ and ‘aahs’ being emitted as the telescope swung wildly from one part of the sky to another. Finally, he interrupted.

Galileo: “Look, friend, at the pattern I have called you here to discuss.”

Galileo swiveled the telescope carefully to the position he had marked out earlier.

Eamadin: “Yes, I see it, just as you said. The three stars form a seemingly perfect line, each of the two outer ones equidistant from the central star.”

Galileo: “Now tell me, Eamadin, what are the chances of observing such a coincidence? One in a million? A billion?”

Eamadin frowned and shook his head. “It’s certainly a beautiful pattern, Galileo, but I don’t see what good a statistician like myself can do for you. What is there to be explained? With so many stars in the sky, of course you would chance upon some patterns that look pretty.”

Galileo: “Perhaps it seems only an attractive configuration of stars spewed randomly across the sky. I thought the same myself. But the symmetry seemed too perfect. I decided to carefully measure the central angle, as well as the angular distance distended by the paths from each outer star to the central one. Look.”

Galileo pulled out a sheet of paper that had been densely scribbled upon. “My calculations revealed the central angle to be precisely 180.000º, with an error of ± .003º. And similarly, I found the difference in the two angular distances to be .000º, with a margin of error of ± .002º.”

Eamadin: “Let me look at your notes.”

Galileo handed over the sheets to Eamadin. “I checked over my calculations a dozen times before writing you. I found the angular distances by approaching and retreating from this thin paper, which I placed between the three stars and me. I found the distance at which the thin paper just happened to cover both stars on one extreme simultaneously, and did the same for the two stars on the other extreme. The distance was precisely the same, leaving measurement error only for the thickness of the paper, my distance from it, and the resolution of my vision.”

Eamadin: “I see, I see. Yes, what you have found is a startlingly clear pattern. A similarity in distance and precision of angle this precise is quite unlikely to be the result of any natural phenomenon… ”

Galileo: “Exactly what I thought at first! But then I thought about the vast quantity of stars in the sky, and the vast number of ways of arranging them into groups of three, and wondered if perhaps in fact such coincidences might be expected. I tried to apply your method of uncertainty to the problem, and came to the conclusion that the chance of such a pattern having occurred through random chance is one in a thousand million! I must confess, however, that at several points in the calculation I found myself confronted with doubt about how to progress and wished for your counsel.”

Eamadin stared at Galileo’s notes, then pulled out a pad of his own and began scribbling intensely. Eventually, he spoke. “Yes, your calculations are correct. The chance of such a pattern having occurred to within the degree of measurement error you have specified by random forces is 10-9.”

Galileo: “Aha! Remarkable. So what does this mean? What strange forces have conspired to place the stars in such a pattern? And, most significantly, why?”

Eamadin: “Hold it there, Galileo. It is not reasonable to jump from the knowledge that the chance of an event is remarkably small to the conclusion that it demands a novel explanation.”

Galileo: “How so?”

Eamadin: “I’ll show you by means of a thought experiment. Suppose that we found that instead of the angle being 180.000º with an experimental error of .003º, it was 180.001º with the same error. The probability of this outcome would be the same as the outcome we found – one in a thousand million.”

Galileo: “That can’t be right. Surely it’s less likely to find a perfectly straight line than a merely nearly perfectly straight line.”

Eamadin: “While that is true, it is also true that the exact calculation you did for 180.000º ± .003º would apply for 180.001º ± .003º. And indeed, it is less likely to find the stars at this precise angle, than it is to find the stars merely near this angle. We must compare like with like, and when we do so we find that 180.000º is no more likely than any other angle!”

Galileo: “I see your reasoning, Eamadin, but you are missing something of importance. Surely there is something objectively more significant about finding an exactly straight line than about a nearly straight line, even if they have the same probability. Not all equiprobable events should be considered to be equally important. Think, for instance, of a sequence of twenty coin tosses. While it’s true that the outcome HHTHTTTTHTHHHTHHHTTH has the same probability as the outcome HHHHHHHHHHHHHHHHHHHH, the second is clearly more remarkable than the first.”

Eamadin: “But what is significance if disentangled from probability? I insist that the concept of significance only makes sense in the context of my theory of uncertainty. Significant results are those that either have a low probability or have a low conditional probability given a set of plausible hypotheses. It is this second class that we may utilize in analyzing your coin tossing example, Galileo. The two strings of tosses you mention are only significant to different degrees in that the second more naturally lends itself to a set of hypotheses in which the coin is heavily biased towards heads. In judging the second to be a more significant result than the first, you are really just saying that you use a natural hypothesis class in which probability judgments are only dependent on the ratios of heads and tails, not the particular sequence of heads and tails. Now, my question for you is: since 180.000º is just as likely as 180.001º, what set of hypotheses are you considering in which the first is much less likely than the second?”

Galileo: “I must confess, I have difficulty answering your question. For while there is a simple sense in which the number of heads and tails is a product of a coin’s bias, it is less clear what would be the analogous ‘bias’ in angles and distances between stars that should make straight lines and equal distances less likely than any others. I must say, Eamadin, that in calling you here, I find myself even more confused than when I began!”

Eamadin: “I apologize, my friend. But now let me attempt to disentangle this mess and provide a guiding light towards a solution to your problem.”

Galileo: “Please.”

Eamadin: “Perhaps we may find some objective sense in which a straight line or the equality of two quantities is a simpler mathematical pattern than a nearly straight line or two nearly equal quantities. But even if so, this will only be a help to us insofar as we have a presumption in favor of less simple patterns inhering in Nature.”

Galileo: “This is no help at all! For surely the principle of Ockham should push us towards favoring more simple patterns.”

Eamadin: “Precisely. So if we are not to look for an objective basis for the improbability of simple and elegant patterns, then we must look towards the subjective. Here we may find our answer. Suppose I were to scribble down on a sheet of paper a series of symbols and shapes, hidden from your view. Now imagine that I hand the images to you, and you go off to some unexplored land. You explore the region and draw up cartographic depictions of the land, having never seen my images. It would be quite a remarkable surprise were you to find upon looking at my images that they precisely matched your maps of the land.”

Galileo: “Indeed it would be. It would also quickly lend itself to a number of possible explanations. Firstly, it may be that you were previously aware of the layout of the land, and drew your pictures intentionally to capture the layout of the land – that is, that the layout directly caused the resemblance in your depictions. Secondly, it could be that there was a common cause between the resemblance and the layout; perhaps, for instance, the patterns that most naturally come to the mind are those that resemble common geographic features. And thirdly, included only for completion, it could be that your images somehow caused the land to have the geographic features that it did.”

Eamadin: “Exactly! You catch on quickly. Now, this case of the curious coincidence of depiction and reality is exactly analogous to your problem of the straight line in the sky. The straight lines and equal distances are just like patterns on the slips of paper I handed to you. For whatever reason, we come pre-loaded with a set of sensitivities to certain visual patterns. And what’s remarkable about your observation of the three stars is that a feature of the natural world happens to precisely align with these patterns, where we would expect no such coincidence to occur!”

Galileo: “Yes, yes, I see. You are saying that the improbability doesn’t come from any objective unusual-ness of straight lines or equal distances. Instead, the improbability comes from the fact that the patterns in reality just happen to be the same as the patterns in my head!”

Eamadin: “Precisely. Now we can break down the suitable explanations, just as you did with my cartographic example. The first explanation is that the patterns in your mind were caused by the patterns in the sky. That is, for some reason the fact that these stars were aligned in this particular way caused you to by psychologically sensitive to straight lines and equal quantities.”

Galileo: “We may discard this explanation immediately, for such sensitivities are too universal and primitive to be the result of a configuration of stars that has only just now made itself apparent to me.”

Eamadin: “Agreed. Next we have a common cause explanation. For instance, perhaps our mind is naturally sensitive to visual patterns like straight lines because such patterns tend to commonly arise in Nature. This natural sensitivity is what feels to us on the inside as simplicity. In this case, you would expect it to be more likely for you to observe simple patterns than might be naively thought.”

Galileo: “We must deny this explanation as well, it seems to me. For the resemblance to a straight line goes much further than my visual resolution could even make out. The increased likelihood of observing a straight line could hardly be enough to outweigh our initial naïve calculation of the probability being 10-9. But thinking more about this line of reasoning, it strikes me that you have just provided an explanation the apparent simplicity of the laws of Nature! We have developed to be especially sensitive to patterns that are common in Nature, we interpret such patterns as ‘simple’, and thus it is a tautology that we will observe Nature to be full of simple patterns.”

Eamadin: “Indeed, I have offered just such an explanation. But it is an unsatisfactory explanation, insofar as one is opposed to the notion of simplicity as a purely subjective feature. Most people, myself included, would strongly suggest that a straight line is inherently simpler than a curvy line.”

Galileo: “I feel the same temptation. Of course, justifying a measure of simplicity that does the job we want of it is easier said than done. Now, on to the third explanation: that my sensitivity to straight lines has caused the apparent resemblance to a straight line. There are two interpretations of this. The first is that the stars are not actually in a straight line, and you only think this because of your predisposition towards identifying straight lines. The second is that the stars aligned in a straight line because of these predispositions. I’m sure you agree that both can be reasonably excluded.”

Eamadin: “Indeed. Although it may look like we’ve excluded all possible explanations, notice that we only considered one possible form of the common cause explanation. The other two categories of explanations seem more thoroughly ruled out; your dispositions couldn’t be caused by the star alignment given that you have only just found out about it and the star alignment couldn’t be caused by your dispositions given the physical distance.”

Galileo: “Agreed. Here is another common cause explanation: God, who crafted the patterns we see in Nature, also created humans to have similar mental features to Himself. These mental features include aesthetic preferences for simple patterns. Thus God causes both the salience of the line pattern to humans and the existence of the line pattern in Nature.”

Eamadin: “The problem with this is that it explains too much. Based solely on this argument, we would expect that when looking up at the sky, we should see it entirely populated by simple and aesthetic arrangements of stars. Instead it looks mostly random and scattershot, with a few striking exceptions like those which you have pointed out.”

Galileo: “Your point is well taken. All I can imagine now is that there must be some sort of ethereal force that links some stars together, gradually pushing them so that they end up in nearly straight lines.”

Eamadin: “Perhaps that will be the final answer in the end. Or perhaps we will discover that it is the whim of a capricious Creator with an unusual habit for placing unsolvable mysteries in our paths. I sometimes feel this way myself.”

Galileo: “I confess, I have felt the same at times. Well, Eamadin, although we have failed to find a satisfactory explanation for the moment, I feel much less confused about this matter. I must say, I find this method of reasoning by noticing similarities between features of our mind and features of the world quite intriguing. Have you a name for it?”

Eamadin: “In fact, I just thought of it on the spot! I suppose that it is quite generalizable… We come pre-loaded with a set of very salient and intuitive concepts, be they geometric, temporal, or logical. We should be surprised to find these concepts instantiated in the world, unless we know of some causal connection between the patterns in our mind and the patterns in reality. And by Eamadin’s rule of probability-updating, when we notice these similarities, we should increase our strength of belief in these possible causal connections. In the spirit of anachrony, let us refer to this as the Schelling point improbability principle!”

Galileo: “Sounds good to me! Thank you for your assistance, my friend. And now I must return to my exploration of the Cosmos.”

Why “number of parameters” isn’t good enough

A friend of mine recently pointed out a curious fact. Any set of two-dimensional data whatsoever can be perfectly fit by a simple two-parameter sinusoidal model.

y(x) = A sin(Bx)

Sound wrong? Check it out:


Zoomed out:small-sine.png

N = 10 pointssine-overfit.png

As you see, as the number of data points goes up, all you need to do to accommodate this is increase the frequency in your sine function, and adjust the amplitude as necessary. Ultimately, you can fit any data set with a ridiculously quickly oscillating and large-amplitude sine function.

Now, most model selection methods explicitly rely on the parameter count to estimate the potential of a model to overfit. For example, if k is the number of parameters in a model, and L is the log likelihood of the data given the model, we have:

AIC = L – k
BIC = L – k/2・log(N)

This little example represents a fantastic failure of parameter count to successfully do the job AIC and BIC ask of it. Evidently parameter count is too blunt an instrument to do the job we require of it, and we need something with more nuance.

One more example.

For any set of data, if you can perfectly fit a curve to each data point, and if your measurement error σ is an adjustable parameter, then you can take the measurement error to zero to have a fit with infinite accuracy. Now when we evaluate, you find it running off to infinity! Thus our ‘fit to data’ term L goes to infinity, while the model complexity penalty stays a small finite number.

Once again, we see the same lack of nuance dragging us into trouble. The number of parameters might do well at estimating overfitting potential for some types of well-behaved parameters, but it clearly doesn’t do the job universally. What we want is some measure that is sensitive to the potential for some parameters to capture “more” of the space of all possible distributions than others.

And lo and behold, we have such a measure! This is the purpose of information geometry and the volume of a model in the space formed by the Fisher information metric as the penalty for overfitting potential. You can learn more about it in a post I wrote here.

Bayesian Occam’s Razor

A couple of days ago I posted a question that has been bugging me; namely, does Bayes’ overfit, and if not, why not?

Today I post the solution!

There are two parts: first, explaining where my initial argument against Bayes went wrong, and second, describing the Bayesian Occam’s Razor, the key to understanding how a Bayesian deals with overfitting.

Part 1: Why I was wrong

Here’s the argument I wrote initially:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

Here’s a more formal version of the argument:

  1. The relative probabilities of two model given data is calculated by Bayes’ rule:
    P(M | D) / P(M’ | D)  = P(M) / P(M’)・P(D | M) / P(D | M’)
  2. If M overfits the data and M’ does not, then as the size of the data set |D| goes to infinity, the likelihood factor P(D | M) / P(D | M’) goes to infinity.
  3. Thus the posterior probability P(M | D) should go to 1 for the model that most drastically overfits the data.

This argument is wrong for a couple of reasons. For one, the argument assumes that as the size of the data set grows, the model stays the same. But this is very much not going to be true in general. The task of overfitting gets harder and harder as the number of data points go up. It’s not that there’s no longer noise in the data; it’s that the signal becomes more and more powerful.

A perfect polynomial fit on 100 data points must have, at the worst, 100 parameters. On 1000 data points: 1000 parameters. Etc. In general, as you add more data points, a model that was initially overfitting (e.g. the 100-parameter distribution) will find that it is harder and harder to ignore the signal for the noise, and the next best overfitting model will have more parameters (e.g. the 1000-parameter distribution).

But now we have a very natural solution to the problem we started with! It is true that as the number of data points increases, the evidential support for the model that overfits the data will get larger and larger. It’s also true is that the number of parameters required to overfit the data will grow as well. So if your prior in a model is a decreasing function of the number of parameters in the model, then you can in principle find a perfect balance and avoid overfitting. This perfect balance would be characterized by the following: each time you increase the number of parameters, the prior should decrease by an amount proportional to how much more you get rewarded by overfitting the data with the extra parameters.

How do we find this prior in practice? Beats me… I’d be curious to know, myself.

But what’s most interesting to me is that to solve overfitting as a Bayesian, you don’t even need the priors; the solution comes from the evidential update! It turns out that in fact, the likelihood function for updating credences in a model given data automatically incorporates in model overparameterization. Which brings us to part 2!

Part 2: Bayesian Occam’s Razor

That last sentence bears repeating. In reality, although priors can play some role by manually penalizing models with high overfitting potential, the true source of the Bayesian Occam’s razor comes from the evidential update. What we’ll find by the end of this post is that models that overfit don’t actually get a stronger evidential update than models that don’t.

You might wonder how this is possible. Isn’t it practically the definition of overfitting that it is an enhancement of the strength of an evidential update through fitting to noise in the data?

Sort of. It is super important to keep in mind the distinction between a model and a distribution. A distribution is a single probability function over your possible observable data. A model is a set of distributions, characterized by a set of parameters. When we say that some models have the potential to overfit a set of data, what we are really saying is that some models contain distributions that overfit the data.

Why is this important? Because assessing the posterior probability of the model is not the same as assessing the posterior probability of the overfitting distribution within the model! Here’s Bayes’ rule, applied to the model and to the overfitting distribution:

(1) P(M | D) = P(M)・P(D | M) / P(D)

(2) P(theta hat | D) = P(theta hat)・P(D | theta hat) / P(D)

It’s clear how to evaluate equation (2). You have some prior probability assigned to theta hat, you know how to assess the likelihood function P(D | theta hat), and P(D) is an integral that is in principle do-able. In addition, equation (2) has the scary feature we’ve been talking about: the likelihood function P(D | theta hat) is really really large if our parameter theta hat overfits the data, potentially large enough to swamp the priors and screw up our Bayesian calculus.

But what we’re really interested in evaluating is not equation (2), but equation (1)! This is, after all, model selection; we are in the end trying to assess the quality of different models, not individual distributions.

So how do we evaluate (1)? The key term is P(D | M); your prior over the models and the data you receive are not too important for the moment. What is P(D | M)? This question does not actually have an obvious answer… M is a model, a set of distributions, not a single distribution. If we were looking at one distribution, it would be easy to assess the likelihood of the data given that distribution.

So what does P(D | M) really mean?

It represents the average probability of the data, given the model. It’s as if you were to draw a distribution at random from your model, and see how well it fits the data. More precisely, you draw a distribution from your model, according to your prior distribution over the distributions in the model.

That was a mouthful. But the basic idea is simple; a model is an infinite set of distributions, each corresponding to a particular set of values for the parameters that define the model. You have a prior distribution over these values for the parameters, and you use this prior distribution to “randomly” select a distribution in your model. You then assess the probability of the data given that distribution, and voila, you have your likelihood function.

In other words…

P(D | M) = ∫ P(D | θ) P(θ | M) dθ

Now, an overfitting model has a massive space of parameters, and in some small region of this space contains distributions that fit the data really well. On the other hand, a simple model that generalizes well has a small space of parameters, and a region of this space contains distributions that fit the data well (though not as well as the overfitter).

So on average, you are much less likely to select the optimal distribution in the overfitting model than in the generalizable model. Why? Because the space of parameters you must search through to find it is so much larger!

True, when you do select the optimal distribution in the overfitting model, you get rewarded with a better fit to the data than you could have gotten from the nice model. But the balance, in the end, pushes you towards simpler and more general models.

This is the Bayesian Occam’s Razor! Models that are underparameterized do poorly on average, because they just can’t fit the data at all. Models that are overparametrized do poorly on average, because the subset of the parameter space that fits the data well is so tiny compared to the volume of the parameter space as a whole. And the models that strike the perfect balance are those that have enough parameters to fit the data well, but not too many as to excessively bloat the parameter space.

Here are some lecture slides from these great notes that have some helpful visualizations:

Screen Shot 2018-03-16 at 2.33.34 AMScreen Shot 2018-03-16 at 2.33.50 AM

Recapping in a few sentences: Simpler models are promoted, simply because they do well on average. And evidential support for a model comes down to the performance on average, not optimal performance. The likelihood in question is not P(data | best distribution in model), it’s P(data | average distribution in model). So overfitting models actually don’t get as much evidential support from data when assessing the model quality as a whole!

Ain’t that cool??

A model selection puzzle: Why is BIC ≠ AIC?

Slide 19 from this lecture:

Screen Shot 2018-03-14 at 8.14.41 PM.png

This is a really important result. It says that Bayesian updating ultimately converges to the distribution in a model that minimizes DKL, even when the truth is not in your model.

But it is also confusing to me, for the following reason.

If Bayes converges to the minimum DKL solution, and BIC approximates Bayes, and if AIC approximately finds the minimum DKL solution… well, then how can they give different answers?

In other words, how can all three of the following statements be true?

  1. BIC approximates Bayes, which minimizes DKL.
  2. AIC approximates the minimum DKL solution.
  3. But BIC ≠ AIC.

Clearly we have a problem here.

It’s possible that the answer to this is just that the differences arise from the differences in approximations between AIC and BIC. But this seems like a inadequate explanation to account for such a huge difference, on the order of log(size of data set).

A friend of mine suggested that the reason is that the derivation of BIC assumes that the truth is in the set of candidate models, and this assumption is broken in the condition where Bayes’ optimizes for DKL.

I’m not sure how strongly ‘the truth is in your set of candidate models’ is actually assumed by BIC. I know that this is the standard thing people say about BIC, but is it really that the exact truth has to be in the model, or just that the model has a low overall bias? For instance, you can derive AIC by assuming that the truth is in your set of candidate models. But you don’t need this assumption; you can also derive AIC as an approximate measure of DKL when your set of candidate models contains models with low bias.

This question amounts to looking closely at the derivation of BIC to see what is absolutely necessary for the result. For now, I’m just pointing out the basic confusion, and will hopefully post a solution soon!

Bayes and overfitting

In a couple of previous posts, I’ve said some things about Bayesianism that I now think might not be right. Specifically, I claimed a few times that Bayesians will have trouble with overfitting data.  Having looked into it more and seen some complicated arguments on either side, I’m less sure about this. I’m currently just confused about it, so I’m writing up my confusion here.

The reasoning behind my initial claim was something like this:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

In other words, the model that is best supported by the data will be the one that fits it perfectly (i.e. overfitting). We get out of this by giving overfitting models low priors… but we should expect that even this won’t be sufficient if we get strong enough evidence.

Is this wrong? And why?


The basics of information divergence

I’ve talked quite a bit about DKL on this blog, but I think I’ve yet to give a simple introduction to the concept. That’s what this post is about; an introduction to DKL in all it’s wonder!

What is DKL?

Essentially, DKL is a measure of the information distance between a given model of reality and reality itself. Information distance, more precisely, is a quantification of how many bits of information you would need to receive in order to update your model of reality into perfect alignment with reality.

Said another way, it is how much information you would lose if you started with a perfectly aligned model of reality and ended with the model of reality that you currently have. And said one final way, it is how much more surprised you will be on average given your beliefs, than you would be if you had all true beliefs.

Here’s the functional form of DKL:

DKL(Ptrue, P) = ∫ Ptrue log(Ptrue / P)
= Etrue[ log(Ptrue / P) ]

The Etrue[-] on the second line refers to an expected value taken over the true distribution.

Why the log? I give some intuitive reasons here.

The problem, however, is that DKL cannot be directly calculated. Notice that one of the inputs to the function is the true probability distribution over outcomes. If you had access to this from the start, then you would have no need for any fancy methods of inference in the first place.

However, there are good ways to indirectly approximate DKL. How could we ever approximate a function that takes in an input that we don’t know? Through data!

Loosely speaking, data functions as a window that allows you to sneak peeks at reality. When you observe the outcomes of an experiment, the result you get will not always be aligned with your beliefs about reality. Instead, the outcomes of the experiment reflect the nature of reality itself, unfiltered by your beliefs.

(This is putting aside subtleties about good experimental design, but even those subtleties are unnecessary; technically the data you get is always a product of the nature of reality as it is,  it’s just that our interpretation of the data might be flawed.)

So if we have access to some set of data from well-designed experiments (that is, experiments whose results we are correctly interpreting), we can use it to form an approximation of the DKL of any given model of reality. This first approximation is called the log loss, and takes the following form:

Log Loss = – Edata[ log(P) ]

There is one more problem with this notion of using data to approximate DKL. The problem is that normally, we use data to update our beliefs. If we first update our beliefs with the data, and then approximate the DKL of our new distribution using the data, then we are biasing our approximation. It’s sort of like assessing intelligence by giving people a IQ test, but they were allowed to study by examining that exact IQ test and its answer key. If they do well on that test, it might not be because they are actually intelligent, but rather just that they’ve memorized all of the answers (overfit to the data of the IQ test).

So we have a few choices; first, we could refuse to update our beliefs on the data, and then have a nice unbiased estimate of the DKL of our un-updated distribution. Second, we could update our beliefs on the data, but give up hope of an unbiased estimate of DKL. Third, we could use some of the data for updating our beliefs, and the rest of it for evaluating DKL (this option is called cross validation). Or finally, we could try to find some way to approximate the amount of bias introduced by a given update of beliefs to our estimate of DKL.

Amazingly, we actually know precisely how to do this final option! This was the great contribution of the brilliant Japanese statistician Hirotogu Akaike. The equation he derived when trying to quantify the degree of bias is called the Akaike information criterion.

AIC(Model M) = Number of parameters in M – log P(data | M)

The best model in a set is the one with the lowest AIC score. It makes a lot of sense that models with more parameters are penalized; models with more tweakable parameters are like students that are better at memorizing an answer key to their test.

Can we do any better than AIC? Yes, in fact! For small data sets, a better measure is AICc, which adds a correction term that scales like 1/N.

So to summarize everything in a few sentences:

  1. DKL is a measure of how far your model of reality is from the truth.
  2. DKL cannot be calculated without prior knowledge of the truth.
  3. However, we can use data to approximate DKL, by calculating log loss.
  4. Unfortunately, if we are also using the data to update our beliefs, log loss is a biased estimator of DKL.
  5. We can approximate the bias and negate it using the Akaike information criterion, AIC.
  6. An even better approximation for small data sets is AICc.

More model selection visualizations

I added cross validation to my model selection program, and ooh boy do I now understand why people want to find more efficient alternatives.

Reliable CV calculations end up making the program run orders of magnitude slower, as they require re-fitting the curve for every partition of your data and this is the most resource-intensive part of the process. While it’s beautiful in theory, this is a major set-back in practice.

For a bunch of different true distributions I tried, I found that CV pretty much always lines up with all of the other methods, just with a more “jagged” curve and a steeper complexity penalty. This similarity looks like a score for the other methods, assuming that CV does a good job of measuring predictive power. (And for those interested in the technical details, the cross validation procedure implemented here is leave-one-out CV)

I also added an explicit calculation of DKL, which should help to give some idea of a benchmark against which we can measure all of these methods. Anyway, I have some more images!

True curve = e-x/3 – ex/3

N = 100 data points


N = 100 data points, zoomed in a bit


For smaller data sets, you can see that AICc tracks DKL much more closely than any other technique (which is of course the entire purpose of AICc).

N = 25


N = 25, zoomed


N = 25, super zoom


Interestingly, you start to see BIC really suffering relative to the other methods, beginning to overfit the data. This is counterintuitive; BIC is supposed to be the method that penalizes complexity excessively and underfits the data. Relevant is that in this program, I use the form of BIC that doesn’t approximate for large N.

BICsmall N = k log(N/2π) – 2L
BICordinary = k log(N) – 2L

When I use the approximation instead, the problem disappears. Of course, this is not too great a solution; why should the large N approximation be necessary for fixing BIC specifically when N is small?

 (Edit: after many more runs, it’s looking more like it may have just been a random accident that BIC overfit in the first few runs)

Just for the heck of it, I also decided to torture my polynomials a little bit by making them try to fit the function 1/x. I got dismal results no matter how I tried to tweak the hyper-parameters, which is, well, pretty much what I expected (1/x has no Taylor expansion around 0, for one thing).

More surprisingly, I tried fitting a simple Gaussian curve and again got fairly bad results. The methods disagreed with one another a lot of the time (although weirdly, AICc and BIC seemed to generally be in agreement), and gave curves that were overfitting the data a bit. The part that seems hardest for a polynomial to nail down is the flattened ends of the Gaussian curve.

True curve = 40 exp(-x2/2), N = 100 data points


And zoomed in…


If the jaggedness of the cross validation score is not purely an artifact of random fluctuations in the data, I don’t really get it. Why should, for example, a 27-parameter model roughly equal a 25-parameter model in predictive power, but a 26-parameter model be significantly worse than both?