What is integrated information?

Integrated information theory relates consciousness to degrees of integrated information within a physical system. I recently became interested in IIT and found it surprisingly hard to locate a good simple explanation of the actual mathematics of integrated information online.

Having eventually just read through all of the original papers introducing IIT, I discovered that integrated information is closely related to some of my favorite bits of mathematics, involving information theory and causal modeling.  This was exciting enough to me that I decided to write a guide to understanding integrated information. My goal in this post is to introduce a beginner to integrated information in a rigorous and (hopefully!) intuitive way.

I’ll describe it increasing levels of complexity, so that even if you eventually get lost somewhere in the post, you’ll be able to walk away having learned something. If you get to the end of this post, you should be able to sit down with a pencil and paper and calculate the amount of integrated information in simple systems, as well as how to calculate it in principle for any system.

Level 1

So first, integrated information is a measure of the degree to which the components of a system are working together to produce outputs.

A system composed of many individual parts that are not interacting with each other in any way is completely un-integrated – it has an integrated information ɸ = 0. On the other hand, a system composed entirely of parts that are tightly entangled with one another will have a high amount of integrated information, ɸ >> 0.

For example, consider a simple model of a camera sensor.

tut_sensors_grid2

This sensor is composed of many independent parts functioning completely separately. Each pixel stores a unit of information about the outside world, regardless of what its neighboring pixels are doing. If we were to somehow sever the causal connections between the two halves of the sensor, each half would still capture and store information in exactly the same way.

Now compare this to a human brain.

FLARE-Technique-Offers-Snapshots-of-Neuron-Activity

The nervous system is a highly entangled mesh of neurons, each interacting with many many neighbors in functionally important ways. If we tried to cut the brain in half, severing all the causal connections between the two sides, we would get an enormous change in brain functioning.

Makes sense? Okay, on to level 2.

Level 2

So, integrated information has to do with the degree to which the components of a system are working together to produce outputs. Let’s delve a little deeper.

We just said that we can tell that the brain is integrating lots of information, because the functioning would be drastically disrupted if you cut it in half. A keen reader might have realized that the degree to which the functioning is disrupted will depend a lot on how you cut it in half.

For instance, cut off the front half of somebody’s brain, and you will end up with total dysfunction. But you can entirely remove somebody’s cerebellum (~50% of the brain’s neurons), and end up with a person that has difficulty with coordination and is a slow learner, but is otherwise a pretty ordinary person.

Human head, MRI and 3D CT scans

What this is really telling us is that different parts of the brain are integrating information differently. So how do we quantify the total integration of information of the brain? Which cut do we choose when evaluating the decrease in functioning?

Simple: We look at every possible way of partitioning the brain into two parts. For each one, we see how much the brain’s functioning is affected. Then we locate the minimum information partition, that is, the partition that results in the smallest change in brain functioning. The change in functioning that results from this particular partition is the integrated information!

Okay. Now, what exactly do we mean by “changes to the system’s functioning”? How do we measure this?

Answer: The functionality of a system is defined by the way in which the current state of the system constrains the past and future states of the system.

To make full technical sense of this, we have to dive a little deeper.

Level 3

How many possible states are there of a Connect Four board?

(I promise this is relevant)

The board is 6 by 7, and each spot can be either a red piece, a black piece, or empty.

Screen Shot 2018-04-20 at 1.03.04 AM

So a simple upper bound on the number of total possible board states is 342 (of course, the actual number of possible states will be much lower than this, since some positions are impossible to get into).

Now, consider what you know about the possible past and future states of the board if the board state is currently…

Screen Shot 2018-04-20 at 1.03.33 AM

Clearly there’s only one possible past state:

Screen Shot 2018-04-20 at 1.03.04 AM

And there are seven possible future states:

What this tells us is that the information about the current state of the board constrains the possible past and future states, selecting exactly one possible board out of the 342 possibilities for the past, and seven out of 342 possibilities for the future.

More generally, for any given system S we have a probability distribution over past and future states, given that the current state is X.

System

Pfuture(X, S) = Pr( Future state of S | Present state of S is X )
Ppast(X, S) = Pr( Past state of S | Present state of S is X )

For any partition of the system into two components, S1 and S2, we can consider the future and past distributions given that the states of the components are, respectively, X1 and X2, where X = (X1, X2).

System

Pfuture(X, S1, S2) = Pr( Future state of S1 | Present state of S1 is X1 )・Pr( Future state of S2 | Present state of S2 is X2 )
Ppast(X, S1, S2) = Pr( Past state of S1 | Present state of S1 is X1 )・Pr( Past state of S2 | Present state of S2 is X2 )

Now, we just need to compare our distributions before the partition to our distributions after the partition. For this we need some type of distance function D that assesses how far apart two probability distributions are. Then we define the cause information and the effect information for the partition (S1, S2).

Cause information = D( Ppast(X, S), Ppast(X, S1, S2) )
Effect information = D( Pfuture(X, S), Pfuture(X, S1, S2) )

In short, the cause information is how much the distribution over past states changes when you partition off your system into two separate systems And the future information is the change in the distribution over future states when you partition the system.

The cause-effect information CEI is then defined as the minimum of the cause information CI and effect information EI.

CEI = min{ CI, EI }

We’ve almost made it all the way to our full definition of ɸ! Our last step is to calculate the CEI for every possible partition of S into two pieces, and then select the partition that minimizes CEI (the minimum information partition MIP).

The integrated information is just the cause effect information of the minimum information partition!

ɸ = CEI(MIP)

Level 4

We’ve now semi-rigorously defined ɸ. But to really get a sense of how to calculate ɸ, we need to delve into causal diagrams. At this point, I’m going to assume familiarity with causal modeling. The basics are covered in a series of posts I wrote starting here.

Here’s a simple example system:

XOR AND.png

This diagram tells us that the system is composed of two variables, A and B. Each of these variables can take on the values 0 and 1. The system follows the following simple update rule:

A(t + 1) = A(t) XOR B(t)
B(t + 1) = A(t) AND B(t)

We can redraw this as a causal diagram from A and B at time 0 to A and B at time 1:

Causal Diagram

What this amounts to is the following system evolution rule:

    ABt → ABt+1
00        00
01       10
10       10
11       01

Now, suppose that we know that the system is currently in the state AB = 00. What does this tell us about the future and past states of the system?

Well, since the system evolution is deterministic, we can say with certainty that the next state of the system will be 00. And since there’s only one way to end up in the state 00, we know that the past state of the system 00.

We can plot the probability distributions over the past and future distributions as follows:

Probabilities Full System

This is not too interesting a distribution… no information is lost or gained going into the past or future. Now we partition the system:

XOR AND Cut

The causal diagram, when cut, looks like:

Causal Diagram Cut

Why do we have the two “noise” variables? Well, both A and B take two variables as inputs. Since one of these causal inputs has been cut off, we replace it with a random variable that’s equally likely to be a 0 or a 1. This procedure is called “noising” the causal connections across the partition.

According to this diagram, we now have two independent distributions over the two parts of the system, A and B. In addition, to know the total future state of a system, we do the following:

P(A1, B1 | A0, B0) = P(A1 | A0) P(B1 | B0)

We can compute the two distributions P(A1 | A0) and P(B1 | B0) straightforwardly, by looking at how each variable evolves in our new causal diagram.

A0 = 0 ⇒ A1 = 0, 1 (½ probability each)
B0 = 0 ⇒ B1 = 0

A0 = 0 ⇒ A-1 = 0, 1 (½ probability each)
B0 = 0 ⇒ B-1 = 0, 1 (probabilities ⅔ and ⅓)

This implies the following probability distribution for the partitioned system:

Partitioned System

I recommend you go through and calculate this for yourself. Everything follows from the updating rules that define the system and the noise assumption.

Good! Now we have two distributions, one for the full system and one for the partitioned system. How do we measure the difference between these distributions?

There are a few possible measures we could use. My favorite of these is the Kullback-Leibler divergence DKL. Technically, this metric is only used in IIT 2.0, not IIT 3.0 (which uses the earth-mover’s distance). I prefer DKL, as it has a nice interpretation as the amount of information lost when the system is partitioned. I have a post describing DKL here.

Here’s the definition of DKL:

DKL(P, Q) = ∑ Pi log(Pi / Qi)

We can use this quantity to calculate the cause information and the effect information:

Cause information = log(3) ≈ 1.6
Effect information = log(2) = 1

These values tell us that our partition destroys about .6 more bits of information about the past than it does the future. For the purpose of integrated information, we only care about the smaller of these two (for reasons that I don’t find entirely convincing).

Cause-effect information = min{ 1, 1.6 } = 1

Now, we’ve calculated the cause-effect information for this particular partition. And since our system has only two variables, this is the only possible partition.

The integrated information is the cause-effect information of the minimum information partition. Since our system only has two components, the partition we’ve examined is the only possible partition, meaning that it must be the minimum information partition. And thus, we’ve calculated ɸ for our system!

ɸ = 1

Level 5

Let’s now define ɸ in full generality.

Our system S consists of a vector of N variables X = (X1, X2, X3, …, XN), each an element in some space 𝒳. Our system also has an updating rule, which is a function f: 𝒳N → 𝒳N. In our previous example, 𝒳 = {0, 1}, N = 2, and f(x, y) = (x XOR y, x AND y).

More generally, our updating rule f can map X to a probability distribution p:  𝒳N → . We’ll denote P(Xt+1 | Xt) as the distribution over the possible future states, given the current state. P is defined by our updating rule: P(Xt+1 | Xt) = f(Xt). The distribution over possible past states will be denoted P(Xt-1 | Xt). We’ll obtain this using Bayes’ rule: P(Xt-1 | Xt) = P(Xt | Xt-1) P(Xt-1) / P(Xt) = f(Xt-1) P(Xt-1) / P(Xt).

A partition of the system is a subset of {1, 2, 3, …, N}, which we’ll label A. We define B = {1, 2, 3, …, N} \ A. Now we can define XA = ( X)a∈A, and XB = ( X)b∈B. Loosely speaking, we can say that X = (XA, XB), i.e. that the total state is just the combination of the two partitions A and B.

We now define the distributions over future and past states in our partitioned system:

Q(Xt+1 | Xt) = P(XA, t+1 | XA, t) P(XB, t+1 | XB, t)
Q(Xt-1 | Xt) = P(XA, t-1 | XA, t) P(XB, t-1 | XB, t).

The effect information EI of the partition defined by A is the distance between P(Xt+1 | Xt) and Q(Xt+1 | Xt), and the cause information CI is defined similarly. The cause-effect information is defined as the minimum of these two.

CI(f, A, Xt) = D( P(Xt-1 | Xt), Q(Xt-1 | Xt) )
EI(f, A, Xt) = D( P(Xt+1 | Xt), Q(Xt+1 | Xt) )

CEI(f, A, Xt) = min{ CI(f, A, Xt), EI(f, A, Xt) }

And finally, we define the minimum information partition (MIP) and the integrated information:

MIP = argminA CEI(f, A, Xt)
ɸ(f, Xt) = minA CEI(f, A, Xt)
= CEI(f, MIP, Xt)

And we’re done!

Notice that our final result is a function of f (the updating function) as well as the current state of the system. What this means is that the integrated information of a system can change from moment to moment, even if the organization of the system remains the same.

By itself, this is not enough for the purposes of integrated information theory. Integrated information theory uses ɸ to define gradations of consciousness of systems, but the relationship between ɸ and consciousness isn’t exactly one-to-on (briefly, consciousness resides in non-overlapping local maxima of integrated information).

But this post is really meant to just be about integrated information, and the connections to the theory of consciousness are actually less interesting to me. So for now I’ll stop here! 🙂

Bayesian Occam’s Razor

A couple of days ago I posted a question that has been bugging me; namely, does Bayes’ overfit, and if not, why not?

Today I post the solution!

There are two parts: first, explaining where my initial argument against Bayes went wrong, and second, describing the Bayesian Occam’s Razor, the key to understanding how a Bayesian deals with overfitting.

Part 1: Why I was wrong

Here’s the argument I wrote initially:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

Here’s a more formal version of the argument:

  1. The relative probabilities of two model given data is calculated by Bayes’ rule:
    P(M | D) / P(M’ | D)  = P(M) / P(M’)・P(D | M) / P(D | M’)
  2. If M overfits the data and M’ does not, then as the size of the data set |D| goes to infinity, the likelihood factor P(D | M) / P(D | M’) goes to infinity.
  3. Thus the posterior probability P(M | D) should go to 1 for the model that most drastically overfits the data.

This argument is wrong for a couple of reasons. For one, the argument assumes that as the size of the data set grows, the model stays the same. But this is very much not going to be true in general. The task of overfitting gets harder and harder as the number of data points go up. It’s not that there’s no longer noise in the data; it’s that the signal becomes more and more powerful.

A perfect polynomial fit on 100 data points must have, at the worst, 100 parameters. On 1000 data points: 1000 parameters. Etc. In general, as you add more data points, a model that was initially overfitting (e.g. the 100-parameter distribution) will find that it is harder and harder to ignore the signal for the noise, and the next best overfitting model will have more parameters (e.g. the 1000-parameter distribution).

But now we have a very natural solution to the problem we started with! It is true that as the number of data points increases, the evidential support for the model that overfits the data will get larger and larger. It’s also true is that the number of parameters required to overfit the data will grow as well. So if your prior in a model is a decreasing function of the number of parameters in the model, then you can in principle find a perfect balance and avoid overfitting. This perfect balance would be characterized by the following: each time you increase the number of parameters, the prior should decrease by an amount proportional to how much more you get rewarded by overfitting the data with the extra parameters.

How do we find this prior in practice? Beats me… I’d be curious to know, myself.

But what’s most interesting to me is that to solve overfitting as a Bayesian, you don’t even need the priors; the solution comes from the evidential update! It turns out that in fact, the likelihood function for updating credences in a model given data automatically incorporates in model overparameterization. Which brings us to part 2!

Part 2: Bayesian Occam’s Razor

That last sentence bears repeating. In reality, although priors can play some role by manually penalizing models with high overfitting potential, the true source of the Bayesian Occam’s razor comes from the evidential update. What we’ll find by the end of this post is that models that overfit don’t actually get a stronger evidential update than models that don’t.

You might wonder how this is possible. Isn’t it practically the definition of overfitting that it is an enhancement of the strength of an evidential update through fitting to noise in the data?

Sort of. It is super important to keep in mind the distinction between a model and a distribution. A distribution is a single probability function over your possible observable data. A model is a set of distributions, characterized by a set of parameters. When we say that some models have the potential to overfit a set of data, what we are really saying is that some models contain distributions that overfit the data.

Why is this important? Because assessing the posterior probability of the model is not the same as assessing the posterior probability of the overfitting distribution within the model! Here’s Bayes’ rule, applied to the model and to the overfitting distribution:

(1) P(M | D) = P(M)・P(D | M) / P(D)

(2) P(theta hat | D) = P(theta hat)・P(D | theta hat) / P(D)

It’s clear how to evaluate equation (2). You have some prior probability assigned to theta hat, you know how to assess the likelihood function P(D | theta hat), and P(D) is an integral that is in principle do-able. In addition, equation (2) has the scary feature we’ve been talking about: the likelihood function P(D | theta hat) is really really large if our parameter theta hat overfits the data, potentially large enough to swamp the priors and screw up our Bayesian calculus.

But what we’re really interested in evaluating is not equation (2), but equation (1)! This is, after all, model selection; we are in the end trying to assess the quality of different models, not individual distributions.

So how do we evaluate (1)? The key term is P(D | M); your prior over the models and the data you receive are not too important for the moment. What is P(D | M)? This question does not actually have an obvious answer… M is a model, a set of distributions, not a single distribution. If we were looking at one distribution, it would be easy to assess the likelihood of the data given that distribution.

So what does P(D | M) really mean?

It represents the average probability of the data, given the model. It’s as if you were to draw a distribution at random from your model, and see how well it fits the data. More precisely, you draw a distribution from your model, according to your prior distribution over the distributions in the model.

That was a mouthful. But the basic idea is simple; a model is an infinite set of distributions, each corresponding to a particular set of values for the parameters that define the model. You have a prior distribution over these values for the parameters, and you use this prior distribution to “randomly” select a distribution in your model. You then assess the probability of the data given that distribution, and voila, you have your likelihood function.

In other words…

P(D | M) = ∫ P(D | θ) P(θ | M) dθ

Now, an overfitting model has a massive space of parameters, and in some small region of this space contains distributions that fit the data really well. On the other hand, a simple model that generalizes well has a small space of parameters, and a region of this space contains distributions that fit the data well (though not as well as the overfitter).

So on average, you are much less likely to select the optimal distribution in the overfitting model than in the generalizable model. Why? Because the space of parameters you must search through to find it is so much larger!

True, when you do select the optimal distribution in the overfitting model, you get rewarded with a better fit to the data than you could have gotten from the nice model. But the balance, in the end, pushes you towards simpler and more general models.

This is the Bayesian Occam’s Razor! Models that are underparameterized do poorly on average, because they just can’t fit the data at all. Models that are overparametrized do poorly on average, because the subset of the parameter space that fits the data well is so tiny compared to the volume of the parameter space as a whole. And the models that strike the perfect balance are those that have enough parameters to fit the data well, but not too many as to excessively bloat the parameter space.

Here are some lecture slides from these great notes that have some helpful visualizations:

Screen Shot 2018-03-16 at 2.33.34 AMScreen Shot 2018-03-16 at 2.33.50 AM

Recapping in a few sentences: Simpler models are promoted, simply because they do well on average. And evidential support for a model comes down to the performance on average, not optimal performance. The likelihood in question is not P(data | best distribution in model), it’s P(data | average distribution in model). So overfitting models actually don’t get as much evidential support from data when assessing the model quality as a whole!

Ain’t that cool??

The basics of information divergence

I’ve talked quite a bit about DKL on this blog, but I think I’ve yet to give a simple introduction to the concept. That’s what this post is about; an introduction to DKL in all it’s wonder!

What is DKL?

Essentially, DKL is a measure of the information distance between a given model of reality and reality itself. Information distance, more precisely, is a quantification of how many bits of information you would need to receive in order to update your model of reality into perfect alignment with reality.

Said another way, it is how much information you would lose if you started with a perfectly aligned model of reality and ended with the model of reality that you currently have. And said one final way, it is how much more surprised you will be on average given your beliefs, than you would be if you had all true beliefs.

Here’s the functional form of DKL:

DKL(Ptrue, P) = ∫ Ptrue log(Ptrue / P)
= Etrue[ log(Ptrue / P) ]

The Etrue[-] on the second line refers to an expected value taken over the true distribution.

Why the log? I give some intuitive reasons here.

The problem, however, is that DKL cannot be directly calculated. Notice that one of the inputs to the function is the true probability distribution over outcomes. If you had access to this from the start, then you would have no need for any fancy methods of inference in the first place.

However, there are good ways to indirectly approximate DKL. How could we ever approximate a function that takes in an input that we don’t know? Through data!

Loosely speaking, data functions as a window that allows you to sneak peeks at reality. When you observe the outcomes of an experiment, the result you get will not always be aligned with your beliefs about reality. Instead, the outcomes of the experiment reflect the nature of reality itself, unfiltered by your beliefs.

(This is putting aside subtleties about good experimental design, but even those subtleties are unnecessary; technically the data you get is always a product of the nature of reality as it is,  it’s just that our interpretation of the data might be flawed.)

So if we have access to some set of data from well-designed experiments (that is, experiments whose results we are correctly interpreting), we can use it to form an approximation of the DKL of any given model of reality. This first approximation is called the log loss, and takes the following form:

Log Loss = – Edata[ log(P) ]

There is one more problem with this notion of using data to approximate DKL. The problem is that normally, we use data to update our beliefs. If we first update our beliefs with the data, and then approximate the DKL of our new distribution using the data, then we are biasing our approximation. It’s sort of like assessing intelligence by giving people a IQ test, but they were allowed to study by examining that exact IQ test and its answer key. If they do well on that test, it might not be because they are actually intelligent, but rather just that they’ve memorized all of the answers (overfit to the data of the IQ test).

So we have a few choices; first, we could refuse to update our beliefs on the data, and then have a nice unbiased estimate of the DKL of our un-updated distribution. Second, we could update our beliefs on the data, but give up hope of an unbiased estimate of DKL. Third, we could use some of the data for updating our beliefs, and the rest of it for evaluating DKL (this option is called cross validation). Or finally, we could try to find some way to approximate the amount of bias introduced by a given update of beliefs to our estimate of DKL.

Amazingly, we actually know precisely how to do this final option! This was the great contribution of the brilliant Japanese statistician Hirotogu Akaike. The equation he derived when trying to quantify the degree of bias is called the Akaike information criterion.

AIC(Model M) = Number of parameters in M – log P(data | M)

The best model in a set is the one with the lowest AIC score. It makes a lot of sense that models with more parameters are penalized; models with more tweakable parameters are like students that are better at memorizing an answer key to their test.

Can we do any better than AIC? Yes, in fact! For small data sets, a better measure is AICc, which adds a correction term that scales like 1/N.

So to summarize everything in a few sentences:

  1. DKL is a measure of how far your model of reality is from the truth.
  2. DKL cannot be calculated without prior knowledge of the truth.
  3. However, we can use data to approximate DKL, by calculating log loss.
  4. Unfortunately, if we are also using the data to update our beliefs, log loss is a biased estimator of DKL.
  5. We can approximate the bias and negate it using the Akaike information criterion, AIC.
  6. An even better approximation for small data sets is AICc.

More model selection visualizations

I added cross validation to my model selection program, and ooh boy do I now understand why people want to find more efficient alternatives.

Reliable CV calculations end up making the program run orders of magnitude slower, as they require re-fitting the curve for every partition of your data and this is the most resource-intensive part of the process. While it’s beautiful in theory, this is a major set-back in practice.

For a bunch of different true distributions I tried, I found that CV pretty much always lines up with all of the other methods, just with a more “jagged” curve and a steeper complexity penalty. This similarity looks like a score for the other methods, assuming that CV does a good job of measuring predictive power. (And for those interested in the technical details, the cross validation procedure implemented here is leave-one-out CV)

I also added an explicit calculation of DKL, which should help to give some idea of a benchmark against which we can measure all of these methods. Anyway, I have some more images!

True curve = e-x/3 – ex/3

N = 100 data points

exps-wcv.png

N = 100 data points, zoomed in a bit

exps-wcv-zoom.png

For smaller data sets, you can see that AICc tracks DKL much more closely than any other technique (which is of course the entire purpose of AICc).

N = 25

exps-small-n.png

N = 25, zoomed

exps-small-n-zoom.png

N = 25, super zoom

exps-small-n-bigzoom.png

Interestingly, you start to see BIC really suffering relative to the other methods, beginning to overfit the data. This is counterintuitive; BIC is supposed to be the method that penalizes complexity excessively and underfits the data. Relevant is that in this program, I use the form of BIC that doesn’t approximate for large N.

BICsmall N = k log(N/2π) – 2L
BICordinary = k log(N) – 2L

When I use the approximation instead, the problem disappears. Of course, this is not too great a solution; why should the large N approximation be necessary for fixing BIC specifically when N is small?

 (Edit: after many more runs, it’s looking more like it may have just been a random accident that BIC overfit in the first few runs)

Just for the heck of it, I also decided to torture my polynomials a little bit by making them try to fit the function 1/x. I got dismal results no matter how I tried to tweak the hyper-parameters, which is, well, pretty much what I expected (1/x has no Taylor expansion around 0, for one thing).

More surprisingly, I tried fitting a simple Gaussian curve and again got fairly bad results. The methods disagreed with one another a lot of the time (although weirdly, AICc and BIC seemed to generally be in agreement), and gave curves that were overfitting the data a bit. The part that seems hardest for a polynomial to nail down is the flattened ends of the Gaussian curve.

True curve = 40 exp(-x2/2), N = 100 data points

gaussian-best.png

And zoomed in…

gaussian-best-zoom.png

If the jaggedness of the cross validation score is not purely an artifact of random fluctuations in the data, I don’t really get it. Why should, for example, a 27-parameter model roughly equal a 25-parameter model in predictive power, but a 26-parameter model be significantly worse than both?

 

On complexity and information geometry

AIC and BIC, two of the most important model selection criteria, both penalize overfitting by looking at the number of parameters in a model. While this is a good first approximation to quantifying overfitting potential, it is overly simplistic in a few ways.

Here’s a simple example:

= { y(x) = ax | a ∈ [0, 1] }
= { y(x) = ax | a ∈ [0, 10] }

is contained within ℳ, so we expect that it should be strictly less complex, with lesser overfitting potential, than ℳ₂. But both have the same number of parameters! So the difference between the two will be invisible to AIC and BIC (as well as all other model selection techniques that only make reference to the number of parameters in the model).

A more subtle approach to quantifying complexity and overfitting potential is given by the Fisher information metric. The idea is to define a geometric space over all possible values of the parameter, where distances in this space correspond to information gaps between different distributions.

Imagine a simple two-parameter model:

ℳ = { P(x | a, b) | a, b ∈ ℝ }

We can talk about the information distance between any particular distribution in this model and the true distribution by referencing the Kullback-Leibler divergence:

DKL = ∫ Ptrue(x) log( Ptrue(x) / P(x | a, b)) dx

The optimal distribution in the space of parameters is the distribution for which this quantity is minimized. We can find this by taking the derivative with respect to the parameters and setting it equal to zero:

[DKL] = ∂a [ ∫ Ptrue(x) log( Ptrue(x) / P(x | a, b)) d]
= ∂a [ – ∫ Ptrue(x) log(P(x | a, b)) d]
= – ∫ Ptrue(x) ∂log(P(x | a, b)) d]
= E[ – ∂log(P(x | a, b) ]

[DKL] = E[ – ∂log(P(x | a, b) ]

We can form a geometric space out of DKL by looking at its second derivatives:

aa [DKL] = E[ – ∂aa log(P(x | a, b) ] = gaa
ab [DKL] = E[ – ∂ab log(P(x | a, b) ] = gab
ba [DKL] = E[ – ∂ba log(P(x | a, b) ] = gba
bb [DKL] = E[ – ∂bb log(P(x | a, b) ] = gbb

These four values make up what is called the Fisher information metric . Now, the quantity

ds² = gaa da² + 2 gab da db + gbb db²

defines the information distance between two infinitesimally close distributions. We now have a geometric space, where each point corresponds to a particular probability distribution, and distances correspond to information gaps. All of the nice geometric properties of this space can be discovered just by studying the metric ds². For instance, the volume of any region of this space is given by:

dV = √(det(g)) da db

Now, we are able to see the relevance of all of this to the question of model complexity and overfitting potential. Any model corresponds to some region in this space of distributions, and the complexity of the model can be measured by the volume it occupies in the space defined by the Fisher information metric.

This solves the problem that arose with the simple example that we started with. If one model is a subset of another, then the smaller model will be literally enclosed in the parameter space by the larger one. Clearly, then, the volume of the larger model will be greater, so it will be penalized with a higher complexity.

In other words, volumes in the geometric space defined by Fisher information metric give us a good way to talk about model complexity, in terms of the total information content of the model.

Here’s a quick example:

= { y(x) = ax + b + U | a ∈ [0, 1], b ∈ [0, 10], U a Gaussian error term }
= { y(x) = ax + b + U | a ∈ [-1, 1], b ∈ [0, 100], U a Gaussian error term }

Our two models are represented by a set of gaussians centered around the line ax + b. Both of these models have the same information geometry, since they only differ in the domain of their parameters:

gaa = ∂aa [DKL] = ⅓
gab = ∂ab [DKL] = ½
gba = ∂ba [DKL] = ½
gbb = ∂bb [DKL] = 1

From this, we can define lengths and volumes in the space:

ds² = ⅓ da² + da db + db²
dV = √(det(g)) da db = da db / 2√3

Now we can explicitly compare the complexities of ℳ and ℳ:

C(ℳ) = 5/√3 ≈ 2.9
C(ℳ) = 100/√3 ≈ 53.7

In the end, we find that C(ℳ) > C(ℳ) by a factor of 20. This is to be expected; Model 2 has a 20 times larger range of parameters to search, and is thus 20 times more permissive than Model 1.

While the conclusion is fairly obvious here, using information geometry allows you to answer questions that are far from obvious. For example, how would you compare the following two models? (For simplicity, let’s suppose that the data is generated according to the line y(x) = 1, with x ∈ [0, 1].)

 = { y(x) = ax + b | a ∈ [2, 10], b ∈ [0, 2] }
 = { y(x) = aeᵇˣ | a ∈ [2, 10], b ∈ [0, 2] }

They both have two parameters, but express very different hypotheses about the underlying data. Intuitively, ℳ feels more complex, but how do we quantify this? It turns out that ℳ has the following Fisher information metric:

gaa = ∂aa [DKL] = (2b + 1)-1
gab = ∂ab [DKL] = – (2b + 1)-2
gba = ∂ba [DKL] = – (2b + 1)-2
gbb = ∂bb [DKL] = 4a (2b + 1)-3 – 2 (b + 1)-3

Thus,

dV = (2b + 1)-2 (4a + 1 – (2b+1)3/(b+1)3)½ da db

Combining this with the previously found volume element for ℳ. we find the following:

C(ℳ) ≈ 4.62
C(ℳ₄
) ≈ 14.92

This tells us that ℳ₄ contains about 3 times as much information as ℳ, precisely quantifying our intuition about the relative complexity of these models.

Formalizing this as a model selection procedure, we get the Fisher information approximation (FIA).

FIA  = – log L + k/2 log(N/2π) + log(Volume in Fisher information space)
BIC  = – log L + k/2 log(N/2π)
AIC  = – log L + k
AICc = – logL + k + k ∙ (k+1)/(N – k – 1)

Color coding: Goodness-of-fit DimensionalityComplexity

 

Gibbs’ inequality

As a quick reminder from previous posts, we can define the surprise in an occurrence of an event E with probability P(E) as:

Sur(P(E)) = log(1/P) = – log(P).

I’ve discussed why this definition makes sense here. Now, with this definition, we can talk about expected surprise; in general, the surprise that somebody with distribution Q would expect somebody with distribution P to have is:

EQ[Sur(P)] = ∫ – Q log(P) dE

This integral is taken over all possible events. A special case of it is entropy, which is somebody’s own expected surprise. This corresponds to the intuitive notion of uncertainty:

Entropy = EP[Sur(P)] = ∫ – P log(P) dE

The actual average surprise for somebody with distribution P is:

Actual average surprise = ∫ – Ptrue log(P) dE

Here we are using the idea of a true probability distribution, which corresponds to the distribution over possible events that best describes the frequencies of each event. And finally, the “gap” in average surprise between P and Q is:

∫ Ptrue log(P/Q) dE

Gibbs’ inequality says the following:

For any two different probability distributions P and Q:
EP[Sur(P)] < EP[Sur(Q)]

This means that out of all possible ways of distributing your credences, you should always expect that your own distribution is the least surprising.

In other words, you should always expect to be less surprised than everybody else.

This is really unintuitive, and I’m not sure how to make sense of it. Say that you think that a coin will either land heads or tails, with probability 50% for each. In addition, you are with somebody (who we’ll call A) that you know has perfect information about how the coin will land.

Does it make sense to say that you expect them to be more surprised about the result of the coin flip than you will be? This seems hardly intuitive. One potential way out of this is that the statement “A knows exactly how the coin will land” has not actually been included in your probability distribution, so it isn’t fair to stipulate that you know this. One way to try to add in this information is to model their knowledge by something like “There’s a 50% chance that A’s distribution is 100% H, and a 50% chance that it is 100% T.”

The problem is that when you average over these distributions, you get a new distribution that is identical to your own. This is clearly not capturing the state of knowledge in question.

Another possibility is that we should not be thinking about the expected surprise of people, but solely of distributions. In other words, Gibb’s inequality tells us that you will expect a higher average surprise for any distribution that you are handed, than for your own distribution. This can only be translated into statements about people‘s average surprise when their knowledge can be directly translated into a distribution.

Some simple visual comparisons of model selection techniques

The goal of model selection is to find a model that provides the best fit to a set of data, without overfitting the data. Different criterion for assessing the degree of overfitting abound; typically they make reference to the number of parameters a model includes. Too few parameters, and your model will not be flexible enough to fit the data. Too many, and your model will be too flexible and end up overfitting the data.

I made a little program that calculates and plots different measures of model quality as a function of the number of parameters in the model, for any choice of true distribution. The models used in this program are all just polynomial fits; the kth model is the set of all (k-1)-order polynomials. I’ll show off some of the resulting plots here!

***

True distribution: y(x) = x2

10 data points

Parabola N=10

100 data points

parabola-n1001.png

1000 data points

parabola-n10001.png

Some things to notice:

  • All three of BIC, AIC, and AICc give the same (and correct) answer, even for a data set of only 10 points.
  • The difference between AICc and AIC becomes pretty much irrelevant for large enough data sets.
  • BIC always penalizes complexity more than AIC
  • The complexity penalty is pretty nearly matched by the improvement in fit for large numbers of parameters, but slightly outweighs it.

True distribution: y = x3/10 + x2 – 10x

10 data points

3rd-order-n10.png

100 data points

3rd-order-n100.png

1000 data points

3rd-order-n1000.png

Now let’s look at an example where the true distribution is not actually in any of the models.

True distribution: y = e-x/2

20 data points

exp-n20.png

100 data points

exp-n100.png

1000 data points

exp-n1000.png

Here we begin to see some disagreement between the different methods! For N=20, AICc would have recommended the optimal model as k = 4 (a third order polynomial), while AIC and BIC both recommended k = 5. In addition, we see that the same method gives different answers as the number of data points rises (5 to 7 to 6 parameters)

Regardless, we still see that all three methods succeed in preventing overfitting, and do a fairly good job at catching the underlying trend in the data. However, the question of which model is optimal becomes a little more ambiguous.

One final example, which we’ll make especially difficult for a polynomial model:

True distribution: y = 10*sin(x)

N = 20

sine-n20.png

N = 100

sine-n100.png

N = 1000

sine-n1000.png

Again we see that all of the model selection criterion give similar answers, and the curves generated nicely align with the true curve. It looks like 11 to 13 order polynomials do a good job at modeling a sine wave on this scale.

It’s interesting to watch the jagged descent of the criteria as you approach the optimal number of parameters from below. For some reason, it looks like adding a single extra parameter is generally unhelpful for this problem, but adding two is helpful. I suspect that this is related to the fact that sin(x) is an odd function, so adding an even function with a tweakable parameter out front doesn’t do much for your model fit.

By the end, we see the optimal curve beautifully aligning with the true curve, not getting distracted by the noise in the data. Seeing these plots helps give a bit of an intuition about how different techniques penalize complexity and reward goodness of fit to data. I want to eventually add cross validation scores in to these plots as well, to see how they compare to the others.