AIC and BIC, two of the most important model selection criteria, both penalize overfitting by looking at the number of parameters in a model. While this is a good first approximation to quantifying overfitting potential, it is overly simplistic in a few ways.

Here’s a simple example:

ℳ*₁* = { y(x) = ax | a ∈ [0, 1] }

ℳ*₂* = { y(x) = ax | a ∈ [0, 10] }

ℳ*₁* is contained within ℳ*₂*, so we expect that it should be strictly less complex, with lesser overfitting potential, than ℳ*₂.* But both have the same number of parameters! So the difference between the two will be invisible to AIC and BIC (as well as all other model selection techniques that only make reference to the number of parameters in the model).

A more subtle approach to quantifying complexity and overfitting potential is given by the Fisher information metric. The idea is to define a *geometric space* over all possible values of the parameter, where distances in this space correspond to information gaps between different distributions.

Imagine a simple two-parameter model:

ℳ = { *P(x | a, b)* | *a, b* ∈ ℝ }

We can talk about the information distance between any particular distribution in this model and the true distribution by referencing the Kullback-Leibler divergence:

D_{KL} = ∫ *P _{true}(x) log( P_{true}(x) / P(x | a, b)) *d

*x*

The optimal distribution in the space of parameters is the distribution for which this quantity is minimized. We can find this by taking the derivative with respect to the parameters and setting it equal to zero:

∂_{a }[D_{KL}] = ∂_{a} [ ∫ *P _{true}(x) log( P_{true}(x) / P(x | a, b)) *d

*x*]

= ∂

_{a}[ – ∫

*P*d

_{true}(x) log(P(x | a, b))*x*]

= – ∫

*P*d

_{true}(x) ∂_{a }log(P(x | a, b))*x*]

= E[ – ∂

*]*

_{a }log(P(x | a, b)∂_{b }[D_{KL}] = E[ – ∂* _{b }log(P(x | a, b)* ]

We can form a geometric space out of D_{KL} by looking at its second derivatives:

∂_{aa }[D_{KL}] = E[ – ∂* _{aa }log(P(x | a, b)* ] = g

_{aa}

∂

_{ab }[D

_{KL}] = E[ – ∂

*] = g*

_{ab }log(P(x | a, b)_{ab}

∂

_{ba }[D

_{KL}] = E[ – ∂

*] = g*

_{ba }log(P(x | a, b)_{ba}

∂

_{bb }[D

_{KL}] = E[ – ∂

*] = g*

_{bb }log(P(x | a, b)_{bb}

These four values make up what is called the *Fisher information metric* . Now, the quantity

ds² = g_{aa }da² + 2 g_{ab }da db + g_{bb }db²

defines the *information distance* between two infinitesimally close distributions. We now have a geometric space, where each point corresponds to a particular probability distribution, and distances correspond to information gaps. All of the nice geometric properties of this space can be discovered just by studying the metric ds². For instance, the volume of any region of this space is given by:

d*V* = √(det(g)) da db

Now, we are able to see the relevance of all of this to the question of model complexity and overfitting potential. Any model corresponds to some region in this space of distributions, and the *complexity* of the model can be measured by the **volume it occupies**** in the space defined by the Fisher information metric.**

This solves the problem that arose with the simple example that we started with. If one model is a subset of another, then the smaller model will be literally *enclosed* in the parameter space by the larger one. Clearly, then, the volume of the larger model will be greater, so it will be penalized with a higher complexity.

In other words, volumes in the geometric space defined by Fisher information metric give us a good way to talk about *model** complexity*, in terms of the total information content of the model.

Here’s a quick example:

ℳ*₁* = { y(x) = ax + b + U | a ∈ [0, 1], b ∈ [0, 10], U a Gaussian error term }

ℳ*₂* = { y(x) = ax + b + U | a ∈ [-1, 1], b ∈ [0, 100], U a Gaussian error term }

Our two models are represented by a set of gaussians centered around the line ax + b. Both of these models have the same information geometry, since they only differ in the domain of their parameters:

g_{aa} = ∂_{aa }[D_{KL}] = ⅓

g_{ab} = ∂_{ab }[D_{KL}] = ½

g_{ba} = ∂_{ba }[D_{KL}] = ½

g_{bb} = ∂_{bb }[D_{KL}] = 1

From this, we can define lengths and volumes in the space:

ds² = ⅓_{ }da² +_{ }da db + db²

d*V* = √(det(g)) da db = da db / 2√3

Now we can explicitly compare the complexities of ℳ*₁* and ℳ*₂*:

C(ℳ*₁**) = *5/√3 ≈ 2.9

C(ℳ*₂*) = 100/√3 ≈ 53.7

In the end, we find that C(ℳ*₂*) > C(ℳ*₁**)* by a factor of 20. This is to be expected; Model 2 has a 20 times larger range of parameters to search, and is thus 20 times more permissive than Model 1.

While the conclusion is fairly obvious here, using information geometry allows you to answer questions that are far from obvious. For example, how would you compare the following two models? (For simplicity, let’s suppose that the data is generated according to the line y(x) = 1, with x ∈ [0, 1].)

ℳ*₃* = { y(x) = ax + b | a ∈ [2, 10], b ∈ [0, 2] }

ℳ*₄* = { y(x) = aeᵇˣ | a ∈ [2, 10], b ∈ [0, 2] }

They both have two parameters, but express very different hypotheses about the underlying data. Intuitively, ℳ*₄* *feels* more complex, but how do we quantify this? It turns out that ℳ*₄* has the following Fisher information metric:

g_{aa} = ∂_{aa }[D_{KL}] = (2b + 1)^{-1}

g_{ab} = ∂_{ab }[D_{KL}] = – (2b + 1)^{-2}

g_{ba} = ∂_{ba }[D_{KL}] = – (2b + 1)^{-2}

g_{bb} = ∂_{bb }[D_{KL}] = 4a (2b + 1)^{-3} – 2 (b + 1)^{-3}

Thus,

d*V* = (2b + 1)^{-2 }(4a + 1 – (2b+1)^{3}/(b+1)^{3})^{½} da db

Combining this with the previously found volume element for ℳ*₃*. we find the following:

C(ℳ*₃*) ≈ 4.62*
C(ℳ₄*) ≈ 14.92

This tells us that *ℳ₄* contains about 3 times as much information as ℳ*₃*, precisely quantifying our intuition about the relative complexity of these models.

Formalizing this as a model selection procedure, we get the Fisher information approximation (FIA).

FIA = – log L + k/2 log(N/2π) + log(Volume in Fisher information space)

BIC = – log L + k/2 log(N/2π)

AIC = – log L + k

AICc = – logL + k + k ∙ (k+1)/(N – k – 1)

Color coding: Goodness-of-fit + Dimensionality + Complexity