I found a nice example of an application of model selection techniques in this paper.

The history of astronomy provides one of the earliest examples of the problem at hand. In Ptolemy’s geocentric astronomy, the relative motion of the earth and the sun is independently replicated within the model for each planet, thereby unnecessarily adding to the number of adjustable parameters in his system. Copernicus’s major innovation was to decompose the apparent motion of the planets into their individual motions around the sun together with a common sun-earth component, thereby reducing the number of adjustable parameters. At the end of the non-technical exposition of his programme in De Revolutionibus, Copernicus repeatedly traces the weakness of Ptolemy’s astronomy back to its failure to impose any principled constraints on the separate planetary models.

In a now famous passage, Kuhn claims that the unification or harmony of Copernicus’ system appeals to an aesthetic sense, and that alone. Many philosophers of science have resisted Kuhn’s analysis, but none has made a convincing reply. We present the maximization of estimated predictive accuracy as the rationale for accepting the Copernican model over its Ptolemaic rival. For example, if each additional epicycle is characterized by 4 adjustable parameters, then the likelihood of the best basic Ptolemaic model, with just twelve circles, would have to be e

^{20}(or more than 485 million) times the likelihood of its Copernican counterpart with just seven circles for the evidence to favour the Ptolemaic proposal. Yet it is generally agreed that these basic models had about the same degree of fit with the data known at the time. The advantage of the Copernican model can hardly be characterized as merely aesthetic; it is observation, not a prioristic preference, that drives our choice of theory in this instance.

Forster

*How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions*

Looking into this a little, I found on Wiki that apparently more and more complicated epicycle models were developed after Ptolemy.

As a measure of complexity, the number of circles is given as 80 for Ptolemy, versus a mere 34 for Copernicus. The highest number appeared in the Encyclopædia Britannica on Astronomy during the 1960s, in a discussion of King Alfonso X of Castile’s interest in astronomy during the 13th century. (Alfonso is credited with commissioning the Alfonsine Tables.)

By this time each planet had been provided with from 40 to 60 epicycles to represent after a fashion its complex movement among the stars. Amazed at the difficulty of the project, Alfonso is credited with the remark that had he been present at the Creation he might have given excellent advice.

40 epicycles per planet, with five known planets in Ptolemy’s time, and four adjustable parameters per epicycle, gives *800 **additional parameters*.

Since AIC scores are given by (# of parameters) – (log of likelihood of evidence), we can write:

AIC_{Copernicus} = k_{Copernicus} – L_{Copernicus}

AIC_{epicycles} = (k_{Copernicus} + 800) – L_{epicycles
}

AIC_{epicycles} > AIC_{Copernicus} only if L_{epicycles }/ L_{Copernicus} > e^{800}

For these two models to perform equally well according to AIC, the strength of the evidence for epicycles would have to be *at least e ^{800} times stronger* than the strength of the evidence for Copernicus. This corresponds roughly to a 2 with 347 zeroes after it. This is a much clearer argument for the superiority of heliocentrism over geocentrism than a vague appeal to lower priors in the latter than the former.

I like this as a nice simple example of how AIC can be practically applied. It’s also interesting to see how the type of reasoning formalized by AIC is fairly intuitive, and that even scholars in the 1500s were thinking in terms of excessive model flexibility in terms of abundant parameters as an epistemic failing.

Another example given in the same paper is Newton’s notion of admitting only as many causes as are necessary to explain the data. This is nicely formalized in terms of AIC using causal diagrams; if a model of a variable references more causes of that variable, then that model involves more adjustable parameters. In addition, adding causal dependencies to a causal model adds parameters to the description of the system as a whole.

One way to think about all this is that AIC and other model selection techniques provide a protection against unfalsifiability. A theory with too many tweakable parameters can be adjusted to fit a very wide range of data points, and therefore is harder to find evidence against.

I recall a discussion between two physicists somewhere about whether Newton’s famous equation *F* = *ma* counts as an unfalsifiable theory. The idea is just that for basically *any* interaction between particles, you could find some function F that makes the equation true. This has the effect of making the statement fairly vacuous, and carrying little content.

What does AIC have to say about this? The family of functions represented by *F* = *ma* is:

**ℱ **= { *F* = *ma* : *F* any function of the coordinates of the system }

How many parameters does this model have? Well, the ‘tweakable parameter’ lives inside an infinite dimensional Hilbert space of functions, suggesting that the number of parameters is infinity! If this is right, then the overfitting penalty on Newton’s second law is infinitely large and should outweigh any amount of evidence that could support it. This is actually not too crazy; if a model can accommodate any data set, then the model is infinitely weak.

One possible response is that the equation F = ma is meant to be a definitional statement, rather than a claim about the laws of physics. This seems wrong to me for several reasons, the most important of which is that it is *not the case* that any set of laws of physics can be framed in terms of Newton’s equation.

Case in point: quantum mechanics. Try as you might, you won’t be able to express quantum happenings as the result of forces causing accelerations according to F = ma. This suggests that F = ma is at least somewhat of a contingent statement, one that is meant to model aspects of reality rather than simply define terms.