A note of ambiguity regarding model selection

A model is a family of probability distributions over a set of observable variables X, parameterized by some set of parameters a1, a2, …, ak.

M = { p(X | a1, a2, …, ak) | a1, a2, …, ak }

Models arise naturally when we are unsure about some details of a distribution, but know its general form. For example, maybe we know that the positions of particles in a gas cloud are normally distributed, but don’t know the degree of spread of this cloud or the location of its center. Then we would want to represent the positions of our particles by a Gaussian distribution over all possible positions, parameterized by the mean and variance of the distribution.

Given this model, we can now make observations of particle positions in order to gain information about the spread and center of the gas cloud. In other words, we have split our epistemological task into two questions:

  1. What model is best? (Model selection)
  2. What values of the parameters are best? (Parameter selection)

Parameter selection is generally accomplished by ordinary accommodation procedures. Broadly, these fall into two categories:

  • Likelihood maximization (which parameters make the data most likely?)
  • Posterior maximization (which parameters are made most likely by the data?)

Model selection is where we correct for overfitting and prioritize simplicity. Two common optimization goals are:

  • Minimize information divergence (which model is closest to the truth in information theoretic terms?)
  • Maximize predictive accuracy (which model does the best job at predicting the next data point?)

So to summarize, we decide what to believe by (1) selecting a set of models, (2) optimizing each model to fit our data, and (3) comparing our optimized models using model selection criteria.

Now, while (3) and (2) are perfectly clear to me, (1) seems much less so. How do we decide what set of models we are working with? While this might be easily practically solved by just using a standard set of models, it seems theoretically troubling. One problem is that the space of possible models is incredibly large, and can be divided up in many different ways.

Another problem is that two people that are looking at all the same hypotheses might have apparent disagreements about what models they are using. Let’s look at an example of this. Person A and Person B both are looking at the same hypothesis set: the set of all lines through the origin with a Gaussian measurement error. But they describe their epistemic framework as follows:

Person A: I have a single model, defined by a single parameter: M = { y = ax + U | a ∈ ℝ, U a Gaussian error term }.

Person B: I have an uncountable infinity of models, each defined by zero parameters. Labeling each model with index a ∈ , I can describe the ath model: Ma = { y = ax + U | U a Gaussian error term }.

The difference between these two is clearly purely semantic; both are looking at the same set of hypotheses, but one is considering them to be contained in a single overarching model, and the other is looking at them each individually.

This becomes a problem when we consider the fact that model selection techniques are sensitive to the number of parameters in the model. More parameters = a larger penalty for overfitting. So while Person A will be penalized for having one tweakable parameter, Person B will be free from penalty.

The response that we want to give here is that Person B is really working with a single model in all but name. What we really care about is the ability of an agent to search among a large space of models, with the excessive flexibility that allows them to not only identify trends in data but also to track the noise in the data. And both Person A and Person B have equal flexibility in this regard, so should be penalized accordingly.

We could try to implement this formally by attempting to reduce large sets of models to smaller sets as much as possible. The problem with this is that any set of models can in principle be reduced to a single larger model with additional adjustable parameters.

In general, the problem of how to clearly distinguish between parameters and models seems like a fairly serious issue for any epistemology that fundamentally relies on this distinction.

Leave a Reply