Regularization as approximately Bayesian inference

In an earlier post, I showed how the procedure of minimizing sum of squares falls out of regular old frequentist inference. This time I’ll do something similar, but with regularization and Bayesian inference.

Regularization is essentially a technique in which you evaluate models in terms of not just their fit to the data, but also the values of the parameters involved. For instance, say you are modeling some data with a second-order polynomial.

M = { f(x) = a + bx + cx2 | a, b, c ∈ R }
D = { (x1, y1), …, (xN, yN) }

We can evaluate our model’s fit to the data with SOS:

SOS = ∑ (yn – f(xn))2

Minimizing SOS gives us the frequentist answer – the answer that best fits the data. But what if we suspect that the values of a, b, and c are probably small? In other words, what if we have an informative prior about the parameter values? Then we can explicitly add on a penalty term that increases the SOS, such as…

SOS with L1 regularization = k1 |a| + k2 |b| + k3 |c| + ∑ (yn – f(xn))2

The constants k1, k2, and k3 determine how much we will penalize each parameter a, b, and c. This is not the only form of regularization we could use, we could also use the L2 norm:

SOS with L2 regularization = k1 a2 + k2 b2 + k3 c2 + ∑ (yn – f(xn))2

In both of these cases, the regularized SOS term grows as the values of the parameters grow. This makes the optimal choice of curve take into account not only the fit to data, but the desired size of the parameters.

You might, having heard of this procedure, already suspect it of having a Bayesian bent. The notion of penalizing large parameter values on the basis of a prior suspicion that the values should be small sounds a lot like what the Bayesian would call “low priors on high parameter values.”

We’ll now make the connection explicit.

Frequentist inference tries to select the theory that makes the data most likely. Bayesian inference tries to select the theory that is made most likely by the data. I.e. frequentists choose f to maximize P(D | f), and Bayesians choose f to maximize P(f | D).

Assessing P(f | D) requires us to have a prior over our set of functions f, which we’ll call π(f).

P(f | D) = P(D | f) π(f) / P(D)

We take a logarithm to make everything easier:

log P(f | D) = log P(D | f) + log π(f) – log P(D)

We already evaluated P(D | f) in the last post, so we’ll just plug it in right away.

log P(f | D) = – SOS/2σ2 – N/2 log(2πσ2)) + log π(f) – log P(D)

Since we are maximizing with respect to f, two of these terms will fall away.

log P(f | D) = – SOS/2σ2 + log π(f) + constant

Now we just have to decide on the form of π(f). Since the functional form of f is determined by the values of the parameters {a, b, c}, π(f) = π(a, b, c). One plausible choice is a Gaussian centered around the values of each parameter:

π(f) = exp( -a2 / 2σa2 ) exp( -b2 / 2σb2 ) exp( -c2 / 2σc2 ) / √(8π3σa2σb2σc2)
log π(f) = -a2/2σa2 – b2/2σb2 – c2/2σc2 – ½ log(8π3σa2σb2σc2)

Now, throwing out terms that don’t depend on the values of the parameters, we find:

log P(f | D) = – SOS/2σ2 -a2/2σa2 – b2/2σb2 – c2/2σc2 + constant

This is exactly L2 regularization, where each kn = σ2n2. In other words, L2 regularization is Bayesian inference with Gaussian priors over the parameters!

What priors does L1 regularization correspond to?

log π(f) = -k1 |a| – k2 |b| – k3 |c|
π(a, b, c) = e-k1|a| e-k2|b| e-k3|a|

I.e. the L1 regularization prior is an exponential distribution.

This can be easily extended to any regularization technique. This is a way to get some insight into what your favorite regularization methods mean. They are ultimately to be cashed out in the form of your prior knowledge of the parameters!

Leave a Reply