In an earlier post, I showed how the procedure of minimizing sum of squares falls out of regular old frequentist inference. This time I’ll do something similar, but with regularization and Bayesian inference.

Regularization is essentially a technique in which you evaluate models in terms of not just their fit to the data, but also the values of the parameters involved. For instance, say you are modeling some data with a second-order polynomial.

M = { f(x) = a + bx + cx^{2} | a, b, c ∈ R }

D = { (x_{1}, y_{1}), …, (x_{N}, y_{N}) }

We can evaluate our model’s fit to the data with SOS:

SOS = ∑ (y_{n} – f(x_{n}))^{2}

Minimizing SOS gives us the frequentist answer – the answer that best fits the data. But what if we suspect that the values of a, b, and c are probably small? In other words, what if we have an informative prior about the parameter values? Then we can explicitly add on a penalty term that increases the SOS, such as…

SOS with L_{1} regularization = k_{1 }|a| + k_{2} |b| + k_{3} |c| + ∑ (y_{n} – f(x_{n}))^{2}

The constants k_{1}, k_{2}, and k_{3} determine how much we will penalize each parameter a, b, and c. This is not the only form of regularization we could use, we could also use the L_{2} norm:

SOS with L_{2} regularization = k_{1 }a^{2} + k_{2} b^{2} + k_{3} c^{2} + ∑ (y_{n} – f(x_{n}))^{2}

You might, having heard of this procedure, already suspect it of having a Bayesian bent. The notion of penalizing large parameter values on the basis of a prior suspicion that the values should be small sounds a lot like what the Bayesian would call “low priors on high parameter values.”

We’ll now make the connection explicit.

Frequentist inference tries to select the theory that makes the data most likely. *Bayesian* inference tries to select the theory that is *made most likely* by the data. I.e. frequentists choose f to maximize P(D | f), and Bayesians choose f to maximize P(f | D).

Assessing P(f | D) requires us to have a prior over our set of functions f, which we’ll call π(f).

P(f | D) = P(D | f) π(f) / P(D)

We take a logarithm to make everything easier:

log P(f | D) = log P(D | f) + log π(f) – log P(D)

We already evaluated P(D | f) in the last post, so we’ll just plug it in right away.

log P(f | D) = – SOS/2σ^{2} – N/2 log(2πσ^{2})) + log π(f) – log P(D)

Since we are maximizing with respect to f, two of these terms will fall away.

log P(f | D) = – SOS/2σ^{2} + log π(f) + constant

Now we just have to decide on the form of π(f). Since the functional form of f is determined by the values of the parameters {a, b, c}, π(f) = π(a, b, c). One plausible choice is a Gaussian centered around the values of each parameter:

π(f) = exp( -a^{2} / 2σ_{a}^{2} ) exp( -b^{2} / 2σ_{b}^{2} ) exp( -c^{2} / 2σ_{c}^{2} ) / √(8π^{3}σ_{a}^{2}σ_{b}^{2}σ_{c}^{2})

log π(f) = -a^{2}/2σ_{a}^{2} – b^{2}/2σ_{b}^{2} – c^{2}/2σ_{c}^{2} – ½ log(8π^{3}σ_{a}^{2}σ_{b}^{2}σ_{c}^{2})

Now, throwing out terms that don’t depend on the values of the parameters, we find:

log P(f | D) = – SOS/2σ^{2} -a^{2}/2σ_{a}^{2} – b^{2}/2σ_{b}^{2} – c^{2}/2σ_{c}^{2} + constant

This is exactly L_{2} regularization, where each k_{n} = σ^{2}/σ_{n}^{2}. In other words, L_{2} regularization is Bayesian inference with Gaussian priors over the parameters!

What priors does L_{1} regularization correspond to?

log π(f) = -k_{1 }|a| – k_{2} |b| – k_{3} |c|

π(a, b, c) = e^{-k1|a| }e^{-k2|b| }e^{-k3|a|}

I.e. the L_{1} regularization prior is an exponential distribution.

This can be easily extended to any regularization technique. This is a way to get some insight into what your favorite regularization methods *mean*. They are ultimately to be cashed out in the form of your prior knowledge of the parameters!