Regularization as approximately Bayesian inference

In an earlier post, I showed how the procedure of minimizing sum of squares falls out of regular old frequentist inference. This time I’ll do something similar, but with regularization and Bayesian inference.

Regularization is essentially a technique in which you evaluate models in terms of not just their fit to the data, but also the values of the parameters involved. For instance, say you are modeling some data with a second-order polynomial.

M = { f(x) = a + bx + cx² | a, b, c ∈ R }
D = { (x₁, y₁), …, (x_N, y_N) }

We can evaluate our model’s fit to the data with SOS:

SOS = ∑ (y_n – f(x_n))²

Minimizing SOS gives us the frequentist answer – the answer that best fits the data. But what if we suspect that the values of a, b, and c are probably small? In other words, what if we have an informative prior about the parameter values? Then we can explicitly add on a penalty term that increases the SOS, such as…

SOS with L₁ regularization = k₁|a| + k₂ |b| + k₃ |c| + ∑ (y_n – f(x_n))²

The constants k₁, k₂, and k₃ determine how much we will penalize each parameter a, b, and c. This is not the only form of regularization we could use, we could also use the L₂ norm:

SOS with L₂ regularization = k₁a² + k₂ b² + k₃ c² + ∑ (y_n – f(x_n))²

In both of these cases, the regularized SOS term grows as the values of the parameters grow. This makes the optimal choice of curve take into account not only the fit to data, but the desired size of the parameters.

You might, having heard of this procedure, already suspect it of having a Bayesian bent. The notion of penalizing large parameter values on the basis of a prior suspicion that the values should be small sounds a lot like what the Bayesian would call “low priors on high parameter values.”

We’ll now make the connection explicit.

Frequentist inference tries to select the theory that makes the data most likely. Bayesian inference tries to select the theory that is made most likely by the data. I.e. frequentists choose f to maximize P(D | f), and Bayesians choose f to maximize P(f | D).

Assessing P(f | D) requires us to have a prior over our set of functions f, which we’ll call π(f).

P(f | D) = P(D | f) π(f) / P(D)

We take a logarithm to make everything easier:

log P(f | D) = log P(D | f) + log π(f) – log P(D)

We already evaluated P(D | f) in the last post, so we’ll just plug it in right away.

log P(f | D) = – SOS/2σ² – N/2 log(2πσ²)) + log π(f) – log P(D)

Since we are maximizing with respect to f, two of these terms will fall away.

log P(f | D) = – SOS/2σ² + log π(f) + constant

Now we just have to decide on the form of π(f). Since the functional form of f is determined by the values of the parameters {a, b, c}, π(f) = π(a, b, c). One plausible choice is a Gaussian centered around the values of each parameter:

π(f) = exp( -a² / 2σ_a² ) exp( -b² / 2σ_b² ) exp( -c² / 2σ_c² ) / √(8π³σ_a²σ_b²σ_c²)
log π(f) = -a²/2σ_a² – b²/2σ_b² – c²/2σ_c² – ½ log(8π³σ_a²σ_b²σ_c²)

Now, throwing out terms that don’t depend on the values of the parameters, we find:

log P(f | D) = – SOS/2σ² -a²/2σ_a² – b²/2σ_b² – c²/2σ_c² + constant

This is exactly L₂ regularization, where each k_n = σ²/σ_n². In other words, L₂ regularization is Bayesian inference with Gaussian priors over the parameters!

What priors does L₁ regularization correspond to?

log π(f) = -k₁|a| – k₂ |b| – k₃ |c|
π(a, b, c) = e^-k1|a|e^-k2|b|e^-k3|a|

I.e. the L₁ regularization prior is an exponential distribution.

This can be easily extended to any regularization technique. This is a way to get some insight into what your favorite regularization methods mean. They are ultimately to be cashed out in the form of your prior knowledge of the parameters!

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Regularization as approximately Bayesian inference

Published by squarishbracket

Leave a comment Cancel reply

Share this:

Related

Published by squarishbracket

Leave a comment Cancel reply