# How to Learn From Data, Part 2: Evaluating Models

I ended the last post by saying that the solution to the problem of overfitting all relied on the concept of a model. So what is a model? Quite simply, a model is just a set of functions, each of which is a candidate for the true distribution that generates the data.

Why use a model? If we think about a model as just a set of functions, this seems kinda abstract and strange. But I think it can be made intuitive, and that in fact we reason in terms of models all the time. Think about how physicists formulate their theories. Laws of physics have two parts: First they specify the types of functional relationships there are in the world. And second, they specify the value of particular parameters in those functions. This is exactly how a model works!

Take Newton’s theory of gravity. The fundamental claim of this theory is that there is a certain functional relationship between the quantities a (the acceleration of an object), M (the mass of a nearby object), and r (the distance between the two objects): a ~ M/r2. To make this relationship precise, we need to include some constant parameter G that says exactly what this proportionality is: a = GM/r2.

So we start off with a huge set of probability distributions over observations of the accelerations of particles (one for each value of G), and then we gather data to see which value of G is most likely to be right. Einstein’s theory of gravity was another different model, specifying a different functional relationship between a, M, and r and involving its own parameters. So to compare Einstein’s theory of gravity to Newton’s is to compare different models of the phenomenon of gravitation.

When you start to think about it, the concept of a model is an enormous one. Models are ubiquitous, you can’t escape from them. Many fancy methods in machine learning are really just glorified model selection. For instance, a neural network is a model: the architecture of the network specifies a particular functional relationship between the input and the output, and the strengths of connections between the different neurons are the parameters of the model.

So. What are some methods for assessing predictive accuracy of models? The first one we’ll talk about is a super beautiful and clever technique called…

## Cross Validation

The fundamental idea of cross validation is that you can approximate how well a model will do on the next data point by evaluating how the model would have done at predicting a subset of your data, if all it had access to was the rest of your data.

I’ll illustrate this with a simple example. Suppose you have a data set of four points, each of which is a tuple of two real numbers. Your two models are T1: that the relationship is linear, and T2: that the relationship is quadratic.

What we can do is go through each data point, selecting each in order, train the model on the non-selected data points, and then evaluate how well this trained model fits the selected data point. (By training the model, I mean something like “finding the curve within the model that best fits the non-selected data points.”) Here’s a sketch of what this looks like:

There are a bunch of different ways of doing cross validation. We just left out one data point at a time, but we could have left out two points at a time, or three, or any k less than the total number of points N. This is called leave-k-out cross validation. If we partition our data by choosing a fraction of it for testing, then we get what’s called n-fold cross validation (where 1/n is the fraction of the data that is isolated for testing).

We also have some other choices besides how to partition the data. For instance, we can choose to train our model via a Likelihoodist procedure or a Bayesian procedure. And we can also choose to test our model in various ways, by using different metrics for evaluating the distance between the testing set and the trained model. In leave-one-out cross validation (LOOCV), a pretty popular method, both the training and testing procedures are Likelihoodist.

Now, there’s a practical problem with cross-validation, which is that it can take a long time to compute. If we do LOOCV with N data points and look at all ways of leaving out one data point, then we end up doing N optimization procedures (one for each time you train your model), each of which can be pretty slow. But putting aside practical issues, for thinking about theoretical rationality, this is a super beautiful technique.

Next we’ll move on to…

## Bayesian Model Selection!

We had Bayesian procedures for evaluating individual functions. Now we can go full Bayes and apply it to models! For each model, we assess the posterior probability of the model given the data using Bayes’ rule as usual:

$Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)$

Now… what is the prior Pr(M)? Again, it’s unspecified. Maybe there’s a good prior that solves overfitting, but it’s not immediately obvious what exactly it would be.

But there’s something really cool here. In this equation, there’s a solution to overfitting that pops up not out of the prior, but out of the LIKELIHOOD! I explain this here. The general idea is that models that are prone to overfitting get that way because their space of functions is very large. But if the space of functions is large, then the average prior on each function in the model must be small. So larger models have, on average, smaller values of the term Pr(f | M), and correspondingly (via $Pr(D | M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M)$) get a weaker update from evidence.

So even without saying anything about the prior, we can see that Bayesian model selection provides a potential antidote to overfitting. But just like before, we have the practical problem that computing Pr(M | D) is in general very hard. Usually evaluating Pr(D | M) involves calculating a really complicated many-dimensional integral, and calculating many-dimensional integrals can be very computationally expensive.

Might we be able to find some simple unbiased estimator of the posterior Pr(M | D)? It turns out that yes, we can. Take another look at the equation above.

$Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)$

Since $Pr(D)$ is a constant across models, we can ignore it when comparing models. So for our purposes, we can attempt to maximize the equation:

$Pr(D | M) Pr(M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) Pr(M)$

If we assume that P(D | f) is twice differentiable in the model parameters and that Pr(f) behaves roughly linearly around the maximum likelihood function, we can approximate this as:

$Pr(D | M) Pr(M) \approx \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)$

f* is the maximum likelihood function within the model, and k is the the number of parameters of the model (e.g. k = 2 for a linear model and k = 3 for a quadratic model).

\We can make this a little neater by taking a logarithm and defining a new quantity: the Bayesian Information Criterion.

$argmax_M [ Pr(M | D) ] \\~\\ = argmax_M [ Pr(D | M) Pr(M) ] \\~\\ \approx argmax_M [ \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)] \\~\\ = argmax_M [ \log(Pr(M)) + \log (Pr(D | f^*(D)) - \frac{k}{2} \log(N) ] \\~\\ = argmin_M [ BIC - \log Pr(M) ]$

$BIC = \frac{k}{2} log(N) - \log \left( Pr(D | f^*(D)) \right)$

Thus, to maximize the posterior probability of a model M is roughly the same as to minimize the quantity BIC – log Pr(M).

If we assume that our prior over models is constant (that is, that all models are equally probable at the outset), then we just minimize BIC.

## Bayesian Information Criterion

Notice how incredibly simple this expression is to evaluate! If all we’re doing is minimizing BIC, then we only need to find the maximum likelihood function f*(D), assess the value of Pr(D | f*), and then penalize this quantity with the factor k/2 log(N)! This penalty scales in proportion to the model complexity, and thus helps us avoid overfitting.

We can think about this as a way to make precise the claims above about Bayesian Model Selection penalizing overfitting in the likelihood. Remember that minimizing BIC made sense when we assumed a uniform prior over models (and therefore when our prior doesn’t penalize overfitting). So even when we don’t penalize overfitting in the prior, we still end up getting a penalty! This penalty must come from the likelihood.

Some more interesting facts about BIC:

• It is approximately equal to the minimum description length criterion (which I haven’t discussed here)
• It is only valid when N >> k. So it’s not good for small data sets or large models.
• It the truth is contained within your set of models, then BIC will select the truth with probability 1 as N goes to infinity.

Okay, stepping back. We sort of have two distinct clusters of approaches to model selection. There’s Bayesian Model Selection and the Bayesian Information Criterion, and then there’s Cross Validation. Both sound really nice. But how do they compare? It turns out that in general they give qualitatively different answers. And in general, the answers you get using cross validation tend to be more predictively accurate that those that you get from BIC/BMS.

A natural question to ask at this point is: BIC was a nice simple approximation to BMS. Is there a corresponding nice simple approximation to cross validation? Well first we must ask: which cross validation? Remember that there was a plethora of different forms of cross validation, each corresponding to slightly different criterion for evaluating fits. We can’t assume that all these methods give the same answer.

Let’s choose leave-one-out cross validation. It turns out that yes, there is a nice approximation that is asymptotically equivalent to LOOCV! This is called the Akaike information criterion.

## Akaike Information Criterion

First, let’s define AIC:

$AIC = k - \log \left( Pr(D | f^*(D)) \right)$

Like always, k is the number of parameters in M and f* is chosen by the Likelihoodist approach described in the last post.

What you get by minimizing this quantity is asymptotically equivalent to what you get by doing leave-one-out cross validation. Compare this to BIC:

$BIC = \frac{k}{2} \log (N) - \log \left( Pr(D | f^*(D) ) \right)$

There’s a qualitative difference in how the parameter penalty is weighted! BIC is going to have a WAY higher complexity penalty than AIC. This means that AIC should in general choose less simple models than BIC.

Now, we’ve already seen one reason why AIC is good: it’s a super simple approximation of LOOCV and LOOCV is good. But wait, there’s more! AIC can be derived as an unbiased estimator of the Kullback-Leibler divergence DKL.

What is DKL? It’s a measure of the information theoretic distance of a model from truth. For instance, if the true generating distribution for a process is f, and our model of this process is g, then DKL tells us how much information we lose by using g to represent f. Formally, DKL is:

$D_{KL}(f, g) = \int { f(x) \log \left(\frac{f(x)} {g(x)} \right) dx }$

Notice that to actually calculate this quantity, you have to already know what the true distribution f is. So that’s unfeasible. BUT! AIC gives you an unbiased estimate of its value! (The proof of this is complicated, and makes the assumptions that N >> k, and that the model is not far from the truth.)

Thus, AIC gives you an unbiased estimate of which model is closest to the truth. Even if the truth is not actually contained within your set of models, what you’ll end up with is the closest model to it. And correspondingly, you’ll end up getting a theory of the process that is very similar to how the process actually works.

There’s another version of AIC that works well for smaller sample sizes and larger models called AICc.

$AICc = AIC + \frac {k (k + 1)} {N - (k + 1)}$

And interestingly, AIC can be derived as an approximation to Bayesian Model Selection with a particular prior over models. (Remember earlier when we showed that $argmax_M \left( Pr(M | D) \right) \approx argmin_M \left( BIC - \log(Pr(M)) \right)$? Well, just set $\log Pr(M) = BIC - AIC = k \left( \frac{1}{2} \log N - 1 \right)$, and you get the desired result.) Interestingly, this prior ends up rewarding theories for having lots of parameters! This looks pretty bad… it seems like AIC is what you get when you take Bayesian Model Selection and then try to choose a prior that favors overfitting theories. But the practical use and theoretical virtues of AIC warrant, in my opinion, taking a closer look at what’s going on here. Perhaps what’s going on is that the likelihood term Pr(D | M) is actually doing too much to avoid overfitting, so in the end what we need in our prior is one that avoids underfitting! Regardless, we can think of AIC as specifying the unique good prior on models that optimizes for predictive accuracy.

There’s a lot more to be said from here. This is only a short wade into the waters of statistical inference and model selection. But I think this is a good place to stop, and I hope that I’ve given you a sense of the philosophical richness of the various different frameworks for inference I’ve presented here.

# How to Learn From Data, Part I: Evaluating Simple Hypotheses

Here’s a general question of great importance: How should we adjust our model of the world in response to data? This post is a tour of a few of the big ideas that humans have come up with to address this question. I find learning about this topic immensely rewarding, and so I shall attempt to share it! What I love most of all is thinking about how all of this applies to real world examples of inference from data, so I encourage you to constantly apply what I say to situations that you find personally interesting.

First of all, we want a formal description of what we mean by data. Let’s describe our data quite simply as a set: $D = \{(x_1, y_1), (x_2, y_2),..., (x_N, y_N)\}$, where each $x_i$ is a value that you choose (your independent variable) and each $y_i$ is the resulting value that you measure (the dependent variable). Each of these variables can be pretty much anything – real numbers, n-tuples of real numbers, integers, colors, whatever you want. The goal here is to embrace generality, so as to have a framework that applies to many kinds of inference.

Now, suppose that you have a certain theory about the underlying relationship between the x variables and the y variables. This theory might take the form of a simple function: $T: y = f(x)$ We interpret T as making a prediction of a particular value of the dependent variable for each value of the independent variable. Maybe the data is the temperatures of regions at various altitudes, and our theory T says that one over the temperature (1/T) is some particular linear function of the altitude.

What we want is a notion of how good of a theory T is, given our data. Intuitively, we might think about doing this by simply assessing the distance between each data point $y_n$ and the predicted value of y at that point: $f(x_n)$, using some metric, then adding them all up. But there are a whole bunch of distance metrics available to us. Which one should we use? Perhaps the taxicab measure comes to mind ($\sum_{n=1}^N {|y_n - f(x_n)|}$), or the sum of the squares of the differences ($SOS = \sum_{n=1}^N {(y_n - f(x_n))^2}$). We want a good theoretical justification for why any one of these metrics should be preferred over any other, since in general they lead to different conclusions. We’ll see just such justifications shortly. Keep the equation for SOS in mind, as it will turn up repeatedly ahead.

Now here’s a problem: If we want a probabilistic evaluation of T, then we have to face the fact that it makes a deterministic prediction. Our theory seems to predict with 100% probability that at $x_n$ the observed $y_n$ will be precisely $f(x_n)$. If it’s even slightly off of this, then our theory will be probabilistically disconfirmed to zero.

We can solve this problem by modifying our theory T to not just be a theory of the data, but also a theory of error. In other words, we expect that the value we get will not be exactly the predicted value, and give some account of how on average observation should differ from theory.

$T: y = f(x) + \epsilon$, where $\epsilon$ is some random variable drawn from a probability distribution $P_E$.

This error distribution can be whatever we please – Gaussian, exponential, Poisson, whatever. For simplicity let’s say that we know the error is normal (drawn from a Gaussian distribution) with a known standard deviation σ.

$T: y = f(x) + \epsilon, \epsilon \sim N(0, \sigma)$

A note on notation here: $\epsilon \sim N(\mu, \sigma)$ denotes a random variable drawn from a Gaussian distribution centered around $\mu$ with a standard deviation of $\sigma$.

This gives us a sensible notion of the probability of obtaining some value of y from a chosen x given the theory T.

$Pr(y | x, T) \sim N(f(x), \sigma) \\~\\ Pr(y | x, T) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2} (y - f(x))^2}$

Nice! This is a crucial step towards figuring out how to evaluate theories: we’ve developed a formalism for describing precisely how likely a data point is, given a particular theory (which, remember, so far is just a function from values of independent variables to values of dependent variables, coupled with a theory of how the error in your observations works).

Let’s try to extend this to our entire data set D. We want to assess the probability of the particular values of the dependent variables, given the chosen values of the dependent variables and the function. We’ll call this the probability of D given f.

$Pr(D | f) = Pr(x_1, y_1, x_2, y_2,..., x_N, y_N | T) \\~\\ Pr(D | f) = Pr(y_1, y_2,...,y_N | x_1, x_2,..., x_N, T)$

(It’s okay to move the values of the independent variable x across the conditional bar because of our assumption that we know beforehand what values x will take on.)

But now we run into a problem: we can’t really do anything with this expression without knowing how the data points depend upon each other. We’d like to break it into individual terms, each of which can be evaluated by the above expression for $Pr(y | x, T)$. But that would require an assumption that our data points are independent. In general, we cannot grant this assumption. But what we can do is expand our theory once more to include a theory of the dependencies in our data points. For simplicity of explication, let’s proceed with the assumption that each of our observations is independent of the others. Said another way, we assume that given T, $x_n$ screens off all other variables from $y_n$). Combined with our assumption of normal error, this gives us a nice simple reduction.

$Pr(D | f) = Pr(y_1, y_2,...,y_N | x_1, x_2,..., x_N, T) \\~\\ Pr(D | f) = Pr(y_1 | x_1, T) Pr(y_2 | x_2, T) ... Pr(y_N | x_N, T) \\~\\ Pr(D | f) = \prod_{n=1}^N { \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2} (y_n - f(x_n))^2} } \\~\\ Pr(D | f) = (2 \pi \sigma^2)^{-N/2} e^{-\frac{1}{2 \sigma^2} \sum_{n=1}^N {(y_n - f(x_n))^2}} \\~\\ Pr(D | f) = (2 \pi \sigma^2)^{-N/2} e^{-\frac{SOS(f, D)}{2 \sigma^2} }$

We see an interesting connection arise here between Pr(D | f) and the sum of squares evaluation of fit. More will be said of this in just a moment.

But first, let’s take a step back. Our ultimate goal here is to find a criterion for theory evaluation. And here we’ve arrived at an expression that looks like it might be right for that role! Perhaps we want to say that our criterion for evaluating theories is just maximization of Pr(D | f). This makes some intuitive sense… a good theory is one which predicts the observed data. If general relativity or quantum mechanics predicted that the sun should orbit the earth, or that atoms should be unstable, then we wouldn’t be very in favor of it.

And so we get the Likelihoodism Thesis!

Likelihoodism: The best theory f* given some data D is that which maximizes Pr(D | f). Formally: $f^*(D) = argmax_f [ Pr(D | f) ]$

Now, since logarithms are monotonic, we can express this in a more familiar form:

$argmax_f [Pr(D | F)] = argmax_f \left[ e^{-\frac{SOS(f, D)}{2 \sigma^2} } \right] = argmax_f \left[-\frac{SOS(f, D)}{2 \sigma^2} \right] = argmin_f [ SOS(f, D) ]$

Thus the best theory according to Likelihoodism is that which minimizes the sum of squares! People minimize sums of squares all the time, and most of the time they don’t realize that it can be derived from the Likelihoodism Thesis and the assumptions of Gaussian error and independent data. If our assumptions had been different, then we would have found a different expression, and SOS might no longer be appropriate! For instance, the taxicab metric arises as the correct metric if we assume exponential error rather than normal error. I encourage you to see why this is for yourself. It’s a fun exercise to see how different assumptions about the data give rise to different metrics for evaluating the distance between theory and observation. In general, you should now be able to take any theory of error and assess what the proper metric for “degree of fit of theory to data” is, assuming that degree of fit is evaluated by maximizing Pr(D | f).

Now, there’s a well-known problem with the Likelihoodism thesis. This is that the function f that minimizes SOS(f, D) for some data D is typically going to be a ridiculously overcomplicated function that perfectly fits each data points, but does terribly on future data. The function will miss the underlying trend in the data for the noise in the observations, and as a result fail to get predictive accuracy.

This is the problem of overfitting. Likelihoodism will always prefer theories that overfit to those that don’t, and as a result will fail to identify underlying patterns in data and do a terrible job at predicting future data.

How do we solve this? We need a replacement for the Likelihoodism thesis. Here’s a suggestion: we might say that the problem stems from the fact that the Likelihoodist procedure recommends us to find a function that makes the data most probable, rather than finding a function that is made most probable by the data. From this suggestion we get the Bayesian thesis:

Bayesianism: The best theory f given some data is that which maximizes Pr(f | D). $f^*(D) = argmax_f [ Pr(f | D) ]$

Now, what is this Pr(f | D) term? It’s an expression that we haven’t seen so far. How do we evaluate it? We simply use the theorem that the thesis is named for: Bayes’ rule!

$Pr(f | D) = \frac{Pr(D | f)}{Pr(D)} Pr(f)$

This famous theorem is a simple deductive consequence of the probability axioms and the definition of conditional probability. And it happens to be exactly what we need here. Notice that the right-hand side consists of the term Pr(D | f), which we already know how to calculate. And since we’re ultimately only interested in varying f to find the function that maximizes this expression, we can ignore the constant term in the denominator.

$argmax_f [ Pr(f | D) ] = argmax_f [ Pr(D | f) Pr(f) ] \\~\\ argmax_f [ Pr(f | D) ] = argmax_f [ \log Pr(D | f) + \log Pr(f) ] \\~\\ argmax_f [ Pr(f | D) ] = argmax_f [ -\frac{SOS(f, D)}{2 \sigma^2} + \log Pr(f) ] \\~\\ argmax_f [ Pr(f | D) ] = argmin_f [ SOS(f, D) - 2 \sigma^2 \log Pr(f) ]$

The last steps we get just by substituting in what we calculated before. The 2σ² in the first term comes from the fact that the exponent of our Gaussian is (y – f(x))² / 2σ². We ignored it before because it was just a constant, but since we now have another term in the expression being maximized, we have to keep track of it again.

Notice that what we get is just what we had initially (a sum of squares) plus an additional term involving a mysterious Pr(f). What is this Pr(f)? It’s our prior distribution over theories. Because of the negative sign in front of the second term, a larger value of Pr(f) gives a smaller value for the expression that we are minimizing. Similarly, the more closely our function follows the data, the smaller the SOS term becomes. So what we get is a balance between fitting the data and having a high prior. A theory that fits the data perfectly can still end up with a bad evaluation, as long as it has a low enough prior. And a theory that fits the data poorly can end up with a great evaluation if it has a high enough prior.

(Those that are familiar with machine learning might notice that this feels similar to regularization. They’re right! It turns out that different regularization techniques just end up corresponding to different Bayesian priors! L2 regularization corresponds to a Gaussian prior over parameters, and L1 regularization corresponds to an exponential prior over parameters. And so on. Different prior assumptions about the process generating the data lead naturally to different regularization techniques.)

What this means is that if we are trying to solve overfitting, then the Bayesianism thesis provides us a hopeful ray of light. Maybe we can find some perfect prior Pr(f) that penalizes complex overfitting theories just enough to give us the best prediction. But unsurprisingly, finding such a prior is no easy matter. And importantly, the Bayesian thesis as I’ve described it gives us no explicit prescription for what this prior should be, making it insufficient for a full account of inference. What Bayesianism does is open up a space of possible methods of theory evaluation, inside which (we hope) there might be a solution to our problem.

Okay, let’s take another big step back. How do we progress from here? Well, let’s think for a moment about what our ultimate goal is. We want to say that overfitting is bad, and that therefore some priors are better than others insofar as they prevent overfitting. But what is the standard we’re using to determine that overfitting is bad? What’s wrong with an overfitting theory?

Here’s a plausible answer: The problem with overfitting is that while it maximizes descriptive accuracy, it leads to poor predictive accuracy! I.e., theories that overfit do a great job at describing past data, but tend to do a poor job of matching future data.

Let’s take this idea and run with it. Perhaps all that we truly care about in theory selection is predictive accuracy. I’ll give a name to this thesis:

Predictivism: The ultimate standard for theory evaluation is predictive accuracy. I.e., the best theory is that which does the best at predicting future data.

Predictivism is a different type of thesis than Bayesianism and Likelihoodism. While each of those gave precise prescriptions for how to calculate the best theory, Predictivism does not. If we want an explicit algorithm for computing the best theory, then we need to say what exactly “predictive accuracy” means. This means that as far as we know so far, Predictivism could be equivalent to Likelihoodism or to Bayesianism. It all depends on what method we end up using to determine what theory has the greatest predictive accuracy. Of course, we have good reasons to suspect that Likelihoodism is not identical to Predictivism (Likelihoodism overfits, and we suspect that Predictivism will not) or to Bayesianism (Bayesianism does not prescribe a particular prior, so it gives no unique prescription for the best theory). But what exactly Predictivism does say depends greatly on how exactly we formalize the notion of predictive accuracy.

Another difference between Predictivism and Likelihoodism/Bayesianism is that predictive accuracy is about future data. While it’s in principle possible to find an analytic solution to P(D | f) or P(f | D), Predictivism seems to require us to compute terms involving future data! But these are impossible to specify, because we don’t know what the future data will be!

Can we somehow approximate what the best theory is according to Predictivism, by taking our best guess at what the future data will be? Maybe we can try to take the expected value of something like $Pr(y_{N+1} | x_{N+1}, T, D)$, where $(x_{N+1}, y_{N+1})$ is a future data point. But there are two big problems with this.

First, in taking an expected value, you must use the distribution of outcomes given by your own theory. But then your evaluation picks up an inevitable bias! You can’t rely on your own theories to assess how good your theories are.

And second, by the assumptions we’ve included in our theories so far, the future data will be independent of the past data! This is a really big problem. We want to use our past data to give some sense of how well we’ll do on future data. But if our data is independent, then our assessment of how well T will do on the next data point will have to be independent of the past data as well! After all, no matter what value $y_{N+1}$ has, $Pr(y_{N+1} | x_{N+1}, T, D) = Pr(y_{N+1} | x_{N+1}, T).$

So is there any way out of this? Surprisingly, yes! If we get creative, we can find ways to approximate this probability with some minimal assumptions about the data. These methods all end up relying on a new concept that we haven’t yet discussed: the concept of a model. The next post will go into more detail on how this works. Stay tuned!

# Some simple probability puzzles

(Most of these are taken from Ian Hacking’s Introduction to Probability and Inductive Logic.)

1. About as many boys as girls are born in hospitals. Many babies are born every week at City General. In Cornwall, a country town, there is a small hospital where only a few babies are born every week.

Define a normal week as one where between 45% and 55% of babies are female. An unusual week is one where more than 55% or less than 45% are girls.

Which of the following is true:
(a) Unusual weeks occur equally often at City General and at Cornwall.
(b) Unusual weeks are more common at City General than at Cornwall.
(c) Unusual weeks are more common at Cornwall than at City General.

2. Pia is 31 years old, single, outspoken, and smart. She was a philosophy major. When a student, she was an ardent supporter of Native American rights, and she picketed a department store that had no facilities for nursing mothers.

Which of the following statements are most probable? Which are least probable?

(a) Pia is an active feminist.
(b) Pia is a bank teller.
(c) Pia works in a small bookstore.
(d) Pia is a bank teller and an active feminist.
(e) Pia is a bank teller and an active feminist who takes yoga classes.
(f) Pia works in a small bookstore and is an active feminist who takes yoga classes.

3. You have been called to jury duty in a town with only green and blue taxis. Green taxis dominate the market, with 85% of the taxis on the road.

On a misty winter night a taxi sideswiped another car and drove off. A witness said it was a blue cab. This witness is tested under similar conditions, and gets the color right 80% of the time.

You conclude about the sideswiping taxi:
(a) The probability that it is blue is 80%.
(b) It is probably blue, but with a lower probability than 80%.
(c) It is equally likely to be blue or green.
(d) It is more likely than not to be green.

4. You are a physician. You think that it’s quite likely that a patient of yours has strep throat. You take five swabs from the throat of this patient and send them to a lab for testing.

If the patient has strep throat, the lab results are right 70% of the time. If not, then the lab is right 90% of the time.

The test results come back: YES, NO, NO, YES, NO

You conclude:
(a) The results are worthless.
(b) It is likely that the patient does not have strep throat.
(c) It is slightly more likely than not that the patient does have strep throat.
(d) It is very much more likely than not that the patient does have strep throat.

5. In a country, all families wants a boy. They keep having babies till a boy is born. What is the expected ratio of boys and girls in the country?
6.  Answer the following series of questions:

If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT?

If you flip the coin repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT?

If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely?

You play a game with a friend in which you each choose a sequence of three possible flips (e.g HHT and TTH). You then flip the coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor?

Are you always able to bias the game in your favor?

## Solutions (and lessons)

1. The correct answer is (a): Unusual weeks occur more often at Cornwall than at City General. Even though the chance of a boy is the same at Cornwall as it is at City General, the percentage of boys from week to week is larger in the smaller city (for N patients a week, the percentage boys goes like 1/sqrt(N)). Indeed, if you think about an extreme case where Cornwall has only one birth a week, then every week will be an unusual week (100% boys or 0% boys).
2. There is room to debate the exact answer but whatever it is, it has to obey some constraints. Namely, the most probable statement cannot be (d), (e), or (f), and the least probable statement cannot be (a), (b), or (c). Why? Because of the conjunction rule of probability: each of (d), (e), and (f) are conjunctions of (a), (b), and (c), so they cannot be more likely. P(A & B) ≤ P(A).

It turns out that most people violate this constraint. Many people answer that (f) is the most probable description, and (b) is the least probable. This result is commonly interpreted to reveal a cognitive bias known as the representativeness heuristic – essentially, that our judgements of likelihood are done by considering which descriptions most closely resemble the known facts. In this case,

Another factor to consider is that prior to considering the evidence, your odds on a given person being a bank teller as opposed to working in a small bookstore should be heavily weighted towards her being a bank teller. There are just far more bank tellers than small bookstore workers (maybe a factor of around 20:1). This does not necessarily mean that (b) is more likely than (c), but it does mean that the evidence must discriminate strongly enough against her being a bank teller so as to overcome the prior odds.

This leads us to another lesson, which is to not neglect the base rate. It is easy to ignore the prior odds when it feels like we have strong evidence (Pia’s age, her personality, her major, etc.). But the base rate on small bookstore workers and bank tellers are very relevant to the final judgement.

3. The correct answer is (d) – it is more likely than not that the sideswiper was green. This is a basic case of base rate neglect – many people would see that the witness is right 80% of the time and conclude that the witness’s testimony has an 80% chance of being correct. But this is ignoring the prior odds on the content of the witness’s testimony.

In this case, there were prior odds of 17:3 (85%:15%) in favor of the taxi being green. The evidence had a strength of 1:4 (20%:80%), resulting in the final odds being 17:12 in favor of the taxi being green. Translating from odds to probabilities, we get a roughly 59% chance of the taxi having been green.

We could have concluded (d) very simply by just comparing the prior probability (85% for green) with the evidence (80% for blue), and noticing that the evidence would not be strong enough to make blue more likely than green (since 85% > 80%). Being able to very quickly translate between statistics and conclusions is a valuable skill to foster.

4. The right answer is (d). We calculate this just like we did the last time:

The results were YES, NO, NO, YES, NO.

Each YES provides evidence with strength 7:1 (70%/10%) in favor of strep, and each NO provides evidence with strength 1:3 (30%/90%).

So our strength of evidence is 7:1 ⋅ 1:3 ⋅ 1:3 ⋅ 7:1 ⋅ 1:3 = 49:27, or roughly 1.81:1 in favor of strep. This might be a little surprising… we got more NOs than YESs and the NO was correct 90% of the time for people without strep, compared to the YES being correct only 70% of the time in people with strep.

Since the evidence is in favor of strep, and we started out already thinking that strep was quite likely, in the end we should be very convinced that they have strep. If our prior on the patient having strep was 75% (3:1 odds), then our probability after getting evidence will be 84% (49:9 odds).

Again, surprising! The patient who sees these results and hears the doctor declaring that the test strengthens their belief that the patient has strep might feel that this is irrational and object to the conclusion. But the doctor would be right!

5. Supposing as before that the chance of any given birth being a boy is equal to the chance of it being a girl, we end up concluding…

The expected ratio of boys and girls in the country is 1! That is, this strategy doesn’t allow you to “cheat” – it has no impact at all on the ratio. Why? I’ll leave this one for you to figure out. Here’s a diagram for a hint:

This is important because it applies to the problem of p-hacking. Imagine that all researchers just repeatedly do studies until they get the results they like, and only publish these results. Now suppose that all the researchers in the world are required to publish every study that they do. Now, can they still get a bias in favor of results they like? No! Even though they always stop when getting the result they like, the aggregate of their studies is unbiased evidence. They can’t game the system!

If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT? (Yes)

If you flip it repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT? (No)

If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely? (Yes)

You play a game with a friend in which you each choose a sequence of three coin flips (e.g HHT and TTH). You then flip a coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor? (Yes)

Are you always able to bias the game in your favor? (Yes!)

Here’s a wiki page with a good explanation of this: LINK. A table from that page illustrating a winning strategy for any choice your friend makes:

1st player’s choice 2nd player’s choice Odds in favour of 2nd player
HHH THH 7 to 1
HHT THH 3 to 1
HTH HHT 2 to 1
HTT HHT 2 to 1
THH TTH 2 to 1
THT TTH 2 to 1
TTH HTT 3 to 1
TTT HTT 7 to 1

# Simple induction

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:

This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update.

How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity). What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

# Variational Bayesian inference

Today I learned a cool trick for practical implementation of Bayesian inference.

Bayesians are interested in calculating posterior probability distributions of unobserved parameters X, given data (which consists of the values of observed parameters Y).

To do so, they need only know the form of the likelihood function (the probability of Y given X) and their own prior distribution over X. Then they can apply Bayes’ rule…

P(X | Y) = P(Y | X) P(X) / P(Y)

… and voila, Bayesian inference complete.

The trickiest part of this process is calculating the term in the denominator, the marginal likelihood P(Y). Trying to calculate this term analytically is typically  very computationally expensive – it involves a sum over all possible values of the parameters of the likelihood multiplied by the prior. If Y is drawn from a continuous infinity of possible parameter values, then calculating the marginal likelihood amounts to solving a (typically completely intractable) integral.

P(Y) = ∫ P(Y | X) P(X) dX

Variational Bayesian inference is a procedure that solves this problem through a clever trick.

We start by searching for a posterior in a space of functions F that are easily integrable.

Our goal is not to find the exact form of the posterior, although if we do, that’s great. Instead, we want to find the function Q(X) within F that is as close to the posterior P(X | Y) as possible.

Distance between probability distributions is typically calculated by the information divergence D(Q, P), which is defined by…

D(Q, P) = ∫ Q(X) log(Q(X) / P(X|Y)) dX

To explicitly calculate and minimize this, we would need to know the form of the posterior P(X | Y) from the start. But let’s plug in the definition of conditional probability…

P(X | Y) = P(X, Y) / P(Y)

D(Q, P) = ∫ Q(X) log(Q(X) P(Y) / P(X, Y)) dX
= ∫ Q(X) log(Q(X) / P(X, Y)) dX  +  ∫ Q(X) log P(Y) dX

The second term is easily calculated. Since log(P(Y)) is not a function of X, the integral just becomes…

∫ Q(X) log P(Y) dX = log P(Y)

Rearranging, we get…

log P(Y) = D(Q, P)  –  ∫ Q(X) log(Q(X) / P(X, Y)) dX

The second term depends on Q(X) and the joint probability P(X, Y), which we can calculate easily as the product of the likelihood P(Y | X) and the prior P(X). We name it the variational free energy, L(Q).

log P(Y) = D(Q, P) + L(Q)

Now, on the left-hand side we have the log of the marginal likelihood, and on the right we have the information distance plus the variational free energy.

Notice that the left side is not a function of Q. This is really important! It tells us that if we’re trying to vary Q to minimize D(Q, P), then the right side will be a constant quantity.

In other words, any increase in L(Q) is necessarily a decrease in D(Q, P). What this means is that the Q that minimizes D(Q, P) is the same thing as the Q that maximizes L(Q)!

We can use this to minimize D(Q, P) without ever explicitly knowing P.

Recalling the definition of the variational free energy, we have…

L(Q) = – ∫ Q(X) log(Q(X) / P(X, Y)) dX
= ∫ Q(X) log P(X, Y) dX – ∫ Q(X) log Q(X) dX

Both of these integrals are computable insofar as we made a good choice for the function space F. Thus we can exactly find Q*, the best approximation to P in F. Then, knowing Q*, we can calculate L(Q*), which serves as a lower bound on the log of the marginal likelihood P(Y).

log P(Y) = D(Q, P) + L(Q)
so log P(Y) ≥ L(Q*)

Summing up…

1. Variational Bayesian inference approximates the posterior probability P(X | Y) with a function Q(X) in the function space F.
2. We find the function Q* that is as similar as possible to P(X | Y) by maximizing L(Q).
3. L(Q*) gives us a lower bound on the log of the marginal likelihood, log P(Y).

# Regularization as approximately Bayesian inference

In an earlier post, I showed how the procedure of minimizing sum of squares falls out of regular old frequentist inference. This time I’ll do something similar, but with regularization and Bayesian inference.

Regularization is essentially a technique in which you evaluate models in terms of not just their fit to the data, but also the values of the parameters involved. For instance, say you are modeling some data with a second-order polynomial.

M = { f(x) = a + bx + cx2 | a, b, c ∈ R }
D = { (x1, y1), …, (xN, yN) }

We can evaluate our model’s fit to the data with SOS:

SOS = ∑ (yn – f(xn))2

Minimizing SOS gives us the frequentist answer – the answer that best fits the data. But what if we suspect that the values of a, b, and c are probably small? In other words, what if we have an informative prior about the parameter values? Then we can explicitly add on a penalty term that increases the SOS, such as…

SOS with L1 regularization = k1 |a| + k2 |b| + k3 |c| + ∑ (yn – f(xn))2

The constants k1, k2, and k3 determine how much we will penalize each parameter a, b, and c. This is not the only form of regularization we could use, we could also use the L2 norm:

SOS with L2 regularization = k1 a2 + k2 b2 + k3 c2 + ∑ (yn – f(xn))2

In both of these cases, the regularized SOS term grows as the values of the parameters grow. This makes the optimal choice of curve take into account not only the fit to data, but the desired size of the parameters.

You might, having heard of this procedure, already suspect it of having a Bayesian bent. The notion of penalizing large parameter values on the basis of a prior suspicion that the values should be small sounds a lot like what the Bayesian would call “low priors on high parameter values.”

We’ll now make the connection explicit.

Frequentist inference tries to select the theory that makes the data most likely. Bayesian inference tries to select the theory that is made most likely by the data. I.e. frequentists choose f to maximize P(D | f), and Bayesians choose f to maximize P(f | D).

Assessing P(f | D) requires us to have a prior over our set of functions f, which we’ll call π(f).

P(f | D) = P(D | f) π(f) / P(D)

We take a logarithm to make everything easier:

log P(f | D) = log P(D | f) + log π(f) – log P(D)

We already evaluated P(D | f) in the last post, so we’ll just plug it in right away.

log P(f | D) = – SOS/2σ2 – N/2 log(2πσ2)) + log π(f) – log P(D)

Since we are maximizing with respect to f, two of these terms will fall away.

log P(f | D) = – SOS/2σ2 + log π(f) + constant

Now we just have to decide on the form of π(f). Since the functional form of f is determined by the values of the parameters {a, b, c}, π(f) = π(a, b, c). One plausible choice is a Gaussian centered around the values of each parameter:

π(f) = exp( -a2 / 2σa2 ) exp( -b2 / 2σb2 ) exp( -c2 / 2σc2 ) / √(8π3σa2σb2σc2)
log π(f) = -a2/2σa2 – b2/2σb2 – c2/2σc2 – ½ log(8π3σa2σb2σc2)

Now, throwing out terms that don’t depend on the values of the parameters, we find:

log P(f | D) = – SOS/2σ2 -a2/2σa2 – b2/2σb2 – c2/2σc2 + constant

This is exactly L2 regularization, where each kn = σ2n2. In other words, L2 regularization is Bayesian inference with Gaussian priors over the parameters!

What priors does L1 regularization correspond to?

log π(f) = -k1 |a| – k2 |b| – k3 |c|
π(a, b, c) = e-k1|a| e-k2|b| e-k3|a|

I.e. the L1 regularization prior is an exponential distribution.

This can be easily extended to any regularization technique. This is a way to get some insight into what your favorite regularization methods mean. They are ultimately to be cashed out in the form of your prior knowledge of the parameters!

# Why minimizing sum of squares is equivalent to frequentist inference

(This will be the first in a short series of posts describing how various commonly used statistical methods are approximate versions of frequentist, Bayesian, and Akaike-ian inference)

Suppose that we have some data D = { (x₁, y₁), (x₂, y₂), … , (xɴ, yɴ) }, and a candidate function y = f(x).

Frequentist inference involves the assessment of the likelihood of the data given this candidate function: P(D | f).

Since D is composed of N independent data points, we can assess the probability of each data point separately, and multiply them all together.

P(D | f) = P(x₁, y₁ | f) P(x₂, y₂ | f) … P(xɴ, yɴ | f)

So now we just need to answer the question: What is P(x, y | f)?

f predicts that for the value x, the most likely y-value is f(x).

The other possible y-values will be normally distributed around f(x).

The equation for this distribution is a Gaussian:

P(x, y | f) = exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)

Now that we know how to find P(x, y | f), we can easily calculate P(D | F)!

P(D | f) = exp[ -(y – f(x))² / 2σ² ] /√(2πσ²) ･ exp[ -(y – f(x))² / 2σ² ] / √(2πσ²) … exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)
= exp[ -(y – f(x))² / 2σ² ] ･ exp[ -(y – f(x))² / 2σ² ] … exp[ -(y – f(x))² / 2σ² ] / (2πσ²)N/2

Products are messy and logarithms are monotonic, so log(P(D | f)) is easier to work with: it turns the product into a sum.

log P(D | f) = log( exp[ -(y₁ – f(x₁))² / 2σ² ] … exp[ -(yɴ – f(xɴ))² / 2σ² ] / (2πσ²)N/2 )
= log( exp[ -(y₁ – f(x₁))² / 2σ² ] ) + … log( exp[ -(yɴ – f(xɴ))² / 2σ² ] ) – N/2 log(2πσ²)
= -(y₁ – f(x₁))² / 2σ² ) + -(yɴ – f(xɴ))² / 2σ² ) – N/2 log(2πσ²)
= -1/2σ² [ (y₁ – f(x₁))² + … +(yɴ – f(xɴ))² ] – N/2 log(2πσ²)

Now notice that the sum of squares just naturally pops out!

SOS = (y₁ – f(x₁))² + … + (yɴ – f(xɴ))²
log P(D | f) = -SOS/2σ² – N/2 log(2πσ²)

Frequentist inference chooses f to maximize P(D | f). We can now immediately see why this is equivalent to minimizing SOS!

argmax{ P(D | f) }
= argmax{ log P(D | f) }
= argmax{ – SOS/2σ² – N/2 log(2πσ²) }
= argmin{ SOS/2σ² + N/2 log(2πσ²) }
= argmin{ SOS/2σ² }
= argmin{ SOS }

Next, we’ll go Bayesian…