How to Learn From Data, Part 2: Evaluating Models

I ended the last post by saying that the solution to the problem of overfitting all relied on the concept of a model. So what is a model? Quite simply, a model is just a set of functions, each of which is a candidate for the true distribution that generates the data.

Why use a model? If we think about a model as just a set of functions, this seems kinda abstract and strange. But I think it can be made intuitive, and that in fact we reason in terms of models all the time. Think about how physicists formulate their theories. Laws of physics have two parts: First they specify the types of functional relationships there are in the world. And second, they specify the value of particular parameters in those functions. This is exactly how a model works!

Take Newton’s theory of gravity. The fundamental claim of this theory is that there is a certain functional relationship between the quantities a (the acceleration of an object), M (the mass of a nearby object), and r (the distance between the two objects): a ~ M/r2. To make this relationship precise, we need to include some constant parameter G that says exactly what this proportionality is: a = GM/r2.

So we start off with a huge set of probability distributions over observations of the accelerations of particles (one for each value of G), and then we gather data to see which value of G is most likely to be right. Einstein’s theory of gravity was another different model, specifying a different functional relationship between a, M, and r and involving its own parameters. So to compare Einstein’s theory of gravity to Newton’s is to compare different models of the phenomenon of gravitation.

When you start to think about it, the concept of a model is an enormous one. Models are ubiquitous, you can’t escape from them. Many fancy methods in machine learning are really just glorified model selection. For instance, a neural network is a model: the architecture of the network specifies a particular functional relationship between the input and the output, and the strengths of connections between the different neurons are the parameters of the model.

So. What are some methods for assessing predictive accuracy of models? The first one we’ll talk about is a super beautiful and clever technique called…

Cross Validation

The fundamental idea of cross validation is that you can approximate how well a model will do on the next data point by evaluating how the model would have done at predicting a subset of your data, if all it had access to was the rest of your data.

I’ll illustrate this with a simple example. Suppose you have a data set of four points, each of which is a tuple of two real numbers. Your two models are T1: that the relationship is linear, and T2: that the relationship is quadratic.

What we can do is go through each data point, selecting each in order, train the model on the non-selected data points, and then evaluate how well this trained model fits the selected data point. (By training the model, I mean something like “finding the curve within the model that best fits the non-selected data points.”) Here’s a sketch of what this looks like:

img_0045.jpeg

Notice that each different data point we select gives a different curve! This is important

There are a bunch of different ways of doing cross validation. We just left out one data point at a time, but we could have left out two points at a time, or three, or any k less than the total number of points N. This is called leave-k-out cross validation. If we partition our data by choosing a fraction of it for testing, then we get what’s called n-fold cross validation (where 1/n is the fraction of the data that is isolated for testing).

We also have some other choices besides how to partition the data. For instance, we can choose to train our model via a Likelihoodist procedure or a Bayesian procedure. And we can also choose to test our model in various ways, by using different metrics for evaluating the distance between the testing set and the trained model. In leave-one-out cross validation (LOOCV), a pretty popular method, both the training and testing procedures are Likelihoodist.

Now, there’s a practical problem with cross-validation, which is that it can take a long time to compute. If we do LOOCV with N data points and look at all ways of leaving out one data point, then we end up doing N optimization procedures (one for each time you train your model), each of which can be pretty slow. But putting aside practical issues, for thinking about theoretical rationality, this is a super beautiful technique.

Next we’ll move on to…

Bayesian Model Selection!

We had Bayesian procedures for evaluating individual functions. Now we can go full Bayes and apply it to models! For each model, we assess the posterior probability of the model given the data using Bayes’ rule as usual:

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Now… what is the prior Pr(M)? Again, it’s unspecified. Maybe there’s a good prior that solves overfitting, but it’s not immediately obvious what exactly it would be.

But there’s something really cool here. In this equation, there’s a solution to overfitting that pops up not out of the prior, but out of the LIKELIHOOD! I explain this here. The general idea is that models that are prone to overfitting get that way because their space of functions is very large. But if the space of functions is large, then the average prior on each function in the model must be small. So larger models have, on average, smaller values of the term Pr(f | M), and correspondingly (via Pr(D | M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) ) get a weaker update from evidence.

So even without saying anything about the prior, we can see that Bayesian model selection provides a potential antidote to overfitting. But just like before, we have the practical problem that computing Pr(M | D) is in general very hard. Usually evaluating Pr(D | M) involves calculating a really complicated many-dimensional integral, and calculating many-dimensional integrals can be very computationally expensive.

Might we be able to find some simple unbiased estimator of the posterior Pr(M | D)? It turns out that yes, we can. Take another look at the equation above.

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Since Pr(D) is a constant across models, we can ignore it when comparing models. So for our purposes, we can attempt to maximize the equation:

Pr(D | M) Pr(M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) Pr(M)

If we assume that P(D | f) is twice differentiable in the model parameters and that Pr(f) behaves roughly linearly around the maximum likelihood function, we can approximate this as:

Pr(D | M) Pr(M) \approx \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)

f* is the maximum likelihood function within the model, and k is the the number of parameters of the model (e.g. k = 2 for a linear model and k = 3 for a quadratic model).

\We can make this a little neater by taking a logarithm and defining a new quantity: the Bayesian Information Criterion.

argmax_M [ Pr(M | D) ] \\~\\  = argmax_M [ Pr(D | M) Pr(M) ] \\~\\  \approx argmax_M [ \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)] \\~\\  = argmax_M [ \log(Pr(M)) + \log (Pr(D | f^*(D)) - \frac{k}{2} \log(N) ] \\~\\  = argmin_M [ BIC - \log Pr(M) ]

BIC = \frac{k}{2} log(N) - \log \left( Pr(D | f^*(D)) \right)

Thus, to maximize the posterior probability of a model M is roughly the same as to minimize the quantity BIC – log Pr(M).

If we assume that our prior over models is constant (that is, that all models are equally probable at the outset), then we just minimize BIC.

Bayesian Information Criterion

Notice how incredibly simple this expression is to evaluate! If all we’re doing is minimizing BIC, then we only need to find the maximum likelihood function f*(D), assess the value of Pr(D | f*), and then penalize this quantity with the factor k/2 log(N)! This penalty scales in proportion to the model complexity, and thus helps us avoid overfitting.

We can think about this as a way to make precise the claims above about Bayesian Model Selection penalizing overfitting in the likelihood. Remember that minimizing BIC made sense when we assumed a uniform prior over models (and therefore when our prior doesn’t penalize overfitting). So even when we don’t penalize overfitting in the prior, we still end up getting a penalty! This penalty must come from the likelihood.

Some more interesting facts about BIC:

  • It is approximately equal to the minimum description length criterion (which I haven’t discussed here)
  • It is only valid when N >> k. So it’s not good for small data sets or large models.
  • It the truth is contained within your set of models, then BIC will select the truth with probability 1 as N goes to infinity.

Okay, stepping back. We sort of have two distinct clusters of approaches to model selection. There’s Bayesian Model Selection and the Bayesian Information Criterion, and then there’s Cross Validation. Both sound really nice. But how do they compare? It turns out that in general they give qualitatively different answers. And in general, the answers you get using cross validation tend to be more predictively accurate that those that you get from BIC/BMS.

A natural question to ask at this point is: BIC was a nice simple approximation to BMS. Is there a corresponding nice simple approximation to cross validation? Well first we must ask: which cross validation? Remember that there was a plethora of different forms of cross validation, each corresponding to slightly different criterion for evaluating fits. We can’t assume that all these methods give the same answer.

Let’s choose leave-one-out cross validation. It turns out that yes, there is a nice approximation that is asymptotically equivalent to LOOCV! This is called the Akaike information criterion.

Akaike Information Criterion

First, let’s define AIC:

AIC = k - \log \left( Pr(D | f^*(D)) \right)

Like always, k is the number of parameters in M and f* is chosen by the Likelihoodist approach described in the last post.

What you get by minimizing this quantity is asymptotically equivalent to what you get by doing leave-one-out cross validation. Compare this to BIC:

BIC = \frac{k}{2} \log (N) - \log \left( Pr(D | f^*(D) ) \right)

There’s a qualitative difference in how the parameter penalty is weighted! BIC is going to have a WAY higher complexity penalty than AIC. This means that AIC should in general choose less simple models than BIC.

Now, we’ve already seen one reason why AIC is good: it’s a super simple approximation of LOOCV and LOOCV is good. But wait, there’s more! AIC can be derived as an unbiased estimator of the Kullback-Leibler divergence DKL.

What is DKL? It’s a measure of the information theoretic distance of a model from truth. For instance, if the true generating distribution for a process is f, and our model of this process is g, then DKL tells us how much information we lose by using g to represent f. Formally, DKL is:

D_{KL}(f, g) = \int { f(x) \log \left(\frac{f(x)} {g(x)} \right) dx }

Notice that to actually calculate this quantity, you have to already know what the true distribution f is. So that’s unfeasible. BUT! AIC gives you an unbiased estimate of its value! (The proof of this is complicated, and makes the assumptions that N >> k, and that the model is not far from the truth.)

Thus, AIC gives you an unbiased estimate of which model is closest to the truth. Even if the truth is not actually contained within your set of models, what you’ll end up with is the closest model to it. And correspondingly, you’ll end up getting a theory of the process that is very similar to how the process actually works.

There’s another version of AIC that works well for smaller sample sizes and larger models called AICc.

AICc = AIC + \frac {k (k + 1)} {N - (k + 1)}

And interestingly, AIC can be derived as an approximation to Bayesian Model Selection with a particular prior over models. (Remember earlier when we showed that argmax_M \left( Pr(M | D) \right) \approx argmin_M \left( BIC - \log(Pr(M)) \right) ? Well, just set \log Pr(M) = BIC - AIC = k \left( \frac{1}{2} \log N - 1 \right) , and you get the desired result.) Interestingly, this prior ends up rewarding theories for having lots of parameters! This looks pretty bad… it seems like AIC is what you get when you take Bayesian Model Selection and then try to choose a prior that favors overfitting theories. But the practical use and theoretical virtues of AIC warrant, in my opinion, taking a closer look at what’s going on here. Perhaps what’s going on is that the likelihood term Pr(D | M) is actually doing too much to avoid overfitting, so in the end what we need in our prior is one that avoids underfitting! Regardless, we can think of AIC as specifying the unique good prior on models that optimizes for predictive accuracy.

There’s a lot more to be said from here. This is only a short wade into the waters of statistical inference and model selection. But I think this is a good place to stop, and I hope that I’ve given you a sense of the philosophical richness of the various different frameworks for inference I’ve presented here.

How to Learn From Data, Part I: Evaluating Simple Hypotheses

Here’s a general question of great importance: How should we adjust our model of the world in response to data? This post is a tour of a few of the big ideas that humans have come up with to address this question. I find learning about this topic immensely rewarding, and so I shall attempt to share it! What I love most of all is thinking about how all of this applies to real world examples of inference from data, so I encourage you to constantly apply what I say to situations that you find personally interesting.

First of all, we want a formal description of what we mean by data. Let’s describe our data quite simply as a set: D = \{(x_1, y_1), (x_2, y_2),..., (x_N, y_N)\}, where each x_i is a value that you choose (your independent variable) and each y_i is the resulting value that you measure (the dependent variable). Each of these variables can be pretty much anything – real numbers, n-tuples of real numbers, integers, colors, whatever you want. The goal here is to embrace generality, so as to have a framework that applies to many kinds of inference.

Now, suppose that you have a certain theory about the underlying relationship between the x variables and the y variables. This theory might take the form of a simple function: T: y = f(x) We interpret T as making a prediction of a particular value of the dependent variable for each value of the independent variable. Maybe the data is the temperatures of regions at various altitudes, and our theory T says that one over the temperature (1/T) is some particular linear function of the altitude.

What we want is a notion of how good of a theory T is, given our data. Intuitively, we might think about doing this by simply assessing the distance between each data point y_n and the predicted value of y at that point: f(x_n), using some metric, then adding them all up. But there are a whole bunch of distance metrics available to us. Which one should we use? Perhaps the taxicab measure comes to mind (\sum_{n=1}^N {|y_n - f(x_n)|}), or the sum of the squares of the differences (SOS = \sum_{n=1}^N {(y_n - f(x_n))^2}). We want a good theoretical justification for why any one of these metrics should be preferred over any other, since in general they lead to different conclusions. We’ll see just such justifications shortly. Keep the equation for SOS in mind, as it will turn up repeatedly ahead.

Now here’s a problem: If we want a probabilistic evaluation of T, then we have to face the fact that it makes a deterministic prediction. Our theory seems to predict with 100% probability that at x_n the observed y_n will be precisely f(x_n). If it’s even slightly off of this, then our theory will be probabilistically disconfirmed to zero.

We can solve this problem by modifying our theory T to not just be a theory of the data, but also a theory of error. In other words, we expect that the value we get will not be exactly the predicted value, and give some account of how on average observation should differ from theory.

T: y = f(x) + \epsilon, where \epsilon is some random variable drawn from a probability distribution P_E.

This error distribution can be whatever we please – Gaussian, exponential, Poisson, whatever. For simplicity let’s say that we know the error is normal (drawn from a Gaussian distribution) with a known standard deviation σ.

T: y = f(x) + \epsilon, \epsilon \sim N(0, \sigma)

A note on notation here: \epsilon \sim N(\mu, \sigma) denotes a random variable drawn from a Gaussian distribution centered around \mu with a standard deviation of \sigma.

This gives us a sensible notion of the probability of obtaining some value of y from a chosen x given the theory T.

Pr(y | x, T) \sim N(f(x), \sigma) \\~\\  Pr(y | x, T) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2} (y - f(x))^2} 

Nice! This is a crucial step towards figuring out how to evaluate theories: we’ve developed a formalism for describing precisely how likely a data point is, given a particular theory (which, remember, so far is just a function from values of independent variables to values of dependent variables, coupled with a theory of how the error in your observations works).

Let’s try to extend this to our entire data set D. We want to assess the probability of the particular values of the dependent variables, given the chosen values of the dependent variables and the function. We’ll call this the probability of D given f.

Pr(D | f) = Pr(x_1, y_1, x_2, y_2,..., x_N, y_N | T) \\~\\  Pr(D | f) = Pr(y_1, y_2,...,y_N | x_1,  x_2,..., x_N, T)

(It’s okay to move the values of the independent variable x across the conditional bar because of our assumption that we know beforehand what values x will take on.)

But now we run into a problem: we can’t really do anything with this expression without knowing how the data points depend upon each other. We’d like to break it into individual terms, each of which can be evaluated by the above expression for Pr(y | x, T). But that would require an assumption that our data points are independent. In general, we cannot grant this assumption. But what we can do is expand our theory once more to include a theory of the dependencies in our data points. For simplicity of explication, let’s proceed with the assumption that each of our observations is independent of the others. Said another way, we assume that given T, x_n screens off all other variables from y_n). Combined with our assumption of normal error, this gives us a nice simple reduction.

Pr(D | f) = Pr(y_1, y_2,...,y_N | x_1,  x_2,..., x_N, T) \\~\\  Pr(D | f) = Pr(y_1 | x_1, T) Pr(y_2 | x_2, T) ... Pr(y_N | x_N, T) \\~\\  Pr(D | f) = \prod_{n=1}^N { \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2} (y_n - f(x_n))^2} } \\~\\  Pr(D | f) = (2 \pi \sigma^2)^{-N/2} e^{-\frac{1}{2 \sigma^2} \sum_{n=1}^N {(y_n - f(x_n))^2}} \\~\\  Pr(D | f) = (2 \pi \sigma^2)^{-N/2} e^{-\frac{SOS(f, D)}{2 \sigma^2} }

We see an interesting connection arise here between Pr(D | f) and the sum of squares evaluation of fit. More will be said of this in just a moment.

But first, let’s take a step back. Our ultimate goal here is to find a criterion for theory evaluation. And here we’ve arrived at an expression that looks like it might be right for that role! Perhaps we want to say that our criterion for evaluating theories is just maximization of Pr(D | f). This makes some intuitive sense… a good theory is one which predicts the observed data. If general relativity or quantum mechanics predicted that the sun should orbit the earth, or that atoms should be unstable, then we wouldn’t be very in favor of it.

And so we get the Likelihoodism Thesis!

Likelihoodism: The best theory f* given some data D is that which maximizes Pr(D | f). Formally: f^*(D) = argmax_f [ Pr(D | f) ]

Now, since logarithms are monotonic, we can express this in a more familiar form:

argmax_f [Pr(D | F)] = argmax_f \left[ e^{-\frac{SOS(f, D)}{2 \sigma^2} } \right] = argmax_f \left[-\frac{SOS(f, D)}{2 \sigma^2} \right] = argmin_f [ SOS(f, D) ]

Thus the best theory according to Likelihoodism is that which minimizes the sum of squares! People minimize sums of squares all the time, and most of the time they don’t realize that it can be derived from the Likelihoodism Thesis and the assumptions of Gaussian error and independent data. If our assumptions had been different, then we would have found a different expression, and SOS might no longer be appropriate! For instance, the taxicab metric arises as the correct metric if we assume exponential error rather than normal error. I encourage you to see why this is for yourself. It’s a fun exercise to see how different assumptions about the data give rise to different metrics for evaluating the distance between theory and observation. In general, you should now be able to take any theory of error and assess what the proper metric for “degree of fit of theory to data” is, assuming that degree of fit is evaluated by maximizing Pr(D | f).

Now, there’s a well-known problem with the Likelihoodism thesis. This is that the function f that minimizes SOS(f, D) for some data D is typically going to be a ridiculously overcomplicated function that perfectly fits each data points, but does terribly on future data. The function will miss the underlying trend in the data for the noise in the observations, and as a result fail to get predictive accuracy.

This is the problem of overfitting. Likelihoodism will always prefer theories that overfit to those that don’t, and as a result will fail to identify underlying patterns in data and do a terrible job at predicting future data.

How do we solve this? We need a replacement for the Likelihoodism thesis. Here’s a suggestion: we might say that the problem stems from the fact that the Likelihoodist procedure recommends us to find a function that makes the data most probable, rather than finding a function that is made most probable by the data. From this suggestion we get the Bayesian thesis:

Bayesianism: The best theory f given some data is that which maximizes Pr(f | D). f^*(D) = argmax_f [ Pr(f | D) ]

Now, what is this Pr(f | D) term? It’s an expression that we haven’t seen so far. How do we evaluate it? We simply use the theorem that the thesis is named for: Bayes’ rule!

Pr(f | D) = \frac{Pr(D | f)}{Pr(D)} Pr(f)

This famous theorem is a simple deductive consequence of the probability axioms and the definition of conditional probability. And it happens to be exactly what we need here. Notice that the right-hand side consists of the term Pr(D | f), which we already know how to calculate. And since we’re ultimately only interested in varying f to find the function that maximizes this expression, we can ignore the constant term in the denominator.

argmax_f [ Pr(f | D) ] = argmax_f [ Pr(D | f) Pr(f) ] \\~\\  argmax_f [ Pr(f | D) ] = argmax_f [ \log Pr(D | f) + \log Pr(f) ] \\~\\  argmax_f [ Pr(f | D) ] = argmax_f [ -\frac{SOS(f, D)}{2 \sigma^2} + \log Pr(f) ] \\~\\  argmax_f [ Pr(f | D) ] = argmin_f [ SOS(f, D) - 2 \sigma^2 \log Pr(f) ]

The last steps we get just by substituting in what we calculated before. The 2σ² in the first term comes from the fact that the exponent of our Gaussian is (y – f(x))² / 2σ². We ignored it before because it was just a constant, but since we now have another term in the expression being maximized, we have to keep track of it again.

Notice that what we get is just what we had initially (a sum of squares) plus an additional term involving a mysterious Pr(f). What is this Pr(f)? It’s our prior distribution over theories. Because of the negative sign in front of the second term, a larger value of Pr(f) gives a smaller value for the expression that we are minimizing. Similarly, the more closely our function follows the data, the smaller the SOS term becomes. So what we get is a balance between fitting the data and having a high prior. A theory that fits the data perfectly can still end up with a bad evaluation, as long as it has a low enough prior. And a theory that fits the data poorly can end up with a great evaluation if it has a high enough prior.

(Those that are familiar with machine learning might notice that this feels similar to regularization. They’re right! It turns out that different regularization techniques just end up corresponding to different Bayesian priors! L2 regularization corresponds to a Gaussian prior over parameters, and L1 regularization corresponds to an exponential prior over parameters. And so on. Different prior assumptions about the process generating the data lead naturally to different regularization techniques.)

What this means is that if we are trying to solve overfitting, then the Bayesianism thesis provides us a hopeful ray of light. Maybe we can find some perfect prior Pr(f) that penalizes complex overfitting theories just enough to give us the best prediction. But unsurprisingly, finding such a prior is no easy matter. And importantly, the Bayesian thesis as I’ve described it gives us no explicit prescription for what this prior should be, making it insufficient for a full account of inference. What Bayesianism does is open up a space of possible methods of theory evaluation, inside which (we hope) there might be a solution to our problem.

Okay, let’s take another big step back. How do we progress from here? Well, let’s think for a moment about what our ultimate goal is. We want to say that overfitting is bad, and that therefore some priors are better than others insofar as they prevent overfitting. But what is the standard we’re using to determine that overfitting is bad? What’s wrong with an overfitting theory?

Here’s a plausible answer: The problem with overfitting is that while it maximizes descriptive accuracy, it leads to poor predictive accuracy! I.e., theories that overfit do a great job at describing past data, but tend to do a poor job of matching future data.

Let’s take this idea and run with it. Perhaps all that we truly care about in theory selection is predictive accuracy. I’ll give a name to this thesis:

Predictivism: The ultimate standard for theory evaluation is predictive accuracy. I.e., the best theory is that which does the best at predicting future data.

Predictivism is a different type of thesis than Bayesianism and Likelihoodism. While each of those gave precise prescriptions for how to calculate the best theory, Predictivism does not. If we want an explicit algorithm for computing the best theory, then we need to say what exactly “predictive accuracy” means. This means that as far as we know so far, Predictivism could be equivalent to Likelihoodism or to Bayesianism. It all depends on what method we end up using to determine what theory has the greatest predictive accuracy. Of course, we have good reasons to suspect that Likelihoodism is not identical to Predictivism (Likelihoodism overfits, and we suspect that Predictivism will not) or to Bayesianism (Bayesianism does not prescribe a particular prior, so it gives no unique prescription for the best theory). But what exactly Predictivism does say depends greatly on how exactly we formalize the notion of predictive accuracy.

Another difference between Predictivism and Likelihoodism/Bayesianism is that predictive accuracy is about future data. While it’s in principle possible to find an analytic solution to P(D | f) or P(f | D), Predictivism seems to require us to compute terms involving future data! But these are impossible to specify, because we don’t know what the future data will be!

Can we somehow approximate what the best theory is according to Predictivism, by taking our best guess at what the future data will be? Maybe we can try to take the expected value of something like Pr(y_{N+1} | x_{N+1}, T, D) , where (x_{N+1}, y_{N+1}) is a future data point. But there are two big problems with this.

First, in taking an expected value, you must use the distribution of outcomes given by your own theory. But then your evaluation picks up an inevitable bias! You can’t rely on your own theories to assess how good your theories are.

And second, by the assumptions we’ve included in our theories so far, the future data will be independent of the past data! This is a really big problem. We want to use our past data to give some sense of how well we’ll do on future data. But if our data is independent, then our assessment of how well T will do on the next data point will have to be independent of the past data as well! After all, no matter what value y_{N+1} has, Pr(y_{N+1} | x_{N+1}, T, D) = Pr(y_{N+1} | x_{N+1}, T).

So is there any way out of this? Surprisingly, yes! If we get creative, we can find ways to approximate this probability with some minimal assumptions about the data. These methods all end up relying on a new concept that we haven’t yet discussed: the concept of a model. The next post will go into more detail on how this works. Stay tuned!

Gödel’s Second Incompleteness Theorem: Explained in Words of Only One Syllable

Somebody recently referred me to a 1994 paper by George Boolos in which he writes out a description of Gödel’s Second Incompleteness Theorem, using only words of one syllable. I love it so much that I’m going to copy the whole thing here in this post. Enjoy!

First of all, when I say “proved”, what I will mean is “proved with the aid of the whole of math”. Now then: two plus two is four, as you well know. And, of course, it can be proved that two plus two is four (proved, that is, with the aid of the whole of math, as I said, though in the case of two plus two, of course we do not need the whole of math to prove that it is four). And, as may not be quite so clear, it can be proved that it can be proved that two plus two is four, as well. And it can be proved that it can be proved that it can be proved that two plus two is four. And so on. In fact, if a claim can be proved, then it can be proved that the claim can be proved. And that too can be proved.

Now, two plus two is not five. And it can be proved that two plus two is not five. And it can be proved that it can be proved that two plus two is not five, and so on.

Thus: it can be proved that two plus two is not five. Can it be proved as well that two plus two is five? It would be a real blow to math, to say the least, if it could. If it could be proved that two plus two is five, then it could be proved that five is not five, and then there would be no claim that could not be proved, and math would be a lot of bunk.

So, we now want to ask, can it be proved that it can’t be proved that two plus two is five? Here’s the shock: no, it can’t. Or, to hedge a bit: if it can be proved that it can’t be proved that two plus two is five, then it can be proved as well that two plus two is five, and math is a lot of bunk. In fact, if math is not a lot of bunk, then no claim of the form “claim X can’t be proved” can be proved.

So, if math is not a lot of bunk, then, though it can’t be proved that two plus two is five, it can’t be proved that it can’t be proved that two plus two is five.

By the way, in case you’d like to know: yes, it can be proved that if it can be proved that it can’t be proved that two plus two is five, then it can be proved that two plus two is five.

George Boolos, Mind, Vol. 103, January 1994, pp. 1 – 3

Anti-inductive priors

I used to think of Bayesianism as composed of two distinct parts: (1) setting priors and (2) updating by conditionalizing. In my mind, this second part was the crown jewel of Bayesian epistemology, while the first part was a little more philosophically problematic. Conditionalization tells you that for any prior distribution you might have, there is a unique rational set of new credences that you should adopt upon receiving evidence, and tells you how to get it. As to what the right priors are, well, that’s a different story. But we can at least set aside worries about priors with assurances about how even a bad prior will eventually be made up for in the long run after receiving enough evidence.

But now I’m realizing that this framing is pretty far off. It turns out that there aren’t really two independent processes going on, just one (and the philosophically problematic one at that): prior-setting. Your prior fully determines what happens when you update by conditionalization on any future evidence you receive. And the set of priors consistent with the probability axioms is large enough that it allows for this updating process to be extremely irrational.

I’ll illustrate what I’m talking about with an example.

Let’s imagine a really simple universe of discourse, consisting of just two objects and one predicate. We’ll make our predicate “is green” and denote objects a_1 and a_2 . Now, if we are being good Bayesians, then we should treat our credences as a probability distribution over the set of all state descriptions of the universe. These probabilities should all be derivable from some hypothetical prior probability distribution over the state descriptions, such that our credences at any later time are just the result of conditioning that prior on the total evidence we have by that time.

Let’s imagine that we start out knowing nothing (i.e. our starting credences are identical to the hypothetical prior) and then learn that one of the objects (a_1 ) is green. In the absence of any other information, then by induction, we should become more confident that the other object is green as well. Is this guaranteed by just updating?

No! Some priors will allow induction to happen, but others will make you unresponsive to evidence. Still others will make you anti-inductive, becoming more and more confident that the next object is not green the more green things you observe. And all of this is perfectly consistent with the laws of probability theory!

Take a look at the following three possible prior distributions over our simple language:

Screen Shot 2018-10-21 at 1.58.45 PM.png

According to P_1 , your new credence in Ga_2 after observing Ga_1 is P_1(Ga_2 | Ga_1) = 0.80 , while your prior credence in Ga_2 was 0.50. Thus P_1 is an inductive prior; you get more confident in future objects being green when you observe past objects being green.

For P_2 , we have that P_2(Ga_2 | Ga_1) = 0.50 , and P_2(Ga_2) = 0.50 as well. Thus P_2 is a non-inductive prior: observing instances of green things doesn’t make future instances of green things more likely.

And finally, P_3(Ga_2 | Ga_1) = 0.20 , while P_3(Ga_2) = 0.5 . Thus P_3 is an anti-inductive prior. Observing that one object is green makes you more than two times less confident confident that the next object will be green.

The anti-inductive prior can be made even more stark by just increasing the gap between the prior probability of Ga_1 \wedge Ga_2 and Ga_1 \wedge -Ga_2 . It is perfectly consistent with the axioms of probability theory for observing a green object to make you almost entirely certain that the next object you observe will not be green.

Our universe of discourse here was very simple (one predicate and two objects). But the point generalizes. Regardless of how many objects and predicates there are in your language, you can have non-inductive or anti-inductive priors. And it isn’t even the case that there are fewer anti-inductive priors than inductive priors!

The deeper point here is that the prior is doing all the epistemic work. Your prior isn’t just an initial credence distribution over possible hypotheses, it also dictates how you will respond to any possible evidence you might receive. That’s why it’s a mistake to think of prior-setting and updating-by-conditionalization as two distinct processes. The results of updating by conditionalization are determined entirely by the form of your prior!

This really emphasizes the importance of having good criterion for setting priors. If we’re trying to formalize scientific inquiry, it’s really important to make sure our formalism rules out the possibility of anti-induction. But this just amounts to requiring rational agents to have constraints on their priors that go above and beyond the probability axioms!

What are these constraints? Do they select one unique best prior? The challenge is that actually finding a uniquely rationally justifiable prior is really hard. Carnap tried a bunch of different techniques for generating such a prior and was unsatisfied with all of them, and there isn’t any real consensus on what exactly this unique prior would be. Even worse, all such suggestions seem to end up being hostage to problems of language dependence – that is, that the “uniquely best prior” changes when you make an arbitrary translation from your language into a different language.

It looks to me like our best option is to abandon the idea of a single best prior (and with it, the notion that rational agents with the same total evidence can’t disagree). This doesn’t have to lead to total epistemic anarchy, where all beliefs are just as rational as all others. Instead, we can place constraints on the set of rationally permissible priors that prohibit things like anti-induction. While identifying a set of constraints seems like a tough task, it seems much more feasible than the task of justifying objective Bayesianism.

Making sense of improbability

Imagine that you take a coin that you believe to be fair and flip it 20 times. Each time it lands heads. You say to your friend: “Wow, what a crazy coincidence! There was a 1 in 220 chance of this outcome. That’s less than one in a million! Super surprising.”

Your friend replies: “I don’t understand. What’s so crazy about the result you got? Any other possible outcome (say, HHTHTTTHTHHHTHTTHHHH) had an equal probability as getting all heads. So what’s so surprising?”

Responding to this is a little tricky. After all, it is the case that for a fair coin, the probability of 20 heads = the probability of HHTHTTTHTHHHTHTTHHHH = roughly one in a million.

Simpler Example_ Five Tosses.png

So in some sense your friend is right that there’s something unusual about saying that one of these outcomes is more surprising than another.

You might answer by saying “Well, let’s parse up the possible outcomes by the number of heads and tails. The outcome I got had 20 heads and 0 tails. Your example outcome had 12 heads and 8 tails. There are many many ways of getting 12 heads and 8 tails than of getting 20 heads and 0 tails, right? And there’s only one way of getting all 20 heads. So that’s why it’s so surprising.”

Probability vs. Number of heads (1).png

Your friend replies: “But hold on, now you’re just throwing out information. Sure my example outcome had 12 heads and 8 tails. But while there’s many ways of getting that number of heads and tails, there’s only exactly one way of getting the result I named! You’re only saying that your outcome is less likely because you’ve glossed over the details of my outcome that make it equally unlikely: the order of heads and tails!”

I think this is a pretty powerful response. What we want is a way to say that HHHHHHHHHHHHHHHHHHHH is surprising while HHTHTTTHTHHHTHTTHHHH is not, not that 20 heads is surprising while 12 heads and 8 tails is unsurprising. But it’s not immediately clear how we can say this.

Consider the information theoretic formalization of surprise, in which the surprisingness of an event E is proportional to the negative log of the probability of that event: Sur(E) = -log(P(E)). There are some nice reasons for this being a good definition of surprise, and it tells us that two equiprobable events should be equally surprising. If E is the event of observing all heads and E’ is the event of observing the sequence HHTHTTTHTHHHTHTTHHHH, then P(E) = P(E’) = 1/220. Correspondingly, Sur(E) = Sur(E’). So according to one reasonable formalization of what we mean by surprisingness, the two sequences of coin tosses are equally surprising. And yet, we want to say that there is something more epistemically significant about the first than the second.

(By the way, observing 20 heads is roughly 6.7 times more surprising than observing 12 heads and 8 tails, according to the above definition. We can plot the surprise curve to see how maximum surprise occurs at the two ends of the distribution, at which point it is 20 bits.)

Surprise vs. number of heads (1).png

So there is our puzzle: in what sense does it make sense to say that observing 20 heads in a row is more surprising than observing the sequence HHTHTTTHTHHHTHTTHHHH? We certainly have strong intuitions that this is true, but do these intuitions make sense? How can we ground the intuitive implausibility of getting 20 heads? In this post I’ll try to point towards a solution to this puzzle.

Okay, so I want to start out by categorizing three different perspectives on the observed sequence of coin tosses. These correspond to (1) looking at just the outcome, (2) looking at the way in which the observation affects the rest of your beliefs, and (3) looking at how the observation affects your expectation of future observations. In probability terms, these correspond to the P(E), P(T| T) and P(E’ | E).

Looking at things through the first perspective, all outcomes are equiprobable, so there is nothing more epistemically significant about one than the other.

But considering the second way of thinking about things, there can be big differences in the significance of two equally probable observations. For instance, suppose that our set of theories under consideration are just the set of all possible biases of the coin, and our credences are initially peaked at .5 (an unbiased coin). Observing HHTHTTTHTHHHTHTTHHHH does little to change our prior. It shifts a little bit in the direction of a bias towards heads, but not significantly. On the other hand, observing all heads should have a massive effect on your beliefs, skewing them exponentially in the direction of extreme heads biases.

Importantly, since we’re looking at beliefs about coin bias, our distributions are now insensitive to any details about the coin flip beyond the number of heads and tails! As far as our beliefs about the coin bias go, finding only the first 8 to be tails looks identical to finding the last 8 to be tails. We’re not throwing out the information about the particular pattern of heads and tails, it’s just become irrelevant for the purposes of consideration of the possible biases of the coin.

Visualizing change in beliefs about coin bias.png

If we want to give a single value to quantify the difference in epistemic states resulting from the two observations, we can try looking at features of these distributions. For instance, we could look at the change in entropy of our distribution if we see E and compare it to the change in entropy upon seeing E’. This gives us a measure of how different observations might affect our uncertainty levels. (In our example, observing HHTHTTTHTHHHTHTTHHHH decreases uncertainty by about 0.8 bits, while observing all heads decreases uncertainty by 1.4 bits.) We could also compare the means of the posterior distributions after each observation, and see which is shifted most from the mean of the prior distribution. (In this case, our two means are 0.57 and 0.91).

Now, this was all looking at things through what I called perspective #2 above: how observations affect beliefs. Sometimes a more concrete way to understand the effect of intuitively implausible events is to look at how they affect specific predictions about future events. This is the approach of perspective #3. Sticking with our coin, we ask not about the bias of the coin, but about how we expect it to land on the next flip. To assess this, we look at the posterior predictive distributions for each posterior:

Posterior Predictive Distributions.png

It shouldn’t be too surprising that observing all heads makes you more confident that the next coin will land heads than observing HHTHTTTHTHHHTHTTHHHH. But looking at this graph gives a precise answer to how much more confident you should be. And it’s somewhat easier to think about than the entire distribution over coin biases.

I’ll leave you with an example puzzle that relates to anthropic reasoning.

Say that one day you win the lottery. Yay! Super surprising! What an improbable event! But now compare this to the event that some stranger Bob Smith wins the lottery. This doesn’t seem so surprising. But supposing that Bob Smith buys lottery tickets at the same rate as you, the probability that you win is identical to the probability that Bob Smith wins. So… why is it any more surprising when you win?

This seems like a weird question. Then again, so did the coin-flipping question we started with. We want to respond with something like “I’m not saying that it’s improbable that some random person wins the lottery. I’m interested in the probability of me winning the lottery. And if we parse up the outcomes as that either I win the lottery or that somebody else wins the lottery, then clearly it’s much more improbable that I win than that somebody else wins.”

But this is exactly parallel to the earlier “I’m not interested in the precise sequence of coin flips, I’m just interested in the number of heads versus tails.” And the response to it is identical in form: If Bob Smith, a particular individual whose existence you are aware of, wins the lottery and you know it, then it’s cheating to throw away those details and just say “Somebody other than me won the lottery.” When you update your beliefs, you should take into account all of your evidence.

Does the framework I presented here help at all with this case?

The end goal of epistemology

What are we trying to do in epistemology?

Here’s a candidate for an answer: The goal of epistemology is to formalize rational reasoning.

This is pretty good. But I don’t think it’s quite enough. I want to distinguish between three possible end goals of epistemology.

  1. The goal of epistemology is to formalize how an ideal agent with infinite computational power should reason.
  2. The goal of epistemology is to formalize how an agent with limited computational power should reason.
  3. The goal of epistemology is to formalize how a rational human being should reason.

We can understand the second task as asking something like “How should I design a general artificial intelligence to most efficiently and accurately model the world?” Since any general AI is going to be implemented in a particular bit of hardware, the answer to this question will depend on details like the memory and processing power of the hardware.

For the first task, we don’t need to worry about these details. Imagine that you’re a software engineer with access to an oracle that instantly computes any function you hand it. You want to build a program that takes in input from its environment and, with the help of this oracle, computes a model of its environment. Hardware constraints are irrelevant, you are just interested in getting the maximum epistemic juice out of your sensory inputs as logically possible.

The third task is probably the hardest. It is the most constrained of the three tasks; to accomplish it we need to first of all have a descriptively accurate model of the types of epistemic states that human beings have (e.g. belief and disbelief, comparative confidence, credences). Then we want to place norms on these states that are able to accommodate our cognitive quirks (for example, that don’t call things like memory loss or inability to instantly see all the logical consequences of a set of axioms irrational).

But both of these goals are on a spectrum. We aren’t interested in fully describing our epistemic states, because then there’s no space for placing non-trivial norms on them. And we aren’t interested in fully accommodating our cognitive quirks, because some of these quirks are irrational! It seems really hard to come up with precise and non-arbitrary answers to how descriptive we want to be and how many quirks we want to accommodate.

Now, in my experience, this third task is the one that most philosophers are working on. The second seems to be favored by statisticians and machine learning researchers. The first is favored by LessWrong rationalist-types.

For instance, rationalists tend to like Solomonoff induction as a gold standard for rational reasoning. But Solomonoff induction is literally uncomputable, immediately disqualifying it as a solution to tasks (2) and (3). The only sense in which Solomonoff induction is a candidate for the perfect theory of rationality is the sense of task (1). While it’s certainly not the case that Solomonoff induction is the perfect theory of rationality for a human or a general AI, it might be the right algorithm for an ideal agent with infinite computational power.

I think that disambiguating these three different potential goals of epistemology allows us to sidestep confusion resulting from evaluating a solution to one goal according to the standards of another. Let’s see this by purposefully glossing over the differences between the end goals.

We start with pure Bayesianism, which I’ll take to be the claim that rationality is about having credences that align with the probability calculus and updating them by conditionalization. (Let’s ignore the problem of priors for the moment.)

In favor of this theory: it works really well, in principle! Bayesianism has a lot of really nice properties like convergence to truth and maximizing relative entropy in updating on evidence (which is sort of like squeezing out all the information out of your evidence).

In opposition: the problem of logical omniscience. A Bayesian expects that all of the logical consequences of a set of axioms should be immediately obvious to a rational agent, and therefore that all credences of the form P(logical consequence of axioms | axioms) should be 100%. But now I ask you: is 19,973 a prime number? Presumably you understand natural numbers, including how to multiply and divide them and what prime numbers are. But it seems wrong to declare that the inability to conclude that 19,973 is prime from this basic level is knowledge is irrational.

This is an appeal to task (2). We want to say that there’s a difference between rationality and computational power. An agent with infinite computational power can be irrational if it is running poor software. And an agent with finite computational power can be perfectly rational, in that it makes effective use of these limited computational resources.

What this suggests is that we want a theory of rationality that is indexed by the computational capacities of the agent in question. What’s rational for one agent might not be rational for another. Bayesianism by itself isn’t nuanced enough to do this; two agents with the same evidence (and the same priors) should always end up at the same final credences. What we want is a framework in which two agents with the same evidence, priors, and computational capacity have the same beliefs.

It might be helpful to turn to computational complexity theory for insights. For instance, maybe we want a principle that says that a polynomial-powered agent is not rationally expected to solve NP problems. But the exact details of how such a theory would turn out are not obvious to me. Nor is it obvious that there even is a single non-arbitrary choice.

Regardless, let’s imagine for the moment that we have in hand the perfect theory of rationality for task (2). This theory should reduce to (1) as a special case when the agent in question has infinite computational powers. And if we treat human beings very abstractly as having some well-defined quantity of memory and processing power, then the theory also places norms on human reasoning. But in doing this, we open a new possible set of objections. Might this theory condemn as irrational some cognitive features of humans that we want to label as arational (neither rational nor irrational)?

For instance, let’s suppose that this theory involves something like updating by conditionalization. Notice that in this process, your credence in the evidence being conditioned on goes to 100%. Perhaps we want to say that the only things we should be fully 100% confident in are our conscious experiences at the present moment. Your beliefs about past conscious experiences could certainly be mistaken (indeed, many regularly are). Even your beliefs about your conscious experiences from a moment ago are suspect!

What this implies is that the set of evidence you are conditioning on at any given moment is just the set of all your current conscious experiences. But this is way too small a set to do anything useful with. What’s worse, it’s constantly changing. The sound of a car engine I’m updating on right now will no longer be around to be updated in one more moment. But this can’t be right; if at time T we set our credence in the proposition “I heard a car engine at time T” to 100%, then at time T+1 our credence should still be 100%.

One possibility here is to deny that 100% credences always stay 100%, and allow for updating backwards in time. Another is to treat not just your current experiences but also all your past experiences as 100% certain. Both of these are pretty unsatisfactory to me. A more plausible approach is to think about the things you’re updating on as not just your present experiences, but the set of presently accessible memories. Of course, this raises the question of what we mean by accessibility, but let’s set that aside for a moment and rest on an intuitive notion that at a given moment there is some set of memories that you could call up at will.

If we allow for updating on this set of presently accessible memories as well as present experiences, then we solve the problem of the evidence set being too small. But we don’t solve the problem of past certainties becoming uncertain. Humans don’t have perfect memory, and we forget things over time. If we don’t want to call this memory loss irrational, then we have to abandon the idea that what counts as evidence at one moment will always count as evidence in the future.

The point I’m making here is that the perfect theory of rationality for task (2) might not be the perfect theory of rationality for task (3). Humans have cognitive quirks that might not be well-captured by treating our brain as a combination of a hard drive and processor. (Another example of this is the fact that our confidence levels are not continuous like real numbers. Trying to accurately model the set of qualitatively distinct confidence levels seems super hard.)

Notice that as we move from (1) to (2) to (3), things get increasingly difficult and messy. This makes sense if we think about the progression as adding more constraints to the problem (as well as making it increasingly vague constraints).

While I am hopeful that we can find an optimal algorithm for inference with infinite computing power, I am less hopeful that there is a unique best solution to (2), and still less for (3). This is not merely a matter of difficulty, the problems themselves become increasingly underspecified as we include constraints like “these rational norms should apply to humans.”

Deciphering conditional probabilities

How would you evaluate the following two probabilities?

  1. P(B | A)
  2. P(A → B)

In words, the first is “the probability that B is true, given that A is true” and the second is “the probability that if A is true, then B is true.” I don’t know about you, but these sound pretty darn similar to me.

But in fact, it turns out that they’re different. In fact, you can prove that P(B | A) is always greater than or equal to P(A → B) (the equality only in the case that P(A) = 1 or P(A → B) = 1). The proof of this is not too difficult, but I’ll leave it to you to figure out.

Conditional probabilities are not the same as probabilities of conditionals. But maybe this isn’t actually too strange. After all, material conditionals don’t do such a great job of capturing what we actually mean when we say “If… then…” For instance, consult your intuitions about the truth of the sentence “If 2 is odd then 2 is even.” This turns out to be true (because any material conditional with a true consequent is true). Similarly, think about the statement “If I am on Mars right now, then string theory is correct.” Again, this turns out to be true if we treat the “If… then…” as a material conditional (since any material conditional with a false antecedent is true).

The problem here is that we actually use “If… then…” clauses in several different ways, the logical structure of which are not well captured by the material implication. A → B is logically equivalent to “A is true or B is false,” which is not always exactly what we mean by “If A then B”. Sometimes “If A then B” means “B, because A.” Other times it means something more like “A gives epistemic support for B.” Still other times, it’s meant counterfactually, as something like “If A were to be the case, then B would be the case.”

So perhaps what we want is some other formula involving A and B that better captures our intuitions about conditional statements, and maybe conditional probabilities are the same as probabilities in these types of formulas.

But as we’re about to prove, this is wrong too. Not only does the material implication not capture the logical structure of conditional probabilities, but neither does any other logical truth function! You can prove a triviality result: that if such a formula exists, then all statements must be independent of one another (in which case conditional probabilities lose their meaning).

The proof:

  1. Suppose that there exists a function Γ(A, B) such that P(A | B) = P(Γ(A, B)).
  2. Then P(A | B & A) = P(Γ | A).
  3. So 1 = P(Γ | A).
  4. Similarly, P(A | B & -A) = (Γ | -A).
  5. So 0 = P(Γ | -A).
  6. P(Γ) = P(Γ | A) P(A) + P(Γ | -A) P(-A).
  7. P(Γ) = 1 * P(A) + 0 * P(-A).
  8. P(Γ) = P(A).
  9. So P(A | B) = P(A).

This is a surprisingly strong result. No matter what your formula Γ is, we can say that either it doesn’t capture the logical structure of the conditional probability P(B | A), or it trivializes it.

We can think of this as saying that the language of first order logic is insufficiently powerful to express the conditionals in conditional probabilities. If you take any first order language and apply probabilities to all its valid sentences, none of those credences will be conditional probabilities. To get conditional probabilities, you have to perform algebraic operations like division on the first order probabilities. This is an important (and unintuitive) thing to keep in mind when trying to map epistemic intuitions to probability theory.