Record-breaking Collatz chains!

One of my favorite bits of mathematics is the Collatz conjecture. It’s so simple that you can explain it to any grade schooler, and yet is so mysterious that Paul Erdös declared “Mathematics may not be ready for such problems.”

Here’s the conjecture:

Take any natural number N. If this number is even, divide it by two. If on the other hand N is odd, we multiply it by three and add 1. So N becomes N/2 if N is even, and 3N+1 if N is odd. Now, repeat this process for the resulting number! If even, divide by two. If odd, multiply by three and add one. And then repeat with the new number, and keep on chugging away.

The conjecture is that you will always eventually get to 1 (regardless of the initial value of N!)

That’s it! It’s that simple. Here are some examples:

Say we start with N=2. Since 2 is even, we divide by 2 and get 1. And that’s that, we’re done!

What if we start with N=3? Now, since 3 is even, we must multiply by three and add 1. So 3 becomes 10. 10 is even, so we divide by 2 to get 5. And 5 is odd, so we multiply by three and add one to get 16. And lo and behold, 16 goes to 8, which goes to 4, which goes to 2, and then back down to 1!

Here’s a table of the first few “Collatz sequences”:

Screen Shot 2019-06-11 at 10.37.52 PM.png

A few comments:

First, if you look closely, you’ll notice that there’s a lot of repetition in that list. For instance, look at the chain that starts at 9. It ends up at 7 after just three steps. And then the whole rest of the chain is identical to the one two above it! And starting at 10, one step takes us to 5, the chain of which we had already calculated.

This hints that any method for calculating the Collatz sequences of numbers could likely benefit from dynamic programming: using the work you’ve previously done to make your future tasks quicker. Think about it, by calculating the chains of just the numbers 3 through 10, we’ve also managed in the process to figure out the exact chains for 11, 13, 14, 16, 17, 20, and more numbers up to 52!

Second comment: Notice that these chains look really weird. They jump up and down unpredictably, and their lengths are all over the place. Here’s what the “9” chain looks like as it jumps up and down:

Screen Shot 2019-06-11 at 10.46.17 PM.png

And, to take an even more dramatic example, here’s what the chain starting with 27 looks like:

Figure_1.png

That takes 111 steps to stop! And in the process, it rises as high as 9,232!

Based on this example, you might be thinking that these chains are in general going to increase faster than their starting value. But don’t be deceived: I cherry-picked 27 because I knew that it would generate an interesting long chain. In fact, the lengths of the chains are in general remarkably small relative to their starting value. Here’s a plot of the first 100 chain lengths:

Figure_2.png

Though the values are growing, and some of them are up in the 100s, most of them are in the sub-40 range.

If you’re like me, though, you might notice some very curious patterns in this graph. What’s going on with those clusters of three or two that keep showing up right beside each other? And why do we seem to have these two distinct “regions” of chain lengths? Let’s look to see if these patterns persist as we plot the chain lengths for higher values, this time up to 1000:

Figure_3

Interesting! So it appears like the “two regions” kind of blend together, but this pattern of strings of nearby numbers with similar chain lengths stays the same! This is extremely mysterious to me. Many of the time the specific paths the numbers take on their way to zero are quite different, so why should we see them having the same length?

Well, part of the answer might be that in fact, the paths taken by nearby numbers are sometimes actually not too different at all. For instance, 500 and 501 both take 110 steps to get to 1, but their chains become identical after just four steps. Let’s look closely at these steps:

500 becomes 250 becomes 125 becomes 376
501 becomes 1504 becomes 752 becomes 376

Let’s look in general at what happens to N and N+1 if they follow this exact sequence of moves:

N becomes N/2 becomes N/4 becomes 3N/4 + 1
N+1 becomes 3(N+1) + 1 becomes (3(N+1) + 1)/2 becomes (3(N+1) + 1)/4

But hold on: (3(N+1) + 1)/4 just is 3N/4 + 1! So in fact, we should expect that whenever we do these sequences of moves, N and N+1 will have the same chain length. And when does this happens? Well, N had to be divisible by 4 but not 8. And 3N + 4 had to be divisible by 4. This specific set of conditions is true for about 1/8 of all numbers, so this already tells us that at least 1/8 of neighboring numbers have the same length!

And this was only looking at chains that became identical after three steps! We could go further and look at other possible sequences of moves that would make our chains become identical quickly, and in the process get a better understanding of the structure of the graph.

By the way, want to see how this graph behaves for even larger starting values? Here ya go! N up to 10,000:

Figure_4

Our N has 10-tupled, but our chain lengths have barely increased at all! Same for N up to 100,000:

Figure_5

Same story! Increasing our N by a factor of 10 seems to really only increase our chain lengths by adding about 100! This indicates that the relationship between starting value and chain length might be logarithmic (a multiplicative increase in starting value corresponding to an additive increase in chain length). Let’s look at a semilog plot of the same range of starting values:

Figure_6.png

Wow! This exposes a whole new level of structure in the chain lengths. Look at all those parallel lines of clusters! And also look at how nicely linearly the chain lengths appear to grow with the exponent of the x-axis! This is a bit mind-blowing to me that the chain length of N seems to be so naturally related to log(N).

Extending this graph to 100 times more values of N:

Figure_7

We can really clearly see how linear everything remains! Another feature that pops out here is the little “ledges” that occasionally jut out to the left in the graph. These tend to appear whenever we find an N whose chain length is dramatically larger than the previous numbers, and it then carries along some of its larger neighbors.

This might raise the question: how often do we actually see new record chain lengths? I mean, clearly we’re going to get larger and larger chain lengths forever, off to infinity, but how often do these record-breaking chains appear?

Surprisingly rarely! Let’s define a record-breaking Collatz chain as a Collatz chain such that all the Collatz chains with smaller starting values are shorter in length. Let’s look at where these record-breakers appear for N up to 100,000:

Figure_9

They look pretty evenly spaced! But remember, the x-axis is growing exponentially. So in fact, they get exponentially farther apart.

By the way, look at that weird straight line that appears about midway between 101 and 102. The exact values of these record breakers are:

54: 112
73: 115
97: 118
129: 121
171: 124
241: 127
313: 130

That’s right, the record increases by exactly 3, six times in a row! This might be a fluke, or maybe there’s something deeper going on.

Also, I’m sure you’re curious about that one record-breaker that jumps way above the previous record-breakers. It turns out that you’ve already seen it: it’s 27!

27 is unique in just how much it exceeds all the previous records, going from 23 to 111 (increasing the record by 89). As you can see in the graph, no other record-breaker beats its previous records by nearly as much, even though we’re looking all the way up to 100,000. Perhaps this is just a fluke caused by the early records being especially low? This seems supported by the fact that if we look all the way up to 10 million, we still see that nothing beats 27.

Figure_9-1.png

Define a record-breaking record-breaker as a record-breaker that beats the previous record by a greater amount than all previous record breakers. Now, 27 is clearly a record-breaking record-breaker. There is also one before it, namely 3, which beats the previous record by 6. (7 is a record-tying record-breaker, as it also beats the previous record by 6.)

We might have a postulate in mind: 3 and 27 are the only record-breaking record-breakers. Can we prove this?

Well, it turns out that we can’t. Because the postulate is not true! Let’s extend our graph by another factor of 10, going up to 100 million. (By the way, the dynamic programming I mentioned earlier is absolutely essential at this point for any of this to be possible.)

Figure_10.png

Now we can see that in fact there is another record-breaking record-breaker, all the way over to the right side of the graph! It turns out to be 63,728,127, and it beats the previous record by 205! So 27 is apparently not the largest one. A new question that naturally arises, a question to which I have no answer, is: are there infinitely many record-breaking record-breakers? Or does the sequence end at some point? I conjecture that yes, there are infinitely many such overachievers.

I have two more things to geek out about before I finish up. First of all, take a look at the actual values of the record-breakers up to 10000:

2: 1
3: 7
6: 8
7: 16
9: 19
18: 20
25: 23
27: 111
54: 112
73: 115
97: 118
129: 121
171: 124
231: 127
313: 130
327: 143
649: 144
703: 170
871: 1780
1161: 181
2223: 182
2463: 208
2919: 216
3711: 237
6171: 261
10971: 267
13255: 275
17647: 278
23529: 281
26623: 307
34239: 310
35655: 323
52527: 339
77031: 350

If you skim this list over, you’ll notice that these record-breakers appear to all be odd. But there are in fact four exceptions in this list that appear near its beginning: 2, 6, 18, and 54. And after these four, the even record-breakers appear to disappear. This sort of makes sense; if the first thing you do with your number is make it larger, this will in general result in a longer chain. But are there any more even record breakers larger than 54?

As it turns out, yes there are! But the next even record breaker doesn’t appear until we get into the tens of millions. It is 31,466,382. This is pretty baffling! Try showing your friends the sequence [2, 6, 18, 54, 31466382] and see if they can figure out what comes next! (It turns out that the next number is 127,456,254, which is exactly twice the previous record breaker!)

And now one final observation. There seem to be disproportionately many record-breakers whose chain lengths differ by only 1 or 3. For the heck of it, let’s define shiftless record-breakers as any record breakers who only beat the previous record by 3, and lazy record-breakers as any record-breakers who beat the previous record by 1.

One simple way this might happen is if the previous record breaker is N, and there are no new record breakers up to 2N. Then 2N will be the new record breaker, since its first step will be dividing by 2, and then it will take the rest of Ns chain for itself. So its chain length will be one larger than the chain length of N. (By the way, it turns out that our large even record breaker is in fact an example of a lazy record breaker.)

Are there really disproportionately many lazy and shiftless record breakers? Let’s look at the amount that each record breaker beat its previous record breaker by. This sequence looks like:

6, 1, 8, 3, 1, 3, 88, 1, 3, 3, 3, 3, 3, 3, 13, 1, 26, 8, 3, 1, 26, 8, 21, 24, 6, 8, 3, 3, 26, 3, 13, 16, 11, 3, 21, 8, 3, 57, 6, 21, 39, 16, 3, 3, 26, 3, 3, 21, 13, 16, 52, 21, 3, 3, 13, 1, 39, 205

This is another really mind-blowing observation for me! It looks like the differences between record breakers always seem to take on only very specific values, like 1, 3, 6, 8, and 11. Is there ever a difference of 2? or 5? I have no idea, and I’d love to know! I’ve leave you with this histogram showing how often each difference appears in all record-breakers up to 100 million:

Figure_11

Producing all numbers using just four fours

How many of the natural numbers do you think you can produce just by combining four 4s with standard mathematical operations?

The answer might blow your mind a little bit. It’s all of them!

In particular, you can get any natural number by using the symbols ‘4’, ‘√’, ‘log’, and ‘/’. And in particular, you can do it with just four 4s!

Try it for yourself before moving on!

 

 

Continue reading “Producing all numbers using just four fours”

Firing Squads and The Fine Tuning Argument

I’m confused about how satisfactory a multiverse is as an alternative explanation for the fine-tuning of our universe (alternative to God, that is).

My initial intuition about this is that it is a perfectly satisfactory explanation. It looks like we can justify this on Bayesian grounds by noting that the probability of the universe we’re in being fine-tuned for intelligent life given that there is a multiverse is nearly 1. The probability of fine-tuning given God is also presumably nearly 1, so the observation of fine-tuning shouldn’t push us much in one direction or other.

(Obligatory photo of the theorem doing the work here)

bayes.png

But here’s another argument I’m aware of: A firing squad of twenty sharpshooters aims at you and fires. They all miss. You are obviously very surprised by this. But now somebody comes up to you and tells you that in fact there is a multiverse full of “you”s in identical situations. They all faced down the firing squad, and the vast majority of them died. Now, given that you exist to ask the question, of COURSE you are in the universe in which they all missed. So should you be no longer surprised?

I take it the answer to this is “No, even though I know that I could only be alive right now asking this question if the firing squad missed, this doesn’t remove any mystery from the firing squad missing. It’s exactly as mysterious that I am alive right now as that the firing squad missed, so my existence doesn’t lessen the explanatory burden we face.

The firing squad situation seems exactly parallel to the fine-tuning of the universe. We find ourselves in a universe that is remarkably fine tuned in a way that seems extremely a priori improbable. Now we’re told that there are in fact a massive number of universes out there, the vast majority of which are devoid of life. So of course we exist in one of the universes that is fine-tuned for our existence.

Let’s make this even more intuitive: The earth exists in a Goldilocks zone around the Sun. Too much closer or further away and life would not be possible. Maybe this was mysterious at some point when humans still thought that there was just one solar system in the universe. But now we know that galaxies contain hundreds of billions of solar systems, most of which probably don’t have any planets in their Goldilocks zones. And with this knowledge, the mystery entirely disappears. Of course we’re on a planet that can support life, where else would we be??

So my question is: Why does this argument feel satisfactory in the fine-tuning and Goldilocks examples but not the firing squad example?

A friend I asked about this responded:

if you modify the firing squad scenario so that you don’t exist prior to the shooting and are only brought into existence if they all miss does it still feel less satisfactory then the multiverse case?

And I responded that no, it no longer feels less satisfactory than the multiverse case! Somehow this tweak “fixes” the intuitions. This suggests that the relevant difference between the two cases is something about existence prior to the time of the thought experiment. But how do we formalize this difference? And why should it be relevant? I’m perplexed.

Is the double slit experiment evidence that consciousness causes collapse?

No! No no no.

This might be surprising to those that know the basics of the double slit experiment. For those that don’t, very briefly:

A bunch of tiny particles are thrown one by one at a barrier with two thin slits in it, with a detector sitting on the other side. The pattern on the detector formed by the particles is an interference pattern, which appears to imply that each particle went through both slits in some sense, like a wave would do. Now, if you peek really closely at each slit to see which one each particle passes through, the results seem to change! The pattern on the detector is no longer an interference pattern, but instead looks like the pattern you’d classically expect from a particle passing through only one slit!

When you first learn about this strange dependence of the experimental results on, apparently, whether you’re looking at the system or not, it appears to be good evidence that your conscious observation is significant in some very deep sense. After all, observation appears to lead to fundamentally different behavior, collapsing the wave to a particle! Right?? This animation does a good job of explaining the experiment in a way that really pumps the intuition that consciousness matters:

(Fair warning, I find some aspects of this misleading and just plain factually wrong. I’m linking to it not as an endorsement, but so that you get the intuition behind the arguments I’m responding to in this post.)

The feeling that consciousness is playing an important role here is a fine intuition to have before you dive deep into the details of quantum mechanics. But now consider that the exact same behavior would be produced by a very simple process that is very clearly not a conscious observation. Namely, just put a single spin qubit at one of the slits in such a way that if the particle passes through that slit, it flips the spin upside down. Guess what you get? The exact same results as you got by peeking at the screen. You never need to look at the particle as it travels through the slits to the detector in order to collapse the wave-like behavior. Apparently a single qubit is sufficient to do this!

It turns out that what’s really going on here has nothing to do with the collapse of the wave function and everything to do with the phenomenon of decoherence. Decoherence is what happens when a quantum superposition becomes entangled with the degrees of freedom of its environment in such a way that the branches of the superposition end up orthogonal to each other. Interference can only occur between the different branches if they are not orthogonal, which means that decoherence is sufficient to destroy interference effects. This is all stuff that all interpretations of quantum mechanics agree on.

Once you know that decoherence destroys interference effects (which all interpretations of quantum mechanics agree on), and also that a conscious observing the state of a system is a process that results in extremely rapid and total decoherence (which everybody also agrees on), then the fact that observing the position of the particle causes interference effects to vanish becomes totally independent of the question of what causes wave function collapse. Whether or not consciousness causes collapse is 100% irrelevant to the results of the experiment, because regardless of which of these is true, quantum mechanics tells us to expect observation to result in the loss of interference!

This is why whether or not consciousness causes collapse has no real impact on what pattern shows up in the wall. All interpretations of quantum mechanics agree that decoherence is a thing that can happen, and decoherence is all that is required to explain the experimental results. The double slit experiment provides no evidence for consciousness causing collapse, but it also provides no evidence against it. It’s just irrelevant to the question! That said, however, given that people often hear the experiment presented in a way that makes it seem like evidence for consciousness causing collapse, hearing that qubits do the same thing should make them update downwards on this theory.

Consistently reflecting on decision theory

There are many principles of rational choice that seem highly intuitively plausible at first glance. Sometimes upon further reflection we realize that a principle we initially endorsed is not quite as appealing as we first thought, or that it clashes with other equally plausible principles, or that it requires certain exceptions and caveats that were not initially apparent. We can think of many debates in philosophy as clashes of these general principles, where the thought experiments generated by philosophers serve as the datum that put on display their relative merits and deficits. In this essay, I’ll explore a variety of different principles for rational decision making and consider the ways in which they satisfy and frustrate our intuitions. I will focus in especially on the notion of reflective consistency, and see what sort of decision theory results from treating this as our primary desideratum.

I want to start out by illustrating the back-and-forth between our intuitions and the general principles we formulate. Consider the following claim, known as the dominance principle: If a rational agent believes that doing A is better than not doing A in every possible world, then that agent should do A even if uncertain about which world they are in. Upon first encountering this principle, it seems perfectly uncontroversial and clearly valid. But now consider the following application of the dominance principle:

“A student is considering whether to study for an exam. He reasons that if he will pass the exam, then studying is wasted effort. Also, if he will not pass the exam, then studying is wasted effort. He concludes that because whatever will happen, studying is wasted effort, it is better not to study.” (Titelbaum 237)

The fact that this is clearly a bad argument casts doubt on the dominance principle. It is worth taking a moment to ask what went wrong here. How did this example turn our intuitions on their heads so completely? Well, the flaw in the student’s line of reasoning was that he was ignoring the effect of his studying on whether or not he ends up passing the exam. This dependency between his action and the possible world he ends up in should be relevant to his decision, and it apparently invalidates his dominance reasoning.

A restricted version of the dominance principle fixes this flaw: If a rational agent prefers doing A to not doing A in all possible worlds and which world they are in is independent of whether they do A or not, then that agent should do A even if they are uncertain about which world they are in. I’ll call this the simple dominance principle. This principle is much harder to disagree with than our starting principle, but the caveat about independence greatly limits its scope. It applies only when our uncertainty about the state of the world is independent of our decision, which is not the case in most interesting decision problems. We’ll see by the end of this essay that even this seemingly obvious principle can be made to conflict with another intuitively plausible principle of rational choice.

The process of honing our intuitions and fine-tuning our principles like this is sometimes called seeking reflective consistency, where reflective consistency is the hypothetical state you end up in after a sufficiently long period of consideration. Reflective consistency is achieved when you have reached a balance between your starting intuitions and other meta-level desiderata like consistency and simplicity, such that your final framework is stable against further intuition-pumping. This process has parallels in other areas of philosophy such as ethics and epistemology, but I want to suggest that it is particularly potent when applied to decision theory. The reason for this is that a decision theory makes recommendations for what action to take in any given setup, and we can craft setups where the choice to be made is about what decision theory to adopt. I’ll call these setups self-reflection problems. By observing what choices a decision theory makes in self-reflection problems, we get direct evidence about whether the decision theory is reflectively consistent or not. In other words, we don’t need to do all the hard work of allowing thought experiments to bump our intuitions around; we can just take a specific decision algorithm and observe how it behaves upon self-reflection!

What we end up with is the following principle: Whatever decision theory we end up endorsing should be self-recommending. We should not end up in a position where we endorse decision theory X as the final best theory of rational choice, but then decision theory X recommends that we abandon it for some other decision theory that we consider less rational.

The connection between self-recommendation and reflective consistency is worth fleshing out in a little more detail. I am not saying that self-recommendation is sufficient for reflective consistency. A self-recommending decision theory might be obviously in contradiction with our notion of rational choice, such that any philosopher considering this decision theory would immediately discard it as a candidate. Consider, for instance, alphabetical decision theory, which always chooses the option which comes alphabetically first in its list of choices. When faced with a choice between alphabetical decision theory and, say, evidential decision theory, alphabetical decision theory will presumably choose itself, reasoning that ‘a’ comes before ‘e’. But we don’t want to call this a virtue of alphabetical decision theory. Even if it is uniformly self-recommending, alphabetical decision theory is sufficiently distant from any reasonable notion of rational choice that we can immediately discard it as a candidate.

On the other hand, even though not all self-recommending theories are reflectively consistent, any reflectively consistent decision theory must be self-recommending. Self-recommendation is a necessary but not sufficient condition for an adequate account of rational choice.

Now, it turns out that this principle is too strong as I’ve phrased it and requires a few caveats. One issue with it is what I’ll call the problem of unfair decision problems. For example, suppose that we are comparing evidential decision theory (henceforth EDT) to causal decision theory (henceforth CDT). (For the sake of time and space, I will assume as background knowledge the details of how each of these theories work.) We put each of them up against the following self reflection problem:

An omniscient agent peeks into your brain. If they see that you are an evidential decision theorist, they take all your money. Otherwise they leave you alone. Before they peek into your brain, you have the ability to modify your psychology such that you become either an evidential or causal decision theorist. What should you do?

EDT reasons as follows: If I stay an EDT, I lose all my money. If I self-modify to CDT, I don’t. I don’t want to lose all my money, so I’ll self-modify to CDT. So EDT is not self-recommending in this setup. But clearly this is just because the setup is unfairly biased against EDT, not because of any intrinsic flaw in EDT. In fact, it’s a virtue of a decision theory to not be self-recommending in such circumstances, as doing so indicates a basic awareness of the payoff structure of the world it faces.

While this certainly seems like the right thing to say about this particular decision problem, we need to consider how exactly to formalize this intuitive notion of “being unfairly biased against a decision theory.” There are a few things we might say here. For one, the distinguishing feature of this setup seems to be that the payout is determined not based off the decision made by an agent, but by their decision theory itself. This seems to be at the root of the intuitive unfairness of the problem; EDT is being penalized not for making a bad decision, but simply for being EDT. A decision theory should be accountable for the decisions it makes, not for simply being the particular decision theory that it happens to be.

In addition, by swapping “evidential decision theory” and “causal decision theory” everywhere in the setup, we end up arriving at the exact opposite conclusion (evidential decision theory looks stable, while causal decision theory does not). As long as we don’t have any a priori reason to consider one of these setups more important to take into account than the other, then there is no net advantage of one decision theory over the other. If a decision problem belongs to a set of equally a priori important problems obtained by simply swapping out the terms for different decision theories, and no decision theory comes out ahead on the set as a whole, then perhaps we can disregard the entire set for the purposes of evaluating decision theories.

The upshot of all of this is that what we should care about is decision problems that don’t make any direct reference to a particular decision theory, only to decisions. We’ll call such problems decision-determined. Our principle then becomes the following: Whatever decision theory we end up endorsing should be self-recommending in all decision-determined problems.

There’s certainly more to be said about this principle and if any other caveats need be applied to it, but for now let’s move on to seeing what we end up with when we apply this principle in its current state. We’ll start out with an analysis of the infamous Newcomb problem.

You enter a room containing two boxes, one opaque and one transparent. The transparent box contains $1,000. The opaque box contains either $0 or $1,000,000. Your choice is to either take just the opaque box (one-box) or to take both boxes (two-box). Before you entered the room, a predictor scanned your brain and created a simulation of you to see what you would do. If the simulation one-boxed, then the predictor filled the opaque box with $1,000,000. If the simulation two-boxed, then the opaque box was left empty. What do you choose?

EDT reasons as follows: If I one-box, then this gives me strong evidence that I have the type of brain that decides to one-box, which gives me strong evidence that the predictor’s simulation of me one-boxed, which in turn gives me strong evidence that the opaque box is full. So if I one-box, I expect to get $1,000,000. On the other hand, if I two-box, then this gives me strong evidence that my simulation two-boxed, in which case the opaque box is empty. So if I two-box, I expect to get only $1,000. Therefore one-boxing is better than two-boxing.

CDT reasons as follows: Whether the opaque box is full or empty is already determined by the time I entered the room, so my decision has no causal effect upon the box’s contents. And regardless of the contents of the box, I always expect to leave $1,000 richer by two-boxing than by one-boxing. So I should two-box.

At this point it’s important to ask whether Newcomb’s problem is a decision determined problem. After all, the predictor decides whether to fill the transparent box by scanning your brain and stimulating you. Isn’t that suspiciously similar to our earlier example of penalizing agents based off their decision theory? No. The simulator decides what to do not by evaluating your decision theory, but by its prediction about your decision. You aren’t penalized for being a CDT, just for being the type of agent that one-boxes. To see this you only need to observe that any decision theory that one-boxes would be treated identically to CDT in this problem. The determining factor is the decision, not the decision theory.

Now, let’s make the Newcomb problem into a test of reflective consistency. Instead of your choice being about whether to one-box or to two-box while in the room, your choice will now take place before you enter the room, and will be about whether to be an evidential decision theorist or a causal decision theorist when in the room. What does each theory do?

EDT’s reasoning: If I choose to be an evidential decision theorist, then I will one-box when in the room. The predictor will simulate me as one-boxing, so I’ll end up walking out with $1,000,000. If I choose to be a causal decision theorist, then I will two-box when in the room, the predictor will predict this, and I’ll walk out with only $1,000. So I will stay an EDT.

Interestingly, CDT agrees with this line of reasoning. The decision to be an evidential or causal decision theorist has a causal effect on how the predictor’s simulation behaves, so a causal decision theorist sees that the decision to stay a causal decision theorist will end up leaving them worse off than if they had switched over. So CDT switches to EDT. Notice that in CDT’s estimation, the decision to switch ends up making them $999,000 better off. This means that CDT would pay up to $999,000 just for the privilege of becoming an evidential decision theorist!

I think that looking at an actual example like this makes it more salient why reflective consistency and self-recommendation is something that we actually care about. There’s something very obviously off about a decision theory that knows beforehand that it will reliably perform worse than its opponent, so much so that it would be willing to pay up to $999,000 just for the privilege of becoming its opponent. This is certainly not the type of behavior that we associate with a rational agent that trusts itself to make good decisions.

Classically, this argument has been phrased in the literature as the “why ain’tcha rich?” objection to CDT, but I think that the objection goes much deeper than this framing would suggest. There are several plausible principles that all apply here, such as that a rational decision maker shouldn’t regret having the decision theory they have, a rational decision maker shouldn’t pay to limit their future options, and a rational decision maker shouldn’t pay to decrease the values in their payoff matrix. The first of these is fairly self-explanatory. One influential response to it has been from James Joyce, who said that the causal decision theorist does not regret their decision theory, just the situation they find themselves in. I’d suggest that this response makes little sense when the situation the find themselves in is a direct result of their decision theory. As for the second and third of these, we could imagine giving a causal decision theorist the choice to pay money to remove the future possibility of two-boxing, or to hire a gunman that would shoot them if they try to take the transparent box. In each of these cases, CDT would endorse the decision to pay. I mention these other principles just to suggest that we have sound philosophical reason to care about self-recommendation independent of the reflective consistency considerations that we started out by discussing.

The takeaway from this modified Newcomb problem is that CDT is not reflectively consistent, and that it will self-modify to EDT in the right circumstances. Does this mean that EDT is more reflectively consistent than CDT? It turns out that no, this is not the case. We can construct another thought experiment in which EDT self-modifies to become CDT, called the Newcomb’s Soda problem:

You were just given either Soda 1 or Soda 2, which are indistinguishable. There was a 50% chance of each being the type you were given. You are now given a choice as to whether to choose chocolate ice cream or vanilla ice cream. Those that had Soda 1 have a strong unconscious inclination to choose chocolate ice cream and will be given $1,000,000 after they choose the ice cream. Those that had Soda 2 have a strong unconscious inclination to choose vanilla ice cream and are given nothing. If you choose vanilla ice cream, you get $1,000. What do you choose?

EDT reasons that choosing chocolate ice cream gives them strong evidence that they were given Soda 1, in which case they will be given $1,000,000. So they would be willing to give up the $1000 reward for choosing vanilla ice cream, in order to increase their chances of getting the million. CDT reasons that what soda they were given is a fixed fact that isn’t causally influenced by the decision they make. So they have a 50% chance of getting the million regardless of their choice, but choosing to take the vanilla guarantees them an additional $1000. Therefore they take vanilla.

This is clearly a decision-determined problem, so let’s see how EDT and CDT do upon self-reflection. Given the chance to modify themselves to EDT before being given the soda, a causal decision theorist would see no advantage to doing so. After all, being an evidential decision theorist doesn’t make you more or less likely to get the $1,000,000, it just ensures that you won’t get the $1000. And the evidential decision theorist agrees! By changing to causal decision theory, they guarantee themselves an extra $1000 while not making it any more or less likely that they’ll be given the chocolate soda (and accordingly, the million dollars).

The upshot of this is that both CDT and EDT are reflective inconsistent in the class of decision determined problems. What we seek, then, is a new decision theory that behaves like EDT in the Newcomb problem and like CDT in Newcomb’s Soda. One such theory was pioneered by machine learning researcher Eliezer Yudkowsky, who named it timeless decision theory (henceforth TDT). To deliver different verdicts in the two problems, we must find some feature that allows us to distinguish between their structure. TDT does this by distinguishing between the type of correlation arising from ordinary common causes (like the soda in Newcomb’s Soda) and the type of correlation arising from faithful simulations of your behavior (as in Newcomb’s problem).

This second type of correlation is called logical dependence, and is the core idea motivating TDT. The simplest example of this is the following: two twins, physically identical down to the atomic level, raised in identical environments in a deterministic universe, will have perfectly correlated behavior throughout the lengths of their lives, even if they are entirely causally separated from each other. This correlation is apparently not due to a common cause or to any direct causal influence. It simply arises from the logical fact that two faithful instantiations of the same function will return the same output when fed the same input. Considering the behavior of a human being as an instantiation of an extremely complicated function, it becomes clear why you and your parallel-universe twin behave identically: you are instantiations of the same function! We can take this a step further by noting that two functions can have a similar input-output structure, in which case the physical instantiations of each function will have correlated input-output behavior. This correlation is what’s meant by logical dependence.

To spell this out a bit further, imagine that in a far away country, there are factories that sell very simple calculators. Each calculator is designed to only run only one specific computation. Some factories are multiplication-factories; they only sell calculators that compute 713*291. Others are addition-factories; they only sell calculators that compute 713+291. You buy two calculators from one of these factories, but you’re not sure which type of factory you purchased from. Your credences are 50/50 split between the factory you purchased from being a multiplication-factory and it being an addition-factory. You also have some logical uncertainty regarding what the value of 713*291 is. You are evenly split between the value being 207,481 and the value being 207,483. On the other hand, you have no uncertainty about what the value of 713+291 is; you know that it is 1004.

Now, you press “ENTER” on one of the two calculators you purchased, and find that the result is 207,483. For a rational reasoner, two things should now happen: First, you should treat this result as strong evidence that the factory from which both calculators were bought was a multiplication-factory, and therefore that the other calculator is also a multiplier. And second, you should update strongly on the other calculator outputting 207,483 rather than 207,481, since two calculators running the same computation will output the same result.

The point of this example is that it clearly separates out ordinary common cause correlation from a different type of dependence. The common cause dependence is what warrants you updating on the other calculator being a multiplier rather than an adder. But it doesn’t warrant you updating on the result on the other calculator being specifically 207,483; to do this, we need the notion of logical dependence, which is the type of dependence that arises whenever you encounter systems that are instantiating the same or similar computations.

Connecting this back to decision theory, TDT treats our decision as the output of a formal algorithm, which is our decision-making process. The behavior of this algorithm is entirely determined by its logical structure, which is why there are no upstream causal influences such as the soda in Newcomb’s Soda. But the behavior of this algorithm is going to be correlated with the parts of the universe that instantiate a similar function (as well as the parts of the universe it has a causal influence on). In Newcomb’s problem, for example, the predictor generates a detailed simulation of your decision process based off of a brain scan. This simulation of you is highly logically correlated with you, in that it will faithfully reproduce your behavior in a variety of situations. So if you decide to one-box, you are also learning that your simulation is very likely to one-box (and therefore that the opaque box is full).

Notice that the exact mechanism by which the predictor operates becomes very important for TDT. If the predictor operates by means of some ordinary common cause where no logical dependence exists, TDT will treat its prediction as independent of your choice. This translates over to why TDT behaves like CDT on Newcomb’s Soda, as well as other so-called “medical Newcomb problems” such as the smoking lesion problem. When the reason for the correlation between your behavior and the outcome is merely that both depend on a common input, TDT treats your decision as an intervention and therefore independent of the outcome.

One final way to conceptualize TDT and the difference between the different types of correlation is using structural equation modeling:

Direct causal dependence exists between A and B when A is a function of B or when B is a function of A.
 > A = f(B) or B = g(A)

Common cause dependence exists between A and B when A and B are both functions of some other variable C.
 > A = f(C) and B = g(C)

Logical dependence exists between A and B when A and B depend on their inputs in similar ways.
 > A = f(C) and B = f(D)

TDT takes direct causal dependence and logical dependence seriously, and ignores common cause dependence. We can formally express this by saying that TDT calculates the expected utility of a decision by treating it like a causal intervention and fixing the output of all other instantiations of TDT to be identical interventions. Using Judea Pearl’s do-calculus notation for causal intervention, this looks like:

Screen Shot 2019-03-09 at 12.48.13 AM

Here K is the TDT agent’s background knowledge, D is chosen from a set of possible decisions, and the sum is over all possible worlds. This equation isn’t quite right, since it doesn’t indicate what to do when the computation a given system instantiates is merely similar to TDT but not logically identical, but it serves as a first approximation to the algorithm.

You might notice that the notion of logical dependence depends on the idea of logical uncertainty, as without it the result of the computations would be known with certainty as soon as you learn that the calculators came out of a multiplication-factory, without ever having to observe their results. Thus any theory that incorporates logical dependence into its framework will be faced with a problem of logical omniscience, which is to say, it will have to give some account of how to place and update reasonable probability distributions over tautologies.

The upshot of all of this is that TDT is reflectively consistent on a larger class of problems than both EDT and CDT. Both EDT and CDT would self-modify into TDT in Newcomb-like problems if given the choice. Correspondingly, if you throw a bunch of TDTs and EDTs and CDTs into a world full of Newcomb and Newcomb-like problems, the TDTs will come out ahead. However, it turns out that TDT is not itself reflectively consistent on the whole class of decision-determined problems. Examples like the transparent Newcomb problem, Parfit’s hitchhiker, and counterfactual mugging all expose reflective inconsistency in TDT.

Let’s look at the transparent Newcomb problem. The structure is identical to a Newcomb problem (you walk into a room with two boxes, $1000 in one and either $1000000 or $0 in the other, determined based on the behavior of your simulation), except that both boxes are transparent. This means that you already know with certainty the contents of both boxes. CDT two-boxes here like always. EDT also two-boxes, since any dependence between your decision and the box’s contents is made irrelevant as soon as you see the contents. TDT agrees with this line of reasoning; even though it sees a logical dependence between your behavior and your simulation’s behavior, knowing whether the box is full or empty fully screens off this dependence.

Two-boxing feels to many like the obvious rational choice here. The choice you face is simply whether to take $1,000,000 or $1,001,000 if the box is full. If it’s empty, your choice is between taking $1,000 or walking out empty-handed. But two-boxing also has a few strange consequences. For one, imagine that you are placed, blindfolded, in a transparent Newcomb problem. You can at any moment decide to remove your blindfold. If you are an EDT, you will reason that if you don’t remove your blindfold, you are essentially in an ordinary Newcomb problem, so you will one-box and correspondingly walk away with $1,000,000. But if you do remove your blindfold, you’ll end up two-boxing and most likely walking away with only $1000. So an EDT would pay up to $999,000, just for the privilege of staying blindfolded. This seems to conflict with an intuitive principle of rational choice, which goes something like: A rational agent should never expect to be worse off by simply gaining information. Paying money to keep yourself from learning relevant information seems like a sure sign of a pathological decision theory.

Of course, there are two ways out of this. One way is to follow the causal decision theorist and two-box in both the ordinary Newcomb problem and the transparent problem. This has all the issues that we’ve already discussed, most prominently that you end up systematically and predictably worse off by doing so. If you pit a causal decision theorist against an agent that always one-boxes, even in transparent Newcomb problems, CDT ends up the poorer. And since CDT can reason this through beforehand, they would willingly self-modify to become the other type of agent.

What type of agent is this? None of the three decision theories we’ve discussed give the reflectively consistent response here, so we need to invent a new decision theory. The difficulty with any such theory is that it has to be able to justify sticking to its guns and one-boxing even after conditioning on the contents of the box.

In general, similar issues will arise whenever the recommendations made by a decision theory are not time-consistent. For instance, the decision that TDT prescribes for an agent with background knowledge K depends heavily on the information that TDT has at the time of prescription. This means that at different times, TDT will make different recommendations for what to do in the same situation (before entering the room TDT recommends one-boxing once in the room, while after entering the room TDT recommends two-boxing). This leads to suboptimal performance. Agents that can decide on one course of action and credibly precommit to it get certain benefits that aren’t available to agents that don’t have this ability. I think the clearest example of this is Parfit’s hitchhiker:

You are stranded in the desert, running out of water, and soon to die. A Predictor approaches and tells you that they will drive you to town only if they predict you will pay them $100 once you get there.

All of EDT, CDT, and TDT wish that they could credibly precommit to paying the money once in town, but can’t. Once they are in town they no longer have any reason to pay the $100, since they condition on the fact that they . The fact that EDT, CDT, and TDT all have time-sensitive recommendations makes them worse off, leaving them all stranded in the desert to die. Each of these agents would willingly switch to a decision theory that doesn’t change their recommendations over time. How would such a decision theory work? It looks like we need a decision theory that acts as if it doesn’t know whether they’re in town even once in town, and acts as if it doesn’t know the contents of the box even after seeing them. One strategy for achieving this behavior is simple; you just decide on your strategy without ever conditioning on the fact that you are in town!

The decision theory that arises from this choice is appropriately named updateless decision theory (henceforth UDT). UDT is peculiar in that it never actually updates on any information when determining how to behave. That’s not to say that for UDT, the decision you make does not depend on the information you get through your lifetime. Instead, UDT tells you to choose a policy – a mapping from the possible pieces of information you might receive to possible decisions you could make – that maximizes the expected utility, calculated using your prior on possible worlds. This policy is set from time zero and never changes, and it determines how the UDT agent responds to any information they might receive at later points. So, for instance, a UDT agent reasons that adopting a policy of one-boxing in the transparent Newcomb case regardless of what you see maximizes expected utility as calculated using your prior. So once the UDT agent is in the room with the transparent box, it one-boxes. We can formalize this this by analogy with TDT:

Screen Shot 2019-03-09 at 8.23.54 AM

One concern with this approach is that a UDT agent might end up making silly decisions as a result of not taking into account information that is relevant to their decisions. But once again, the UDT agent does take into account the information they learn in their lifetime. It’s merely that they decide what to do with that information before receiving it and never update this prescription. For example, suppose that a UDT agent anticipates facing exactly one decision problem in their life, regarding whether to push a button or not. They have a 50% prior credence that pushing the button will result in the loss of $10, and 50% that it will result in gaining $10. Now, at some point before they decide to push the button, they are given the information about whether pushing the button causes you to gain $10 or to lose $10. UDT deals with this by choosing a policy for how to respond to that information in either case. The expected utility maximizing policy here would be to push the button if you learn that pushing the button leads to gaining $10, and to not push the button if you learn the opposite.

Since UDT chooses its preferred policy based on its prior, this recommendation never changes throughout a UDT agent’s lifetime. This seems to indicate that UDT will be self-recommending in the class of all decision-determined problems, although I’m not aware of a full proof of this. If this is correct, then we have reached our goal of finding a self-recommending decision theory. It is interesting to consider what other principles of rational choice ended up being violated along the way. The simple dominance principle that we started off by discussing appears to be an example of this. In the transparent Newcomb problem, there is only one possible world that the agent considers when in the room (the one in which the box is full, say), and in this one world, two-boxing dominates one-boxing. Given that the box is full, your decision to one-box or to two-box is completely independent of the box’s contents. So the simple dominance principle recommends two-boxing. But UDT disagrees.

Another example of a deeply intuitive principle that UDT violates is the irrelevance of impossible outcomes. This principles says that impossible outcomes should not factor into your decision-making process. But UDT seems to often recommend acting as if some impossible world might come to be. For instance, suppose a predictor walks up to you and gives you a choice to either give them $10 or to give them $100. You will not face any future consequences on the basis of your decision (besides whether you’re out $100 or only $10). However, you learn that the predictor only approached you because it predicted that you would give the $10. Do you give the $10 or the $100? UDT recommends giving the $100, because agents that do so are less likely to have been approached by the predictor. But if you’ve already been approached, then you are letting considerations about an impossible world influence your decision process!

Our quest for reflective consistency took us from EDT and CDT to timeless decision theory. TDT used the notion of logical dependence to get self-recommending behavior in the Newcomb problem and medical Newcomb cases. But we found that TDT was itself reflectively inconsistent in problems like the Transparent Newcomb problem. This led us to create a new theory that made its recommendations without updating on information, which we called updateless decision theory. UDT turned out to be a totally static theory, setting forth a policy determining how to respond to all possible bits of information and never altering this policy. The unchanging nature of UDT indicates the possibility that we have found a truly self-recommending decision theory, while also leading to some quite unintuitive consequences.

Screen Shot 2019-03-09 at 12.05.45 PM

Decision Theory

Everywhere below where a Predictor is mentioned, assume that their predictions are made by scanning your brain to create a highly accurate simulation of you and then observing what this simulation does.

All the below scenarios are one-shot games. Your action now will not influence the future decision problems you end up in.

Newcomb’s Problem
Two boxes: A and B. B contains $1,000. A contains $1,000,000 if the Predictor thinks you will take just A, and $0 otherwise. Do you take just A or both A and B?

Transparent Newcomb, Full Box
Newcomb problem, but you can see that box A contains $1,000,000. Do you take just A or both A and B?

Transparent Newcomb, Empty Box
Newcomb problem, but you can see that box A contains nothing. Do you take just A or both A and B?

Newcomb with Precommitment
Newcomb’s problem, but you have the ability to irrevocably resolve to take just A in advance of the Predictor’s prediction (which will still be just as good if you do precommit). Should you precommit?

Take Opaque First
Newcomb’s problem, but you have already taken A and it has been removed from the room. Should you now also take B or leave it behind?

Smoking Lesion
Some people have a lesion that causes cancer as well as a strong desire to smoke. Smoking doesn’t cause cancer and you enjoy it. Do you smoke?

Smoking Lesion, Unconscious Inclination
Some people have a lesion that causes cancer as well as a strong unconscious inclination to smoke. Smoking doesn’t cause cancer and you enjoy it. Do you smoke?

Smoking and Appletinis
Drinking a third appletini is the kind of act much more typical of people with addictive personalities, who tend to become smokers. I’d like to drink a third appletini, but I really don’t want to be a smoker. Should I order the appletini?

Expensive Hospital
You just got into an accident which gave you amnesia. You need to choose to be treated at either a cheap hospital or an expensive one. The quality of treatment in the two is the same, but you know that billionaires, due to unconscious habit will be biased towards using the expensive one. Which do you choose?

Rocket Heels and Robots
The world contains robots and humans, and you don’t know which you are. Robots rescue people whenever possible and have rockets in their heels that activate whenever necessary. Your friend falls down a mine shaft and will die soon without robotic assistance. Should you jump in after them?

Death in Damascus
If you and Death are in the same city tomorrow, you die. Death is a perfect predictor, and will come where he predicts you will be. You can stay in Damascus or pay $1000 to flee to Aleppo. Do you stay or flee?

Psychopath Button
If you press a button, all psychopaths will be killed. Only a psychopath would press such a button. Do you press the button?

Parfit’s Hitchhiker
You are stranded in the desert, running out of water, and soon to die. A Predictor will drive you to town only if they predict you will pay them $1000 once you get there. You have been brought into town. Do you pay?

XOR Blackmail
An honest predictor sends you this letter: “I sent this letter because I predicted that you have termites iff you won’t send me $100. Send me $100.” Do you send the money?

Twin Prisoner’s Dilemma
You are in a prisoner’s dilemma with a twin of yourself. Do you cooperate or defect?

Predictor Extortion
A Predictor approaches you and threatens to torture you unless you hand over $100. They only approached you because they predicted beforehand that you would hand over the $100. Do you pay up?

Counterfactual Mugging
Predictor flips coin which lands heads, and approaches you and asks you for $100. If the coin had landed tails, it would have tortured you if it predicted you wouldn’t give the $100. Do you give?

Newcomb’s Soda
You have 50% credence that you were given Soda 1, and 50% that you were given Soda 2. Those that had Soda 1 have a strong unconscious inclination to choose chocolate ice cream and will be given $1,000,000. Those that had Soda 2 have a strong unconscious inclination to choose vanilla ice cream and are given nothing. If you choose vanilla ice cream, you get $1000. Do you choose chocolate or vanilla ice cream?

Meta-Newcomb Problem
Two boxes: A and B. A contains $1,000. Box B will contain either nothing or $1,000,000. What B will contain is (or will be) determined by a Predictor just as in the standard Newcomb problem. Half of the time, the Predictor makes his move before you by predicting what you’ll do. The other half, the Predictor makes his move after you by observing what you do. There is a Metapredictor, who has an excellent track record of predicting Predictor’s choices as well as your own. The Metapredictor informs you that either (1) you choose A and B and Predictor will make his move after you make your choice, or else (2) you choose only B, and Predictor has already made his choice. Do you take only B or both A and B?

Rationality in the face of improbability

I recently read my favorite Wikipedia article of all time. It’s about a park ranger named Roy Cleveland Sullivan, whose claim to fame was having been hit by lightning on seven different occasions and surviving all of them. The details of these events are both tragic and a little hilarious, and raise some interesting questions about rationality.

From the article:

In spring 1972, Sullivan was working inside a ranger station in Shenandoah National Park when another strike occurred. It set his hair on fire; he tried to smother the flames with his jacket. He then rushed to the restroom, but couldn’t fit under the water tap and so used a wet towel instead. Although he never was a fearful man, after the fourth strike he began to believe that some force was trying to destroy him and he acquired a fear of death. For months, whenever he was caught in a storm while driving his truck, he would pull over and lie down on the front seat until the storm passed. He also began to believe that he would somehow attract lightning even if he stood in a crowd of people, and carried a can of water with him in case his hair was set on fire.

Put yourself in his situation and ask yourself if you might start doing the same thing after four times. Now what about if it kept happening?

On August 7, 1973, while he was out on patrol in the park, Sullivan saw a storm cloud forming and drove away quickly. But the cloud, he said later, seemed to be following him. When he finally thought he had outrun it, he decided it was safe to leave his truck. Soon after, he was struck by a lightning bolt.

The next strike, on June 5, 1976, injured his ankle. It was reported that he saw a cloud, thought that it was following him, tried to run away, but was struck anyway. His hair also caught fire.

He was struck the seventh time while fishing in a freshwater pool, which in a weird turn of events was followed by a confrontation with a bear over some trout that he had caught.

What’s more, Sullivan claimed to have been struck by lightning another time as a child, when out helping his father cut wheat in a field.

And furthermore…

Sullivan’s wife was also struck once, when a storm suddenly arrived as she was out hanging clothes in their backyard. Her husband was helping her at the time, but escaped unharmed.

Apparently his fear of lightning was a little contagious:

He was avoided by people later in life because of their fear of being hit by lightning, and this saddened him. He once recalled “For instance, I was walking with the Chief Ranger one day when lightning struck way off (in the distance). The Chief said, ‘I’ll see you later.'”

Okay, so besides from being a hilariously weird series of events, this article does raise some issues related to anthropic reasoning. Namely: what would it be rational for Roy Sullivan to believe?

I want to say that this man had really really good evidence that some angry Thor-like deity existed and was actively hunting him down. In his position, I think I’d feel like it was only rational to try to run from approaching clouds and thunderstorms (although that strategy didn’t seem to be super effective for him).

But at the same time, in a world of billions of people, it’s almost guaranteed to be the case that somebody will find themselves in circumstances just as unlikely as this. If Sullivan had one day looked up lightning strike statistics, and found that the numbers for the overall population were perfectly consistent with a naturalistic hypothesis in which lightning doesn’t target any particular individuals, how should he have responded?

And what should we believe about Roy Sullivan and lightning? Presumably we should not accept his non-naturalistic conclusions. But then what exactly is the difference between what we know and what he knows? We both have the same statistical information about the general trends in lightning strikes, and we both know that Roy Sullivan Cleveland was hit by lightning a bunch of times, so why should we come to different conclusions?

The obvious thought here is that it has something to do with anthropic reasoning. Sure, I have the same non-anthropic evidence as Roy Sullivan, but we have different anthropic evidence. Sullivan doesn’t just know the comparatively unremarkable proposition that “somebody was hit by lightning seven times and survived,” he knows the indexical proposition that “I was hit by lightning seven times and survived.” The non-Sullivans of the world don’t have access to this proposition, and maybe  this is the difference that matters.

Perhaps any population will end up having some individuals that happen to find themselves in very unusual situations, in which it becomes rational for them to come to bizarre conclusions about the world for anthropic reasons. And the bigger the population, the stranger and more rationally certain these beliefs might become.  Imagine a population big enough that it becomes not unlikely that some individual walks around commanding Thor to send bolts of lightning where they’re pointing, and then lo and behold it happens each time.

There would be many many many more individuals out there who succeeded a few times, and even more that never succeed at doing so. But for that tiny fraction that appears to manifest god-like powers, what should they believe? What should their friends and family believe? How far does the anthropic update extend? I’m not sure.

The history of lighting technology

Behold, one of my favorite tables of all time:

Screen-Shot-2018-12-23-at-4.08.12-AM-e1545556587809.pngScreen-Shot-2018-12-23-at-4.08.19-AM-e1545556574319.png
(Source.)

There’s so much to absorb here. Let’s look at just the “Light Price in Terms of Labor” column. At 500,000 BC, our starting point, we have this handsome guy:

Peking Man

The Peking man was a Homo erectus found in a cave with evidence of tool use and basic fire technology.  At this point, it would have taken him about 58 hours of work for every 1000 hours of light. Lighting a fire by hand or even with basic tools is hard, as seen here:

Nothing much changes for hundreds of thousands of years, until people begin using basic candles and oil lamps in the 1800s. After that, things slowly begin to accelerate, with gas lighting, incandescent lamps, and eventually fluorescent bulbs and LEDs…

Lights

Notice that this is a logarithmic plot! So a straight line corresponds to an exponential decrease in the amount of labor required to produce light. By the end we have less than 1 second to produce 1000 hours of light. And this doesn’t even include LED technologies!

Here’s a more detailed timeline of milestones in lighting technology up to the 1980s:

Screen Shot 2018-12-23 at 4.06.50 AM

And finally, a comparison of the efficiency of different lighting technologies over time.

Screen-Shot-2018-12-23-at-4.07.31-AM.png

 

Kant’s attempt to save metaphysics and causality from Hume

TL;DR

  • Hume sort of wrecked metaphysics. This inspired Kant to try and save it.
  • Hume thought that terms were only meaningful insofar as they were derived from experience.
  • We never actually experience necessary connections between events, we just see correlations. So Hume thought that the idea of causality as necessary connection is empty and confused, and that all our idea of causality really amounts to is correlation.
  • Kant didn’t like this. He wanted to PROTECT causality. But how??
  • Kant said that metaphysical knowledge was both a priori and substantive, and justified this by describing these things called pure intuitions and pure concepts.
  • Intuitions are representations of things (like sense perceptions). Pure intuitions are the necessary preconditions for us to represent things at all.
  • Concepts are classifications of representations (like “red”). Pure concepts are the necessary preconditions underlying all classifications of representations.
  • There are two pure intuitions (space and time) and twelve pure concepts (one of which is causality).
  • We get substantive a priori knowledge by referring to pure intuitions (mathematics) or pure concepts (laws of nature, metaphysics).
  • Yay! We saved metaphysics!

 

(Okay, now on to the actual essay. This was not originally written for this blog, which is why it’s significantly more formal than my usual fare.)

 

***

 

David Hume’s Enquiry Into Human Understanding stands out as a profound and original challenge to the validity of metaphysical knowledge. Part of the historical legacy of this work is its effect on Kant, who describes Hume as being responsible for [interrupting] my dogmatic slumber and [giving] my investigations in the field of speculative philosophy a completely different direction.” Despite the great inspiration that Kant took from Hume’s writing, their thinking on many matters is diametrically opposed. A prime example of this is their views on causality.

Hume’s take on causation is famously unintuitive. He gives a deflationary account of the concept, arguing that the traditional conception lacks a solid epistemic foundation and must be replaced by mere correlation. To understand this conclusion, we need to back up and consider the goal and methodology of the Enquiry.

He starts with an appeal to the importance of careful and accurate reasoning in all areas of human life, and especially in philosophy. In a beautiful bit of prose, he warns against the danger of being overwhelmed by popular superstition and religious prejudice when casting one’s mind towards the especially difficult and abstruse questions of metaphysics.

But this obscurity in the profound and abstract philosophy is objected to, not only as painful and fatiguing, but as the inevitable source of uncertainty and error. Here indeed lies the most just and most plausible objection against a considerable part of metaphysics, that they are not properly a science, but arise either from the fruitless efforts of human vanity, which would penetrate into subjects utterly inaccessible to the understanding, or from the craft of popular superstitions, which, being unable to defend themselves on fair ground, raise these entangling brambles to cover and protect their weakness. Chased from the open country, these robbers fly into the forest, and lie in wait to break in upon every unguarded avenue of the mind, and overwhelm it with religious fears and prejudices. The stoutest antagonist, if he remit his watch a moment, is oppressed. And many, through cowardice and folly, open the gates to the enemies, and willingly receive them with reverence and submission, as their legal sovereigns.

In less poetic terms, Hume’s worry about metaphysics is that its difficulty and abstruseness makes its practitioners vulnerable to flawed reasoning. Even worse, the difficulty serves to make the subject all the more tempting for “each adventurous genius[, who] will still leap at the arduous prize and find himself stimulated, rather than discouraged by the failures of his predecessors, while he hopes that the glory of achieving so hard an adventure is reserved for him alone.”

Thus, says Hume, the only solution is “to free learning at once from these abstruse questions [by inquiring] seriously into the nature of human understanding and [showing], from an exact analysis of its powers and capacity, that it is by no means fitted for such remote and abstruse questions.”

Here we get the first major divergence between Kant and Hume. Kant doesn’t share Hume’s eagerness to banish metaphysics. His Prolegomena To Any Future Metaphysics and Critique of Pure Reason are attempts to find it a safe haven from Hume’s attacks. However, while Kant might not be similarly constituted to Hume in this way, he does take Hume’s methodology very seriously. He states in the preface to the Prolegomena that “since the origin of metaphysics as far as history reaches, nothing has ever happened which could have been more decisive to its fate than the attack made upon it by David Hume.” Many of the principles which Hume derives, Kant agrees with wholeheartedly, making the task of shielding metaphysics even harder for him.

With that understanding of Hume’s methodology in mind, let’s look at how he argues for his view of causality. We’ll start with a distinction that is central to Hume’s philosophy: that between ideas and impressions. The difference between the memory of a sensation and the sensation itself is a good example. While the memory may mimic or copy the sensation, it can never reach its full force and vivacity. In general, Hume suggests that our experiences fall into two distinct categories, separated by a qualitative gap in liveliness. The more lively category he calls impressions, which includes sensory perceptions like the smell of a rose or the taste of wine, as well as internal experiences like the feeling of love or anger. The less lively category he refers to as thoughts or ideas. These include memories of impressions as well as imagined scenes, concepts, and abstract thoughts. 

With this distinction in hand, Hume proposes his first limit on the human mind. He claims that no matter how creative or original you are, all of your thoughts are the product of “compounding, transposing, augmenting, or diminishing the materials afforded us by the senses and experiences.” This is the copy principle: all ideas are copies of impressions, or compositions of simpler ideas that are in turn copies of impressions.

Hume turns this observation of the nature of our mind into a powerful criterion of meaning. “When we entertain any suspicion that a philosophical term is employed without any meaning or idea (as is but too frequent), we need but enquire, From what impression is that supposed idea derived? And if it be impossible to assign any, this will serve to confirm our suspicion.

This criterion turns out to be just the tool Hume needs in order to establish his conclusion. He examines the traditional conception of causation as a necessary connection between events, searches for the impressions that might correspond to this idea, and, failing to find anything satisfactory, declares that “we have no idea of connection or power at all and that these words are absolutely without any meaning when employed in either philosophical reasonings or common life.” His primary argument here is that all of our observations are of mere correlation, and that we can never actually observe a necessary connection.

Interestingly, at this point he refrains from recommending that we throw out the term causation. Instead he proposes a redefinition of the term, suggesting a more subtle interpretation of his criterion of meaning. Rather than eliminating the concept altogether upon discovering it to have no satisfactory basis in experience, he reconceives it in terms of the impressions from which it is actually formed. In particular, he argues that our idea of causation is really based on “the connection which we feel in the mind, this customary transition of the imagination from one object to its usual attendant.”

Here Hume is saying that humans have a rationally unjustifiable habit of thought where, when we repeatedly observe one type of event followed by another, we begin to call the first a cause and the second its effect, and we expect that future instances of the cause will be followed by future instances of the effect. Causation, then, is just this constant conjunction between events, and our mind’s habit of projecting the conjunction into the future. We can summarize all of this in a few lines:

Hume’s denial of the traditional concept of causation

  1. Ideas are always either copies of impressions or composites of simpler ideas that are copies of impressions.
  2. The traditional conception of causation is neither of these.
  3. So we have no idea of the traditional conception of causation.

Hume’s reconceptualization of causation

  1. An idea is the idea of the impression that it is a copy of.
  2. The idea of causation is copied from the impression of constant conjunction.
  3. So the idea of causation is just the idea of constant conjunction.

There we have Hume’s line of reasoning, which provoked Kant to examine the foundations of metaphysics anew. Kant wanted to resist Hume’s dismissal of the traditional conception of causation, while accepting that our sense perceptions reveal no necessary connections to us. Thus his strategy was to deny the Copy Principle and give an account of how we can have substantive knowledge that is not ultimately traceable to impressions. He does this by introducing the analytic/synthetic distinction and the notion of a priori synthetic knowledge.

Kant’s original definition of analytic judgments is that they “express nothing in the predicate but what has already been actually thought in the concept of the subject.” This suggests that the truth value of an analytic judgment is determined by purely the meanings of the concepts in use. A standard example of this is “All bachelors are unmarried.” The truth of this statement follows immediately just by understanding what it means, as the concept of bachelor already contains the predicate unmarried.  Synthetic judgments, on the other hand, are not fixed in truth value by merely the meanings of the concepts in use. These judgments amplify our knowledge and bring us to genuinely new conclusions about our concepts. An example: “The President is ornery.” This certainly doesn’t follow by definition; you’d have to go out and watch the news to realize its truth.

We can now put the challenge to metaphysics slightly differently. Metaphysics purports to be discovering truths that are both necessary (and therefore a priori) as well as substantive (adding to our concepts and thus synthetic). But this category of synthetic a priori judgments seems a bit mysterious. Evidently, the truth values of such judgments can be determined without referring to experience, but can’t be determined by merely the meanings of the relevant concepts. So apparently something further is required besides the meanings of concepts in order to make a synthetic a priori judgment. What is this thing?

Kant’s answer is that the further requirement is pure intuition and pure concepts. These terms need explanation.

Pure Intuitions

For Kant, an intuition is a direct, immediate representation of an object. An obvious example of this is sense perception; looking at a cup gives you a direct and immediate representation of an object, namely, the cup. But pure intuitions must be independent of experience, or else judgments based on them would not be a priori. In other words, the only type of intuition that could possibly be a priori is one that is present in all possible perceptions, so that its existence is not contingent upon what perceptions are being had. Kant claims that this is only possible if pure intuitions represent the necessary preconditions for the possibility of perception.

What are these necessary preconditions? Kant famously claimed that the only two are space and time. This implies that all of our perceptions have spatiotemporal features, and indeed that perception is only possible in virtue of the existence of space and time. It also implies, according to Kant, that space and time don’t exist outside of our minds!  Consider that pure intuitions exist equally in all possible perceptions and thus are independent of the actual properties of external objects. This independence suggests that rather than being objective features of the external world, space and time are structural features of our minds that frame all our experiences.

This is why Kant’s philosophy is a species of idealism. Space and time get turned into features of the mind, and correspondingly appearances in space and time become internal as well. Kant forcefully argues that this view does not make space and time into illusions, saying that without his doctrine “it would be absolutely impossible to determine whether the intuitions of space and time, which we borrow from no experience, but which still lie in our representation a priori, are not mere phantasms of our brain.”

The pure intuitions of space and time play an important role in Kant’s philosophy of mathematics: they serve to justify the synthetic a priori status of geometry and arithmetic. When we judge that the sum of the interior angles of a triangle is 180º, for example, we do so not purely by examining the concepts triangle, sum, and angle. We also need to consult the pure intuition of space! And similarly, our affirmations of arithmetic truths rely upon the pure intuition of time for their validity.

Pure Concepts

Pure intuition is only one part of the story. We don’t just perceive the world, we also think about our perceptions. In Kant’s words, “Thoughts without content are empty; intuitions without concepts are blind. […] The understanding cannot intuit anything, and the senses cannot think anything. Only from their union can cognition arise.” As pure intuitions are to perceptions, pure concepts are to thought. Pure concepts are necessary for our empirical judgments, and without them we could not make sense of perception. It is this category in which causality falls.

Kant’s argument for this is that causality is a necessary condition for the judgment that events occur in a temporal order. He starts by observing that we don’t directly perceive time. For instance, we never have a perception of one event being before another, we just perceive one and, separately, the other. So to conclude that the first preceded the second requires something beyond perception, that is, a concept connecting them.

He argues that this connection must be necessary: “For this objective relation to be cognized as determinate, the relation between the two states must be thought as being such that it determines as necessary which of the states must be placed before and which after.” And as we’ve seen, the only way to get a necessary connection between perceptions is through a pure concept. The required pure concept is the relation of cause and effect: “the cause is what determines the effect in time, and determines it as the consequence.” So starting from the observation that we judge events to occur in a temporal order, Kant concludes that we must have a pure concept of cause and effect.

What about particular causal rules, like that striking a match produces a flame? Such rules are not derived solely from experience, but also from the pure concept of causality, on which their existence depends. It is the presence of the pure concept that allows the inference of these particular rules from experience, even though they postulate a necessary connection.

Now we can see how different Kant and Hume’s conceptions of causality are. While Hume thought that the traditional concept of causality as a necessary connection was unrescuable and extraneous to our perceptions, Kant sees it as a bedrock principle of experience that is necessary for us to be able to make sense of our perceptions at all. Kant rejects Hume’s definition of cause in terms of constant conjunction on the grounds that it “cannot be reconciled with the scientific a priori cognitions that we actually have.”

Despite this great gulf between the two philosophers’ conceptions of causality, there are some similarities. As we saw above, Kant agrees wholeheartedly with Hume that perception alone is insufficient for concluding that there is a necessary connection between events. He also agrees that a purely analytic approach is insufficient. Since Kant sees pure intuitions and pure concepts as features of the mind, not the external world, both philosophers deny that causation is an objective relationship between things in themselves (as opposed to perceptions of things). Of course, Kant would deny that this makes causality an illusion, just as he denied that space and time are made illusory by his philosophy.

Of course, it’s impossible to know to what extent the two philosophers would have actually agreed, had Hume been able to read Kant’s responses to his works. Would he have been convinced that synthetic a priori judgments really exist? If so, would he accept Kant’s pure intuitions and pure concepts? I suspect that at the crux of their disagreement would be Kant’s claim that math is synthetic a priori. While Hume never explicitly proclaims math’s analyticity (he didn’t have the term, after all), it seems more in line with his views on algebra and arithmetic as purely concerning the way that ideas relate to one another. It is also more in line with the axiomatic approach to mathematics familiar to Hume, in which one defines a set of axioms from which all truths about the mathematical concepts involved necessarily follow.

If Hume did maintain math’s analyticity, then Kant’s arguments about the importance of synthetic a priori knowledge would probably hold much less sway for him, and would largely amount to an appeal to the validity of metaphysical knowledge, which Hume already doubted. Hume also would likely want to resist Kant’s idealism; in Section XII of the Enquiry he mocks philosophers that doubt the connection between the objects of our senses and external objects, saying that if you “deprive matter of all its intelligible qualities, both primary and secondary, you in a manner annihilate it and leave only a certain unknown, inexplicable something as the cause of our perceptions – a notion so imperfect that no skeptic will think it worthwhile to contend against it.”

How to Learn From Data, Part 2: Evaluating Models

I ended the last post by saying that the solution to the problem of overfitting all relied on the concept of a model. So what is a model? Quite simply, a model is just a set of functions, each of which is a candidate for the true distribution that generates the data.

Why use a model? If we think about a model as just a set of functions, this seems kinda abstract and strange. But I think it can be made intuitive, and that in fact we reason in terms of models all the time. Think about how physicists formulate their theories. Laws of physics have two parts: First they specify the types of functional relationships there are in the world. And second, they specify the value of particular parameters in those functions. This is exactly how a model works!

Take Newton’s theory of gravity. The fundamental claim of this theory is that there is a certain functional relationship between the quantities a (the acceleration of an object), M (the mass of a nearby object), and r (the distance between the two objects): a ~ M/r2. To make this relationship precise, we need to include some constant parameter G that says exactly what this proportionality is: a = GM/r2.

So we start off with a huge set of probability distributions over observations of the accelerations of particles (one for each value of G), and then we gather data to see which value of G is most likely to be right. Einstein’s theory of gravity was another different model, specifying a different functional relationship between a, M, and r and involving its own parameters. So to compare Einstein’s theory of gravity to Newton’s is to compare different models of the phenomenon of gravitation.

When you start to think about it, the concept of a model is an enormous one. Models are ubiquitous, you can’t escape from them. Many fancy methods in machine learning are really just glorified model selection. For instance, a neural network is a model: the architecture of the network specifies a particular functional relationship between the input and the output, and the strengths of connections between the different neurons are the parameters of the model.

So. What are some methods for assessing predictive accuracy of models? The first one we’ll talk about is a super beautiful and clever technique called…

Cross Validation

The fundamental idea of cross validation is that you can approximate how well a model will do on the next data point by evaluating how the model would have done at predicting a subset of your data, if all it had access to was the rest of your data.

I’ll illustrate this with a simple example. Suppose you have a data set of four points, each of which is a tuple of two real numbers. Your two models are T1: that the relationship is linear, and T2: that the relationship is quadratic.

What we can do is go through each data point, selecting each in order, train the model on the non-selected data points, and then evaluate how well this trained model fits the selected data point. (By training the model, I mean something like “finding the curve within the model that best fits the non-selected data points.”) Here’s a sketch of what this looks like:

img_0045.jpeg

There are a bunch of different ways of doing cross validation. We just left out one data point at a time, but we could have left out two points at a time, or three, or any k less than the total number of points N. This is called leave-k-out cross validation. If we partition our data by choosing a fraction of it for testing, then we get what’s called n-fold cross validation (where 1/n is the fraction of the data that is isolated for testing).

We also have some other choices besides how to partition the data. For instance, we can choose to train our model via a Likelihoodist procedure or a Bayesian procedure. And we can also choose to test our model in various ways, by using different metrics for evaluating the distance between the testing set and the trained model. In leave-one-out cross validation (LOOCV), a pretty popular method, both the training and testing procedures are Likelihoodist.

Now, there’s a practical problem with cross-validation, which is that it can take a long time to compute. If we do LOOCV with N data points and look at all ways of leaving out one data point, then we end up doing N optimization procedures (one for each time you train your model), each of which can be pretty slow. But putting aside practical issues, for thinking about theoretical rationality, this is a super beautiful technique.

Next we’ll move on to…

Bayesian Model Selection!

We had Bayesian procedures for evaluating individual functions. Now we can go full Bayes and apply it to models! For each model, we assess the posterior probability of the model given the data using Bayes’ rule as usual:

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Now… what is the prior Pr(M)? Again, it’s unspecified. Maybe there’s a good prior that solves overfitting, but it’s not immediately obvious what exactly it would be.

But there’s something really cool here. In this equation, there’s a solution to overfitting that pops up not out of the prior, but out of the LIKELIHOOD! I explain this here. The general idea is that models that are prone to overfitting get that way because their space of functions is very large. But if the space of functions is large, then the average prior on each function in the model must be small. So larger models have, on average, smaller values of the term Pr(f | M), and correspondingly (via Pr(D | M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) ) get a weaker update from evidence.

So even without saying anything about the prior, we can see that Bayesian model selection provides a potential antidote to overfitting. But just like before, we have the practical problem that computing Pr(M | D) is in general very hard. Usually evaluating Pr(D | M) involves calculating a really complicated many-dimensional integral, and calculating many-dimensional integrals can be very computationally expensive.

Might we be able to find some simple unbiased estimator of the posterior Pr(M | D)? It turns out that yes, we can. Take another look at the equation above.

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Since Pr(D) is a constant across models, we can ignore it when comparing models. So for our purposes, we can attempt to maximize the equation:

Pr(D | M) Pr(M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) Pr(M)

If we assume that P(D | f) is twice differentiable in the model parameters and that Pr(f) behaves roughly linearly around the maximum likelihood function, we can approximate this as:

Pr(D | M) Pr(M) \approx \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)

f* is the maximum likelihood function within the model, and k is the the number of parameters of the model (e.g. k = 2 for a linear model and k = 3 for a quadratic model).

\We can make this a little neater by taking a logarithm and defining a new quantity: the Bayesian Information Criterion.

argmax_M [ Pr(M | D) ] \\~\\  = argmax_M [ Pr(D | M) Pr(M) ] \\~\\  \approx argmax_M [ \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)] \\~\\  = argmax_M [ \log(Pr(M)) + \log (Pr(D | f^*(D)) - \frac{k}{2} \log(N) ] \\~\\  = argmin_M [ BIC - \log Pr(M) ]

BIC = \frac{k}{2} log(N) - \log \left( Pr(D | f^*(D)) \right)

Thus, to maximize the posterior probability of a model M is roughly the same as to minimize the quantity BIC – log Pr(M).

If we assume that our prior over models is constant (that is, that all models are equally probable at the outset), then we just minimize BIC.

Bayesian Information Criterion

Notice how incredibly simple this expression is to evaluate! If all we’re doing is minimizing BIC, then we only need to find the maximum likelihood function f*(D), assess the value of Pr(D | f*), and then penalize this quantity with the factor k/2 log(N)! This penalty scales in proportion to the model complexity, and thus helps us avoid overfitting.

We can think about this as a way to make precise the claims above about Bayesian Model Selection penalizing overfitting in the likelihood. Remember that minimizing BIC made sense when we assumed a uniform prior over models (and therefore when our prior doesn’t penalize overfitting). So even when we don’t penalize overfitting in the prior, we still end up getting a penalty! This penalty must come from the likelihood.

Some more interesting facts about BIC:

  • It is approximately equal to the minimum description length criterion (which I haven’t discussed here)
  • It is only valid when N >> k. So it’s not good for small data sets or large models.
  • It the truth is contained within your set of models, then BIC will select the truth with probability 1 as N goes to infinity.

Okay, stepping back. We sort of have two distinct clusters of approaches to model selection. There’s Bayesian Model Selection and the Bayesian Information Criterion, and then there’s Cross Validation. Both sound really nice. But how do they compare? It turns out that in general they give qualitatively different answers. And in general, the answers you get using cross validation tend to be more predictively accurate that those that you get from BIC/BMS.

A natural question to ask at this point is: BIC was a nice simple approximation to BMS. Is there a corresponding nice simple approximation to cross validation? Well first we must ask: which cross validation? Remember that there was a plethora of different forms of cross validation, each corresponding to slightly different criterion for evaluating fits. We can’t assume that all these methods give the same answer.

Let’s choose leave-one-out cross validation. It turns out that yes, there is a nice approximation that is asymptotically equivalent to LOOCV! This is called the Akaike information criterion.

Akaike Information Criterion

First, let’s define AIC:

AIC = k - \log \left( Pr(D | f^*(D)) \right)

Like always, k is the number of parameters in M and f* is chosen by the Likelihoodist approach described in the last post.

What you get by minimizing this quantity is asymptotically equivalent to what you get by doing leave-one-out cross validation. Compare this to BIC:

BIC = \frac{k}{2} \log (N) - \log \left( Pr(D | f^*(D) ) \right)

There’s a qualitative difference in how the parameter penalty is weighted! BIC is going to have a WAY higher complexity penalty than AIC. This means that AIC should in general choose less simple models than BIC.

Now, we’ve already seen one reason why AIC is good: it’s a super simple approximation of LOOCV and LOOCV is good. But wait, there’s more! AIC can be derived as an unbiased estimator of the Kullback-Leibler divergence DKL.

What is DKL? It’s a measure of the information theoretic distance of a model from truth. For instance, if the true generating distribution for a process is f, and our model of this process is g, then DKL tells us how much information we lose by using g to represent f. Formally, DKL is:

D_{KL}(f, g) = \int { f(x) \log \left(\frac{f(x)} {g(x)} \right) dx }

Notice that to actually calculate this quantity, you have to already know what the true distribution f is. So that’s unfeasible. BUT! AIC gives you an unbiased estimate of its value! (The proof of this is complicated, and makes the assumptions that N >> k, and that the model is not far from the truth.)

Thus, AIC gives you an unbiased estimate of which model is closest to the truth. Even if the truth is not actually contained within your set of models, what you’ll end up with is the closest model to it. And correspondingly, you’ll end up getting a theory of the process that is very similar to how the process actually works.

There’s another version of AIC that works well for smaller sample sizes and larger models called AICc.

AICc = AIC + \frac {k (k + 1)} {N - (k + 1)}

And interestingly, AIC can be derived as an approximation to Bayesian Model Selection with a particular prior over models. (Remember earlier when we showed that argmax_M \left( Pr(M | D) \right) \approx argmin_M \left( BIC - \log(Pr(M)) \right) ? Well, just set \log Pr(M) = BIC - AIC = k \left( \frac{1}{2} \log N - 1 \right) , and you get the desired result.) Interestingly, this prior ends up rewarding theories for having lots of parameters! This looks pretty bad… it seems like AIC is what you get when you take Bayesian Model Selection and then try to choose a prior that favors overfitting theories. But the practical use and theoretical virtues of AIC warrant, in my opinion, taking a closer look at what’s going on here. Perhaps what’s going on is that the likelihood term Pr(D | M) is actually doing too much to avoid overfitting, so in the end what we need in our prior is one that avoids underfitting! Regardless, we can think of AIC as specifying the unique good prior on models that optimizes for predictive accuracy.

There’s a lot more to be said from here. This is only a short wade into the waters of statistical inference and model selection. But I think this is a good place to stop, and I hope that I’ve given you a sense of the philosophical richness of the various different frameworks for inference I’ve presented here.