The problem with the many worlds interpretation of quantum mechanics

The Schrodinger equation is the formula that describes the dynamics of quantum systems – how small stuff behaves.

One fundamental feature of quantum mechanics that differentiates it from classical mechanics is the existence of something called superposition. In the same way that a particle can be in the state of “being at position A” and could also be in the state of “being at position B”, there’s a weird additional possibility that the particle is in the state of “being in a superposition of being at position A and being at position B”. It’s necessary to introduce a new word for this type of state, since it’s not quite like anything we are used to thinking about.

Now, people often talk about a particle in a superposition of states as being in both states at once, but this is not technically correct. The behavior of a particle in a superposition of positions is not the behavior you’d expect from a particle that was at both positions at once. Suppose you sent a stream of small particles towards each position and looked to see if either one was deflected by the presence of a particle at that location. You would always find that exactly one of the streams was deflected. Never would you observe the particle having been in both positions, deflecting both streams.

But it’s also just as wrong to say that the particle is in either one state or the other. Again, particles simply do not behave this way. Throw a bunch of electrons, one at a time, through a pair of thin slits in a wall and see how they spread out when they hit a screen on the other side. What you’ll get is a pattern that is totally inconsistent with the image of the electrons always being either at one location or the other. Instead, the pattern you’d get only makes sense under the assumption that the particle traveled through both slits and then interfered with itself.

If a superposition of A and B is not the same as “A and B’ and it’s not the same as ‘A or B’, then what is it? Well, it’s just that: a superposition! A superposition is something fundamentally new, with some of the features of “and” and some of the features of “or”. We can do no better than to describe the empirically observed features and then give that cluster of features a name.

Now, quantum mechanics tells us that for any two possible states that a system can be in, there is another state that corresponds to the system being in a superposition of the two. In fact, there’s an infinity of such superpositions, each corresponding to a different weighting of the two states.

Now, the Schrödinger equation is what tells how quantum mechanical systems evolve over time. And since all of nature is just one really big quantum mechanical system, the Schrödinger equation should also tell us how we evolve over time. So what does the Schrödinger equation tell us happens when we take a particle in a superposition of A and B and make a measurement of it?

The answer is clear and unambiguous: The Schrödinger equation tells us that we ourselves enter into a superposition of states, one in which we observe the particle in state A, the other in which we observe it in B. This is a pretty bizarre and radical answer! The first response you might have may be something like “When I observe things, it certainly doesn’t seem like I’m entering into a superposition… I just look at the particle and see it in one state or the other. I never see it in this weird in-between state!”

But this is not a good argument against the conclusion, as it’s exactly what you’d expect by just applying the Schrödinger equation! When you enter into a superposition of “observing A” and “observing B”, neither branch of the superposition observes both A and B. And naturally, since neither branch of the superposition “feels” the other branch, nobody freaks out about being superposed.

But there is a problem here, and it’s a serious one. The problem is the following: Sure, it’s compatible with our experience to say that we enter into superpositions when we make observations. But what predictions does it make? How do we take what the Schrödinger equation says happens to the state of the world and turn it into a falsifiable experimental setup? The answer appears to be that we can’t. At least, not using just the Schrödinger equation on its own. To get out predictions, we need an additional postulate, known as the Born rule.

This postulate says the following: For a system in a superposition, each branch of the superposition has an associated complex number called the amplitude. The probability of observing any particular branch of the superposition upon measurement is simply the square of that branch’s amplitude.

For example: A particle is in a superposition of positions A and B. The amplitude attached to A is 0.8. The amplitude attached to B is 0.4. If we now observe the position of the particle, we will find it to be at either A with probability (.6)2 (i.e. 36%), or B with probability (.8)2 (i.e. 64%).

Simple enough, right? The problem is to figure out where the Born rule comes from and what it even means. The rule appears to be completely necessary to make quantum mechanics a testable theory at all, but it can’t be derived from the Schrödinger equation. And it’s not at all inevitable; it could easily have been that probabilities associated with the amplitude rather than the amplitude squared. Or why not the fourth power of the amplitude? There’s a substantive claim here, that probabilities associate with the square of the amplitudes that go into the Schrödinger equation, that needs to be made sense of. There are a lot of different ways that people have tried to do this, and I’ll list a few of the more prominent ones here.

The Copenhagen Interpretation

(Prepare to be disappointed.) The Copenhagen interpretation, which has historically been the dominant position among working physicists, is that the Born rule is just an additional rule governing the dynamics of quantum mechanical systems. Sometimes systems evolve according to the Schrödinger equation, and sometimes according to the Born rule. When they evolve according to the Schrödinger equation, they split into superpositions endlessly. When they evolve according to the Born rule, they collapse into a single determinate state. What determines when the systems evolve one way or the other? Something measurement something something observation something. There’s no real consensus here, nor even a clear set of well-defined candidate theories.

If you’re familiar with the way that physics works, this idea should send your head spinning. The claim here is that the universe operates according to two fundamentally different laws, and that the dividing line between the two hinges crucially on what we mean by the words “measurement and “observation. Suffice it to say, if this was the right way to understand quantum mechanics, it would go entirely against the spirit of the goal of finding a fundamental theory of physics. In a fundamental theory of physics, macroscopic phenomena like measurements and observations need to be built out of the behavior of lots of tiny things like electrons and quarks, not the other way around. We shouldn’t find ourselves in the position of trying to give a precise definition to these words, debating whether frogs have the capacity to collapse superpositions or if that requires a higher “measuring capacity”, in order to make predictions about the world.

The Copenhagen interpretation is not an elegant theory, it’s not a clearly defined theory, and it’s fundamentally at tension with the project of theoretical physics. So why has it been, as I said, the dominant approach over the last century to understanding quantum mechanics? This really comes down to physicists not caring enough about the philosophy behind the physics to notice that the approach they are using is fundamentally flawed. In practice, the Copenhagen interpretation works. It allows somebody working in the lab to quickly assess the results of their experiments and to make predictions about how future experiments will turn out. It gives the right empirical probabilities and is easy to implement, even if the fuzziness in the details can start to make your head hurt if you start to think about it too much. As Jean Bricmont said, “You can’t blame most physicists for following this ‘shut up and calculate’ ethos because it has led to tremendous develop­ments in nuclear physics, atomic physics, solid­ state physics and particle physics.” But the Copenhagen interpretation is not good enough for us. A serious attempt to make sense of quantum mechanics requires something more substantive. So let’s move on.

Objective Collapse Theories

These approaches hinge on the notion that the Schrödinger equation really is the only law at work in the universe, it’s just that we have that equation slightly wrong. Objective collapse theories add slight nonlinearities to the Schrödinger equation so that systems sometimes spread out in superpositions and other times collapse into definite states, all according to one single equation. The most famous of these is the spontaneous collapse theory, according to which quantum systems collapse with a probability that grows with the number of particles in the system.

This approach is nice for several reasons. For one, it gives us the Born rule without requiring a new equation. It makes sense of the Born rule as a fundamental feature of physical reality, and makes precise and empirically testable predictions that can distinguish it from from other interpretations. The drawback? It makes the Schrödinger equation ugly and complicated, and it adds extra parameters that determine how often collapse happens. And as we know, whenever you start adding parameters you run the risk of overfitting your data.

Hidden Variable Theories

These approaches claim that superpositions don’t really exist, they’re just a high-level consequence of the unusual behavior of the stuff at the smallest level of reality.  They deny that the Schrödinger equation is truly fundamental, and say instead that it is a higher-level approximation of an underlying deterministic reality. “Deterministic?! But hasn’t quantum mechanics been shown conclusively to be indeterministic??” Well, not entirely. For a while there was a common sentiment amongst physicists that John Von Neumann and others had proved beyond a doubt that no deterministic theory could make the predictions that quantum mechanics makes. Later subtle mistakes were found in these purported proofs that left a door open for determinism. Today there are well-known fleshed-out hidden variable theories that successfully reproduce the predictions of quantum mechanics, and do so fully deterministically.

The most famous of these is certainly Bohmian mechanics, also called pilot wave theory. Here’s a nice video on it if you’d like to know more, complete with pretty animations. Bohmian mechanics is interesting, appear to work, give us the Born rule, and is probably empirically distinguishable from other theories (at least in principle). A serious issue with it is that it requires nonlocality, which is a challenge to any attempt to make it consistent with special relativity. Locality is such an important and well-understood feature of our reality that this constitutes a major challenge to the approach.

Many-Worlds / Everettian Interpretations

Ok, finally we talk about the approach that is most interesting in my opinion, and get to the title of this post. The Many-Worlds interpretation says, in essence, that we were wrong to ever want more than the Schrödinger equation. This is the only law that governs reality, and it gives us everything we need. Many-Worlders deny that superpositions ever collapse. The result of us performing a measurement on a system in superposition is simply that we end up in superposition, and that’s the whole story!

So superpositions never collapse, they just go deeper into superposition. There’s not just one you, there’s every you, spread across the different branches of the wave function of the universe. All these yous exist beside each other, living out all your possible life histories.

But then where does Many-Worlds get the Born rule from? Well, uh, it’s kind of a mystery. The Born rule isn’t an additional law of physics, because the Schrödinger equation is supposed to be the whole story. It’s not an a priori rule of rationality, because as we said before probabilities could have easily gone as the fourth power of amplitudes, or something else entirely. But if it’s not an a posteriori fact about physics, and also not an a priori knowable principle of rationality, then what is it?

This issue has seemed to me to be more and more important and challenging for Many-Worlds the more I have thought about it. It’s hard to see what exactly the rule is even saying in this interpretation. Say I’m about to make a measurement of a system in a superposition of states A and B. Suppose that I know the amplitude of A is much smaller than the amplitude of B. I need some way to say “I have a strong expectation that I will observe B, but there’s a small chance that I’ll see A.” But according to Many-Worlds, a moment from now both observations will be made. There will be a branch of the superposition in which I observe A, and another branch in which I observe B. So what I appear to need to say is something like “I am much more likely to be the me in the branch that observes B than the me that observes A.” But this is a really strange claim that leads us straight into the thorny philosophical issue of personal identity.

In what sense are we allowed to say that one and only one of the two resulting humans is really going to be you? Don’t both of them have equal claim to being you? They each have your exact memories and life history so far, the only difference is that one observed A and the other B. Maybe we can use anthropic reasoning here? If I enter into a superposition of observing-A and observing-B, then there are now two “me”s, in some sense. But that gives the wrong prediction! Using the self-sampling assumption, we’d just say “Okay, two yous, so there’s a 50% chance of being each one” and be done with it. But obviously not all binary quantum measurements we make have a 50% chance of turning out either way!

Maybe we can say that the world actually splits into some huge number of branches, maybe even infinite, and the fraction of the total branches in which we observe A is exactly the square of the amplitude of A? But this is not what the Schrödinger equation says! The Schrödinger equation tells exactly what happens after we make the observation: we enter a superposition of two states, no more, no less. We’re importing a whole lot into our interpretive apparatus by interpreting this result as claiming the literal existence of an infinity of separate worlds, most of which are identical, and the distribution of which is governed by the amplitudes.

What we’re seeing here is that Many-Worlds, by being too insistent on the reality of the superposition, the sole sovereignty of the Schrödinger equation, and the unreality of collapse, ends up running into a lot of problems in actually doing what a good theory of physics is supposed to do: making empirical predictions. The Many-Worlders can of course use the Born Rule freely to make predictions about the outcomes of experiments, but they have little to say in answer to what, in their eyes, this rule really amounts to. I don’t know of any good way out of this mess.

Basically where this leaves me is where I find myself with all of my favorite philosophical topics; totally puzzled and unsatisfied with all of the options that I can see.

Advertisements

A probability puzzle

probpuzzle.jpg

To be totally clear: the question is not assuming that there is ONLY one student whose neighbors both flipped heads, just that there is AT LEAST one such student. You can imagine that the teacher first asks for all students whose neighbors both flipped heads to step forward, then randomly selected one of the students that had stepped forward.

Now, take a minute to think about this before reading on…

It seemed initially obvious to me that the teacher was correct. There are exactly as many possible worlds in which the three students are HTH as there worlds in which they are HHH, right? Knowing how your neighbors’ coins landed shouldn’t give you any information about how your own coin landed, and to think otherwise seems akin to the Gambler’s fallacy.

But in fact, the teacher is wrong! It is in fact more likely that the student flipped tails than heads! Why? Let’s simplify the problem.

Suppose there are just three students standing in a circle (/triangle). There are eight possible ways that their coins might have landed, namely:

HHH
HHT
HTH
HTT
THH
THT
TTH
TTT

Now, the teacher asks all those students whose neighbors both have “H” to step forward, and AT LEAST ONE steps forward. What does this tell us about the possible world we’re in? Well, it rules out all of the worlds in which no student could be surrounded by both ‘H’, namely… TTT, TTH, THT, and HTT. We’re left with the following…

HHH
HHT
HTH
THH

One thing to notice is that we’re left with mostly worlds with lots of heads. The expected total of heads is 2.25, while the expected total of tails is just 0.75. So maybe we should expect that the student is actually more likely to have heads than tails!

But this is wrong. What we want to see is what proportion of those surrounded by heads are heads in each possible world.

HHH: 3/3 have H (100%)
HHT: 0/1 have H (0%)
HTH: 0/1 have H (0%)
THH: 0/1 have H (0%)

Since each of these worlds is equally likely, what we end up with is a 25% chance of 100% heads, and a 75% chance of 0% heads. In other words, our credence in the student having heads should be just 25%!

Now, what about for N students? I wrote a program that does a brute-force calculation of the final answer for any N, and here’s what you get:

N

cr(heads)

~

3

1/4

0.25

4

3/7

0.4286

5

4/9

0.4444

6

13/32

0.4063

7

1213/2970

0.4084

8

6479/15260

0.4209

9

10763/25284

0.4246

10

998993/2329740

0.4257

11

24461/56580

0.4323

12

11567641/26580015

0.4352

13

1122812/2564595

0.4378

14

20767139/47153106

0.4404

15

114861079/259324065

0.4430

16

2557308958/5743282545

0.4453

17

70667521/157922688

0.4475

These numbers are not very pretty, though they appear to be gradually converging (I’d guess to 50%).

Can anybody see any patterns here? Or some simple intuitive way to arrive at these numbers?

 

Anti-inductive priors

I used to think of Bayesianism as composed of two distinct parts: (1) setting priors and (2) updating by conditionalizing. In my mind, this second part was the crown jewel of Bayesian epistemology, while the first part was a little more philosophically problematic. Conditionalization tells you that for any prior distribution you might have, there is a unique rational set of new credences that you should adopt upon receiving evidence, and tells you how to get it. As to what the right priors are, well, that’s a different story. But we can at least set aside worries about priors with assurances about how even a bad prior will eventually be made up for in the long run after receiving enough evidence.

But now I’m realizing that this framing is pretty far off. It turns out that there aren’t really two independent processes going on, just one (and the philosophically problematic one at that): prior-setting. Your prior fully determines what happens when you update by conditionalization on any future evidence you receive. And the set of priors consistent with the probability axioms is large enough that it allows for this updating process to be extremely irrational.

I’ll illustrate what I’m talking about with an example.

Let’s imagine a really simple universe of discourse, consisting of just two objects and one predicate. We’ll make our predicate “is green” and denote objects a_1 and a_2 . Now, if we are being good Bayesians, then we should treat our credences as a probability distribution over the set of all state descriptions of the universe. These probabilities should all be derivable from some hypothetical prior probability distribution over the state descriptions, such that our credences at any later time are just the result of conditioning that prior on the total evidence we have by that time.

Let’s imagine that we start out knowing nothing (i.e. our starting credences are identical to the hypothetical prior) and then learn that one of the objects (a_1 ) is green. In the absence of any other information, then by induction, we should become more confident that the other object is green as well. Is this guaranteed by just updating?

No! Some priors will allow induction to happen, but others will make you unresponsive to evidence. Still others will make you anti-inductive, becoming more and more confident that the next object is not green the more green things you observe. And all of this is perfectly consistent with the laws of probability theory!

Take a look at the following three possible prior distributions over our simple language:

Screen Shot 2018-10-21 at 1.58.45 PM.png

According to P_1 , your new credence in Ga_2 after observing Ga_1 is P_1(Ga_2 | Ga_1) = 0.80 , while your prior credence in Ga_2 was 0.50. Thus P_1 is an inductive prior; you get more confident in future objects being green when you observe past objects being green.

For P_2 , we have that P_2(Ga_2 | Ga_1) = 0.50 , and P_2(Ga_2) = 0.50 as well. Thus P_2 is a non-inductive prior: observing instances of green things doesn’t make future instances of green things more likely.

And finally, P_3(Ga_2 | Ga_1) = 0.20 , while P_3(Ga_2) = 0.5 . Thus P_3 is an anti-inductive prior. Observing that one object is green makes you more than two times less confident confident that the next object will be green.

The anti-inductive prior can be made even more stark by just increasing the gap between the prior probability of Ga_1 \wedge Ga_2 and Ga_1 \wedge -Ga_2 . It is perfectly consistent with the axioms of probability theory for observing a green object to make you almost entirely certain that the next object you observe will not be green.

Our universe of discourse here was very simple (one predicate and two objects). But the point generalizes. Regardless of how many objects and predicates there are in your language, you can have non-inductive or anti-inductive priors. And it isn’t even the case that there are fewer anti-inductive priors than inductive priors!

The deeper point here is that the prior is doing all the epistemic work. Your prior isn’t just an initial credence distribution over possible hypotheses, it also dictates how you will respond to any possible evidence you might receive. That’s why it’s a mistake to think of prior-setting and updating-by-conditionalization as two distinct processes. The results of updating by conditionalization are determined entirely by the form of your prior!

This really emphasizes the importance of having good criterion for setting priors. If we’re trying to formalize scientific inquiry, it’s really important to make sure our formalism rules out the possibility of anti-induction. But this just amounts to requiring rational agents to have constraints on their priors that go above and beyond the probability axioms!

What are these constraints? Do they select one unique best prior? The challenge is that actually finding a uniquely rationally justifiable prior is really hard. Carnap tried a bunch of different techniques for generating such a prior and was unsatisfied with all of them, and there isn’t any real consensus on what exactly this unique prior would be. Even worse, all such suggestions seem to end up being hostage to problems of language dependence – that is, that the “uniquely best prior” changes when you make an arbitrary translation from your language into a different language.

It looks to me like our best option is to abandon the idea of a single best prior (and with it, the notion that rational agents with the same total evidence can’t disagree). This doesn’t have to lead to total epistemic anarchy, where all beliefs are just as rational as all others. Instead, we can place constraints on the set of rationally permissible priors that prohibit things like anti-induction. While identifying a set of constraints seems like a tough task, it seems much more feasible than the task of justifying objective Bayesianism.

Making sense of improbability

Imagine that you take a coin that you believe to be fair and flip it 20 times. Each time it lands heads. You say to your friend: “Wow, what a crazy coincidence! There was a 1 in 220 chance of this outcome. That’s less than one in a million! Super surprising.”

Your friend replies: “I don’t understand. What’s so crazy about the result you got? Any other possible outcome (say, HHTHTTTHTHHHTHTTHHHH) had an equal probability as getting all heads. So what’s so surprising?”

Responding to this is a little tricky. After all, it is the case that for a fair coin, the probability of 20 heads = the probability of HHTHTTTHTHHHTHTTHHHH = roughly one in a million.

Simpler Example_ Five Tosses.png

So in some sense your friend is right that there’s something unusual about saying that one of these outcomes is more surprising than another.

You might answer by saying “Well, let’s parse up the possible outcomes by the number of heads and tails. The outcome I got had 20 heads and 0 tails. Your example outcome had 12 heads and 8 tails. There are many many ways of getting 12 heads and 8 tails than of getting 20 heads and 0 tails, right? And there’s only one way of getting all 20 heads. So that’s why it’s so surprising.”

Probability vs. Number of heads (1).png

Your friend replies: “But hold on, now you’re just throwing out information. Sure my example outcome had 12 heads and 8 tails. But while there’s many ways of getting that number of heads and tails, there’s only exactly one way of getting the result I named! You’re only saying that your outcome is less likely because you’ve glossed over the details of my outcome that make it equally unlikely: the order of heads and tails!”

I think this is a pretty powerful response. What we want is a way to say that HHHHHHHHHHHHHHHHHHHH is surprising while HHTHTTTHTHHHTHTTHHHH is not, not that 20 heads is surprising while 12 heads and 8 tails is unsurprising. But it’s not immediately clear how we can say this.

Consider the information theoretic formalization of surprise, in which the surprisingness of an event E is proportional to the negative log of the probability of that event: Sur(E) = -log(P(E)). There are some nice reasons for this being a good definition of surprise, and it tells us that two equiprobable events should be equally surprising. If E is the event of observing all heads and E’ is the event of observing the sequence HHTHTTTHTHHHTHTTHHHH, then P(E) = P(E’) = 1/220. Correspondingly, Sur(E) = Sur(E’). So according to one reasonable formalization of what we mean by surprisingness, the two sequences of coin tosses are equally surprising. And yet, we want to say that there is something more epistemically significant about the first than the second.

(By the way, observing 20 heads is roughly 6.7 times more surprising than observing 12 heads and 8 tails, according to the above definition. We can plot the surprise curve to see how maximum surprise occurs at the two ends of the distribution, at which point it is 20 bits.)

Surprise vs. number of heads (1).png

So there is our puzzle: in what sense does it make sense to say that observing 20 heads in a row is more surprising than observing the sequence HHTHTTTHTHHHTHTTHHHH? We certainly have strong intuitions that this is true, but do these intuitions make sense? How can we ground the intuitive implausibility of getting 20 heads? In this post I’ll try to point towards a solution to this puzzle.

Okay, so I want to start out by categorizing three different perspectives on the observed sequence of coin tosses. These correspond to (1) looking at just the outcome, (2) looking at the way in which the observation affects the rest of your beliefs, and (3) looking at how the observation affects your expectation of future observations. In probability terms, these correspond to the P(E), P(T| T) and P(E’ | E).

Looking at things through the first perspective, all outcomes are equiprobable, so there is nothing more epistemically significant about one than the other.

But considering the second way of thinking about things, there can be big differences in the significance of two equally probable observations. For instance, suppose that our set of theories under consideration are just the set of all possible biases of the coin, and our credences are initially peaked at .5 (an unbiased coin). Observing HHTHTTTHTHHHTHTTHHHH does little to change our prior. It shifts a little bit in the direction of a bias towards heads, but not significantly. On the other hand, observing all heads should have a massive effect on your beliefs, skewing them exponentially in the direction of extreme heads biases.

Importantly, since we’re looking at beliefs about coin bias, our distributions are now insensitive to any details about the coin flip beyond the number of heads and tails! As far as our beliefs about the coin bias go, finding only the first 8 to be tails looks identical to finding the last 8 to be tails. We’re not throwing out the information about the particular pattern of heads and tails, it’s just become irrelevant for the purposes of consideration of the possible biases of the coin.

Visualizing change in beliefs about coin bias.png

If we want to give a single value to quantify the difference in epistemic states resulting from the two observations, we can try looking at features of these distributions. For instance, we could look at the change in entropy of our distribution if we see E and compare it to the change in entropy upon seeing E’. This gives us a measure of how different observations might affect our uncertainty levels. (In our example, observing HHTHTTTHTHHHTHTTHHHH decreases uncertainty by about 0.8 bits, while observing all heads decreases uncertainty by 1.4 bits.) We could also compare the means of the posterior distributions after each observation, and see which is shifted most from the mean of the prior distribution. (In this case, our two means are 0.57 and 0.91).

Now, this was all looking at things through what I called perspective #2 above: how observations affect beliefs. Sometimes a more concrete way to understand the effect of intuitively implausible events is to look at how they affect specific predictions about future events. This is the approach of perspective #3. Sticking with our coin, we ask not about the bias of the coin, but about how we expect it to land on the next flip. To assess this, we look at the posterior predictive distributions for each posterior:

Posterior Predictive Distributions.png

It shouldn’t be too surprising that observing all heads makes you more confident that the next coin will land heads than observing HHTHTTTHTHHHTHTTHHHH. But looking at this graph gives a precise answer to how much more confident you should be. And it’s somewhat easier to think about than the entire distribution over coin biases.

I’ll leave you with an example puzzle that relates to anthropic reasoning.

Say that one day you win the lottery. Yay! Super surprising! What an improbable event! But now compare this to the event that some stranger Bob Smith wins the lottery. This doesn’t seem so surprising. But supposing that Bob Smith buys lottery tickets at the same rate as you, the probability that you win is identical to the probability that Bob Smith wins. So… why is it any more surprising when you win?

This seems like a weird question. Then again, so did the coin-flipping question we started with. We want to respond with something like “I’m not saying that it’s improbable that some random person wins the lottery. I’m interested in the probability of me winning the lottery. And if we parse up the outcomes as that either I win the lottery or that somebody else wins the lottery, then clearly it’s much more improbable that I win than that somebody else wins.”

But this is exactly parallel to the earlier “I’m not interested in the precise sequence of coin flips, I’m just interested in the number of heads versus tails.” And the response to it is identical in form: If Bob Smith, a particular individual whose existence you are aware of, wins the lottery and you know it, then it’s cheating to throw away those details and just say “Somebody other than me won the lottery.” When you update your beliefs, you should take into account all of your evidence.

Does the framework I presented here help at all with this case?

A simple probability puzzle

In front of you is an urn containing some unknown quantity of balls. These balls are labeled 1, 2, 3, etc. They’ve been jumbled about so as to be in no particular order within the urn. You initially consider it equally likely that the urn contains 1 ball as that it contains 2 balls, 3 balls, and so on, up to 100 balls, which is the maximum capacity of the urn.

Now you reach in to draw out a ball and read the number on it: 34. What is the most likely theory for how many balls the urn contains?

 

 

(…)

 

(Think of an answer before reading on.)

 

(…)

 

 

The answer turns out to be 34!

Hopefully this is a little unintuitive. Specifically, what seems wrong is that you draw out a ball and then conclude that this is the ball with the largest value on it. Shouldn’t extreme results be unlikely? But remember, the balls were randomly jumbled about inside the urn. So whether or not the number on the ball you drew is at the beginning, middle, or end of the set of numbers is pretty much irrelevant.

What is relevant is the likelihood: Pr(There are N balls | I drew a ball numbered 34). And the value of this is simply 1/N.

In general, comparing the theory that there are N balls to the theory that there are M balls, we look at the likelihood ratio: Pr(There are N balls | I drew a ball numbered 34) / Pr(There are M balls | I drew a ball numbered 34). This is simply M/N.

Thus we see that our prior odds get updated by a factor that favors smaller values of N, as long as N ≥ 34. The likelihood is zero up to N = 33, maxes at 34, and then decreases steadily after it as N goes to infinity. Since our prior was evenly spread out between N = 1 and 100 and zero everywhere else, our posterior will be peaked at 34 and decline until 100, after which it will drop to zero.

One way to make this result seem more intuitive is to realize that while strictly speaking the most probable number of balls in the urn is 34, it’s not that much more probable than 35 or 36. The actual probability of 34 is still quite small, it just happens to be a little bit more probable than its larger neighbors. And indeed, for larger values of the maximum capacity of the urn, the relative difference between the posterior probability of 34 and that of 35 decreases.

Deciphering conditional probabilities

How would you evaluate the following two probabilities?

  1. P(B | A)
  2. P(A → B)

In words, the first is “the probability that B is true, given that A is true” and the second is “the probability that if A is true, then B is true.” I don’t know about you, but these sound pretty darn similar to me.

But in fact, it turns out that they’re different. In fact, you can prove that P(B | A) is always greater than or equal to P(A → B) (the equality only in the case that P(A) = 1 or P(A → B) = 1). The proof of this is not too difficult, but I’ll leave it to you to figure out.

Conditional probabilities are not the same as probabilities of conditionals. But maybe this isn’t actually too strange. After all, material conditionals don’t do such a great job of capturing what we actually mean when we say “If… then…” For instance, consult your intuitions about the truth of the sentence “If 2 is odd then 2 is even.” This turns out to be true (because any material conditional with a true consequent is true). Similarly, think about the statement “If I am on Mars right now, then string theory is correct.” Again, this turns out to be true if we treat the “If… then…” as a material conditional (since any material conditional with a false antecedent is true).

The problem here is that we actually use “If… then…” clauses in several different ways, the logical structure of which are not well captured by the material implication. A → B is logically equivalent to “A is true or B is false,” which is not always exactly what we mean by “If A then B”. Sometimes “If A then B” means “B, because A.” Other times it means something more like “A gives epistemic support for B.” Still other times, it’s meant counterfactually, as something like “If A were to be the case, then B would be the case.”

So perhaps what we want is some other formula involving A and B that better captures our intuitions about conditional statements, and maybe conditional probabilities are the same as probabilities in these types of formulas.

But as we’re about to prove, this is wrong too. Not only does the material implication not capture the logical structure of conditional probabilities, but neither does any other logical truth function! You can prove a triviality result: that if such a formula exists, then all statements must be independent of one another (in which case conditional probabilities lose their meaning).

The proof:

  1. Suppose that there exists a function Γ(A, B) such that P(A | B) = P(Γ(A, B)).
  2. Then P(A | B & A) = P(Γ | A).
  3. So 1 = P(Γ | A).
  4. Similarly, P(A | B & -A) = (Γ | -A).
  5. So 0 = P(Γ | -A).
  6. P(Γ) = P(Γ | A) P(A) + P(Γ | -A) P(-A).
  7. P(Γ) = 1 * P(A) + 0 * P(-A).
  8. P(Γ) = P(A).
  9. So P(A | B) = P(A).

This is a surprisingly strong result. No matter what your formula Γ is, we can say that either it doesn’t capture the logical structure of the conditional probability P(B | A), or it trivializes it.

We can think of this as saying that the language of first order logic is insufficiently powerful to express the conditionals in conditional probabilities. If you take any first order language and apply probabilities to all its valid sentences, none of those credences will be conditional probabilities. To get conditional probabilities, you have to perform algebraic operations like division on the first order probabilities. This is an important (and unintuitive) thing to keep in mind when trying to map epistemic intuitions to probability theory.

The Problem of Logical Omniscience

Bayesian epistemology says that rational agents have credences that align with the probability calculus. A common objection to this is that this is actually really really demanding. But we don’t have to say that rationality is about having perfectly calibrated credences that match the probability calculus to an arbitrary number of decimal points. Instead we want to say something like “Look, this is just our idealized model of perfectly rational reasoning. We understand that any agent with finite computational capacities is incapable of actually putting real numbers over the set of all possible worlds and updating them with perfect precision. All we say is that the closer to this ideal you are, the better.”

Which raises an interesting question: what do we mean by ‘closeness’? We want some metric to say how rational/irrational a given a given person is being (and how they can get closer to perfect rationality), but it’s not obvious what this metric should be. Also, it’s important to notice that the details of this metric are not specified by Bayesianism!  If we want a precise theory of rationality that can be applied in the real world, we probably have to layer on at least this one additional premise.

Trying to think about candidates for a good metric is made more difficult by the realization that descriptively, our actual credences almost certainly don’t form a probability distribution. Humans are notoriously sub additive when considering the probabilities of disjuncts versus their disjunctions. And I highly doubt that most of my actual credences are normalized.

That said, even if we imagine that we have some satisfactory metric for comparing probability distributions to non-probability-distributions-that-really-ought-to-be-probability-distributions, our problems still aren’t over. The demandingness objection doesn’t just say that it’s hard to be rational. It says that in some cases the Bayesian standard for rationality doesn’t actually make sense. Enter the problem of logical omniscience.

The Bayesian standard for ideal rationality is the Kolmogorov axioms (or something like it). One of these axioms says that for any tautology T, P(T) = 1. In other words, we should be 100% confident in the truth of any tautology. This raises some thorny issues.

For instance, if the Collatz conjecture is true, then it is a tautology (given the definitions of addition, multiplication, natural numbers, and so on). So a perfectly rational being should instantly adopt a 100% credence in its truth. This already seems a bit wrong to me. Whether or not we have deduced the Collatz conjectures from the axioms looks more like an issue of raw computational power than one of rationality. I want to make a distinction between what it takes to be rational, and what it takes to be smart. Raw computing power is not necessarily rationality. Rationality is good software running on that hardware.

But even if we put that worry aside, things get even worse for the Bayesian. Not only can a Bayesian not say that your credences in tautologies can be reasonably non-1, they also have no way to account for the phenomenon of obtaining evidence for mathematical truths.

If somebody comes up to you and shows you that the first 10^20 numbers all satisfy the Collatz conjecture, then, well, the Collatz conjecture is still either a tautology or a contradiction. Updating on the truth of the first 10^20 cases shouldn’t sway your credences at all, because nothing should sway your credences in mathematical truths. Credences of 1 stay 1, always. Same for credences of 0.

That is really really undesirable behavior for an epistemic framework.  At this moment there are thousands of graduate students sitting around feeling uncertain about mathematical propositions and updating on evidence for or against them, and it looks like they’re being perfectly rational to do so. (Both to be uncertain, and to move that uncertainty around with evidence.)

The problem here is not a superficial one. It goes straight to the root of the Bayesian formalism: the axioms that define probability theory. You can’t just throw out the axiom… what you end up with if you do so is an entirely different mathematical framework. You’re not talking about probabilities anymore! And without it you don’t even have the ability to say things like P(X) + P(-X) = 1. But keeping it entails that you can’t have non-1 credences in tautologies, and correspondingly that you can’t get evidence for them. It’s just true that P(theorem | axioms) = 1.

Just to push this point one last time: Suppose I ask you whether 79 is a prime number. Probably the first thing that you automatically do is run a few quick tests (is it even? Does it end in a five or a zero? No? Okay, then it’s not divisible by 2 or 5.) Now you add 7 to 9 to see whether the sum (16) is divisible by three. Is it? No. Upon seeing this, you become more confident that 79 is prime. You realize that 79 is only 2 more than 77, which is a multiple of 7 and 11. So 79 can’t be divisible by either 7 or 11. Your credence rises still more. A reliable friend tells you that it’s not divisible by 13. Now you’re even more confident! And so on.

It sure looks like each step of this thought process was perfectly rational. But what is P(79 is prime | 79 is not divisible by 3)? The exact same thing as P(79 is prime): 100%. The challenge for Bayesians is to account for this undesirable behavior, and to explain how we can reason inductively about logical truths.