More on quantum entanglement and irreducibility

August 1, 2018March 26, 2019 ~ squarishbracket ~ Leave a comment

A few posts ago, I talked about how quantum mechanics entails the existence of irreducible states – states of particles that in principle cannot be described as the product of their individual components. The classic example of such an entangled state is the two qubit state

This state describes a system which is in an equal-probability superposition of both particles being |0⟩ and both particles being |1⟩. As it turns out, this state cannot be expressed as the product of two single-qubit states.

A friend of mine asked me a question about this that was good enough to deserve its own post in response. Start by imagining that Alice and Bob each have a coin. They each put their quarter inside a small box with heads facing up. Now they close their respective boxes, and shake them up in the exact same way. This is important! (as well as unrealistic) We suppose that whatever happens to the coin in Alice’s box, also happens to the coin in Bob’s box.

Now we have two boxes, each of which contains a coin, and these coins are guaranteed to be facing the same way. We just don’t know what way they are facing.

Alice and Bob pick up their boxes, being very careful to not disturb the states of their respective coins, and travel to opposite ends of the galaxy. The Milky Way is 100,000 light years across, so any communication between the two now would take a minimum of 100,000 years. But if Alice now opens her box, she instantly knows the state of Bob’s coin!

So while Alice and Bob cannot send messages about the state of their boxes any faster than 100,000 years, they can instantly receive information about each others’ boxes by just observing their own! Is this a contradiction?

No, of course not. While Alice does learn something about Bob’s box, this is not because of any message passed between the two. It is the result of the fact that in the past the configurations of their coins were carefully designed to be identical. So what seemed on its face to be special and interesting turns out to be no paradox at all.

Finally, we get to the question my friend asked. How is this any different from the case of entangled particles in quantum mechanics??

Both systems would be found to be in the states |00⟩ and |11⟩ with equal probability (where |0⟩ is heads and |1⟩ is tails). And both have the property that learning the state of one instantly tells you the state of the other. Indeed, the coins-in-boxes system also has the property of irreducibility that we talked about before! Try as we might, we cannot coherently treat the system of both coins as the product of two independent coins, as doing so will ignore the statistical dependence between the two coins.

(Which, by the way, is exactly the sort of statistical dependence that justifies timeless decision theory and makes it a necessary update to decision theory.)

I love this question. The premise of the question is that we can construct a classical system that behaves in just the same supposedly weird ways that quantum systems behave, and thus make sense of all this mystery. And answering it requires that we get to the root of why quantum mechanics is a fundamentally different description of reality than anything classical.

So! I’ll describe the two primary disanalogies between entangled particles and “entangled” coins.

Epistemic Uncertainty vs Fundamental Indeterminacy

First disanalogy. With the coins, either they are both heads or they are both tails. There is an actual fact in the world about which of these two is true, and the probabilities we reference when we talk about the chance of HH or TT represent epistemic uncertainty. There is a true determinate state of the coins, and probability only arises as a way to deal with our imperfect knowledge.

On the other hand, according to the mainstream interpretation of quantum mechanics, the state of the two particles is fundamentally indeterminate. There isn’t a true fact out there waiting to be discovered about whether the state is |00⟩ or |11⟩. The actual state of the system is this unusual thing called a superposition of |00⟩ and |11⟩. When we observe it to be |00⟩, the state has now actually changed from the superposition to the determinate state.

We can phrase this in terms of counterfactuals: If when we look at the coins, we see that they are HH, then we know that they were HH all along. In particular, we know that if we had observed them a moment later or earlier, we would have gotten H with 100% certainty. Give that we actually observed HH, the probability that we would have observed HH is 100%.

But if we observe the state of the particles to be |00⟩, this does not mean that had we observed it a moment before, we would be guaranteed to get the same answer. Given that we actually observed |00⟩, the probability that we would have observed |00⟩ is still 50%.

(A project for some enterprising reader: see what the truths of these counterfactuals imply for an interpretation of quantum mechanics in terms of Pearl-style causal diagrams. Is it even possible to do?)

Predictive differences

The second difference between the two cases is a straightforward experimental difference. Suppose that Alice and Bob identically prepare thousands of coins as we described before, and also identically prepare thousands of entangled particles. They ensure that the coins are treated exactly the same way, so that they are guaranteed to all be in the same state, and similarly for the entangled pairs.

If they now just observe all of their entangled pairs and coins, they will get similar results – roughly half of the coins will be HH and roughly half of the entangled pairs will be |00⟩. But there are other experiments they could run on the entangled pairs that would give different answers depending on whether we treat the particles as being in superposition or not. I described what these experiments could be in this earlier post – essentially they involve applying an operation that takes qubits in and out of superposition. The conclusion of this is that even if you tried to model the entangled pair as a simple probability distribution similar to the coins, you will get the wrong answer in some experiments.

So we have both a theoretical argument and a practical argument for the difference between these two cases. They key take-away is the following:

According to quantum mechanics an entangled pair is in a state that is fundamentally indeterminate. When we describe it with probabilities, we are not saying “This probabilistic description is an account of my imperfect knowledge of the state of the system”. We’re saying that nature herself is undecided on what we will observe when we look at the state. (Side note: there is actually a way to describe epistemic uncertainty in quantum mechanics. It is called the density matrix, and is distinct from the description of superpositions.)

In addition, the most fundamental and accurate probability description for the state of the two particles is one that cannot be described as the product of two independent particles. This is not the case with the coins! The most fundamental and accurate probability description for the state of the two coins is either 100% HH or 100% TT (whichever turns out to be the case). What this means is that in the quantum case, not only is the state indeterminate, but the two particles are fundamentally interdependent – entangled. There is no independent description of the individual components of the system, there is only the system as a whole.

Quantum mechanics, reductionism, and irreducibility

July 20, 2018July 20, 2018 ~ squarishbracket ~ 2 Comments

Take a look at the following two qubit state:

Is it possible to describe this two-qubit system in terms of the states of the two individual particles that compose it? The principle of reductionism suggests to us that yes it should be possible; after all, of course we can always take any larger system and describe it perfectly fine in terms of its components.

But this turns out to not be the case! The above state is a perfectly allowable physical configuration of two particles, but there is no accurate description of the state of the individual particles composing the system!

Multi-particle systems cannot in general be reduced to their parts. This is one of the shocking features of quantum mechanics that is extremely easy to prove, but is rarely emphasized in proportion to its importance. We’ll prove it now.

Suppose we have a system composed of two qubits in states |Ψ₁⟩ and |Ψ₂⟩. In general, we may write:

Screen Shot 2018-07-17 at 6.24.32 PM

Now, as we’ve seen in previous posts, we can describe the state of the two qubits as a whole by simply smushing them together as follows:

Screen Shot 2018-07-17 at 6.24.37 PM

So the set of all two-qubit states that can be split in their component parts is the set of all states arising from all possible values of α₁, α₂, β₁, and β₂ such that all states are normalized. I.e.

Screen Shot 2018-07-17 at 11.15.18 PM

However, there’s also a theorem that says that if any two states are physical possible, then all normalized linear combinations are physically possible as well. Because the states |00⟩, |01⟩, |10⟩, and |11⟩ are all physically possible, and because they form a basis for the set of two-qubit states, we can write out the set of all possible states:

Screen Shot 2018-07-17 at 6.50.09 PM.png

Now the philosophical question of whether or not there exist states that are irreducible can be formulated as a precise mathematical question: Does A = R?

And the answer is no! It turns out that A is much much larger than R.

The proof of this is very simple. R and A are both sets defined by a set of four complex numbers, and they share a constraint. But R also has two other constraints, independent of the shared constraint. That is, the two additional constraints cannot be derived from the first (try to derive it yourself! Or better, show that it cannot be derived). So the set of states that satisfy the conditions necessary to be in R must be smaller than the set of states that satisfy the conditions necessary to be in A. This is basically just the statement that when you take a set, and then impose a constraint on it, you get a smaller set.

An even simpler proof of the irreducibility of some states is to just give an example. Let’s return to our earlier example of a two-qubit state that cannot be decomposed into its parts:

More precisely:

Screen Shot 2018-07-20 at 12.16.01 AM.png

Screen Shot 2018-07-17 at 8.22.17 PM

So here we have a two-qubit state that is fundamentally irreducible. There is literally no possible description of the individual qubits on their own. We can go through all possible states that each qubit might be in, and rule them out one.

Let’s pause for a minute to reflect on how totally insane this is. It is a definitive proof that according to quantum mechanics, reality cannot necessarily be described in terms of its smallest components. This is a serious challenge to the idea of reductionism, and I’m still trying to figure out how to adjust my worldview in response. While the notion of reductionism as “higher-level laws can be derived as approximations of the laws of physics” isn’t challenged by this, the notion that “the whole is always reducible to its parts” has to go.

In fact, I’ll show in the next section that if you try to make predictions about an system but analyze it in terms of its smallest components, you will not in general get the right answer.

Predictive accuracy requires holism

So suppose that we have two qubits in the state we already introduced:

You might think the following: “Look, the two qubits are either both in the state |0⟩, or both in the state |1⟩. There’s a 50% chance of either one happening. Let’s suppose that we are only interested in the first qubit, and don’t care what happens with the second one. Can’t we just say that the first qubit is in a state with amplitudes 1/√2 in both states |0⟩ and |1⟩? After all, this will match the experimental results when we measure the qubit (50% of the time it is |0⟩ and 50% of the time it is |1⟩.”

Okay, but there are two big problems with this. First of all, while it’s true that each particle has a 50% chance of being observed in the state |0⟩, if you model these probabilities as independent of one another, then you will end up concluding that there is a 25% chance of the first particle being in the state |0⟩ and the second being in the state |1⟩. Whereas in fact, this will never happen!

You may reply that this is only a problem if you’re interested in making predictions about the state of the second qubit. If you are solely looking at your single qubit, you can still succeed at predicting what will happen when you measure it.

Well, fine. But the second, more important point is that even if you are able to accurately describe what happens when you measure your single qubit, you can always construct a different experiment you could perform that this same description will give the wrong answer for.

What this comes down to is the observation that quantum gates don’t operate the same way on 1/√2 (|00⟩ + |11⟩) as on 1/√2 (|0⟩ + |1⟩).

Suppose you take your qubit and pretend that the other one doesn’t exist. Then you apply a Hadamard gate to just your qubit and measure it. If you thought that the state was initially 1/√2 (|0⟩ + |1⟩), you will now think that your qubit is in the state |0⟩. You will predict with 100% confidence that if you measure it now, you will observe |0⟩.

But in fact when you measure it, you will find that 50% of the time it is |0⟩ and 50% of the time it is |1⟩! Where did you go wrong? You went wrong by trying to describe the particle as an individual entity.

Let’s prove this. First we’ll figure out what it looks like when we apply a Hadamard gate to only the first qubit, in the two-qubit representation:

Screen Shot 2018-07-17 at 10.02.27 PM Screen Shot 2018-07-17 at 10.13.22 PM

So we have a 25% chance of observing each of |00⟩, |10⟩, |01⟩, and|11⟩. Looking at just your own qubit, then, you have a 50% chance of observing |0⟩ and a 50% chance of observing |1⟩.

While your single-qubit description told you to predict a 100% chance of observing |0⟩, you actually would get a 50% chance of |0⟩ and a 50% chance of |1⟩.

Okay, but maybe the problem was that we were just using the wrong amplitude distribution for our single qubit. There are many choices we could have made for the amplitudes besides 1/√2 that would have kept the probabilities 50/50. Maybe one of these correctly simulates the behavior of the qubit in response to a quantum gate?

But no. It turns out that even though it is correct that there is a 50/50 chance of observing the qubit to be |0⟩ or |1⟩, there is no amplitude distribution matching this probability distribution that will correctly predict the results of all possible experiments.

Quick proof: We can describe a general two-qubit state with a 50/50 probability of being observed in |0⟩ and |1⟩ as follows:

Screen Shot 2018-07-19 at 2.36.10 AM

For any |Ψ⟩, we can construct a specially designed quantum gate U that transforms |Ψ⟩ into |0⟩:

Screen Shot 2018-07-19 at 2.40.01 AM

Applying U to our single qubit, we now expect to observe |0⟩ with 100% probability. But now let’s look at what happens if we consider the state of the combined system. The operation of applying U to only the first qubit is represented by taking the tensor product of U with the identity matrix I: U ⊗ I.

Screen Shot 2018-07-19 at 2.48.42 AM

Screen Shot 2018-07-19 at 2.48.56 AM

Screen Shot 2018-07-19 at 7.41.44 AM

What we see is that the two-qubit state ends up with a 25% chance of being observed as each of |00⟩, |01⟩, |10⟩, and |11⟩. This means that there is still a 50% chance of the first qubit being observed as |0⟩ and |1⟩.

This means that for every possible single qubit description of the first qubit, we can construct an experiment that will give different results than the model predicts. And the only model that always gives the right experimental predictions is a model that considers the two qubits as a single unit, irreducible and impossible to describe independently.

To recap: The lesson here is that for some quantum systems, if you describe them in terms of their parts instead of as a whole, you will necessarily make the wrong predictions about experimental results. And if you describe them as a whole, you will get the predictions spot on.

So how many states are irreducible?

Said another way, how much larger is A (the set of all states) than R (the set of reducible states)? Well, they’re both infinite sets with the same cardinality (they each have the cardinality of the continuum, |ℝ|). So in this sense, they’re the same size of infinity. But we can think about this by considering the dimensionality of these various spaces.

Let’s take another look at the definitions of A and R:

Screen Shot 2018-07-17 at 11.15.18 PM

Screen Shot 2018-07-17 at 6.50.09 PM.png

Each set is defined by four complex numbers, or 8 real numbers. If we ignored all constraints, then, our sets would be isomorphic to ℝ⁸.

Now, each share the same first constraint, which says that the overall state must be normalized. This constraint cuts one dimension off of the space of solutions, making it isomorphic to ℝ⁷.

That’s the only constraint for A, so we can say that A ~ ℝ⁷. But R involves two further constraints (the normalization conditions for each individual qubit). So we have three total constraints. However, it turns out that one of them can be derived from the others – two normalized qubits, when smushed together, always produce a normalized state. This gives us a net two constraints, meaning that the space of reducible states is isomorphic to ℝ⁶.

The space of irreducible states is what’s left when we subtract all elements of R from A. The dimensionality of this is just the same as the dimensionality of A. (A 3D volume minus a plane is still a 3D volume, a plane minus a curve is still two dimensional, a curve minus a point is still one dimensional, and so on.)

So both the space of total states and the space of irreducible states are 7-real-dimensional, while the space of reducible states is 6-real dimensional.

Screen Shot 2018-07-19 at 3.39.07 AM

You can visualize this as the space of all states being a volume, through which cuts a plane that composes all reducible states. The entire rest of the volume is the set of irreducible states. Clearly there are a lot more irreducible states than reducible states.

What about if we consider totally reducible three-qubit states? Now things are slightly different.

The set of all possible three qubit states (which we’ll denote A₃) is a set of 8 complex numbers (16 real numbers) with one normalization constraint. So A₃ ~ ℝ¹⁵.

The set of all totally reducible three qubit states (which we’ll denote R₃) is a set of only six complex numbers. Why? Because we only need to specify two complex numbers for each of the three individual qubits that will be smushed together. So we start off with only 12 real numbers. Then we have three constraints, one for the normalization of each individual qubit. And the final normalization constraint (of the entire system) follows from the previous three constraints. In the end, we see that R₃ ~ ℝ⁹.

Screen Shot 2018-07-19 at 3.40.17 AM

Now the space of reducible states is six-dimensions less than the space of all states.

How does this scale for larger quantum systems? Let’s look in general at a system of N qubits.

A_N is a set of 2^N complex amplitudes (2^N+1 real numbers), one for each N qubit state. There is just one normalization constraint. Thus we have a space with 2^N+1 – 1 real dimensions.

On the other hand, R_N is a set of only 2N complex amplitudes (4N real numbers), two for each of the N individual qubits. And there are N independent constraints ensuring that all states are normalized. So we have:

screen-shot-2018-07-19-at-4-04-32-am-e1532019398268.png

The point of all of this is that as you consider larger and larger quantum systems, the dimensionality of the space of irreducible states grows exponentially, while the dimensionality of the space of reducible states only grows linearly. If we were to imagine randomly selecting a 20-qubit state from the space of all possibilities, we would be exponentially more likely to ending up with a space that cannot be described as a product of each of its parts.

What this means is that irreducibility is not a strange exotic phenomenon that we shouldn’t expect to see in the real world. Instead, we should expect that basically all systems we’re surrounded by are irreducible. And therefore, we should expect that the world as a whole is almost certainly not describable as the sum of individual parts.

100 prisoners problem

July 4, 2018February 14, 2019 ~ squarishbracket ~ 1 Comment

I’m in the mood for puzzles, so here’s another one. This one is so good that it deserves its own post.

The setup (from wiki):

The director of a prison offers 100 death row prisoners, who are numbered from 1 to 100, a last chance. A room contains a cupboard with 100 drawers. The director randomly puts one prisoner’s number in each closed drawer. The prisoners enter the room, one after another. Each prisoner may open and look into 50 drawers in any order. The drawers are closed again afterwards.

If, during this search, every prisoner finds his number in one of the drawers, all prisoners are pardoned. If just one prisoner does not find his number, all prisoners die. Before the first prisoner enters the room, the prisoners may discuss strategy—but may not communicate once the first prisoner enters to look in the drawers. What is the prisoners’ best strategy?

Suppose that each prisoner selects 50 at random, and don’t coordinate with one another. Then the chance that any particular prisoner gets their own number is 50%. This means that the chance that all 100 get their own number is 1/2¹⁰⁰.

Let me emphasize how crazily small this is. 1/2¹⁰⁰ is 1/1,267,650,600,228,229,401,496,703,205,376; less than one in a decillion. If there were 100 prisoners trying exactly this setup every millisecond, it would take them 40 billion billion years to get out alive once. This is 3 billion times longer than the age of the universe.

Okay, so that’s a bad strategy. Can we do better?

It’s hard to imagine how… While the prisoners can coordinate beforehand, they cannot share any information. So every time a prisoner comes in for their turn at the drawers, they are in exactly the same state of knowledge as if they hadn’t coordinated with the others.

Given this, how could we possibly increase the survival chance beyond 1/2¹⁰⁰?

(…)

(Try to answer for yourself before continuing)

(…)

Let’s consider a much simpler case. Imagine we have just two prisoners, two drawers, and each one can only open one of them. Now if both prisoners choose randomly, there’s only a 1 in 4 chance that they both survive.

What if they agree to open the same drawer? Then they have reduced their survival chance from 25% to 0%! Why? Because by choosing the same drawer, they either both get the number 1, or they both get the number 2. In either case, they are guaranteed that only one of them gets their own number.

So clearly the prisoners can decrease the survival probability by coordinating beforehand. Can they increase it?

Yes! Suppose that they agree to open different drawers. Then this doubles their survival chance from 25% to 50%. Either they both get their own number, or they both get the wrong number.

The key here is to minimize the overlap between the choices of the prisoners. Unfortunately, this sort of strategy doesn’t scale well. If we have four prisoners, each allowed to open two drawers, then random drawing gives a 1/16 survival chance.

Let’s say they open according to the following scheme: 12, 34, 13, 24 (first prisoner opens drawers 1 and 2, second opens 3 and 4, and so on). Then out of the 24 possible drawer layouts, the only layouts that work are 1432 and 3124:

1234 1243 1324 1342 1423 1432
2134 2143 2314 2341 2413 2431
3124 3142 3214 3241 3412 3421
4123 4132 4213 4231 4312 4321

This gives a 1/12 chance of survival, which is better but not by much.

What if instead they open according to the following scheme: (12, 23, 34, 14)?

1234 1243 1324 1342 1423 1432
2134 2143 2314 2341 2413 2431
3124 3142 3214 3241 3412 3421
4123 4132 4213 4231 4312 4321

Same thing: a 1/12 chance of survival.

Scaling this up to 100 prisoners, the odds of survival look pretty measly. Can they do better than this?

(…)

(Try to answer for yourself before continuing)

(…)

It turns out that yes, there is a strategy that does better at ensuring survival. In fact, it does so much better that the survival chance is over 30 percent!

Take a moment to boggle at this. Somehow we can leverage the dependency induced by the prisoners’ coordination to increase the chance of survival by a factor of one decillion, even though none of their states of knowledge are any different. It’s pretty shocking to me that this is possible.

Here’s the strategy: Each time a prisoner opens a drawer, they consult the number in that drawer to determine which drawer they will open next. Thus each prisoner only has to decide on the first drawer to open, and all the rest of the drawers follow from this. Importantly, the prisoner only knows the first drawer they’ll pick; the other 49 are determined by the distribution of numbers in the drawers.

We can think about each drawer as starting a chain through the other drawers. These chains always cycle back into the starting number, the longest possible cycle being 100 numbers and the shortest being 1. Now, each prisoner can guarantee that they are in a cycle that contains their own number by choosing the drawer corresponding to their own number!

So, the strategy is that Prisoner N starts by choosing Drawer N, looking at the number within, then choosing the drawer labeled with that number. Repeat 50 times per each prisoner.

The wiki page has a good description of how to calculate the survival probability with this strategy:

The prison director’s assignment of prisoner numbers to drawers can mathematically be described as a permutation of the numbers 1 to 100. Such a permutation is a one-to-one mapping of the set of natural numbers from 1 to 100 to itself. A sequence of numbers which after repeated application of the permutation returns to the first number is called a cycle of the permutation. Every permutation can be decomposed into disjoint cycles, that is, cycles which have no common elements.

In the initial problem, the 100 prisoners are successful if the longest cycle of the permutation has a length of at most 50. Their survival probability is therefore equal to the probability that a random permutation of the numbers 1 to 100 contains no cycle of length greater than 50. This probability is determined in the following.

A permutation of the numbers 1 to 100 can contain at most one cycle of length $l>50$ . There are exactly ${\tbinom {100}{l}}$ ways to select the numbers of such a cycle. Within this cycle, these numbers can be arranged in $(l-1)!$ ways since there are $l-1$ permutations to represent distinct cycles of length $l$ because of cyclic symmetry. The remaining numbers can be arranged in $(100-l)!$ ways. Therefore, the number of permutations of the numbers 1 to 100 with a cycle of length $l>50$ is equal to

${\binom {100}{l}}\cdot (l-1)!\cdot (100-l)!={\frac {100!}{l}}.$

The probability, that a (uniformly distributed) random permutation contains no cycle of length greater than 50 is calculated with the formula for single events and the formula for complementary events thus given by

Screen Shot 2018-07-04 at 5.44.43 AM

This is about 31.1828%.

This formula easily generalizes to other numbers of prisoners. We can plot the survival chance using this strategy as a function of the number of prisoners:

N prisoners

Amazingly, we can see that this value is asymptoting at a value greater than 30%. The precise limit is 1 – ln2 ≈ 30.685%.

In other words, no matter how many prisoners there are, we can always ensure that the survival probability is greater than 30%! This is pretty remarkable, and I think there are some deep lessons to be drawn from it, but I’m not sure what they are.

Some simple probability puzzles

July 3, 2018July 4, 2018 ~ squarishbracket ~ 3 Comments

(Most of these are taken from Ian Hacking’s Introduction to Probability and Inductive Logic.)

About as many boys as girls are born in hospitals. Many babies are born every week at City General. In Cornwall, a country town, there is a small hospital where only a few babies are born every week.
Define a normal week as one where between 45% and 55% of babies are female. An unusual week is one where more than 55% or less than 45% are girls.

Which of the following is true:
(a) Unusual weeks occur equally often at City General and at Cornwall.
(b) Unusual weeks are more common at City General than at Cornwall.
(c) Unusual weeks are more common at Cornwall than at City General.
Pia is 31 years old, single, outspoken, and smart. She was a philosophy major. When a student, she was an ardent supporter of Native American rights, and she picketed a department store that had no facilities for nursing mothers.
Which of the following statements are most probable? Which are least probable?

(a) Pia is an active feminist.
(b) Pia is a bank teller.
(c) Pia works in a small bookstore.
(d) Pia is a bank teller and an active feminist.
(e) Pia is a bank teller and an active feminist who takes yoga classes.
(f) Pia works in a small bookstore and is an active feminist who takes yoga classes.
You have been called to jury duty in a town with only green and blue taxis. Green taxis dominate the market, with 85% of the taxis on the road.
On a misty winter night a taxi sideswiped another car and drove off. A witness said it was a blue cab. This witness is tested under similar conditions, and gets the color right 80% of the time.

You conclude about the sideswiping taxi:
(a) The probability that it is blue is 80%.
(b) It is probably blue, but with a lower probability than 80%.
(c) It is equally likely to be blue or green.
(d) It is more likely than not to be green.
You are a physician. You think that it’s quite likely that a patient of yours has strep throat. You take five swabs from the throat of this patient and send them to a lab for testing.
If the patient has strep throat, the lab results are right 70% of the time. If not, then the lab is right 90% of the time.

The test results come back: YES, NO, NO, YES, NO

You conclude:
(a) The results are worthless.
(b) It is likely that the patient does not have strep throat.
(c) It is slightly more likely than not that the patient does have strep throat.
(d) It is very much more likely than not that the patient does have strep throat.
In a country, all families wants a boy. They keep having babies till a boy is born. What is the expected ratio of boys and girls in the country?
Answer the following series of questions:
If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT?

If you flip the coin repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT?

If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely?

You play a game with a friend in which you each choose a sequence of three possible flips (e.g HHT and TTH). You then flip the coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor?

Are you always able to bias the game in your favor?

Solutions (and lessons)

The correct answer is (a): Unusual weeks occur more often at Cornwall than at City General. Even though the chance of a boy is the same at Cornwall as it is at City General, the percentage of boys from week to week is larger in the smaller city (for N patients a week, the percentage boys goes like 1/sqrt(N)). Indeed, if you think about an extreme case where Cornwall has only one birth a week, then every week will be an unusual week (100% boys or 0% boys).
There is room to debate the exact answer but whatever it is, it has to obey some constraints. Namely, the most probable statement cannot be (d), (e), or (f), and the least probable statement cannot be (a), (b), or (c). Why? Because of the conjunction rule of probability: each of (d), (e), and (f) are conjunctions of (a), (b), and (c), so they cannot be more likely. P(A & B) ≤ P(A).
It turns out that most people violate this constraint. Many people answer that (f) is the most probable description, and (b) is the least probable. This result is commonly interpreted to reveal a cognitive bias known as the representativeness heuristic – essentially, that our judgements of likelihood are done by considering which descriptions most closely resemble the known facts. In this case,

Another factor to consider is that prior to considering the evidence, your odds on a given person being a bank teller as opposed to working in a small bookstore should be heavily weighted towards her being a bank teller. There are just far more bank tellers than small bookstore workers (maybe a factor of around 20:1). This does not necessarily mean that (b) is more likely than (c), but it does mean that the evidence must discriminate strongly enough against her being a bank teller so as to overcome the prior odds.

This leads us to another lesson, which is to not neglect the base rate. It is easy to ignore the prior odds when it feels like we have strong evidence (Pia’s age, her personality, her major, etc.). But the base rate on small bookstore workers and bank tellers are very relevant to the final judgement.
The correct answer is (d) – it is more likely than not that the sideswiper was green. This is a basic case of base rate neglect – many people would see that the witness is right 80% of the time and conclude that the witness’s testimony has an 80% chance of being correct. But this is ignoring the prior odds on the content of the witness’s testimony.
In this case, there were prior odds of 17:3 (85%:15%) in favor of the taxi being green. The evidence had a strength of 1:4 (20%:80%), resulting in the final odds being 17:12 in favor of the taxi being green. Translating from odds to probabilities, we get a roughly 59% chance of the taxi having been green.

We could have concluded (d) very simply by just comparing the prior probability (85% for green) with the evidence (80% for blue), and noticing that the evidence would not be strong enough to make blue more likely than green (since 85% > 80%). Being able to very quickly translate between statistics and conclusions is a valuable skill to foster.
The right answer is (d). We calculate this just like we did the last time:
The results were YES, NO, NO, YES, NO.

Each YES provides evidence with strength 7:1 (70%/10%) in favor of strep, and each NO provides evidence with strength 1:3 (30%/90%).

So our strength of evidence is 7:1 ⋅ 1:3 ⋅ 1:3 ⋅ 7:1 ⋅ 1:3 = 49:27, or roughly 1.81:1 in favor of strep. This might be a little surprising… we got more NOs than YESs and the NO was correct 90% of the time for people without strep, compared to the YES being correct only 70% of the time in people with strep.

Since the evidence is in favor of strep, and we started out already thinking that strep was quite likely, in the end we should be very convinced that they have strep. If our prior on the patient having strep was 75% (3:1 odds), then our probability after getting evidence will be 84% (49:9 odds).

Again, surprising! The patient who sees these results and hears the doctor declaring that the test strengthens their belief that the patient has strep might feel that this is irrational and object to the conclusion. But the doctor would be right!
Supposing as before that the chance of any given birth being a boy is equal to the chance of it being a girl, we end up concluding…
The expected ratio of boys and girls in the country is 1! That is, this strategy doesn’t allow you to “cheat” – it has no impact at all on the ratio. Why? I’ll leave this one for you to figure out. Here’s a diagram for a hint:

This is important because it applies to the problem of p-hacking. Imagine that all researchers just repeatedly do studies until they get the results they like, and only publish these results. Now suppose that all the researchers in the world are required to publish every study that they do. Now, can they still get a bias in favor of results they like? No! Even though they always stop when getting the result they like, the aggregate of their studies is unbiased evidence. They can’t game the system!

Answers, in order:

If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT? (Yes)

If you flip it repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT? (No)

If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely? (Yes)

You play a game with a friend in which you each choose a sequence of three coin flips (e.g HHT and TTH). You then flip a coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor? (Yes)

Are you always able to bias the game in your favor? (Yes!)

Here’s a wiki page with a good explanation of this: LINK. A table from that page illustrating a winning strategy for any choice your friend makes:

1st player’s choice	2nd player’s choice	Odds in favour of 2nd player
HHH	THH	7 to 1
HHT	THH	3 to 1
HTH	HHT	2 to 1
HTT	HHT	2 to 1
THH	TTH	2 to 1
THT	TTH	2 to 1
TTH	HTT	3 to 1
TTT	HTT	7 to 1

On existence

June 29, 2018July 5, 2018 ~ squarishbracket ~ Leave a comment

Epistemic status: This is a line of thought that I’m not fully on board with, but have been taking more seriously recently. I wouldn’t be surprised if I object to all of this down the line.

The question of whether or not a given thing exists is not an empty question or a question of mere semantics. It is a question which you can get empirical evidence for, and a question whose answer affects what you expect to observe in the world.

Before explaining this further, I want to draw an analogy between ontology and causation (and my attitudes towards them).

Early in my philosophical education, my attitude towards causality was sympathetic to the Humean-style eliminativism, in which causality is a useful construct that isn’t reflected in the fundamental structure of the world. That is, I quickly ‘tossed out’ the notion of causality, comfortable to just talk about the empirical regularities governed by our laws of physics.

Later, upon encountering some statisticians that exposed me to the way that causality is actually calculated in the real world, I began to feel that I had been overly hasty. In fact, it turns out that there is a perfectly rigorous and epistemically accessible formalization of causality, and I now feel that there is no need to toss it out after all.

Here’s an easy way of thinking about this: While the slogan “Correlation does not imply causality” is certainly true, the reverse (“Causality does not imply correlation”) is trickier. In fact, whenever you have a causal relationship between variables, you do end up expecting some observable correlations. So while you cannot deductively conclude a causal relationship from a merely correlational one, you can certainly get evidence for some causal models.

This is just a peek into the world of statistical discovery of causal relationships – going further requires a lot more work. But that’s not necessary for my aim here. I just want to express the following parallel:

Rather than trying to set up a perfect set of necessary and sufficient conditions for application of the term ’cause’, we can just take a basic axiom that any account of causation must adhere to. Namely: Where there’s causation, there’s correlation.

And rather than trying to set up a perfect set of necessary and sufficient conditions for the term ‘existence’, we can just take a basic axiom that any account of existence must adhere to. Namely: If something affects the world, it exists.

This should seem trivially obvious. While there could conceivably be entities that exist without affecting anything, clearly any entity that has a causal impact on the world must exist.

The contrapositive of this axiom is that if something doesn’t exist, it does not affect the world.

Again, this is not a controversial statement. And importantly, it makes ontology amenable to scientific inquiry! Why? Because two worlds with different ontologies will have different repertoires of causes and effects. A world in which nothing exists is a world in which nothing affects anything – a dead, static region of nothingness. We can rule out this world on the basis of our basic empirical observation that stuff is happening.

This short argument attempts to show that ontology is a scientifically respectable concept, and not merely a matter of linguistic game-playing. Scientific theories implicitly assume particular ontologies by relying upon laws of nature which reference objects with causal powers. Fundamentally, evidence that reveals the impotence of these supposed causal powers serves as evidence against the ontological framework of such theories.

I think the temptation to wave off ontological questions as somehow disreputable and unscientific actually springs from the fundamentality of this concept. Ontology isn’t a minor add-on to our scientific theories done to appease the philosophers. Instead, it is built in from the ground floor. We can’t do science without implicitly making ontological assumptions. I think it’s better to make these assumptions explicit and debate about the fundamental principles by which we justify them, then it is to do it invisibly, without further analysis.

Concepts we keep and concepts we toss out

June 28, 2018 ~ squarishbracket ~ 4 Comments

Often when we think about philosophical concepts like identity, existence, possibility, and so on, we find ourselves confronted with numerous apparent paradoxes that require us to revise our initial intuitive conception. Sometimes, however, the revisions necessary to render the concept coherent end up looking so extreme as to make us prefer to just throw out the concept altogether.

An example: claims about identity are implicit in much of our reasoning (“I was thinking about this in the morning” implicitly presumes an identity between myself now and the person resembling me in my apartment this morning). But when we analyze our intuitive conception of identity, we find numerous incoherencies (e.g. through Sorites-style paradoxes in which objects retain their identity through arbitrarily small transformations, but then end up changing their identity upon the conjunction of these transformations anyway).

When faced with these incoherencies, we have a few options: first of all, we can decide to “toss out” the concept of identity (i.e. determine that the concept is too fundamentally paradoxical to be saved), or we can decide to keep it. If we keep it, then we are forced to bite some bullets (e.g. by revising the concept away from our intuitions to a more coherent neighboring concept, or by accepting the incoherencies).

In addition, keeping the concept does not mean thinking that the concept actually successfully applies to anything. For instance, one might keep the concept of free will (in that they have a well-defined personal conception of it), while denying that free will exists. This is the difference between saying “People don’t have free will, and that has consequences X, Y, and Z” and saying “I think that contradictions are so deeply embedded in the concept of free will that it’s fundamentally unsavable, and henceforth I’m not going to reason in terms of it.” I often hop back and forth between these positions, but I think they are really quite different.

One final way to describe this distinction: When faced with a statement like “X exists,” we have three choices: We can say that the statement is true, that it is false, or that it is not a proposition. This third category is what we would say about statements like “Arghleschmargle” or “Colorless green ideas sleep furiously”. While they are sentences that we can speak, they just aren’t the types of things that could be true or false. To throw out the concept of existence is to say that a statement like “X exists” is neither true nor false, and to keep it is to treat it as having a truth value.

I have a clear sense for any given concept whether or not I think it’s better to keep or toss out, and I imagine that others can do the same. Here’s a table of some common philosophical concepts and my personal response to each:

Keep
Causality
Existence
Justification
Free will
Time
Consciousness
Randomness
Meaning (of life)
Should (ethical)
Essences
Representation / Intentionality

Toss Out
Knowledge
Identity
Possibility
Objects
Forms
Purposes (in the teleological sense)
Beauty

Many of these I’m not sure about, and I imagine I could have my mind easily changed (e.g. identity, possibility, intentionality). Some I’ve even recently changed my mind about (causality, existence). And others I feel quite confident about (e.g. knowledge, randomness, justification).

I’m curious about how others’ would respond… What philosophical concepts do you lean towards keeping, and which concepts do you lean towards tossing out?

Against moral realism

June 27, 2018 ~ squarishbracket ~ 3 Comments

Here’s my primary problem with moral realism: I can’t think of any acceptable epistemic framework that would give us a way to justifiably update our beliefs in the objective truth of moral claims. I.e. I can’t think of any reasonable account of how we could have justified beliefs in objectively true moral principles.

Here’s a sketch of a plausible-seeming account of epistemology. Broad-strokes, there are two sources of justified belief: deduction and induction.

Deduction refers to the process by which we define some axioms and then see what logically follows from them. So, for instance, the axioms of Peano Arithmetic entail the theorem that 1+1=2 – or, in Peano’s language, S(0) + S(0) = S(S(0)). The central reason why reasoning by deduction is reliable is that the truths established are true by definition – they are made true by the way we have constructed our terms, and are thus true in every possible world.

Induction is scientific reasoning – it is the process of taking prior beliefs, observing evidence, and then updating these beliefs (via Bayes’ rule, for instance). The central reason why induction is reliable comes from the notion of causal entanglement. When we make an observation and update our beliefs based upon this observation, the brain state “believes X” has become causally entangled with the truth of the the statement X. So, for instance, if I observe a positive result on a pregnancy test, then my belief in the statement “I am pregnant” has become causally entangled with the truth of the statement “I am pregnant.” It is exactly this that justifies our use of induction in reasoning about the world.

Now, where do moral claims fall? They are not derived from deductive reasoning… that is, we cannot just arbitrarily define right and wrong however we like, and then derive morality from these definitions.

And they are also not truths that can be established through inductive reasoning; after all, objective moral truths are not the types of things that have any causal effects on the world.

In other words, even if there are objective moral truths, we would have no way of forming justified beliefs about this. To my mind, this is a pretty devastating situation for a moral realist. Think about it like this: a moral realist who doesn’t think that moral truths have causal power over the world must accept that all of their beliefs about morality are completely causally independent of their truth. If we imagine keeping all the descriptive truths about the world fixed, and only altering the normative truths, then none of the moral realist’s moral beliefs would change.

So how do they know that they’re in the world where their moral beliefs actually do align with the moral reality? Can they point to any reason why their moral beliefs are more likely to be true than any other moral statements? As far as I can tell, no, they can’t!

Now, you might just object to the particular epistemology I’ve offered up, and suggest some new principle by which we can become acquainted with moral truth. This is the path of many professional philosophers I have talked to.

But every attempt that I’ve heard of for doing this begs the question or resorts to just gesturing at really deeply held intuitions of objectivity. If you talk to philosophers, you’ll hear appeals to a mysterious cognitive ability to reflect on concepts and “detect their intrinsic properties”, even if these properties have no way of interacting with the world, or elaborate descriptions of the nature of “self-evident truths.”

(Which reminds me of this meme)

None of this deals with the central issue in moral epistemology, as I see it. This central issue is: How can a moral realist think that their beliefs about morality are any more likely to be true than any random choice of a moral framework?

Explanation is asymmetric

June 26, 2018July 3, 2018 ~ squarishbracket ~ Leave a comment

We all regularly reason in terms of the concept of explanation, but rarely think hard about what exactly we mean by this explanation. What constitutes a scientific explanation? In this post, I’ll point out some features of explanation that may not be immediately obvious.

Let’s start with one account of explanation that should seem intuitively plausible. This is the idea that to explain X to a person is to give that person some information I that would have allowed them to predict X.

For instance, suppose that Janae wants an explanation of why Ari is not pregnant. Once we tell Janae that Ari is a biological male, she is satisfied and feels that the lack of pregnancy has been explained. Why? Well, because had Janae known that Ari was a male, she would have been able to predict that Ari would not get pregnant.

Let’s call this the “predictive theory of explanation.” On this view, explanation and prediction go hand-in-hand. When somebody learns a fact that explains a phenomenon, they have also learned a fact that allows them to predict that phenomenon.

To spell this out very explicitly, suppose that Janae’s state of knowledge at some initial time is expressed by

K₁ = “Males cannot get pregnant.”

At this point, Janae clearly cannot conclude anything about whether Ari is pregnant. But now Janae learns a new piece of information, and her state of knowledge is updated to

K₂ = “Ari is a male & males cannot get pregnant.”

Now Janae is warranted in adding the deduction

K’ = “Ari cannot get pregnant”

This suggests that added information explains Ari’s non-pregnancy for the same reason that it allows the deduction of Ari’s non-pregnancy.

Now, let’s consider a problem with this view: the problem of relevance.

Suppose a man named John is not pregnant, and somebody explains this with the following two premises:

People who take birth control pills almost certainly don’t get pregnant.
John takes birth control pills regularly.

Now, these two premises do successfully predict that John will not get pregnant. But the fact that John takes birth control pills regularly gives no explanation at all of his lack of pregnancy. Naively applying the predictive theory of explanation gives the wrong answer here.

You might have also been suspicious of the predictive theory of explanation on the grounds that it relied on purely logical deduction and a binary conception of knowledge, not allowing us to accommodate the uncertainty inherent in scientific reasoning. We can fix this by saying something like the following:

What it is to explain X to somebody that knows K is to give them information I such that

(1) P(X | K) is small, and
(2) P(X | K, I) is large.

“Small” and “large’ here are intentionally vague; it wouldn’t make sense to draw a precise line in the probabilities.

The idea here is that explanations are good insofar as they (1) make their explanandum sufficiently likely, where (2) it would be insufficiently likely without them.

We can think of this as a correlational account of explanation. It attempts to root explanations in sufficiently strong correlations.

First of all, we can notice that this doesn’t suffer from a problem with irrelevant information. We can find relevance relationships by looking for independencies between variables. So maybe this is a good definition of scientific explanation?

Unfortunately, this “correlational account of explanation” has its own problems.

Take the following example.

uploaded image

This flagpole casts a shadow of length L because of the angle of elevation of the sun and the height of the flagpole (H). In other words, we can explain the length of the shadow with the following pieces of information:

I₁ = “The angle of elevation of the sun is θ”
I₂ = “The height of the lamp post is H”
I₃ = Details involving the rectilinear propagation of light and the formation of shadows

Both the predictive and correlational theory of explanation work fine here. If somebody wanted an explanation for why the shadow’s length is L, then telling them I₁, I₂, and I₃ would suffice. Why? Because I₁, I₂, and I₃jointly allow us to predict the shadow’s length! Easy.

X = “The length of the shadow is L.”
(I₁ & I₂ & I₃) ⇒ X
So I₁ & I₂ & I₃ explain X.

And similarly, P(X | I₁ & I₂ & I₃) is large, and P(X) is small. So on the correlational account, the information given explains X.

But now, consider the following argument:

(I₁ & I₃ & X) ⇒ I₂
So I₁ & I₃ & X explain I₂.

The predictive theory of explanation applies here. If we know the length of the shadow and the angle of elevation of the sun, we can deduce the height of the flagpole. And the correlational account tells us the same thing.

But it’s clearly wrong to say that the explanation for the height of the flagpole is the length of the shadow!

What this reveals is an asymmetry in our notion of explanation. If somebody already knows how light propagates and also knows θ, then telling them H explains L. But telling them L does not explain H!

In other words, the correlational theory of explanation fails, because correlation possesses symmetry properties that explanation does not.

This thought experiment also points the way to a more complete account of explanation. Namely, the relevant asymmetry between the length of the shadow and the height of the flagpole is one of causality. The reason why the height of the flagpole explains the shadow length but not vice versa, is that the flagpole is the cause of the shadow and not the reverse.

In other words, what this reveals to us is that scientific explanation is fundamentally about finding causes, not merely prediction or statistical correlation. This causal theory of explanation can be summarized in the following:

An explanation of A is a description of its causes that renders it intelligible.

More explicitly, an explanation of A (relative to background knowledge K) is a set of causes of A that render X intelligible to a rational agent that knows K.

Priors in the supernatural

June 13, 2018June 24, 2018 ~ squarishbracket ~ Leave a comment

A friend of mine recently told me the following anecdote.

Years back, she had visited an astrologer in India with her boyfriend, who told her the following things: (1) she would end up marrying her boyfriend at the time, (2) down the line they would have two kids, the first a girl and the second a boy, and (3) he predicted the exact dates of birth of both children.

Many years down the line, all of these predictions turned out to be true.

I trust this friend a great deal, and don’t have any reason to think that she misremembered the details or lied to me about them. But at the same time, I recognize that astrology is completely crazy.

Since that conversation, I’ve been thinking about the ways in which we can evaluate our de facto priors in supernatural events by consulting either real-world anecdotes or thought experiments. For instance, if we think that each of these two predictions gave us a likelihood ratio of 100:1 in favor of astrology being true, and if I ended up thinking that astrology was about as likely to be true as false, then I must have started with roughly 1:10,000 odds against astrology being true.

That’s not crazily low for a belief that contradicts much of our understanding of physics. I would have thought that my prior odds would be something much lower, like 1:10¹⁰ or something. But really put yourself in that situation.

Imagine that you go to an astrologer, who is able to predict an essentially unpredictable sequence of events years down the line, with incredible accuracy. Suppose that the astrologer tells you who you will marry, how many kids you’ll have, and the dates of birth of each. Would you really be totally unshaken by this experience? Would you really believe that it was more likely to have happened by coincidence?

Yes, yes, I know the official Bayesian response – I read it in Jaynes long ago. For beliefs like astrology that contradict our basic understanding of science and causality, we should always have reserved some amount of credence for alternate explanations, even if we can’t think of any on the spot. This reserve of credence will insure us against jumping in credence to 99% upon seeing a psychic continuously predict the number in your heads, ensuring sanity and a nice simple secular worldview.

But that response is not sufficient to rule out all strong evidence for the supernatural.

Here’s one such category of strong evidence: evidence for which all alternative explanations are ruled out by the laws of physics as strongly as the supernatural hypothesis is ruled out by the laws of physics.

I think that my anecdote is one such case. If it was true, then there is no good natural alternative explanation for it. The reason? Because the information about the dates of birth of my friend’s children did not exist in the world at the time of the prediction, in any way that could be naturally attainable by any human being.

By contrast, imagine you go to a psychic who tells you to put up some fingers behind your back and then predicts over and over again how many fingers you have up. There’s hundreds of alternative explanations for this besides “Psychics are real science has failed us.” The reason that there are these alternative explanations is that the information predicted by the psychic existed in the world at the time of the prediction.

But in the case of my friend’s anecdote, the information predicted by the astrologer was lost far in the chaotic dynamics of the future.

What this rules out is the possibility that the astrologer somehow obtained the information surreptitiously by any natural means. It doesn’t rule out a host of other explanations, such as that my friend’s perception at the time was mistaken, that her memory of the event is skewed, or that she is lying. I could even, as a last resort, consider that possibility that I hallucinated the entire conversation with her. (I’d like to give the formal title “unbelievable propositions” to the set of propositions that are so unlikely that we should sooner believe that we are hallucinating than accept evidence for them.)

But each of these sources of alternative explanations, with the possible exception of the last, can be made significantly less plausible.

Let me use a thought experiment to illustrate this.

Imagine that you are a nuclear physicist who, with a group of fellow colleagues, have decided to test the predictive powers of a fortune teller. You carefully design an experiment in which a source of true quantum randomness will produce a number between 1 and N. Before the number has been produced, when it still exists only as an unrealized possibility in the wave function, you ask the fortune teller to predict its value.

Suppose that they get it correct. For what value of N would you begin to take their fortune telling abilities seriously?

Here’s how I would react to the success, for different values of N.

N = 10: “Haha, that’s a funny coincidence.”

N = 100: “Hm, that’s pretty weird.”

N = 1000: “What…”

N = 10,000: “Wait, WHAT!?”

N = 100,000: “How on Earth?? This is crazy.”

N = 1,000,000: “Ok, I’m completely baffled.”

I think I’d start taking them seriously as early as N = 10,000. This indicates prior odds of roughly 1:10,000 against fortune-telling abilities (roughly the same as my prior odds against astrology, interestingly!). Once again, this seems disconcertingly low.

But let’s try to imagine some alternative explanations.

As far as I can tell, there are only three potential failure points: (1) our understanding of physics, (2) our engineering of the experiment, (3) our perception of the fortune teller’s prediction.

First of all, if our understanding of quantum mechanics is correct, there is no possible way that any agent could do better than random at predicting the number.

Secondly, we stipulated that the experiment was designed meticulously so as to ensure that the information was truly random, and unavailable to the fortune-teller. I don’t think that such an experiment would actually be that hard to design. But let’s go even further and imagine that we’ve designed the experiment so that the fortune teller is not in causal contact with the quantum number-generator until after she has made her prediction.

And thirdly, we can suppose that the prediction is viewed by multiple different people, all of whom affirm that it was correct. We can even go further and imagine that video was taken, and broadcast to millions of viewers, all of whom agreed. Not all of them could just be getting it wrong over and over again. The only possibility is that we’re hallucinating not just the experimental result, but indeed also the public reaction and consensus on the experimental result.

But the hypothesis of a hallucination now becomes inconsistent with our understanding of how the brain works! A hallucination wouldn’t have the effect of creating a perception of a completely coherent reality in which everybody behaves exactly as normal except that they saw the fortune teller make a correct prediction. We’d expect that if this were a hallucination, it would not be so self-consistent.

Pretty much all that’s left, as far as I can tell, is some sort of Cartesian evil demon that’s cleverly messing with our brains to create this bizarre false reality. If this is right, then we’re left weighing the credibility of the laws of physics against the credibility of radical skepticism. And in that battle, I think, the laws of physics lose out. (Consider that the invalidity of radical skepticism is a precondition for the development of laws of physics in the first place.)

The point of all of this is just to sketch an example where I think we’d have a good justification for ruling out all alternative explanations, at least with an equivalent degree of confidence that we have for affirming any of our scientific knowledge.

Let’s bring this all the way back to where we started, with astrology. The conclusion of this blog post is not that I’m now a believer in astrology. I think that there’s enough credence in the buckets of “my friend misremembered details”, “my friend misreported details”, and “I misunderstood details” so that the likelihood ratio I’m faced with is not actually 10,000 to 1. I’d guess it’s something more like 10 to 1.

But I am now that much less confident that astrology is wrong. And I can imagine circumstances under which my confidence would be drastically decreased. While I don’t expect such circumstances to occur, I do find it instructive (and fun!) to think about them. It’s a good test of your epistemology to wonder what it would take for your most deeply-held beliefs to be overturned.

Patterns of inductive inference

June 12, 2018June 16, 2018 ~ squarishbracket ~ Leave a comment

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A ⊥ B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.