A simple explanation of Bell’s inequality

Everybody knows that quantum mechanics is weird. But there are plenty of weird things in the world. We’ve pretty much come to expect that as soon as we look out beyond our little corner of the universe, we’ll start seeing intuition-defying things everywhere. So why does quantum mechanics get the reputation of being especially weird?

Bell’s theorem is a good demonstration of how the weirdness of quantum mechanics is in a realm of its own. It’s a set of proposed (and later actually verified) experimental results that seem to defy all attempts at classical interpretation.

The Experimental Results

Here is the experimental setup:


In the center of the diagram, we have a black box that spits out two particles every few minutes. These two particles fly in different directions to two detectors. Each detector has three available settings (marked by 1, 2, and 3) and two bulbs, one red and the other green.

Shortly after a particle enters the detector, one of the two bulbs flashes. Our experiment is simply this: we record which bulb flashes on both the left and right detector, and we take note of the settings on both detectors at the time. We then try randomly varying the detector settings, and collect data for many such trials.

Quick comprehension test: Suppose that what bulb flashes is purely a function of some property of the particles entering the detector, and the settings don’t do anything. Then we should expect that changes in the settings will not have any impact on the frequency of flashing for each bulb. It turns out that we don’t see this in the experimental results.

One more: Suppose that the properties of the particles have nothing to do with which bulb flashes, and all that matters is the detector settings. What do we expect our results to be in this case?

Well, then we should expect that changing the detector settings will change which bulb flashes, but that the variance in the bulb flashes should be able to be fully accounted for by the detector settings. It turns out that this also doesn’t happen.

Okay, so what do we see in the experimental results?

The results are as follows:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
Different colored bulbs flash on the left and right 75% of the time.

In some sense, the paradox is already complete. It turns out that some very minimal assumptions about the nature of reality tell us that these results are impossible.  There is a hidden inconsistency within these results, and the only remaining task is to draw it out and make it obvious.


We’ll start our analysis by detailing our basic assumptions about the nature of the process.

Assumption 1: Lawfulness
The probability of an event is a function of all other events in the universe.

This assumption is incredibly weak. It just says that if you know everything about the universe, then you are able to place a probability distribution over future events. This isn’t even as strong as determinism, as it’s only saying that the future is a probabilistic function of the past. Determinism would be the claim that all such probabilities are 1 or 0, that is, the facts about the past fix the facts about the future.

From Assumption 1 we conclude the following:

There exists a function P(R | everything else) that accurately reports the frequency of the red bulb flashing, given the rest of facts about the universe.

It’s hard to imagine what it would mean for this to be wrong. Even in a perfectly non-deterministic universe where the future is completely probabilistically independent of the past, we could still express what’s going to happen next probabilistically, just with all of the probabilities of events being independent. This is why even naming this assumption lawfulness is too strong – the “lawfulness” could be probabilistic, chaotic, and incredibly minimal.

The next assumption constrains this function a little more.

Assumption 2: Locality
The probability of an event only depends on events local to it.

This assumption is justified by virtually the entire history of physics. Over and over we find that particles influence each others’ behaviors through causal intermediaries. Einstein’s Special Theory of Relativity provides a precise limitation on causal influences; the absolute fastest that causal influences can propagate is the speed of light. The light cone of an event is defined as all the past events that could have causally influenced it, given the speed of light limit, and all future events that can be causally influenced by this event.

Combining Assumption 1 and Assumption 2, we get:

P(R | everything else) = P(R | local events)

So what are these local events? Given our experimental design, we have two possibilities; the particle entering the detector, and the detector settings. Our experimental design explicitly rules out the effects of other causal influences, by holding them fixed. The only thing that we, the experimenters, vary are the detector settings, and the variation in the particle types being produced by the central black box. All else is stipulated to be held constant.

Thus we get our third, and final assumption.

Assumption 3: Good experimental design
The only local events relevant to the bulb flashing are the particle that enters the detector and the detector setting.

Combining these three assumptions, we get the following:

P(R | everything else) = P(R | particle & detector setting)

We can think of this function a little differently, by asking about a particular particle with a fixed set of properties.

Pparticle(R | detector setting)

We haven’t changed anything but the notation – this is the same function as what we originally had, just carrying a different meaning. Now it tells us how likely a given particle is to cause the red bulb to flash, given a certain detector setting. This allows us to categorize all different types of particles by looking at all different settings.

Particle type is defined by
Pparticle(R | Setting 1), Pparticle(R | Setting 2), Pparticle(R | Setting 3) )

This fully defines our particle type for the purposes of our experiment. The set of particle types is the set of three-tuples of probabilities.

So to summarize, here are the only three assumptions we need to generate the paradox.

Lawfulness: Events happen with probabilities that are determined by facts about the universe.
Locality: Causal influences propagate locally.
Good experimental design: Only the particle type and detector setting influence the experiment result.

Now, we generate a contradiction between these assumptions and the experimental results!


Recall our experimental results:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
Different colored bulbs flash on the left and right 75% of the time.

We are guaranteed by Assumptions 1 to 3 that there exists a function Pparticle(R | detector setting) that describes the frequencies we observe for a detector. We have two particles and two detectors, so we are really dealing with two functions for each experimental trial.

Left particle: Pleft(R | left setting)
Right particle: Pright(R | right setting)

From Result (1), we see that when left settingright setting, the same color always flashes on both sides. This means two things: first, that the black box always produces two particles of the same type, and second, that the behavior observed in the experiment is deterministic.

Why must they be the same type? Well, if they were different, then we would expect different frequencies on the left and the right. Why determinism? If the results were at all probabilistic, then even if the probability functions for the left and right particles were the same, we’d expect to still see them sometimes give different results. Since they don’t, the results must be fully determined.

Pleft(R | setting 1) = Pright(R | setting 1) = 0 or 1
Pleft(R | setting 2) = Pright(R | setting 2) = 0 or 1
Pleft(R | setting 3) = Pright(R | setting 3) = 0 or 1

This means that we can fully express particle types by a function that takes in a setting (1, 2, or 3), and returns a value (0 or 1) corresponding to whether or not the red bulb will flash. How many different types of particles are there? Eight!

Abbreviation: Pn = P(R | setting n)
P1 = 1, P2 = 1, P3 = 1 : (RRR)
P1 = 1, P2 = 1, P3 = 0 : (RRG)
P1 = 1, P2 = 0, P3 = 1 : (RGR)
P1 = 1, P2 = 0, P3 = 0 : (RGG)
P1 = 0, P2 = 1, P3 = 1 : (GRR)
P1 = 0, P2 = 1, P3 = 0 : (GRG)
P1 = 0, P2 = 0, P3 = 1 : (GGR)
P1 = 0, P2 = 0, P3 = 0 : (GGG)

The three-letter strings (RRR) are short representations of which bulb will flash for each detector setting.

Now we are ready to bring in experimental result (2). In 25% of the cases in which the settings are different, the same bulbs flash on either side. Is this possible given our results? No! Check out the following table that describes what happens with RRR-type particles and RRG-type particles when the detectors have different settings different detector settings.

(Setting 1, Setting 2) RRR-type RRG-type
1, 2 R, R R, R
1, 3 R, R R, G
2, 1 R, R R, R
2, 3 R, R R, G
3, 1 R, R G, R
3, 2 R, R G, R
100% same 33% same

Obviously, if the particle always triggers a red flash, then any combination of detector settings will result in a red flash. So when the particles are the RRR-type, you will always see the same color flash on either side. And when the particles are the RRG-type, you end up seeing the same color bulb flash in only two of the six cases with different detector settings.

By symmetry, we can extend this to all of the other types.

Particle type Percentage of the time that the same bulb flashes (for different detector settings)
RRR 100%
RRG 33%
RGR 33%
RGG 33%
GRR 33%
GRG 33%
GGR 33%
GGG 100%

Recall, in our original experimental results, we found that the same bulb flashes 25% of the time when the detectors are on different settings. Is this possible? Is there any distribution of particle types that could be produced by the central black box that would give us a 25% chance of seeing the same color?

No! How could there be? No matter how the black box produces particles, the best it can do is generate a distribution without RRRs and GGGs, in which case we would see 33% instead of 25%. In other words, the lowest that this value could possibly get is 33%!

This is the contradiction. Bell’s inequality points out a contradiction between theory and observation:

Theory: P(same color flash | different detector settings) ≥ 33%
Experiment: P(same color flash | different detector settings) = 25%


We have a contradiction between experimental results and a set of assumptions about reality. So one of our assumptions has to go. Which one?

Assumption 3: Experimental design. Good experimental design can be challenged, but this would require more detail on precisely how these experiments are done. The key feature of this is that you would have to propose a mechanism by which changes to the detector setting end up altering other relevant background factors that affect the experiment results. You’d also have to be able to do this for all the other subtly different variants of Bell’s experiment that give the same result. While this path is open, it doesn’t look promising.

Assumption 1: Lawfulness. Challenging the lawfulness of the universe looks really difficult. As I said before, I can barely imagine what a universe that doesn’t adhere to some version of Assumption 1 looks like. It’s almost tautological that some function will exist that can probabilistically describe the behavior of the universe. The universe must have some behavior, and why would we be unable to describe it probabilistically?

Assumption 2: Locality. This leaves us with locality. This is also really hard to deny! Modern physics has repeatedly confirmed that the speed of light acts as a speed limit on causal interactions, and that any influences must propagate locally. But perhaps quantum mechanics requires us to overthrow this old assumption and reveal it as a mere approximation to a deeper reality, as has been done many times before.

If we abandon number 2, we are allowing for the existence of statistical dependencies between variables that are entirely causally disconnected. Here’s Bell’s inequality in a causal diagram:

Bell Causal w entanglement

Since the detector settings on the left and the right are independent by assumption, we end up finding an unexplained dependence between the left particle and the right particle. Neither the common cause between them or any sort of subjunctive dependence a la timeless decision theory are able to explain away this dependence. In quantum mechanics, this dependence is given a name: entanglement. But of course, naming it doesn’t make it any less mysterious. Whatever entanglement is, it is something completely new to physics and challenges our intuitions about the very structure of causality.

Iterated Simpson’s Paradox

Previous: Simpson’s paradox

In the last post, we saw how statistical reasoning can go awry in Simpson’s paradox, and how causal reasoning can rescue us. In this post, we’ll be generalizing the idea behind the paradox and producing arbitrarily complex versions of it.

The main idea behind Simpson’s paradox is that conditioning on an extra variable can sometimes reverse dependencies.

In our example in the last post, we saw that one treatment for kidney stones worked better than another, until we conditioned on the kidney stone’s size. Upon conditioning, the sign of the dependence between treatment and recovery changed, so that the first treatment now looked like it was less effective than the other.

We explained this as a result of a spurious correlation, which we represented with ‘paths of dependence’ like so:


But we can do better than just one reversal! With our understanding of causal models, we are able to generate new reversals by introducing appropriate new variables to condition upon.

Our toy model for this will be a population of sick people, some given a drug and some not (D), and some who recover and some who do not (R). If there are no spurious correlations between D and R, then our diagram is simply:

Iter Simpson's 0

Now suppose that we introduce a spurious correlation, wealth (W). Wealthy people are more likely to get the drug (let’s say that this occurs through a causal intermediary of education level E), and are more likely to recover (we’ll suppose that this occurs through a casual intermediary of nutrition level of diet N).

Now we have the following diagram:

Iter Simpson's 1

Where there was only previously one path of dependency between D and R, there is now a second. This means that if we observe W, we break the spurious dependency between D and R, and retain the true causal dependence.

Iter Simpson's 1 all paths          Iter Simpson's 1 broken.png

This allows us one possible Simpson’s paradox: by conditioning upon W, we can change the direction of the dependence between D and R.

But we can do better! Suppose that your education level causally influences your nutrition. This means that we now have three paths of dependency between D and R. This allows us to cause two reversals in dependency: first by conditioning on W and second by conditioning on N.

Iter Simpson's 2 all paths.png  Iter Simpson's 2 broke 1  Iter Simpson's 2 broke 2

And we can keep going! Suppose that education does not cause nutrition, but both education and nutrition causally impact IQ. Now we have three possible reversals. First we condition on W, blocking the top path. Next we condition on I, creating a dependence between E and N (via explaining away). And finally, we condition on N, blocking the path we just opened. Now, to discern the true causal relationship between the drug and recovery, we have two choices: condition on W, or condition on all three W, I, and N.

Iter Simpson's 3 all pathsiter-simpsons-3-cond-w-e1514586779193.pngIter Simpson's 3 cond WIIter Simpson's 3 cond WIN

As might be becoming clear, we can do this arbitrarily many times. For example, here’s a five-step iterated Simpson paradox set-up:

Big iter simpson

The direction of dependence switches when you condition on, in this order: A, X, B’, X’, C’. You can trace out the different paths to see how this happens.

Part of the reason that I wanted to talk about the iterated Simpson’s paradox is to show off the power of causal modeling. Imagine that somebody hands you data that indicates that a drug is helpful in the whole population, harmful when you split the population up by wealth levels, helpful when you split it into wealth-IQ classes, and harmful when you split it into wealth-IQ-education classes.

How would you interpret this data? Causal modeling allows you to answer such questions by simply drawing a few diagrams!

Next we’ll move into one of the most significant parts of causal modeling – causal decision theory.

Previous: Simpson’s paradox

Next: Causal decision theory

Causal decision theory

Previous: Iterated Simpson’s Paradox

We’ll now move on into slightly new intellectual territory, that of decision theory.

While what we’ve previously discussed all had to do with questions about the probabilities of events and causal relationships between variables, we will now discuss questions about what the best decision to make in a given context is.


Decision theory has two ingredients. The first is a probabilistic model of different possible events that allows an agent to answer questions like “What is the probability that A happens if I do B?” This is, roughly speaking, the agent’s beliefs about the world.

The second ingredient is a utility function U over possible states of the world. This function takes in propositions, and returns the value to a particular agent of that proposition being true. This represents the agent’s values.

So, for instance, if A = “I win a million dollars” and B = “Somebody cuts my ear off”, U(A) will be a large positive number, and U(B) will be a large negative number. For propositions that an agent feels neutral or apathetic about, the utility function assigns them a value of 0.

Different decision theories represent different ways of combining a utility function with a probability distribution over world states. Said more intuitively, decision theories are prescriptions for combining your beliefs and your values in order to yield decisions.

A proposition that all competing decision theories agree on is “You should act to maximize your expected utility.” The difference between these different theories, then, is how they think that expected utility should be calculated.

“But this is simple!” you might think. “Simply sum over the value of each consequence, and weight each by its likelihood given a particular action! This will be the expected utility of that action.”

This prescription can be written out as follows:

Evidential Decision Theory.png

Here A is an action, C is the index for the different possible world states that you could end up in, and K is the conjunction of all of your background knowledge.


While this is quite intuitive, it runs into problems. For instance, suppose that scientists discover a gene G that causes both a greater chance of smoking (S) and a greater chance of developing cancer (C). In addition, suppose that smoking is known to not cause cancer.

Smoking Lesion problem

The question is, if you slightly prefer to smoke, then should you do so?

The most common response is that yes, you should do so. Either you have the cancer-causing gene or you don’t. If you do have the gene, then you’re already likely to develop cancer, and smoking won’t do anything to increase that chance.

And if you don’t have the gene, then you already probably won’t develop cancer, and smoking again doesn’t make it any more likely. So regardless of if you have the gene or not, smoking does not affect your chances of getting cancer. All it does is give you the little utility boost of getting to smoke.

But our expected utility formula given above disagrees. It sees that you are almost certain to get cancer if you smoke, and almost certain not to if you don’t. And this means that the expected utility of smoking includes the utility of cancer, which we’ll suppose to be massively negative.

Let’s do the calculation explicitly:

EU(S) = U(C & S) * P(C | S) + U(~C & S) * P(~C| S)
= U(C & S) << 0
EU(~S) =  U(~S & C) * P(C | ~S) + U(~S & ~C) * P(~C | ~S)
= U(~S & ~C) ~ 0

Therefore we find that EU(~S) >> EU(S), so our expected utility formula will tell us to avoid smoking.

The problem here is evidently that the expected utility function is taking into account not just the causal effects of your actions, but the spurious correlations as well.

The standard way that decision theory deals with this is to modify the expected utility function, switching from ordinary conditional probabilities to causal conditional probabilities.

Causal Decision Theory.png

You can calculate these causal conditional probabilities by intervening on S, which corresponds to removing all its incoming arrows.

Smoking Lesion problem mutilated

Now our expected utility function exactly mirrors our earlier argument – whether or not we smoke has no impact on our chance of getting cancer, so we might as well smoke.

Calculating this explicitly:

EU(S) = U(S & C) * P(C | do S) + U(S & ~C) * P(~C | do S)
= U(S & C) * P(C) + U(S & ~C) * P(~C)
EU(~S) = U(~S & C) * P(C | do ~S) + U(S & ~C) * P(~C | do S)
= U(~S & C) * P(C) + U(~S & ~C) * P(~C)

Looking closely at these values, we can see that EU(S) must be greater than EU(~S), regardless of the value of P(C).


The first expected utility formula that we wrote down represents the branch of decision theory called evidential decision theory. The second is what is called causal decision theory.

We can roughly describe the difference between them as that evidential decision theory looks at possible consequences of your decisions as if making an external observation of your decisions, while causal decision theory looks at the consequences of your decisions as if determining your decisions.

EDT treats your decisions as just another event out in the world, while CDT treats your decisions like causal interventions.

Perhaps you think that the choice between these is obvious. But Newcomb’s problem is a famous thought experiment that famously splits people along these lines and challenges both theories. I’ve written about it here, but for now will leave decision theory for new topics.

Previous: Iterated Simpson’s Paradox

Next: Causality for philosophers

Simpson’s paradox

Previous: Screening off and explaining away

A look at admission statistics at a college reveals that women are less likely to be admitted to graduate programs than men. A closer investigation reveals that in fact when the data is broken down into individual department data, women are more likely to be admitted than men. Does this sound impossible to you? It happened at UC Berkeley in 1973.

When two treatments are tested on a group of patients with kidney stones, Treatment A turns out to lead to worse recovery rates than Treatment B. But when the patients are divided according to the size of their kidney stone, it turns out that no matter how large their kidney stone, Treatment A always does better than Treatment B. Is this a logical contradiction? Nope, it happened in 1986!

What’s going on here? How can we make sense of this apparently inconsistent data? And most importantly, what conclusions do we draw? Is Berkeley biased against women or men? Is Treatment A actually more effective or less effective than Treatment B?

In this post, we’ll apply what we’ve learned about causal modeling to be able to answer these questions.


Quine gave the following categorization of types of paradoxes: veridical paradoxes (those that seem wrong but are actually correct), falsidical paradoxes (those that seem wrong and actually are wrong), and antinomies (those that are premised on common forms of reasoning and end up deriving a contradiction).

Simpson’s paradox is in the first category. While it seems impossible, it actually is possible, and it happens all the time. Our first task is to explain away the apparent falsity of the paradox.

Let’s look at some actual data on the recovery rates for different treatments of kidney stones.

Treatment A Treatment B
All patients 78% (273/350) 83% (289/350)

The percentages represent the number of patients that recovered, out of all those that were given the particular treatment. So 273 patients recovered out of the 350 patients given Treatment A, giving us 78%. And 289 patients recovered out of the 350 patients given Treatment B, giving 83%.

At this point we’d be tempted to proclaim that B is the better treatment. But if we now break down the data and divide up the patients by kidney stone size, we see:

Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)

And here the paradoxical conclusion falls out! If you have small stones, Treatment A looks better for you. And if you have large stones, Treatment A looks better for you. So no matter what size kidney stones you have, Treatment A is better!

And yet, amongst all patients, Treatment B has a higher recovery rate.

Small stones: A better than B
Large stones: A better than B
All sizes: B better than A

I encourage you to check out the numbers for yourself, in case you still don’t believe this.


The simplest explanation for what’s going on here is that we are treating conditional probabilities like they are joint probabilities. Let’s look again at our table, and express the meaning of the different percentages more precisely.

Treatment A Treatment B
Small stones P(Recovery | Small stones & Treatment A) P(Recovery | Small stones & Treatment B)
Large stones P(Recovery | Large stones & Treatment A) P(Recovery | Large stones & Treatment B)
Everybody P(Recovery | Treatment A) P(Recovery | Treatment B)

Our paradoxical result is the following:

P(Recovery | Small stones & Treatment A) > P(Recovery | Small stones & Treatment B)
P(Recovery | Large stones & Treatment A) > P(Recovery | Large stones & Treatment B)
P(Recovery | Treatment A) < P(Recovery | Treatment B)

But this is no paradox at all! There is no law of probability that tells us:

If P(A | B & C) > P(A | B & ~C)
and P(A | ~B & C) > P(A | ~B & ~C),
then P(A | C) > P(A | ~C)

There is, however, a law of probability that tells us:

If P(A & B | C) > P(A & B | ~C)
and P(A & ~B | C) > P(A & ~B | ~C),
then P(A | C) > P(A | ~C)

And if we represented the data in terms of these joint probabilities (probability of recovery AND small stones given Treatment A, for example) instead of conditional probabilities, we’d find that the probabilities add up nicely and the paradox vanishes.

Treatment A Treatment B
Small stones 23% (81/350) 67% (234/350)
Large stones 55% (192/350) 16% (55/350)
All patients 78% (273/350) 83% (289/350)

It is in this sense that the paradox arises from improper treatment of conditional probabilities as joint probabilities.


This tells us why we got a paradoxical result, but isn’t quite fully satisfying. We still want to know, for instance, whether we should give somebody with small kidney stones Treatment A or Treatment B.

The fully satisfying answer comes from causal modeling. The causal diagram we will draw will have three variables, A (which is true if you receive Treatment A and false if you receive Treatment B), S (which is true if you have small kidney stones and false if you have large), and R (which is true if you recovered).

Our causal diagram should express that there is some causal relationship between the treatment you receive (A) and whether you recover (R). It should also show a causal relationship between the size of your kidney stone (S) and your recovery, as the data indicates that larger kidney stones make recovery less likely.

And finally, it should show a causal arrow from the size of the kidney stone to the treatment that you receive. This final arrow comes from the fact that more people with large stones were given Treatment A than Treatment B, and more people with small stones were given Treatment B than Treatment B.

This gives us the following diagram:

Simpson's paradox

The values of P(S), P(A | S), and P(A | ~S) were calculated from the table we started with. For instance, the value of P(S) was calculated by adding up all the patients that had small kidney stones, and dividing by the total number of patients in the study: (87 + 270) / 700.

Now, we want to know if P(R | A) > P(R | ~A) (that is, if recovery is more likely given Treatment A than given Treatment B).

If we just look at the conditional probabilities given by our first table, then we are taking into account two sources of dependency between treatment type and recovery. The first is the direct causal relationship, which is what we want to know. The second is the spurious correlation between A and R as a result of the common cause S.

Simpson's paradox paths

Here the red arrows represent “paths of dependency” between A and R. For example, since those with small stones are more likely to get treatment B, and are also more likely to recover, this will result in a spurious correlation between small stones and recovery.

So how we do we determine the actual non-spurious causal dependency between A and R?


If we observe the value of S, then we screen A off from R through S! This removes the spurious correlation, and leaves us with just the causal relationship that we want.

Simpson's paradox broken

What this means is that the true nature of the relationship between treatment type and recovery can be determined by breaking down the data in terms of kidney stone size. Looking back at our original data:

Recovery rate Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)
All patients 78% (273/350) 83% (289/350)

This corresponds to looking at the data divided up by size of stones, and not the data on all patients. And since for each stone size category, Treatment A was more effective than Treatment B, this is the true causal relationship between A and R!


A nice feature of the framework of causal modeling is that there are often multiple ways to think about the same problem. So instead of thinking about this in terms of screening off the spurious correlation through observation of S, we could also think in terms of causal interventions.

In other words, to determine the true nature of the causal relationship between A and R, we want to intervene on A, and see what happens to R.

This corresponds to calculating if P(R | do A) > P(R | do ~A), rather than if P(R | A) > P(R | ~A).

Intervention on A gives us the new diagram:

Simpson's paradox intervene

With this diagram, we can calculate:

P(R | do A)
= P(R & S | do A) + P(R & ~S | do A)
= P(S) * P(R | A & S) + P(~S) * P(R | A & ~S)
= 51% * 93% + 49% * 73%
= 83.2%


P(R | do ~A)
= P(R & S | do ~A) + P(R & ~S | do ~A)
= P(S) * P(R | ~A & S) + P(~S) * P(R | ~A & ~S)
= 51% * 87% + 49% * 69%
= 78.2%

Now not only do we see that Treatment A is better than Treatment B, but we can have the exact amount by which it is better – it improves recovery chances by about 5%!

Next, we’re going to go kind of crazy with Simpson’s paradox and show how to construct an infinite chain of Simpson’s paradoxes.

Fantastic paper on all of this here.

Previous: Screening off and explaining away

Next: Iterated Simpson’s paradox

Screening off and explaining away

Previous: Correlation and causation

In this post, I’ll explain three of the most valuable tools for inference that arise naturally from causal modeling.

Screening off via causal intermediary
Screening off via common cause
Explaining away


Suppose that the rain causes the sidewalk to get wet, and the sidewalk getting wet causes you to slip and break your elbow.

rain & slip & elbow.png

This means that if you know that it’s raining, then you know that a broken elbow is more likely. But if you also know that the sidewalk is wet, then learning whether or not it is raining no longer makes a broken elbow more likely. After all, the rain is only a useful piece of information for predicting broken elbows insofar as it allows you to infer sidewalk-wetness.

In other words, the information about sidewalk-wetness screens off the information about whether or not it is raining with respect to broken elbows. In particular, sidewalk-wetness screens off rain because it is a causal intermediary to broken elbows.


Suppose that being wealthy causes you to eat more nutritious food, and being wealthy also causes you to own fancy cars.

common cause.png

This means that if you see somebody in a fancy car, you know it is more likely that they eat nutritious food. But if you already knew that they were wealthy, then knowing that their car is fancy tells you no more about the nutritiousness of their diet. After all, the fanciness of the car is only a useful piece of information for predicting nutritious diets insofar as it allows you to infer wealth.

In other words, wealth screens off ownership of fancy cars with respect to nutrition. In particular, wealth screens off ownership of fancy cars because it is a common cause of nutrition and fancy car owning.


Suppose that being really intelligent causes you to get on television, and being really attractive causes you to get on television, but attractiveness and intelligence are not directly causally related.

smart & hot & tv.png

This means that in the general population, you don’t learn anything about somebody’s intelligence by assessing their attractiveness. But if you know that they are on television, then you do learn something about their intelligence by assessing their attractiveness.

In particular, if you know that somebody is on television, and then you learn that they are attractive, then it becomes less likely that they intelligent than it was before you learned this.

We say that in this scenario attractiveness explains away intelligence, given the knowledge that they are on television.


I want to introduce some notation that will allow us to really compactly describe these types of effects and visualize them clearly.

We’ll depict an ‘observed variable’ in a causal diagram as follows:

A&gt;B&gt;C with observed B

This diagram says that A causes B, B causes C, and the value of B is known.

In addition, we talked about the value of one variable telling you something about the value of another variable, given some information about other variables. For this we use the language of dependence.

To say, for example, that A and B are independent given C, we write:

(A ⫫ B) | C

And to say that A and B are dependent given C, we just write:

~(A ⫫ B) | C

With this notation, we can summarize everything I said above with the following diagram:

Screening off and explaining away

In words, the first row expresses dependent variables that become independent when conditioning on causal intermediaries. B screens off A from C as a causal intermediary.

The second expresses dependent variables that become independent when conditioning on common causes. B screens off A from C as a common cause.

And the third row expresses independent variables that become dependent when conditioning on common effects. A explains away C, given B.


Repeated application of these three rules allows you to determine dependencies in complicated causal diagrams. Let’s say that somebody gives you the following diagram:

Complex cause

First they ask you if E and F are going to be correlated.

We can answer this just by tracing causal paths through the diagram. If we look at all connected triples on paths leading from E to F and find that there is dependence between the end variables in each triple, then we know that E and F are dependent.

The path ECA is a causal chain, and C is not observed, so E and A are dependent along this path. Next, the path CAD is a common cause path, and the common cause (A) is not observed, thus retaining dependence again along the path. And finally, the path ADF is a causal chain with D unobserved, so A and F are dependent along the path.

So E and F are dependent.

Now your questioner tell you the value of D, and re-asks you if E and F are dependent.

Complex cause obs D

Now dependence still exists along the paths ECA and CAD, but the path ADF breaks the dependence. This follows from the rule in row 1: D is observed, so A is screened off from F. Since A is screened off, E is as well. This means that E and F are now independent.

Suppose they asked you if E and B were dependent before telling you the value of D. In this case, the dependence travels along ECA, and along CAD, but is broken along ADB by observation of D. This follows from our rule in row 3.

And if they asked you if E and B were dependent after telling you the value of D, then you would respond that they are dependent. Now the last leg of the path (ADB) is dependent, because A and B explain each other away.

The general ability to look at a complicated causal diagram is a valuable tool, and we will come back to it in the future.

Next, I’ll talk about one of my current favorite applications of causal diagrams: Simpson’s paradox!

Previous: Correlation and causation

Next: Simpson’s paradox

Dialogue: Why you should one-box in Newcomb’s problem

(Nothing original here, just my presentation of the most interesting arguments I’ve seen on the various sides)

Newcomb’s problem: You find yourself in a room with two boxes in it. Box #1 is clear, and you can see $10,000 inside. Box #2 is opaque. A loud voice announces to you: “Box 2 has either 1 million dollars inside of it or nothing. You have a choice: Either you take just Box 2 by itself, or you take both Box 1 and Box 2.”

As you’re reaching forward to take both boxes, the voice declares: “Wait! There’s a catch.

Sometime before you entered the room, a Predictor with enormous computing power scanned you, made an incredibly detailed simulation of you, and used it to make a prediction about what decision you would make. The Predictor has done similar simulations many times in the past, and has never been wrong. If the Predictor predicted that you would take just Box 2, then it filled up the box with 1 million dollars. And if the Predictor predicted that you would take both boxes, then it left Box 2 empty. Now you may make your choice.”


The most initially intuitive answer to most people is to take both boxes. Here’s the strongest argument for why this makes sense, presented by Claus the causal thinker.

Claus: “The Predictor has already made its prediction and fixed the contents of the box. So we know for sure that my decision can’t possibly have any impact on whether Box 2 is full or empty. And in either case, I am better off taking both boxes than just one! Think about it like this: whether I one-box or two-box, I still end up taking Box 2. So let’s consider Box 2 taken – I have no choice in the matter. Now the only real question is if I’m also going to take Box 1. And Box 1 has $10,000 inside it! I can see it right there! My choice is really whether to take the free $10,000 or not, and I’d be a fool to leave it behind.”


Claus makes a very convincing argument. On his calculation, the expected value of two-boxing is strictly greater than the expected value of one-boxing, regardless of what probabilities he puts on the second box being empty. We’ll call this the dominance argument.

But Claus is making a fundamental error in his calculation. Let’s let a different type of decision theorist named Eve interrogate Claus.

Eve: “So, Claus, I’m curious about how you arrived at your answer. You say that your decision about whether or not to take Box 1 can’t possibly impact the contents of Box 2. I think that I agree with this. But do you agree that if you don’t take Box 1, it is more likely that Box 2 has a million dollars inside it?”

Claus: “I can’t see how that could be the case. The box’s contents are already fixed. How could my decision about something entirely causally unrelated make it any more likely that the contents are one way or the other?”

Eve: “Well, it’s not actually that unusual. There are plenty of things that are correlated without any direct causal impact between them. For example, say that a certain gene causes you to be a good juggler, but also causes a high chance of a certain disease. In this case, juggling ability and incidence of the disease will end up being correlated in the population, even though neither one is directly causing the other. And if you’re a good juggler, then you should be more worried that maybe you also have the disease!”

Claus: “Sure, but I don’t see how that case is anything like this one…”

Eve: “The two cases are actually structurally identical! Let me draw some causal diagrams…” (Claus rummages around for paper and a pencil)

Newcomb vs Juggling alt.png

Eve: “In our disease example, we have a common cause (the gene) that is directly causally linked to both the disease and to being a good juggler. So the “disease” variable and the “good juggler” variable are dependent because of the “gene” variable. In your Newcomb problem, the common cause is your past self at the moment that the Predictor scanned you. This common cause is directly linked to both your decision to one-box or two-box in the present, and to the contents of the box. Which means that in the exact same way that being a good juggler makes you more likely to have the disease, two-boxing makes you more likely to end up with an empty Box 2! The two cases are exactly analogous!”

Claus: “Hmm, that all seems correct. But even if my decision to take Box 1 isn’t independent of the contents of Box 2, this doesn’t necessarily mean that I shouldn’t still take both.”

Eve: “Right! But it does invalidate your dominance argument, which implicitly rested on the assumption that you could treat the contents of the box as if they were unaffected by your action. While your actions do not strictly speaking causally effect the contents of the box, they do change the likelihoods of the different possible contents! So there is a real sense in which your actions do statistically affect the contents of the box, even though they don’t causally affect them. Anyway, we can just calculate the actual expected values and see whether one-boxing or two-boxing comes out ahead.”

Eve writes out some expected utility calculations:

Newcomb vs Juggling 2

Eve: “So you see, it actually turns out to always be better to one-box than to two-box!”

Claus: “Hmm, I guess you’re right. Okay never mind, I guess that I’ll one-box. Thanks!”


Claus goes away for a while, and comes back a more sophisticated causal thinker.

Claus: “Hey, remember that I agreed that my decision and the contents of Box 1 are actually dependent upon each other, just not causally?”

Eve: “Yes.”

Claus: “Well, I do still agree with that. But I am also still a two-boxer. I’ll explain – would you hand me that paper?”

Claus scribbles a few equations beneath Eve’s diagrams.

EDT vs CDT.png

Claus: “When you calculated the expected values of one-boxing and two-boxing, you implicitly used Equation (1). Let’s call this equation the “Evidential Decision Algorithm.” You summed over all the possible consequences of your actions, and multiplied the values of each consequence by the conditional probability of that consequence, given the decision.”

Eve: “Yes…”

Claus: “Well, I have a different way to calculate expected values! It’s Equation (2), and I call it the “Causal Decision Algorithm.” I also sum over all possible consequences, but I multiply the value of each consequence by its causal conditional probability, not it’s ordinary conditional probability! And when you calculate the expected value, it turns out to be larger for two-boxing!”

Eve: “Hmm, doesn’t this seem a little arbitrary? Maybe a little ad-hoc?”

Claus: “Not at all! The point of rational decision-making is to choose the decision that causes the best outcomes. What we should be interested in is only the causal links between our decisions and their possible consequences, not the spurious correlations.”

Eve: “Hmm, I can see how that makes sense…”

Claus: “Here, let’s look back at your earlier example about juggling and disease. I agree with what you said that if you observe that you’re a good juggler, you should be worried that you have the disease. But imagine that instead of just observing whether or not you’re a good juggler, you get to decide whether or not to be a good juggler. Say that you can decide to spend many hours training your juggling, and at the end of that process you know that you’ll be a good juggler. Now, according to your decision theory, deciding to train to become a good juggler puts you at a higher risk for having the disease. But that’s ridiculous! We know for sure that your decision to become a good juggler does not make you any more likely to have the disease. Since you’re deciding what actions to take, you should treat your decisions like causal interventions, in which you set the decision variable to one value or another and in the process break all other causal arrows directed at it. And that’s why you should be using the causal conditional probability, not the ordinary conditional probability!”

Eve: “Huh. What you’re saying does have some intuitive appeal. But now I’m starting to think that there is an important difference that we both missed between the juggling example and Newcomb’s problem.”

Eve draws two more diagrams on a new page.

Newcomb vs Juggling 3.png

Eve: “In the juggling case, it makes sense to describe your decision to become a good juggler or not as a causal intervention, because this decision is not part of the chain of causes leading from your genes to your juggling ability – it’s a separate cause, independent of whether or not you have the gene. But in Newcomb’s problem, your decision to one-box or two-box exists along the path of the causal arrow from your past character to your current action! The Predictor predicted every part of you, including the part of you that’s thinking about what action you’re going to take. So while modeling your decision as a causal intervention in the juggling example makes sense, doing so in Newcomb’s case is just empirically wrong! Whatever part of your brain ends up deciding to “intervene” and two-box, the Predictor predicted that this would happen! By the nature of the problem, any way in which you attempt to intervene on your decision will inevitably not actually be a causal intervention.”


(Tim, a new type of decision theorist, appears in a puff of smoke)

Claus and Eve: “Gasp! Who are you?”

Tim: “I’m Tim, a new type of decision theorist! And I’m here to say that you’re both wrong!”

Claus and Eve: “Gasp!”

Tim: “I’ll explain with a thought experiment. You both know the prisoner’s dilemma, right? Two prisoners each get to make a choice either to cooperate or defect. The best outcome for each one is that they defect and the other prisoner cooperates, the second best outcome is that both cooperate, the second worst is that they both defect, and the worst is that they cooperate and the other prisoner defects. Famously, two rational agents in a prisoner’s dilemma will end up both defecting, because defecting dominates cooperating as a strategy. If the other prisoner defects, you’re better off defecting, and if the other prisoner cooperates, you’re better off defecting. So you should defect.”

Claus and Eve: “Yes, that seems right…”

Tim: “Well, first of all notice that two rational agents end up behaving in a sub-optimal way. They would both be better off if they each cooperated. But apparently, being ‘rational’ in this case entails ending up worse off. This should be a little unusual to you if you think that rational decision-making is about optimizing your outcomes. But now consider this variant: now you are in a prisoner’s dilemma with an exact clone of yourself. You have identical brains, have lived identical lives, and are now making this decision in identical settings. Now what do you do?”

Claus: “Well, on my decision theory, it’s still the case that I can’t causally effect my clone with my decision. This means that when I treat my decision as an intervention, I won’t end up making the probability that my clone defects given that I defect any higher. So defecting still dominates cooperating as a strategy. I defect!”

Eve: “Well, my answer depends on the set-up of the problem. If there’s some common cause that explains why my clone and I are identical (like maybe we were both manufactured in a twin-clone-making factory), then our decisions will be dependent. If I defect, then my clone will certainly defect, and if I cooperate, then my clone will cooperate. So my algorithm will tell me that cooperation maximizes expected utility.”

Tim: “There is no common cause. It’s by an insanely unlikely coincidence that you and your clone happen to have the same brains and to have lived the same lives. Until this moment, the two of you have been completely causally cut off from each other, with no possibility of any type of causal relationship .

Eve: “Okay, then I gotta agree with Claus. With no possible common cause and no causal intermediaries, my decision can’t affect my clone’s decision, causally or statistically. So I’ll defect too.”

Tim: “You’re both wrong. Both of you end up defecting, along with your clones, and everybody is worse off. Look, both of you ended up concluding that your decision and the decision of your clone cannot be correlated, because there are no causal connections to generate that correlation. But you and your clone are completely physically identical. Every atom in your brain is in a functionally identical spot as the atoms in your clone’s brain. Are you determinists?”

Eve: “Well, in quantum mechanics -”

Tim: “Forget quantum mechanics! For the purpose of this thought experiment, you exist in a completely deterministic world, where the same initial conditions lead to the same final conditions in every case, always. You and your clone are in identical initial conditions. So your final condition – that is, your decision about whether to cooperate or defect, must be the same. In the setup as I’ve described it, it is logically impossible that you defect and your clone cooperates, or that you cooperate and your clone defects.”

Claus: “Yes, I think you’re right… but then how do we represent this extra dependence in our diagrams? We can’t draw any causal links connecting the two, so how can we express the logical connection between our actions?”

Tim: “I don’t really care how you represent it in your diagram. Maybe draw a special fancy common cause node with special fancy causal arrows that can’t be broken towards both your decision and your clone’s decision.”

TDT Clones.png

Tim: “The point is: there are really only two possible worlds. In World 1, you defect and your clone defects. In World 2, you cooperate and your clone cooperates. Which world would you rather be in?”

Claus and Eve: “World 2.”

Tim: “Good! So you’ll both cooperate. Now, what if the clone is not exactly identical to you? Let’s say that your clone only ends up doing the same thing as you 99.999% of the time. Now what do you do?”

Claus: “Well, if it’s no longer logically impossible for my clone to behave differently from me, then maybe I should defect again?”

Tim: “Do you really want a decision theory that has a discontinuous jump in your behavior from a 99.999% chance to a 100% chance? I mean, I’ve told you that the chance that the clone gives a different answer than your answer is .001%! Rational agents should take into account all of their information, not only selective pieces of it. Either you ignore this information and end up worse off, or you take it into account and win!”

Eve: “Okay, yes, it seems reasonable to still expect a 99.999% chance of identical choices in this case. So we should cooperate again. But what does all of this have to do with Newcomb’s problem?”

Tim: “It relates to your answers to Newcomb’s problem in two ways. First, it shows that both of your decision algorithms are wrong! They are failing to take into account that extra logical dependency between actions and consequences that we drew with fancy arrows. And second, Newcomb’s problem is virtually identical to the prisoner’s dilemma with a clone!”

Eve and Claus: “Huh?”

Tim: “Here, let’s modify the prisoner’s dilemma in the following way: If you both cooperate, then you get one million dollars. If you cooperate and your clone defects, you get $0. If you defect and your clone cooperates, you get $1,010,000. And if you both defect, then you get $10,000. Now “cooperating” is the same as one-boxing, and “defecting” is the same as two-boxing!”

Eve: “But hold on, isn’t the logical dependency between my actions and my clone’s actions not carried over to the prisoner’s dilemma? Like, it’s not logically impossible that I one-box and the box has a million dollars in it, right?”

Tim: “It is with a perfect Predictor, yes! Remember, the Predictor works by creating a perfect simulation of you and seeing what it does. This means that your decision to one-box or to two-box is logically dependent on the Predictor’s prediction of what you do (and thus the contents of the box) in the exact same way that your decision to cooperate is logically dependent on your clone’s decision to one-box!”


Claus: “Yes, I see. So with a perfect Predictor, there are really only two worlds to consider: one in which I one-box and get a million dollars, and another in which I two-box and get just $10,000. And of course I prefer the first, so I should one-box.”

Tim: “Exactly! And if the Predictor is not perfectly accurate, and is only right 99.999% of the time…”

Eve: “Well, then there’s still only a .001% chance that I two-box and get an extra million bucks. So, I’m still much better off if I one-box than if I two-box.”

Tim: “Yep! It sounds like we’re all on the same page then. There’s a logical dependence between your action and the contents of the box that you are rationally required to take into account, and when you do take it into account, you end up seeing that one-boxing is the rational action.”


The decision theory that “Tim” is using is called timeless decision theory. It’s also been variously called functional decision theory, logical decision theory, and updateless decision theory.

Timeless decision theory ends up better off in Newcomb-like problems, invariably walking away with 1 million dollars instead of $10,000. It also does better than evidential decision theory (Eve’s theory) and causal decision theory (Claus’s theory) at prisoner’s-dilemmas-with-a-clone. These are fairly contrived problems, and it’d be easy for Eve or Claus to just deny that these problems have any real-world application.

But timeless decision theorists also cooperate with each other in ordinary prisoner’s dilemmas. They have a much easier time with coordination problems in general. They do better in bargaining problems. And they can’t be blackmailed in a large general class of situations. It’s harder to write these results off as strange quirks that don’t relate to real life.

A society of TDTs wouldn’t be plagued with doubts about the rationality of voting, wouldn’t find themselves stuck in as many sub-optimal Nash equilibria, and would look around and see a lot fewer civilizational inadequacies and low-hanging policy fruit than we currently have. This is what’s most interesting to me about TDT – that it gives a foundation for rational decision-making that seems like it has potential for solving real civilizational problems.

Correlation and causation

Previous: Causal intervention

I’m feeling a bit uninspired today, so what I am going to do is take the path of least resistance. Instead of giving a thoughtful discussion of the merits and faults of the slogan “Correlation does not imply causation”, I’ll just disprove it with a counterexample.

We have some condition C. This condition affects some members of our population. We want to know if gender (A) and race (B) play a causal role in the incidence of this condition.

Some starting causal assumptions: Gender does not cause race. Race does not cause gender. And the condition does not cause either gender or race.

First we go search for numbers to determine possible correlations between gender and the condition or race and the conditions. Here’s what we find:

P(A & B & C) = 2%
P(A & B & ~C) = 3%
P(A & ~B & C) = 18%
P(A & ~B & ~C) = 27%
P(~A & B & C) = 0.5%
P(~A & B & ~C) = 4.5%
P(~A & ~B & C) = 4.5%
P(~A & ~B & ~C) = 40.5%

Alright, now what are the possible causal structures of race, gender, and condition consistent with our starting assumptions? There are 4: neither A nor B cause C, only B causes C, only A causes C, and both cause C.

ABC all models

Each of these causal models makes precise, empirical predictions about what sort of correlations we should expect to find. The first model tells us not to expect any correlations whatsoever – each of the variables should vary independently in the population. The second says that A and C will be independent, and B and C will not be. Etc.

We can test all of these straightforwardly: Is it true that P(A & C) = P(A) * P(C)? And is it true that P(B & C) = P(B) * P(C)? We calculate:

P(A & C) = 2% + 18% = 20%
P(B & C) = 2% + .5% = 2.5%

P(A) = 2% + 3% + 18% + 27% = 50%
P(B) = 2% + 3% + .5% + 4.5% = 10%
P(C) = 2% + 18% + .5% + 4.5% = 25%

P(A) * P(C) is 12.5%, and P(B) * P(C) is 2.5%.

So… our third model is correct! We have determined causation from correlation! So much for the famous slogan.


The studious one will object that the only way that we have determined causation from correlation in this case is because we started with causal assumptions. This is correct, at least in part. If we had started with no causal assumptions, we still would have found that race and gender are independent. But we would not have been able to determine the direction of our causal arrows.

Here’s a general principle: Purely observational data (read: correlations) cannot tell you on its own the direction of causation. Even this is not actually fully correct: in fact there are special situations called natural experiments in which purely observational data can tell you the direction of causation. We’ll save this discussion for later.

Another studious reader will object: But this is a threadbare notion of causation! On this view, causation is really just statistical dependence!

They are wrong. A causal diagram tells you two things. First, it tells you what correlations you should expect to observe in observational data. But second, it tells you what to expect when you intervene and perform experiments on your variables. This second feature packs in the rest of the intuitive substance of causality.

One final skeptic will point out: Even if we accept your causal assumptions, we cannot truly say that we have ruled out all other causal models. For instance, what if gender does not actually cause the condition, but both gender and the condition are the result of some hidden common cause? This new causal diagram is not ruled out by the data, as one still expects to see a correlation between gender and condition.

They are correct. I am being a little sly in ignoring these subtleties, but this is because they avoid the main point. Which is that causal diagrams are empirically falsifiable, even from purely correlational data. The sense in which the slogan “Correlation does not imply causation” is correct is the sense in which not literally every possible causal model can be eliminated just by observations of correlation. Some causal diagrams truly are empirically indistinguishable. But this doesn’t make causality any more mysterious or un-probeable with the scientific method. We can simply run experiments to deal with the remaining possibilities.

Here are three general ways that you can falsify causal diagrams:

  1. Through observations of correlation or lack of correlation between variables.
  2. Through relevant background information (like temporal order or impossibility of physical interaction between variables)
  3. Through experimental interventions, in which you fix some variables and observe what happens to the others.

Next we’ll discuss some of the useful conceptual tools that arise from this notion of causality.

Previous: Causal intervention

Next: Screening off and explaining away

Causal Intervention

Previous post: Causal arrows

Let’s quickly review the last post. A causal diagram for two variables A and B tells us how to factor the joint probability distribution P(A & B). The rule we use is that for each variable, we calculate its probability conditional upon all of its parent nodes. This can easily be generalized to any number of variables.

Quick exercises: See if you understand why the following are true.

1. If the causal relationships between three variables A, B, and C are:A&gt;B&gt;C

Then P(A & B & C) = P(A) · P(B | A) · P(C | B).

2. If the causal relationships are:


Then P(A & B & ~C) = P(A | B) · P(B) · P(~C | B).

3. If the causal relationships are:


Then P(~A & ~B & C) = P(~A) · P(~B | ~A & C) · P(C)

Got it? Then you’re ready to move on!


Two people are debating a causal question. One of them says that the rain causes the sidewalk to get wet. The other one says that the sidewalk being wet causes the rain. We can express their debate as:

2 var causal OR

We’ve already seen that the probability distributions that correspond to these causal models are empirically indistinguishable. So how do we tell who’s right?

Easy! We go outside with a bucket of water and splash it on the sidewalk. Then we check and see if it’s raining. Another day, we apply a high-powered blow-drier to the sidewalk and check if it’s raining.

We repeat this a bunch of times at random intervals, and see if we find that splashing the sidewalk makes it any more likely to rain than blow-drying the sidewalk. If so, then we know that sidewalk-wetness causes rain, not the other way around.

This is the process of intervention. When we intervene on a variable, we set it to some desired value and see what happens. Let’s express this with our diagrams.

When we splash the sidewalk with water, what we are in essence doing is setting the variable B (“The sidewalk is wet”) to true. And when we blow-dry the sidewalk, we are setting the variable B to false. Since we are now the complete determinant of the value of B, all causal arrows pointing towards B must be erased. So:

A&gt;B becomes A B


&lt; stays &lt;

And now our intervened-upon distributions are empirically distinguishable!

The person who thinks that sidewalk-wetness causes rain expects to find a probabilistic dependence between A and B when we intervene. In particular, they expect that it will be more likely to rain when you splash the sidewalk than when you blow-dry it.

And the person who thinks that rain causes sidewalk-wetness expects to find no probabilistic dependence between A and B. They’ll expect that it is equally likely to be raining if you’re splashing the sidewalk as if you’re blow-drying it.


This is how to determine the direction of causal arrows using causal models. The key insight here is that a causal model tells you what happens when you perform interventions.

The rule is: Causal intervention on a variable X is represented by erasing all incoming arrows to X and setting its value to its intervened value.

I’ll introduce one last concept here before we move on to the next post: the causal conditional probability.

In our previous example, we talked about the probability that it rains, given that you splash the sidewalk. This is clearly different than the probability that it rains, given that the sidewalk is wet. So we give it a new name.

Normal conditional probability = P(A | B) = probability that it rains given that the sidewalk is wet

Causal conditional probability = P(A | do B) = probability that it rains given that you splash the sidewalk.

The causal conditional probability of A given B, is just the probability of A given that you intervene on B and set it to “True”. And P(A | do ~B) is the probability of A given that you intervene on B and set it to “False”.
If we find that P(A | do B) = P(A | do ~B), then we have ruled out a-b-e1513835006290.png.


Previous: Causal arrows

Next: Correlation and causation

Causal Arrows

Previous post: Preliminaries

Let’s start discussing causality. The first thing I want to get across is that causal models tell us how to factor joint probability distributions.

Let’s say that we want to express a causal relationship between some variable A and another variable B. We’ll draw it this way:

A &gt; B

Let’s say that A = “It is raining”, and B = “The sidewalk is wet.”

Let’s assign probabilities to the various possibilities.

P(A & B) = 49%
P(A & ~B) = 1%
P(~A & B) = 5%
P(~A & ~B) = 45%

This is the joint probability distribution for our variables A and B. It tells us that it rains about half the time, that the sidewalk is almost always wet when it rains, and the sidewalk is rarely wet when it doesn’t rain.

Factorizations of a joint probability distribution express the joint probabilities in terms of a product of probabilities for each variable. Any given probability distribution may have multiple equivalent factorizations. So, for instance, we can factor our distribution like this:

Factorization 1:
P(A) = 50%
P(B | A) = 98%
P(B | ~A) = 10%

And we can also factor our distribution like this:

Factorization 2
P(B) = 54%
P(A | B) = 90.741%
P(A | ~B) = 2.174%

You can check for yourself that these factorizations are equivalent to our starting joint probability distribution by using the relationship between joint probabilities and conditional probabilities. For example, using Factorization 1:

P(A & ~B)
= P(A) · P(~B | A)
= 50% · 2%
= 1%

Just as expected! If any of this is confusing to you, go back to my last post.


Let’s rewind. What does any of this have to do with causality? Well, the diagram we drew above, in which rain causes sidewalk-wetness, instructs us as to how we should factor our joint probability distribution.

Here are the rules:

  1. If node X has no incoming arrows, you express its probability as P(X).
  2. If a node does have incoming arrows, you express its probability as conditional upon the values of its parent nodes – those from which the arrows originate.

Let’s look back at our diagram for rain and sidewalk-wetness.

A &gt; B

Which representation do we use?

A has no incoming arrows, so we express its probability unconditionally: P(A).

B has one incoming arrow from A, so we express its probability as conditional upon the possible values of A. That is, we use P(B | A) and P(B | ~A).

Which means that we use Factorization 1!

Say that instead somebody tells you that they think the causal relationship between rain and sidewalk-wetness goes the other way. I.e., they believe that the correct diagram is:

A &lt; B

Which factorization would they use?


So causal diagrams tell us how to factor a probability distribution over multiple variables. But why does this matter? After all, two different factorizations of a single probability distribution are empirically equivalent. Doesn’t this mean that “A causes B” and “B causes A” are empirically indistinguishable?

Two responses: First, this is only one component of causal models. Other uses of causal models that we will see in the next post will allow us to empirically determine the direction of causation.

And second: in fact, some causal diagrams can be empirically distinguished.

Say that somebody proclaims that there are no causal links between rain and sidewalk-wetness. We represent this as follows:


What does this tell us about how to express our probability distribution?

Well, A has no incoming arrows, so we use P(A). B has no incoming arrows, so we use P(B).

So let’s say we want to know the chance that it’s raining AND the sidewalk is wet. According to the diagram, we’ll calculate this in the following way:

P(A & B) = P(A) · P(B)

But wait! Let’s look back at our initial distribution:

P(A & B) = 49%
P(A & ~B) = 1%
P(~A & B) = 5%
P(~A & ~B) = 45%

Is it possible to get all of these values from just our two values P(A) and P(B)? No! (proof below)

In other words, our data rules out this causal model.

A X B crossed


To summarize: a causal diagram over two variables A and B tells you how to calculate things like P(A & B). It says that you break it into the probabilities of the individual propositions, and that the probability for each variable should be conditional on the possible values of its parents.

Next we’ll look at how we can empirically distinguish between &gt; and &lt;

Proof of dependence

Previous post: Preliminaries

Next post: Causal intervention

Causality: Preliminaries


One revolution in my thinking was Bayesianism – applying probability theory to beliefs. This has been thoroughly covered in self-contained series at all levels of accessibility elsewhere.

A more recent revolution in my thinking is causal modeling – using graphical networks to model causal relationships. There appears to be a lack of good online explanations of these tools for reasoning, so it seems worthwhile to create one.

My goal here is not to make you an expert in all things causal, but to pass on the key insights that have modified my thinking. Let’s get started!


Much of the framework of causal modeling relies on an understanding of probability theory. So in this first post, I’ll establish the basics that will be used in later posts. If you know how to factor a joint probability distribution, then you can safely skip this.

We’ll label propositions like “The movie has started” with the letters A, B, C, etc. Probability theory is about assigning probabilities to these propositions. A probability is a value between 0 and 1, where 0 is complete confidence that the statement is false and 1 is complete confidence that it is true.

Some notation:

The probability of A = P(A)
The negation of A = ~A
The joint probability that both A and B are true = P(A & B)
The conditional probability of A, given that B is true = P(A | B)

There are just five important things you need to know in order to understand the following posts:

  1. P(A & B) = P(A | B) · P(B)
  2. P(A) + P(~A) = 1
  3. A and B are independent if and only if P(A | B) = P(A). Otherwise, A and B are called dependent.
  4. A joint probability distribution over statements is an assignment of probabilities to all possible truth-values of those statements.
  5. factorization of a joint probability distribution is a way to break down the joint probabilities into products of probabilities of individual statements.

#1 should make some sense. To see how likely it is that A and B are both true, you can first calculate how likely it is that A is true given that B is true, then multiply by the chance that B is true. You can think of this as breaking a question about the probability of both A and B into two questions:

1. In a world in which B is true, how likely is it that A is true?
and 2. How likely is it that we are in that world where B is true?

#2 is just the idea that a proposition must be either true or false, and not both. This is the type of thing that sounds trivial, but ends up being extremely important for manipulating probabilities. For instance, it is also true that a proposition must be true or false and not both, given some other proposition. This means that the conditional probabilities P(B | A) and P(~B | A) must sum to 1 as well. From this we find that P(A) = P(A & B) + P(A & ~B). We’ll use this last identity often.

#3 is a definition of the terms dependence and independence. If two statements are independent, then the truth of one makes no difference to the probability of the other.  It also follows from #1 that if A and B are independent, then P(A & B) = P(A) · P(B). A lot of analysis of causality will be done by looking at probabilistic dependencies, so make sure that this makes sense.

I’ll explain #4 with a simple example. The possible truth-values of two variables A and B are the following:

Both are true: A & B
A is true, and B is false: A & ~B
A is false and B is true: ~A & B
Both are false: ~A & ~B

To specify the joint distribution, we assign probabilities to each of these. For instance:

P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .30
P(~A & ~B) = .20

In this case, the joint distribution is a set of four different joint probabilities.

And finally, #5 is a definition of factorization. We turn joint distributions into products of individual probabilities by using #1. For instance, one factorization of the joint distribution over A and B uses:

P(A & B) = P(A) · P(B | A)
P(A & ~B) = P(A) · P(~B | A)
P(~A & B) = P(~A) · P(B | ~A)
P(~A & ~B) = P(~A) · P(~B | ~A)

We can see that in order to express all four joint probabilities, we need to know the values of six probabilities. But as a result of #2, we only need to know three of them to find all six. If we specify P(A), P(B | A), and P(B | ~A), then we know the values of P(~A), P(~B | A) and P(~B | ~A). These three probabilities are the factors in our factorization.

P(B | A)
P(B | ~A)

One last thing to notice is that our joint distribution of A and B could have been factored in another way. This comes from the fact that we could use #1 to break down P(A & B), or equivalently to break down P(B & A). If we had done the second, then our factors would be P(B), P(A | B), and P(A | ~B).

And that’s everything!



We’ll apply all this by looking at one factorization of a joint probability distribution over three statements. With three statements, there are eight possible worlds:

A & B & C        A & B & ~C
A & ~B & C       A & ~B & ~C
~A & B & C       ~A & B & ~C
~A & ~B & C       ~A & ~B & ~C

The joint distribution over A, B and C is an assignment of probabilities to each of these worlds.

P(A & B & C)        P(A & B & ~C)
P(A & ~B & C)       P(A & ~B & ~C)
P(~A & B & C)       P(~A & B & ~C)
P(~A & ~B & C)       P(~A & ~B & ~C)

To factor our joint distribution, we just use Idea #1 twice, treating “B & C” as a single statement the first time:

P(A & B & C)
= P(A | B&C) · P(B&C)
= P(A | B&C) · P(B | C) · P(C)

This tells us that the factors we need to specify are:

P(B | C), P(B | ~C),
P(A | B & C), P(A | B & ~C), P(A | ~B & C), and P(A | ~B & ~C)


One last application, this time with actual numbers. Let’s revisit our earlier distribution:

P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .3
P(~A & ~B) = .2

To factor this distribution, we must find P(A), P(B | A), and P(B | ~A).

We’ll start by finding P(A) using #2.

Since P(B) + P(~B) = 1, P(A & B) + P(A & ~B) = P(A).

This means that P(A) = .5

We can now use #1 to find our remaining two numbers.

Plugging in values to P(A & B) = P(A) · P(B | A) and P(~A & B) = P(~A) · P(B | ~A), we have:

.25 = .5 · P(B | A)
.3 = .5 · P(B | ~A)

Therefore, P(B | A) = .5 and P(B | ~A) = .6

Next: Causal arrows