Causal Intervention

Previous post: Causal arrows

Let’s quickly review the last post. A causal diagram for two variables A and B tells us how to factor the joint probability distribution P(A & B). The rule we use is that for each variable, we calculate its probability conditional upon all of its parent nodes. This can easily be generalized to any number of variables.

Quick exercises: See if you understand why the following are true.

1. If the causal relationships between three variables A, B, and C are:A>B>C

Then P(A & B & C) = P(A) · P(B | A) · P(C | B).

2. If the causal relationships are:

AC

Then P(A & B & ~C) = P(A | B) · P(B) · P(~C | B).

3. If the causal relationships are:

A>B<C

Then P(~A & ~B & C) = P(~A) · P(~B | ~A & C) · P(C)

Got it? Then you’re ready to move on!

***

Two people are debating a causal question. One of them says that the rain causes the sidewalk to get wet. The other one says that the sidewalk being wet causes the rain. We can express their debate as:

2 var causal OR

We’ve already seen that the probability distributions that correspond to these causal models are empirically indistinguishable. So how do we tell who’s right?

Easy! We go outside with a bucket of water and splash it on the sidewalk. Then we check and see if it’s raining. Another day, we apply a high-powered blow-drier to the sidewalk and check if it’s raining.

We repeat this a bunch of times at random intervals, and see if we find that splashing the sidewalk makes it any more likely to rain than blow-drying the sidewalk. If so, then we know that sidewalk-wetness causes rain, not the other way around.

This is the process of intervention. When we intervene on a variable, we set it to some desired value and see what happens. Let’s express this with our diagrams.

When we splash the sidewalk with water, what we are in essence doing is setting the variable B (“The sidewalk is wet”) to true. And when we blow-dry the sidewalk, we are setting the variable B to false. Since we are now the complete determinant of the value of B, all causal arrows pointing towards B must be erased. So:

A>B becomes A B

and

< stays <

And now our intervened-upon distributions are empirically distinguishable!

The person who thinks that sidewalk-wetness causes rain expects to find a probabilistic dependence between A and B when we intervene. In particular, they expect that it will be more likely to rain when you splash the sidewalk than when you blow-dry it.

And the person who thinks that rain causes sidewalk-wetness expects to find no probabilistic dependence between A and B. They’ll expect that it is equally likely to be raining if you’re splashing the sidewalk as if you’re blow-drying it.

***

This is how to determine the direction of causal arrows using causal models. The key insight here is that a causal model tells you what happens when you perform interventions.

The rule is: Causal intervention on a variable X is represented by erasing all incoming arrows to X and setting its value to its intervened value.

I’ll introduce one last concept here before we move on to the next post: the causal conditional probability.

In our previous example, we talked about the probability that it rains, given that you splash the sidewalk. This is clearly different than the probability that it rains, given that the sidewalk is wet. So we give it a new name.

Normal conditional probability = P(A | B) = probability that it rains given that the sidewalk is wet

Causal conditional probability = P(A | do B) = probability that it rains given that you splash the sidewalk.

The causal conditional probability of A given B, is just the probability of A given that you intervene on B and set it to “True”. And P(A | do ~B) is the probability of A given that you intervene on B and set it to “False”.
If we find that P(A | do B) = P(A | do ~B), then we have ruled out a-b-e1513835006290.png.

 

Previous: Causal arrows

Next: Correlation and causation

Causal Arrows

Previous post: Preliminaries

Let’s start discussing causality. The first thing I want to get across is that causal models tell us how to factor joint probability distributions.

Let’s say that we want to express a causal relationship between some variable A and another variable B. We’ll draw it this way:

A > B

Let’s say that A = “It is raining”, and B = “The sidewalk is wet.”

Let’s assign probabilities to the various possibilities.

P(A & B) = 49%
P(A & ~B) = 1%
P(~A & B) = 5%
P(~A & ~B) = 45%

This is the joint probability distribution for our variables A and B. It tells us that it rains about half the time, that the sidewalk is almost always wet when it rains, and the sidewalk is rarely wet when it doesn’t rain.

Factorizations of a joint probability distribution express the joint probabilities in terms of a product of probabilities for each variable. Any given probability distribution may have multiple equivalent factorizations. So, for instance, we can factor our distribution like this:

Factorization 1:
P(A) = 50%
P(B | A) = 98%
P(B | ~A) = 10%

And we can also factor our distribution like this:

Factorization 2
P(B) = 54%
P(A | B) = 90.741%
P(A | ~B) = 2.174%

You can check for yourself that these factorizations are equivalent to our starting joint probability distribution by using the relationship between joint probabilities and conditional probabilities. For example, using Factorization 1:

P(A & ~B)
= P(A) · P(~B | A)
= 50% · 2%
= 1%

Just as expected! If any of this is confusing to you, go back to my last post.

***

Let’s rewind. What does any of this have to do with causality? Well, the diagram we drew above, in which rain causes sidewalk-wetness, instructs us as to how we should factor our joint probability distribution.

Here are the rules:

  1. If node X has no incoming arrows, you express its probability as P(X).
  2. If a node does have incoming arrows, you express its probability as conditional upon the values of its parent nodes – those from which the arrows originate.

Let’s look back at our diagram for rain and sidewalk-wetness.

A > B

Which representation do we use?

A has no incoming arrows, so we express its probability unconditionally: P(A).

B has one incoming arrow from A, so we express its probability as conditional upon the possible values of A. That is, we use P(B | A) and P(B | ~A).

Which means that we use Factorization 1!

Say that instead somebody tells you that they think the causal relationship between rain and sidewalk-wetness goes the other way. I.e., they believe that the correct diagram is:

A < B

Which factorization would they use?

***

So causal diagrams tell us how to factor a probability distribution over multiple variables. But why does this matter? After all, two different factorizations of a single probability distribution are empirically equivalent. Doesn’t this mean that “A causes B” and “B causes A” are empirically indistinguishable?

Two responses: First, this is only one component of causal models. Other uses of causal models that we will see in the next post will allow us to empirically determine the direction of causation.

And second: in fact, some causal diagrams can be empirically distinguished.

Say that somebody proclaims that there are no causal links between rain and sidewalk-wetness. We represent this as follows:

A X B

What does this tell us about how to express our probability distribution?

Well, A has no incoming arrows, so we use P(A). B has no incoming arrows, so we use P(B).

So let’s say we want to know the chance that it’s raining AND the sidewalk is wet. According to the diagram, we’ll calculate this in the following way:

P(A & B) = P(A) · P(B)

But wait! Let’s look back at our initial distribution:

P(A & B) = 49%
P(A & ~B) = 1%
P(~A & B) = 5%
P(~A & ~B) = 45%

Is it possible to get all of these values from just our two values P(A) and P(B)? No! (proof below)

In other words, our data rules out this causal model.

A X B crossed

***

To summarize: a causal diagram over two variables A and B tells you how to calculate things like P(A & B). It says that you break it into the probabilities of the individual propositions, and that the probability for each variable should be conditional on the possible values of its parents.

Next we’ll look at how we can empirically distinguish between > and <

Proof of dependence

Previous post: Preliminaries

Next post: Causal intervention

Causality: Preliminaries

erooma2

One revolution in my thinking was Bayesianism – applying probability theory to beliefs. This has been thoroughly covered in self-contained series at all levels of accessibility elsewhere.

A more recent revolution in my thinking is causal modeling – using graphical networks to model causal relationships. There appears to be a lack of good online explanations of these tools for reasoning, so it seems worthwhile to create one.

My goal here is not to make you an expert in all things causal, but to pass on the key insights that have modified my thinking. Let’s get started!

***

Much of the framework of causal modeling relies on an understanding of probability theory. So in this first post, I’ll establish the basics that will be used in later posts. If you know how to factor a joint probability distribution, then you can safely skip this.

We’ll label propositions like “The movie has started” with the letters A, B, C, etc. Probability theory is about assigning probabilities to these propositions. A probability is a value between 0 and 1, where 0 is complete confidence that the statement is false and 1 is complete confidence that it is true.

Some notation:

The probability of A = P(A)
The negation of A = ~A
The joint probability that both A and B are true = P(A & B)
The conditional probability of A, given that B is true = P(A | B)

There are just five important things you need to know in order to understand the following posts:

  1. P(A & B) = P(A | B) · P(B)
  2. P(A) + P(~A) = 1
  3. A and B are independent if and only if P(A | B) = P(A). Otherwise, A and B are called dependent.
  4. A joint probability distribution over statements is an assignment of probabilities to all possible truth-values of those statements.
  5. factorization of a joint probability distribution is a way to break down the joint probabilities into products of probabilities of individual statements.

#1 should make some sense. To see how likely it is that A and B are both true, you can first calculate how likely it is that A is true given that B is true, then multiply by the chance that B is true. You can think of this as breaking a question about the probability of both A and B into two questions:

1. In a world in which B is true, how likely is it that A is true?
and 2. How likely is it that we are in that world where B is true?

#2 is just the idea that a proposition must be either true or false, and not both. This is the type of thing that sounds trivial, but ends up being extremely important for manipulating probabilities. For instance, it is also true that a proposition must be true or false and not both, given some other proposition. This means that the conditional probabilities P(B | A) and P(~B | A) must sum to 1 as well. From this we find that P(A) = P(A & B) + P(A & ~B). We’ll use this last identity often.

#3 is a definition of the terms dependence and independence. If two statements are independent, then the truth of one makes no difference to the probability of the other.  It also follows from #1 that if A and B are independent, then P(A & B) = P(A) · P(B). A lot of analysis of causality will be done by looking at probabilistic dependencies, so make sure that this makes sense.

I’ll explain #4 with a simple example. The possible truth-values of two variables A and B are the following:

Both are true: A & B
A is true, and B is false: A & ~B
A is false and B is true: ~A & B
Both are false: ~A & ~B

To specify the joint distribution, we assign probabilities to each of these. For instance:

P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .30
P(~A & ~B) = .20

In this case, the joint distribution is a set of four different joint probabilities.

And finally, #5 is a definition of factorization. We turn joint distributions into products of individual probabilities by using #1. For instance, one factorization of the joint distribution over A and B uses:

P(A & B) = P(A) · P(B | A)
P(A & ~B) = P(A) · P(~B | A)
P(~A & B) = P(~A) · P(B | ~A)
P(~A & ~B) = P(~A) · P(~B | ~A)

We can see that in order to express all four joint probabilities, we need to know the values of six probabilities. But as a result of #2, we only need to know three of them to find all six. If we specify P(A), P(B | A), and P(B | ~A), then we know the values of P(~A), P(~B | A) and P(~B | ~A). These three probabilities are the factors in our factorization.

P(A)
P(B | A)
P(B | ~A)

One last thing to notice is that our joint distribution of A and B could have been factored in another way. This comes from the fact that we could use #1 to break down P(A & B), or equivalently to break down P(B & A). If we had done the second, then our factors would be P(B), P(A | B), and P(A | ~B).

And that’s everything!

***

Examples!

We’ll apply all this by looking at one factorization of a joint probability distribution over three statements. With three statements, there are eight possible worlds:

A & B & C        A & B & ~C
A & ~B & C       A & ~B & ~C
~A & B & C       ~A & B & ~C
~A & ~B & C       ~A & ~B & ~C

The joint distribution over A, B and C is an assignment of probabilities to each of these worlds.

P(A & B & C)        P(A & B & ~C)
P(A & ~B & C)       P(A & ~B & ~C)
P(~A & B & C)       P(~A & B & ~C)
P(~A & ~B & C)       P(~A & ~B & ~C)

To factor our joint distribution, we just use Idea #1 twice, treating “B & C” as a single statement the first time:

P(A & B & C)
= P(A | B&C) · P(B&C)
= P(A | B&C) · P(B | C) · P(C)

This tells us that the factors we need to specify are:

P(C),
P(B | C), P(B | ~C),
P(A | B & C), P(A | B & ~C), P(A | ~B & C), and P(A | ~B & ~C)

***

One last application, this time with actual numbers. Let’s revisit our earlier distribution:

P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .3
P(~A & ~B) = .2

To factor this distribution, we must find P(A), P(B | A), and P(B | ~A).

We’ll start by finding P(A) using #2.

Since P(B) + P(~B) = 1, P(A & B) + P(A & ~B) = P(A).

This means that P(A) = .5

We can now use #1 to find our remaining two numbers.

Plugging in values to P(A & B) = P(A) · P(B | A) and P(~A & B) = P(~A) · P(B | ~A), we have:

.25 = .5 · P(B | A)
.3 = .5 · P(B | ~A)

Therefore, P(B | A) = .5 and P(B | ~A) = .6

Next: Causal arrows