One revolution in my thinking was Bayesianism – applying probability theory to beliefs. This has been thoroughly covered in self-contained series at all levels of accessibility elsewhere.
A more recent revolution in my thinking is causal modeling – using graphical networks to model causal relationships. There appears to be a lack of good online explanations of these tools for reasoning, so it seems worthwhile to create one.
My goal here is not to make you an expert in all things causal, but to pass on the key insights that have modified my thinking. Let’s get started!
Much of the framework of causal modeling relies on an understanding of probability theory. So in this first post, I’ll establish the basics that will be used in later posts. If you know how to factor a joint probability distribution, then you can safely skip this.
We’ll label propositions like “The movie has started” with the letters A, B, C, etc. Probability theory is about assigning probabilities to these propositions. A probability is a value between 0 and 1, where 0 is complete confidence that the statement is false and 1 is complete confidence that it is true.
The probability of A = P(A)
The negation of A = ~A
The joint probability that both A and B are true = P(A & B)
The conditional probability of A, given that B is true = P(A | B)
There are just five important things you need to know in order to understand the following posts:
- P(A & B) = P(A | B) · P(B)
- P(A) + P(~A) = 1
- A and B are independent if and only if P(A | B) = P(A). Otherwise, A and B are called dependent.
- A joint probability distribution over statements is an assignment of probabilities to all possible truth-values of those statements.
- A factorization of a joint probability distribution is a way to break down the joint probabilities into products of probabilities of individual statements.
#1 should make some sense. To see how likely it is that A and B are both true, you can first calculate how likely it is that A is true given that B is true, then multiply by the chance that B is true. You can think of this as breaking a question about the probability of both A and B into two questions:
1. In a world in which B is true, how likely is it that A is true?
and 2. How likely is it that we are in that world where B is true?
#2 is just the idea that a proposition must be either true or false, and not both. This is the type of thing that sounds trivial, but ends up being extremely important for manipulating probabilities. For instance, it is also true that a proposition must be true or false and not both, given some other proposition. This means that the conditional probabilities P(B | A) and P(~B | A) must sum to 1 as well. From this we find that P(A) = P(A & B) + P(A & ~B). We’ll use this last identity often.
#3 is a definition of the terms dependence and independence. If two statements are independent, then the truth of one makes no difference to the probability of the other. It also follows from #1 that if A and B are independent, then P(A & B) = P(A) · P(B). A lot of analysis of causality will be done by looking at probabilistic dependencies, so make sure that this makes sense.
I’ll explain #4 with a simple example. The possible truth-values of two variables A and B are the following:
Both are true: A & B
A is true, and B is false: A & ~B
A is false and B is true: ~A & B
Both are false: ~A & ~B
To specify the joint distribution, we assign probabilities to each of these. For instance:
P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .30
P(~A & ~B) = .20
In this case, the joint distribution is a set of four different joint probabilities.
And finally, #5 is a definition of factorization. We turn joint distributions into products of individual probabilities by using #1. For instance, one factorization of the joint distribution over A and B uses:
P(A & B) = P(A) · P(B | A)
P(A & ~B) = P(A) · P(~B | A)
P(~A & B) = P(~A) · P(B | ~A)
P(~A & ~B) = P(~A) · P(~B | ~A)
We can see that in order to express all four joint probabilities, we need to know the values of six probabilities. But as a result of #2, we only need to know three of them to find all six. If we specify P(A), P(B | A), and P(B | ~A), then we know the values of P(~A), P(~B | A) and P(~B | ~A). These three probabilities are the factors in our factorization.
P(B | A)
P(B | ~A)
One last thing to notice is that our joint distribution of A and B could have been factored in another way. This comes from the fact that we could use #1 to break down P(A & B), or equivalently to break down P(B & A). If we had done the second, then our factors would be P(B), P(A | B), and P(A | ~B).
And that’s everything!
We’ll apply all this by looking at one factorization of a joint probability distribution over three statements. With three statements, there are eight possible worlds:
A & B & C A & B & ~C
A & ~B & C A & ~B & ~C
~A & B & C ~A & B & ~C
~A & ~B & C ~A & ~B & ~C
The joint distribution over A, B and C is an assignment of probabilities to each of these worlds.
P(A & B & C) P(A & B & ~C)
P(A & ~B & C) P(A & ~B & ~C)
P(~A & B & C) P(~A & B & ~C)
P(~A & ~B & C) P(~A & ~B & ~C)
To factor our joint distribution, we just use Idea #1 twice, treating “B & C” as a single statement the first time:
P(A & B & C)
= P(A | B&C) · P(B&C)
= P(A | B&C) · P(B | C) · P(C)
This tells us that the factors we need to specify are:
P(B | C), P(B | ~C),
P(A | B & C), P(A | B & ~C), P(A | ~B & C), and P(A | ~B & ~C)
One last application, this time with actual numbers. Let’s revisit our earlier distribution:
P(A & B) = .25
P(A & ~B) = .25
P(~A & B) = .3
P(~A & ~B) = .2
To factor this distribution, we must find P(A), P(B | A), and P(B | ~A).
We’ll start by finding P(A) using #2.
Since P(B) + P(~B) = 1, P(A & B) + P(A & ~B) = P(A).
This means that P(A) = .5
We can now use #1 to find our remaining two numbers.
Plugging in values to P(A & B) = P(A) · P(B | A) and P(~A & B) = P(~A) · P(B | ~A), we have:
.25 = .5 · P(B | A)
.3 = .5 · P(B | ~A)
Therefore, P(B | A) = .5 and P(B | ~A) = .6
Next: Causal arrows