Previous: Screening off and explaining away

A look at admission statistics at a college reveals that women are less likely to be admitted to graduate programs than men. A closer investigation reveals that in fact when the data is broken down into individual department data, women are more likely to be admitted than men. Does this sound impossible to you? It happened at UC Berkeley in 1973.

When two treatments are tested on a group of patients with kidney stones, Treatment A turns out to lead to worse recovery rates than Treatment B. But when the patients are divided according to the size of their kidney stone, it turns out that *no matter how large their kidney stone,* Treatment A *always does better* than Treatment B. Is this a logical contradiction? Nope, it happened in 1986!

What’s going on here? How can we make sense of this apparently inconsistent data? And most importantly, what conclusions do we draw? Is Berkeley biased against women or men? Is Treatment A actually more effective or less effective than Treatment B?

In this post, we’ll apply what we’ve learned about causal modeling to be able to answer these questions.

*******

Quine gave the following categorization of types of paradoxes: veridical paradoxes (those that *seem* wrong but are actually correct), falsidical paradoxes (those that seem wrong and actually *are* wrong), and antinomies (those that are premised on common forms of reasoning and end up deriving a contradiction).

Simpson’s paradox is in the first category. While it seems impossible, it actually *is* possible, and it happens all the time. Our first task is to explain away the apparent falsity of the paradox.

Let’s look at some actual data on the recovery rates for different treatments of kidney stones.

Treatment A | Treatment B | |

All patients | 78% (273/350) | 83% (289/350) |

The percentages represent the number of patients that recovered, out of all those that were given the particular treatment. So 273 patients recovered out of the 350 patients given Treatment A, giving us 78%. And 289 patients recovered out of the 350 patients given Treatment B, giving 83%.

At this point we’d be tempted to proclaim that B is the better treatment. But if we now break down the data and divide up the patients by kidney stone size, we see:

Treatment A | Treatment B | |

Small stones | 93% (81/87) | 87% (234/270) |

Large stones | 73% (192/263) | 69% (55/80) |

And here the paradoxical conclusion falls out! If you have small stones, Treatment A looks better for you. And if you have large stones, Treatment A looks better for you. So no matter *what* size kidney stones you have, Treatment A is better!

And yet, amongst all patients, Treatment B has a higher recovery rate.

Small stones: A better than B

Large stones: A better than B

All sizes: B better than A

I encourage you to check out the numbers for yourself, in case you still don’t believe this.

*******

The simplest explanation for what’s going on here is that we are treating conditional probabilities like they are joint probabilities. Let’s look again at our table, and express the meaning of the different percentages more precisely.

Treatment A | Treatment B | |

Small stones | P(Recovery | Small stones & Treatment A) | P(Recovery | Small stones & Treatment B) |

Large stones | P(Recovery | Large stones & Treatment A) | P(Recovery | Large stones & Treatment B) |

Everybody | P(Recovery | Treatment A) | P(Recovery | Treatment B) |

Our paradoxical result is the following:

P(Recovery | Small stones & Treatment A) > P(Recovery | Small stones & Treatment B)

P(Recovery | Large stones & Treatment A) > P(Recovery | Large stones & Treatment B)

P(Recovery | Treatment A) < P(Recovery | Treatment B)

But this is no paradox at all! There is no law of probability that tells us:

If P(A | B & C) > P(A | B & ~C)

and P(A | ~B & C) > P(A | ~B & ~C),

then P(A | C) > P(A | ~C)

There *is*, however, a law of probability that tells us:

If P(A & B | C) > P(A & B | ~C)

and P(A & ~B | C) > P(A & ~B | ~C),

then P(A | C) > P(A | ~C)

And if we represented the data in terms of these *joint* probabilities (probability of recovery AND small stones given Treatment A, for example) instead of *conditional* probabilities, we’d find that the probabilities add up nicely and the paradox vanishes.

Treatment A | Treatment B | |

Small stones | 23% (81/350) | 67% (234/350) |

Large stones | 55% (192/350) | 16% (55/350) |

All patients | 78% (273/350) | 83% (289/350) |

It is in this sense that the paradox arises from improper treatment of conditional probabilities as joint probabilities.

*******

This tells us *why* we got a paradoxical result, but isn’t quite fully satisfying. We still want to know, for instance, whether we should give somebody with small kidney stones Treatment A or Treatment B.

The fully satisfying answer comes from causal modeling. The causal diagram we will draw will have three variables, A (which is true if you receive Treatment A and false if you receive Treatment B), S (which is true if you have small kidney stones and false if you have large), and R (which is true if you recovered).

Our causal diagram should express that there is some causal relationship between the treatment you receive (A) and whether you recover (R). It should also show a causal relationship between the size of your kidney stone (S) and your recovery, as the data indicates that larger kidney stones make recovery less likely.

And finally, it should show a causal arrow from the size of the kidney stone to the treatment that you receive. This final arrow comes from the fact that more people with large stones were given Treatment A than Treatment B, and more people with small stones were given Treatment B than Treatment B.

This gives us the following diagram:

The values of P(S), P(A | S), and P(A | ~S) were calculated from the table we started with. For instance, the value of P(S) was calculated by adding up all the patients that had small kidney stones, and dividing by the total number of patients in the study: (87 + 270) / 700.

Now, we want to know if P(R | A) > P(R | ~A) (that is, if recovery is more likely given Treatment A than given Treatment B).

If we *just* look at the conditional probabilities given by our first table, then we are taking into account *two* sources of dependency between treatment type and recovery. The first is the direct causal relationship, which is what we *want* to know. The second is the spurious correlation between A and R as a result of the common cause S.

Here the red arrows represent “paths of dependency” between A and R. For example, since those with small stones are more likely to get treatment B, and are *also* more likely to recover, this will result in a spurious correlation between small stones and recovery.

So how we do we determine the actual *non-spurious* causal dependency between A and R?

Easy!

If we *observe* the value of S, then we screen A off from R through S! This removes the spurious correlation, and leaves us with just the causal relationship that we want.

What this *means* is that the true nature of the relationship between treatment type and recovery can be determined by breaking down the data in terms of kidney stone size. Looking back at our original data:

Recovery rate | Treatment A | Treatment B |

Small stones | 93% (81/87) | 87% (234/270) |

Large stones | 73% (192/263) | 69% (55/80) |

All patients | 78% (273/350) | 83% (289/350) |

This corresponds to looking at the data divided up by size of stones, and not the data on all patients. And since for each stone size category, Treatment A was more effective than Treatment B, this is the true causal relationship between A and R!

*******

A nice feature of the framework of causal modeling is that there are often multiple ways to think about the same problem. So instead of thinking about this in terms of *screening off* the spurious correlation through observation of S, we could also think in terms of causal interventions.

In other words, to determine the true nature of the causal relationship between A and R, we want to *intervene* on A, and see what happens to R.

This corresponds to calculating if P(R | do A) > P(R | do ~A), rather than if P(R | A) > P(R | ~A).

Intervention on A gives us the new diagram:

With this diagram, we can calculate:

P(R | do A)

= P(R & S | do A) + P(R & ~S | do A)

= P(S) * P(R | A & S) + P(~S) * P(R | A & ~S)

= 51% * 93% + 49% * 73%

= 83.2%

And…

P(R | do ~A)

= P(R & S | do ~A) + P(R & ~S | do ~A)

= P(S) * P(R | ~A & S) + P(~S) * P(R | ~A & ~S)

= 51% * 87% + 49% * 69%

= 78.2%

Now not only do we see that Treatment A is better than Treatment B, but we can have the exact *amount* by which it is better – it improves recovery chances by about 5%!

Next, we’re going to go kind of crazy with Simpson’s paradox and show how to construct an infinite chain of Simpson’s paradoxes.

Fantastic paper on all of this here.

Previous: Screening off and explaining away

## 2 thoughts on “Simpson’s paradox”