Kant’s attempt to save metaphysics and causality from Hume

TL;DR

  • Hume sort of wrecked metaphysics. This inspired Kant to try and save it.
  • Hume thought that terms were only meaningful insofar as they were derived from experience.
  • We never actually experience necessary connections between events, we just see correlations. So Hume thought that the idea of causality as necessary connection is empty and confused, and that all our idea of causality really amounts to is correlation.
  • Kant didn’t like this. He wanted to PROTECT causality. But how??
  • Kant said that metaphysical knowledge was both a priori and substantive, and justified this by describing these things called pure intuitions and pure concepts.
  • Intuitions are representations of things (like sense perceptions). Pure intuitions are the necessary preconditions for us to represent things at all.
  • Concepts are classifications of representations (like “red”). Pure concepts are the necessary preconditions underlying all classifications of representations.
  • There are two pure intuitions (space and time) and twelve pure concepts (one of which is causality).
  • We get substantive a priori knowledge by referring to pure intuitions (mathematics) or pure concepts (laws of nature, metaphysics).
  • Yay! We saved metaphysics!

 

(Okay, now on to the actual essay. This was not originally written for this blog, which is why it’s significantly more formal than my usual fare.)

 

***

 

David Hume’s Enquiry Into Human Understanding stands out as a profound and original challenge to the validity of metaphysical knowledge. Part of the historical legacy of this work is its effect on Kant, who describes Hume as being responsible for [interrupting] my dogmatic slumber and [giving] my investigations in the field of speculative philosophy a completely different direction.” Despite the great inspiration that Kant took from Hume’s writing, their thinking on many matters is diametrically opposed. A prime example of this is their views on causality.

Hume’s take on causation is famously unintuitive. He gives a deflationary account of the concept, arguing that the traditional conception lacks a solid epistemic foundation and must be replaced by mere correlation. To understand this conclusion, we need to back up and consider the goal and methodology of the Enquiry.

He starts with an appeal to the importance of careful and accurate reasoning in all areas of human life, and especially in philosophy. In a beautiful bit of prose, he warns against the danger of being overwhelmed by popular superstition and religious prejudice when casting one’s mind towards the especially difficult and abstruse questions of metaphysics.

But this obscurity in the profound and abstract philosophy is objected to, not only as painful and fatiguing, but as the inevitable source of uncertainty and error. Here indeed lies the most just and most plausible objection against a considerable part of metaphysics, that they are not properly a science, but arise either from the fruitless efforts of human vanity, which would penetrate into subjects utterly inaccessible to the understanding, or from the craft of popular superstitions, which, being unable to defend themselves on fair ground, raise these entangling brambles to cover and protect their weakness. Chased from the open country, these robbers fly into the forest, and lie in wait to break in upon every unguarded avenue of the mind, and overwhelm it with religious fears and prejudices. The stoutest antagonist, if he remit his watch a moment, is oppressed. And many, through cowardice and folly, open the gates to the enemies, and willingly receive them with reverence and submission, as their legal sovereigns.

In less poetic terms, Hume’s worry about metaphysics is that its difficulty and abstruseness makes its practitioners vulnerable to flawed reasoning. Even worse, the difficulty serves to make the subject all the more tempting for “each adventurous genius[, who] will still leap at the arduous prize and find himself stimulated, rather than discouraged by the failures of his predecessors, while he hopes that the glory of achieving so hard an adventure is reserved for him alone.”

Thus, says Hume, the only solution is “to free learning at once from these abstruse questions [by inquiring] seriously into the nature of human understanding and [showing], from an exact analysis of its powers and capacity, that it is by no means fitted for such remote and abstruse questions.”

Here we get the first major divergence between Kant and Hume. Kant doesn’t share Hume’s eagerness to banish metaphysics. His Prolegomena To Any Future Metaphysics and Critique of Pure Reason are attempts to find it a safe haven from Hume’s attacks. However, while Kant might not be similarly constituted to Hume in this way, he does take Hume’s methodology very seriously. He states in the preface to the Prolegomena that “since the origin of metaphysics as far as history reaches, nothing has ever happened which could have been more decisive to its fate than the attack made upon it by David Hume.” Many of the principles which Hume derives, Kant agrees with wholeheartedly, making the task of shielding metaphysics even harder for him.

With that understanding of Hume’s methodology in mind, let’s look at how he argues for his view of causality. We’ll start with a distinction that is central to Hume’s philosophy: that between ideas and impressions. The difference between the memory of a sensation and the sensation itself is a good example. While the memory may mimic or copy the sensation, it can never reach its full force and vivacity. In general, Hume suggests that our experiences fall into two distinct categories, separated by a qualitative gap in liveliness. The more lively category he calls impressions, which includes sensory perceptions like the smell of a rose or the taste of wine, as well as internal experiences like the feeling of love or anger. The less lively category he refers to as thoughts or ideas. These include memories of impressions as well as imagined scenes, concepts, and abstract thoughts. 

With this distinction in hand, Hume proposes his first limit on the human mind. He claims that no matter how creative or original you are, all of your thoughts are the product of “compounding, transposing, augmenting, or diminishing the materials afforded us by the senses and experiences.” This is the copy principle: all ideas are copies of impressions, or compositions of simpler ideas that are in turn copies of impressions.

Hume turns this observation of the nature of our mind into a powerful criterion of meaning. “When we entertain any suspicion that a philosophical term is employed without any meaning or idea (as is but too frequent), we need but enquire, From what impression is that supposed idea derived? And if it be impossible to assign any, this will serve to confirm our suspicion.

This criterion turns out to be just the tool Hume needs in order to establish his conclusion. He examines the traditional conception of causation as a necessary connection between events, searches for the impressions that might correspond to this idea, and, failing to find anything satisfactory, declares that “we have no idea of connection or power at all and that these words are absolutely without any meaning when employed in either philosophical reasonings or common life.” His primary argument here is that all of our observations are of mere correlation, and that we can never actually observe a necessary connection.

Interestingly, at this point he refrains from recommending that we throw out the term causation. Instead he proposes a redefinition of the term, suggesting a more subtle interpretation of his criterion of meaning. Rather than eliminating the concept altogether upon discovering it to have no satisfactory basis in experience, he reconceives it in terms of the impressions from which it is actually formed. In particular, he argues that our idea of causation is really based on “the connection which we feel in the mind, this customary transition of the imagination from one object to its usual attendant.”

Here Hume is saying that humans have a rationally unjustifiable habit of thought where, when we repeatedly observe one type of event followed by another, we begin to call the first a cause and the second its effect, and we expect that future instances of the cause will be followed by future instances of the effect. Causation, then, is just this constant conjunction between events, and our mind’s habit of projecting the conjunction into the future. We can summarize all of this in a few lines:

Hume’s denial of the traditional concept of causation

  1. Ideas are always either copies of impressions or composites of simpler ideas that are copies of impressions.
  2. The traditional conception of causation is neither of these.
  3. So we have no idea of the traditional conception of causation.

Hume’s reconceptualization of causation

  1. An idea is the idea of the impression that it is a copy of.
  2. The idea of causation is copied from the impression of constant conjunction.
  3. So the idea of causation is just the idea of constant conjunction.

There we have Hume’s line of reasoning, which provoked Kant to examine the foundations of metaphysics anew. Kant wanted to resist Hume’s dismissal of the traditional conception of causation, while accepting that our sense perceptions reveal no necessary connections to us. Thus his strategy was to deny the Copy Principle and give an account of how we can have substantive knowledge that is not ultimately traceable to impressions. He does this by introducing the analytic/synthetic distinction and the notion of a priori synthetic knowledge.

Kant’s original definition of analytic judgments is that they “express nothing in the predicate but what has already been actually thought in the concept of the subject.” This suggests that the truth value of an analytic judgment is determined by purely the meanings of the concepts in use. A standard example of this is “All bachelors are unmarried.” The truth of this statement follows immediately just by understanding what it means, as the concept of bachelor already contains the predicate unmarried.  Synthetic judgments, on the other hand, are not fixed in truth value by merely the meanings of the concepts in use. These judgments amplify our knowledge and bring us to genuinely new conclusions about our concepts. An example: “The President is ornery.” This certainly doesn’t follow by definition; you’d have to go out and watch the news to realize its truth.

We can now put the challenge to metaphysics slightly differently. Metaphysics purports to be discovering truths that are both necessary (and therefore a priori) as well as substantive (adding to our concepts and thus synthetic). But this category of synthetic a priori judgments seems a bit mysterious. Evidently, the truth values of such judgments can be determined without referring to experience, but can’t be determined by merely the meanings of the relevant concepts. So apparently something further is required besides the meanings of concepts in order to make a synthetic a priori judgment. What is this thing?

Kant’s answer is that the further requirement is pure intuition and pure concepts. These terms need explanation.

Pure Intuitions

For Kant, an intuition is a direct, immediate representation of an object. An obvious example of this is sense perception; looking at a cup gives you a direct and immediate representation of an object, namely, the cup. But pure intuitions must be independent of experience, or else judgments based on them would not be a priori. In other words, the only type of intuition that could possibly be a priori is one that is present in all possible perceptions, so that its existence is not contingent upon what perceptions are being had. Kant claims that this is only possible if pure intuitions represent the necessary preconditions for the possibility of perception.

What are these necessary preconditions? Kant famously claimed that the only two are space and time. This implies that all of our perceptions have spatiotemporal features, and indeed that perception is only possible in virtue of the existence of space and time. It also implies, according to Kant, that space and time don’t exist outside of our minds!  Consider that pure intuitions exist equally in all possible perceptions and thus are independent of the actual properties of external objects. This independence suggests that rather than being objective features of the external world, space and time are structural features of our minds that frame all our experiences.

This is why Kant’s philosophy is a species of idealism. Space and time get turned into features of the mind, and correspondingly appearances in space and time become internal as well. Kant forcefully argues that this view does not make space and time into illusions, saying that without his doctrine “it would be absolutely impossible to determine whether the intuitions of space and time, which we borrow from no experience, but which still lie in our representation a priori, are not mere phantasms of our brain.”

The pure intuitions of space and time play an important role in Kant’s philosophy of mathematics: they serve to justify the synthetic a priori status of geometry and arithmetic. When we judge that the sum of the interior angles of a triangle is 180º, for example, we do so not purely by examining the concepts triangle, sum, and angle. We also need to consult the pure intuition of space! And similarly, our affirmations of arithmetic truths rely upon the pure intuition of time for their validity.

Pure Concepts

Pure intuition is only one part of the story. We don’t just perceive the world, we also think about our perceptions. In Kant’s words, “Thoughts without content are empty; intuitions without concepts are blind. […] The understanding cannot intuit anything, and the senses cannot think anything. Only from their union can cognition arise.” As pure intuitions are to perceptions, pure concepts are to thought. Pure concepts are necessary for our empirical judgments, and without them we could not make sense of perception. It is this category in which causality falls.

Kant’s argument for this is that causality is a necessary condition for the judgment that events occur in a temporal order. He starts by observing that we don’t directly perceive time. For instance, we never have a perception of one event being before another, we just perceive one and, separately, the other. So to conclude that the first preceded the second requires something beyond perception, that is, a concept connecting them.

He argues that this connection must be necessary: “For this objective relation to be cognized as determinate, the relation between the two states must be thought as being such that it determines as necessary which of the states must be placed before and which after.” And as we’ve seen, the only way to get a necessary connection between perceptions is through a pure concept. The required pure concept is the relation of cause and effect: “the cause is what determines the effect in time, and determines it as the consequence.” So starting from the observation that we judge events to occur in a temporal order, Kant concludes that we must have a pure concept of cause and effect.

What about particular causal rules, like that striking a match produces a flame? Such rules are not derived solely from experience, but also from the pure concept of causality, on which their existence depends. It is the presence of the pure concept that allows the inference of these particular rules from experience, even though they postulate a necessary connection.

Now we can see how different Kant and Hume’s conceptions of causality are. While Hume thought that the traditional concept of causality as a necessary connection was unrescuable and extraneous to our perceptions, Kant sees it as a bedrock principle of experience that is necessary for us to be able to make sense of our perceptions at all. Kant rejects Hume’s definition of cause in terms of constant conjunction on the grounds that it “cannot be reconciled with the scientific a priori cognitions that we actually have.”

Despite this great gulf between the two philosophers’ conceptions of causality, there are some similarities. As we saw above, Kant agrees wholeheartedly with Hume that perception alone is insufficient for concluding that there is a necessary connection between events. He also agrees that a purely analytic approach is insufficient. Since Kant sees pure intuitions and pure concepts as features of the mind, not the external world, both philosophers deny that causation is an objective relationship between things in themselves (as opposed to perceptions of things). Of course, Kant would deny that this makes causality an illusion, just as he denied that space and time are made illusory by his philosophy.

Of course, it’s impossible to know to what extent the two philosophers would have actually agreed, had Hume been able to read Kant’s responses to his works. Would he have been convinced that synthetic a priori judgments really exist? If so, would he accept Kant’s pure intuitions and pure concepts? I suspect that at the crux of their disagreement would be Kant’s claim that math is synthetic a priori. While Hume never explicitly proclaims math’s analyticity (he didn’t have the term, after all), it seems more in line with his views on algebra and arithmetic as purely concerning the way that ideas relate to one another. It is also more in line with the axiomatic approach to mathematics familiar to Hume, in which one defines a set of axioms from which all truths about the mathematical concepts involved necessarily follow.

If Hume did maintain math’s analyticity, then Kant’s arguments about the importance of synthetic a priori knowledge would probably hold much less sway for him, and would largely amount to an appeal to the validity of metaphysical knowledge, which Hume already doubted. Hume also would likely want to resist Kant’s idealism; in Section XII of the Enquiry he mocks philosophers that doubt the connection between the objects of our senses and external objects, saying that if you “deprive matter of all its intelligible qualities, both primary and secondary, you in a manner annihilate it and leave only a certain unknown, inexplicable something as the cause of our perceptions – a notion so imperfect that no skeptic will think it worthwhile to contend against it.”

What do I find conceptually puzzling?

There are lots of things that I don’t know, like, say, what the birth rate in Sweden is or what the effect of poverty on IQ is. There are also lots of things that I find really confusing and hard to understand, like quantum field theory and monetary policy. There’s also a special category of things that I find conceptually puzzling. These things aren’t difficult to grasp because the facts about them are difficult to understand or require learning complicated jargon. Instead, they’re difficult to grasp because I suspect that I’m confused about the concepts in use.

This is a much deeper level of confusion. It can’t be adjudicated by just reading lots of facts about the subject matter. It requires philosophical reflection on the nature of these concepts, which can sometimes leave me totally confused about everything and grasping for the solid ground of mere factual ignorance.

As such, it feels like a big deal when something I’ve been conceptually puzzled about becomes clear. I want to compile a list for future reference of things that I’m currently conceptually puzzled about and things that I’ve become un-puzzled about. (This is not a complete list, but I believe it touches on the major themes.)

Things I’m conceptually puzzled about

What is the relationship between consciousness and physics?

I’ve written about this here.

Essentially, at this point every available viewpoint on consciousness seems wrong to me.

Eliminativism amounts to a denial of pretty much the only thing that we can be sure can’t be denied – that we are having conscious experiences. Physicalism entails the claim that facts about conscious experience can be derived from laws of physics, which is wrong as a matter of logic.

Dualism entails that the laws of physics by themselves cannot account for the behavior of the matter in our brains, which is wrong. And epiphenomenalism entails that our beliefs about our own conscious experience are almost certainly wrong, and are no better representations of our actual conscious experiences than random chance.

How do we make sense of decision theory if we deny libertarian free will?

Written about this here and here.

Decision theory is ultimately about finding the decision D that maximizes expected utility EU(D). But to do this calculation, we have to decide what the set of possible decisions we are searching is.

EU confusion

Make this set too large, and you end up getting fantastical and impossible results (like that the optimal decision is to snap your fingers and make the world into a utopia). Make it too small, and you end up getting underwhelming results (in the extreme case, you just get that the optimal decision is to do exactly what you are going to do, since this is the only thing you can do in a strictly deterministic world).

We want to find a nice middle ground between these two – a boundary where we can say “inside here the things that are actually possible for us to do, and outside are those that are not.” But any principled distinction between what’s in the set and what’s not must be based on some conception of some actions being “truly possible” to us, and others being truly impossible. I don’t know how to make this distinction in the absence of a robust conception of libertarian free will.

Are there objectively right choices of priors?

I’ve written about this here.

If you say no, then there are no objectively right answers to questions like “What should I believe given the evidence I have?” And if you say yes, then you have to deal with thought experiments like the cube problem, where any choice of priors looks arbitrary and unjustifiable.

(If you are going to be handed a cube, and all you know is that it has a volume less than 1 cm3, then setting maximum entropy priors over volumes gives different answers than setting maximum entropy priors over side areas or side lengths. This means that what qualifies as “maximally uncertain” depends on whether we frame our reasoning in terms of side length, areas, or cube volume. Other approaches besides MaxEnt have similar problems of concept dependence.)

How should we deal with infinities in decision theory?

I wrote about this here, here, here, and here.

The basic problem is that expected utility theory does great at delivering reasonable answers when the rewards are finite, but becomes wacky when the rewards become infinite. There are a huge amount of examples of this. For instance, in the St. Petersburg paradox, you are given the option to play a game with an infinite expected payout, suggesting that you should buy in to the game no matter how high the cost. You end up making obviously irrational choices, such as spending $1,000,000 on the hope that a fair coin will land heads 20 times in a row. Variants of this involve the inability of EU theory to distinguish between obviously better and worse bets that have infinite expected value.

And Pascal’s mugging is an even worse case. Roughly speaking, a person comes up to you and threatens you with infinite torture if you don’t submit to them and give them 20 dollars. Now, the probability that this threat is credible is surely tiny. But it is non-zero! (as long as you don’t think it is literally logically impossible for this threat to come true)

An infinite penalty times a finite probability is still an infinite expected penalty. So we stand to gain an infinite expected utility by just handing over the 20 dollars. This seems ridiculous, but I don’t know any reasonable formalization of decision theory that allows me to refute it.

Is causality fundamental?

Causality has been nicely formalized by Pearl’s probabilistic graphical models. This is a simple extension of probability theory, out of which naturally falls causality and counterfactuals.

One can use this framework to represent the states of fundamental particles and how they change over time and interact with one another. What I’m confused about is that in some ways of looking at it, the causal relations appear to be useful but un-fundamental constructs for the sake of easing calculations. In other ways of looking at it, causal relations are necessarily built into the structure of the world, and we can go out and empirically discover them. I don’t know which is right. (Sorry for the vagueness in this one – it’s confusing enough to me that I have trouble even precisely phrasing the dilemma).

How should we deal with the apparent dependence of inductive reasoning upon our choices of concepts?

I’ve written about this here. Beyond just the problem of concept-dependence in our choices of priors, there’s also the problem presented by the grue/bleen thought experiment.

This thought experiment proposes two new concepts: grue (= the set of things that are either green before 2100 or blue after 2100) and bleen (the inverse of grue). It then shows that if we reasoned in terms of grue and bleen, standard induction would have us concluding that all emeralds will suddenly turn blue after 2100. (We repeatedly observed them being grue before 2100, so we should conclude that they will be grue after 2100.)

In other words, choose the wrong concepts and induction breaks down. This is really disturbing – choices of concepts should be merely pragmatic matters! They shouldn’t function as fatal epistemic handicaps. And given that they appear to, we need to develop some criterion we can use to determine what concepts are good and what concepts are bad.

The trouble with this is that the only proposals I’ve seen for such a criterion reference the idea of concepts that “carve reality at its joints”; in other words, the world is composed of green and blue things, not grue and bleen things, so we should use the former rather than the latter. But this relies on the outcome of our inductive process to draw conclusions about the starting step on which this outcome depends!

I don’t know how to cash out “good choices of concepts” without ultimately reasoning circularly. I also don’t even know how to make sense of the idea of concepts being better or worse for more than merely pragmatic reasons.

How should we reason about self defeating beliefs?

The classic self-defeating belief is “This statement is a lie.” If you believe it, then you are compelled to disbelieve it, eliminating the need to believe it in the first place. Broadly speaking, self-defeating beliefs are those that undermine the justifications for belief in them.

Here’s an example that might actually apply in the real world: Black holes glow. The process of emission is known as Hawking radiation. In principle, any configuration of particles with a mass less than the black hole can be emitted from it. Larger configurations are less likely to be emitted, but even configurations such as a human brain have a non-zero probability of being emitted. Henceforth, we will call such configurations black hole brains.

Now, imagine discovering some cosmological evidence that the era in which life can naturally arise on planets circling stars is finite, and that after this era there will be an infinite stretch of time during which all that exists are black holes and their radiation. In such a universe, the expected number of black hole brains produced is infinite (a tiny finite probability multiplied by an infinite stretch of time), while the expected number of “ordinary” brains produced is finite (assuming a finite spatial extent as well).

What this means is that discovering this cosmological evidence should give you an extremely strong boost in credence that you are a black hole brain. (Simply because most brains in your exact situation are black hole brains.) But most black hole brains have completely unreliable beliefs about their environment! They are produced by a stochastic process which cares nothing for producing brains with reliable beliefs. So if you believe that you are a black hole brain, then you should suddenly doubt all of your experiences and beliefs. In particular, you have no reason to think that the cosmological evidence you received was veridical at all!

I don’t know how to deal with this. It seems perfectly possible to find evidence for a scenario that suggests that we are black hole brains (I’d say that we have already found such evidence, multiple times). But then it seems we have no way to rationally respond to this evidence! In fact, if we do a naive application of Bayes’ theorem here, we find that the probability of receiving any evidence in support of black hole brains to be 0!

So we have a few options. First, we could rule out any possible skeptical scenarios like black hole brains, as well as anything that could provide any amount of evidence for them (no matter how tiny). Or we could accept the possibility of such scenarios but face paralysis upon actually encountering evidence for them! Both of these seem clearly wrong, but I don’t know what else to do.

How should we reason about our own existence and indexical statements in general?

This is called anthropic reasoning. I haven’t written about it on this blog, but expect future posts on it.

A thought experiment: imagine a murderous psychopath who has decided to go on an unusual rampage. He will start by abducting one random person. He rolls a pair of dice, and kills the person if they land snake eyes (1, 1). If not, he lets them free and hunts down ten new people. Once again, he rolls his pair of die. If he gets snake eyes he kills all ten. Otherwise he frees them and kidnaps 100 new people. On and on until he eventually gets snake eyes, at which point his murder spree ends.

Now, you wake up and find that you have been abducted. You don’t know how many others have been abducted alongside you. The murderer is about to roll the dice. What is your chance of survival?

Your first thought might be that your chance of death is just the chance of both dice landing 1: 1/36. But think instead about the proportion of all people that are ever abducted by him that end up dying. This value ends up being roughly 90%! So once you condition upon the information that you have been captured, you end up being much more worried about your survival chance.

But at the same time, it seems really wrong to be watching the two dice tumble and internally thinking that there is a 90% chance that they land snake eyes. It’s as if you’re imagining that there’s some weird anthropic “force” pushing the dice towards snake eyes. There’s way more to say about this, but I’ll leave it for future posts.

Things I’ve become un-puzzled about

Newcomb’s problem – one box or two box?

To almost everyone, it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly.

– Nozick, 1969

I’ve spent months and months being hopelessly puzzled about Newcomb’s problem. I now am convinced that there’s an unambiguous right answer, which is to take the one box. I wrote up a dialogue here explaining the justification for this choice.

In a few words, you should one-box because one-boxing makes it nearly certain that the simulation of you run by the predictor also one-boxed, thus making it nearly certain that you will get 1 million dollars. The dependence between your action and the simulation is not an ordinary causal dependence, nor even a spurious correlation – it is a logical dependence arising from the shared input-output structure. It is the same type of dependence that exists in the clone prisoner dilemma, where you can defect or cooperate with an individual you are assured is identical to you in every single way. When you take into account this logical dependence (also called subjunctive dependence), the answer is unambiguous: one-boxing is the way to go.

Summing up:

Things I remain conceptually confused about:

  • Consciousness
  • Decision theory & free will
  • Objective priors
  • Infinities in decision theory
  • Fundamentality of causality
  • Dependence of induction on concept choice
  • Self-defeating beliefs
  • Anthropic reasoning

On existence

Epistemic status: This is a line of thought that I’m not fully on board with, but have been taking more seriously recently. I wouldn’t be surprised if I object to all of this down the line.

The question of whether or not a given thing exists is not an empty question or a question of mere semantics. It is a question which you can get empirical evidence for, and a question whose answer affects what you expect to observe in the world.

Before explaining this further, I want to draw an analogy between ontology and causation (and my attitudes towards them).

Early in my philosophical education, my attitude towards causality was sympathetic to the Humean-style eliminativism, in which causality is a useful construct that isn’t reflected in the fundamental structure of the world. That is, I quickly ‘tossed out’ the notion of causality, comfortable to just talk about the empirical regularities governed by our laws of physics.

Later, upon encountering some statisticians that exposed me to the way that causality is actually calculated in the real world, I began to feel that I had been overly hasty. In fact, it turns out that there is a perfectly rigorous and epistemically accessible formalization of causality, and I now feel that there is no need to toss it out after all.

Here’s an easy way of thinking about this: While the slogan “Correlation does not imply causality” is certainly true, the reverse (“Causality does not imply correlation”) is trickier. In fact, whenever you have a causal relationship between variables, you do end up expecting some observable correlations. So while you cannot deductively conclude a causal relationship from a merely correlational one, you can certainly get evidence for some causal models.

This is just a peek into the world of statistical discovery of causal relationships – going further requires a lot more work. But that’s not necessary for my aim here. I just want to express the following parallel:

Rather than trying to set up a perfect set of necessary and sufficient conditions for application of the term ’cause’, we can just take a basic axiom that any account of causation must adhere to. Namely: Where there’s causation, there’s correlation.

And rather than trying to set up a perfect set of necessary and sufficient conditions for the term ‘existence’, we can just take a basic axiom that any account of existence must adhere to. Namely: If something affects the world, it exists.

This should seem trivially obvious. While there could conceivably be entities that exist without affecting anything, clearly any entity that has a causal impact on the world must exist.

The contrapositive of this axiom is that if something doesn’t exist, it does not affect the world.

Again, this is not a controversial statement. And importantly, it makes ontology amenable to scientific inquiry! Why? Because two worlds with different ontologies will have different repertoires of causes and effects. A world in which nothing exists is a world in which nothing affects anything – a dead, static region of nothingness. We can rule out this world on the basis of our basic empirical observation that stuff is happening.

This short argument attempts to show that ontology is a scientifically respectable concept, and not merely a matter of linguistic game-playing. Scientific theories implicitly assume particular ontologies by relying upon laws of nature which reference objects with causal powers. Fundamentally, evidence that reveals the impotence of these supposed causal powers serves as evidence against the ontological framework of such theories.

I think the temptation to wave off ontological questions as somehow disreputable and unscientific actually springs from the fundamentality of this concept. Ontology isn’t a minor add-on to our scientific theories done to appease the philosophers. Instead, it is built in from the ground floor. We can’t do science without implicitly making ontological assumptions. I think it’s better to make these assumptions explicit and debate about the fundamental principles by which we justify them, then it is to do it invisibly, without further analysis.

Concepts we keep and concepts we toss out

Often when we think about philosophical concepts like identity, existence, possibility, and so on, we find ourselves confronted with numerous apparent paradoxes that require us to revise our initial intuitive conception. Sometimes, however, the revisions necessary to render the concept coherent end up looking so extreme as to make us prefer to just throw out the concept altogether.

An example: claims about identity are implicit in much of our reasoning (“I was thinking about this in the morning” implicitly presumes an identity between myself now and the person resembling me in my apartment this morning). But when we analyze our intuitive conception of identity, we find numerous incoherencies (e.g. through Sorites-style paradoxes in which objects retain their identity through arbitrarily small transformations, but then end up changing their identity upon the conjunction of these transformations anyway).

When faced with these incoherencies, we have a few options: first of all, we can decide to “toss out” the concept of identity (i.e. determine that the concept is too fundamentally paradoxical to be saved), or we can decide to keep it. If we keep it, then we are forced to bite some bullets (e.g. by revising the concept away from our intuitions to a more coherent neighboring concept, or by accepting the incoherencies).

In addition, keeping the concept does not mean thinking that the concept actually successfully applies to anything. For instance, one might keep the concept of free will (in that they have a well-defined personal conception of it), while denying that free will exists. This is the difference between saying “People don’t have free will, and that has consequences X, Y, and Z” and saying “I think that contradictions are so deeply embedded in the concept of free will that it’s fundamentally unsavable, and henceforth I’m not going to reason in terms of it.” I often hop back and forth between these positions, but I think they are really quite different.

One final way to describe this distinction: When faced with a statement like “X exists,” we have three choices: We can say that the statement is true, that it is false, or that it is not a proposition. This third category is what we would say about statements like “Arghleschmargle” or “Colorless green ideas sleep furiously”. While they are sentences that we can speak, they just aren’t the types of things that could be true or false. To throw out the concept of existence is to say that a statement like “X exists” is neither true nor false, and to keep it is to treat it as having a truth value.

I have a clear sense for any given concept whether or not I think it’s better to keep or toss out, and I imagine that others can do the same.  Here’s a table of some common philosophical concepts and my personal response to each:

Keep
Causality
Existence
Justification
Free will
Time
Consciousness
Randomness
Meaning (of life)
Should (ethical)
Essences
Representation / Intentionality

Toss Out
Knowledge
Identity
Possibility
Objects
Forms
Purposes (in the teleological sense)
Beauty

Many of these I’m not sure about, and I imagine I could have my mind easily changed (e.g. identity, possibility, intentionality). Some I’ve even recently changed my mind about (causality, existence). And others I feel quite confident about (e.g. knowledge, randomness, justification).

I’m curious about how others’ would respond… What philosophical concepts do you lean towards keeping, and which concepts do you lean towards tossing out?

Explanation is asymmetric

We all regularly reason in terms of the concept of explanation, but rarely think hard about what exactly we mean by this explanation. What constitutes a scientific explanation? In this post, I’ll point out some features of explanation that may not be immediately obvious.

Let’s start with one account of explanation that should seem intuitively plausible. This is the idea that to explain X to a person is to give that person some information I that would have allowed them to predict X.

For instance, suppose that Janae wants an explanation of why Ari is not pregnant. Once we tell Janae that Ari is a biological male, she is satisfied and feels that the lack of pregnancy has been explained. Why? Well, because had Janae known that Ari was a male, she would have been able to predict that Ari would not get pregnant.

Let’s call this the “predictive theory of explanation.” On this view, explanation and prediction go hand-in-hand. When somebody learns a fact that explains a phenomenon, they have also learned a fact that allows them to predict that phenomenon.

 To spell this out very explicitly, suppose that Janae’s state of knowledge at some initial time is expressed by

K1 = “Males cannot get pregnant.”

At this point, Janae clearly cannot conclude anything about whether Ari is pregnant. But now Janae learns a new piece of information, and her state of knowledge is updated to

K2 = “Ari is a male & males cannot get pregnant.”

Now Janae is warranted in adding the deduction

K’ = “Ari cannot get pregnant”

This suggests that added information explains Ari’s non-pregnancy for the same reason that it allows the deduction of Ari’s non-pregnancy.

Now, let’s consider a problem with this view: the problem of relevance.

Suppose a man named John is not pregnant, and somebody explains this with the following two premises:

  1. People who take birth control pills almost certainly don’t get pregnant.
  2. John takes birth control pills regularly.

Now, these two premises do successfully predict that John will not get pregnant. But the fact that John takes birth control pills regularly gives no explanation at all of his lack of pregnancy. Naively applying the predictive theory of explanation gives the wrong answer here.

You might have also been suspicious of the predictive theory of explanation on the grounds that it relied on purely logical deduction and a binary conception of knowledge, not allowing us to accommodate the uncertainty inherent in scientific reasoning. We can fix this by saying something like the following:

What it is to explain X to somebody that knows K is to give them information I such that

(1) P(X | K) is small, and
(2) P(X | K, I) is large.

“Small” and “large’ here are intentionally vague; it wouldn’t make sense to draw a precise line in the probabilities.

The idea here is that explanations are good insofar as they (1) make their explanandum sufficiently likely, where (2) it would be insufficiently likely without them.

We can think of this as a correlational account of explanation. It attempts to root explanations in sufficiently strong correlations.

First of all, we can notice that this doesn’t suffer from a problem with irrelevant information. We can find relevance relationships by looking for independencies between variables. So maybe this is a good definition of scientific explanation?

Unfortunately, this “correlational account of explanation” has its own problems.

Take the following example.

uploaded image

This flagpole casts a shadow of length L because of the angle of elevation of the sun and the height of the flagpole (H). In other words, we can explain the length of the shadow with the following pieces of information:

I1 =  “The angle of elevation of the sun is θ”
I2 = “The height of the lamp post is H”
I3 = Details involving the rectilinear propagation of light and the formation of shadows

Both the predictive and correlational theory of explanation work fine here. If somebody wanted an explanation for why the shadow’s length is L, then telling them I1, I2, and I3 would suffice. Why? Because I1, I2, and Ijointly allow us to predict the shadow’s length! Easy.

X = “The length of the shadow is L.”
(I1 & I2 & I3) ⇒ X
So I1 & I2 & I3 explain X.

And similarly, P(X | I1 & I2 & I3) is large, and P(X) is small. So on the correlational account, the information given explains X.

But now, consider the following argument:

(I1 & I3 & X) ⇒ I2
So I1 & I3 & X explain I2.

The predictive theory of explanation applies here. If we know the length of the shadow and the angle of elevation of the sun, we can deduce the height of the flagpole. And the correlational account tells us the same thing.

But it’s clearly wrong to say that the explanation for the height of the flagpole is the length of the shadow!

What this reveals is an asymmetry in our notion of explanation. If somebody already knows how light propagates and also knows θ, then telling them H explains L. But telling them L does not explain H!

In other words, the correlational theory of explanation fails, because correlation possesses symmetry properties that explanation does not.

This thought experiment also points the way to a more complete account of explanation. Namely, the relevant asymmetry between the length of the shadow and the height of the flagpole is one of causality. The reason why the height of the flagpole explains the shadow length but not vice versa, is that the flagpole is the cause of the shadow and not the reverse.

In other words, what this reveals to us is that scientific explanation is fundamentally about finding causes, not merely prediction or statistical correlation. This causal theory of explanation can be summarized in the following:

An explanation of A is a description of its causes that renders it intelligible.

More explicitly, an explanation of A (relative to background knowledge K) is a set of causes of A that render X intelligible to a rational agent that knows K.

What is integrated information?

Integrated information theory relates consciousness to degrees of integrated information within a physical system. I recently became interested in IIT and found it surprisingly hard to locate a good simple explanation of the actual mathematics of integrated information online.

Having eventually just read through all of the original papers introducing IIT, I discovered that integrated information is closely related to some of my favorite bits of mathematics, involving information theory and causal modeling.  This was exciting enough to me that I decided to write a guide to understanding integrated information. My goal in this post is to introduce a beginner to integrated information in a rigorous and (hopefully!) intuitive way.

I’ll describe it increasing levels of complexity, so that even if you eventually get lost somewhere in the post, you’ll be able to walk away having learned something. If you get to the end of this post, you should be able to sit down with a pencil and paper and calculate the amount of integrated information in simple systems, as well as how to calculate it in principle for any system.

Level 1

So first, integrated information is a measure of the degree to which the components of a system are working together to produce outputs.

A system composed of many individual parts that are not interacting with each other in any way is completely un-integrated – it has an integrated information ɸ = 0. On the other hand, a system composed entirely of parts that are tightly entangled with one another will have a high amount of integrated information, ɸ >> 0.

For example, consider a simple model of a camera sensor.

tut_sensors_grid2

This sensor is composed of many independent parts functioning completely separately. Each pixel stores a unit of information about the outside world, regardless of what its neighboring pixels are doing. If we were to somehow sever the causal connections between the two halves of the sensor, each half would still capture and store information in exactly the same way.

Now compare this to a human brain.

FLARE-Technique-Offers-Snapshots-of-Neuron-Activity

The nervous system is a highly entangled mesh of neurons, each interacting with many many neighbors in functionally important ways. If we tried to cut the brain in half, severing all the causal connections between the two sides, we would get an enormous change in brain functioning.

Makes sense? Okay, on to level 2.

Level 2

So, integrated information has to do with the degree to which the components of a system are working together to produce outputs. Let’s delve a little deeper.

We just said that we can tell that the brain is integrating lots of information, because the functioning would be drastically disrupted if you cut it in half. A keen reader might have realized that the degree to which the functioning is disrupted will depend a lot on how you cut it in half.

For instance, cut off the front half of somebody’s brain, and you will end up with total dysfunction. But you can entirely remove somebody’s cerebellum (~50% of the brain’s neurons), and end up with a person that has difficulty with coordination and is a slow learner, but is otherwise a pretty ordinary person.

Human head, MRI and 3D CT scans

What this is really telling us is that different parts of the brain are integrating information differently. So how do we quantify the total integration of information of the brain? Which cut do we choose when evaluating the decrease in functioning?

Simple: We look at every possible way of partitioning the brain into two parts. For each one, we see how much the brain’s functioning is affected. Then we locate the minimum information partition, that is, the partition that results in the smallest change in brain functioning. The change in functioning that results from this particular partition is the integrated information!

Okay. Now, what exactly do we mean by “changes to the system’s functioning”? How do we measure this?

Answer: The functionality of a system is defined by the way in which the current state of the system constrains the past and future states of the system.

To make full technical sense of this, we have to dive a little deeper.

Level 3

How many possible states are there of a Connect Four board?

(I promise this is relevant)

The board is 6 by 7, and each spot can be either a red piece, a black piece, or empty.

Screen Shot 2018-04-20 at 1.03.04 AM

So a simple upper bound on the number of total possible board states is 342 (of course, the actual number of possible states will be much lower than this, since some positions are impossible to get into).

Now, consider what you know about the possible past and future states of the board if the board state is currently…

Screen Shot 2018-04-20 at 1.03.33 AM

Clearly there’s only one possible past state:

Screen Shot 2018-04-20 at 1.03.04 AM

And there are seven possible future states:

What this tells us is that the information about the current state of the board constrains the possible past and future states, selecting exactly one possible board out of the 342 possibilities for the past, and seven out of 342 possibilities for the future.

More generally, for any given system S we have a probability distribution over past and future states, given that the current state is X.

System

Pfuture(X, S) = Pr( Future state of S | Present state of S is X )
Ppast(X, S) = Pr( Past state of S | Present state of S is X )

For any partition of the system into two components, S1 and S2, we can consider the future and past distributions given that the states of the components are, respectively, X1 and X2, where X = (X1, X2).

System

Pfuture(X, S1, S2) = Pr( Future state of S1 | Present state of S1 is X1 )・Pr( Future state of S2 | Present state of S2 is X2 )
Ppast(X, S1, S2) = Pr( Past state of S1 | Present state of S1 is X1 )・Pr( Past state of S2 | Present state of S2 is X2 )

Now, we just need to compare our distributions before the partition to our distributions after the partition. For this we need some type of distance function D that assesses how far apart two probability distributions are. Then we define the cause information and the effect information for the partition (S1, S2).

Cause information = D( Ppast(X, S), Ppast(X, S1, S2) )
Effect information = D( Pfuture(X, S), Pfuture(X, S1, S2) )

In short, the cause information is how much the distribution over past states changes when you partition off your system into two separate systems And the future information is the change in the distribution over future states when you partition the system.

The cause-effect information CEI is then defined as the minimum of the cause information CI and effect information EI.

CEI = min{ CI, EI }

We’ve almost made it all the way to our full definition of ɸ! Our last step is to calculate the CEI for every possible partition of S into two pieces, and then select the partition that minimizes CEI (the minimum information partition MIP).

The integrated information is just the cause effect information of the minimum information partition!

ɸ = CEI(MIP)

Level 4

We’ve now semi-rigorously defined ɸ. But to really get a sense of how to calculate ɸ, we need to delve into causal diagrams. At this point, I’m going to assume familiarity with causal modeling. The basics are covered in a series of posts I wrote starting here.

Here’s a simple example system:

XOR AND.png

This diagram tells us that the system is composed of two variables, A and B. Each of these variables can take on the values 0 and 1. The system follows the following simple update rule:

A(t + 1) = A(t) XOR B(t)
B(t + 1) = A(t) AND B(t)

We can redraw this as a causal diagram from A and B at time 0 to A and B at time 1:

Causal Diagram

What this amounts to is the following system evolution rule:

    ABt → ABt+1
00        00
01       10
10       10
11       01

Now, suppose that we know that the system is currently in the state AB = 00. What does this tell us about the future and past states of the system?

Well, since the system evolution is deterministic, we can say with certainty that the next state of the system will be 00. And since there’s only one way to end up in the state 00, we know that the past state of the system 00.

We can plot the probability distributions over the past and future distributions as follows:

Probabilities Full System

This is not too interesting a distribution… no information is lost or gained going into the past or future. Now we partition the system:

XOR AND Cut

The causal diagram, when cut, looks like:

Causal Diagram Cut

Why do we have the two “noise” variables? Well, both A and B take two variables as inputs. Since one of these causal inputs has been cut off, we replace it with a random variable that’s equally likely to be a 0 or a 1. This procedure is called “noising” the causal connections across the partition.

According to this diagram, we now have two independent distributions over the two parts of the system, A and B. In addition, to know the total future state of a system, we do the following:

P(A1, B1 | A0, B0) = P(A1 | A0) P(B1 | B0)

We can compute the two distributions P(A1 | A0) and P(B1 | B0) straightforwardly, by looking at how each variable evolves in our new causal diagram.

A0 = 0 ⇒ A1 = 0, 1 (½ probability each)
B0 = 0 ⇒ B1 = 0

A0 = 0 ⇒ A-1 = 0, 1 (½ probability each)
B0 = 0 ⇒ B-1 = 0, 1 (probabilities ⅔ and ⅓)

This implies the following probability distribution for the partitioned system:

Partitioned System

I recommend you go through and calculate this for yourself. Everything follows from the updating rules that define the system and the noise assumption.

Good! Now we have two distributions, one for the full system and one for the partitioned system. How do we measure the difference between these distributions?

There are a few possible measures we could use. My favorite of these is the Kullback-Leibler divergence DKL. Technically, this metric is only used in IIT 2.0, not IIT 3.0 (which uses the earth-mover’s distance). I prefer DKL, as it has a nice interpretation as the amount of information lost when the system is partitioned. I have a post describing DKL here.

Here’s the definition of DKL:

DKL(P, Q) = ∑ Pi log(Pi / Qi)

We can use this quantity to calculate the cause information and the effect information:

Cause information = log(3) ≈ 1.6
Effect information = log(2) = 1

These values tell us that our partition destroys about .6 more bits of information about the past than it does the future. For the purpose of integrated information, we only care about the smaller of these two (for reasons that I don’t find entirely convincing).

Cause-effect information = min{ 1, 1.6 } = 1

Now, we’ve calculated the cause-effect information for this particular partition. And since our system has only two variables, this is the only possible partition.

The integrated information is the cause-effect information of the minimum information partition. Since our system only has two components, the partition we’ve examined is the only possible partition, meaning that it must be the minimum information partition. And thus, we’ve calculated ɸ for our system!

ɸ = 1

Level 5

Let’s now define ɸ in full generality.

Our system S consists of a vector of N variables X = (X1, X2, X3, …, XN), each an element in some space 𝒳. Our system also has an updating rule, which is a function f: 𝒳N → 𝒳N. In our previous example, 𝒳 = {0, 1}, N = 2, and f(x, y) = (x XOR y, x AND y).

More generally, our updating rule f can map X to a probability distribution p:  𝒳N → . We’ll denote P(Xt+1 | Xt) as the distribution over the possible future states, given the current state. P is defined by our updating rule: P(Xt+1 | Xt) = f(Xt). The distribution over possible past states will be denoted P(Xt-1 | Xt). We’ll obtain this using Bayes’ rule: P(Xt-1 | Xt) = P(Xt | Xt-1) P(Xt-1) / P(Xt) = f(Xt-1) P(Xt-1) / P(Xt).

A partition of the system is a subset of {1, 2, 3, …, N}, which we’ll label A. We define B = {1, 2, 3, …, N} \ A. Now we can define XA = ( X)a∈A, and XB = ( X)b∈B. Loosely speaking, we can say that X = (XA, XB), i.e. that the total state is just the combination of the two partitions A and B.

We now define the distributions over future and past states in our partitioned system:

Q(Xt+1 | Xt) = P(XA, t+1 | XA, t) P(XB, t+1 | XB, t)
Q(Xt-1 | Xt) = P(XA, t-1 | XA, t) P(XB, t-1 | XB, t).

The effect information EI of the partition defined by A is the distance between P(Xt+1 | Xt) and Q(Xt+1 | Xt), and the cause information CI is defined similarly. The cause-effect information is defined as the minimum of these two.

CI(f, A, Xt) = D( P(Xt-1 | Xt), Q(Xt-1 | Xt) )
EI(f, A, Xt) = D( P(Xt+1 | Xt), Q(Xt+1 | Xt) )

CEI(f, A, Xt) = min{ CI(f, A, Xt), EI(f, A, Xt) }

And finally, we define the minimum information partition (MIP) and the integrated information:

MIP = argminA CEI(f, A, Xt)
ɸ(f, Xt) = minA CEI(f, A, Xt)
= CEI(f, MIP, Xt)

And we’re done!

Notice that our final result is a function of f (the updating function) as well as the current state of the system. What this means is that the integrated information of a system can change from moment to moment, even if the organization of the system remains the same.

By itself, this is not enough for the purposes of integrated information theory. Integrated information theory uses ɸ to define gradations of consciousness of systems, but the relationship between ɸ and consciousness isn’t exactly one-to-on (briefly, consciousness resides in non-overlapping local maxima of integrated information).

But this post is really meant to just be about integrated information, and the connections to the theory of consciousness are actually less interesting to me. So for now I’ll stop here! 🙂

The problem with philosophy

(Epistemic status: I have a high credence that I’m going to disagree with large parts of this in the future, but it all seems right to me at present. I know that’s non-Bayesian, but it’s still true.)

Philosophy is great. Some of the clearest thinkers and most rational people I know come out of philosophy, and many of my biggest worldview-changing moments have come directly from philosophers. So why is it that so many scientists seem to feel contempt towards philosophers and condescension towards their intellectual domain? I can actually occasionally relate to the irritation, and I think I understand where some of it comes from.

Every so often, a domain of thought within philosophy breaks off from the rest of philosophy and enters the sciences. Usually when this occurs, the subfield (which had previously been stagnant and unsuccessful in its attempts to make progress) is swiftly revolutionized and most of the previous problems in the field are promptly solved.

Unfortunately, what also often happens is that the philosophers that were previously working in the field are often unaware of or ignore the change in their field, and end up wasting a lot of time and looking pretty silly. Sometimes they even explicitly challenge the scientists at the forefront of this revolution, like Henri Bergson did with Einstein after he came out with his pesky new theory of time that swept away much of the past work of philosophers in one fell swoop.

Next you get a generation of philosophy students that are taught a bunch of obsolete theories, and they are later blindsided when they encounter scientists that inform them that the problems they’re working on have been solved decades ago. And by this point the scientists have left the philosophers so far in the dust that the typical philosophy student is incapable of understanding the answers to their questions without learning a whole new area of math or something. Thus usually the philosophers just keep on their merry way, asking each other increasingly abstruse questions and working harder and harder to justify their own intellectual efforts. Meanwhile scientists move further and further beyond them, occasionally dropping in to laugh at their colleagues that are stuck back in the Middle Ages.

Part of why this happens is structural. Philosophy is the womb inside which develops the seeds of great revolutions of knowledge. It is where ideas germinate and turn from vague intuitions and hotbeds of conceptual confusion into precisely answerable questions. And once these questions are answerable, the scientists and mathematicians sweep in and make short work of them, finishing the job that philosophy started.

I think that one area in which this has happened is causality.

Statisticians now know how to model causal relationships, how to distinguish them from mere regularities, how to deal with common causes and causal pre-emption, how to assess counterfactuals and assign precise probabilities to these statements, and how to compare different causal models and determine which is most likely to be true.

(By the way, guess where I came to be aware of all of this? It wasn’t in the metaphysics class in which we spent over a month discussing the philosophy of causation. No, it was a statistician friend of mine who showed me a book by Judea Pearl and encouraged me to get up to date with modern methods of causal modeling.)

Causality as a subject has firmly and fully left the domain of philosophy. We now have a fully fleshed out framework of causal reasoning that is capable of answering all of the ancient philosophical questions and more. This is not to say that there is no more work to be done on understanding causality… just that this work is not going to be done by philosophers. It is going to be done by statisticians, computer scientists, and physicists.

Another area besides causality where I think this has happened is epistemology. Modern advances in epistemology are not coming out of the philosophy departments. They’re coming out of machine learning institutes and artificial intelligence researchers, who are working on turning the question of “how do we optimally come to justified beliefs in a posteriori matters?” into precise code-able algorithms.

I’m thinking about doing a series of posts called “X for philosophers”, in which I take an area of inquiry that has historically been the domain of philosophy, and explain how modern scientific methods have solved or are solving the central questions in this area.

For instance, here’s a brief guide to how to translate all the standard types of causal statements philosophers have debated for centuries into simple algebra problems:

Causal model

An ordered triple of exogenous variables, endogenous variables, and structural equations for each endogenous variable

Causal diagram

A directed acyclic graph representing a causal model, whose nodes represent the endogenous variables and whose edges represent the structural equations

Causal relationship

A directed edge in a causal diagram

Causal intervention

A mutilated causal diagram in which the edges between the intervened node and all its parent nodes are removed

Probability of A if B

P(A | B)

Probability of A if we intervene on B

P(A | do B) = P(AB)

Probability that A would have happened, had B happened

P(AB | -B)

Probability that B is a necessary cause of A

P(-A-B | A, B)

Probability that B is a sufficient cause of A

P(AB | -A, -B)

Right there is the guide to understanding the nature of causal relationships, and assessing the precise probabilities of causal conditional statements, counterfactual statements, and statements of necessary and sufficient causation.

To most philosophy students and professors, what I’ve written is probably chicken-scratch. But it is crucially important for them in order to not become obsolete in their causal thinking.

There’s an unhealthy tendency amongst some philosophers to, when presented with such chicken-scratch, dismiss it as not being philosophical enough and then go back to reading David Lewis’s arguments for the existence of possible worlds. It is this that, I think, is a large part of the scientist’s tendency to dismiss philosophers as outdated and intellectually behind the times. And it’s hard not to agree with them when you’ve seen both the crystal-clear beauty of formal causal modeling, and also the debates over things like how to evaluate the actual “distance” between possible worlds.

Artificial intelligence researcher extraordinaire Stuart Russell has said that he knew immediately upon reading Pearl’s book on causal modeling that it was going to change the world. Philosophy professors should either teach graph theory and Bayesian networks, or they should not make a pretense of teaching causality at all.

Galileo and the Schelling point improbability principle

An alternative history interaction between Galileo and his famous statistician friend

***

In the year 1609, when Galileo Galilei finished the construction of his majestic artificial eye, the first place he turned his gaze was the glowing crescent moon. He reveled in the crevices and mountains he saw, knowing that he was the first man alive to see such a sight, and his mind expanded as he saw the folly of the science of his day and wondered what else we might be wrong about.

For days he was glued to his telescope, gazing at the Heavens. He saw the planets become colorful expressive spheres and reveal tiny orbiting companions, and observed the distant supernova which Kepler had seen blinking into existence only five years prior. He discovered that Venus had phases like the Moon, that some apparently single stars revealed themselves to be binaries when magnified, and that there were dense star clusters scattered through the sky. All this he recorded in frantic enthusiastic writing, putting out sentences filled with novel discoveries nearly every time he turned his telescope in a new direction. The universe had opened itself up to him, revealing all its secrets to be uncovered by his ravenous intellect.

It took him two weeks to pull himself away from his study room for long enough to notify his friend Bertolfo Eamadin of his breakthrough. Eamadin was a renowned scholar, having pioneered at age 15 his mathematical theory of uncertainty and created the science of probability. Galileo often sought him out to discuss puzzles of chance and randomness, and this time was no exception. He had noticed a remarkable confluence of three stars that were in perfect alignment, and needed the counsel of his friend to sort out his thoughts.

Eamadin arrived at the home of Galileo half-dressed and disheveled, obviously having leapt from his bed and rushed over immediately upon receiving Galileo’s correspondence. He practically shoved Galileo out from his viewing seat and took his place, eyes glued with fascination on the sky.

Galileo allowed his friend to observe unmolested for a half-hour, listening with growing impatience to the ‘oohs’ and ‘aahs’ being emitted as the telescope swung wildly from one part of the sky to another. Finally, he interrupted.

Galileo: “Look, friend, at the pattern I have called you here to discuss.”

Galileo swiveled the telescope carefully to the position he had marked out earlier.

Eamadin: “Yes, I see it, just as you said. The three stars form a seemingly perfect line, each of the two outer ones equidistant from the central star.”

Galileo: “Now tell me, Eamadin, what are the chances of observing such a coincidence? One in a million? A billion?”

Eamadin frowned and shook his head. “It’s certainly a beautiful pattern, Galileo, but I don’t see what good a statistician like myself can do for you. What is there to be explained? With so many stars in the sky, of course you would chance upon some patterns that look pretty.”

Galileo: “Perhaps it seems only an attractive configuration of stars spewed randomly across the sky. I thought the same myself. But the symmetry seemed too perfect. I decided to carefully measure the central angle, as well as the angular distance distended by the paths from each outer star to the central one. Look.”

Galileo pulled out a sheet of paper that had been densely scribbled upon. “My calculations revealed the central angle to be precisely 180.000º, with an error of ± .003º. And similarly, I found the difference in the two angular distances to be .000º, with a margin of error of ± .002º.”

Eamadin: “Let me look at your notes.”

Galileo handed over the sheets to Eamadin. “I checked over my calculations a dozen times before writing you. I found the angular distances by approaching and retreating from this thin paper, which I placed between the three stars and me. I found the distance at which the thin paper just happened to cover both stars on one extreme simultaneously, and did the same for the two stars on the other extreme. The distance was precisely the same, leaving measurement error only for the thickness of the paper, my distance from it, and the resolution of my vision.”

Eamadin: “I see, I see. Yes, what you have found is a startlingly clear pattern. A similarity in distance and precision of angle this precise is quite unlikely to be the result of any natural phenomenon… ”

Galileo: “Exactly what I thought at first! But then I thought about the vast quantity of stars in the sky, and the vast number of ways of arranging them into groups of three, and wondered if perhaps in fact such coincidences might be expected. I tried to apply your method of uncertainty to the problem, and came to the conclusion that the chance of such a pattern having occurred through random chance is one in a thousand million! I must confess, however, that at several points in the calculation I found myself confronted with doubt about how to progress and wished for your counsel.”

Eamadin stared at Galileo’s notes, then pulled out a pad of his own and began scribbling intensely. Eventually, he spoke. “Yes, your calculations are correct. The chance of such a pattern having occurred to within the degree of measurement error you have specified by random forces is 10-9.”

Galileo: “Aha! Remarkable. So what does this mean? What strange forces have conspired to place the stars in such a pattern? And, most significantly, why?”

Eamadin: “Hold it there, Galileo. It is not reasonable to jump from the knowledge that the chance of an event is remarkably small to the conclusion that it demands a novel explanation.”

Galileo: “How so?”

Eamadin: “I’ll show you by means of a thought experiment. Suppose that we found that instead of the angle being 180.000º with an experimental error of .003º, it was 180.001º with the same error. The probability of this outcome would be the same as the outcome we found – one in a thousand million.”

Galileo: “That can’t be right. Surely it’s less likely to find a perfectly straight line than a merely nearly perfectly straight line.”

Eamadin: “While that is true, it is also true that the exact calculation you did for 180.000º ± .003º would apply for 180.001º ± .003º. And indeed, it is less likely to find the stars at this precise angle, than it is to find the stars merely near this angle. We must compare like with like, and when we do so we find that 180.000º is no more likely than any other angle!”

Galileo: “I see your reasoning, Eamadin, but you are missing something of importance. Surely there is something objectively more significant about finding an exactly straight line than about a nearly straight line, even if they have the same probability. Not all equiprobable events should be considered to be equally important. Think, for instance, of a sequence of twenty coin tosses. While it’s true that the outcome HHTHTTTTHTHHHTHHHTTH has the same probability as the outcome HHHHHHHHHHHHHHHHHHHH, the second is clearly more remarkable than the first.”

Eamadin: “But what is significance if disentangled from probability? I insist that the concept of significance only makes sense in the context of my theory of uncertainty. Significant results are those that either have a low probability or have a low conditional probability given a set of plausible hypotheses. It is this second class that we may utilize in analyzing your coin tossing example, Galileo. The two strings of tosses you mention are only significant to different degrees in that the second more naturally lends itself to a set of hypotheses in which the coin is heavily biased towards heads. In judging the second to be a more significant result than the first, you are really just saying that you use a natural hypothesis class in which probability judgments are only dependent on the ratios of heads and tails, not the particular sequence of heads and tails. Now, my question for you is: since 180.000º is just as likely as 180.001º, what set of hypotheses are you considering in which the first is much less likely than the second?”

Galileo: “I must confess, I have difficulty answering your question. For while there is a simple sense in which the number of heads and tails is a product of a coin’s bias, it is less clear what would be the analogous ‘bias’ in angles and distances between stars that should make straight lines and equal distances less likely than any others. I must say, Eamadin, that in calling you here, I find myself even more confused than when I began!”

Eamadin: “I apologize, my friend. But now let me attempt to disentangle this mess and provide a guiding light towards a solution to your problem.”

Galileo: “Please.”

Eamadin: “Perhaps we may find some objective sense in which a straight line or the equality of two quantities is a simpler mathematical pattern than a nearly straight line or two nearly equal quantities. But even if so, this will only be a help to us insofar as we have a presumption in favor of less simple patterns inhering in Nature.”

Galileo: “This is no help at all! For surely the principle of Ockham should push us towards favoring more simple patterns.”

Eamadin: “Precisely. So if we are not to look for an objective basis for the improbability of simple and elegant patterns, then we must look towards the subjective. Here we may find our answer. Suppose I were to scribble down on a sheet of paper a series of symbols and shapes, hidden from your view. Now imagine that I hand the images to you, and you go off to some unexplored land. You explore the region and draw up cartographic depictions of the land, having never seen my images. It would be quite a remarkable surprise were you to find upon looking at my images that they precisely matched your maps of the land.”

Galileo: “Indeed it would be. It would also quickly lend itself to a number of possible explanations. Firstly, it may be that you were previously aware of the layout of the land, and drew your pictures intentionally to capture the layout of the land – that is, that the layout directly caused the resemblance in your depictions. Secondly, it could be that there was a common cause between the resemblance and the layout; perhaps, for instance, the patterns that most naturally come to the mind are those that resemble common geographic features. And thirdly, included only for completion, it could be that your images somehow caused the land to have the geographic features that it did.”

Eamadin: “Exactly! You catch on quickly. Now, this case of the curious coincidence of depiction and reality is exactly analogous to your problem of the straight line in the sky. The straight lines and equal distances are just like patterns on the slips of paper I handed to you. For whatever reason, we come pre-loaded with a set of sensitivities to certain visual patterns. And what’s remarkable about your observation of the three stars is that a feature of the natural world happens to precisely align with these patterns, where we would expect no such coincidence to occur!”

Galileo: “Yes, yes, I see. You are saying that the improbability doesn’t come from any objective unusual-ness of straight lines or equal distances. Instead, the improbability comes from the fact that the patterns in reality just happen to be the same as the patterns in my head!”

Eamadin: “Precisely. Now we can break down the suitable explanations, just as you did with my cartographic example. The first explanation is that the patterns in your mind were caused by the patterns in the sky. That is, for some reason the fact that these stars were aligned in this particular way caused you to by psychologically sensitive to straight lines and equal quantities.”

Galileo: “We may discard this explanation immediately, for such sensitivities are too universal and primitive to be the result of a configuration of stars that has only just now made itself apparent to me.”

Eamadin: “Agreed. Next we have a common cause explanation. For instance, perhaps our mind is naturally sensitive to visual patterns like straight lines because such patterns tend to commonly arise in Nature. This natural sensitivity is what feels to us on the inside as simplicity. In this case, you would expect it to be more likely for you to observe simple patterns than might be naively thought.”

Galileo: “We must deny this explanation as well, it seems to me. For the resemblance to a straight line goes much further than my visual resolution could even make out. The increased likelihood of observing a straight line could hardly be enough to outweigh our initial naïve calculation of the probability being 10-9. But thinking more about this line of reasoning, it strikes me that you have just provided an explanation the apparent simplicity of the laws of Nature! We have developed to be especially sensitive to patterns that are common in Nature, we interpret such patterns as ‘simple’, and thus it is a tautology that we will observe Nature to be full of simple patterns.”

Eamadin: “Indeed, I have offered just such an explanation. But it is an unsatisfactory explanation, insofar as one is opposed to the notion of simplicity as a purely subjective feature. Most people, myself included, would strongly suggest that a straight line is inherently simpler than a curvy line.”

Galileo: “I feel the same temptation. Of course, justifying a measure of simplicity that does the job we want of it is easier said than done. Now, on to the third explanation: that my sensitivity to straight lines has caused the apparent resemblance to a straight line. There are two interpretations of this. The first is that the stars are not actually in a straight line, and you only think this because of your predisposition towards identifying straight lines. The second is that the stars aligned in a straight line because of these predispositions. I’m sure you agree that both can be reasonably excluded.”

Eamadin: “Indeed. Although it may look like we’ve excluded all possible explanations, notice that we only considered one possible form of the common cause explanation. The other two categories of explanations seem more thoroughly ruled out; your dispositions couldn’t be caused by the star alignment given that you have only just found out about it and the star alignment couldn’t be caused by your dispositions given the physical distance.”

Galileo: “Agreed. Here is another common cause explanation: God, who crafted the patterns we see in Nature, also created humans to have similar mental features to Himself. These mental features include aesthetic preferences for simple patterns. Thus God causes both the salience of the line pattern to humans and the existence of the line pattern in Nature.”

Eamadin: “The problem with this is that it explains too much. Based solely on this argument, we would expect that when looking up at the sky, we should see it entirely populated by simple and aesthetic arrangements of stars. Instead it looks mostly random and scattershot, with a few striking exceptions like those which you have pointed out.”

Galileo: “Your point is well taken. All I can imagine now is that there must be some sort of ethereal force that links some stars together, gradually pushing them so that they end up in nearly straight lines.”

Eamadin: “Perhaps that will be the final answer in the end. Or perhaps we will discover that it is the whim of a capricious Creator with an unusual habit for placing unsolvable mysteries in our paths. I sometimes feel this way myself.”

Galileo: “I confess, I have felt the same at times. Well, Eamadin, although we have failed to find a satisfactory explanation for the moment, I feel much less confused about this matter. I must say, I find this method of reasoning by noticing similarities between features of our mind and features of the world quite intriguing. Have you a name for it?”

Eamadin: “In fact, I just thought of it on the spot! I suppose that it is quite generalizable… We come pre-loaded with a set of very salient and intuitive concepts, be they geometric, temporal, or logical. We should be surprised to find these concepts instantiated in the world, unless we know of some causal connection between the patterns in our mind and the patterns in reality. And by Eamadin’s rule of probability-updating, when we notice these similarities, we should increase our strength of belief in these possible causal connections. In the spirit of anachrony, let us refer to this as the Schelling point improbability principle!”

Galileo: “Sounds good to me! Thank you for your assistance, my friend. And now I must return to my exploration of the Cosmos.”

The Monty Hall non-paradox

I recently showed the famous Monty Hall problem to a friend. This friend solved the problem right away, and we realized quickly that the standard presentation of the problem is highly misleading.

Here’s the setup as it was originally described in the magazine column that made it famous:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I encourage you to think through this problem for yourself and come to an answer. Will provide some blank space so that you don’t accidentally read ahead.

 

 

 

 

 

 

Now, the writer of the column was Marilyn vos Savant, famous for having an impossible IQ of 228 according to an interpretation of a test that violated “almost every rule imaginable concerning the meaning of IQs” (psychologist Alan Kaufman). In her response to the problem, she declared that switching gives you a 2/3 chance of winning the car, as opposed to a 1/3 chance for staying. She argued by analogy:

Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance. Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

Notice that this answer contains a crucial detail that is not contained in the statement of the problem! Namely, the answer adds the stipulation that the host “knows what’s behind the doors and will always avoid the one with the prize.”

The original statement of the problem in no way implies this general statement about the host’s behavior. All you are justified to assume in an initial reading of the problem are the observational facts that (1) the host happened to open door No. 3, and (2) this door happened to contain a goat.

When nearly a thousand PhDs wrote in to the magazine explaining that her answer was wrong, she gave further arguments that failed to reference the crucial point; that her answer was only true given additional unstated assumptions.

My original answer is correct. But first, let me explain why your answer is wrong. The winning odds of 1/3 on the first choice can’t go up to 1/2 just because the host opens a losing door. To illustrate this, let’s say we play a shell game. You look away, and I put a pea under one of three shells. Then I ask you to put your finger on a shell. The odds that your choice contains a pea are 1/3, agreed? Then I simply lift up an empty shell from the remaining other two. As I can (and will) do this regardless of what you’ve chosen, we’ve learned nothing to allow us to revise the odds on the shell under your finger.

Notice that this argument is literally just a restatement of the original problem. If one didn’t buy the conclusion initially, restating it in terms of peas and shells is unlikely to do the trick!

This problem was made even more famous by this scene in the movie “21”, in which the protagonist demonstrates his brilliance by coming to the same conclusion as vos Savant. While the problem is stated slightly better in this scene, enough ambiguity still exists that the proper response should be that the problem is underspecified, or perhaps a set of different answers for different sets of auxiliary assumptions.

The wiki page on this ‘paradox’ describes it as a veridical paradox, “because the correct choice (that one should switch doors) is so counterintuitive it can seem absurd, but is nevertheless demonstrably true.”

Later on the page, we see the following:

In her book The Power of Logical Thinking, vos Savant (1996, p. 15) quotes cognitive psychologist Massimo Piattelli-Palmarini as saying that “no other statistical puzzle comes so close to fooling all the people all the time,” and “even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who propose the right answer.”

There’s something to be said about adequacy reasoning here; when thousands of PhDs and some of the most brilliant mathematicians in the world are making the same point, perhaps we are too quick to write it off as “Wow, look at the strength of this cognitive bias! Thank goodness I’m bright enough to see past it.”

In fact, the source of all of the confusion is fairly easy to understand, and I can demonstrate it in a few lines.

Solution to the problem as presented

Initially, all three doors are equally likely to contain the car.
So Pr(1) = Pr(2) = Pr(3) = ⅓

We are interested in how these probabilities update upon the observation that 3 does not contain the car.
Pr(1 | ~3) = Pr(1)・Pr(~3 | 1) / Pr(~3)
= (⅓ ・1) / ⅔ = ½

By the same argument,
Pr(2 | ~3) = ½

Voila. There’s the simple solution to the problem as it is presented, with no additional presumptions about the host’s behavior. Accepting this argument requires only accepting three premises:

(1) Initially all doors are equally likely to be hiding the car.

(2) Bayes’ rule.

(3) There is only one car.

(3) implies that Pr(the car is not behind a door | the car is behind a different door) = 100%, which we use when we replace Pr(~3 | 1) with 1.

The answer we get is perfectly obvious; in the end all you know is that the car is either in door 1 or door 2, and that you picked door 1 initially. Since which door you initially picked has nothing to do with which door the car was behind, and the host’s decision gives you no information favoring door 1 over door 2, the probabilities should be evenly split between the two.

It is also the answer that all the PhDs gave.

Now, why does taking into account the host’s decision process change things? Simply because the host’s decision is now contingent on your decision, as well as the actual location of the car. Given that you initially opened door 1, the host is guaranteed to not open door 1 for you, and is also guaranteed to not open up a door hiding the car.

Solution with specified host behavior

Initially, all three doors are equally likely to contain the car.
So Pr(1) = Pr(2) = Pr(3) = ⅓

We update these probabilities upon the observation that 3 does not contain the car, using the likelihood formulation of Bayes’ rule.

Pr(1 | open 3) / Pr(2 | open 3)
= Pr(1) / Pr(2)・Pr(open 3 | 1) / Pr(open 3 | 2)
= ⅓ / ⅓・½ / 1 = ½

So Pr(1 | open 3) = ⅓ and Pr(2 | open 3) = ⅔

Pr(open 3 | 2) = 1, because the host has no choice of which door to open if you have selected door 1 and the car is behind door 2.

Pr(open 3 | 1) = ½, because the host has a choice of either opening 2 or 3.

In fact, it’s worth pointing out that this requires another behavioral assumption about the host that is nowhere stated in the original post, or Savant’s solution. This is that if there is a choice about which of two doors to open, the host will pick randomly.

This assumption is again not obviously correct from the outset; perhaps the host chooses the larger of the two door numbers in such cases, or the one closer to themselves, or the one or the smaller number with 25% probability. There are an infinity of possible strategies the host could be using, and this particular strategy must be explicitly stipulated to get the answer that Wiki proclaims to be correct.

It’s also worth pointing out that once these additional assumptions are made explicit, the ⅓ answer is fairly obvious and not much of a paradox. If you know that the host is guaranteed to choose a door with a goat behind it, and not one with a car, then of course their decision about which door to open gives you information. It gives you information because it would have been less likely in the world where the car was under door 1 than in the world where the car was under door 2.

In terms of causal diagrams, the second formulation of the Monty Hall problem makes your initial choice of door and the location of the car dependent upon one another. There is a path of causal dependency that goes forwards from your decision to the host’s decision, which is conditioned upon, and then backward from the host’s decision to which door the car is behind.

Any unintuitiveness in this version of the Monty Hall problem is ultimately due to the unintuitiveness of the effects of conditioning on a common effect of two variables.

Monty Hall Causal

In summary, there is no paradox behind the Monty Hall problem, because there is no single Monty Hall problem. There are two different problems, each containing different assumptions, and each with different answers. The answers to each problem are fairly clear after a little thought, and the only appearance of a paradox comes from apparent disagreements between individuals that are actually just talking about different problems. There is no great surprise when ambiguous wording turns out multiple plausible solutions, it’s just surprising that so many people see something deeper than mere ambiguity here.