Seeing, Doing, Imagining

A picture from Pearl that I think nicely sums up the different levels of reasoning:

Screen Shot 2018-06-27 at 1.10.23 AM.png

Mere probability theory gives you only the first level. To get to the level of causal and counterfactual understanding, we must supplement probabilities with graphical models.

Against moral realism

Here’s my primary problem with moral realism: I can’t think of any acceptable epistemic framework that would give us a way to justifiably update our beliefs in the objective truth of moral claims. I.e. I can’t think of any reasonable account of how we could have justified beliefs in objectively true moral principles.

Here’s a sketch of a plausible-seeming account of epistemology. Broad-strokes, there are two sources of justified belief: deduction and induction.

Deduction refers to the process by which we define some axioms and then see what logically follows from them. So, for instance, the axioms of Peano Arithmetic entail the theorem that 1+1=2 – or, in Peano’s language, S(0) + S(0) = S(S(0)). The central reason why reasoning by deduction is reliable is that the truths established are true by definition – they are made true by the way we have constructed our terms, and are thus true in every possible world.

Induction is scientific reasoning – it is the process of taking prior beliefs, observing evidence, and then updating these beliefs (via Bayes’ rule, for instance). The central reason why induction is reliable comes from the notion of causal entanglement. When we make an observation and update our beliefs based upon this observation, the brain state “believes X” has become causally entangled with the truth of the the statement X. So, for instance, if I observe a positive result on a pregnancy test, then my belief in the statement “I am pregnant” has become causally entangled with the truth of the statement “I am pregnant.” It is exactly this that justifies our use of induction in reasoning about the world.

Now, where do moral claims fall? They are not derived from deductive reasoning… that is, we cannot just arbitrarily define right and wrong however we like, and then derive morality from these definitions.

And they are also not truths that can be established through inductive reasoning; after all, objective moral truths are not the types of things that have any causal effects on the world.

In other words, even if there are objective moral truths, we would have no way of forming justified beliefs about this. To my mind, this is a pretty devastating situation for a moral realist. Think about it like this: a moral realist who doesn’t think that moral truths have causal power over the world must accept that all of their beliefs about morality are completely causally independent of their truth. If we imagine keeping all the descriptive truths about the world fixed, and only altering the normative truths, then none of the moral realist’s moral beliefs would change.

So how do they know that they’re in the world where their moral beliefs actually do align with the moral reality? Can they point to any reason why their moral beliefs are more likely to be true than any other moral statements? As far as I can tell, no, they can’t!

Now, you might just object to the particular epistemology I’ve offered up, and suggest some new principle by which we can become acquainted with moral truth. This is the path of many professional philosophers I have talked to.

But every attempt that I’ve heard of for doing this begs the question or resorts to just gesturing at really deeply held intuitions of objectivity. If you talk to philosophers, you’ll hear appeals to a mysterious cognitive ability to reflect on concepts and “detect their intrinsic properties”, even if these properties have no way of interacting with the world, or elaborate descriptions of the nature of “self-evident truths.”

(Which reminds me of this meme)

self-evident-truth-5153703.png

None of this deals with the central issue in moral epistemology, as I see it. This central issue is: How can a moral realist think that their beliefs about morality are any more likely to be true than any random choice of a moral framework?

Explanation is asymmetric

We all regularly reason in terms of the concept of explanation, but rarely think hard about what exactly we mean by this explanation. What constitutes a scientific explanation? In this post, I’ll point out some features of explanation that may not be immediately obvious.

Let’s start with one account of explanation that should seem intuitively plausible. This is the idea that to explain X to a person is to give that person some information I that would have allowed them to predict X.

For instance, suppose that Janae wants an explanation of why Ari is not pregnant. Once we tell Janae that Ari is a biological male, she is satisfied and feels that the lack of pregnancy has been explained. Why? Well, because had Janae known that Ari was a male, she would have been able to predict that Ari would not get pregnant.

Let’s call this the “predictive theory of explanation.” On this view, explanation and prediction go hand-in-hand. When somebody learns a fact that explains a phenomenon, they have also learned a fact that allows them to predict that phenomenon.

 To spell this out very explicitly, suppose that Janae’s state of knowledge at some initial time is expressed by

K1 = “Males cannot get pregnant.”

At this point, Janae clearly cannot conclude anything about whether Ari is pregnant. But now Janae learns a new piece of information, and her state of knowledge is updated to

K2 = “Ari is a male & males cannot get pregnant.”

Now Janae is warranted in adding the deduction

K’ = “Ari cannot get pregnant”

This suggests that added information explains Ari’s non-pregnancy for the same reason that it allows the deduction of Ari’s non-pregnancy.

Now, let’s consider a problem with this view: the problem of relevance.

Suppose a man named John is not pregnant, and somebody explains this with the following two premises:

  1. People who take birth control pills almost certainly don’t get pregnant.
  2. John takes birth control pills regularly.

Now, these two premises do successfully predict that John will not get pregnant. But the fact that John takes birth control pills regularly gives no explanation at all of his lack of pregnancy. Naively applying the predictive theory of explanation gives the wrong answer here.

You might have also been suspicious of the predictive theory of explanation on the grounds that it relied on purely logical deduction and a binary conception of knowledge, not allowing us to accommodate the uncertainty inherent in scientific reasoning. We can fix this by saying something like the following:

What it is to explain X to somebody that knows K is to give them information I such that

(1) P(X | K) is small, and
(2) P(X | K, I) is large.

“Small” and “large’ here are intentionally vague; it wouldn’t make sense to draw a precise line in the probabilities.

The idea here is that explanations are good insofar as they (1) make their explanandum sufficiently likely, where (2) it would be insufficiently likely without them.

We can think of this as a correlational account of explanation. It attempts to root explanations in sufficiently strong correlations.

First of all, we can notice that this doesn’t suffer from a problem with irrelevant information. We can find relevance relationships by looking for independencies between variables. So maybe this is a good definition of scientific explanation?

Unfortunately, this “correlational account of explanation” has its own problems.

Take the following example.

uploaded image

This flagpole casts a shadow of length L because of the angle of elevation of the sun and the height of the flagpole (H). In other words, we can explain the length of the shadow with the following pieces of information:

I1 =  “The angle of elevation of the sun is θ”
I2 = “The height of the lamp post is H”
I3 = Details involving the rectilinear propagation of light and the formation of shadows

Both the predictive and correlational theory of explanation work fine here. If somebody wanted an explanation for why the shadow’s length is L, then telling them I1, I2, and I3 would suffice. Why? Because I1, I2, and Ijointly allow us to predict the shadow’s length! Easy.

X = “The length of the shadow is L.”
(I1 & I2 & I3) ⇒ X
So I1 & I2 & I3 explain X.

And similarly, P(X | I1 & I2 & I3) is large, and P(X) is small. So on the correlational account, the information given explains X.

But now, consider the following argument:

(I1 & I3 & X) ⇒ I2
So I1 & I3 & X explain I2.

The predictive theory of explanation applies here. If we know the length of the shadow and the angle of elevation of the sun, we can deduce the height of the flagpole. And the correlational account tells us the same thing.

But it’s clearly wrong to say that the explanation for the height of the flagpole is the length of the shadow!

What this reveals is an asymmetry in our notion of explanation. If somebody already knows how light propagates and also knows θ, then telling them H explains L. But telling them L does not explain H!

In other words, the correlational theory of explanation fails, because correlation possesses symmetry properties that explanation does not.

This thought experiment also points the way to a more complete account of explanation. Namely, the relevant asymmetry between the length of the shadow and the height of the flagpole is one of causality. The reason why the height of the flagpole explains the shadow length but not vice versa, is that the flagpole is the cause of the shadow and not the reverse.

In other words, what this reveals to us is that scientific explanation is fundamentally about finding causes, not merely prediction or statistical correlation. This causal theory of explanation can be summarized in the following:

An explanation of A is a description of its causes that renders it intelligible.

More explicitly, an explanation of A (relative to background knowledge K) is a set of causes of A that render X intelligible to a rational agent that knows K.

Wave function entropy

Entropy is a feature of probability distributions, and can be taken to be a quantification of uncertainty.

Standard quantum mechanics takes its fundamental object to be the wave function – an amplitude distribution. And from an amplitude distribution Ψ you can obtain a probability distribution Ψ*Ψ.

So it is very natural to think about the entropy of a given quantum state. For some reason, it looks like this concept of wave function entropy is not used much in physics. The quantum-mechanical version of entropy that is typically referred to is the Von-Neumann entropy, which involves uncertainty over which quantum state a system is in (rather than uncertainty intrinsic to a quantum state).

I’ve been looking into some of the implications of the concept of wave function entropy, and found a few interesting things.

Firstly, let’s just go over what precisely wave function entropy is.

Quantum mechanics is primarily concerned with calculating the wave function Ψ(x), which distributes complex amplitudes over phase space. The physical meaning of these amplitudes is interpreted by taking their absolute square Ψ*Ψ, which is a probability distribution.

Thus, the entropy of the wave function is given by:

S = – ∫ Ψ*Ψ ln(Ψ*Ψ) dx

As an example, I’ll write out some of the wave functions for the basic hydrogen atom:

*Ψ)1s = e-2r / π
*Ψ)2s = (2 – r)2 e-r / 32π

*Ψ)2p = r2 e-r cos(θ) / 32π
*Ψ)3s = (2r2 – 18r + 27)2 e-⅔r / 19683π

With these wave functions in hand, we can go ahead and calculate the entropies! Some of the integrals are intractable, so using numerical integration, we get:

S1s ≈ 70
S2s ≈ 470
S2p ≈ 326
S3s ≈ 1320

The increasing values for (1s, 2s, 3s) make sense – higher energy wave functions are more dispersed, meaning that there is greater uncertainty in the electron’s spatial distribution.

Let’s go into something a bit more theoretically interesting.

We’ll be interested in a generalization of entropy – relative entropy. This will quantify, rather than pure uncertainty, changes in uncertainty from a prior probability distribution ρ to our new distribution Ψ*Ψ. This will be the quantity we’ll denote S from now on.

S = – ∫ Ψ*Ψ ln(Ψ*Ψ/ρ) dx

Now, suppose we’re interested in calculating the wave functions Ψ that are local maxima of entropy. This means we want to find the Ψ for which δS = 0. Of course, we also want to ensure that a few basic constraints are satisfied. Namely,

∫ Ψ*Ψ dx = 1
∫ Ψ*HΨ = E

These constraints are chosen by analogy with the constraints in ordinary statistical mechanics – normalization and average energy. H is the Hamiltonian operator, which corresponds to the energy observable.

We can find the critical points of entropy that satisfy the constraint by using the method of Lagrange multipliers. Our two Lagrange multipliers will be α (for normalization) and β (for energy). This gives us the following equation for Ψ:

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0

We can rewrite this as an operator equation, which gives us

ln(Ψ*Ψ/ρ) + (α + 1) + βH = 0
Ψ*Ψ = ρ/Z e-βH

Here we’ve renamed our constants so that Z =  eα+1 is a normalization constant.

So we’ve solved the wave function equation… but what does this tell us? If you’re familiar with some basic quantum mechanics, our expression should look somewhat familiar to you. Let’s backtrack a few steps to see where this familiarity leads us.

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0
HΨ + 1/β ln(Ψ*Ψ/ρ) Ψ = – (α + 1)/β Ψ

Let’s rename – (α + 1)/β to a new constant λ. And we’ll take a hint from statistical mechanics and call 1/β the temperature T of the state. Now our equation looks like

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

This equation is almost the Schrodinger equation. In particular, the Schrodinger equation pops out as the zero-temperature limit of this equation:

As T → 0,
our equation becomes…
HΨ = λΨ

The obvious interpretation of the constant λ in the zero temperature limit is E, the energy of the state. 

What about in the infinite-temperature limit?

As T → ∞,
our equation becomes…
Ψ*Ψ = ρ

Why is this? Because the only solution to the equation in this limit is for ln(Ψ*Ψ/ρ) → 0, or in other words Ψ*Ψ/ρ → 1

And what this means is that in the infinite temperature limit, the critical entropy wave function is just that which gives the prior distribution.

We can interpret this result as a generalization of the Schrodinger equation. Rather than a linear equation, we now have an additional logarithmic nonlinearity. I’d be interested to see how the general solutions to this equation differ from the standard equations, but that’s for another post.

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

Query sensitivity of evidential updates

Plausible reasoning, unlike logical deduction, is sensitive not only to the information at hand but also to the query process by which the information was obtained.

Judea Pearl, Probabilistic Reasoning in Intelligent Systems

This quote references an interesting feature of inductive reasoning that’s worth unpacking. It is indicative of the different level of complexity involved in formalizing induction than that involved in formalizing deduction.

A very simple example of this:

You spread a rumor to your neighbor N. A few days later you hear the same rumor from another neighbor N’. Should you increase your belief in the rumor now that N acknowledges it, or should you determine first whether N heard it from N’?

Clearly, if the only source of information for N’ was N, then your belief should not change. But if N’ independently confirmed the validity of the rumor, you have good reason to increase your belief in it.

In general, when you have both top-down (predictive) and bottom-up (explanatory/diagnostic) inferences in evidential reasoning, it is important to be able to trace back queries. If not, one runs the risk of engaging in circular reasoning.

So far this is all fairly obvious. Now here’s an example that’s more subtle.

Three prisoners problem

Three prisoners have been tried for murder, and their verdicts will be read tomorrow morning. Only one will be declared guilty, and the other two will be declared innocent.

Before sentencing, Prisoner A asks the guard (who knows which prisoner will be declared guilty) to do him a favor and give a letter to one of the other two prisoners who will be released (since only one person will be declared guilty, Prisoner A knows that at least one of the other two prisoners will be released). The guard does so, and later, Prisoner A asks him which of the two prisoners (B or C) he gave the letter two. The guard responds “I gave the letter to Prisoner B.”

Now Prisoner A reasons as follows:

“Previously, my chances of being executed were one in three. Now that I know that B will be released, only C and I remain as candidates for being declared guilty. So now my chances are one in two.”

Is this wrong?

Denote “A is guilty” as GA, and “B is innocent” as IB. Now, since GA → IB, we have that P(IB | GA) = 1. This tells us that we can write

P(GA | IB) = P(IB | GA) P(GA) / P(IB)
= P(GA) / P(IB) = ⅓ / ⅔ = ½

The problem with this argument is that we have excluded some of the context of the guard’s response, namely, that the guard could only have answered “I gave the letter to Prisoner B” or “I gave the letter to Prisoner C.” In other words, the fact “Prisoner B will be declared innocent” leads to the wrong conclusion about the credibility of A’s guilt.

Let’s instead condition on IB’ = “Guard says that B will be declared innocent.” Now we get

P(GA | IB’) = P(IB’ | GA) P(GA) / P(IB’) = ½ ⋅ ⅓ / ½ = ⅓.

It’s not sufficient to just condition on what the guard said. We must consider the range of possible statements that the guard could have made.

In general, we cannot only assess the impact of propositions implied by information. We must also consider what information we could have received.

Things get clearer if we consider a similar thought experiment.

1000 prisoners problem

You are one of 1000 prisoners awaiting sentencing, with the knowledge that only one of you will be declared guilty. You come across a slip of paper from the court listing 998 prisoners; each name marked ‘innocent’. You look through all 998 names and find that your name is missing.

This should worry you greatly – your chances of being declared guilty have gone from 1 in 1000 to 1 in 2.

But imagine that you now see the query that produced the list.

Query: “Print the names of any 998 innocent right-handed prisoners.”

If you are the only left-handed prisoner, then you should thank your lucky stars. Why? Because now that you know that the query couldn’t have produced your name, the fact that it didn’t gives you no information. In other words, your chances have gone back from 1:2 to 1:1000.

In this example you can see very clearly why information about the possible outputs of a query is relevant to how we should update on the actual output of the query. We must know the process by which we attain information in order to be able to accommodate that information into our beliefs.

But now what if we don’t have this information? Suppose that you only run into the list of prisoners, and have no additional knowledge about how it was produced. Well, then we must consider all possible queries that might have produced this output!

This is no small matter.

For simplicity, let’s reconsider the simpler example with just three prisoners: Prisoners A, B, and C. Imagine that you are Prisoner A.

You come across a slip of paper from the court containing the statement I, where

I = “Prisoner B will be declared innocent”.

Now, we must assess the impact of I on the proposition GA = “Prisoner A is guilty.”

P(GA | I) = P(I | GA) P(GA) / P(I) = ⅓ P(I | GA) / P(I)

The result of this calculation depends upon P(I | GA), or in other words, how likely it is that the slip would declare Prisoner B innocent, given that you are guilty. This depends on the query process, and can vary anywhere from 0 to 1. Let’s just give this variable a name: P(I | GA) = p.

We’ll also need to know two other probabilities: (1) that the slip declares B innocent given that B is guilty, and (2) that it declares B innocent given that C is guilty. We’ll assume that the slip cannot be lying (i.e. that the first of these is zero), and name the second probability q = P(I | GC).

P(I | GA) = p (slip could declare either B or C innocent)
P(I | GB) = 0 (slip could declare either A or C innocent)
P(I | GC) = q (slip could declare either A or B innocent)

Now we have

P(GA | I) = ⅓ p / P(I)
=⅓ p / [P(I | GA) P(GA) + P(I | GB) P(GB) + P(I | GC) P(GC)]
= ⅓ p / (⅓ p + 0 + ⅓ q)
= p/(p + q)

How do we assess this value, given that p and q are unknown? The Bayesian solution is to treat the probabilities p and q as random variables, and specify probability distributions over their possible values: f(p) and g(q). This distribution should contain all of your prior knowledge about the plausible queries that might have produced I.

The final answer is obtained by integrating over all possible values of p and q.

P(GA | I) E[p/(p + q)]
= ∫ p/(p + q) f(p) g(q) dp dq

Supposing that our distributions over p and q are maximally uncertain, the final distribution we obtain is

P(GA | I) = ∫ p/(p + q) dp dq
0.5

Now suppose that we know that the slip could not declare A (yourself) innocent (as we do in the original three prisoners problem). Then we know that q = 1 (since if C is guilty and A couldn’t be on the slip, B is the only possible choice). This gives us

P(GA | I) = ∫ p/(p + 1) f(p) dp

If we are maximally uncertain about the value of p, we obtain

P(GA | I) = ∫ p/(p + 1) dp
1 – ln(2)
≈ 0.30685

If, on the other hand, we are sure that the value of p is 50% (i.e., we know that in the case that A is guilty, the guard chooses randomly between B and C), we obtain

P(GA | I) = .5/(.5 + 1) = ⅓

We’ve re-obtained our initial result! Interestingly, we can see that being maximally uncertainty about the guard’s procedure for choosing between B and C gives a different answer than knowing that the guard chooses totally randomly between B and C.

Notice that this is true even though these reflect the same expectation of what choice the guard will make!

I.e., in both cases (total uncertainty about p, and knowledge that p is exactly .5), we should have 50% credence in the guard choosing B. This gives us some insight into the importance of considering different types of uncertainty when doing induction, which is a topic for another post.

Summarizing conscious experience

There’s a puzzle for implementation of probabilistic reasoning in human beings. This is that the start of the reasoning process in humans is conscious experience, and it’s not totally clear how we should update on conscious experiences.

Jeffreys defined a summary of an experience E as a set B of propositions {B1, B2, … Bn} such that for all other propositions in your belief system A, P(A | B) = P(A | B, E).

In other words, B is a minimal set of propositions that fully screens off your experience.

This is a useful concept because summary sentences allow you to isolate everything that is epistemically relevant about conscious experience. if you have a summary B of an experience E, then you only need to know P(A | B) and P(B | E) in order to calculate P(A | E).

Notice that the summary set is subjective; it is defined only in terms of properties of your personal belief network. The set of facts that screens off E for you might be different from the set of facts that screens it off for somebody else.

Quick example.

Consider a brief impression by candlelight of a cloth held some distance away from you. Call this experience E.

Suppose that all you could decipher from E is that the cloth was around 2 meters away from you, and that it was either blue (with probability 60%) or green (with probability 40%). Then the summary set for E might be {“The cloth is blue”, “The cloth is green”, “The cloth is 2 meters away from you”, “The cloth is 3 meters away from you”, etc.}.

If this is the right summary set, then the probabilities P(“The cloth is blue”), P(“The cloth is green”) and P(“The cloth is x meters away from you”) should screen off E from the rest of your beliefs.

One trouble is that it’s not exactly obvious how to go about converting a given experience into a set of summary propositions. We could always be leaving something out. For instance, one more thing we learned upon observing E was the proposition “I can see light.” This is certainly not screened off by the other propositions so far, so we need to add it in as well.

But how do we know that we’ve gotten everything now? If we think a little more, we realize that we have also learned something about the nature of the light given off by the candle flame. We learn that it is capable of reflecting the color of light that we saw!

But now this additional consideration is related to how we interpret the color of the cloth. In other words, not only might we be missing something from our summary set, but that missing piece might be relevant to how we interpret the others.

I’d like to think more about this question: In general, how do we determine the set of propositions that screens off a given experience from the rest of your beliefs? Ultimately, to be able to coherently assess the impact of experiences on your web of beliefs, your model of reality must contain a model of yourself as an experiencer.

The nature of this model is pretty interesting from a philosophical perspective. Does it arise organically out of factual beliefs about the physical world? Well, this is what a physicalist would say. To me, it seems quite plausible that modeling yourself as a conscious experiencer would require a separate set of rules relating physical happenings to conscious experiences. How we should model this set of rules as a set of a priori hypotheses to be updated on seems very unclear to me.

Simple induction

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:

Fixed

This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update.

How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity). What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

The North Korea problem isn’t solved

Donald Trump and Kim Jong Un just met and signed a deal committing North Korea to nuclear disarmament. Yay! Problem solved!

Except that there’s a long historical precedent of North Korea signing deals just like this one, only to immediately go back on them. Here’s a timeline for some relevant historical context.

1985: North Korea signs Nuclear Non-Proliferation Treaty
1992: North Korea signs historic agreement to halt nuclear program! (#1)
1993: North Korea is found to be cheating on its commitments under the NPT
1994: In exchange for US assistance in production of proliferation-free nuclear power plants, North Korea signs historic agreement to halt nuclear program! (#2)
1998: North Korea is suspected of having an underground nuclear facility
1998: North Korea launches missile tests over Japan
1999: North Korea signs historic agreement to end missile tests, in exchange for a partial lifting of economic sanctions by the US.
2000: North Korea signs historic agreement to reunify Korea! Nobel Peace Prize is awarded
2002-2003: North Korea admits to having a secret nuclear weapons program, and withdraws from the NPT
2004: North Korea allows an unofficial US delegation to visit its nuclear facilities to display a working nuclear weapon
2005: In exchange for economic and energy assistance, North Korea signs historic agreement to halt nuclear program and denuclearize! (#3)
2006: North Korea fires seven ballistic missiles and conducts an underground nuclear test
2006: North Korea declares support for denuclearization of Korean peninsula
2006: North Korea again supports denuclearization of Korean peninsula
2007: In exchange for energy aid from the US, North Korea signs historic agreement to halt nuclear program! (#4)
2007: N&S Korea sign agreement on reunification
2009: North Korea issues a statement outlining a plan to weaponize newly separated plutonium
2010: North Korea threatens war with South Korea
2010: North Korea again announces commitment to denuclearize
2011: North Korea announces plan to halt nuclear and missile tests
2012: North Korea announces halt to nuclear program
2013: North Korea announces intentions to conduct more nuclear tests
2014: North Korea test fires 30 short-range rockets, as well as two medium missiles into the Sea of Japan
2015: North Korea offers to halt nuclear tests
2016: North Korea announces that it has detonated a hydrogen bomb
2016: North Korea again announces support for denuclearization
2017: North Korea conducts its sixth nuclear test
2018: Kim Jong Un announces that North Korea will mass produce nuclear warheads and ballistic missiles for deployment
2018: In exchange for the cancellation of US-South Korea military exercises, North Korea, once again, commits to “work toward complete denuclearization on the Korean peninsula”

Maybe this time is really, truly different. But our priors should be informed by history, and history tell us that it’s almost certainly not.

Priors in the supernatural

A friend of mine recently told me the following anecdote.

Years back, she had visited an astrologer in India with her boyfriend, who told her the following things: (1) she would end up marrying her boyfriend at the time, (2) down the line they would have two kids, the first a girl and the second a boy, and (3) he predicted the exact dates of birth of both children.

Many years down the line, all of these predictions turned out to be true.

I trust this friend a great deal, and don’t have any reason to think that she misremembered the details or lied to me about them. But at the same time, I recognize that astrology is completely crazy.

Since that conversation, I’ve been thinking about the ways in which we can evaluate our de facto priors in supernatural events by consulting either real-world anecdotes or thought experiments. For instance, if we think that each of these two predictions gave us a likelihood ratio of 100:1 in favor of astrology being true, and if I ended up thinking that astrology was about as likely to be true as false, then I must have started with roughly 1:10,000 odds against astrology being true.

That’s not crazily low for a belief that contradicts much of our understanding of physics. I would have thought that my prior odds would be something much lower, like 1:1010 or something. But really put yourself in that situation.

Imagine that you go to an astrologer, who is able to predict an essentially unpredictable sequence of events years down the line, with incredible accuracy. Suppose that the astrologer tells you who you will marry, how many kids you’ll have, and the dates of birth of each. Would you really be totally unshaken by this experience? Would you really believe that it was more likely to have happened by coincidence?

Yes, yes, I know the official Bayesian response – I read it in Jaynes long ago. For beliefs like astrology that contradict our basic understanding of science and causality, we should always have reserved some amount of credence for alternate explanations, even if we can’t think of any on the spot. This reserve of credence will insure us against jumping in credence to 99% upon seeing a psychic continuously predict the number in your heads, ensuring sanity and a nice simple secular worldview.

But that response is not sufficient to rule out all strong evidence for the supernatural.

Here’s one such category of strong evidence: evidence for which all alternative explanations are ruled out by the laws of physics as strongly as the supernatural hypothesis is ruled out by the laws of physics.

I think that my anecdote is one such case. If it was true, then there is no good natural alternative explanation for it. The reason? Because the information about the dates of birth of my friend’s children did not exist in the world at the time of the prediction, in any way that could be naturally attainable by any human being.

By contrast, imagine you go to a psychic who tells you to put up some fingers behind your back and then predicts over and over again how many fingers you have up. There’s hundreds of alternative explanations for this besides “Psychics are real science has failed us.” The reason that there are these alternative explanations is that the information predicted by the psychic existed in the world at the time of the prediction.

But in the case of my friend’s anecdote, the information predicted by the astrologer was lost far in the chaotic dynamics of the future.

What this rules out is the possibility that the astrologer somehow obtained the information surreptitiously by any natural means. It doesn’t rule out a host of other explanations, such as that my friend’s perception at the time was mistaken, that her memory of the event is skewed, or that she is lying. I could even, as a last resort, consider that possibility that I hallucinated the entire conversation with her. (I’d like to give the formal title “unbelievable propositions” to the set of propositions that are so unlikely that we should sooner believe that we are hallucinating than accept evidence for them.)

But each of these sources of alternative explanations, with the possible exception of the last, can be made significantly less plausible.

Let me use a thought experiment to illustrate this.

Imagine that you are a nuclear physicist who, with a group of fellow colleagues, have decided to test the predictive powers of a fortune teller. You carefully design an experiment in which a source of true quantum randomness will produce a number between 1 and N. Before the number has been produced, when it still exists only as an unrealized possibility in the wave function, you ask the fortune teller to predict its value.

Suppose that they get it correct. For what value of N would you begin to take their fortune telling abilities seriously?

Here’s how I would react to the success, for different values of N.

N = 10: “Haha, that’s a funny coincidence.”

N = 100: “Hm, that’s pretty weird.”

N = 1000: “What…”

N = 10,000: “Wait, WHAT!?”

N = 100,000: “How on Earth?? This is crazy.”

N = 1,000,000: “Ok, I’m completely baffled.”

I think I’d start taking them seriously as early as N = 10,000. This indicates prior odds of roughly 1:10,000 against fortune-telling abilities (roughly the same as my prior odds against astrology, interestingly!). Once again, this seems disconcertingly low.

But let’s try to imagine some alternative explanations.

As far as I can tell, there are only three potential failure points: (1) our understanding of physics, (2) our engineering of the experiment, (3) our perception of the fortune teller’s prediction.

First of all, if our understanding of quantum mechanics is correct, there is no possible way that any agent could do better than random at predicting the number.

Secondly, we stipulated that the experiment was designed meticulously so as to ensure that the information was truly random, and unavailable to the fortune-teller. I don’t think that such an experiment would actually be that hard to design. But let’s go even further and imagine that we’ve designed the experiment so that the fortune teller is not in causal contact with the quantum number-generator until after she has made her prediction.

And thirdly, we can suppose that the prediction is viewed by multiple different people, all of whom affirm that it was correct. We can even go further and imagine that video was taken, and broadcast to millions of viewers, all of whom agreed. Not all of them could just be getting it wrong over and over again. The only possibility is that we’re hallucinating not just the experimental result, but indeed also the public reaction and consensus on the experimental result.

But the hypothesis of a hallucination now becomes inconsistent with our understanding of how the brain works! A hallucination wouldn’t have the effect of creating a perception of a completely coherent reality in which everybody behaves exactly as normal except that they saw the fortune teller make a correct prediction. We’d expect that if this were a hallucination, it would not be so self-consistent.

Pretty much all that’s left, as far as I can tell, is some sort of Cartesian evil demon that’s cleverly messing with our brains to create this bizarre false reality. If this is right, then we’re left weighing the credibility of the laws of physics against the credibility of radical skepticism. And in that battle, I think, the laws of physics lose out. (Consider that the invalidity of radical skepticism is a precondition for the development of laws of physics in the first place.)

The point of all of this is just to sketch an example where I think we’d have a good justification for ruling out all alternative explanations, at least with an equivalent degree of confidence that we have for affirming any of our scientific knowledge.

Let’s bring this all the way back to where we started, with astrology. The conclusion of this blog post is not that I’m now a believer in astrology. I think that there’s enough credence in the buckets of “my friend misremembered details”, “my friend misreported details”, and “I misunderstood details” so that the likelihood ratio I’m faced with is not actually 10,000 to 1. I’d guess it’s something more like 10 to 1.

But I am now that much less confident that astrology is wrong. And I can imagine circumstances under which my confidence would be drastically decreased. While I don’t expect such circumstances to occur, I do find it instructive (and fun!) to think about them. It’s a good test of your epistemology to wonder what it would take for your most deeply-held beliefs to be overturned.

Patterns of inductive inference

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.