Wave function entropy

June 25, 2018May 9, 2019 ~ squarishbracket ~ 4 Comments

Entropy is a feature of probability distributions, and can be taken to be a quantification of uncertainty.

Standard quantum mechanics takes its fundamental object to be the wave function – an amplitude distribution. And from an amplitude distribution Ψ you can obtain a probability distribution Ψ^*Ψ.

So it is very natural to think about the entropy of a given quantum state. For some reason, it looks like this concept of wave function entropy is not used much in physics. The quantum-mechanical version of entropy that is typically referred to is the Von-Neumann entropy, which involves uncertainty over which quantum state a system is in (rather than uncertainty intrinsic to a quantum state).

I’ve been looking into some of the implications of the concept of wave function entropy, and found a few interesting things.

Firstly, let’s just go over what precisely wave function entropy is.

Quantum mechanics is primarily concerned with calculating the wave function Ψ(x), which distributes complex amplitudes over phase space. The physical meaning of these amplitudes is interpreted by taking their absolute square Ψ^*Ψ, which is a probability distribution.

Thus, the entropy of the wave function is given by:

S = – ∫ Ψ^*Ψ ln(Ψ^*Ψ) dx

As an example, I’ll write out some of the wave functions for the basic hydrogen atom:

(Ψ^*Ψ)_1s = e^-2r / π
(Ψ^*Ψ)_2s = (2 – r)² e^-r / 32π(Ψ^*Ψ)_2p = r² e^-r cos(θ) / 32π
(Ψ^*Ψ)_3s = (2r² – 18r + 27)² e^-⅔r / 19683π

With these wave functions in hand, we can go ahead and calculate the entropies! Some of the integrals are intractable, so using numerical integration, we get:

S_1s ≈ 70
S_2s ≈ 470
S_2p ≈ 326
S_3s ≈ 1320

The increasing values for (1s, 2s, 3s) make sense – higher energy wave functions are more dispersed, meaning that there is greater uncertainty in the electron’s spatial distribution.

Let’s go into something a bit more theoretically interesting.

We’ll be interested in a generalization of entropy – relative entropy. This will quantify, rather than pure uncertainty, changes in uncertainty from a prior probability distribution ρ to our new distribution Ψ^*Ψ. This will be the quantity we’ll denote S from now on.

S = – ∫ Ψ^*Ψ ln(Ψ^*Ψ/ρ) dx

Now, suppose we’re interested in calculating the wave functions Ψ that are local maxima of entropy. This means we want to find the Ψ for which δS = 0. Of course, we also want to ensure that a few basic constraints are satisfied. Namely,

∫ Ψ^*Ψ dx = 1
∫ Ψ^*HΨ = E

These constraints are chosen by analogy with the constraints in ordinary statistical mechanics – normalization and average energy. H is the Hamiltonian operator, which corresponds to the energy observable.

We can find the critical points of entropy that satisfy the constraint by using the method of Lagrange multipliers. Our two Lagrange multipliers will be α (for normalization) and β (for energy). This gives us the following equation for Ψ:

Ψ ln(Ψ^*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0

We can rewrite this as an operator equation, which gives us

ln(Ψ^*Ψ/ρ) + (α + 1) + βH = 0
Ψ^*Ψ = ρ/Z e^-βH

Here we’ve renamed our constants so that Z = e^α+1 is a normalization constant.

So we’ve solved the wave function equation… but what does this tell us? If you’re familiar with some basic quantum mechanics, our expression should look somewhat familiar to you. Let’s backtrack a few steps to see where this familiarity leads us.

Ψ ln(Ψ^*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0
HΨ + 1/β ln(Ψ^*Ψ/ρ) Ψ = – (α + 1)/β Ψ

Let’s rename – (α + 1)/β to a new constant λ. And we’ll take a hint from statistical mechanics and call 1/β the temperature T of the state. Now our equation looks like

HΨ + T ln(Ψ^*Ψ/ρ) Ψ = λΨ

This equation is almost the Schrodinger equation. In particular, the Schrodinger equation pops out as the zero-temperature limit of this equation:

As T → 0,
our equation becomes…
HΨ = λΨ

The obvious interpretation of the constant λ in the zero temperature limit is E, the energy of the state.

What about in the infinite-temperature limit?

As T → ∞,
our equation becomes…
Ψ^*Ψ = ρ

Why is this? Because the only solution to the equation in this limit is for ln(Ψ^*Ψ/ρ) → 0, or in other words Ψ^*Ψ/ρ → 1

And what this means is that in the infinite temperature limit, the critical entropy wave function is just that which gives the prior distribution.

We can interpret this result as a generalization of the Schrodinger equation. Rather than a linear equation, we now have an additional logarithmic nonlinearity. I’d be interested to see how the general solutions to this equation differ from the standard equations, but that’s for another post.

HΨ + T ln(Ψ^*Ψ/ρ) Ψ = λΨ

Query sensitivity of evidential updates

June 24, 2018July 3, 2018 ~ squarishbracket ~ Leave a comment

Plausible reasoning, unlike logical deduction, is sensitive not only to the information at hand but also to the query process by which the information was obtained.

Judea Pearl, Probabilistic Reasoning in Intelligent Systems

This quote references an interesting feature of inductive reasoning that’s worth unpacking. It is indicative of the different level of complexity involved in formalizing induction than that involved in formalizing deduction.

A very simple example of this:

You spread a rumor to your neighbor N. A few days later you hear the same rumor from another neighbor N’. Should you increase your belief in the rumor now that N acknowledges it, or should you determine first whether N heard it from N’?

Clearly, if the only source of information for N’ was N, then your belief should not change. But if N’ independently confirmed the validity of the rumor, you have good reason to increase your belief in it.

In general, when you have both top-down (predictive) and bottom-up (explanatory/diagnostic) inferences in evidential reasoning, it is important to be able to trace back queries. If not, one runs the risk of engaging in circular reasoning.

So far this is all fairly obvious. Now here’s an example that’s more subtle.

Three prisoners problem

Three prisoners have been tried for murder, and their verdicts will be read tomorrow morning. Only one will be declared guilty, and the other two will be declared innocent.

Before sentencing, Prisoner A asks the guard (who knows which prisoner will be declared guilty) to do him a favor and give a letter to one of the other two prisoners who will be released (since only one person will be declared guilty, Prisoner A knows that at least one of the other two prisoners will be released). The guard does so, and later, Prisoner A asks him which of the two prisoners (B or C) he gave the letter two. The guard responds “I gave the letter to Prisoner B.”

Now Prisoner A reasons as follows:

“Previously, my chances of being executed were one in three. Now that I know that B will be released, only C and I remain as candidates for being declared guilty. So now my chances are one in two.”

Is this wrong?

Denote “A is guilty” as G_A, and “B is innocent” as I_B. Now, since G_A → I_B, we have that P(I_B | G_A) = 1. This tells us that we can write

P(G_A | I_B) = P(I_B | G_A) P(G_A) / P(I_B)
= P(G_A) / P(I_B) = ⅓ / ⅔ = ½

The problem with this argument is that we have excluded some of the context of the guard’s response, namely, that the guard could only have answered “I gave the letter to Prisoner B” or “I gave the letter to Prisoner C.” In other words, the fact “Prisoner B will be declared innocent” leads to the wrong conclusion about the credibility of A’s guilt.

Let’s instead condition on I_B’ = “Guard says that B will be declared innocent.” Now we get

P(G_A | I_B’) = P(I_B’ | G_A) P(G_A) / P(I_B’) = ½ ⋅ ⅓ / ½ = ⅓.

It’s not sufficient to just condition on what the guard said. We must consider the range of possible statements that the guard could have made.

In general, we cannot only assess the impact of propositions implied by information. We must also consider what information we could have received.

Things get clearer if we consider a similar thought experiment.

1000 prisoners problem

You are one of 1000 prisoners awaiting sentencing, with the knowledge that only one of you will be declared guilty. You come across a slip of paper from the court listing 998 prisoners; each name marked ‘innocent’. You look through all 998 names and find that your name is missing.

This should worry you greatly – your chances of being declared guilty have gone from 1 in 1000 to 1 in 2.

But imagine that you now see the query that produced the list.

Query: “Print the names of any 998 innocent right-handed prisoners.”

If you are the only left-handed prisoner, then you should thank your lucky stars. Why? Because now that you know that the query couldn’t have produced your name, the fact that it didn’t gives you no information. In other words, your chances have gone back from 1:2 to 1:1000.

In this example you can see very clearly why information about the possible outputs of a query is relevant to how we should update on the actual output of the query. We must know the process by which we attain information in order to be able to accommodate that information into our beliefs.

But now what if we don’t have this information? Suppose that you only run into the list of prisoners, and have no additional knowledge about how it was produced. Well, then we must consider all possible queries that might have produced this output!

This is no small matter.

For simplicity, let’s reconsider the simpler example with just three prisoners: Prisoners A, B, and C. Imagine that you are Prisoner A.

You come across a slip of paper from the court containing the statement I, where

I = “Prisoner B will be declared innocent”.

Now, we must assess the impact of I on the proposition G_A = “Prisoner A is guilty.”

P(G_A | I) = P(I | G_A) P(G_A) / P(I) = ⅓ P(I | G_A) / P(I)

The result of this calculation depends upon P(I | G_A), or in other words, how likely it is that the slip would declare Prisoner B innocent, given that you are guilty. This depends on the query process, and can vary anywhere from 0 to 1. Let’s just give this variable a name: P(I | G_A) = p.

We’ll also need to know two other probabilities: (1) that the slip declares B innocent given that B is guilty, and (2) that it declares B innocent given that C is guilty. We’ll assume that the slip cannot be lying (i.e. that the first of these is zero), and name the second probability q = P(I | G_C).

P(I | G_A) = p (slip could declare either B or C innocent)
P(I | G_B) = 0 (slip could declare either A or C innocent)
P(I | G_C) = q (slip could declare either A or B innocent)

Now we have

P(G_A | I) = ⅓ p / P(I)
=⅓ p / [P(I | G_A) P(G_A) + P(I | G_B) P(G_B) + P(I | G_C) P(G_C)]
= ⅓ p / (⅓ p + 0 + ⅓ q)
= p/(p + q)

How do we assess this value, given that p and q are unknown? The Bayesian solution is to treat the probabilities p and q as random variables, and specify probability distributions over their possible values: f(p) and g(q). This distribution should contain all of your prior knowledge about the plausible queries that might have produced I.

The final answer is obtained by integrating over all possible values of p and q.

P(G_A | I) = E[p/(p + q)]
= ∫ p/(p + q) f(p) g(q) dp dq

Supposing that our distributions over p and q are maximally uncertain, the final distribution we obtain is

P(G_A | I) = ∫ p/(p + q) dp dq
= 0.5

Now suppose that we know that the slip could not declare A (yourself) innocent (as we do in the original three prisoners problem). Then we know that q = 1 (since if C is guilty and A couldn’t be on the slip, B is the only possible choice). This gives us

P(G_A | I) = ∫ p/(p + 1) f(p) dp

If we are maximally uncertain about the value of p, we obtain

P(G_A | I) = ∫ p/(p + 1) dp
= 1 – ln(2)
≈ 0.30685

If, on the other hand, we are sure that the value of p is 50% (i.e., we know that in the case that A is guilty, the guard chooses randomly between B and C), we obtain

P(G_A | I) = .5/(.5 + 1) = ⅓

We’ve re-obtained our initial result! Interestingly, we can see that being maximally uncertainty about the guard’s procedure for choosing between B and C gives a different answer than knowing that the guard chooses totally randomly between B and C.

Notice that this is true even though these reflect the same expectation of what choice the guard will make!

I.e., in both cases (total uncertainty about p, and knowledge that p is exactly .5), we should have 50% credence in the guard choosing B. This gives us some insight into the importance of considering different types of uncertainty when doing induction, which is a topic for another post.

Summarizing conscious experience

June 20, 2018June 24, 2018 ~ squarishbracket ~ 1 Comment

There’s a puzzle for implementation of probabilistic reasoning in human beings. This is that the start of the reasoning process in humans is conscious experience, and it’s not totally clear how we should update on conscious experiences.

Jeffreys defined a summary of an experience E as a set B of propositions {B₁, B₂, … B_n} such that for all other propositions in your belief system A, P(A | B) = P(A | B, E).

In other words, B is a minimal set of propositions that fully screens off your experience.

This is a useful concept because summary sentences allow you to isolate everything that is epistemically relevant about conscious experience. if you have a summary B of an experience E, then you only need to know P(A | B) and P(B | E) in order to calculate P(A | E).

Notice that the summary set is subjective; it is defined only in terms of properties of your personal belief network. The set of facts that screens off E for you might be different from the set of facts that screens it off for somebody else.

Quick example.

Consider a brief impression by candlelight of a cloth held some distance away from you. Call this experience E.

Suppose that all you could decipher from E is that the cloth was around 2 meters away from you, and that it was either blue (with probability 60%) or green (with probability 40%). Then the summary set for E might be {“The cloth is blue”, “The cloth is green”, “The cloth is 2 meters away from you”, “The cloth is 3 meters away from you”, etc.}.

If this is the right summary set, then the probabilities P(“The cloth is blue”), P(“The cloth is green”) and P(“The cloth is x meters away from you”) should screen off E from the rest of your beliefs.

One trouble is that it’s not exactly obvious how to go about converting a given experience into a set of summary propositions. We could always be leaving something out. For instance, one more thing we learned upon observing E was the proposition “I can see light.” This is certainly not screened off by the other propositions so far, so we need to add it in as well.

But how do we know that we’ve gotten everything now? If we think a little more, we realize that we have also learned something about the nature of the light given off by the candle flame. We learn that it is capable of reflecting the color of light that we saw!

But now this additional consideration is related to how we interpret the color of the cloth. In other words, not only might we be missing something from our summary set, but that missing piece might be relevant to how we interpret the others.

I’d like to think more about this question: In general, how do we determine the set of propositions that screens off a given experience from the rest of your beliefs? Ultimately, to be able to coherently assess the impact of experiences on your web of beliefs, your model of reality must contain a model of yourself as an experiencer.

The nature of this model is pretty interesting from a philosophical perspective. Does it arise organically out of factual beliefs about the physical world? Well, this is what a physicalist would say. To me, it seems quite plausible that modeling yourself as a conscious experiencer would require a separate set of rules relating physical happenings to conscious experiences. How we should model this set of rules as a set of a priori hypotheses to be updated on seems very unclear to me.

Simple induction

June 17, 2018June 18, 2018 ~ squarishbracket ~ Leave a comment

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:

Fixed

This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update.

How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity). What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

The North Korea problem isn’t solved

June 14, 2018 ~ squarishbracket ~ Leave a comment

Donald Trump and Kim Jong Un just met and signed a deal committing North Korea to nuclear disarmament. Yay! Problem solved!

Except that there’s a long historical precedent of North Korea signing deals just like this one, only to immediately go back on them. Here’s a timeline for some relevant historical context.

1985: North Korea signs Nuclear Non-Proliferation Treaty
1992: North Korea signs historic agreement to halt nuclear program! (#1)
1993: North Korea is found to be cheating on its commitments under the NPT
1994: In exchange for US assistance in production of proliferation-free nuclear power plants, North Korea signs historic agreement to halt nuclear program! (#2)
1998: North Korea is suspected of having an underground nuclear facility
1998: North Korea launches missile tests over Japan
1999: North Korea signs historic agreement to end missile tests, in exchange for a partial lifting of economic sanctions by the US.
2000: North Korea signs historic agreement to reunify Korea! Nobel Peace Prize is awarded
2002-2003: North Korea admits to having a secret nuclear weapons program, and withdraws from the NPT
2004: North Korea allows an unofficial US delegation to visit its nuclear facilities to display a working nuclear weapon
2005: In exchange for economic and energy assistance, North Korea signs historic agreement to halt nuclear program and denuclearize! (#3)
2006: North Korea fires seven ballistic missiles and conducts an underground nuclear test
2006: North Korea declares support for denuclearization of Korean peninsula
2006: North Korea again supports denuclearization of Korean peninsula
2007: In exchange for energy aid from the US, North Korea signs historic agreement to halt nuclear program! (#4)
2007: N&S Korea sign agreement on reunification
2009: North Korea issues a statement outlining a plan to weaponize newly separated plutonium
2010: North Korea threatens war with South Korea
2010: North Korea again announces commitment to denuclearize
2011: North Korea announces plan to halt nuclear and missile tests
2012: North Korea announces halt to nuclear program
2013: North Korea announces intentions to conduct more nuclear tests
2014: North Korea test fires 30 short-range rockets, as well as two medium missiles into the Sea of Japan
2015: North Korea offers to halt nuclear tests
2016: North Korea announces that it has detonated a hydrogen bomb
2016: North Korea again announces support for denuclearization
2017: North Korea conducts its sixth nuclear test
2018: Kim Jong Un announces that North Korea will mass produce nuclear warheads and ballistic missiles for deployment
2018: In exchange for the cancellation of US-South Korea military exercises, North Korea, once again, commits to “work toward complete denuclearization on the Korean peninsula”

Maybe this time is really, truly different. But our priors should be informed by history, and history tell us that it’s almost certainly not.

Priors in the supernatural

June 13, 2018June 24, 2018 ~ squarishbracket ~ Leave a comment

A friend of mine recently told me the following anecdote.

Years back, she had visited an astrologer in India with her boyfriend, who told her the following things: (1) she would end up marrying her boyfriend at the time, (2) down the line they would have two kids, the first a girl and the second a boy, and (3) he predicted the exact dates of birth of both children.

Many years down the line, all of these predictions turned out to be true.

I trust this friend a great deal, and don’t have any reason to think that she misremembered the details or lied to me about them. But at the same time, I recognize that astrology is completely crazy.

Since that conversation, I’ve been thinking about the ways in which we can evaluate our de facto priors in supernatural events by consulting either real-world anecdotes or thought experiments. For instance, if we think that each of these two predictions gave us a likelihood ratio of 100:1 in favor of astrology being true, and if I ended up thinking that astrology was about as likely to be true as false, then I must have started with roughly 1:10,000 odds against astrology being true.

That’s not crazily low for a belief that contradicts much of our understanding of physics. I would have thought that my prior odds would be something much lower, like 1:10¹⁰ or something. But really put yourself in that situation.

Imagine that you go to an astrologer, who is able to predict an essentially unpredictable sequence of events years down the line, with incredible accuracy. Suppose that the astrologer tells you who you will marry, how many kids you’ll have, and the dates of birth of each. Would you really be totally unshaken by this experience? Would you really believe that it was more likely to have happened by coincidence?

Yes, yes, I know the official Bayesian response – I read it in Jaynes long ago. For beliefs like astrology that contradict our basic understanding of science and causality, we should always have reserved some amount of credence for alternate explanations, even if we can’t think of any on the spot. This reserve of credence will insure us against jumping in credence to 99% upon seeing a psychic continuously predict the number in your heads, ensuring sanity and a nice simple secular worldview.

But that response is not sufficient to rule out all strong evidence for the supernatural.

Here’s one such category of strong evidence: evidence for which all alternative explanations are ruled out by the laws of physics as strongly as the supernatural hypothesis is ruled out by the laws of physics.

I think that my anecdote is one such case. If it was true, then there is no good natural alternative explanation for it. The reason? Because the information about the dates of birth of my friend’s children did not exist in the world at the time of the prediction, in any way that could be naturally attainable by any human being.

By contrast, imagine you go to a psychic who tells you to put up some fingers behind your back and then predicts over and over again how many fingers you have up. There’s hundreds of alternative explanations for this besides “Psychics are real science has failed us.” The reason that there are these alternative explanations is that the information predicted by the psychic existed in the world at the time of the prediction.

But in the case of my friend’s anecdote, the information predicted by the astrologer was lost far in the chaotic dynamics of the future.

What this rules out is the possibility that the astrologer somehow obtained the information surreptitiously by any natural means. It doesn’t rule out a host of other explanations, such as that my friend’s perception at the time was mistaken, that her memory of the event is skewed, or that she is lying. I could even, as a last resort, consider that possibility that I hallucinated the entire conversation with her. (I’d like to give the formal title “unbelievable propositions” to the set of propositions that are so unlikely that we should sooner believe that we are hallucinating than accept evidence for them.)

But each of these sources of alternative explanations, with the possible exception of the last, can be made significantly less plausible.

Let me use a thought experiment to illustrate this.

Imagine that you are a nuclear physicist who, with a group of fellow colleagues, have decided to test the predictive powers of a fortune teller. You carefully design an experiment in which a source of true quantum randomness will produce a number between 1 and N. Before the number has been produced, when it still exists only as an unrealized possibility in the wave function, you ask the fortune teller to predict its value.

Suppose that they get it correct. For what value of N would you begin to take their fortune telling abilities seriously?

Here’s how I would react to the success, for different values of N.

N = 10: “Haha, that’s a funny coincidence.”

N = 100: “Hm, that’s pretty weird.”

N = 1000: “What…”

N = 10,000: “Wait, WHAT!?”

N = 100,000: “How on Earth?? This is crazy.”

N = 1,000,000: “Ok, I’m completely baffled.”

I think I’d start taking them seriously as early as N = 10,000. This indicates prior odds of roughly 1:10,000 against fortune-telling abilities (roughly the same as my prior odds against astrology, interestingly!). Once again, this seems disconcertingly low.

But let’s try to imagine some alternative explanations.

As far as I can tell, there are only three potential failure points: (1) our understanding of physics, (2) our engineering of the experiment, (3) our perception of the fortune teller’s prediction.

First of all, if our understanding of quantum mechanics is correct, there is no possible way that any agent could do better than random at predicting the number.

Secondly, we stipulated that the experiment was designed meticulously so as to ensure that the information was truly random, and unavailable to the fortune-teller. I don’t think that such an experiment would actually be that hard to design. But let’s go even further and imagine that we’ve designed the experiment so that the fortune teller is not in causal contact with the quantum number-generator until after she has made her prediction.

And thirdly, we can suppose that the prediction is viewed by multiple different people, all of whom affirm that it was correct. We can even go further and imagine that video was taken, and broadcast to millions of viewers, all of whom agreed. Not all of them could just be getting it wrong over and over again. The only possibility is that we’re hallucinating not just the experimental result, but indeed also the public reaction and consensus on the experimental result.

But the hypothesis of a hallucination now becomes inconsistent with our understanding of how the brain works! A hallucination wouldn’t have the effect of creating a perception of a completely coherent reality in which everybody behaves exactly as normal except that they saw the fortune teller make a correct prediction. We’d expect that if this were a hallucination, it would not be so self-consistent.

Pretty much all that’s left, as far as I can tell, is some sort of Cartesian evil demon that’s cleverly messing with our brains to create this bizarre false reality. If this is right, then we’re left weighing the credibility of the laws of physics against the credibility of radical skepticism. And in that battle, I think, the laws of physics lose out. (Consider that the invalidity of radical skepticism is a precondition for the development of laws of physics in the first place.)

The point of all of this is just to sketch an example where I think we’d have a good justification for ruling out all alternative explanations, at least with an equivalent degree of confidence that we have for affirming any of our scientific knowledge.

Let’s bring this all the way back to where we started, with astrology. The conclusion of this blog post is not that I’m now a believer in astrology. I think that there’s enough credence in the buckets of “my friend misremembered details”, “my friend misreported details”, and “I misunderstood details” so that the likelihood ratio I’m faced with is not actually 10,000 to 1. I’d guess it’s something more like 10 to 1.

But I am now that much less confident that astrology is wrong. And I can imagine circumstances under which my confidence would be drastically decreased. While I don’t expect such circumstances to occur, I do find it instructive (and fun!) to think about them. It’s a good test of your epistemology to wonder what it would take for your most deeply-held beliefs to be overturned.

Patterns of inductive inference

June 12, 2018June 16, 2018 ~ squarishbracket ~ Leave a comment

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A ⊥ B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.

If all truths are knowable, then all truths are known

June 8, 2018June 13, 2018 ~ squarishbracket ~ Leave a comment

The title of this post is what’s called Fitch’s paradox of knowability.

It’s a weird result that arises from a few very intuitive assumptions about the notion of knowability. I’ll prove it here.

First, let’s list five assumptions. The first of these will be the only strong one – the others should all seem very obviously correct.

Assumptions

All truths are knowable.
If P & Q is known, then both P and Q are known.
Knowledge entails truth.
If P is possible and Q can be derived from P, then Q is possible.
Contradictions are necessarily false.

Let’s put these assumptions in more formal language by using the following symbolization:

◇P means that P is possible
KP means that P is known by somebody at some time

Assumptions

From P, derive ◇KP
From K(P & Q), derive KP & KQ
From KP, derive P
From ◇P & (P → Q), derive ◇Q
–◇[P & -P]

Now, the proof. First in English…

Proof

Suppose that P is true and unknown.
Then it is knowable that P is true and unknown.
Thus it is possible that P is known and that it is known that P is unknown.
So it is possible that P is both known and not known.
Since 4 is a contradiction, it is not the case that P is true and unknown.
In other words, if P is true, then it is known.

Follow all of that? Essentially, we assume that there is some statement P that is both true and unknown. But if this last sentence is true, and if all truths are knowable, then it should be a knowable truth. I.e. it is knowable that P is both true and unknown. But of course this can’t be knowable, since to know that P is both true and unknown is to both know it and not know it. And thus it must be the case that if all truths are knowable, then all truths are known.

I’ll write out the proof more formally now.

Proof

P & –KP Provisional assumption
◇K(P & –KP) Assumption 1
◇(KP & K–KP) Assumption 2
◇(KP & –KP) Assumption 3
-(P & –KP) Reductio ad absurdum of 1
P → KP Standard tautology

I love finding little examples like these where attempts to formalize our intuitions about basic concepts we use all the time lead us into disaster. You can’t simultaneously accept all of the following:

Not all truths are known.
All truths are knowable.
If P & Q is known, then both P and Q are known.
Knowledge entails truth.
If P is possible and P implies Q, then Q is possible.
Contradictions are necessarily false.

Variational Bayesian inference

June 5, 2018August 1, 2018 ~ squarishbracket ~ Leave a comment

Today I learned a cool trick for practical implementation of Bayesian inference.

Bayesians are interested in calculating posterior probability distributions of unobserved parameters X, given data (which consists of the values of observed parameters Y).

To do so, they need only know the form of the likelihood function (the probability of Y given X) and their own prior distribution over X. Then they can apply Bayes’ rule…

P(X | Y) = P(Y | X) P(X) / P(Y)

… and voila, Bayesian inference complete.

The trickiest part of this process is calculating the term in the denominator, the marginal likelihood P(Y). Trying to calculate this term analytically is typically very computationally expensive – it involves a sum over all possible values of the parameters of the likelihood multiplied by the prior. If Y is drawn from a continuous infinity of possible parameter values, then calculating the marginal likelihood amounts to solving a (typically completely intractable) integral.

P(Y) = ∫ P(Y | X) P(X) dX

Variational Bayesian inference is a procedure that solves this problem through a clever trick.

We start by searching for a posterior in a space of functions F that are easily integrable.

Our goal is not to find the exact form of the posterior, although if we do, that’s great. Instead, we want to find the function Q(X) within F that is as close to the posterior P(X | Y) as possible.

Distance between probability distributions is typically calculated by the information divergence D(Q, P), which is defined by…

D(Q, P) = ∫ Q(X) log(Q(X) / P(X|Y)) dX

To explicitly calculate and minimize this, we would need to know the form of the posterior P(X | Y) from the start. But let’s plug in the definition of conditional probability…

P(X | Y) = P(X, Y) / P(Y)

D(Q, P) = ∫ Q(X) log(Q(X) P(Y) / P(X, Y)) dX
= ∫ Q(X) log(Q(X) / P(X, Y)) dX + ∫ Q(X) log P(Y) dX

The second term is easily calculated. Since log(P(Y)) is not a function of X, the integral just becomes…

∫ Q(X) log P(Y) dX = log P(Y)

Rearranging, we get…

log P(Y) = D(Q, P) – ∫ Q(X) log(Q(X) / P(X, Y)) dX

The second term depends on Q(X) and the joint probability P(X, Y), which we can calculate easily as the product of the likelihood P(Y | X) and the prior P(X). We name it the variational free energy, L(Q).

log P(Y) = D(Q, P) + L(Q)

Now, on the left-hand side we have the log of the marginal likelihood, and on the right we have the information distance plus the variational free energy.

Notice that the left side is not a function of Q. This is really important! It tells us that if we’re trying to vary Q to minimize D(Q, P), then the right side will be a constant quantity.

In other words, any increase in L(Q) is necessarily a decrease in D(Q, P). What this means is that the Q that minimizes D(Q, P) is the same thing as the Q that maximizes L(Q)!

We can use this to minimize D(Q, P) without ever explicitly knowing P.

Recalling the definition of the variational free energy, we have…

L(Q) = – ∫ Q(X) log(Q(X) / P(X, Y)) dX
= ∫ Q(X) log P(X, Y) dX – ∫ Q(X) log Q(X) dX

Both of these integrals are computable insofar as we made a good choice for the function space F. Thus we can exactly find Q*, the best approximation to P in F. Then, knowing Q*, we can calculate L(Q*), which serves as a lower bound on the log of the marginal likelihood P(Y).

log P(Y) = D(Q, P) + L(Q)
so log P(Y) ≥ L(Q*)

Summing up…

Variational Bayesian inference approximates the posterior probability P(X | Y) with a function Q(X) in the function space F.
We find the function Q* that is as similar as possible to P(X | Y) by maximizing L(Q).
L(Q*) gives us a lower bound on the log of the marginal likelihood, log P(Y).

The value of the personal

June 4, 2018June 3, 2018 ~ squarishbracket ~ 1 Comment

I have been thinking about the value of powerful anecdotes. An easy argument for why we should be very cautious of personal experience and anecdotal evidence is that it has the potential to cause us to over-update. E.g. somebody that hears a few harrowing stories from friends about gun violence in Chicago is more likely to have an overly high estimation of how dangerous Chicago is.

Maybe the best way to formulate our worldview is in a cold and impersonal manner, disregarding most anecdotes in favor of hard data. This is the type of thing I might have once said, but I now think that this approach is likely just as flawed.

First of all, I think it’s an unrealistic demand on most people’s cognitive systems that they toss out the personal and compelling in their worldview.

And second of all, just like personal experience and anecdotal evidence have the potential to cause over-updating, statistical data and dry studies have the potential to cause under-updating.

Reading some psychological studies about the seriousness of the psychological harms of extended periods of solitary confinement is no match for witnessing or personally experiencing the effects of being locked in a tiny room alone for years. There’s a real and important difference between abstractly comprehending a fact and really understanding the fact. Other terms for this second thing include internalizing the fact, embodying it, and making it salient to yourself.

This difference is not easy to capture on a one-dimensional model of epistemology where beliefs are represented as simple real numbers. I’m not even sure if there’d be any good reason to build this distinction into artificial intelligences we might eventually construct. But it is there in us, and has a powerful influence.

How do we know whether somebody has really internalized a belief or not? I’m not sure, but here’s a gesture in what I think is the right direction.

We can conceive of somebody’s view of the world as a massive web of beliefs, where the connections between beliefs indicate dependencies and logical relations. To have a fully internalized a belief is to have a web that is fully consistent with the truth of this belief. On the other hand, if you notice that somebody verbally reports that they believe A, but then also seem to believe B, C, and D, where all of these are inconsistent with A, then they have not really internalized A.

The worry is that a cold and impersonal approach to forming your worldview is the type of thing that would result in this type of inconsistency and disconnectedness in your web of beliefs, through the failure to internalize important facts.

Such failures become most obvious when you have a good sense of somebody’s values, and can simply observe their behavior to see what it reveals about their beliefs. If somebody is a pure act utilitarian (I know that nobody actually has a value system as simple as this, but just play along for a moment), then they should be sending their money wherever it would be better spent maximizing utility. If they are not doing so, then this reveals an implicit belief that there is no better way to be maximizing utility than by keeping their own money.

This is sort of an attempt to uncover somebody’s revealed beliefs, to steal the concept of revealed preferences from economics. Conflicts between revealed beliefs and stated beliefs indicate a lack of internalization.