Quantum Computing in a Nutshell

You can think about classical bits as like little switches. They have two possible states, ON and OFF, and are in either one or the other.

37081856_10216918284859437_1668214775789649920_n

Then we have logic gates, which are like operations we perform on the switches. If we have one switch, we might flip it from whatever state it’s in to the opposite state. If we have two, we might flip the second switch conditional on the first switch being ON, and otherwise do nothing. And so on.

37074006_10216918284939439_5899339380094402560_n

Put together a bunch of these bits and gates and you get an algorithm. If you arrange them cleverly enough, you end up getting interesting computations like adding binary numbers and factoring primes and playing Skyrim.

Quantum computing is a fundamentally different beast. To start with, the fundamental units of quantum computation are non-deterministic. We call them qubits, for quantum bits. While a qubit can, like a classical bit, be in the state ON and the state OFF, it can also be in other more exotic states:

37171485_10216918315460202_5674648183184031744_n

Important! Just like a classical bit, a qubit will only ever be observed in two possible states, ON and OFF. We never observe a qubit being in the state “36% ON and 64% OFF.” But if we prepare a qubit in that particular state, and then measure it over and over again, we could find that 36% of the time it is ON and 64% of the time it is OFF. That is, the state of a qubit specifies the probability distribution over possible observed outcomes.

Let’s compare the complexity of bits to the complexity of qubits.

To specify the state of a classical bit, we only need to answer a single yes or no question: Is it ON?

To specify the state of a quantum bit, we need to answer a more complicated question: What is the probability that it is ON?

Since probabilities are real numbers, answering the second question requires infinitely more information than answering the first. (In general, specifying a real number to perfect precision requires answering an unbounded number of yes or no questions.) The implication of this is that in some sense, the state of a qubit contains infinitely more information than the state of a classical bit. The trick of quantum computing is to exploit this enhanced information capacity in order to build quantum algorithms that beat out the best classical algorithms.

Moving on! The set of all states that a qubit can be in is the set of all probability distributions over the outcomes ON and OFF.

37156326_10216918369941564_7471468612121788416_n

Now, we could describe a simple probabilistic extension to classical computation, where logic gates transform probability distributions into different probability distributions, and algorithms neatly combine these gates to get useful computations, and be done with it. But it turns out that things are a slight bit more complicated than this. Having described this probabilistic extension, we would still not have quantum computing.

In fact, the states of qubits are specified not by probability distributions over the observations ON and OFF, but an amplitude distribution over these observations.

What are these amplitudes? An amplitude is simply a thing that you square to get the probability of the observation. Amplitudes have a wider range of possible values than probabilities. While a probability has to be greater than zero, an amplitude can be negative (since the square of a negative number will still be positive). In fact, amplitudes can even be complex numbers, in which case the square of the amplitude is just the square of its distance from zero in the complex plane.

37190749_10216918417262747_1993599483794948096_n

Now, why does it matter that the set of quantum states be described by amplitudes rather than probabilities? After all, the amplitudes will always just be converted back to probabilities when we observe the qubits, so what difference does it make if the probability came from a negative amplitude or a positive one? The answer to this question is difficult, but it comes down to this:

For some reason, it appears that on the quantum level, the universe calculates the states of particles in terms of these complex numbers we call amplitudes, not probabilities. This was the discovery of experimentalists in the 1900s that looked at things like the double slit experiment, the Stern-Gerlach experiment, and so on. If you try to just analyze everything in terms of probabilities instead of amplitudes, you will get the wrong answer.

We’ll write the state of a qubit that has an amplitude alpha of being ON and an amplitude beta of being OFF as follows:

37227024_10216918430103068_5777410781888905216_n

It’s useful to have a few different ways of visualizing the state of a qubit.

37179912_10216918454223671_661326054782140416_n.jpg

Now, quantum gates are transformations that take amplitude distributions to different amplitude distributions.

37199704_10216918490904588_4575874598293209088_n

The set of physically realizable quantum gates is the set of all transformations that take normalized states (|α|² + |β|² = 1) to other normalized states. Some of these gates are given special names like the Hadamard gate or the CNOT gate and crop up all the time. And just like with classical states, you can put together these gates in clever ways to construct algorithms that do interesting computations for you.

So, that’s quantum computing!

Now, the first thing to notice is that the set of all things that quantum computers can do must contain the set of all things that classical computers can do. Why? Because classical computers are just a special case of quantum computers where states are only allowed to be either ON or OFF. Every classical logic gate has a parallel quantum logic gate, and so every classical algorithm is a quantum algorithm is disguise.

But quantum computers can also do more than classical computers. In the next posts, I’ll give two examples of quantum algorithms we’ve discovered that solve problems at speeds that are classically impossible: the Deutsch-Jozsa algorithm and Grover’s algorithm. Stay tuned!

Matter and interactions in quantum mechanics

Here’s another post on how quantum mechanics is weird.

Let’s consider the operation of swapping around different quantum particles. If the coordinates of the first particle are denoted x1 and the second denoted x2, we’ll be interested in the transformation:

Ψ(x1, x2) → Ψ(x2, x1)

Let’s just give this transformation a name. We’ll call it S, for “swap”. S is an operator that is defined as follows:

S Ψ(x1, x2) = Ψ(x2, x1)

Now, clearly if you swap the particles’ coordinates two times, you get back the same thing you started with. That is:

S2 Ψ(x1, x2) = Ψ(x1, x2)

This implies that S2 is an operator with eigenvalue +1. What does this tell us about the spectrum of eigenvalues of S?

Suppose S has an eigenvalue λ. Then, from the above, we see:

S2 Ψ = Ψ
S SΨ = Ψ
S λΨ = Ψ
λ SΨ = Ψ
λ2 Ψ = Ψ

So λ = ±1.

In other words, the only possible eigenvalues of S are +1 and -1.

This tells us that the wave functions of particles must obey one of the following two equations:

SΨ = +Ψ
or
SΨ = -Ψ

Said another way…

Ψ(x1, x2) = Ψ(x2, x1)
or
Ψ(x1, x2) = -Ψ(x2, x1)

This is pretty important! It says that by the nature of what it means to swap particles around and the structure of quantum mechanics, it must be the case that the wave function obtained by switching around coordinates is either the same as the original wave function, or its negative.

This is a huge constraint on the space of functions we’re considering; most functions don’t have this type of symmetry (e.g. x2 + y ≠ y2 + x).

So we have two possible types of wave functions: those that flip signs upon coordinate swaps, and those that don’t. It turns out that these two types of wave functions describe fundamentally different kinds of particles. The first type of wave functions describes fermions, and the second type describes bosons.

All particles in the standard model of particle physics are either fermions or bosons (as expected by the above argument, since there are only two possible eigenvalues of S). Fermions include electrons, quarks, and neutrinos. Bosons include photons, gravitons, and the Higgs boson. This sampling of particles from each category should give you the intuitive sense that fermions are “matter-like” and bosons “force-like.”

Of course, the definition of fermions and bosons we gave above doesn’t reference matter-like or force-like behavior. Somehow the qualitative difference between fermions and bosons must arise from this simple sign difference. How?

We’ll start exploring this by examining the concept of non-separability of wave functions.

A wave function is separable if it is possible to write it as a product of two individual wave functions, one for each coordinate:

Ψ(x1, x2) = Ψ1(x1) Ψ2(x2)

This is a really nice property. For one thing, it allows us to solve the Schrodinger equation extremely easily by using the differential equations method of variable separation. For another, it tells us that the wave function is simple in a way. If we’re interested in only one set of coordinates, say x1, then we can easily disregard the whole rest of the wave function and just look at Ψ1.

But what does it mean? Well, if a wave function is separable, then it is possible to sensibly ask questions about the properties of individual particles, independent of each other. Why? Because you can just look at the component of the wave function corresponding to the particle you’re interested in, and the rest behaves just like a constant for all you care.

If a wave function is non-separable (i.e. if there aren’t any two functions Ψ1 and Ψ2 for which Ψ(x1, x2) = Ψ1(x1) Ψ2(x2)), then the story is trickier. For all intents and purposes, it loses meaning to talk about one particle independent of the other.

This is hard to wrap our classical brains around. If the wave function cannot be separated, then the particles just don’t have positions, momentums, and so on that can be described independently.

Now, it turns out that most of matter is composed of non-separable wave functions. Pretty much any time you have particles interacting, their wave function cannot be written as a product of independent particles. A lot of the time, wave functions are very approximately separable (consider the wave function describing two electrons light years away). But in such cases, when we talk about the two electrons as two distinct entities, we’re really using an approximation that is not fundamentally correct.

Now, this all relates back to the fermion/boson distinction in the following way. Suppose we had a system that was described by a separable wave function.

Ψ(x1, x2) = Ψ1(x1) Ψ2(x2)

Now what happens when we apply the swap operator to Ψ?

S Ψ(x1, x2) = Ψ(x2, x1) = Ψ1(x2) Ψ2(x1)

As we’d expect, we get the particle described by x2 in the wave function initially occupied by the particle described by x1, and vice versa. But now, our system must obey one of two constraints:

Since SΨ = ±Ψ,

Ψ1(x2) Ψ2(x1) = Ψ1(x1) Ψ2(x2)
or
Ψ1(x2) Ψ2(x1) = – Ψ1(x1) Ψ2(x2)

Let’s take the second possibility for fermions first. It turns out that this constraint can never be satisfied. Why? Look:

Suppose f(x) g(y) = – f(y) g(x)
Then we have:  f(x)/g(x) = -f(y)/g(y)

Since the left is a pure function of x, and the right of y, this implies that they are both equal to a constant.
I.e. f(x)/g(x) = k, for some constant k.
But then f(y)/g(y) = k as well.

Thus k = -k,
which can only be true if k = 0.

This argument tells us that the only function that satisfies (1) Ψ describes a fermion and (2) Ψ is separable, is Ψ(x1, x2) = 0. Of course, this is not a wave function, since it is not normalizable. In other words, fermion wave functions are always non-separable!

What about boson wave functions? The equation

Ψ1(x2) Ψ2(x1) = Ψ1(x1) Ψ2(x2)

does have some solutions, but they are highly constrained. Essentially, for this to be true for all x1 and x2, Ψ1 and Ψ2 must be the same function.

In other words, bosons are separable only in the case that their independent wave functions are completely identical.

So. If fermions cannot be described by a separable wave function, how can we describe them? We can be clever and notice the following:

While Ψ(x1, x2) = Ψ1(x1) Ψ2(x2) does not obey the requirement that SΨ = -Ψ,
Ψ(x1, x2) = Ψ1(x1) Ψ2(x2) – Ψ2(x1) Ψ1(x2) does!

Let’s check and see that this is right:

S Ψ(x1, x2)
= S [Ψ1(x1) Ψ2(x2) – Ψ2(x1) Ψ1(x2)]
= Ψ1(x2) Ψ2(x1) – Ψ2(x2) Ψ1(x1)
= -Ψ(x1, x2).

Aha! It works.

Now we have a general description for fermion wave functions that does not violate the swap constraints. We can do something similar for bosons, giving us:

Fermions
Ψ(x1, x2) = Ψ1(x1) Ψ2(x2) – Ψ2(x1) Ψ1(x2)

Bosons
Ψ(x1, x2) = Ψ1(x1) Ψ2(x2) + Ψ2(x1) Ψ1(x2)

(Note: These are not all possible fermion/boson wave functions. They only capture one set of allowable wave functions.)

Now, let’s notice a peculiar feature of the fermion wave function. What happens if we ask for the probability amplitude of the two particles being at the same place at the same time? We find out by taking the fermion equation and setting x1 = x2 = x:

Ψ(x, x) = Ψ1(x) Ψ2(x) – Ψ2(x) Ψ1(x)
= 0

More generally, we can prove that any fermion wave function must have this same property.

SΨ(x1, x2) = -Ψ(x1, x2)
Ψ(x2, x1) = -Ψ(x1, x2)
Ψ(x, x) = -Ψ(x, x)
Ψ(x, x) = 0

Crazy! Apparently, two fermions cannot be at the same position. And since wave functions are smooth, fermions will generally have a small probability of being close together.

This is the case even if the fermions are interacting by a strong attractive force. It’s almost as if there is an intrinsic repulsive force keeping fermions away from each other. But this would be the wrong way to think about it. This feature of the natural world is not discovered by exploring different possible forces and potentials. Instead we got here by reasoning purely abstractly about the nature of swapping particles. We are a priori guaranteed this unusual feature: that if fermions exist, they must have zero probability of being located at the same point in space.

The name for this property is the Pauli exclusion principle. It’s the reason why electrons in atoms spread out in ever-larger orbitals around nuclei instead of all settling down to the nucleus. It is responsible for the entire structure of the periodic table, and without it we couldn’t have chemistry and biology.

(Quick side note: You might have thought of something that seems to throw a wrench in this – namely, that the ground state of atoms can hold two electrons. In general, electron orbitals in atoms are populated by more than one electron. This is possible because the wave function has extra degrees of freedom such as spin, the details of which determine how many electrons can fit in the same spatial orbital. I.e. two fermions can have the same spatial amplitude distribution, so far as they have some other distinct property.)

Mathematically, this arises because of interference effects. As the two fermions get closer and closer to each other, their wave functions interfere with one another more and more, turning their joint probability to zero. Fermions destructively interfere.

And bosons? They constructively interfere! At x1 = x2 = x, for bosons, we have:

Ψ(x, x) = Ψ1(x) Ψ2(x) + Ψ2(x) Ψ1(x)
= 2 Ψ1(x) Ψ2(x)

In other words, while fermions disperse, bosons cluster! Just like how fermions seemed to be pulled away from each other with an imaginary force, bosons seem to be pulled together! (Once again, this is only a poetic description, not to be taken literally. There is no extra force concentrating bosons, as can be evidenced by the straight trajectory of parallel beams of light).

I’m not aware of a name for this principle to complement the Pauli exclusion principle. But it explains phenomena like lasers, where enormous numbers of photons concentrate together to form a powerful beam of light. By contrast, an “electron laser” that concentrates enormous numbers of electrons into a single beam would be enormously difficult to create.

The intrinsic tendency for fermions to repel each other leads them to form complicated structures that spread out through space, and give them the qualitative material resistance to objects passing through them. You just can’t pack them arbitrarily close together – eventually you’ll run out of degrees of freedom and the particles will push each other away.

Bosons, on the other hand, are like ghosts – they can vanish into each others wave functions, congregate together in large numbers and behave as a single entity, and so on.

These differences between fermions and bosons pop straight out of their definitions, and lead us to the qualitative differences between matter and forces!

100 prisoners problem

I’m in the mood for puzzles, so here’s another one. This one is so good that it deserves its own post.

The setup (from wiki):

The director of a prison offers 100 death row prisoners, who are numbered from 1 to 100, a last chance. A room contains a cupboard with 100 drawers. The director randomly puts one prisoner’s number in each closed drawer. The prisoners enter the room, one after another. Each prisoner may open and look into 50 drawers in any order. The drawers are closed again afterwards.

If, during this search, every prisoner finds his number in one of the drawers, all prisoners are pardoned. If just one prisoner does not find his number, all prisoners die. Before the first prisoner enters the room, the prisoners may discuss strategy—but may not communicate once the first prisoner enters to look in the drawers. What is the prisoners’ best strategy?

Suppose that each prisoner selects 50 at random, and don’t coordinate with one another. Then the chance that any particular prisoner gets their own number is 50%. This means that the chance that all 100 get their own number is 1/2¹⁰⁰.

Let me emphasize how crazily small this is. 1/2¹⁰⁰ is 1/1,267,650,600,228,229,401,496,703,205,376; less than one in a decillion. If there were 100 prisoners trying exactly this setup every millisecond, it would take them 40 billion billion years to get out alive once. This is 3 billion times longer than the age of the universe.

Okay, so that’s a bad strategy. Can we do better?

It’s hard to imagine how… While the prisoners can coordinate beforehand, they cannot share any information. So every time a prisoner comes in for their turn at the drawers, they are in exactly the same state of knowledge as if they hadn’t coordinated with the others.

Given this, how could we possibly increase the survival chance beyond 1/2¹⁰⁰?

 

 

(…)

 

 

 

(Try to answer for yourself before continuing)

 

 

 

(…)

 

 

 

Let’s consider a much simpler case. Imagine we have just two prisoners, two drawers, and each one can only open one of them. Now if both prisoners choose randomly, there’s only a 1 in 4 chance that they both survive.

What if they agree to open the same drawer? Then they have reduced their survival chance from 25% to 0%! Why? Because by choosing the same drawer, they either both get the number 1, or they both get the number 2. In either case, they are guaranteed that only one of them gets their own number.

So clearly the prisoners can decrease the survival probability by coordinating beforehand. Can they increase it?

Yes! Suppose that they agree to open different drawers. Then this doubles their survival chance from 25% to 50%. Either they both get their own number, or they both get the wrong number.

The key here is to minimize the overlap between the choices of the prisoners. Unfortunately, this sort of strategy doesn’t scale well. If we have four prisoners, each allowed to open two drawers, then random drawing gives a 1/16 survival chance.

Let’s say they open according to the following scheme: 12, 34, 13, 24 (first prisoner opens drawers 1 and 2, second opens 3 and 4, and so on). Then out of the 24 possible drawer layouts, the only layouts that work are 1432 and 3124:

1234 1243 1324 1342 1423 1432
2134 2143 2314 2341 2413 2431
3124 3142 3214 3241 3412 3421
4123 4132 4213 4231 4312 4321

This gives a 1/12 chance of survival, which is better but not by much.

What if instead they open according to the following scheme: (12, 23, 34, 14)?

1234 1243 1324 1342 1423 1432
2134 2143 2314 2341 2413 2431
3124 3142 3214 3241 3412 3421
4123 4132 4213 4231 4312 4321

Same thing: a 1/12 chance of survival.

Scaling this up to 100 prisoners, the odds of survival look pretty measly. Can they do better than this?

 

 

 

(…)

 

 

 

(Try to answer for yourself before continuing)

 

 

 

(…)

 

 

 

It turns out that yes, there is a strategy that does better at ensuring survival. In fact, it does so much better that the survival chance is over 30 percent!

Take a moment to boggle at this. Somehow we can leverage the dependency induced by the prisoners’ coordination to increase the chance of survival by a factor of one decillion, even though none of their states of knowledge are any different. It’s pretty shocking to me that this is possible.

Here’s the strategy: Each time a prisoner opens a drawer, they consult the number in that drawer to determine which drawer they will open next. Thus each prisoner only has to decide on the first drawer to open, and all the rest of the drawers follow from this. Importantly, the prisoner only knows the first drawer they’ll pick; the other 49 are determined by the distribution of numbers in the drawers.

We can think about each drawer as starting a chain through the other drawers. These chains always cycle back into the starting number, the longest possible cycle being 100 numbers and the shortest being 1. Now, each prisoner can guarantee that they are in a cycle that contains their own number by choosing the drawer corresponding to their own number!

So, the strategy is that Prisoner N starts by choosing Drawer N, looking at the number within, then choosing the drawer labeled with that number. Repeat 50 times per each prisoner.

The wiki page has a good description of how to calculate the survival probability with this strategy:

The prison director’s assignment of prisoner numbers to drawers can mathematically be described as a permutation of the numbers 1 to 100. Such a permutation is a one-to-one mapping of the set of natural numbers from 1 to 100 to itself. A sequence of numbers which after repeated application of the permutation returns to the first number is called a cycle of the permutation. Every permutation can be decomposed into disjoint cycles, that is, cycles which have no common elements.

In the initial problem, the 100 prisoners are successful if the longest cycle of the permutation has a length of at most 50. Their survival probability is therefore equal to the probability that a random permutation of the numbers 1 to 100 contains no cycle of length greater than 50. This probability is determined in the following.

A permutation of the numbers 1 to 100 can contain at most one cycle of length l>50. There are exactly {\tbinom  {100}{l}} ways to select the numbers of such a cycle. Within this cycle, these numbers can be arranged in (l-1)! ways since there are {\displaystyle l-1} permutations to represent distinct cycles of length l because of cyclic symmetry. The remaining numbers can be arranged in (100-l)! ways. Therefore, the number of permutations of the numbers 1 to 100 with a cycle of length l>50 is equal to

{\displaystyle {\binom {100}{l}}\cdot (l-1)!\cdot (100-l)!={\frac {100!}{l}}.}

The probability, that a (uniformly distributed) random permutation contains no cycle of length greater than 50 is calculated with the formula for single events and the formula for complementary events thus given by

Screen Shot 2018-07-04 at 5.44.43 AM

This is about 31.1828%.

This formula easily generalizes to other numbers of prisoners. We can plot the survival chance using this strategy as a function of the number of prisoners:

N prisoners

Amazingly, we can see that this value is asymptoting at a value greater than 30%. The precise limit is 1 – ln2 ≈ 30.685%.

In other words, no matter how many prisoners there are, we can always ensure that the survival probability is greater than 30%! This is pretty remarkable, and I think there are some deep lessons to be drawn from it, but I’m not sure what they are.

Some simple probability puzzles

(Most of these are taken from Ian Hacking’s Introduction to Probability and Inductive Logic.)

  1. About as many boys as girls are born in hospitals. Many babies are born every week at City General. In Cornwall, a country town, there is a small hospital where only a few babies are born every week.

    Define a normal week as one where between 45% and 55% of babies are female. An unusual week is one where more than 55% or less than 45% are girls.

    Which of the following is true:
    (a) Unusual weeks occur equally often at City General and at Cornwall.
    (b) Unusual weeks are more common at City General than at Cornwall.
    (c) Unusual weeks are more common at Cornwall than at City General.

  2. Pia is 31 years old, single, outspoken, and smart. She was a philosophy major. When a student, she was an ardent supporter of Native American rights, and she picketed a department store that had no facilities for nursing mothers.

    Which of the following statements are most probable? Which are least probable?

    (a) Pia is an active feminist.
    (b) Pia is a bank teller.
    (c) Pia works in a small bookstore.
    (d) Pia is a bank teller and an active feminist.
    (e) Pia is a bank teller and an active feminist who takes yoga classes.
    (f) Pia works in a small bookstore and is an active feminist who takes yoga classes.

  3. You have been called to jury duty in a town with only green and blue taxis. Green taxis dominate the market, with 85% of the taxis on the road.

    On a misty winter night a taxi sideswiped another car and drove off. A witness said it was a blue cab. This witness is tested under similar conditions, and gets the color right 80% of the time.

    You conclude about the sideswiping taxi:
    (a) The probability that it is blue is 80%.
    (b) It is probably blue, but with a lower probability than 80%.
    (c) It is equally likely to be blue or green.
    (d) It is more likely than not to be green.

  4. You are a physician. You think that it’s quite likely that a patient of yours has strep throat. You take five swabs from the throat of this patient and send them to a lab for testing.

    If the patient has strep throat, the lab results are right 70% of the time. If not, then the lab is right 90% of the time.

    The test results come back: YES, NO, NO, YES, NO

    You conclude:
    (a) The results are worthless.
    (b) It is likely that the patient does not have strep throat.
    (c) It is slightly more likely than not that the patient does have strep throat.
    (d) It is very much more likely than not that the patient does have strep throat.

  5. In a country, all families wants a boy. They keep having babies till a boy is born. What is the expected ratio of boys and girls in the country?
  6.  Answer the following series of questions:

    If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT?

    If you flip the coin repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT?

    If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely?

    You play a game with a friend in which you each choose a sequence of three possible flips (e.g HHT and TTH). You then flip the coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor?

    Are you always able to bias the game in your favor?

 

Solutions (and lessons)

  1. The correct answer is (a): Unusual weeks occur more often at Cornwall than at City General. Even though the chance of a boy is the same at Cornwall as it is at City General, the percentage of boys from week to week is larger in the smaller city (for N patients a week, the percentage boys goes like 1/sqrt(N)). Indeed, if you think about an extreme case where Cornwall has only one birth a week, then every week will be an unusual week (100% boys or 0% boys).
  2. There is room to debate the exact answer but whatever it is, it has to obey some constraints. Namely, the most probable statement cannot be (d), (e), or (f), and the least probable statement cannot be (a), (b), or (c). Why? Because of the conjunction rule of probability: each of (d), (e), and (f) are conjunctions of (a), (b), and (c), so they cannot be more likely. P(A & B) ≤ P(A).

    It turns out that most people violate this constraint. Many people answer that (f) is the most probable description, and (b) is the least probable. This result is commonly interpreted to reveal a cognitive bias known as the representativeness heuristic – essentially, that our judgements of likelihood are done by considering which descriptions most closely resemble the known facts. In this case,

    Another factor to consider is that prior to considering the evidence, your odds on a given person being a bank teller as opposed to working in a small bookstore should be heavily weighted towards her being a bank teller. There are just far more bank tellers than small bookstore workers (maybe a factor of around 20:1). This does not necessarily mean that (b) is more likely than (c), but it does mean that the evidence must discriminate strongly enough against her being a bank teller so as to overcome the prior odds.

    This leads us to another lesson, which is to not neglect the base rate. It is easy to ignore the prior odds when it feels like we have strong evidence (Pia’s age, her personality, her major, etc.). But the base rate on small bookstore workers and bank tellers are very relevant to the final judgement.

  3. The correct answer is (d) – it is more likely than not that the sideswiper was green. This is a basic case of base rate neglect – many people would see that the witness is right 80% of the time and conclude that the witness’s testimony has an 80% chance of being correct. But this is ignoring the prior odds on the content of the witness’s testimony.

    In this case, there were prior odds of 17:3 (85%:15%) in favor of the taxi being green. The evidence had a strength of 1:4 (20%:80%), resulting in the final odds being 17:12 in favor of the taxi being green. Translating from odds to probabilities, we get a roughly 59% chance of the taxi having been green.

    We could have concluded (d) very simply by just comparing the prior probability (85% for green) with the evidence (80% for blue), and noticing that the evidence would not be strong enough to make blue more likely than green (since 85% > 80%). Being able to very quickly translate between statistics and conclusions is a valuable skill to foster.

  4. The right answer is (d). We calculate this just like we did the last time:

    The results were YES, NO, NO, YES, NO.

    Each YES provides evidence with strength 7:1 (70%/10%) in favor of strep, and each NO provides evidence with strength 1:3 (30%/90%).

    So our strength of evidence is 7:1 ⋅ 1:3 ⋅ 1:3 ⋅ 7:1 ⋅ 1:3 = 49:27, or roughly 1.81:1 in favor of strep. This might be a little surprising… we got more NOs than YESs and the NO was correct 90% of the time for people without strep, compared to the YES being correct only 70% of the time in people with strep.

    Since the evidence is in favor of strep, and we started out already thinking that strep was quite likely, in the end we should be very convinced that they have strep. If our prior on the patient having strep was 75% (3:1 odds), then our probability after getting evidence will be 84% (49:9 odds).

    Again, surprising! The patient who sees these results and hears the doctor declaring that the test strengthens their belief that the patient has strep might feel that this is irrational and object to the conclusion. But the doctor would be right!

  5. Supposing as before that the chance of any given birth being a boy is equal to the chance of it being a girl, we end up concluding…

    The expected ratio of boys and girls in the country is 1! That is, this strategy doesn’t allow you to “cheat” – it has no impact at all on the ratio. Why? I’ll leave this one for you to figure out. Here’s a diagram for a hint:

    36666658_10216831977421805_8359037287605993472_n

    This is important because it applies to the problem of p-hacking. Imagine that all researchers just repeatedly do studies until they get the results they like, and only publish these results. Now suppose that all the researchers in the world are required to publish every study that they do. Now, can they still get a bias in favor of results they like? No! Even though they always stop when getting the result they like, the aggregate of their studies is unbiased evidence. They can’t game the system!

  6.  Answers, in order:

    If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT? (Yes)

    If you flip it repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT? (No)

    If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely? (Yes)

    You play a game with a friend in which you each choose a sequence of three coin flips (e.g HHT and TTH). You then flip a coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor? (Yes)

    Are you always able to bias the game in your favor? (Yes!)

    Here’s a wiki page with a good explanation of this: LINK. A table from that page illustrating a winning strategy for any choice your friend makes:

    1st player’s choice 2nd player’s choice Odds in favour of 2nd player
    HHH THH 7 to 1
    HHT THH 3 to 1
    HTH HHT 2 to 1
    HTT HHT 2 to 1
    THH TTH 2 to 1
    THT TTH 2 to 1
    TTH HTT 3 to 1
    TTT HTT 7 to 1

Deriving the Schrodinger equation

This video contains a really short and sweet derivation of the form of the Schrodinger equation from some fundamental principles. I want to present it here because I like it a lot.

I’m going to assume a lot of background knowledge of quantum mechanics for the purposes of this post, so as to keep it from getting too long. If you want to know more QM, I highly highly recommend Leonard Susskind’s online video lectures.

So! Very brief review of basic QM:

In quantum mechanics, the state of a system is described by a vector in a complex vector space. These vectors are all unit length, and encode all of the observable information about the system. The notation used for a state vector is |Ψ⟩, and is read as “the state vector psi”. By analogy with complex conjugation of numbers, you can also conjugate vectors. These conjugated vectors are written like ⟨φ|. Similarly, any operator A has a conjugate operator A*.

Inner products between vectors are expressed like ⟨φ|Ψ⟩, and represent the “closeness” between these states. If ⟨φ|Ψ⟩ = 0, then the states φ and Ψ are called orthogonal, and are as different as can be. (In particular, there is zero probability of either state being observed as the other.) And if ⟨φ|Ψ⟩ = 1, then the states are indistinguishable, and |Ψ⟩ =|φ⟩.

Now, we’re interested in the dynamics of quantum systems. How do they change in time?

Well, since we’re dealing with vectors, we can very generally suppose that there exists some operator that will take any state vector to the state vector that it evolves into after some amount of time t. Let’s just give this operator a name: U(t). We express the notion of U(t) as a time-evolution operator by writing

U(t)|Ψ(0)⟩ = |Ψ(t)⟩

In other words, take the state Ψ at time 0, apply the operator U(t) to it, and you get back the state Ψ at time t.

Now, what are some basic things we can say about the time-evolution operator?

First: if we evolve forwards in time by a length of time equal to zero, the state will not change. (This is basically definitional.)

I.e. U(0) = I (where I is the identity operator).

Second: Time evolution is always continuous, in that an evolution forwards by an arbitrarily small time period ε will change the state by an amount proportional to ε.

I.e. U(ε) = I + εG (where G is some other operator).

Third: Time evolution preserves orthogonality. If two states are ever orthogonal, then they are always orthogonal. (This is an assumption of conservation of information – the laws of physics don’t cause information to disappear or new information to pop up out of nowhere.)

I.e. ⟨φ(0)|Ψ(0)⟩ = 0  ⇒ ⟨φ(t)|Ψ(t)⟩ = 0

From this we can actually derive a stronger statement, which is that all inner products are conserved over time. (The intuition for this is that if all our orthogonal basis vectors stay orthogonal when we evolve forward in time, then time evolution is something like a rotation, and rotations preserve all inner products.)

I.e. ⟨φ(0)|Ψ(0)⟩ = ⟨φ(t)|Ψ(t)⟩

So our starting point is:

  1. U(t)|Ψ(0)⟩ = |Ψ(t)⟩
  2. U(0) = I
  3. U(ε) = I + εG
  4. ⟨φ(0)|Ψ(0)⟩ = ⟨φ(t)|Ψ(t)⟩

From (1) and (4), we get

U(t)|Ψ(0)⟩ = |Ψ(t)⟩
U(t)|φ(0)⟩ = |φ(t)⟩
so…
⟨φ(t)|Ψ(t)⟩ = ⟨φ(0)|U*(t) U(t)|Ψ(0)⟩
= ⟨φ(0)|Ψ(0)⟩
and therefore…
U*U = I

Operators that satisfy the identity on the final line this are called unitary – they are analogous to complex numbers of unit length.

Let’s use this identity together with (3):

U*(ε) U(ε) = I
(I + εG)(I + εG*) = I
I + ε(G + G*) + ε² G*G = I
ε(G + G*) + ε² G*G = 0
G + G* ≈ 0

In the last line, I’ve used the assumption that ε is arbitrarily small, so that we can throw out factors of ε².

Now, what does this final line tell us? Well, it says that the operator G (which dictates the change in state over an infinitesimal time) is purely imaginary. By analogy, any purely imaginary number y = ix satisfies the identity:

y + y* =  ix + (ix)* = ix – ix = 0

So if G is purely imaginary, it is convenient to consider a new purely real operator H = iG. This operator is Hermitian by construction – it is equal to its complex conjugate. Substituting this operator into our infinitesimal time evolution equation, we get

U(ε) = I – iεH

Now, let’s consider the derivative of a quantum state.

d|Ψ⟩/dt = (|Ψ(t + ε)⟩ – |Ψ(t)⟩) / ε
= (U(ε) – I)|Ψ(t)⟩ / ε
= -iεH|Ψ(t)⟩ / ε
= -iH|Ψ(t)⟩

Thus we get…

d/dt|Ψ⟩ = -iH|Ψ⟩

This is the time-dependent Schrodinger equation, although we haven’t yet specified what this operator H is supposed to be. However, since we know H is Hermitian, we also know that H corresponds to some observable quantity.

It turns out that if we multiply this operator by Planck’s constant ħ, it becomes the Hamiltonian – the operator that corresponds to the observable energy. We’ll just change notation subtly by taking H to be the Hamiltonian – that is, what we would previously have called ħH. Then we get the more familiar form of the time-dependent Schrodinger equation:

iħ d/dt|Ψ⟩ = H|Ψ⟩

 

On existence

Epistemic status: This is a line of thought that I’m not fully on board with, but have been taking more seriously recently. I wouldn’t be surprised if I object to all of this down the line.

The question of whether or not a given thing exists is not an empty question or a question of mere semantics. It is a question which you can get empirical evidence for, and a question whose answer affects what you expect to observe in the world.

Before explaining this further, I want to draw an analogy between ontology and causation (and my attitudes towards them).

Early in my philosophical education, my attitude towards causality was sympathetic to the Humean-style eliminativism, in which causality is a useful construct that isn’t reflected in the fundamental structure of the world. That is, I quickly ‘tossed out’ the notion of causality, comfortable to just talk about the empirical regularities governed by our laws of physics.

Later, upon encountering some statisticians that exposed me to the way that causality is actually calculated in the real world, I began to feel that I had been overly hasty. In fact, it turns out that there is a perfectly rigorous and epistemically accessible formalization of causality, and I now feel that there is no need to toss it out after all.

Here’s an easy way of thinking about this: While the slogan “Correlation does not imply causality” is certainly true, the reverse (“Causality does not imply correlation”) is trickier. In fact, whenever you have a causal relationship between variables, you do end up expecting some observable correlations. So while you cannot deductively conclude a causal relationship from a merely correlational one, you can certainly get evidence for some causal models.

This is just a peek into the world of statistical discovery of causal relationships – going further requires a lot more work. But that’s not necessary for my aim here. I just want to express the following parallel:

Rather than trying to set up a perfect set of necessary and sufficient conditions for application of the term ’cause’, we can just take a basic axiom that any account of causation must adhere to. Namely: Where there’s causation, there’s correlation.

And rather than trying to set up a perfect set of necessary and sufficient conditions for the term ‘existence’, we can just take a basic axiom that any account of existence must adhere to. Namely: If something affects the world, it exists.

This should seem trivially obvious. While there could conceivably be entities that exist without affecting anything, clearly any entity that has a causal impact on the world must exist.

The contrapositive of this axiom is that if something doesn’t exist, it does not affect the world.

Again, this is not a controversial statement. And importantly, it makes ontology amenable to scientific inquiry! Why? Because two worlds with different ontologies will have different repertoires of causes and effects. A world in which nothing exists is a world in which nothing affects anything – a dead, static region of nothingness. We can rule out this world on the basis of our basic empirical observation that stuff is happening.

This short argument attempts to show that ontology is a scientifically respectable concept, and not merely a matter of linguistic game-playing. Scientific theories implicitly assume particular ontologies by relying upon laws of nature which reference objects with causal powers. Fundamentally, evidence that reveals the impotence of these supposed causal powers serves as evidence against the ontological framework of such theories.

I think the temptation to wave off ontological questions as somehow disreputable and unscientific actually springs from the fundamentality of this concept. Ontology isn’t a minor add-on to our scientific theories done to appease the philosophers. Instead, it is built in from the ground floor. We can’t do science without implicitly making ontological assumptions. I think it’s better to make these assumptions explicit and debate about the fundamental principles by which we justify them, then it is to do it invisibly, without further analysis.

Concepts we keep and concepts we toss out

Often when we think about philosophical concepts like identity, existence, possibility, and so on, we find ourselves confronted with numerous apparent paradoxes that require us to revise our initial intuitive conception. Sometimes, however, the revisions necessary to render the concept coherent end up looking so extreme as to make us prefer to just throw out the concept altogether.

An example: claims about identity are implicit in much of our reasoning (“I was thinking about this in the morning” implicitly presumes an identity between myself now and the person resembling me in my apartment this morning). But when we analyze our intuitive conception of identity, we find numerous incoherencies (e.g. through Sorites-style paradoxes in which objects retain their identity through arbitrarily small transformations, but then end up changing their identity upon the conjunction of these transformations anyway).

When faced with these incoherencies, we have a few options: first of all, we can decide to “toss out” the concept of identity (i.e. determine that the concept is too fundamentally paradoxical to be saved), or we can decide to keep it. If we keep it, then we are forced to bite some bullets (e.g. by revising the concept away from our intuitions to a more coherent neighboring concept, or by accepting the incoherencies).

In addition, keeping the concept does not mean thinking that the concept actually successfully applies to anything. For instance, one might keep the concept of free will (in that they have a well-defined personal conception of it), while denying that free will exists. This is the difference between saying “People don’t have free will, and that has consequences X, Y, and Z” and saying “I think that contradictions are so deeply embedded in the concept of free will that it’s fundamentally unsavable, and henceforth I’m not going to reason in terms of it.” I often hop back and forth between these positions, but I think they are really quite different.

One final way to describe this distinction: When faced with a statement like “X exists,” we have three choices: We can say that the statement is true, that it is false, or that it is not a proposition. This third category is what we would say about statements like “Arghleschmargle” or “Colorless green ideas sleep furiously”. While they are sentences that we can speak, they just aren’t the types of things that could be true or false. To throw out the concept of existence is to say that a statement like “X exists” is neither true nor false, and to keep it is to treat it as having a truth value.

I have a clear sense for any given concept whether or not I think it’s better to keep or toss out, and I imagine that others can do the same.  Here’s a table of some common philosophical concepts and my personal response to each:

Keep
Causality
Existence
Justification
Free will
Time
Consciousness
Randomness
Meaning (of life)
Should (ethical)
Essences
Representation / Intentionality

Toss Out
Knowledge
Identity
Possibility
Objects
Forms
Purposes (in the teleological sense)
Beauty

Many of these I’m not sure about, and I imagine I could have my mind easily changed (e.g. identity, possibility, intentionality). Some I’ve even recently changed my mind about (causality, existence). And others I feel quite confident about (e.g. knowledge, randomness, justification).

I’m curious about how others’ would respond… What philosophical concepts do you lean towards keeping, and which concepts do you lean towards tossing out?

Against moral realism

Here’s my primary problem with moral realism: I can’t think of any acceptable epistemic framework that would give us a way to justifiably update our beliefs in the objective truth of moral claims. I.e. I can’t think of any reasonable account of how we could have justified beliefs in objectively true moral principles.

Here’s a sketch of a plausible-seeming account of epistemology. Broad-strokes, there are two sources of justified belief: deduction and induction.

Deduction refers to the process by which we define some axioms and then see what logically follows from them. So, for instance, the axioms of Peano Arithmetic entail the theorem that 1+1=2 – or, in Peano’s language, S(0) + S(0) = S(S(0)). The central reason why reasoning by deduction is reliable is that the truths established are true by definition – they are made true by the way we have constructed our terms, and are thus true in every possible world.

Induction is scientific reasoning – it is the process of taking prior beliefs, observing evidence, and then updating these beliefs (via Bayes’ rule, for instance). The central reason why induction is reliable comes from the notion of causal entanglement. When we make an observation and update our beliefs based upon this observation, the brain state “believes X” has become causally entangled with the truth of the the statement X. So, for instance, if I observe a positive result on a pregnancy test, then my belief in the statement “I am pregnant” has become causally entangled with the truth of the statement “I am pregnant.” It is exactly this that justifies our use of induction in reasoning about the world.

Now, where do moral claims fall? They are not derived from deductive reasoning… that is, we cannot just arbitrarily define right and wrong however we like, and then derive morality from these definitions.

And they are also not truths that can be established through inductive reasoning; after all, objective moral truths are not the types of things that have any causal effects on the world.

In other words, even if there are objective moral truths, we would have no way of forming justified beliefs about this. To my mind, this is a pretty devastating situation for a moral realist. Think about it like this: a moral realist who doesn’t think that moral truths have causal power over the world must accept that all of their beliefs about morality are completely causally independent of their truth. If we imagine keeping all the descriptive truths about the world fixed, and only altering the normative truths, then none of the moral realist’s moral beliefs would change.

So how do they know that they’re in the world where their moral beliefs actually do align with the moral reality? Can they point to any reason why their moral beliefs are more likely to be true than any other moral statements? As far as I can tell, no, they can’t!

Now, you might just object to the particular epistemology I’ve offered up, and suggest some new principle by which we can become acquainted with moral truth. This is the path of many professional philosophers I have talked to.

But every attempt that I’ve heard of for doing this begs the question or resorts to just gesturing at really deeply held intuitions of objectivity. If you talk to philosophers, you’ll hear appeals to a mysterious cognitive ability to reflect on concepts and “detect their intrinsic properties”, even if these properties have no way of interacting with the world, or elaborate descriptions of the nature of “self-evident truths.”

(Which reminds me of this meme)

self-evident-truth-5153703.png

None of this deals with the central issue in moral epistemology, as I see it. This central issue is: How can a moral realist think that their beliefs about morality are any more likely to be true than any random choice of a moral framework?

Explanation is asymmetric

We all regularly reason in terms of the concept of explanation, but rarely think hard about what exactly we mean by this explanation. What constitutes a scientific explanation? In this post, I’ll point out some features of explanation that may not be immediately obvious.

Let’s start with one account of explanation that should seem intuitively plausible. This is the idea that to explain X to a person is to give that person some information I that would have allowed them to predict X.

For instance, suppose that Janae wants an explanation of why Ari is not pregnant. Once we tell Janae that Ari is a biological male, she is satisfied and feels that the lack of pregnancy has been explained. Why? Well, because had Janae known that Ari was a male, she would have been able to predict that Ari would not get pregnant.

Let’s call this the “predictive theory of explanation.” On this view, explanation and prediction go hand-in-hand. When somebody learns a fact that explains a phenomenon, they have also learned a fact that allows them to predict that phenomenon.

 To spell this out very explicitly, suppose that Janae’s state of knowledge at some initial time is expressed by

K1 = “Males cannot get pregnant.”

At this point, Janae clearly cannot conclude anything about whether Ari is pregnant. But now Janae learns a new piece of information, and her state of knowledge is updated to

K2 = “Ari is a male & males cannot get pregnant.”

Now Janae is warranted in adding the deduction

K’ = “Ari cannot get pregnant”

This suggests that added information explains Ari’s non-pregnancy for the same reason that it allows the deduction of Ari’s non-pregnancy.

Now, let’s consider a problem with this view: the problem of relevance.

Suppose a man named John is not pregnant, and somebody explains this with the following two premises:

  1. People who take birth control pills almost certainly don’t get pregnant.
  2. John takes birth control pills regularly.

Now, these two premises do successfully predict that John will not get pregnant. But the fact that John takes birth control pills regularly gives no explanation at all of his lack of pregnancy. Naively applying the predictive theory of explanation gives the wrong answer here.

You might have also been suspicious of the predictive theory of explanation on the grounds that it relied on purely logical deduction and a binary conception of knowledge, not allowing us to accommodate the uncertainty inherent in scientific reasoning. We can fix this by saying something like the following:

What it is to explain X to somebody that knows K is to give them information I such that

(1) P(X | K) is small, and
(2) P(X | K, I) is large.

“Small” and “large’ here are intentionally vague; it wouldn’t make sense to draw a precise line in the probabilities.

The idea here is that explanations are good insofar as they (1) make their explanandum sufficiently likely, where (2) it would be insufficiently likely without them.

We can think of this as a correlational account of explanation. It attempts to root explanations in sufficiently strong correlations.

First of all, we can notice that this doesn’t suffer from a problem with irrelevant information. We can find relevance relationships by looking for independencies between variables. So maybe this is a good definition of scientific explanation?

Unfortunately, this “correlational account of explanation” has its own problems.

Take the following example.

uploaded image

This flagpole casts a shadow of length L because of the angle of elevation of the sun and the height of the flagpole (H). In other words, we can explain the length of the shadow with the following pieces of information:

I1 =  “The angle of elevation of the sun is θ”
I2 = “The height of the lamp post is H”
I3 = Details involving the rectilinear propagation of light and the formation of shadows

Both the predictive and correlational theory of explanation work fine here. If somebody wanted an explanation for why the shadow’s length is L, then telling them I1, I2, and I3 would suffice. Why? Because I1, I2, and Ijointly allow us to predict the shadow’s length! Easy.

X = “The length of the shadow is L.”
(I1 & I2 & I3) ⇒ X
So I1 & I2 & I3 explain X.

And similarly, P(X | I1 & I2 & I3) is large, and P(X) is small. So on the correlational account, the information given explains X.

But now, consider the following argument:

(I1 & I3 & X) ⇒ I2
So I1 & I3 & X explain I2.

The predictive theory of explanation applies here. If we know the length of the shadow and the angle of elevation of the sun, we can deduce the height of the flagpole. And the correlational account tells us the same thing.

But it’s clearly wrong to say that the explanation for the height of the flagpole is the length of the shadow!

What this reveals is an asymmetry in our notion of explanation. If somebody already knows how light propagates and also knows θ, then telling them H explains L. But telling them L does not explain H!

In other words, the correlational theory of explanation fails, because correlation possesses symmetry properties that explanation does not.

This thought experiment also points the way to a more complete account of explanation. Namely, the relevant asymmetry between the length of the shadow and the height of the flagpole is one of causality. The reason why the height of the flagpole explains the shadow length but not vice versa, is that the flagpole is the cause of the shadow and not the reverse.

In other words, what this reveals to us is that scientific explanation is fundamentally about finding causes, not merely prediction or statistical correlation. This causal theory of explanation can be summarized in the following:

An explanation of A is a description of its causes that renders it intelligible.

More explicitly, an explanation of A (relative to background knowledge K) is a set of causes of A that render X intelligible to a rational agent that knows K.