The Central Paradox of Statistical Mechanics: The Problem of The Past

This is the third part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

What I’ve argued for so far is the following set of claims:

  1. To successfully predict the behavior of macroscopic systems, we need something above and beyond the microphysical laws.
  2. This extra thing we need is the fundamental postulate of statistical mechanics, which assigns a uniform distribution over the region of phase space consistent with what you know about the system. This postulate allows us to prove all the things we want to say about the future, such as “gases expand”, “ice cubes melt”, “people age” and so on.
  3. This fundamental postulate is not justifiable on a priori grounds, as it is fundamentally an empirical claim about how frequently different micro states pop up in our universe. Different initial conditions give rise to different such frequencies, so that a claim to a priori access to the fundamental postulate is a claim to a priori access to the precise details of the initial condition of the universe.

 There’s just one problem with all this… apply our postulate to the past, and everything breaks.

 Notice that I said that the fundamental postulate allows us to prove all the things we want to say about the future. That wording was chosen carefully. What happens if you try to apply the microphysical laws + the fundamental postulate to predict the past of some macroscopic system? It turns out that all hell breaks loose. Gases spontaneously contract, ice cubes form from puddles of water, and brains pop out of thermal equilibrium.

 Why does this happen? Very simply, we start with two fully time reversible premises (the microphysical laws and the fundamental postulate). We apply it to present knowledge of some state, the description of which does not specify a special time direction. So any conclusion we get must as a matter of logic be time reversible as well! You can’t start with premises that treat the past as the mirror image of the future, and using just the rules of logical equivalence derive a conclusion that treats the past as fundamentally different from the future. And what this means is that if you conclude that entropy increases towards the future, then you must also conclude that entropy increases towards the past. Which is to say that we came from a higher entropy state, and ultimately (over a long enough time scale and insofar as you think that our universe is headed to thermal equilibrium) from thermal equilibrium.

Let’s flesh this argument out a little more. Consider a half-melted ice cube sitting in the sun. The microphysical laws + the fundamental postulate tell us that the region of phase space consisting of states in which the ice cube is entirely melted is much much much larger than the region of phase space in which it is fully unmelted. So much larger, in fact, that it’s hard to express using ordinary English words. This is why we conclude that any trajectory through phase space that passes through the present state of the system (the half-melted cube) is almost certainly going to quickly move towards the regions of phase space in which the cube is fully melted. But for the exact same reason, if we look at the set of trajectories that pass through the present state of the system, the vast vast vast majority of them will have come from the fully-melted regions of phase space. And what this means is that the inevitable result of our calculation of the ice cube’s history will be that a few moments ago it was a puddle of water, and then it spontaneously solidified and formed into a half-melted ice cube.

This argument generalizes! What’s the most likely past history of you, according to statistical mechanics? It’s not that the solar system coalesced from a haze of gases strewn through space by a past supernova, such that a planet would form in the Goldilocks zone and develop life, which would then gradually evolve through natural selection to the point where you are sitting in whatever room you’re sitting in reading this post. This trajectory through phase space is enormously unlikely. The much much much more likely past trajectory of you through phase space is that a little while ago you were a bunch of particles dispersed through a universe at thermal equilibrium, which happened to spontaneously coalesce into a brain that has time to register a few moments of experience before dissipating back into chaos. “What about all of my memories of the past?” you say. As it happens the most likely explanation of these memories is not that they are veridical copies of real happenings in the universe but illusions, manufactured from randomness.

Basically, if you buy everything I’ve argued in the first two parts, then you are forced to conclude that the universe is most likely near thermal equilibrium, with your current experience of it arising as a spontaneous dip in entropy, just enough to produce a conscious brain but no more. There are at least two big problems with this view.

Problem 1: This conclusion is, we think, extremely empirically wrong! The ice cube in front of you didn’t spontaneously form from a puddle of water, uncracked eggs weren’t a moment ago scrambled, and your memories are to some degree veridical. If you really believe that you are merely a spontaneous dip in entropy, then your prediction for the next minute will be the gradual dissolution of your brain and loss of consciousness. Now, wait a minute and see if this happens. Still here? Good!

Problem 2: The conclusion cannot be simultaneously believed and justified. If you think that you’re a thermal fluctuation, then you shouldn’t credit any of your memories as telling you anything about the world. But then your whole justification to coming to the conclusion in the first place (the experiments that led us to conclude that physics is time-reversible and that the fundamental postulate is true) is undermined! Either you believe it without justification, or you don’t believe despite justification. Said another way, no reflective equilibrium exists at an entropy minimum. David Albert calls this peculiar epistemic state cognitively unstable, as it’s not clear where exactly it should leave you.

Reflect for a moment on how strange of a situation we are in here. Starting from very basic observations of the world, involving its time-reversibility on the micro scale and the increase in entropy of systems, we see that we are inevitably led to the conclusion that we are almost certainly thermal fluctuations, brains popping out of the void. I promise you that no trick has been pulled here, this really is the state of the philosophy of statistical mechanics! The big issue is how to deal with this strange situation.

One approach is to say the following: Our problem is that our predictions work towards the future but not the past. So suppose that we simply add as a new fundamental postulate the proposition that long long ago the universe had an incredibly low entropy. That is, suppose that instead of just starting with the microphysical laws and the fundamental postulate of statistical mechanics, we added a third claims: the Past Hypothesis.

The Past Hypothesis should be understood as an augmentation of our Fundamental Postulate. Taken together, the two postulates say that our probability distribution over possible microstates should not be uniform over phase space. Instead, it should be what you get when you take the uniform distribution, and then condition on the distant past being extremely low entropy. This process of conditioning clearly preferences one direction of time over the other, and so the symmetry is broken.

 It’s worth reflecting for a moment on the strangeness of the epistemic status of the Past Hypothesis. It happens that we have over time accumulated a ton of observational evidence for the occurrence of the Big Bang. But none of this evidence has anything to do with our reasons for accepting the Past Hypothesis. If we buy the whole line of argument so far, our conclusion that something like a Big Bang occurred becomes something that we are forced to believe for deep logical reasons, on pain of cognitive instability and self-undermining belief. Anybody that denies that the Big Bang (or some similar enormously low-entropy past state) occurred has to contend with their view collapsing in self-contradiction upon observing the physical laws!

Is The Fundamental Postulate of Statistical Mechanics A Priori?

This is the second part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

The fantastic empirical success of the fundamental postulate gives us a great amount of assurance that the postulate is good one. But it’s worth asking whether that’s the only reason that we should like this postulate, or if it has some solid a priori justification. The basic principle of “when you’re unsure, just distribute credences evenly over phase space” certainly strikes many people as highly intuitive and justifiable on a priori grounds. But there are some huge problems with this way of thinking, one of which I’ve already hinted at. Here’s a thought experiment that illustrates the problem.

There is a factory in your town that produces cubic boxes. All you know about this factory is that the boxes that they produce all have a volume between 0 m3 and 1 m3. You are going to be delivered a box produced by this factory, and are asked to represent your state of knowledge about the box with a probability distribution. What distribution should you use?

Suppose you say “I should be indifferent over all the possible boxes. So I should have a uniform distribution over the volumes from 0 m3 to 1 m3.” This might seem reasonable at first blush. But what if somebody else said “Yes, you should be indifferent over all the possible boxes, but actually the uniform distribution should be over the side lengths from 0 m to 1 m, not volumes.” This would be a very different probability distribution! For example, if the probability that the side length is greater than .5 m is 50%, then the probability that the volume is greater than (.5)3 = 1/8 is also 50%! Uniform over side length is not the same as uniform over volume (or surface area, for that matter). Now, how do you choose between a uniform distribution over volumes and a uniform distribution over side lengths? After all, you know nothing about the process that the factory is using to produce the boxes, and whether it is based off of volume or side length (or something else); all you know is that all boxes are between 0 m3 and 1 m3.

The lesson of this thought experiment is that the statement we started with (“I should be indifferent over all possible boxes”) was actually not even well-defined. There’s not just one unique measure over a continuous space, and in general the notion that “all possibilities are equally likely” is highly language-dependent.

The exact same applies to phase space, as position and momentum are continuous quantities. Imagine that somebody instead of talking about phase space, only talked about “craze space”, in which all positions become positions cubed, and all momentum values become natural logs of momentum. This space would still contain all possible microstates of your system. What’s more, the fundamental laws of nature could be rewritten in a way that uses only craze space quantities, not phase space quantities. And needless to say, being indifferent over phase space would not be the same as being indifferent over craze space.

Spend enough time looking at attempts to justify a unique interpretation of the statement “All states are equally likely”, when your space of states is a continuous infinity, and you’ll realize that all such attempts are deeply dependent upon arbitrary choices of language. The maximum information entropy probability distribution is afflicted with the exact same problem, because the entropy of your distribution is going to depend on the language you’re using to describe it! The entropy of a distribution in phase space is NOT the same as the entropy of the equivalent distribution transformed to craze space.

Let’s summarize this section. If somebody tells you that the fundamental postulate says that all microstates compatible with what you know about the macroscopic features of your system are equally likely, the proper response is something like “Equally likely? That sounds like you’re talking about a uniform distribution. But uniform over what? Oh, position and momentum? Well, why’d you make that choice?” And if they point out that the laws of physics are expressed in terms of position and momentum, you just disagree and say “No, actually I prefer writing the laws of physics in terms of position cubed and log momentum!” (Substitute in any choice of monotonic functions).

If they object on the grounds of simplicity, point out that position and momentum are only simple as measured from a standpoint that takes them to be the fundamental concepts, and that from your perspective, getting position and momentum requires applying complicated inverse transformations to your monotonic transformation of the chosen coordinates.

And if they object on the grounds of naturalness, the right response is probably something like “Tell me more about this ’naturalness’. How do you know what’s natural or unnatural? It seems to me that your choice of what physical concepts count as natural is a manifestation of deep selection pressures that push any beings whose survival depends on modeling and manipulating their surroundings towards forming an empirically accurate model of the macroscopic world. So that when you say that position is more natural than log(position), what I hear is that the fundamental postulate is a very useful tool. And you can’t use the naturalness of the choice of position to justify the fundamental postulate, when your perception of the naturalness of position is the result of the empirical success of the fundamental postulate!”

In my judgement, none of the a priori arguments work, and fundamentally the reason is that the fundamental postulate is an empirical claim. There’s no a priori principle of rationality that tells us that boxes of gases tend to equilibrate, because you can construct a universe whose initial microstate is such that its entire history is one of entropy radically decreasing, gases concentrating, eggs unscrambling, ice cubes unmelting, and so on. Why is this possible? Because it’s consistent with the microphysical laws that the universe started in an enormously low entropy configuration, so it’s gotta also be consistent with the microphysical laws for the entire universe to spend its entire lifetime decreasing in entropy. The general principle is: If you believe that something is physically possible, then you should believe its time-inverse is possible as well.

Let’s pause and take stock. What I’ve argued for so far is the following set of claims:

  1. To successfully predict the behavior of macroscopic systems, we need something above and beyond the microphysical laws.
  2. This extra thing we need is the fundamental postulate of statistical mechanics, which assigns a uniform distribution over the region of phase space consistent with what you know about the system. This postulate allows us to prove all the things we want to say about the future, such as “gases expand”, “ice cubes melt”, “people age” and so on.
  3. This fundamental postulate is not justifiable on a priori grounds, as it is fundamentally an empirical claim about how frequently different microstates pop up in our universe. Different initial conditions give rise to different such frequencies, so that a claim to a priori access to the fundamental postulate is a claim to a priori access to the precise details of the initial condition of the universe.

There’s just one problem with all this… apply our postulate to the past, and everything breaks.

Up next: Why does statistical mechanics give crazy answers about the past? Where did we go wrong?

The Necessity of Statistical Mechanics for Getting Macro From Micro

This is the first part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

Let’s start this out with a thought experiment. Imagine that you have access to the exact fundamental laws of physics. Suppose further that you have unlimited computing power, for instance, you have an oracle that can instantly complete any computable task. What then do you know about the world?

The tempting answer: Everything! But of course, upon further consideration, you are missing a crucial ingredient: the initial conditions of the universe. The laws themselves aren’t enough to tell you about your universe, as many different universes are compatible with the laws. By specifying the state of the universe at any one time (which incidentally does not have to be an “initial” time), though, you should be able to narrow down this set of compatible universes. So let’s amend our question:

Suppose that you have unlimited computing power, that you know the exact microphysical laws, and that you know the state of the universe at some moment. Then what do you know about the world?

The answer is: It depends! What exactly do you know about the state of the universe? Do you know it’s exact microstate? As in, do you know the position and momentum of every single particle in the universe? If so, then yes, the entire past and future of the universe are accessible to you. But suppose that instead of knowing the exact microstate, you only have access to a macroscopic description of the universe. For example, maybe you have a temperature map as well as a particle density function over the universe. Or perhaps you know the exact states of some particles, just not all of them.

Well, if you only have access to the macrostate of the system (which, notice, is the epistemic situation that we find ourselves in, being that full access to the exact microstate of the universe is as technologically remote as can be), then it should be clear that you can’t specify the exact microstate at all other times. This is nothing too surprising or interesting… starting with imperfect knowledge you will not arrive at perfect knowledge. But we might hope that in the absence of a full description of the microstate of the universe at all other times, you could at least give a detailed macroscopic description of the universe at other times.

That is, here’s what seems like a reasonable expectation: If I had infinite computational power, knew the exact microphysical laws, and knew, say, that a closed box was occupied by a cloud of noninteracting gas in its corner, then I should be able to draw the conclusion that “The gas will disperse.” Or, if I knew that an ice cube was sitting outdoors on a table in the sun, then I should be able to apply my knowledge of microphysics to conclude that “The ice cube will melt”. And we’d hope that in addition to being able to make statements like these, we’d also be able to produce precise predictions for how long it would take for the gas to become uniformly distributed over the box, or for how long it would take for the ice cube to melt.

Here is the interesting and surprising bit. It turns out that this is in principle impossible to do. Just the exact microphysical laws and an infinity of computing power is not enough to do the job! In fact, the microphysical laws will in general tell us almost nothing about the future evolution or past history of macroscopic systems!

Take this in for a moment. You might not believe me (especially if you’re a physicist). For one thing, we don’t know the exact form of the microphysical laws. It would seem that such a bold statement about their insufficiencies would require us to at least first know what they are, right? No, it turns out that the statement that microphysics is is far too weak to tell us about the behavior of macroscopic systems holds for an enormously large class of possible laws of physics, a class that we are very sure that our universe belongs to.

Let’s prove this. We start out with the following observation that will be familiar to physicists: the microphysical laws appear to be time-reversible. That is, it appears to be the case that for every possible evolution of a system compatible with the laws of physics, the time-reverse of that evolution (obtained by simply reversing the trajectories of all particles) is also perfectly compatible with the laws of physics.*

This is surprising! Doesn’t it seem like there are trajectories that are physically possible for particles to take, such that their time reverse is physically impossible? Doesn’t it seem like classical mechanics would say that a ball sitting on the ground couldn’t suddenly bounce up to your hand? An egg unscramble? A gas collect in the corner of a room? The answer to all of the above is no. Classical mechanics, and fundamental physics in general, admits the possibilities of all these things. A fun puzzle for you is to think about why the first example (the ball initially at rest on the ground bouncing up higher and higher until it comes to rest in your hand) is not a violation of the conservation of energy.

Now here’s the argument: Suppose that you have a box that you know is filled with an ideal gas at equilibrium (uniformly spread through the volume). There are many many (infinitely many) microstates that are compatible with this description. We can conclusively say that in 15 minutes the gas will still be dispersed only if all of these microstates, when evolved forward 15 minutes, end up dispersed.

But, and here’s the crucial step, we also know that there exist very peculiar states (such as the macrostate in which all the gas particles have come together to form a perfect statuette of Michael Jackson) such that these states will in 15 minutes evolve to the dispersed state. And by time reversibility, this tells us that there is another perfectly valid history of the gas that starts uniformly dispersed and evolves over 15 minutes into a perfect statuette of Michael Jackson. That is, if we believe that complicated configurations of gases disperse, and believe that physics is time-reversible, then you must also believe that there are microstates compatible with dispersed states of gas that will in the next moment coalesce into some complicated configuration.

  1. A collection of gas shaped exactly like Michael Jackson will disperse uniformly across its container.
  2. Physics is time reversible.
  3. So uniformly dispersed gases can coalesce into a collection of gases shaped exactly like Michael Jackson.

At this point you might be thinking “yeah, sure, microphysics doesn’t in principle rule out the possibility that a uniformly dispersed gas will coalesce into Michael Jackson, or any other crazy configuration. But who cares? It’s so incredibly unlikely!” To which the response is: Yes, exactly, it’s extremely unlikely. But nothing in the microphysical laws says this! Look as hard as you can at the laws of motion, you will not find a probability distribution over the likelihood of the different microstates compatible with a given macrostate. And indeed, different initial conditions of the universe will give different such frequencies distributions! To make any statements about the relative likelihood of some microstates over others, you need some principle above and beyond the microphysical laws.

To summarize. All that microphysics + infinite computing power allows you to say about a macrostate is the following: Here are all the microstates that are compatible with that macrostate, and here are all the past and future histories of each of these microstates. And given time reversibility, these future histories cover an enormously diverse set of predictions about the future, from “the gas will disperse” to “the gas will form into a statuette of Michael Jackson”. To get reasonable predictions about how the world will actually behave, we need some other principle, a principle that allows us to disregard these “perverse” microstates. And microphysics contains no such principle.

Statistical mechanics is thus the study of the necessary augmentation to a fundamental theory of physics that allows us to make predictions about the world, given that we are not in the position to know its exact microstate. This necessary augmentation is known as the fundamental postulate of statistical mechanics, and it takes the form of a probability distribution over microstates. Some people describe the postulate as saying “all microstates being equally likely”, but that phrasing is a big mistake, as the sentence “all states are equally likely” is not well defined over a continuous set of states. (More on that in a bit.) To really understand the fundamental postulate, we have to introduce the notion of phase space.

The phase space for a system is a mathematical space in which every point represents a full specification of the positions and momenta of all particles in the system. So, for example, a system consisting of 1000 classical particles swimming around in an infinite universe would have 6000 degrees of freedom (three position coordinates and three momentum coordinates per particle). Each of these degrees of freedom is isomorphic to the real numbers. So phase space for this system must be 6000, and a point in phase space is a specification of the values of all 6000 degrees of freedom. In general, for N classical particles, phase space is 6N.

With the concept of phase space in hand, we can define the fundamental postulate of statistical mechanics. This is: the probability distribution over microstates compatible with a given macrostate is uniform over the corresponding volume of phase space.

It turns out that if you just measure the volume of the “perverse states” in phase space, you end up finding that it composes approximately 0% of the volume of compatible microstates in phase space. This of course allows us to say of perverse states, “Sure they’re there, and technically it’s possible that my system is in such a state, but it’s so incredibly unlikely that it makes virtually no impact on my prediction of the future behavior of my system.” And indeed, when you start going through the math and seeing the way that systems most likely evolve given the fundamental postulate, you see that the predictions you get match beautifully with our observations of nature.

Next time: What is the epistemic status of the fundamental postulate? Do we have good a priori reasons to believe it?

— — —

* There are some subtleties here. For one, we think that there actually is a very small time asymmetry in the weak nuclear force. And some collapse interpretations of quantum mechanics have the collapse of the wave function as an irreversible process, although Everettian quantum mechanics denies this. For the moment, let’s disregard all of that. The time asymmetry in the weak nuclear force is not going to have any relevant effect on the proof made here, besides making it uglier and more complicated. What we need is technically not exact time-reversibility, but very-approximate time-reversibility. And that we have. Collapsing wave functions are a more troubling business, and are a genuine way out of the argument made in this post.

Decoherence is not wave function collapse

In the double slit experiment, particles travelling through a pair of thin slits exhibit wave-like behavior, forming an interference pattern where they land that indicates that the particles in some sense travelled through both slits.


Now, suppose that you place a single spin bit at the top slit, which starts off in the state |↑⟩ and flips to |↓⟩ iff a particle travels through the top slit. We fire off a single particle at a time, and then each time swap out that spin bit for a new spin bit that also starts off in the state |↑⟩. This serves as an extremely simple measuring device which encodes the information about which slit each particle went through.

Now what will you observe on the screen? It turns out that you’ll observe the classically expected distribution, which is a simple average over the two individual possibilities without any interference.


Okay, so what happened? Remember that the first pattern we observed was the result of the particles being in a superposition over the two possible paths, and then interfering with each other on the way to the detector screen. So it looks like simply having one bit of information recording the path of the particle was sufficient to collapse the superposition! But wait! Doesn’t this mean that the “consciousness causes collapse” theory is wrong? The spin bit was apparently able to cause collapse all by itself, so assuming that it isn’t a conscious system, it looks like consciousness isn’t necessary for collapse! Theory disproved!

No. As you might be expecting, things are not this simple. For one thing, notice that this ALSO would prove as false any other theory of wave function collapse that doesn’t allow single bits to cause collapse (including anything about complex systems or macroscopic systems or complex information processing). We should be suspicious of any simple argument that claims to conclusively prove a significant proportion of experts wrong.

To see what’s going on here, let’s look at what happens if we don’t assume that the spin bit causes the wave function to collapse. Instead, we’ll just model it as becoming fully entangled with the path of the particle, so that the state evolution over time looks like the following:


Now if we observe the particle’s position on the screen, the probability distribution we’ll observe is given by the Born rule. Assuming that we don’t observe the states of the spin bits, there are now two qualitatively indistinguishable branches of the wave function for each possible position on the screen. This means that the total probability for any given landing position will be given by the sum of the probabilities of each branch:


But hold on! Our final result is identical to the classically expected result! We just get the probability of the particle getting to |j⟩ from |A⟩, multiplied by the probability of being at |A⟩ in the first place (50%), plus the probability of the particle going from |B⟩ to |j⟩ times the same 50% for the particle getting to |B⟩.


In other words, our prediction is that we’d observe the classical pattern of a bunch of individual particles, each going through exactly one slit, with 50% going through the top slit and 50% through the bottom. The interference has vanished, even though we never assumed that the wave function collapsed!

What this shows is that wave function collapse is not required to get particle-like behavior. All that’s necessary is that the different branches of the superposition end up not interfering with each other. And all that’s necessary for that is environmental decoherence, which is exactly what we had with the single spin bit!

In other words, environmental decoherence is sufficient to produce the same type of behavior that we’d expect from wave function collapse. This is because interference will only occur between non-orthogonal branches of the wave function, and the branches become orthogonal upon decoherence (by definition). A particle can be in a superposition of multiple states but still act as if it has collapsed!

Now, maybe we want to say that the particle’s wave function is collapsed when its position is measured by the screen. But this isn’t necessary either! You could just say that the detector enters into a superposition and quickly decoheres, such that the different branches of the wave function (one for each possible detector state) very suddenly become orthogonal and can no longer interact. And then you could say that the collapse only really happens once a conscious being observes the detector! Or you could be a Many-Worlder and say that the collapse never happens (although then you’d have to figure out where the probabilities are coming from in the first place).

You might be tempted to say at this point: “Well, then all the different theories of wave function collapse are empirically equivalent! At least, the set of theories that say ‘wave function collapse = total decoherence + other necessary conditions possibly’. Since total decoherence removes all interference effects, the results of all experiments will be indistinguishable from the results predicted by saying that the wave function collapsed at some point!”

But hold on! This is forgetting a crucial fact: decoherence is reversible, while wave function collapse is not!!! 

Screen Shot 2019-03-17 at 8.21.51 PM
Pretty picture from doi: 10.1038/srep15330

Let’s say that you run the same setup before with the spin bit recording the information about which slit the particle went through, but then we destroy that information before it interacts with the environment in any way, therefore removing any traces of the measurement. Now the two branches of the wave function have “recohered,” meaning that what we’ll observe is back to the interference pattern! (There’s a VERY IMPORTANT caveat, which is that the time period during which we’re destroying the information stored in the spin bit must be before the particle hits the detector screen and the state of the screen couples to its environment, thus decohering with the record of which slit the particle went through).

If you’re a collapse purist that says that wave function collapse = total decoherence (i.e. orthogonality of the relevant branches of the wave function), then you’ll end up making the wrong prediction! Why? Well, because according to you, the wave function collapsed as soon as the information was recorded, so there was no “other branch of the wave function” to recohere with once the information was destroyed!

This has some pretty fantastic implications. Since IN PRINCIPLE even the type of decoherence that occurs when your brain registers an observation is reversible (after all, the Schrodinger equation is reversible), you could IN PRINCIPLE recohere after an observation, allowing the branches of the wave function to interfere with each other again. These are big “in principle”s, which is why I wrote them big. But if you could somehow do this, then the “Consciousness Causes Collapse” theory would give different predictions from Many-Worlds! If your final observation shows evidence of interference, then “consciousness causes collapse” is wrong, since apparently conscious observation is not sufficient to cause the other branches of the wave function to vanish. Otherwise, if you observe the classical pattern, then Many Worlds is wrong, since the observation indicates that the other branches of the wave function were gone for good and couldn’t come back to recohere.

This suggests a general way to IN PRINCIPLE test any theory of wave function collapse: Look at processes right beyond the threshold where the theory says wave functions collapse. Then implement whatever is required to reverse the physical process that you say causes collapse, thus recohering the branches of the wave function (if they still exist). Now look to see if any evidence of interference exists. If it does, then the theory is proven wrong. If it doesn’t, then it might be correct, and any theory of wave function collapse that demands a more stringent standard for collapse (including Many-Worlds, the most stringent of them all) is proven wrong.

Wave function entropy

Entropy is a feature of probability distributions, and can be taken to be a quantification of uncertainty.

Standard quantum mechanics takes its fundamental object to be the wave function – an amplitude distribution. And from an amplitude distribution Ψ you can obtain a probability distribution Ψ*Ψ.

So it is very natural to think about the entropy of a given quantum state. For some reason, it looks like this concept of wave function entropy is not used much in physics. The quantum-mechanical version of entropy that is typically referred to is the Von-Neumann entropy, which involves uncertainty over which quantum state a system is in (rather than uncertainty intrinsic to a quantum state).

I’ve been looking into some of the implications of the concept of wave function entropy, and found a few interesting things.

Firstly, let’s just go over what precisely wave function entropy is.

Quantum mechanics is primarily concerned with calculating the wave function Ψ(x), which distributes complex amplitudes over phase space. The physical meaning of these amplitudes is interpreted by taking their absolute square Ψ*Ψ, which is a probability distribution.

Thus, the entropy of the wave function is given by:

S = – ∫ Ψ*Ψ ln(Ψ*Ψ) dx

As an example, I’ll write out some of the wave functions for the basic hydrogen atom:

*Ψ)1s = e-2r / π
*Ψ)2s = (2 – r)2 e-r / 32π

*Ψ)2p = r2 e-r cos(θ) / 32π
*Ψ)3s = (2r2 – 18r + 27)2 e-⅔r / 19683π

With these wave functions in hand, we can go ahead and calculate the entropies! Some of the integrals are intractable, so using numerical integration, we get:

S1s ≈ 70
S2s ≈ 470
S2p ≈ 326
S3s ≈ 1320

The increasing values for (1s, 2s, 3s) make sense – higher energy wave functions are more dispersed, meaning that there is greater uncertainty in the electron’s spatial distribution.

Let’s go into something a bit more theoretically interesting.

We’ll be interested in a generalization of entropy – relative entropy. This will quantify, rather than pure uncertainty, changes in uncertainty from a prior probability distribution ρ to our new distribution Ψ*Ψ. This will be the quantity we’ll denote S from now on.

S = – ∫ Ψ*Ψ ln(Ψ*Ψ/ρ) dx

Now, suppose we’re interested in calculating the wave functions Ψ that are local maxima of entropy. This means we want to find the Ψ for which δS = 0. Of course, we also want to ensure that a few basic constraints are satisfied. Namely,

∫ Ψ*Ψ dx = 1
∫ Ψ*HΨ = E

These constraints are chosen by analogy with the constraints in ordinary statistical mechanics – normalization and average energy. H is the Hamiltonian operator, which corresponds to the energy observable.

We can find the critical points of entropy that satisfy the constraint by using the method of Lagrange multipliers. Our two Lagrange multipliers will be α (for normalization) and β (for energy). This gives us the following equation for Ψ:

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0

We can rewrite this as an operator equation, which gives us

ln(Ψ*Ψ/ρ) + (α + 1) + βH = 0
Ψ*Ψ = ρ/Z e-βH

Here we’ve renamed our constants so that Z =  eα+1 is a normalization constant.

So we’ve solved the wave function equation… but what does this tell us? If you’re familiar with some basic quantum mechanics, our expression should look somewhat familiar to you. Let’s backtrack a few steps to see where this familiarity leads us.

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0
HΨ + 1/β ln(Ψ*Ψ/ρ) Ψ = – (α + 1)/β Ψ

Let’s rename – (α + 1)/β to a new constant λ. And we’ll take a hint from statistical mechanics and call 1/β the temperature T of the state. Now our equation looks like

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

This equation is almost the Schrodinger equation. In particular, the Schrodinger equation pops out as the zero-temperature limit of this equation:

As T → 0,
our equation becomes…
HΨ = λΨ

The obvious interpretation of the constant λ in the zero temperature limit is E, the energy of the state. 

What about in the infinite-temperature limit?

As T → ∞,
our equation becomes…
Ψ*Ψ = ρ

Why is this? Because the only solution to the equation in this limit is for ln(Ψ*Ψ/ρ) → 0, or in other words Ψ*Ψ/ρ → 1

And what this means is that in the infinite temperature limit, the critical entropy wave function is just that which gives the prior distribution.

We can interpret this result as a generalization of the Schrodinger equation. Rather than a linear equation, we now have an additional logarithmic nonlinearity. I’d be interested to see how the general solutions to this equation differ from the standard equations, but that’s for another post.

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

Bayesian experimental design

We can use the concepts in information theory that I’ve been discussing recently to discuss the idea of optimal experimental design. The main idea is that when deciding which experiment to run out of some set of possible experiments, you should choose the one that will generate the maximum information. Said another way, you want to choose experiments that are surprising as possible, since these provide the strongest evidence.

An example!

Suppose that you have a bunch of face-down cups in front of you. You know that there is a ping pong ball underneath one of the cups, and want to discover which one it is. You have some prior distribution of probabilities over the cups, and are allowed to check under exactly one cup. Which cup should you choose in order to get the highest expected information gain?

The answer to this question isn’t extremely intuitively obvious. You might think that you want to choose the cup that you think is most likely to hold the ball, because then you’ll be most likely to find the ball there and thus learn exactly where the ball is. But at the same time, the most likely choice of ball location is also the one that gives you the least information if the ball is actually there. If you were already fairly sure that the ball was under that cup, then you don’t learn much by discovering that it was.

Maybe instead the better strategy is to go for a cup that you think is fairly unlikely to be hiding the ball. Then you’ll have a small chance of finding the ball, but in that case will gain a huge amount of evidence. Or perhaps the maximum expected information gain is somewhere in the middle.

The best way to answer this question is to actually do the calculation. So let’s do it!

First, we’ll label the different theories about the cup containing the ball:

{C1, C2, C3, … CN}

Ck corresponds to the theory that the ball is under the kth cup. Next, we’ll label the possible observations you could make:

{X1, X2, X3, … XN}

Xk corresponds to the observation that the ball is under the kth cup.

Now, our prior over the cups will contain all of our past information about the ball and the cups. Perhaps we thought we heard a rattle when somebody bumped one of the cups earlier, or we notice that the person who put the ball under one of the cups was closer to the cups on the right hand side. All of this information will be contained in the distribution P:

(P1, P2, P3, … PN)

Pk is shorthand for P(Ck) – the probability of Ck being true.

Good! Now we are ready to calculate the expected information gain from any particular observation. Let’s say that we decide to observe X3. There are two scenarios: either we find the ball there, or we don’t.

Scenario 1: You find the ball under cup 3. In this case, you previously had a credence of P3 in X3 being true, so you gain -log(P3) bits of information.

Scenario 2: You don’t find the ball under cup 3. In this case, you gain –log(1 – P3) bits of information.

With probability P3, you gain –log(P3) bits of information, and with probability (1 – P3) you gain –log(1 – P3) bits of information. So your expected information gain is just –P3 logP3 – (1 – P3) logP3.

In general, we see that if you have a prior credence of P in the cup containing the ball, then your expected information gain is:

-P logP – (1 – P) logP

What does this function look like?

Experimental design

We see that it has a peak value at 50%. This means that you expect to gain the most information by looking at a cup that you are 50% sure contains the ball. If you are any more or less confident than this, then evidently you learn less than you would have if you were exactly agnostic about the cup.

Intuitively speaking, this means that we stand to learn the most by doing an experiment on a quantity that we are perfectly agnostic about. Practically speaking, however, the mandate that we run the experiment that maximizes information gain ends up telling us to always test the cup that we are most confident contains the ball. This is because if you split your credences among N cups, they will be mostly under 50%, so the closest you can get to 50% will be the largest credence.

Even if you are 99% confident that the fifteenth cup out of one hundred contains the ball, you will have just about .01% credence in each of the others containing the ball. Since 99% is closer to 50% than .01%, you will stand to gain the most information by testing the fifteenth ball (although you stand to gain very little information in a more absolute sense).

This generalizes nicely. Suppose that instead of trying to guess whether or not there is a ball under a cup, you are trying to guess whether there is a ball, a cube, or nothing. Now your expected information gain in testing a cup is a function of your prior over the cup containing a ball Pball, your prior over it containing a cube Pcube, and your prior over it containing nothing Pempty.

-Pball logPball – Pcube logPcube – Pempty logPempty

Subject to the constraint that these three priors must add up to 1, what set of (Pball, Pcube, Pempy) maximizes the information gain? It is just (⅓, ⅓, ⅓).

Optimal (Pball, Pcube, Pempy) = (⅓, ⅓, ⅓)

Imagine that you know that exactly one cup is empty, exactly one contains a cube, and exactly one contains a ball, and have the following distribution over the cups:

Cup 1: (⅓, ⅓, ⅓)
Cup 2: (⅔, ⅙, ⅙)
Cup 3: (0, ½, ½)

If you can only peek under a single cup, which one should you choose in order to learn the most possible? I take it that the answer to this question is not immediately obvious. But using these methods in information theory, we can answer this question unambiguously: Cup 1 is the best choice – the optimal experiment.

We can even numerically quantify how much more information you get by checking under Cup 1 than by checking under Cup 2:

Information gain(check cup 1) ≈ 1.58 bits
Information gain(check cup 2) ≈ 1.25 bits
Information gain(check cup 3) = 1 bits

Checking cup 1 is thus 0.33 bits better than checking cup 2, and 0.58 bits better than checking cup 3. Since receiving N bits of information corresponds to ruling out all but 1/2N possibilities, we rule out 20.33 ≈ 1.26 times more possibilities by checking cup 1 than cup 2, and 20.58 ≈ 1.5 times more possibilities than cup 3.

Even more generally, we see that when we can test N mutually exclusive characteristics of an object at once, the test is most informative when our credences in the characteristics are smeared out evenly; P(k) = 1/N.

This makes a lot of sense. We learn the most by testing things about which we are very uncertain. The more smeared out our probabilities are over the possibilities, the less confident we are, and thus the more effective a test will be. Here we see a case in which information theory vindicates common sense!

Why relative entropy

Background for this post: Entropy is expected surprise, A survey of entropy and entropy variants, and Maximum Entropy and Bayes

Suppose you have some old distribution Pold, and you want to update it to a new distribution Pnew given some information.

You want to do this in such a way as to be as uncertain as possible, given your evidence. One strategy for achieving this is to maximize the difference in entropy between your new distribution and your old one.

Max (Snew – Sold) = ∑ -Pnew logPnew – ∑ -Pold logPold

Entropy is expected surprise. So this quantity is the new expected surprise minus the old expected surprise. Maximizing this corresponds to trying to be as much more surprised on average as possible than you expected to be previously.

But this is not quite right. We are comparing the degree of surprise you expect to have now to the degree of surprise you expected to have previously, based on your old distribution. But in general, your new distribution may contain important information as to how surprised you should have expected to be.

Think about it this way.

One minute ago, you had some set of beliefs about the world. This set of beliefs carried with it some degree of expected surprise. This expected surprise is not the same as the true average surprise, because you could be very wrong in your beliefs. That is, you might be very confident in your beliefs (i.e. have very low EXPECTED surprise), but turn out to be very wrong (i.e. have very high ACTUAL average surprise).

What we care about is not how surprised somebody with the distribution Pold would have expected to be, but how surprised you now expect somebody with the distribution Pold to be. That is, you care about the average value of surprise, given your new distribution, your new best estimate of the actual distribution

That is to say, instead of using the simple difference in entropies S(Pnew) – S(Pold), you should be using the relative entropy Srel(Pnew, Pold).

Max Srel = ∑ -Pnew logPnew – ∑ -Pnew logPold

Here’s a diagram describing the three species of entropy: entropy, cross entropy, and relative entropy.

Types of Entropy.png

As one more example of why this makes sense: imagine that one minute ago you were totally ignorant and knew absolutely nothing about the world, but were for some reason very irrationally confident about your beliefs. Now you are suddenly intervened upon by an omniscient Oracle that tells you with perfect accuracy exactly what is truly going on.

If your new beliefs are designed by maximizing the absolute gain in entropy, then you will be misled by your old irrational confidence; your old expected surprise will be much lower than it should have been. If you use relative entropy, then you will be using your best measure of the actual average surprise for your old beliefs, which might have been very large. So in this scenario, relative entropy is a much better measure of your actual change in average surprise than the absolute entropy difference, as it avoids being misled by previous irrationality.

A good way to put this is that relative entropy is better because it uses your current best information to estimate the difference in average surprise. While maximizing absolute entropy differences will give you the biggest change in expected surprise, maximizing relative entropy differences will do a better job at giving you the biggest difference in *actual* surprise. Relative entropy, in other words, allows you to correct for previous bad estimates of your average surprise, and substitute in the best estimate you currently have.

These two approaches, maximizing absolute entropy difference and maximizing relative entropy, can give very different answers for what you should believe. It so happens that the answers you get by maximizing relative entropy line up nicely with the answers you get from just ordinary Bayesian updating, while the answers you get by maximizing absolute entropy differences do not, which is why this difference is important.

A survey of entropy and entropy variants

This post is for anybody that is confused about the numerous different types of entropy concepts out there, and how they relate to one another. The concepts covered are:

  • Surprise
  • Information
  • Entropy
  • Cross entropy
  • KL divergence
  • Relative entropy
  • Log loss
  • Akaike Information Criterion
  • Cross validation

Let’s dive in!

Surprise and information

Previously, I talked about the relationship between surprise and information. It is expressed by the following equation:

Surprise = Information = – log(P)

I won’t rehash the justification for this equation, but highly recommend you check out the previous post if this seems unusual to you.

In addition, we introduced the ideas of expected surprise and total expected surprise, which were expressed by the following equations:

Expected surprise = – P log(P)
Total expected surprise = – ∑ P log(P)

As we saw previously, the total expected surprise for a distribution is synonymous with the entropy of somebody with that distribution.

Which leads us straight into the topic of this post!


The entropy of a distribution is how surprised we expect to be if we suddenly learn the truth about the distribution. It is also the amount of information we expect to gain upon learning the truth.

A small degree of entropy means that we expect to learn very little when we hear the truth. A large degree of entropy means that we expect to gain a lot of information upon hearing the truth. Therefore a large degree of entropy represents a large degree of uncertainty. Entropy is our distance from certainty.

Entropy = Total expected surprise = – ∑ P log(P)

Notice that this is not the distance from truth. We can be very certain, and very wrong. In this case, our entropy will be high, because it is our expected surprise. That is, we calculate entropy by looking at the average surprise over our probability distribution, not the true distribution. If we want to evaluate the distance from truth, we need to evaluate the average over the true distribution.

We can do this by using cross-entropy.

Cross Entropy

In general, the cross entropy is a function of two distributions P and Q. The cross entropy of P and Q is the surprise you expect somebody with the distribution Q to have, if you have distribution P.

Cross Entropy = Surprise P expects of Q = – ∑ P log(Q)

The actual average surprise of your distribution P is therefore the cross-entropy between P and the true distribution. It is how surprised somebody would expect you to be, if they had perfect knowledge of the true distribution.

Actual average surprise = – ∑ Ptrue log(P)

Notice that the smallest possible value that the cross entropy could take on is the entropy of the true distribution. This makes sense – if your distribution is as close to the truth as possible, but the truth itself contains some amount of uncertainty (for example, a fundamentally stochastic process), then the best possible state of belief you could have would be exactly as uncertain as the true distribution is. Maximum cross entropy between your distribution and the true distribution corresponds to maximum distance from the truth.

Kullback-Leibler divergence

If we want a quantity that is zero when your distribution is equal to the true distribution, then you can shift the cross entropy H(Ptrue, P) over by the value of the true entropy S(Ptrue). This new quantity H(Ptrue, P) – S(Ptrue) is known as the Kullback-Leibler divergence.

Shifted actual average surprise = Kullback-Leibler divergence
= – ∑ Ptrue log(P) + ∑ Ptrue log(Ptrue)
= ∑ Ptrue log(Ptrue/P)

It represents the information gap, or the actual average difference in difference between your distribution and the true distribution. The smallest possible value of the Kullback-Leibler divergence is zero, when your beliefs are completely aligned with reality.

Since KL divergence is just a constant shift away from cross entropy, minimizing one is the same as minimizing the other. This makes sense; the only real difference between the two is whether we want our measure of “perfect accuracy” to start at zero (KL divergence) or to start at the entropy of the true distribution (cross entropy).

Relative Entropy

The negative KL divergence is just a special case of what’s called relative entropy. The relative entropy of P and Q is just the negative cross entropy of P and Q, shifted so that it is zero when P = Q.

Relative entropy = shifted cross entropy
= – ∑ P log(P/Q)

Since the cross entropy between P and Q measures how surprised P expects Q to be, the relative entropy measures P’s expected gap in average surprisal between themselves and Q.

KL divergence is what you get if you substitute in Ptrue for P. Thus it is the expected gap in average surprisal between a distribution and the true distribution.


Maximum KL divergence corresponds to maximum distance from the truth, while maximum entropy corresponds to maximum from certainty. This is why we maximize entropy, but minimize KL divergence. The first is about humility – being as uncertain as possible given the information that you possess. The second is about closeness to truth.

Since KL divergence is just a constant shift away from cross entropy, minimizing one is the same as minimizing the other. This makes sense, the only real difference between the two is whether we want our “perfectly accurate” measure to start at zero (KL divergence) or at the entropy of the true distribution (cross entropy).

Since we don’t start off with access to Ptrue, we can’t directly calculate the cross entropy H(Ptrue, P). But lucky for us, a bunch of useful approximations are available!

Log loss

Log loss uses the fact that if we have a set of data D generated by the true distribution, the expected value of F(x) taken over the true distribution will be approximately just the average value of F(x), for x in D.

Cross Entropy = – ∑ Ptrue log(P)
(Data set D, N data points)
Cross Entropy ~ Log loss = – ∑x in D log(P(x)) / N

This approximation should get better as our data set gets larger. Log loss is thus just a large-numbers approximation of the actual expected surprise.

Akaike information criterion

Often we want to use our data set D to optimize our distribution P with respect to some set of parameters. If we do this, then the log loss estimate is biased. Why? Because we use the data in two places: first to optimize our distribution P, and second to evaluate the information distance between P and the true distribution.

This allows problems of overfitting to creep in. A distribution can appear to have a fantastically low information distance to the truth, but actually just be “cheating” by ensuring success on the existing data points.

The Akaike information criterion provides a tweak to the log loss formula to try to fix this. It notes that the difference between the cross entropy and the log loss is approximately proportional to the number of parameters you tweaked divided by the total size of the data set: k/N.

Thus instead of log loss, we can do better at minimizing cross entropy by minimizing the following equation:

AIC = Log loss + k/N

(The exact form of the AIC differs by multiplicative constants in different presentations, which ultimately is unimportant if we are just using it to choose an optimal distribution)

The explicit inclusion of k, the number of parameters in your model, represents an explicit optimization for simplicity.

Cross Validation

The derivation of AIC relies on a complicated set of assumptions about the underlying distribution. These assumptions limit the validity of AIC as an approximation to cross entropy / KL divergence.

But there exists a different set of techniques that rely on no assumptions besides those used in the log loss approximation (the law of large numbers and the assumption that your data is an unbiased sampling of the true distribution). Enter the holy grail of model selection!

The problem, recall, was that we used the same data twice, allowing us to “cheat” by overfitting. First we used it to tweak our model, and second we used it to evaluate our model’s cross entropy.

Cross validation solves this problem by just separating the data into two sets, the training set and the testing set. The training set is used for tweaking your model, and the testing set is used for evaluating the cross entropy. Different procedures for breaking up the data result in different flavors of cross-validation.

There we go! These are some of the most important concepts built off of entropy and variants of entropy.

Entropy is expected surprise

Today we’re going to talk about a topic that’s very close to my heart: entropy. We’ll start somewhere that might seem unrelated: surprise.

Suppose that we wanted to quantify the intuitive notion of surprise. How should we do that?

We’ll start by analyzing a few base cases.

First! If something happens and you already were completely certain that it would happen, then you should completely unsurprised.

That is, if event E happens, and you had a credence P(E) = 100% in it happening, then your surprise S should be zero.

S(1) = 0

Second! If something happens that you were totally sure was impossible, with 100% credence, then you should be infinitely surprised.

That is, if E happens and P(E) = 0, then S = ∞.

S(0) = ∞

So far, it looks like your surprise S should be a function of your credence P in the event you are surprised at. That is, S = S(P). We also have the constraints that S(1) = 0 and S(0) = ∞.

There are many candidates for a function like this, for example: S(P) = 1/P – 1, S(P) = -log(P), S(P) = cot(πx/2). So we need more constraints.

Third! If an event E1 happens that is surprising to degree S1, and then another event E2 happens with surprisingness S2, then your surprise at the combination of these events should be S1 + S2.

I.e., we want surprise to be additive. If S(P(E1)) = S1 and S(P(E2 | E1)) = S2, then S(P(E1 & E2) = S1 + S2.

This entails a new constraint on our surprise function, namely:

S(PQ) = S(P) + S(Q)

Fourth, and finally! We want our surprise function to be continuous – free from discontinuous jumps. If your credence that the event will happen changes by an arbitrarily small amount, then your surprise if it does happen should also change by an arbitrarily small amount.

S(P) is continuous.

These four constraints now fully specify the form of our surprise function, up to a multiplicative constant. What we find is that the only function satisfying these constraints is the logarithm:

S(P) = k logP, where k is some negative number

Taking the simplest choice of k, we end up with a unique formalization of the intuitive notion of surprise:

S(P) = – logP

To summarize what we have so far: Four basic desideratum for our formalization of the intuitive notion of surprise have led us to a single simple equation.

This equation that we’ve arrived at turns out to be extremely important in information theory. It is, in fact, just the definition of the amount of information you gain by observing E. This reveals to us a deep connection between surprise and information. They are in an important sense expressing the same basic idea: more surprising events give you more information, and unsurprising events give you little information.

Let’s get a little better numerical sense of this formalization of surprise/information. What does a single unit of surprise or information mean? With some quick calculation, we see that a single unit of surprise, or bit of information corresponds to the observation of an event that you had a 50% expectation of. This also corresponds to a ruling out of 50% of the weight of possible other events you thought you might have observed. In essence, each bit of information you receive / surprise you experience corresponds to the total amount of possibilities being cut in half.

Two bits of information narrow the possibilities to one-fourth. Three cut out all but one-eighth. And so on. For a rational agent, the process of receiving more information or of being continuously surprised is the process of whittling down your models of reality to a smaller and better set!

The next great step forward is to use our formalization of surprise to talk not just about how surprised you are once an event happens, but how surprised you expect to be. If you have a credence of P in an event happening, then you expect a degree of surprise S(P) with credence P. In other words, the expected surprise you have with respect to that particular event is:

Expected surprise = – P logP

When summed over the totality of all possible events that occurred we get the following expression:

Total expected surprise = – ∑i Pi logPi

This expression should look very very familiar to you. It’s one of the most important quantities humans have discovered…


Now you understand the title of this post. Quite literally, entropy is total expected surprise!

Entropy = Total expected surprise

By the way, you might be wondering if this is the same entropy as you hear mentioned in the context of physics (that thing that always increases). Yes, it is identical! This means that we can describe the Second Law of Thermodynamics as a conspiracy by the universe to always be as surprising as possible to us! There are a bunch of ways to explore the exact implications of this, but that’s a subject for another post.

Getting back to the subject of this post, we can now make another connection. Surprise is information. Total expected surprise is entropy. And entropy is a measure of uncertainty.

If you think about this for a moment, this should start to make sense. If your model of reality is one in which you expect to be very surprised in the next moment, then you are very uncertain about what is going to happen in the next moment. If, on the other hand, your model of reality is one in which you expect zero surprise in the next moment, then you are completely certain!

Thus we see the beautiful and deep connection between surprise, information, entropy, and uncertainty. The overlap of these four concepts is rich with potential for exploration. We could go the route of model selection and discuss notions like mutual informationinformation divergence, and relative entropy, and how they relate to the virtues of predictive accuracy and model simplicity. We could also go the route of epistemology and discuss the notion of epistemic humility, choosing your beliefs to maximize your uncertainty, and the connection to Bayesian epistemology. Or, most tantalizingly, we could go the route of physics and explore the connection between this highly subjective sense of entropy as surprise/ uncertainty, and the very concrete notion of entropy as a physical quantity that characterizes the thermal properties of systems.

Instead of doing any of these, I’ll do none, and end here in hope that I’ve conveyed some of the coolness of this intersection of philosophy, statistics, and information theory.

Inference as a balance of accommodation, prediction, and simplicity

(This post is a soft intro to some of the many interesting aspects of model selection. I will inevitably skim over many nuances and leave out important details, but hopefully the final product is worth reading as a dive into the topic. A lot of the general framing I present here is picked up from Malcolm Forster’s writings.)

What is the optimal algorithm for discovering the truth? There are many different candidates out there, and it’s not totally clear how to adjudicate between them. One issue is that it is not obvious exactly how to measure correspondence to truth. There are several different criterion that we can use, and in this post, I want to talk about three big ones: accommodation, prediction, and simplicity.
The basic idea of accommodation is that we want our theories to do a good job at explaining the data that we have observed. Prediction is about doing well at predicting future data. Simplicity is, well, just exactly what it sounds like. Its value has been recognized in the form of Occam’s razor, or the law of parsimony, although it is famously difficult to formalize.
Let’s say that we want to model the relationship between the number of times we toss a fair coin and the number of times that it lands H. We might get a data set that looks something like this:

Now, our goal is to fit a curve to this data. How best to do this?

Consider the following two potential curves:

Curve fitting

Curve 1 is generated by Procedure 1: Find the lowest-order polynomial that perfectly matches the data.

Curve 2 is generated by Procedure 2: Find the straight line that best fits the data.

If we only cared about accommodation, then we’ll prefer Curve 1 over Curve 2. After all, Curve 1 matches our data perfectly! Curve 2, on the other hand, is always close but never exactly right.

On the other hand, regardless of how well Curve 1 fits the data, it entirely misses the underlying pattern in the data captured by Curve 2! This demonstrates one of the failure modes of a single-minded focus on accommodation: the problem of overfitting.

We might want to solve in this problem by noting that while Curve 1 matches the data better, it does so in virtue of its enormous complexity. Curve 2, on the other hand, matches the data pretty well, but does so simply. A combined focus on accommodation + simplicity might, therefore, favor Curve 2. Of course, this requires us to precisely specify what we mean by ‘simplicity’, which has been the subject of a lot of debate. For instance, some have argued that an individual curve cannot be said to be more or less simple than a different curve, as just rephrasing the data in a new coordinate system can flip the apparent simplicity relationship. This is a general version of the grue-bleen problem, which is a fantastic problem that deserves talking about in a separate post.

Another way to solve this problem is by optimizing for accommodation + prediction. The over-fitted curve is likely to be very off if you ask for predictions about future data, while the straight line is likely going to do better. This makes sense – a straight line makes better forecasts about future data because it has gotten to the true nature of the underlying relationship.

What if we want to ensure that our model does a good job at predicting future data, but are unable to gather future data? For example, suppose that we lost the coin that we were using to generate the data, but still want to know what model would have done best at predicting future flips? Cross-validation is a wonderful technique that can be used to deal with exactly this problem.

How does it work? The idea is that we randomly split up the data we have into two sets, the training set and the testing set. Then we train our models on the training set (see which curve each model ends up choosing as its best fit, given the training data), and test it on the testing set. For instance, if our training set is just the data from the early coin flips, we find the following:

Curve fitting cross validation
Cross validation

We can see that while the new Curve 2 does roughly as well as it did before, the new Curve 1 will do horribly on the testing set. We now do this for many different ways of splitting up our data set, and in the end accumulate a cross-validation “score”. This score represents the average success of the model at predicting points that it was not trained on.

We expect that in general, models that overfit will tend to do horribly badly when asked to predict the testing data, while models that actually get at the true relationship will tend to do much better. This is a beautiful method for avoiding overfitting by getting at the deep underlying relationships, and optimizing for the value of predictive accuracy.

It seems like predictive accuracy and simplicity often go hand-in-hand. In our coin example, the simpler model (the straight line) was also the more predictively accurate one. And models that overfit tend to be both bad at making accurate predictions and enormously complicated. What is the explanation for this relationship?

One classic explanation says that simpler models tend to be more predictive because the universe just actually is relatively simple. For whatever reason, the actual relationships between different variables in the universe happens to be best modeled by simple equations, not complicated ones. Why? One reason that you could point to is the underlying simplicity of the laws of nature.

The Standard Model of particle physics, which gives rise to basically all of the complex behavior we see in the world, can be expressed in an equation that can be written on a t-shirt. In general, physicists have found that reality seems to obey very mathematically simple laws at its most fundamental level.

I think that this is somewhat of a non-explanation. It predicts simplicity in the results of particle physics experiments, but does not at all predict simple results for higher-level phenomenon. In general, very complex phenomena can arise from very simple laws, and we get no guarantee that the world will obey simple laws when we’re talking about patterns involving 1020 particles.

An explanation that I haven’t heard before references possible selection biases. The basic idea is that most variables out there that we could analyze are likely not connected by any simple relationships. Think of any random two variables, like the number of seals mating at any moment and the distance between Obama and Trump at that moment. Are these likely to be related by a simple equation? Of course!

(Kidding. Of course not.)

The only times when we do end up searching for patterns in variables is when we have already noticed that some pattern does plausibly seem to exist. And since we’re more likely to notice simpler patterns, we should expect a selection bias among those patterns we’re looking at. In other words, given that we’re looking for a pattern between two variables, it is fairly likely that there is a pattern that is simple enough for us to notice in the first place.

Regardless, it looks like an important general feature of inference systems to provide a good balance between accommodation and either prediction or simplicity. So what do actual systems of inference do?

I’ve already talked about cross validation as a tool for inference. It optimizes for accommodation (in the training set) + prediction (in the testing set), but not explicitly for simplicity.

Updating of beliefs via Bayes’ rule is a purely accommodation procedure. When you take your prior credence P(T) and update it with evidence E, you are ultimately just doing your best to accommodate the new information.

Bayes’ Rule: P(T | E) = P(T) ∙ P(E | T) / P(T) 

The theory that receives the greatest credence bump is going to be the theory that maximizes P(E | T), or the likelihood of the evidence given the theory. This is all about accommodation, and entirely unrelated to the other virtues. Technically, the method of choosing the theory that maximizes the likelihood of your data is known as Maximum Likelihood Estimation (MLE).

On the other hand, the priors that you start with might be set in such a way as to favor simpler theories. Most frameworks for setting priors do this either explicitly or implicitly (principle of indifference, maximum entropy, minimum description length, Solomonoff induction).

Leaving Bayes, we can look to information theory as the foundation for another set of epistemological frameworks. These are focused mostly on minimizing the information gain from new evidence, which is equivalent to maximizing the relative entropy of your new distribution and your old distribution.

Two approximations of this procedure are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), each focusing on subtly different goals. Both of these explicitly take into account simplicity in their form, and are designed to optimize for both accommodation and prediction.

Here’s a table of these different procedures, as well as others I haven’t mentioned yet, and what they optimize for:

Optimizes for…




Maximum Likelihood Estimation

Minimize Sum of Squares

Bayesian Updating

Principle of Indifference

Maximum Entropy Priors

Minimum Message Length

Solomonoff Induction


Minimize Mallow’s Cp

Maximize Relative Entropy

Minimize Log Loss

Cross Validation

Minimize Akaike Information Criterion (AIC)

Minimize Bayesian Information Criterion (AIC)

Some of the procedures I’ve included are closely related to others, and in some cases they are in fact approximations of others (e.g. minimize log loss ≈ maximize relative entropy, minimize AIC ≈ minimize log loss).

We can see in this table that Bayesianism (Bayesian updating + a prior-setting procedure) does not explicitly optimize for predictive value. It optimizes for simplicity through the prior-setting procedure, and in doing so also happens to pick up predictive value by association, but doesn’t get the benefits of procedures like cross-validation.

This is one reason why Bayesianism might be seen as suboptimal – prediction is the great goal of science, and it is entirely missing from the equations of Bayes’ rule.

On the other hand, procedures like cross validation and maximization of relative entropy look like good candidates for optimizing for accommodation and predictive value, and picking up simplicity along the way.