The Central Paradox of Statistical Mechanics: The Problem of The Past

This is the third part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

What I’ve argued for so far is the following set of claims:

  1. To successfully predict the behavior of macroscopic systems, we need something above and beyond the microphysical laws.
  2. This extra thing we need is the fundamental postulate of statistical mechanics, which assigns a uniform distribution over the region of phase space consistent with what you know about the system. This postulate allows us to prove all the things we want to say about the future, such as “gases expand”, “ice cubes melt”, “people age” and so on.
  3. This fundamental postulate is not justifiable on a priori grounds, as it is fundamentally an empirical claim about how frequently different micro states pop up in our universe. Different initial conditions give rise to different such frequencies, so that a claim to a priori access to the fundamental postulate is a claim to a priori access to the precise details of the initial condition of the universe.

 There’s just one problem with all this… apply our postulate to the past, and everything breaks.

 Notice that I said that the fundamental postulate allows us to prove all the things we want to say about the future. That wording was chosen carefully. What happens if you try to apply the microphysical laws + the fundamental postulate to predict the past of some macroscopic system? It turns out that all hell breaks loose. Gases spontaneously contract, ice cubes form from puddles of water, and brains pop out of thermal equilibrium.

 Why does this happen? Very simply, we start with two fully time reversible premises (the microphysical laws and the fundamental postulate). We apply it to present knowledge of some state, the description of which does not specify a special time direction. So any conclusion we get must as a matter of logic be time reversible as well! You can’t start with premises that treat the past as the mirror image of the future, and using just the rules of logical equivalence derive a conclusion that treats the past as fundamentally different from the future. And what this means is that if you conclude that entropy increases towards the future, then you must also conclude that entropy increases towards the past. Which is to say that we came from a higher entropy state, and ultimately (over a long enough time scale and insofar as you think that our universe is headed to thermal equilibrium) from thermal equilibrium.

Let’s flesh this argument out a little more. Consider a half-melted ice cube sitting in the sun. The microphysical laws + the fundamental postulate tell us that the region of phase space consisting of states in which the ice cube is entirely melted is much much much larger than the region of phase space in which it is fully unmelted. So much larger, in fact, that it’s hard to express using ordinary English words. This is why we conclude that any trajectory through phase space that passes through the present state of the system (the half-melted cube) is almost certainly going to quickly move towards the regions of phase space in which the cube is fully melted. But for the exact same reason, if we look at the set of trajectories that pass through the present state of the system, the vast vast vast majority of them will have come from the fully-melted regions of phase space. And what this means is that the inevitable result of our calculation of the ice cube’s history will be that a few moments ago it was a puddle of water, and then it spontaneously solidified and formed into a half-melted ice cube.

This argument generalizes! What’s the most likely past history of you, according to statistical mechanics? It’s not that the solar system coalesced from a haze of gases strewn through space by a past supernova, such that a planet would form in the Goldilocks zone and develop life, which would then gradually evolve through natural selection to the point where you are sitting in whatever room you’re sitting in reading this post. This trajectory through phase space is enormously unlikely. The much much much more likely past trajectory of you through phase space is that a little while ago you were a bunch of particles dispersed through a universe at thermal equilibrium, which happened to spontaneously coalesce into a brain that has time to register a few moments of experience before dissipating back into chaos. “What about all of my memories of the past?” you say. As it happens the most likely explanation of these memories is not that they are veridical copies of real happenings in the universe but illusions, manufactured from randomness.

Basically, if you buy everything I’ve argued in the first two parts, then you are forced to conclude that the universe is most likely near thermal equilibrium, with your current experience of it arising as a spontaneous dip in entropy, just enough to produce a conscious brain but no more. There are at least two big problems with this view.

Problem 1: This conclusion is, we think, extremely empirically wrong! The ice cube in front of you didn’t spontaneously form from a puddle of water, uncracked eggs weren’t a moment ago scrambled, and your memories are to some degree veridical. If you really believe that you are merely a spontaneous dip in entropy, then your prediction for the next minute will be the gradual dissolution of your brain and loss of consciousness. Now, wait a minute and see if this happens. Still here? Good!

Problem 2: The conclusion cannot be simultaneously believed and justified. If you think that you’re a thermal fluctuation, then you shouldn’t credit any of your memories as telling you anything about the world. But then your whole justification to coming to the conclusion in the first place (the experiments that led us to conclude that physics is time-reversible and that the fundamental postulate is true) is undermined! Either you believe it without justification, or you don’t believe despite justification. Said another way, no reflective equilibrium exists at an entropy minimum. David Albert calls this peculiar epistemic state cognitively unstable, as it’s not clear where exactly it should leave you.

Reflect for a moment on how strange of a situation we are in here. Starting from very basic observations of the world, involving its time-reversibility on the micro scale and the increase in entropy of systems, we see that we are inevitably led to the conclusion that we are almost certainly thermal fluctuations, brains popping out of the void. I promise you that no trick has been pulled here, this really is the state of the philosophy of statistical mechanics! The big issue is how to deal with this strange situation.

One approach is to say the following: Our problem is that our predictions work towards the future but not the past. So suppose that we simply add as a new fundamental postulate the proposition that long long ago the universe had an incredibly low entropy. That is, suppose that instead of just starting with the microphysical laws and the fundamental postulate of statistical mechanics, we added a third claims: the Past Hypothesis.

The Past Hypothesis should be understood as an augmentation of our Fundamental Postulate. Taken together, the two postulates say that our probability distribution over possible microstates should not be uniform over phase space. Instead, it should be what you get when you take the uniform distribution, and then condition on the distant past being extremely low entropy. This process of conditioning clearly preferences one direction of time over the other, and so the symmetry is broken.

 It’s worth reflecting for a moment on the strangeness of the epistemic status of the Past Hypothesis. It happens that we have over time accumulated a ton of observational evidence for the occurrence of the Big Bang. But none of this evidence has anything to do with our reasons for accepting the Past Hypothesis. If we buy the whole line of argument so far, our conclusion that something like a Big Bang occurred becomes something that we are forced to believe for deep logical reasons, on pain of cognitive instability and self-undermining belief. Anybody that denies that the Big Bang (or some similar enormously low-entropy past state) occurred has to contend with their view collapsing in self-contradiction upon observing the physical laws!

Is The Fundamental Postulate of Statistical Mechanics A Priori?

This is the second part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

The fantastic empirical success of the fundamental postulate gives us a great amount of assurance that the postulate is good one. But it’s worth asking whether that’s the only reason that we should like this postulate, or if it has some solid a priori justification. The basic principle of “when you’re unsure, just distribute credences evenly over phase space” certainly strikes many people as highly intuitive and justifiable on a priori grounds. But there are some huge problems with this way of thinking, one of which I’ve already hinted at. Here’s a thought experiment that illustrates the problem.

There is a factory in your town that produces cubic boxes. All you know about this factory is that the boxes that they produce all have a volume between 0 m3 and 1 m3. You are going to be delivered a box produced by this factory, and are asked to represent your state of knowledge about the box with a probability distribution. What distribution should you use?

Suppose you say “I should be indifferent over all the possible boxes. So I should have a uniform distribution over the volumes from 0 m3 to 1 m3.” This might seem reasonable at first blush. But what if somebody else said “Yes, you should be indifferent over all the possible boxes, but actually the uniform distribution should be over the side lengths from 0 m to 1 m, not volumes.” This would be a very different probability distribution! For example, if the probability that the side length is greater than .5 m is 50%, then the probability that the volume is greater than (.5)3 = 1/8 is also 50%! Uniform over side length is not the same as uniform over volume (or surface area, for that matter). Now, how do you choose between a uniform distribution over volumes and a uniform distribution over side lengths? After all, you know nothing about the process that the factory is using to produce the boxes, and whether it is based off of volume or side length (or something else); all you know is that all boxes are between 0 m3 and 1 m3.

The lesson of this thought experiment is that the statement we started with (“I should be indifferent over all possible boxes”) was actually not even well-defined. There’s not just one unique measure over a continuous space, and in general the notion that “all possibilities are equally likely” is highly language-dependent.

The exact same applies to phase space, as position and momentum are continuous quantities. Imagine that somebody instead of talking about phase space, only talked about “craze space”, in which all positions become positions cubed, and all momentum values become natural logs of momentum. This space would still contain all possible microstates of your system. What’s more, the fundamental laws of nature could be rewritten in a way that uses only craze space quantities, not phase space quantities. And needless to say, being indifferent over phase space would not be the same as being indifferent over craze space.

Spend enough time looking at attempts to justify a unique interpretation of the statement “All states are equally likely”, when your space of states is a continuous infinity, and you’ll realize that all such attempts are deeply dependent upon arbitrary choices of language. The maximum information entropy probability distribution is afflicted with the exact same problem, because the entropy of your distribution is going to depend on the language you’re using to describe it! The entropy of a distribution in phase space is NOT the same as the entropy of the equivalent distribution transformed to craze space.

Let’s summarize this section. If somebody tells you that the fundamental postulate says that all microstates compatible with what you know about the macroscopic features of your system are equally likely, the proper response is something like “Equally likely? That sounds like you’re talking about a uniform distribution. But uniform over what? Oh, position and momentum? Well, why’d you make that choice?” And if they point out that the laws of physics are expressed in terms of position and momentum, you just disagree and say “No, actually I prefer writing the laws of physics in terms of position cubed and log momentum!” (Substitute in any choice of monotonic functions).

If they object on the grounds of simplicity, point out that position and momentum are only simple as measured from a standpoint that takes them to be the fundamental concepts, and that from your perspective, getting position and momentum requires applying complicated inverse transformations to your monotonic transformation of the chosen coordinates.

And if they object on the grounds of naturalness, the right response is probably something like “Tell me more about this ’naturalness’. How do you know what’s natural or unnatural? It seems to me that your choice of what physical concepts count as natural is a manifestation of deep selection pressures that push any beings whose survival depends on modeling and manipulating their surroundings towards forming an empirically accurate model of the macroscopic world. So that when you say that position is more natural than log(position), what I hear is that the fundamental postulate is a very useful tool. And you can’t use the naturalness of the choice of position to justify the fundamental postulate, when your perception of the naturalness of position is the result of the empirical success of the fundamental postulate!”

In my judgement, none of the a priori arguments work, and fundamentally the reason is that the fundamental postulate is an empirical claim. There’s no a priori principle of rationality that tells us that boxes of gases tend to equilibrate, because you can construct a universe whose initial microstate is such that its entire history is one of entropy radically decreasing, gases concentrating, eggs unscrambling, ice cubes unmelting, and so on. Why is this possible? Because it’s consistent with the microphysical laws that the universe started in an enormously low entropy configuration, so it’s gotta also be consistent with the microphysical laws for the entire universe to spend its entire lifetime decreasing in entropy. The general principle is: If you believe that something is physically possible, then you should believe its time-inverse is possible as well.

Let’s pause and take stock. What I’ve argued for so far is the following set of claims:

  1. To successfully predict the behavior of macroscopic systems, we need something above and beyond the microphysical laws.
  2. This extra thing we need is the fundamental postulate of statistical mechanics, which assigns a uniform distribution over the region of phase space consistent with what you know about the system. This postulate allows us to prove all the things we want to say about the future, such as “gases expand”, “ice cubes melt”, “people age” and so on.
  3. This fundamental postulate is not justifiable on a priori grounds, as it is fundamentally an empirical claim about how frequently different microstates pop up in our universe. Different initial conditions give rise to different such frequencies, so that a claim to a priori access to the fundamental postulate is a claim to a priori access to the precise details of the initial condition of the universe.

There’s just one problem with all this… apply our postulate to the past, and everything breaks.

Up next: Why does statistical mechanics give crazy answers about the past? Where did we go wrong?

The Necessity of Statistical Mechanics for Getting Macro From Micro

This is the first part in a three-part series on the foundations of statistical mechanics.

  1. The Necessity of Statistical Mechanics for Getting Macro From Micro
  2. Is The Fundamental Postulate of Statistical Mechanics A Priori?
  3. The Central Paradox of Statistical Mechanics: The Problem of The Past

— — —

Let’s start this out with a thought experiment. Imagine that you have access to the exact fundamental laws of physics. Suppose further that you have unlimited computing power, for instance, you have an oracle that can instantly complete any computable task. What then do you know about the world?

The tempting answer: Everything! But of course, upon further consideration, you are missing a crucial ingredient: the initial conditions of the universe. The laws themselves aren’t enough to tell you about your universe, as many different universes are compatible with the laws. By specifying the state of the universe at any one time (which incidentally does not have to be an “initial” time), though, you should be able to narrow down this set of compatible universes. So let’s amend our question:

Suppose that you have unlimited computing power, that you know the exact microphysical laws, and that you know the state of the universe at some moment. Then what do you know about the world?

The answer is: It depends! What exactly do you know about the state of the universe? Do you know it’s exact microstate? As in, do you know the position and momentum of every single particle in the universe? If so, then yes, the entire past and future of the universe are accessible to you. But suppose that instead of knowing the exact microstate, you only have access to a macroscopic description of the universe. For example, maybe you have a temperature map as well as a particle density function over the universe. Or perhaps you know the exact states of some particles, just not all of them.

Well, if you only have access to the macrostate of the system (which, notice, is the epistemic situation that we find ourselves in, being that full access to the exact microstate of the universe is as technologically remote as can be), then it should be clear that you can’t specify the exact microstate at all other times. This is nothing too surprising or interesting… starting with imperfect knowledge you will not arrive at perfect knowledge. But we might hope that in the absence of a full description of the microstate of the universe at all other times, you could at least give a detailed macroscopic description of the universe at other times.

That is, here’s what seems like a reasonable expectation: If I had infinite computational power, knew the exact microphysical laws, and knew, say, that a closed box was occupied by a cloud of noninteracting gas in its corner, then I should be able to draw the conclusion that “The gas will disperse.” Or, if I knew that an ice cube was sitting outdoors on a table in the sun, then I should be able to apply my knowledge of microphysics to conclude that “The ice cube will melt”. And we’d hope that in addition to being able to make statements like these, we’d also be able to produce precise predictions for how long it would take for the gas to become uniformly distributed over the box, or for how long it would take for the ice cube to melt.

Here is the interesting and surprising bit. It turns out that this is in principle impossible to do. Just the exact microphysical laws and an infinity of computing power is not enough to do the job! In fact, the microphysical laws will in general tell us almost nothing about the future evolution or past history of macroscopic systems!

Take this in for a moment. You might not believe me (especially if you’re a physicist). For one thing, we don’t know the exact form of the microphysical laws. It would seem that such a bold statement about their insufficiencies would require us to at least first know what they are, right? No, it turns out that the statement that microphysics is is far too weak to tell us about the behavior of macroscopic systems holds for an enormously large class of possible laws of physics, a class that we are very sure that our universe belongs to.

Let’s prove this. We start out with the following observation that will be familiar to physicists: the microphysical laws appear to be time-reversible. That is, it appears to be the case that for every possible evolution of a system compatible with the laws of physics, the time-reverse of that evolution (obtained by simply reversing the trajectories of all particles) is also perfectly compatible with the laws of physics.*

This is surprising! Doesn’t it seem like there are trajectories that are physically possible for particles to take, such that their time reverse is physically impossible? Doesn’t it seem like classical mechanics would say that a ball sitting on the ground couldn’t suddenly bounce up to your hand? An egg unscramble? A gas collect in the corner of a room? The answer to all of the above is no. Classical mechanics, and fundamental physics in general, admits the possibilities of all these things. A fun puzzle for you is to think about why the first example (the ball initially at rest on the ground bouncing up higher and higher until it comes to rest in your hand) is not a violation of the conservation of energy.

Now here’s the argument: Suppose that you have a box that you know is filled with an ideal gas at equilibrium (uniformly spread through the volume). There are many many (infinitely many) microstates that are compatible with this description. We can conclusively say that in 15 minutes the gas will still be dispersed only if all of these microstates, when evolved forward 15 minutes, end up dispersed.

But, and here’s the crucial step, we also know that there exist very peculiar states (such as the macrostate in which all the gas particles have come together to form a perfect statuette of Michael Jackson) such that these states will in 15 minutes evolve to the dispersed state. And by time reversibility, this tells us that there is another perfectly valid history of the gas that starts uniformly dispersed and evolves over 15 minutes into a perfect statuette of Michael Jackson. That is, if we believe that complicated configurations of gases disperse, and believe that physics is time-reversible, then you must also believe that there are microstates compatible with dispersed states of gas that will in the next moment coalesce into some complicated configuration.

  1. A collection of gas shaped exactly like Michael Jackson will disperse uniformly across its container.
  2. Physics is time reversible.
  3. So uniformly dispersed gases can coalesce into a collection of gases shaped exactly like Michael Jackson.

At this point you might be thinking “yeah, sure, microphysics doesn’t in principle rule out the possibility that a uniformly dispersed gas will coalesce into Michael Jackson, or any other crazy configuration. But who cares? It’s so incredibly unlikely!” To which the response is: Yes, exactly, it’s extremely unlikely. But nothing in the microphysical laws says this! Look as hard as you can at the laws of motion, you will not find a probability distribution over the likelihood of the different microstates compatible with a given macrostate. And indeed, different initial conditions of the universe will give different such frequencies distributions! To make any statements about the relative likelihood of some microstates over others, you need some principle above and beyond the microphysical laws.

To summarize. All that microphysics + infinite computing power allows you to say about a macrostate is the following: Here are all the microstates that are compatible with that macrostate, and here are all the past and future histories of each of these microstates. And given time reversibility, these future histories cover an enormously diverse set of predictions about the future, from “the gas will disperse” to “the gas will form into a statuette of Michael Jackson”. To get reasonable predictions about how the world will actually behave, we need some other principle, a principle that allows us to disregard these “perverse” microstates. And microphysics contains no such principle.

Statistical mechanics is thus the study of the necessary augmentation to a fundamental theory of physics that allows us to make predictions about the world, given that we are not in the position to know its exact microstate. This necessary augmentation is known as the fundamental postulate of statistical mechanics, and it takes the form of a probability distribution over microstates. Some people describe the postulate as saying “all microstates being equally likely”, but that phrasing is a big mistake, as the sentence “all states are equally likely” is not well defined over a continuous set of states. (More on that in a bit.) To really understand the fundamental postulate, we have to introduce the notion of phase space.

The phase space for a system is a mathematical space in which every point represents a full specification of the positions and momenta of all particles in the system. So, for example, a system consisting of 1000 classical particles swimming around in an infinite universe would have 6000 degrees of freedom (three position coordinates and three momentum coordinates per particle). Each of these degrees of freedom is isomorphic to the real numbers. So phase space for this system must be 6000, and a point in phase space is a specification of the values of all 6000 degrees of freedom. In general, for N classical particles, phase space is 6N.

With the concept of phase space in hand, we can define the fundamental postulate of statistical mechanics. This is: the probability distribution over microstates compatible with a given macrostate is uniform over the corresponding volume of phase space.

It turns out that if you just measure the volume of the “perverse states” in phase space, you end up finding that it composes approximately 0% of the volume of compatible microstates in phase space. This of course allows us to say of perverse states, “Sure they’re there, and technically it’s possible that my system is in such a state, but it’s so incredibly unlikely that it makes virtually no impact on my prediction of the future behavior of my system.” And indeed, when you start going through the math and seeing the way that systems most likely evolve given the fundamental postulate, you see that the predictions you get match beautifully with our observations of nature.

Next time: What is the epistemic status of the fundamental postulate? Do we have good a priori reasons to believe it?

— — —

* There are some subtleties here. For one, we think that there actually is a very small time asymmetry in the weak nuclear force. And some collapse interpretations of quantum mechanics have the collapse of the wave function as an irreversible process, although Everettian quantum mechanics denies this. For the moment, let’s disregard all of that. The time asymmetry in the weak nuclear force is not going to have any relevant effect on the proof made here, besides making it uglier and more complicated. What we need is technically not exact time-reversibility, but very-approximate time-reversibility. And that we have. Collapsing wave functions are a more troubling business, and are a genuine way out of the argument made in this post.

Wave function entropy

Entropy is a feature of probability distributions, and can be taken to be a quantification of uncertainty.

Standard quantum mechanics takes its fundamental object to be the wave function – an amplitude distribution. And from an amplitude distribution Ψ you can obtain a probability distribution Ψ*Ψ.

So it is very natural to think about the entropy of a given quantum state. For some reason, it looks like this concept of wave function entropy is not used much in physics. The quantum-mechanical version of entropy that is typically referred to is the Von-Neumann entropy, which involves uncertainty over which quantum state a system is in (rather than uncertainty intrinsic to a quantum state).

I’ve been looking into some of the implications of the concept of wave function entropy, and found a few interesting things.

Firstly, let’s just go over what precisely wave function entropy is.

Quantum mechanics is primarily concerned with calculating the wave function Ψ(x), which distributes complex amplitudes over phase space. The physical meaning of these amplitudes is interpreted by taking their absolute square Ψ*Ψ, which is a probability distribution.

Thus, the entropy of the wave function is given by:

S = – ∫ Ψ*Ψ ln(Ψ*Ψ) dx

As an example, I’ll write out some of the wave functions for the basic hydrogen atom:

*Ψ)1s = e-2r / π
*Ψ)2s = (2 – r)2 e-r / 32π

*Ψ)2p = r2 e-r cos(θ) / 32π
*Ψ)3s = (2r2 – 18r + 27)2 e-⅔r / 19683π

With these wave functions in hand, we can go ahead and calculate the entropies! Some of the integrals are intractable, so using numerical integration, we get:

S1s ≈ 70
S2s ≈ 470
S2p ≈ 326
S3s ≈ 1320

The increasing values for (1s, 2s, 3s) make sense – higher energy wave functions are more dispersed, meaning that there is greater uncertainty in the electron’s spatial distribution.

Let’s go into something a bit more theoretically interesting.

We’ll be interested in a generalization of entropy – relative entropy. This will quantify, rather than pure uncertainty, changes in uncertainty from a prior probability distribution ρ to our new distribution Ψ*Ψ. This will be the quantity we’ll denote S from now on.

S = – ∫ Ψ*Ψ ln(Ψ*Ψ/ρ) dx

Now, suppose we’re interested in calculating the wave functions Ψ that are local maxima of entropy. This means we want to find the Ψ for which δS = 0. Of course, we also want to ensure that a few basic constraints are satisfied. Namely,

∫ Ψ*Ψ dx = 1
∫ Ψ*HΨ = E

These constraints are chosen by analogy with the constraints in ordinary statistical mechanics – normalization and average energy. H is the Hamiltonian operator, which corresponds to the energy observable.

We can find the critical points of entropy that satisfy the constraint by using the method of Lagrange multipliers. Our two Lagrange multipliers will be α (for normalization) and β (for energy). This gives us the following equation for Ψ:

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0

We can rewrite this as an operator equation, which gives us

ln(Ψ*Ψ/ρ) + (α + 1) + βH = 0
Ψ*Ψ = ρ/Z e-βH

Here we’ve renamed our constants so that Z =  eα+1 is a normalization constant.

So we’ve solved the wave function equation… but what does this tell us? If you’re familiar with some basic quantum mechanics, our expression should look somewhat familiar to you. Let’s backtrack a few steps to see where this familiarity leads us.

Ψ ln(Ψ*Ψ/ρ) + (α + 1)Ψ + βHΨ = 0
HΨ + 1/β ln(Ψ*Ψ/ρ) Ψ = – (α + 1)/β Ψ

Let’s rename – (α + 1)/β to a new constant λ. And we’ll take a hint from statistical mechanics and call 1/β the temperature T of the state. Now our equation looks like

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

This equation is almost the Schrodinger equation. In particular, the Schrodinger equation pops out as the zero-temperature limit of this equation:

As T → 0,
our equation becomes…
HΨ = λΨ

The obvious interpretation of the constant λ in the zero temperature limit is E, the energy of the state. 

What about in the infinite-temperature limit?

As T → ∞,
our equation becomes…
Ψ*Ψ = ρ

Why is this? Because the only solution to the equation in this limit is for ln(Ψ*Ψ/ρ) → 0, or in other words Ψ*Ψ/ρ → 1

And what this means is that in the infinite temperature limit, the critical entropy wave function is just that which gives the prior distribution.

We can interpret this result as a generalization of the Schrodinger equation. Rather than a linear equation, we now have an additional logarithmic nonlinearity. I’d be interested to see how the general solutions to this equation differ from the standard equations, but that’s for another post.

HΨ + T ln(Ψ*Ψ/ρ) Ψ = λΨ

Entropy is expected surprise

Today we’re going to talk about a topic that’s very close to my heart: entropy. We’ll start somewhere that might seem unrelated: surprise.

Suppose that we wanted to quantify the intuitive notion of surprise. How should we do that?

We’ll start by analyzing a few base cases.

First! If something happens and you already were completely certain that it would happen, then you should completely unsurprised.

That is, if event E happens, and you had a credence P(E) = 100% in it happening, then your surprise S should be zero.

S(1) = 0

Second! If something happens that you were totally sure was impossible, with 100% credence, then you should be infinitely surprised.

That is, if E happens and P(E) = 0, then S = ∞.

S(0) = ∞

So far, it looks like your surprise S should be a function of your credence P in the event you are surprised at. That is, S = S(P). We also have the constraints that S(1) = 0 and S(0) = ∞.

There are many candidates for a function like this, for example: S(P) = 1/P – 1, S(P) = -log(P), S(P) = cot(πx/2). So we need more constraints.

Third! If an event E1 happens that is surprising to degree S1, and then another event E2 happens with surprisingness S2, then your surprise at the combination of these events should be S1 + S2.

I.e., we want surprise to be additive. If S(P(E1)) = S1 and S(P(E2 | E1)) = S2, then S(P(E1 & E2) = S1 + S2.

This entails a new constraint on our surprise function, namely:

S(PQ) = S(P) + S(Q)

Fourth, and finally! We want our surprise function to be continuous – free from discontinuous jumps. If your credence that the event will happen changes by an arbitrarily small amount, then your surprise if it does happen should also change by an arbitrarily small amount.

S(P) is continuous.

These four constraints now fully specify the form of our surprise function, up to a multiplicative constant. What we find is that the only function satisfying these constraints is the logarithm:

S(P) = k logP, where k is some negative number

Taking the simplest choice of k, we end up with a unique formalization of the intuitive notion of surprise:

S(P) = – logP

To summarize what we have so far: Four basic desideratum for our formalization of the intuitive notion of surprise have led us to a single simple equation.

This equation that we’ve arrived at turns out to be extremely important in information theory. It is, in fact, just the definition of the amount of information you gain by observing E. This reveals to us a deep connection between surprise and information. They are in an important sense expressing the same basic idea: more surprising events give you more information, and unsurprising events give you little information.

Let’s get a little better numerical sense of this formalization of surprise/information. What does a single unit of surprise or information mean? With some quick calculation, we see that a single unit of surprise, or bit of information corresponds to the observation of an event that you had a 50% expectation of. This also corresponds to a ruling out of 50% of the weight of possible other events you thought you might have observed. In essence, each bit of information you receive / surprise you experience corresponds to the total amount of possibilities being cut in half.

Two bits of information narrow the possibilities to one-fourth. Three cut out all but one-eighth. And so on. For a rational agent, the process of receiving more information or of being continuously surprised is the process of whittling down your models of reality to a smaller and better set!

The next great step forward is to use our formalization of surprise to talk not just about how surprised you are once an event happens, but how surprised you expect to be. If you have a credence of P in an event happening, then you expect a degree of surprise S(P) with credence P. In other words, the expected surprise you have with respect to that particular event is:

Expected surprise = – P logP

When summed over the totality of all possible events that occurred we get the following expression:

Total expected surprise = – ∑i Pi logPi

This expression should look very very familiar to you. It’s one of the most important quantities humans have discovered…


Now you understand the title of this post. Quite literally, entropy is total expected surprise!

Entropy = Total expected surprise

By the way, you might be wondering if this is the same entropy as you hear mentioned in the context of physics (that thing that always increases). Yes, it is identical! This means that we can describe the Second Law of Thermodynamics as a conspiracy by the universe to always be as surprising as possible to us! There are a bunch of ways to explore the exact implications of this, but that’s a subject for another post.

Getting back to the subject of this post, we can now make another connection. Surprise is information. Total expected surprise is entropy. And entropy is a measure of uncertainty.

If you think about this for a moment, this should start to make sense. If your model of reality is one in which you expect to be very surprised in the next moment, then you are very uncertain about what is going to happen in the next moment. If, on the other hand, your model of reality is one in which you expect zero surprise in the next moment, then you are completely certain!

Thus we see the beautiful and deep connection between surprise, information, entropy, and uncertainty. The overlap of these four concepts is rich with potential for exploration. We could go the route of model selection and discuss notions like mutual informationinformation divergence, and relative entropy, and how they relate to the virtues of predictive accuracy and model simplicity. We could also go the route of epistemology and discuss the notion of epistemic humility, choosing your beliefs to maximize your uncertainty, and the connection to Bayesian epistemology. Or, most tantalizingly, we could go the route of physics and explore the connection between this highly subjective sense of entropy as surprise/ uncertainty, and the very concrete notion of entropy as a physical quantity that characterizes the thermal properties of systems.

Instead of doing any of these, I’ll do none, and end here in hope that I’ve conveyed some of the coolness of this intersection of philosophy, statistics, and information theory.

Principle of Maximum Entropy

Previously, I talked about the principle of maximum entropy as the basis of statistical mechanics, and gave some intuitive justifications for it. In this post I want to present a more rigorous justification.

Our goal is to find a function that uniquely quantifies the amount of uncertainty that there is in our model of reality. I’m going to use very minimal assumptions, and will point them out as I use them.


Here’s the setup. There are N boxes, and you know that a ball is in one of them. We’ll label the possible locations of the ball as:

B1, B2, B3, … BN
where Bn = “The ball is in box n”

The full state of our knowledge about which box the ball is in will be represented by a probability distribution.

(P1, P2, P3, … PN)
where Pn = the probability that the ball is in box n

Our ultimate goal is to uniquely prescribe an uncertainty measure S that will take in the distribution P and return a numerical value.

S(P1, P2, P3, … PN)

Our first assumption is that this function is continuous. When you make arbitrarily small changes to your distribution, you don’t get discontinuous jumps in your entropy. We’ll use this in a few minutes.

We’ll start with a simpler case than the general distribution – a uniform distribution, where the ball is equally likely to be in any of the N boxes.

For all n, Pn = 1/N


We’ll give the entropy of a uniform distribution a special name, labeled U for ‘uniform’:

S(1/N, 1/N, …, 1/N) = U(N)

Our next and final assumption is going to relate to the way that we combine our knowledge. In words, it will be that the uncertainty of a given distribution should be the same, regardless of how we represent the distribution. We’ll lay this out more formally in a moment.

Before that, imagine enclosing our boxes in M different containers, like this:


Now we can represent the same state of knowledge as our original distribution by specifying first the probability that the ball is in a given container, and then the probability that it is in a given box, given that it is in that container.

Qn = probability that the ball is in container n
Pm|n = probability that the ball is in box m, given that it’s in container n


Notice that the value of each Qn is just the number of boxes in the container divided by the total number of boxes. In addition, the conditional probability that the ball is in box m, given that it’s in container n, is just one over the number of boxes in the container. We’ll write these relationships as

Qn = |Cn| / N
Pm|n = 1 / |Cn|

The point of all this is that we can now formalize our third assumption. Our initial state of knowledge was given by a single distribution Pn. Now it is given by two distributions: Qn and Pm|n.

Since these represent the same amount of knowledge about which container the box is in, the entropy of each should be the same.


And in general:

Initial entropy = S(1/N, 1/N, …, 1/N) = U(N)
Final entropy = S(Q1, Q2, …, QM) + ∑i Qi · S(P1|i, P2|i, …, PN|i)
= S(Q1, Q2, …, QM) + ∑i Qi · U(|Ci|)

The form of this final entropy is the substance of the uncertainty combination rule. First you compute the uncertainty of each individual distribution. Then you add them together, but weight each one by the probability that you encounter that uncertainty.

Why? Well, a conditional probability like Pm|n represents the probability that the ball is in box m, given that it’s in container n. You will only have to consider this probability if you discover that the ball is in container n, which happens with probability Qn.

With this, we’re almost finished.

First of all, notice that we have the following equality:

S(Q1, Q2, …, QM) = U(N) – ∑i [ Qi · U(|Ci|) ]

In other words, if we determine the general form of the function U, then we will have uniquely determined the entropy S for any arbitrary distribution!

And we can determine the general form of the function U by making a final simplification: assume that the containers all contain an equal number of boxes.

This means that Qn will be a uniform distribution over the M possible containers. And if there are M containers and N boxes, then this means that each container contains N/M boxes.

For all n, Qn = 1/M and |Cn| = N/M

If we plug this all in, we get that:

S(1/M, 1/M, …, 1/M) = U(N) – ∑i [1/M · U(N/M)]
U(M) = U(N) – U(N/M)
U(N/M) = U(N) – U(M)

There is only one continuous function that satisfies this equation, and it is the logarithm:

U(N) = K log(N) for some constant K

And we have uniquely determined the form of our entropy function, up to a constant factor K!

S(Q1, Q2, …, QM) = K log(N)  –  K ∑i Qi log(| Ci |)
= – K ∑i Qi log(| Ci |/N)
= – K ∑i Qi log(Qi)

If we add as a third assumption that U(N) should be monotonically increasing with N (that is, more boxes means more uncertainty, not less), then we can also specify that K should be a positive constant.

The three basic assumptions from which we can find the form of the entropy:

  1. S(P) is a continuous function of P.
  2. S should assign the same uncertainty to different representations of the same information.
  3. The entropy of a wide uniform distribution is greater than the entropy of a thin uniform distribution.

Statistical mechanics is wonderful

The law that entropy always increases, holds, I think, the supreme position among the laws of Nature. If someone points out to you that your pet theory of the universe is in disagreement with Maxwell’s equations — then so much the worse for Maxwell’s equations. If it is found to be contradicted by observation — well, these experimentalists do bungle things sometimes. But if your theory is found to be against the second law of thermodynamics I can give you no hope; there is nothing for it but to collapse in deepest humiliation.

 – Eddington

My favorite part of physics is statistical mechanics.

This wasn’t the case when it was first presented to me – it seemed fairly ugly and complicated compared to the elegant and deep formulations of classical mechanics and quantum mechanics. There were too many disconnected rules and special cases messily bundled together to match empirical results. Unlike the rest of physics, I failed to see the same sorts of deep principles motivating the equations we derived.

Since then I’ve realized that I was completely wrong. I’ve come to appreciate it as one of the deepest parts of physics I know, and mentally categorize it somewhere in the intersection of physics, math, and philosophy.

This post is an attempt to convey how statistical mechanics connects these fields, and to show concisely how some of the standard equations of statistical mechanics arise out of deep philosophical principles.


The fundamental goal of statistical mechanics is beautiful. It answers the question “How do we apply our knowledge of the universe on the tiniest scale to everyday life?”

In doing so, it bridges the divide between questions about the fundamental nature of reality (What is everything made of? What types of interactions link everything together?) and the types of questions that a ten-year old might ask (Why is the sky blue? Why is the table hard? What is air made of? Why are some things hot and others cold?).

Statistical mechanics peeks at the realm of quarks and gluons and electrons, and then uses insights from this realm to understand the workings of the world on a scale a factor of 1021 larger.

Wilfrid Sellars described philosophy as an attempt to reconcile the manifest image (the universe as it presents itself to us, as a world of people and objects and purposes and values), and the scientific image (the universe as revealed to us by scientific inquiry, empty of purpose, empty of meaning, and animated by simple exact mathematical laws that operate like clockwork). This is what I see as the fundamental goal of statistical mechanics.

What is incredible to me is how elegantly it manages to succeed at this. The universality and simplicity of the equations of statistical mechanics are astounding, given the type of problem we’re dealing with. Physicists would like to say that once they’ve figured out the fundamental equations of physics, then we understand the whole universe. Rutherford said that “all science is either physics or stamp collecting.” But you try to take some laws that tell you how two electrons interact, and then answer questions about how 1023 electrons will behave when all crushed together.

The miracle is that we can do this, and not only can we do it, but we can do it with beautiful, simple equations that are loaded with physical insight.

There’s an even deeper connection to philosophy. Statistical mechanics is about epistemology. (There’s a sense in which all of science is part of epistemology. I don’t mean this. I mean that I think of statistical mechanics as deeply tied to the philosophical foundations of epistemology.)

Statistical mechanics doesn’t just tell us what the world should look like on the scale of balloons and oceans and people. Some of the most fundamental concepts in statistical mechanics are ultimately about our state of knowledge about the world. It contains precise laws telling us what we can know about the universe, what we should believe, how we should deal with uncertainty, and how this uncertainty is structured in the physical laws.

While the rest of physics searches for perfect objectivity (taking the “view from nowhere”, in Nagel’s great phrase), statistical mechanics has one foot firmly planted in the subjective. It is an epistemological framework, a theory of physics, and a piece of beautiful mathematics all in one.


Enough gushing.

I want to express some of these deep concepts I’ve been referring to.

First of all, statistical mechanics is fundamentally about probability.

It accepts that trying to keep track of the positions and velocities of 1023 particles all interacting with each other is futile, regardless of how much you know about the equations guiding their motion.

And it offers a solution: Instead of trying to map out all of the particles, let’s course-grain our model of the universe and talk about the likelihood that a given particle is in a given position with a given velocity.

As soon as we do this, our theory is no longer just about the universe in itself, it is also about us, and our model of the universe. Equations in statistical mechanics are not only about external objective features of the world; they are also about properties of the map that we use to describe it.

This is fantastic and I think really under-appreciated. When we talk about the results of the theory, we must keep in mind that these results must be interpreted in this joint way. I’ve seen many misunderstandings arise from failures of exactly this kind, like when people think of entropy as a purely physical quantity and take the second law of thermodynamics to be solely a statement about the world.

But I’m getting ahead of myself.

Statistical mechanics is about probability. So if we have a universe consisting of N = 1080 particles, then we will create a function P that assigns a probability to every possible position for each of these particles at a given moment:

P(x1, y1, z1, x2, y2, z2, …, xN, yN, zN)

P is a function of 3•1080 values… this looks super complicated. Where’s all this elegance and simplicity I’ve been gushing about? Just wait.

The second fundamental concept in statistical mechanics is entropy. I’m going to spend way too much time on this, because it’s really misunderstood and really important.

Entropy is fundamentally a measure of uncertainty. It takes in a model of reality and returns a numerical value. The larger this value, the more coarse-grained your model of reality is. And as this value approaches zero, your model approaches perfect certainty.

Notice: Entropy is not an objective feature of the physical world!! Entropy is a function of your model of reality. This is very very important.

So how exactly do we define the entropy function?

Say that a masked inquisitor tells you to close your eyes and hands you a string of twenty 0s and 1s. They then ask you what your uncertainty is about the exact value of the string.

If you don’t have any relevant background knowledge about this string, then you have no reason to suspect that any letter in the string is more likely to be a 0 than a 1 or vice versa. So perhaps your model places equal likelihood in every possible string. (This corresponds to a probability of ½ • ½ • … • ½ twenty times, or 1/220).

The entropy of this model is 20.

Now your inquisitor allows you to peek at only the first number in the string, and you see that it is a 1.

By the same reasoning, your model is now an equal distribution of likelihoods over all strings that start with 1.

The entropy of this model? 19.

If now the masked inquisitor tells you that he has added five new numbers at the end of your string, the entropy of your new model will be 24.

The idea is that if you are processing information right, then every time you get a single bit of information, your entropy should decrease by exactly 1. And every time you “lose” a bit of information, your entropy should increase by exactly 1.

In addition, when you have perfect knowledge, your entropy should be zero. This means that the entropy of your model can be thought of as the number of pieces of binary information you would have to receive to have perfect knowledge.

How do we formalize this?

Well, your initial model (back when there were 20 numbers and you had no information about any of them) gave each outcome a probability of P = 1/220. How do we get a 20 out of this? Simple!

Entropy = S = log2(1/P)

(Yes, entropy is denoted by S. Why? Don’t ask me, I didn’t invent the notation! But you’ll get used to it.)

We can check if this formula still works out right when we get new information. When we learned that the first number was a 1, half of our previous possibilities disappeared. Given that the others are all still equally likely, our new probabilities for each should double from 1/220 to 1/219.

And S = log2(1/(1/219)) = log2(219) = 19. Perfect!

What if you now open your eyes and see the full string? Well now your probability distribution is 0 over all strings except the one you see, which has probability 1.

So S = log2(1/1) = log2(1) = 0. Zero entropy corresponds to perfect information.

This is nice, but it’s a simple idealized case. What if we only get partial information? What if the masked stranger tells you that they chose the numbers by running a process that 80% of the time returns 0 and 20% of the time returns 1, and you’re fifty percent sure they’re lying?

In general, we want our entropy function to be able to handle models more sophisticated than just uniform distributions with equal probabilities for every event. Here’s how.

We can write out any arbitrary probability distribution over N binary events as follows:

(P1, P2, …, PN)

As we’ve seen, if they were all equal then we would just find the entropy according to previous equation: S = log2(1/P).

But if they’re not equal, then we can just find the weighted average! In other words:

S = mean(log2(1/P)) =∑ Pn log2(1/Pn)

We can put this into the standard form by noting that log(1/P) = -log(P).

And we have our general definition of entropy!

For discrete probabilities: S = – ∑ Plog Pn
For continuous probabilities: S = – ∫ P(x) log P(x) dx

(Aside: Physicists generally use a natural logarithm instead of log2 when they define entropy. This is just a difference in convention: e pops up more in physics and 2 in information theory. It’s a little weird, because now when entropy drops by 1 this means you’ve excluded 1/e of the options, instead of ½. But it makes equations much nicer.)

I’m going to spend a little more time talking about this, because it’s that important.

We’ve already seen that entropy is a measure of how much you know. When you have perfect and complete knowledge, your model has entropy zero. And the more uncertainty you have, the more entropy you have.

You can visualize entropy as a measure of the size of your probability distribution. Some examples you can calculate for yourself using the above equations:

Roughly, when you double the “size” of your probability distribution, you increase its entropy by 1.

But what does it mean to double the size of your probability distribution? It means that there are two times as many possibilities as you initially thought – which is equivalent to you losing one piece of binary information! This is exactly the connection between these two different ways of thinking about entropy.

Third: (I won’t name it yet so as to not ruin the surprise). This is so important that I should have put it earlier, but I couldn’t have because I needed to introduce entropy first.

So I’ve been sneakily slipping in an assumption throughout the last paragraphs. This is that when you don’t have any knowledge about the probability of a set of events, you should act as if all events are equally likely.

This might seem like a benign assumption, but it’s responsible for god-knows how many hours of heated academic debate. Here’s the problem: sure it seems intuitive to say that 0 and 1 are equally likely. But that itself is just one of many possibilities. Maybe 0 comes up 57% of the time, or maybe 34%. It’s not like you have any knowledge that tells you that 0 and 1 are objectively equally likely, so why should you favor that hypothesis?

Statistical mechanics answers this by just postulating a general principle: Look at the set of all possible probability distributions, calculate the entropy of each of them, and then choose the one with the largest entropy.

In cases where you have literally no information (like our earlier inquisitor-string example), this principle becomes the principle of indifference: spread your credences evenly among the possibilities. (Prove it yourself! It’s a fun proof.)

But as a matter of fact, this principle doesn’t only apply to cases where you have no information. If you have partial or incomplete information, you apply the exact same principle by looking at the set of probability distributions that are consistent with this information and maximizing entropy.

This principle of maximum entropy is the foundational assumption of statistical mechanics. And it is a purely epistemic assumption. It is a normative statement about how you should rationally divide up your credences in the absence of information.

Said another way, statistical mechanics prescribes an answer to the problem of the priors, the biggest problem haunting Bayesian epistemologists. If you want to treat your beliefs like probabilities and update them with evidence, you have to have started out with an initial level of belief before you had any evidence. And what should that prior probability be?

Statistical mechanics says: It should be the probability that maximizes your entropy. And statistical mechanics is one of the best-verified and most successful areas of science. Somehow this is not loudly shouted in the pages of every text on Bayesianism.

There’s much more to say about this, but I’ll set it aside for the moment.


So we have our setup for statistical mechanics.

  1. Coarse-grain your model of reality by constructing a probability distribution over all possible microstates of the world.
  2. Construct this probability distribution according to the principle of maximum entropy.

Okay! So going back to our world of N = 1080 particles jostling each other around, we now know how to construct our probability distribution P(x1, …, xN). (I’ve made the universe one-dimensional for no good reason except to pretty it up – everything I say follows exactly the same if I left it in 3D. I’ll also start writing the set of all N coordinates as X, again for prettiness.)

What probability distribution maximizes S = – ∫ P logP dX?

We can solve this with the method of Lagrange multipliers:

P [ P logP + λP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1

This is such a nice equation and you should do yourself a favor and learn it, because I’m not going to explain it (if I explained everything, this post would become a textbook!).

But it essentially maximizes the value of S, subject to the constraint that the total probability is 1. When we solve it we find:

P(x1, …, xN) = 1/VN, where V is the volume of the universe

Remember earlier when I said to just wait for the probability equation to get simple?

Okay, so this is simple, but it’s also not very useful. It tells us that every particle has an equal probability of being in any equally sized region of space. But we want to know more. Like, are the higher energy particles distributed differently than the lower energy?

The great thing about statistical mechanics is that if you want a better model, you can just feed in more information to your distribution.

So let’s say we want to find the probability distribution, given two pieces of information: (1) we know the energy of every possible configuration of particles, and (2) the average total energy of the universe is fixed.

That is, we have a function E(x1, …, xN) that tells us energies, and we know that the total energy E = ∫ P(x1, …, xN)•E(x1, …, xN) dX is fixed.

So how do we find our new P? Using the same method as before:

P [ P logP + λP + βEP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1
and β is chosen to satisfy: ∫ P•E dX = E

This might look intimidating, but it’s really not. I’ll write out how to solve this:

P [P logP + λP + βEP) ]
= logP + 1 + λ + βE = 0
So P = e-(1+λ) • e-βE
Renaming our first term, we get:
P(X) = 1/Z • e-βE(X)

This result is called the Boltzmann distribution, and it’s one of the incredibly important must-know equations of statistical mechanics. The amount of physics you can do with just this one equation is staggering. And we got it by just adding conservation of energy to the principle of maximum entropy.

Maybe you’re disturbed by the strange new symbols Z and β that have appeared in the equation. Don’t fear! Z is simply a normalization constant: it’s there to keep the probability of the total distribution at 1. We can calculate it explicitly:

Z = ∫ e-βE dX

And β is really interesting. Notice that β came into our equations because we had to satisfy this extra constraint about a fixed total energy. Is there some nice physical significance to this quantity?

Yes, very much so. β is what we humans like to call ‘temperature’, or more precisely, inverse temperature.

β = 1/T

While avoiding the math, I can just say the following: Temperature is defined to be the change in the energy of a system when you change its entropy a little bit. (This definition is much more general than the special case definition of temperature as average kinetic energy)

And it turns out that when you manipulate the above equations a little bit, you see that ∂SE = 1/β = T.

So we could rewrite our probability distribution as follows:

P(X) = 1/Z • e-E(X)/T

Feed in your fundamental laws of physics to the energy function, and you can see the distribution of particles across the universe!

Let’s just look at the basic properties of this equation. First of all, we can see that the larger E(X)/T becomes, the smaller the probability of a particle being in X becomes. This corresponds both to particles scattering away from high-energy regions and to less densely populated systems having lower temperatures.

And the smaller E(X)/T, the larger P(X). This corresponds to particles densely clustering in low-energy areas, and dense clusters of particles having high temperatures.

There are too many other things I could say about this equation and others, and this post is already way too long. I want to close with a final note about the nature of entropy.

I said earlier that entropy is entirely a function of your model of reality. The universe doesn’t have an entropy. You have a model of the universe, and that model has an entropy. Regardless of what physical reality is like, if I hand you a model, you can tell me its entropy.

But at the same time, models of reality are linked to the nature of the physical world. So for instance, a very simple and predictable universe lends itself to very precise and accurate models of reality, and thus to lower-entropy models. And a very complicated and chaotic universe lends itself to constant loss of information and low-accuracy models, and thus to higher entropy.

It is this second world that we live in. Due to the structure of the universe, information is constantly being lost to us at enormous rates. Systems that start out simple eventually spiral off into chaotic and unpredictable patterns, and order in the universe is only temporary.

It is in this sense that statements about entropy are statements about physical reality. And it is for this reason that entropy always increases.

In principle, an omnipotent and omniscient agent could track the positions of all particles at all times, and this agent’s model of the universe would be always perfectly accurate, with entropy zero. For this agent, the entropy of the universe would never rise.

And yet for us, as we look at the universe, we seem to constantly and only see entropy-increasing interactions.

This might seem counterintuitive or maybe even impossible to you. How could the entropy rise to one agent and stay constant for another?

Imagine an ice cube sitting out on a warm day. The ice cube is in a highly ordered and understandable state. We could sit down and write out a probability distribution, taking into account the crystalline structure of the water molecules and the shape of the cube, and have a fairly low-entropy and accurate description of the system.

But now the ice cube starts to melt. What happens? Well, our simple model starts to break down. We start losing track of where particles are going, and having trouble predicting what the growing puddle of water will look like. And by the end of the transition, when all that’s left is a wide spread-out wetness across the table, our best attempts to describe the system will inevitably remain higher-entropy than what we started with.

Our omniscient agent looks at the ice cube and sees all the particles exactly where they are. There is no mystery to him about what will happen next – he knows exactly how all the water molecules are interacting with one another, and can easily determine which will break their bonds first. What looked like an entropy-increasing process to us was an entropy-neutral process to him, because his model never lost any accuracy.

We saw the puddle as higher-entropy, because we started doing poorly at modeling it. And our models started performing poorly, because the system got too complex for our models.

In this sense, entropy is not just a physical quantity, it is an epistemic quantity. It is both a property of the world and a property of our model of the world. The statement that the entropy of the universe increases is really the statement that the universe becomes harder for our models to compute over time.

Which is a really substantive statement. To know that we live in the type of universe that constantly increases in entropy is to know a lot about the way that the universe operates.

More reading here if you’re interested!