Constructing the world

In this six and a half hour lecture series by David Chalmers, he describes the concept of a minimal set of statements from which all other truths are a priori “scrutable” (meaning, basically, in-principle knowable or derivable).

What are the types of statements in this minimal set required to construct the world? Chalmers offers up four categories, and abbreviates this theory PQIT.


P is the set of physical facts (for instance, everything that would be accessible to a Laplacean demon). It can be thought of as essentially the initial conditions of the universe and the laws governing their changes over time.


Q is the set of facts about qualitative experience. We can see Chalmers’ rejection of physicalism here, as he doesn’t think that Q is eclipsed within P. Example of a type of statement that cannot be derived from P without Q: “There is a beige region in the bottom right of my visual field.”


Here’s a true statement: “I’m in the United States.” Could this be derivable from P and Q? Presumably not; we need another set of indexical truths that allows us to have “self-locating” beliefs and to engage in anthropic reasoning.


Suppose that P, Q, and I really are able to capture all the true statements there are to be captured. Well then, the statement “P, Q, and I really are able to capture all the true statements there are to be captured” is a true statement, and it is presumably not captured by P, Q, and I! In other words, we need some final negative statements that tell us that what we have is enough, and that there are no more truths out there. These “that’s all”-type statements are put into the set T.


So this is a basic sketch of Chalmer’s construction. I like that we can use these tags like PQIT or PT or QIT as a sort of philosophical zip-code indicating the core features of a person’s philosophical worldview. I also want to think about developing this further. What other possible types of statements are there out there that may not be captured in PQIT? Here is a suggestion for a more complete taxonomy:

p    microphysics
P    macrophysics (by which I mean all of science besides fundamental physics)
Q    consciousness
R    normative rationality
normative ethics
C    counterfactuals
mathematical / logical truths
I     indexicals
T    “that’s-all” statements

I’ve split P into big-P (macrophysics) and little-p (microphysics) to account for the disagreements about emergence and reductionism. Normativity here is broad enough to include both normative epistemic statements (e.g. “You should increase your credence in the next coin toss landing H after observing it land H one hundred times in a row”) and ethical statements. The others are fairly self-explanatory.

The most ontologically extravagant philosophical worldview would then be characterized as pPQRECLIT.

My philosophical address is pRLIT (with the addendum that I think C comes from p, and am really confused about Q). What’s yours?

A simple explanation of Bell’s inequality

Everybody knows that quantum mechanics is weird. But there are plenty of weird things in the world. We’ve pretty much come to expect that as soon as we look out beyond our little corner of the universe, we’ll start seeing intuition-defying things everywhere. So why does quantum mechanics get the reputation of being especially weird?

Bell’s theorem is a good demonstration of how the weirdness of quantum mechanics is in a realm of its own. It’s a set of proposed (and later actually verified) experimental results that seem to defy all attempts at classical interpretation.

The Experimental Results

Here is the experimental setup:


In the center of the diagram, we have a black box that spits out two particles every few minutes. These two particles fly in different directions to two detectors. Each detector has three available settings (marked by 1, 2, and 3) and two bulbs, one red and the other green.

Shortly after a particle enters the detector, one of the two bulbs flashes. Our experiment is simply this: we record which bulb flashes on both the left and right detector, and we take note of the settings on both detectors at the time. We then try randomly varying the detector settings, and collect data for many such trials.

Quick comprehension test: Suppose that what bulb flashes is purely a function of some property of the particles entering the detector, and the settings don’t do anything. Then we should expect that changes in the settings will not have any impact on the frequency of flashing for each bulb. It turns out that we don’t see this in the experimental results.

One more: Suppose that the properties of the particles have nothing to do with which bulb flashes, and all that matters is the detector settings. What do we expect our results to be in this case?

Well, then we should expect that changing the detector settings will change which bulb flashes, but that the variance in the bulb flashes should be able to be fully accounted for by the detector settings. It turns out that this also doesn’t happen.

Okay, so what do we see in the experimental results?

The results are as follows:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
The same color bulb flashes on the left and right 75% of the time.

In some sense, the paradox is already complete. It turns out that some very minimal assumptions about the nature of reality tell us that these results are impossible.  There is a hidden inconsistency within these results, and the only remaining task is to draw it out and make it obvious.


We’ll start our analysis by detailing our basic assumptions about the nature of the process.

Assumption 1: Lawfulness
The probability of an event is a function of all other events in the universe.

This assumption is incredibly weak. It just says that if you know everything about the universe, then you are able to place a probability distribution over future events. This isn’t even as strong as determinism, as it’s only saying that the future is a probabilistic function of the past. Determinism would be the claim that all such probabilities are 1 or 0, that is, the facts about the past fix the facts about the future.

From Assumption 1 we conclude the following:

There exists a function P(R | everything else) that accurately reports the frequency of the red bulb flashing, given the rest of facts about the universe.

It’s hard to imagine what it would mean for this to be wrong. Even in a perfectly non-deterministic universe where the future is completely probabilistically independent of the past, we could still express what’s going to happen next probabilistically, just with all of the probabilities of events being independent. This is why even naming this assumption lawfulness is too strong – the “lawfulness” could be probabilistic, chaotic, and incredibly minimal.

The next assumption constrains this function a little more.

Assumption 2: Locality
The probability of an event only depends on events local to it.

This assumption is justified by virtually the entire history of physics. Over and over we find that particles influence each others’ behaviors through causal intermediaries. Einstein’s Special Theory of Relativity provides a precise limitation on causal influences; the absolute fastest that causal influences can propagate is the speed of light. The light cone of an event is defined as all the past events that could have causally influenced it, given the speed of light limit, and all future events that can be causally influenced by this event.

Combining Assumption 1 and Assumption 2, we get:

P(R | everything else) = P(R | local events)

So what are these local events? Given our experimental design, we have two possibilities; the particle entering the detector, and the detector settings. Our experimental design explicitly rules out the effects of other causal influences, by holding them fixed. The only thing that we, the experimenters, vary are the detector settings, and the variation in the particle types being produced by the central black box. All else is stipulated to be held constant.

Thus we get our third, and final assumption.

Assumption 3: Good experimental design
The only local events relevant to the bulb flashing are the particle that enters the detector and the detector setting.

Combining these three assumptions, we get the following:

P(R | everything else) = P(R | particle & detector setting)

We can think of this function a little differently, by asking about a particular particle with a fixed set of properties.

Pparticle(R | detector setting)

We haven’t changed anything but the notation – this is the same function as what we originally had, just carrying a different meaning. Now it tells us how likely a given particle is to cause the red bulb to flash, given a certain detector setting. This allows us to categorize all different types of particles by looking at all different combinations of three

Particle type is defined by
Pparticle(R | Setting 1), Pparticle(R | Setting 2), Pparticle(R | Setting 3) )

This fully defines our particle type for the purposes of our experiment. The set of particle types is the set of three-tuples of probabilities.

So to summarize, here are the only three assumptions we need to generate the paradox.

Lawfulness: Events happen with probabilities that are determined by facts about the universe.
Locality: Causal influences propagate locally.
Good experimental design: Only the particle type and detector setting influence the experiment result.

Now, we generate a contradiction between these assumptions and the experimental results!


Recall our experimental results:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
The same color bulb flashes on the left and right 75% of the time.

We are guaranteed by Assumptions 1 to 3 that there exists a function Pparticle(R | detector setting) that describes the frequencies we observe for a detector. We have two particles and two detectors, so we are really dealing with two functions for each experimental trial.

Left particle: Pleft(R | left setting)
Right particle: Pright(R | right setting)

From Result (1), we see that when left settingright setting, the same color always flashes on both sides. This means two things: first, that the black box always produces two particles of the same type, and second, that the behavior observed in the experiment is deterministic.

Why must they be the same type? Well, if they were different, then we would expect different frequencies on the left and the right. Why determinism? If the results were at all probabilistic, then even if the probability functions for the left and right particles were the same, we’d expect to still see them sometimes give different results. Since they don’t, the results must be fully determined.

Pleft(R | setting 1) = Pright(R | setting 1) = 0 or 1
Pleft(R | setting 2) = Pright(R | setting 2) = 0 or 1
Pleft(R | setting 3) = Pright(R | setting 3) = 0 or 1

This means that we can fully express particle types by a function that takes in a setting (1, 2, or 3), and returns a value (0 or 1) corresponding to whether or not the red bulb will flash. How many different types of particles are there? Eight!

Abbreviation: Pn = P(R | setting n)
P1 = 1, P2 = 1, P3 = 1 : (RRR)
P1 = 1, P2 = 1, P3 = 0 : (RRG)
P1 = 1, P2 = 0, P3 = 1 : (RGR)
P1 = 1, P2 = 0, P3 = 0 : (RGG)
P1 = 0, P2 = 1, P3 = 1 : (GRR)
P1 = 0, P2 = 1, P3 = 0 : (GRG)
P1 = 0, P2 = 0, P3 = 1 : (GGR)
P1 = 0, P2 = 0, P3 = 0 : (GGG)

The three-letter strings (RRR) are short representations of which bulb will flash for each detector setting.

Now we are ready to bring in experimental result (2). In 25% of the cases in which the settings are different, the same bulbs flash on either side. Is this possible given our results? No! Check out the following table that describes what happens with RRR-type particles and RRG-type particles when the detectors have different settings different detector settings.

(Setting 1, Setting 2) RRR-type RRG-type
1, 2 R, R R, R
1, 3 R, R R, G
2, 1 R, R R, R
2, 3 R, R R, G
3, 1 R, R G, R
3, 2 R, R G, R
100% same 33% same

Obviously, if the particle always triggers a red flash, then any combination of detector settings will result in a red flash. So when the particles are the RRR-type, you will always see the same color flash on either side. And when the particles are the RRG-type, you end up seeing the same color bulb flash in only two of the six cases with different detector settings.

By symmetry, we can extend this to all of the other types.

Particle type Percentage of the time that the same bulb flashes (for different detector settings)
RRR 100%
RRG 33%
RGR 33%
RGG 33%
GRR 33%
GRG 33%
GGR 33%
GGG 100%

Recall, in our original experimental results, we found that the same bulb flashes 25% of the time when the detectors are on different settings. Is this possible? Is there any distribution of particle types that could be produced by the central black box that would give us a 25% chance of seeing the same color?

No! How could there be? No matter how the black box produces particles, the best it can do is generate a distribution without RRRs and GGGs, in which case we would see 33% instead of 25%. In other words, the lowest that this value could possibly get is 33%!

This is the contradiction. Bell’s inequality points out a contradiction between theory and observation:

Theory: P(same color flash | different detector settings) ≥ 33%
Experiment: P(same color flash | different detector settings) = 25%


We have a contradiction between experimental results and a set of assumptions about reality. So one of our assumptions has to go. Which one?

Assumption 3: Experimental design. Good experimental design can be challenged, but this would require more detail on precisely how these experiments are done. The key feature of this is that you would have to propose a mechanism by which changes to the detector setting end up altering other relevant background factors that affect the experiment results. You’d also have to be able to do this for all the other subtly different variants of Bell’s experiment that give the same result. While this path is open, it doesn’t look promising.

Assumption 1: Lawfulness. Challenging the lawfulness of the universe looks really difficult. As I said before, I can barely imagine what a universe that doesn’t adhere to some version of Assumption 1 looks like. It’s almost tautological that some function will exist that can probabilistically describe the behavior of the universe. The universe must have some behavior, and why would we be unable to describe it probabilistically?

Assumption 2: Locality. This leaves us with locality. This is also really hard to deny! Modern physics has repeatedly confirmed that the speed of light acts as a speed limit on causal interactions, and that any influences must propagate locally. But perhaps quantum mechanics requires us to overthrow this old assumption and reveal it as a mere approximation to a deeper reality, as has been done many times before.

If we abandon number 2, we are allowing for the existence of statistical dependencies between variables that are entirely causally disconnected. Here’s Bell’s inequality in a causal diagram:

Bell Causal w entanglement

Since the detector settings on the left and the right are independent by assumption, we end up finding an unexplained dependence between the left particle and the right particle. Neither the common cause between them or any sort of subjunctive dependence a la timeless decision theory are able to explain away this dependence. In quantum mechanics, this dependence is given a name: entanglement. But of course, naming it doesn’t make it any less mysterious. Whatever entanglement is, it is something completely new to physics and challenges our intuitions about the very structure of causality.

Statistical mechanics is wonderful

The law that entropy always increases, holds, I think, the supreme position among the laws of Nature. If someone points out to you that your pet theory of the universe is in disagreement with Maxwell’s equations — then so much the worse for Maxwell’s equations. If it is found to be contradicted by observation — well, these experimentalists do bungle things sometimes. But if your theory is found to be against the second law of thermodynamics I can give you no hope; there is nothing for it but to collapse in deepest humiliation.

 – Eddington

My favorite part of physics is statistical mechanics.

This wasn’t the case when it was first presented to me – it seemed fairly ugly and complicated compared to the elegant and deep formulations of classical mechanics and quantum mechanics. There were too many disconnected rules and special cases messily bundled together to match empirical results. Unlike the rest of physics, I failed to see the same sorts of deep principles motivating the equations we derived.

Since then I’ve realized that I was completely wrong. I’ve come to appreciate it as one of the deepest parts of physics I know, and mentally categorize it somewhere in the intersection of physics, math, and philosophy.

This post is an attempt to convey how statistical mechanics connects these fields, and to show concisely how some of the standard equations of statistical mechanics arise out of deep philosophical principles.


The fundamental goal of statistical mechanics is beautiful. It answers the question “How do we apply our knowledge of the universe on the tiniest scale to everyday life?”

In doing so, it bridges the divide between questions about the fundamental nature of reality (What is everything made of? What types of interactions link everything together?) and the types of questions that a ten-year old might ask (Why is the sky blue? Why is the table hard? What is air made of? Why are some things hot and others cold?).

Statistical mechanics peeks at the realm of quarks and gluons and electrons, and then uses insights from this realm to understand the workings of the world on a scale a factor of 1021 larger.

Wilfrid Sellars described philosophy as an attempt to reconcile the manifest image (the universe as it presents itself to us, as a world of people and objects and purposes and values), and the scientific image (the universe as revealed to us by scientific inquiry, empty of purpose, empty of meaning, and animated by simple exact mathematical laws that operate like clockwork). This is what I see as the fundamental goal of statistical mechanics.

What is incredible to me is how elegantly it manages to succeed at this. The universality and simplicity of the equations of statistical mechanics are astounding, given the type of problem we’re dealing with. Physicists would like to say that once they’ve figured out the fundamental equations of physics, then we understand the whole universe. Rutherford said that “all science is either physics or stamp collecting.” But you try to take some laws that tell you how two electrons interact, and then answer questions about how 1023 electrons will behave when all crushed together.

The miracle is that we can do this, and not only can we do it, but we can do it with beautiful, simple equations that are loaded with physical insight.

There’s an even deeper connection to philosophy. Statistical mechanics is about epistemology. (There’s a sense in which all of science is part of epistemology. I don’t mean this. I mean that I think of statistical mechanics as deeply tied to the philosophical foundations of epistemology.)

Statistical mechanics doesn’t just tell us what the world should look like on the scale of balloons and oceans and people. Some of the most fundamental concepts in statistical mechanics are ultimately about our state of knowledge about the world. It contains precise laws telling us what we can know about the universe, what we should believe, how we should deal with uncertainty, and how this uncertainty is structured in the physical laws.

While the rest of physics searches for perfect objectivity (taking the “view from nowhere”, in Nagel’s great phrase), statistical mechanics has one foot firmly planted in the subjective. It is an epistemological framework, a theory of physics, and a piece of beautiful mathematics all in one.


Enough gushing.

I want to express some of these deep concepts I’ve been referring to.

First of all, statistical mechanics is fundamentally about probability.

It accepts that trying to keep track of the positions and velocities of 1023 particles all interacting with each other is futile, regardless of how much you know about the equations guiding their motion.

And it offers a solution: Instead of trying to map out all of the particles, let’s course-grain our model of the universe and talk about the likelihood that a given particle is in a given position with a given velocity.

As soon as we do this, our theory is no longer just about the universe in itself, it is also about us, and our model of the universe. Equations in statistical mechanics are not only about external objective features of the world; they are also about properties of the map that we use to describe it.

This is fantastic and I think really under-appreciated. When we talk about the results of the theory, we must keep in mind that these results must be interpreted in this joint way. I’ve seen many misunderstandings arise from failures of exactly this kind, like when people think of entropy as a purely physical quantity and take the second law of thermodynamics to be solely a statement about the world.

But I’m getting ahead of myself.

Statistical mechanics is about probability. So if we have a universe consisting of N = 1080 particles, then we will create a function P that assigns a probability to every possible position for each of these particles at a given moment:

P(x1, y1, z1, x2, y2, z2, …, xN, yN, zN)

P is a function of 3•1080 values… this looks super complicated. Where’s all this elegance and simplicity I’ve been gushing about? Just wait.

The second fundamental concept in statistical mechanics is entropy. I’m going to spend way too much time on this, because it’s really misunderstood and really important.

Entropy is fundamentally a measure of uncertainty. It takes in a model of reality and returns a numerical value. The larger this value, the more coarse-grained your model of reality is. And as this value approaches zero, your model approaches perfect certainty.

Notice: Entropy is not an objective feature of the physical world!! Entropy is a function of your model of reality. This is very very important.

So how exactly do we define the entropy function?

Say that a masked inquisitor tells you to close your eyes and hands you a string of twenty 0s and 1s. They then ask you what your uncertainty is about the exact value of the string.

If you don’t have any relevant background knowledge about this string, then you have no reason to suspect that any letter in the string is more likely to be a 0 than a 1 or vice versa. So perhaps your model places equal likelihood in every possible string. (This corresponds to a probability of ½ • ½ • … • ½ twenty times, or 1/220).

The entropy of this model is 20.

Now your inquisitor allows you to peek at only the first number in the string, and you see that it is a 1.

By the same reasoning, your model is now an equal distribution of likelihoods over all strings that start with 1.

The entropy of this model? 19.

If now the masked inquisitor tells you that he has added five new numbers at the end of your string, the entropy of your new model will be 24.

The idea is that if you are processing information right, then every time you get a single bit of information, your entropy should decrease by exactly 1. And every time you “lose” a bit of information, your entropy should increase by exactly 1.

In addition, when you have perfect knowledge, your entropy should be zero. This means that the entropy of your model can be thought of as the number of pieces of binary information you would have to receive to have perfect knowledge.

How do we formalize this?

Well, your initial model (back when there were 20 numbers and you had no information about any of them) gave each outcome a probability of P = 1/220. How do we get a 20 out of this? Simple!

Entropy = S = log2(1/P)

(Yes, entropy is denoted by S. Why? Don’t ask me, I didn’t invent the notation! But you’ll get used to it.)

We can check if this formula still works out right when we get new information. When we learned that the first number was a 1, half of our previous possibilities disappeared. Given that the others are all still equally likely, our new probabilities for each should double from 1/220 to 1/219.

And S = log2(1/(1/219)) = log2(219) = 19. Perfect!

What if you now open your eyes and see the full string? Well now your probability distribution is 0 over all strings except the one you see, which has probability 1.

So S = log2(1/1) = log2(1) = 0. Zero entropy corresponds to perfect information.

This is nice, but it’s a simple idealized case. What if we only get partial information? What if the masked stranger tells you that they chose the numbers by running a process that 80% of the time returns 0 and 20% of the time returns 1, and you’re fifty percent sure they’re lying?

In general, we want our entropy function to be able to handle models more sophisticated than just uniform distributions with equal probabilities for every event. Here’s how.

We can write out any arbitrary probability distribution over N binary events as follows:

(P1, P2, …, PN)

As we’ve seen, if they were all equal then we would just find the entropy according to previous equation: S = log2(1/P).

But if they’re not equal, then we can just find the weighted average! In other words:

S = mean(log2(1/P)) =∑ Pn log2(1/Pn)

We can put this into the standard form by noting that log(1/P) = -log(P).

And we have our general definition of entropy!

For discrete probabilities: S = – ∑ Plog Pn
For continuous probabilities: S = – ∫ P(x) log P(x) dx

(Aside: Physicists generally use a natural logarithm instead of log2 when they define entropy. This is just a difference in convention: e pops up more in physics and 2 in information theory. It’s a little weird, because now when entropy drops by 1 this means you’ve excluded 1/e of the options, instead of ½. But it makes equations much nicer.)

I’m going to spend a little more time talking about this, because it’s that important.

We’ve already seen that entropy is a measure of how much you know. When you have perfect and complete knowledge, your model has entropy zero. And the more uncertainty you have, the more entropy you have.

You can visualize entropy as a measure of the size of your probability distribution. Some examples you can calculate for yourself using the above equations:

Roughly, when you double the “size” of your probability distribution, you increase its entropy by 1.

But what does it mean to double the size of your probability distribution? It means that there are two times as many possibilities as you initially thought – which is equivalent to you losing one piece of binary information! This is exactly the connection between these two different ways of thinking about entropy.

Third: (I won’t name it yet so as to not ruin the surprise). This is so important that I should have put it earlier, but I couldn’t have because I needed to introduce entropy first.

So I’ve been sneakily slipping in an assumption throughout the last paragraphs. This is that when you don’t have any knowledge about the probability of a set of events, you should act as if all events are equally likely.

This might seem like a benign assumption, but it’s responsible for god-knows how many hours of heated academic debate. Here’s the problem: sure it seems intuitive to say that 0 and 1 are equally likely. But that itself is just one of many possibilities. Maybe 0 comes up 57% of the time, or maybe 34%. It’s not like you have any knowledge that tells you that 0 and 1 are objectively equally likely, so why should you favor that hypothesis?

Statistical mechanics answers this by just postulating a general principle: Look at the set of all possible probability distributions, calculate the entropy of each of them, and then choose the one with the largest entropy.

In cases where you have literally no information (like our earlier inquisitor-string example), this principle becomes the principle of indifference: spread your credences evenly among the possibilities. (Prove it yourself! It’s a fun proof.)

But as a matter of fact, this principle doesn’t only apply to cases where you have no information. If you have partial or incomplete information, you apply the exact same principle by looking at the set of probability distributions that are consistent with this information and maximizing entropy.

This principle of maximum entropy is the foundational assumption of statistical mechanics. And it is a purely epistemic assumption. It is a normative statement about how you should rationally divide up your credences in the absence of information.

Said another way, statistical mechanics prescribes an answer to the problem of the priors, the biggest problem haunting Bayesian epistemologists. If you want to treat your beliefs like probabilities and update them with evidence, you have to have started out with an initial level of belief before you had any evidence. And what should that prior probability be?

Statistical mechanics says: It should be the probability that maximizes your entropy. And statistical mechanics is one of the best-verified and most successful areas of science. Somehow this is not loudly shouted in the pages of every text on Bayesianism.

There’s much more to say about this, but I’ll set it aside for the moment.


So we have our setup for statistical mechanics.

  1. Coarse-grain your model of reality by constructing a probability distribution over all possible microstates of the world.
  2. Construct this probability distribution according to the principle of maximum entropy.

Okay! So going back to our world of N = 1080 particles jostling each other around, we now know how to construct our probability distribution P(x1, …, xN). (I’ve made the universe one-dimensional for no good reason except to pretty it up – everything I say follows exactly the same if I left it in 3D. I’ll also start writing the set of all N coordinates as X, again for prettiness.)

What probability distribution maximizes S = – ∫ P logP dX?

We can solve this with the method of Lagrange multipliers:

P [ P logP + λP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1

This is such a nice equation and you should do yourself a favor and learn it, because I’m not going to explain it (if I explained everything, this post would become a textbook!).

But it essentially maximizes the value of S, subject to the constraint that the total probability is 1. When we solve it we find:

P(x1, …, xN) = 1/VN, where V is the volume of the universe

Remember earlier when I said to just wait for the probability equation to get simple?

Okay, so this is simple, but it’s also not very useful. It tells us that every particle has an equal probability of being in any equally sized region of space. But we want to know more. Like, are the higher energy particles distributed differently than the lower energy?

The great thing about statistical mechanics is that if you want a better model, you can just feed in more information to your distribution.

So let’s say we want to find the probability distribution, given two pieces of information: (1) we know the energy of every possible configuration of particles, and (2) the average total energy of the universe is fixed.

That is, we have a function E(x1, …, xN) that tells us energies, and we know that the total energy E = ∫ P(x1, …, xN)•E(x1, …, xN) dX is fixed.

So how do we find our new P? Using the same method as before:

P [ P logP + λP + βEP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1
and β is chosen to satisfy: ∫ P•E dX = E

This might look intimidating, but it’s really not. I’ll write out how to solve this:

P [P logP + λP + βEP) ]
= logP + 1 + λ + βE = 0
So P = e-(1+λ) • e-βE
Renaming our first term, we get:
P(X) = 1/Z • e-βE(X)

This result is called the Boltzmann distribution, and it’s one of the incredibly important must-know equations of statistical mechanics. The amount of physics you can do with just this one equation is staggering. And we got it by just adding conservation of energy to the principle of maximum entropy.

Maybe you’re disturbed by the strange new symbols Z and β that have appeared in the equation. Don’t fear! Z is simply a normalization constant: it’s there to keep the probability of the total distribution at 1. We can calculate it explicitly:

Z = ∫ e-βE dX

And β is really interesting. Notice that β came into our equations because we had to satisfy this extra constraint about a fixed total energy. Is there some nice physical significance to this quantity?

Yes, very much so. β is what we humans like to call ‘temperature’, or more precisely, inverse temperature.

β = 1/T

While avoiding the math, I can just say the following: Temperature is defined to be the change in the energy of a system when you change its entropy a little bit. (This definition is much more general than the special case definition of temperature as average kinetic energy)

And it turns out that when you manipulate the above equations a little bit, you see that ∂SE = 1/β = T.

So we could rewrite our probability distribution as follows:

P(X) = 1/Z • e-E(X)/T

Feed in your fundamental laws of physics to the energy function, and you can see the distribution of particles across the universe!

Let’s just look at the basic properties of this equation. First of all, we can see that the larger E(X)/T becomes, the smaller the probability of a particle being in X becomes. This corresponds both to particles scattering away from high-energy regions and to less densely populated systems having lower temperatures.

And the smaller E(X)/T, the larger P(X). This corresponds to particles densely clustering in low-energy areas, and dense clusters of particles having high temperatures.

There are too many other things I could say about this equation and others, and this post is already way too long. I want to close with a final note about the nature of entropy.

I said earlier that entropy is entirely a function of your model of reality. The universe doesn’t have an entropy. You have a model of the universe, and that model has an entropy. Regardless of what physical reality is like, if I hand you a model, you can tell me its entropy.

But at the same time, models of reality are linked to the nature of the physical world. So for instance, a very simple and predictable universe lends itself to very precise and accurate models of reality, and thus to lower-entropy models. And a very complicated and chaotic universe lends itself to constant loss of information and low-accuracy models, and thus to higher entropy.

It is this second world that we live in. Due to the structure of the universe, information is constantly being lost to us at enormous rates. Systems that start out simple eventually spiral off into chaotic and unpredictable patterns, and order in the universe is only temporary.

It is in this sense that statements about entropy are statements about physical reality. And it is for this reason that entropy always increases.

In principle, an omnipotent and omniscient agent could track the positions of all particles at all times, and this agent’s model of the universe would be always perfectly accurate, with entropy zero. For this agent, the entropy of the universe would never rise.

And yet for us, as we look at the universe, we seem to constantly and only see entropy-increasing interactions.

This might seem counterintuitive or maybe even impossible to you. How could the entropy rise to one agent and stay constant for another?

Imagine an ice cube sitting out on a warm day. The ice cube is in a highly ordered and understandable state. We could sit down and write out a probability distribution, taking into account the crystalline structure of the water molecules and the shape of the cube, and have a fairly low-entropy and accurate description of the system.

But now the ice cube starts to melt. What happens? Well, our simple model starts to break down. We start losing track of where particles are going, and having trouble predicting what the growing puddle of water will look like. And by the end of the transition, when all that’s left is a wide spread-out wetness across the table, our best attempts to describe the system will inevitably remain higher-entropy than what we started with.

Our omniscient agent looks at the ice cube and sees all the particles exactly where they are. There is no mystery to him about what will happen next – he knows exactly how all the water molecules are interacting with one another, and can easily determine which will break their bonds first. What looked like an entropy-increasing process to us was an entropy-neutral process to him, because his model never lost any accuracy.

We saw the puddle as higher-entropy, because we started doing poorly at modeling it. And our models started performing poorly, because the system got too complex for our models.

In this sense, entropy is not just a physical quantity, it is an epistemic quantity. It is both a property of the world and a property of our model of the world. The statement that the entropy of the universe increases is really the statement that the universe becomes harder for our models to compute over time.

Which is a really substantive statement. To know that we live in the type of universe that constantly increases in entropy is to know a lot about the way that the universe operates.

More reading here if you’re interested!

Solution: How change arises in QM

Previously I pointed out that if you drew out the wave function of the entire universe by separating out its different energy components and shading each according to its amplitude, you would find that the universe appears completely static.

Energy superposition

This is correct according to standard quantum mechanics. If you looked at how much amplitude the universe had in any particular energy level, you would find that this amplitude was not changing in size.

The only change you would observe would be in the direction, or phase, of the amplitude in the complex plane. And directions of amplitudes in the complex plane are unphysical. Right?

No! While there is an important sense in which the direction of an amplitude is unphysical (the universe ultimately only computes magnitudes of amplitudes), there is a much much more important sense in which the direction of an amplitude contains loads of physical information.

This is because when the universe is in a superposition of different energy states, the amplitudes of these states can interfere.

It is here that we can find the answer to the question I posed in the previous post. Physical changes come from interference between the amplitudes of all the energy states that the universe is in superposition over.

One consequence of all of this is that if the universe did happen to be in a pure energy state, and not in a superposition of multiple energy levels, then change would be impossible.

From which we can conclude: The universe is in a superposition of energy levels, not in any clearly defined single energy level! (Proof: Look around and notice that stuff is happening)

This doesn’t mean, by the way, that the universe is actually in one of the energy levels and we just don’t know which. It also doesn’t mean that the universe is in some other distinct state found by averaging over all of the different energy states. “Superposition” is one of these funny words in quantum mechanics that doesn’t have an analogue in natural language. The best we can say is that the universe really truly is in all of the states in the superposition at once, and the degree to which it is in any particular state is the amplitude of that state.


Let’s imagine a simple toy universe with one dimension of space and one of time.

This universe is initially in an equal superposition of two pure energy states Φ0(x) and Φ1(x), each of which is a real function (no imaginary components). The first has zero energy, and we choose our units so that the second has an energy level equal to exactly 1.

So the wave function of our universe at time zero can be written Ψ = Φ0 + Φ1. (I’m ignoring normalization factors because they aren’t really crucial to the point here)

And from this we can conclude that our probability density is:

P(x) = Ψ*·Ψ = Φ02 + Φ12 + 2·Φ0·Φ1

Now we advance forward in time. Applying the Schrodinger equation, we find:

Φ0(x, t) = Φ0(x)
Φ1(x, t) = Φ1(x) · e-it

Notice that both of these energy states have a time-independent magnitude. The first one is obvious – it’s just completely static. The second one you can visualize as a function spinning in the complex plane, going from purely real and positive to purely imaginary to purely real and negative, et cetera. The magnitude of the function is just what you’d get by spinning it back to its positive real value.

From our two energy functions, we can find the total wave function of the universe:

Ψ(x, t) = Φ0(x) + Φ1(x) · e-it

Already we can see that our time-dependent wave function is not a simple product of our time-independent wave function and a phase.

We can see the consequences of this by calculating the time-dependent probability density:

P(x, t) = Φ0(x)2 + Φ1(x)2 + Φ0(x) · Φ1(x) · (e-it + eit)


P(x, t) = |Φ0|2 + |Φ1|2 + 2 · Φ0(x) · Φ1(x) · cos(t)

And in our final result, we can see a clear time dependence of the spatial probability distribution over the universe. The last term will grow and shrink, oscillating over time and giving rise to dynamics.


We can visualize what’s going on here by looking at the time evolution of each pure energy state as if it’s spinning in the complex plane. For instance, if the universe was in a superposition of the lowest four energy levels we would see something like:


The length of the arrow represents the amplitude of that energy level – “how much” the universe is in that energy state. The arrows are spinning in the complex plane with a speed proportional to the energy level they represent.

The wave function of the universe is represented by the sum of all of these arrows, as if you stacked each on the head of the previous. And this sum is changing!

For instance, in the universe’s first moment, the superposition looks like this:

4-Rotating T=0

And later the universe looks like this:

4-Rotating T=1

If we plotted out the first two energy states scaled by their amplitudes, we might see the following spatial distributions, initially and finally:

Even though there have been no changes in the magnitudes of the arrows (the degree to which the universe exists in each energy level) we get a very different looking universe.

This is the basic idea that explains all change in the universe, from the rising and falling of civilizations to the births and deaths of black holes: they are results of the complex patterns of interference produced by spinning amplitudes.

Is quantum mechanics simpler than classical physics?

I want to make a few very fundamental comparisons between classical and quantum mechanics. I’ll be assuming a lot of background in this particular post to prevent it from getting uncontrollably long, but am planning on writing a series on quantum mechanics at some point.


Let’s assume that the universe consists of N simple point particles (where N is an ungodly large number), each interacting with each other in complicated ways according to their relative positions. These positions are written as x1, x2, …, xN.

The classical description for this simple universe makes each position a function of time, and gives the following set of N equations of motion, one for each particle:

Fk(x1, x2, …, xN) = mk · ∂t2xk

Each force function Fk will be a horribly messy nonlinear function of the positions of all the particles in the universe. These functions encode the details of all of the interactions taking place between the particles.

Analytically solving this equation is completely hopeless – It’s a set of N separate equations, each one a highly nonlinear second order differential equation. You couldn’t solve any of them on their own, and on top of that, they are tightly entangled together, making it impossible to solve any one without also solving all the others.

So if you thought that Newton’s equation F = ma was simple, think again!

Compare this to how quantum mechanics describes our universe. The state of the universe is described by a function Ψ(x1, x2, …, xN, t). This function changes over time according to the Schrödinger equation:

tΨ = -i·H[Ψ]

H is a differential operator that is a complicated function of all of the positions of all the particles in the universe. It encodes the information about particle interactions in the same way that the force functions did in classical mechanics.

I claim that Schrodinger’s equation is infinitely easier to solve than Newton’s equation. In fact, I will by the end of this post write out the exact solution to the wave function of the entire universe.

At first glance, you can notice a few features of the equation that make it look potentially simpler than the classical equation. For one, there’s only one single equation, instead of N entangled equations.

Also, the equation is only first order in time derivatives, while Newton’s equation is second order in time derivatives. This is extremely important. The move from a first order differential equation to a second order differential equation is a huge deal. For one thing, there’s a simple general solution to all first order linear differential equations, and nothing close for second order linear differential equations.

Unfortunately… Schrodinger’s equation, just like Newton’s, is highly highly nonlinear, because of the presence of H. If we can’t find a way to simplify this immensely complex operator, then we’re probably stuck.

But quantum mechanics hands us exactly what we need: two magical facts about the universe that allow us to turn Schrodinger’s equation into a linear first-order differential equation.

First: It guarantees us that there exist a set of functions φE(x1, x2, …, xN) such that:

HE] = E · φE

E is an ordinary real number, and its physical meaning is the energy of the entire universe. The set of values of E is the set of allowed energies for the universe. And the functions φE(x1, x2, …, xN) are the wave functions that correspond to each allowed energy.

Second: it tells us that no matter what complicated state our universe is in, we can express it as a weighted sum over these functions:

Ψ = ∑ a· φE

With these two facts, we’re basically omniscient.

Since Ψ is a sum of all the different functions φE, if we want to know how Ψ changes with time, we can just see how each φE changes with time.

How does each φE change with time? We just use the Schrodinger equation:

tφE = -i · HE]
= -iE · φE

And we end up with a first order linear differential equation. We can write down the solution right away:

φE(x1, x2, …, xN, t) = φE(x1, x2, …, xN) · e-iEt

And just like that, we can write down the wave function of the entire universe:

Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN, t)
= ∑ a· φE(x1, x2, …, xN) · e-iEt

Hand me the initial conditions of the universe, and I can hand you back its exact and complete future according to quantum mechanics.


Okay, I cheated a little bit. You might have guessed that writing out the exact wave function of the entire universe is not actually doable in a short blog post. The problem can’t be that simple.

But at the same time, everything I said above is actually true, and the final equation I presented really is the correct wave function of the universe. So if the problem must be more complex, where is the complexity hidden away?

The answer is that the complexity is hidden away in the first “magical fact” about allowed energy states.

HE] = E · φE

This equation is a highly non-linear and in general second-order differential equation. If we actually wanted to expand out Ψ in terms of the different functions φE, we’d have to solve this equation.

So there is no free lunch here. But what’s interesting is where the complexity moves when switching from classical mechanics to quantum mechanics.

In classical mechanics, virtually zero effort goes into formalizing the space of states, or talking about what configurations of the universe are allowable. All of the hardness of the problem of solving the laws of physics is packed into the dynamics. That is, it is easy to specify an initial condition of the universe. But describing how that initial condition evolves forward in time is virtually impossible.

By contrast, in quantum mechanics, solving the equation of motion is trivially easy. And all of the complexity has moved to defining the system. If somebody hands you the allowed energy levels and energy functions of the universe at a given moment of time, you can solve the future of the rest of the universe immediately. But actually finding the allowed energy levels and corresponding wave functions is virtually impossible.


Let’s get to the strangest (and my favorite) part of this.

If quantum mechanics is an accurate description of the world, then the following must be true:

Ψ(x1, x2, …, xN, 0) = ∑ a· φE(x1, x2, …, xN)
Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN) · e-iEt

This equation has two especially interesting features. First, each term in the sum can be broken down separately into a function of position and a function of time.

And second, the temporal component of each term is an imaginary exponential – a phase factor e-iEt.

Let me take a second to explain the significance of this.

In quantum mechanics, physical quantities are invariably found by taking the absolute square of complex quantities. This is why you can have a complex wave function and an equation of motion with an i in it, and still end up with a universe quite free of imaginary numbers.

But when you take the absolute square of e-iEt, you end up with e-iEt · eiEt = 1. What’s important here is that the time dependence seems to fall away.

A way to see this is to notice that y = e-ix, when graphed, looks like a point on a unit circle in the complex plane.


So e-iEt, when graphed, is just a point repeatedly spinning around the unit circle. The larger E is, the faster it spins.

Taking the absolute square of a complex number is the same as finding its distance from the origin on the complex plane. And since e-iEt always stays on the unit circle, its absolute square is always 1.

So what this all means is that quantum mechanics tells us that there’s a sense in which our universe is remarkably static. The universe starts off as a superposition of a bunch of possible energy states, each with a particular weight. And it ends up as a sum over the same energy states, with weights of the exact same magnitude, just pointing different directions in the complex plane.

Imagine drawing the universe by drawing out all possible energy states in boxes, and shading these boxes according to how much amplitude is distributed in them. Now we advance time forward by one millisecond. What happens?

Absolutely nothing, according to quantum mechanics. The distribution of shading across the boxes stays the exact same, because the phase factor multiplication does not change the magnitude of the amplitude in each box.

Given this, we are faced with a bizarre question: if quantum mechanics tells us that the universe is static in this particular way, then why do we see so much change and motion and excitement all around us?

I’ll stop here for you to puzzle over, but I’ve posted an answer here.