Statistical mechanics is wonderful

The law that entropy always increases, holds, I think, the supreme position among the laws of Nature. If someone points out to you that your pet theory of the universe is in disagreement with Maxwell’s equations — then so much the worse for Maxwell’s equations. If it is found to be contradicted by observation — well, these experimentalists do bungle things sometimes. But if your theory is found to be against the second law of thermodynamics I can give you no hope; there is nothing for it but to collapse in deepest humiliation.

 – Eddington

My favorite part of physics is statistical mechanics.

This wasn’t the case when it was first presented to me – it seemed fairly ugly and complicated compared to the elegant and deep formulations of classical mechanics and quantum mechanics. There were too many disconnected rules and special cases messily bundled together to match empirical results. Unlike the rest of physics, I failed to see the same sorts of deep principles motivating the equations we derived.

Since then I’ve realized that I was completely wrong. I’ve come to appreciate it as one of the deepest parts of physics I know, and mentally categorize it somewhere in the intersection of physics, math, and philosophy.

This post is an attempt to convey how statistical mechanics connects these fields, and to show concisely how some of the standard equations of statistical mechanics arise out of deep philosophical principles.

***

The fundamental goal of statistical mechanics is beautiful. It answers the question “How do we apply our knowledge of the universe on the tiniest scale to everyday life?”

In doing so, it bridges the divide between questions about the fundamental nature of reality (What is everything made of? What types of interactions link everything together?) and the types of questions that a ten-year old might ask (Why is the sky blue? Why is the table hard? What is air made of? Why are some things hot and others cold?).

Statistical mechanics peeks at the realm of quarks and gluons and electrons, and then uses insights from this realm to understand the workings of the world on a scale a factor of 1021 larger.

Wilfrid Sellars described philosophy as an attempt to reconcile the manifest image (the universe as it presents itself to us, as a world of people and objects and purposes and values), and the scientific image (the universe as revealed to us by scientific inquiry, empty of purpose, empty of meaning, and animated by simple exact mathematical laws that operate like clockwork). This is what I see as the fundamental goal of statistical mechanics.

What is incredible to me is how elegantly it manages to succeed at this. The universality and simplicity of the equations of statistical mechanics are astounding, given the type of problem we’re dealing with. Physicists would like to say that once they’ve figured out the fundamental equations of physics, then we understand the whole universe. Rutherford said that “all science is either physics or stamp collecting.” But you try to take some laws that tell you how two electrons interact, and then answer questions about how 1023 electrons will behave when all crushed together.

The miracle is that we can do this, and not only can we do it, but we can do it with beautiful, simple equations that are loaded with physical insight.

There’s an even deeper connection to philosophy. Statistical mechanics is about epistemology. (There’s a sense in which all of science is part of epistemology. I don’t mean this. I mean that I think of statistical mechanics as deeply tied to the philosophical foundations of epistemology.)

Statistical mechanics doesn’t just tell us what the world should look like on the scale of balloons and oceans and people. Some of the most fundamental concepts in statistical mechanics are ultimately about our state of knowledge about the world. It contains precise laws telling us what we can know about the universe, what we should believe, how we should deal with uncertainty, and how this uncertainty is structured in the physical laws.

While the rest of physics searches for perfect objectivity (taking the “view from nowhere”, in Nagel’s great phrase), statistical mechanics has one foot firmly planted in the subjective. It is an epistemological framework, a theory of physics, and a piece of beautiful mathematics all in one.

***

Enough gushing.

I want to express some of these deep concepts I’ve been referring to.

First of all, statistical mechanics is fundamentally about probability.

It accepts that trying to keep track of the positions and velocities of 1023 particles all interacting with each other is futile, regardless of how much you know about the equations guiding their motion.

And it offers a solution: Instead of trying to map out all of the particles, let’s course-grain our model of the universe and talk about the likelihood that a given particle is in a given position with a given velocity.

As soon as we do this, our theory is no longer just about the universe in itself, it is also about us, and our model of the universe. Equations in statistical mechanics are not only about external objective features of the world; they are also about properties of the map that we use to describe it.

This is fantastic and I think really under-appreciated. When we talk about the results of the theory, we must keep in mind that these results must be interpreted in this joint way. I’ve seen many misunderstandings arise from failures of exactly this kind, like when people think of entropy as a purely physical quantity and take the second law of thermodynamics to be solely a statement about the world.

But I’m getting ahead of myself.

Statistical mechanics is about probability. So if we have a universe consisting of N = 1080 particles, then we will create a function P that assigns a probability to every possible position for each of these particles at a given moment:

P(x1, y1, z1, x2, y2, z2, …, xN, yN, zN)

P is a function of 3•1080 values… this looks super complicated. Where’s all this elegance and simplicity I’ve been gushing about? Just wait.

The second fundamental concept in statistical mechanics is entropy. I’m going to spend way too much time on this, because it’s really misunderstood and really important.

Entropy is fundamentally a measure of uncertainty. It takes in a model of reality and returns a numerical value. The larger this value, the more coarse-grained your model of reality is. And as this value approaches zero, your model approaches perfect certainty.

Notice: Entropy is not an objective feature of the physical world!! Entropy is a function of your model of reality. This is very very important.

So how exactly do we define the entropy function?

Say that a masked inquisitor tells you to close your eyes and hands you a string of twenty 0s and 1s. They then ask you what your uncertainty is about the exact value of the string.

If you don’t have any relevant background knowledge about this string, then you have no reason to suspect that any letter in the string is more likely to be a 0 than a 1 or vice versa. So perhaps your model places equal likelihood in every possible string. (This corresponds to a probability of ½ • ½ • … • ½ twenty times, or 1/220).

The entropy of this model is 20.

Now your inquisitor allows you to peek at only the first number in the string, and you see that it is a 1.

By the same reasoning, your model is now an equal distribution of likelihoods over all strings that start with 1.

The entropy of this model? 19.

If now the masked inquisitor tells you that he has added five new numbers at the end of your string, the entropy of your new model will be 24.

The idea is that if you are processing information right, then every time you get a single bit of information, your entropy should decrease by exactly 1. And every time you “lose” a bit of information, your entropy should increase by exactly 1.

In addition, when you have perfect knowledge, your entropy should be zero. This means that the entropy of your model can be thought of as the number of pieces of binary information you would have to receive to have perfect knowledge.

How do we formalize this?

Well, your initial model (back when there were 20 numbers and you had no information about any of them) gave each outcome a probability of P = 1/220. How do we get a 20 out of this? Simple!

Entropy = S = log2(1/P)

(Yes, entropy is denoted by S. Why? Don’t ask me, I didn’t invent the notation! But you’ll get used to it.)

We can check if this formula still works out right when we get new information. When we learned that the first number was a 1, half of our previous possibilities disappeared. Given that the others are all still equally likely, our new probabilities for each should double from 1/220 to 1/219.

And S = log2(1/(1/219)) = log2(219) = 19. Perfect!

What if you now open your eyes and see the full string? Well now your probability distribution is 0 over all strings except the one you see, which has probability 1.

So S = log2(1/1) = log2(1) = 0. Zero entropy corresponds to perfect information.

This is nice, but it’s a simple idealized case. What if we only get partial information? What if the masked stranger tells you that they chose the numbers by running a process that 80% of the time returns 0 and 20% of the time returns 1, and you’re fifty percent sure they’re lying?

In general, we want our entropy function to be able to handle models more sophisticated than just uniform distributions with equal probabilities for every event. Here’s how.

We can write out any arbitrary probability distribution over N binary events as follows:

(P1, P2, …, PN)

As we’ve seen, if they were all equal then we would just find the entropy according to previous equation: S = log2(1/P).

But if they’re not equal, then we can just find the weighted average! In other words:

S = mean(log2(1/P)) =∑ Pn log2(1/Pn)

We can put this into the standard form by noting that log(1/P) = -log(P).

And we have our general definition of entropy!

For discrete probabilities: S = – ∑ Plog Pn
For continuous probabilities: S = – ∫ P(x) log P(x) dx

(Aside: Physicists generally use a natural logarithm instead of log2 when they define entropy. This is just a difference in convention: e pops up more in physics and 2 in information theory. It’s a little weird, because now when entropy drops by 1 this means you’ve excluded 1/e of the options, instead of ½. But it makes equations much nicer.)

I’m going to spend a little more time talking about this, because it’s that important.

We’ve already seen that entropy is a measure of how much you know. When you have perfect and complete knowledge, your model has entropy zero. And the more uncertainty you have, the more entropy you have.

You can visualize entropy as a measure of the size of your probability distribution. Some examples you can calculate for yourself using the above equations:

Roughly, when you double the “size” of your probability distribution, you increase its entropy by 1.

But what does it mean to double the size of your probability distribution? It means that there are two times as many possibilities as you initially thought – which is equivalent to you losing one piece of binary information! This is exactly the connection between these two different ways of thinking about entropy.

Third: (I won’t name it yet so as to not ruin the surprise). This is so important that I should have put it earlier, but I couldn’t have because I needed to introduce entropy first.

So I’ve been sneakily slipping in an assumption throughout the last paragraphs. This is that when you don’t have any knowledge about the probability of a set of events, you should act as if all events are equally likely.

This might seem like a benign assumption, but it’s responsible for god-knows how many hours of heated academic debate. Here’s the problem: sure it seems intuitive to say that 0 and 1 are equally likely. But that itself is just one of many possibilities. Maybe 0 comes up 57% of the time, or maybe 34%. It’s not like you have any knowledge that tells you that 0 and 1 are objectively equally likely, so why should you favor that hypothesis?

Statistical mechanics answers this by just postulating a general principle: Look at the set of all possible probability distributions, calculate the entropy of each of them, and then choose the one with the largest entropy.

In cases where you have literally no information (like our earlier inquisitor-string example), this principle becomes the principle of indifference: spread your credences evenly among the possibilities. (Prove it yourself! It’s a fun proof.)

But as a matter of fact, this principle doesn’t only apply to cases where you have no information. If you have partial or incomplete information, you apply the exact same principle by looking at the set of probability distributions that are consistent with this information and maximizing entropy.

This principle of maximum entropy is the foundational assumption of statistical mechanics. And it is a purely epistemic assumption. It is a normative statement about how you should rationally divide up your credences in the absence of information.

Said another way, statistical mechanics prescribes an answer to the problem of the priors, the biggest problem haunting Bayesian epistemologists. If you want to treat your beliefs like probabilities and update them with evidence, you have to have started out with an initial level of belief before you had any evidence. And what should that prior probability be?

Statistical mechanics says: It should be the probability that maximizes your entropy. And statistical mechanics is one of the best-verified and most successful areas of science. Somehow this is not loudly shouted in the pages of every text on Bayesianism.

There’s much more to say about this, but I’ll set it aside for the moment.

***

So we have our setup for statistical mechanics.

  1. Coarse-grain your model of reality by constructing a probability distribution over all possible microstates of the world.
  2. Construct this probability distribution according to the principle of maximum entropy.

Okay! So going back to our world of N = 1080 particles jostling each other around, we now know how to construct our probability distribution P(x1, …, xN). (I’ve made the universe one-dimensional for no good reason except to pretty it up – everything I say follows exactly the same if I left it in 3D. I’ll also start writing the set of all N coordinates as X, again for prettiness.)

What probability distribution maximizes S = – ∫ P logP dX?

We can solve this with the method of Lagrange multipliers:

P [ P logP + λP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1

This is such a nice equation and you should do yourself a favor and learn it, because I’m not going to explain it (if I explained everything, this post would become a textbook!).

But it essentially maximizes the value of S, subject to the constraint that the total probability is 1. When we solve it we find:

P(x1, …, xN) = 1/VN, where V is the volume of the universe

Remember earlier when I said to just wait for the probability equation to get simple?

Okay, so this is simple, but it’s also not very useful. It tells us that every particle has an equal probability of being in any equally sized region of space. But we want to know more. Like, are the higher energy particles distributed differently than the lower energy?

The great thing about statistical mechanics is that if you want a better model, you can just feed in more information to your distribution.

So let’s say we want to find the probability distribution, given two pieces of information: (1) we know the energy of every possible configuration of particles, and (2) the average total energy of the universe is fixed.

That is, we have a function E(x1, …, xN) that tells us energies, and we know that the total energy E = ∫ P(x1, …, xN)•E(x1, …, xN) dX is fixed.

So how do we find our new P? Using the same method as before:

P [ P logP + λP + βEP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1
and β is chosen to satisfy: ∫ P•E dX = E

This might look intimidating, but it’s really not. I’ll write out how to solve this:

P [P logP + λP + βEP) ]
= logP + 1 + λ + βE = 0
So P = e-(1+λ) • e-βE
Renaming our first term, we get:
P(X) = 1/Z • e-βE(X)

This result is called the Boltzmann distribution, and it’s one of the incredibly important must-know equations of statistical mechanics. The amount of physics you can do with just this one equation is staggering. And we got it by just adding conservation of energy to the principle of maximum entropy.

Maybe you’re disturbed by the strange new symbols Z and β that have appeared in the equation. Don’t fear! Z is simply a normalization constant: it’s there to keep the probability of the total distribution at 1. We can calculate it explicitly:

Z = ∫ e-βE dX

And β is really interesting. Notice that β came into our equations because we had to satisfy this extra constraint about a fixed total energy. Is there some nice physical significance to this quantity?

Yes, very much so. β is what we humans like to call ‘temperature’, or more precisely, inverse temperature.

β = 1/T

While avoiding the math, I can just say the following: Temperature is defined to be the change in the energy of a system when you change its entropy a little bit. (This definition is much more general than the special case definition of temperature as average kinetic energy)

And it turns out that when you manipulate the above equations a little bit, you see that ∂SE = 1/β = T.

So we could rewrite our probability distribution as follows:

P(X) = 1/Z • e-E(X)/T

Feed in your fundamental laws of physics to the energy function, and you can see the distribution of particles across the universe!

Let’s just look at the basic properties of this equation. First of all, we can see that the larger E(X)/T becomes, the smaller the probability of a particle being in X becomes. This corresponds both to particles scattering away from high-energy regions and to less densely populated systems having lower temperatures.

And the smaller E(X)/T, the larger P(X). This corresponds to particles densely clustering in low-energy areas, and dense clusters of particles having high temperatures.

There are too many other things I could say about this equation and others, and this post is already way too long. I want to close with a final note about the nature of entropy.

I said earlier that entropy is entirely a function of your model of reality. The universe doesn’t have an entropy. You have a model of the universe, and that model has an entropy. Regardless of what physical reality is like, if I hand you a model, you can tell me its entropy.

But at the same time, models of reality are linked to the nature of the physical world. So for instance, a very simple and predictable universe lends itself to very precise and accurate models of reality, and thus to lower-entropy models. And a very complicated and chaotic universe lends itself to constant loss of information and low-accuracy models, and thus to higher entropy.

It is this second world that we live in. Due to the structure of the universe, information is constantly being lost to us at enormous rates. Systems that start out simple eventually spiral off into chaotic and unpredictable patterns, and order in the universe is only temporary.

It is in this sense that statements about entropy are statements about physical reality. And it is for this reason that entropy always increases.

In principle, an omnipotent and omniscient agent could track the positions of all particles at all times, and this agent’s model of the universe would be always perfectly accurate, with entropy zero. For this agent, the entropy of the universe would never rise.

And yet for us, as we look at the universe, we seem to constantly and only see entropy-increasing interactions.

This might seem counterintuitive or maybe even impossible to you. How could the entropy rise to one agent and stay constant for another?

Imagine an ice cube sitting out on a warm day. The ice cube is in a highly ordered and understandable state. We could sit down and write out a probability distribution, taking into account the crystalline structure of the water molecules and the shape of the cube, and have a fairly low-entropy and accurate description of the system.

But now the ice cube starts to melt. What happens? Well, our simple model starts to break down. We start losing track of where particles are going, and having trouble predicting what the growing puddle of water will look like. And by the end of the transition, when all that’s left is a wide spread-out wetness across the table, our best attempts to describe the system will inevitably remain higher-entropy than what we started with.

Our omniscient agent looks at the ice cube and sees all the particles exactly where they are. There is no mystery to him about what will happen next – he knows exactly how all the water molecules are interacting with one another, and can easily determine which will break their bonds first. What looked like an entropy-increasing process to us was an entropy-neutral process to him, because his model never lost any accuracy.

We saw the puddle as higher-entropy, because we started doing poorly at modeling it. And our models started performing poorly, because the system got too complex for our models.

In this sense, entropy is not just a physical quantity, it is an epistemic quantity. It is both a property of the world and a property of our model of the world. The statement that the entropy of the universe increases is really the statement that the universe becomes harder for our models to compute over time.

Which is a really substantive statement. To know that we live in the type of universe that constantly increases in entropy is to know a lot about the way that the universe operates.

More reading here if you’re interested!

Solution: How change arises in QM

Previously I pointed out that if you drew out the wave function of the entire universe by separating out its different energy components and shading each according to its amplitude, you would find that the universe appears completely static.

Energy superposition

This is correct according to standard quantum mechanics. If you looked at how much amplitude the universe had in any particular energy level, you would find that this amplitude was not changing in size.

The only change you would observe would be in the direction, or phase, of the amplitude in the complex plane. And directions of amplitudes in the complex plane are unphysical. Right?

No! While there is an important sense in which the direction of an amplitude is unphysical (the universe ultimately only computes magnitudes of amplitudes), there is a much much more important sense in which the direction of an amplitude contains loads of physical information.

This is because when the universe is in a superposition of different energy states, the amplitudes of these states can interfere.

It is here that we can find the answer to the question I posed in the previous post. Physical changes come from interference between the amplitudes of all the energy states that the universe is in superposition over.

One consequence of all of this is that if the universe did happen to be in a pure energy state, and not in a superposition of multiple energy levels, then change would be impossible.

From which we can conclude: The universe is in a superposition of energy levels, not in any clearly defined single energy level! (Proof: Look around and notice that stuff is happening)

This doesn’t mean, by the way, that the universe is actually in one of the energy levels and we just don’t know which. It also doesn’t mean that the universe is in some other distinct state found by averaging over all of the different energy states. “Superposition” is one of these funny words in quantum mechanics that doesn’t have an analogue in natural language. The best we can say is that the universe really truly is in all of the states in the superposition at once, and the degree to which it is in any particular state is the amplitude of that state.

***

Let’s imagine a simple toy universe with one dimension of space and one of time.

This universe is initially in an equal superposition of two pure energy states Φ0(x) and Φ1(x), each of which is a real function (no imaginary components). The first has zero energy, and we choose our units so that the second has an energy level equal to exactly 1.

So the wave function of our universe at time zero can be written Ψ = Φ0 + Φ1. (I’m ignoring normalization factors because they aren’t really crucial to the point here)

And from this we can conclude that our probability density is:

P(x) = Ψ*·Ψ = Φ02 + Φ12 + 2·Φ0·Φ1

Now we advance forward in time. Applying the Schrodinger equation, we find:

Φ0(x, t) = Φ0(x)
Φ1(x, t) = Φ1(x) · e-it

Notice that both of these energy states have a time-independent magnitude. The first one is obvious – it’s just completely static. The second one you can visualize as a function spinning in the complex plane, going from purely real and positive to purely imaginary to purely real and negative, et cetera. The magnitude of the function is just what you’d get by spinning it back to its positive real value.

From our two energy functions, we can find the total wave function of the universe:

Ψ(x, t) = Φ0(x) + Φ1(x) · e-it

Already we can see that our time-dependent wave function is not a simple product of our time-independent wave function and a phase.

We can see the consequences of this by calculating the time-dependent probability density:

P(x, t) = Φ0(x)2 + Φ1(x)2 + Φ0(x) · Φ1(x) · (e-it + eit)

Or…

P(x, t) = |Φ0|2 + |Φ1|2 + 2 · Φ0(x) · Φ1(x) · cos(t)

And in our final result, we can see a clear time dependence of the spatial probability distribution over the universe. The last term will grow and shrink, oscillating over time and giving rise to dynamics.

***

We can visualize what’s going on here by looking at the time evolution of each pure energy state as if it’s spinning in the complex plane. For instance, if the universe was in a superposition of the lowest four energy levels we would see something like:

4-Rotating.gif

The length of the arrow represents the amplitude of that energy level – “how much” the universe is in that energy state. The arrows are spinning in the complex plane with a speed proportional to the energy level they represent.

The wave function of the universe is represented by the sum of all of these arrows, as if you stacked each on the head of the previous. And this sum is changing!

For instance, in the universe’s first moment, the superposition looks like this:

4-Rotating T=0

And later the universe looks like this:

4-Rotating T=1

If we plotted out the first two energy states scaled by their amplitudes, we might see the following spatial distributions, initially and finally:

Even though there have been no changes in the magnitudes of the arrows (the degree to which the universe exists in each energy level) we get a very different looking universe.

This is the basic idea that explains all change in the universe, from the rising and falling of civilizations to the births and deaths of black holes: they are results of the complex patterns of interference produced by spinning amplitudes.

TIMN view of social evolution

(Papers here and here)

In the Neolithic era, societies are thought to have been mostly small groups bonded by kinship relations, with little social stratification. As technological advancement accommodated more complex social structures and larger groups of humans living together, problems of coordination became increasingly difficult. In response, more complex social structures arose, such as Chiefdoms, States and eventually Empires.

These structures solved coordination problems through a top-down command-and-control approach, enforced by strict hierarchical power structures. Historical exemplars of such structures include Ancient Egypt and the Roman Empire. These societies experienced immense growth, stretching out to dominate vast stretches of territory and millions of humans.

But as they grew, these societies began facing increasingly difficult problems of managing vast amounts of information involving complex exchanges and economic dynamics. Eventually, old mercantilist systems in which the state was in charge of economic transactions gave way to a grand new form of social structure: the market.

Societies that adopted market structures alongside the state became global leaders, dominating technological, social, and economic progress up until the present day. And just as previous forms of society had their distinctive failings, capitalistic societies face problems in the creation of social inequalities without the ability to address them.

Advances in technology that allow a revolutionary capacity for information exchange are resulting in the formation of a new form of social structure to address these problems. This structure is characterized by complicated heterarchical cooperation between massive networks of physically dispersed individuals, all coordinating on the basis of shared ideological aims. It is to them that the future belongs.

This is the view of history offered by political scientist David Ronfeldt, who framed the TIMN theory of social evolution.

If I were to summarize his entire theory in four sentences, I would say:

Societies through history can be explained through the interactions of four major forms of social structure: the Tribe, the Institution, the Market, and the Network. Each form defines a structure of governance and the way that individuals interact with one another, as well as cultural values and beliefs about the way society should be organized.

Each has different strengths and its weaknesses, and the progress of history has been a move towards adopting all four forms in a complicated balance. The future will belong to those societies that realize the potential of the network form and successfully incorporate it into their social structure.

There are a lot of parallels between this and previous things that I’ve read. I’ll go into that in a moment, but first will lay out more detailed definitions of his four primary structures.

The Tribe: Tribes are characterized by tight kinship relationships. Tribal social structures create strong senses of social identity and belonging, and define the culture of successive societies. They are small, egalitarian, and generally lack a strong leader. Their limitations are problems of administration and coordination as they grow, as well as nepotism and intertribal wars. Historical examples abound in the Neolithic era, and in modern times they exist in certain hot spots in Third World countries. In the First World, tribal patterns exist within families, urban gangs, civic clubs, and more abstractly in nationalism, racism and sports team mania.

The Institution: Institutions are characterized by authority figures, strict hierarchies, management structures, and administrative bureaucracies. Their strengths involve administration and solving coordination problems. They are afflicted with problems of corruption and abuse of power, as well as difficulty processing large amounts of information, leading to economic inefficiency. Examples include the great Empires, and they exist today in states, military organizations, religious organizations, and corporations.

The Market: Markets are characterized by competition and voluntary exchanges between self-interested individuals. They are uncentralized and nonhierarchical, and do well at handling enormous amounts of complex information and optimizing economic efficiency in exchanges of private goods. They lead to productive and innovative societies with thriving trade and commerce. Markets struggle to deal with externalities and lead to social inequality. Markets historically took off in the transition from mercantilism to capitalism in Europe, and are exemplified by the economies of the U.S. and the U.K. and more recently Chile, China, and Mexico.

The Network: Networks are characterized by cooperation between many autonomous individuals with no single central authority, where each individual is connected to all others. They are tied together not by blood or kinship relationships, but by ideology and common goals. Their strengths are yet to be seen, though Ronfeldt thinks that they could do well at promoting “group empowerment” and solving social issues. Same with their weaknesses, though he points vaguely in the direction of “information overload” and “deception”. Examples include social networks and transnational networks of NGOs.

Networks are the most poorly specified and speculative of the four forms. This is perhaps to be expected; after all, he thinks they have only begun to come into prominence at the advent of the Information Age.

They’re also the form that he stresses the most, making lots of breathless predictions about networked societies superseding the market-state societies that dominate the status quo. He urges states like the U.S. and the U.K. to become active participants in the ushering in of this great new era if they want to remain global leaders.

This part was less interesting to me. I’m not convinced that the problems of social inequality that he thinks Networks are necessary for cannot be fixed in a Market/State paradigm. All the same, it was nice to see falsifiable predictions from an otherwise highly theoretical work.

What I enjoyed most was his view of history. He sees the four forms as additive. When a society incorporates a new form, it does not discard the old, but builds upon it. Both end up modifying and influencing each other, and the end product is a combined system that incorporates both.

So for instance, the culture of a Tribe bleeds into its later instantiations as a State-run society, and can remain generations after the more visible tribal structures have passed on. And the adoption of free-market economic systems forces a reshaping of the State towards political democracy. He quotes Charles Lindblom:

However poorly the market is harnessed to democratic purposes, only within market-oriented systems does political democracy arise. Not all market-oriented systems are democratic, but every democratic system is also a market-oriented system. Apparently, for reasons not wholly understood, political democracy has been unable to exist except when coupled with the market. An extraordinary proposition, it has so far held without exception.

Ronfeldt explains this as a result of the market form pushing social values towards personal freedom, individuality, representation, and governmental accountability.

***

First connection:

I was reminded of psychologist Jonathan Haidt’s categorization of the different basic types of moral intuitions in The Righteous Mind. These are:

  • Care/Harm: Includes feelings like empathy and compassion. These intuitions are most triggered by experiences of vulnerable children, intense suffering and need, and cruelty.
  • Fairness/Cheating: Includes feelings of reciprocity, injustice, and equality. Triggered by others displaying cooperation or selfishness towards us.
  • Loyalty/Betrayal: Includes feelings of tribalism, unity and kinship. Triggered by involvement in tight groups
  • Authority/Subversion: Includes feelings of respect for parents, teachers, rulers, and religious leaders, as well as the feelings that this respect is owed. Involved in hierarchical thinking and perceptions of dominance relations.
  • Sanctity/Degradation: Includes feelings of disgust, purity, cleanliness, dirtiness, sacredness, and corruption.
  • Liberty/Oppression: Includes feelings of individualism, freedom, and resentment towards being dominated or oppressed.

Different political ideologies line up very well with different “moral foundations profiles”. Liberals tend to care primarily about the first two categories, Libertarians the last, and Conservatives a roughly equal mix of all six. You can take a questionnaire to see your personal moral profile here.

These categories look like they map really nicely onto the TIMN model as organizing principles for the different forms. Here’s my speculation on how the different social forms engage and capitalize on the different types of intuitions:

Tribes: Loyalty/Betrayal

Institutions: Authority/Subversion

Markets: Liberty/Oppression

Networks: Care/Harm?

The natural next question is what types of social forms would have as organizing principles the values of Fairness/Cheating or Sanctity/Degradation.

Second connection:

Sociologist Robert Nisbet attempted to categorize the different basic patterns of social interactions. He gave five categories: cooperation, conflict, exchange, coercion and conformity. For some reason this categorization seemed very deep to me when I first heard it, and it has stuck with me ever since.

Cooperation involves coordination between individuals that have a shared goal, while exchange involves coordination between individuals that are each motivated by their own self-interest.

Conflict occurs when individuals work against each other, competing for a larger share of rewards, for instance. Coercion is the forced cooperation between individuals with different goals. And conformity involves behavior that matches group expectations.

These categories nicely match the types of social interactions that characterize the different social forms in the TIMN model.

Tribes are a social form that are dominated by conformity interactions. Identity is tightly bound up with tribal culture, lineage, and adherence to social norms involving mutual defense and aid and who can have kids with whom.

The structure of Institutions is quite clearly analogous to coercion, and Markets to exchange and conflict. And by Ronfeldt’s description, Networks seem to be analogous to cooperative interactions.

Third connection:

Scott Alexander makes the point that democracies have several unique features that set them apart from previous forms of government.

These features all arise from the fact that democracies answer questions of leadership succession by handing them to the people. This is a big deal, for two main reasons:

First, democracies put an upper bound on how terrible a leader can be.

Why? The basic justification is that while the people don’t get to select the absolute best choice for leadership, they do get to select against the worst choices.

(FPTP is terrible enough that I actually don’t know if this is in general true. But this is in contrast to monarchical forms of government, which involve no feedback from the population, so the point stands.)

When the king of a hereditary monarchy dies and the throne passes to his oldest son, there is no formally recognized way to guard against the possibility that the kid is literally the next Hitler. At best, the population can just try to throw him out when they’ve had enough and let whoever wins out in the resulting scramble for power take over.

Second, democracy provides a great Schelling point for leadership succession.

(A Schelling point is a decision that would be arbitrary except that that is made on the basis of an expectation that everybody else will make the same decision. So if you’re supposed to meet a stranger in NYC, and you don’t know where, you’ll choose to go to Grand Central Terminal, and so will they. Not because of any psychic communication between the two of you, nor any sort of official designation of Grand Central Terminal as the One True Stranger Meeting Spot, but because you each expect the other to be there. Thus Grand Central Terminal is a geographical Schelling point for NYC.)

The Schelling point for leadership succession in a hereditary monarchy is royal blood. Which is to say that when the leader dies, everybody looks for the person (usually the man) with the most royal blood, and elects them.

But who determines if somebody’s blood is truly royal? What do you do if some other family decides that they have the truly royal blood? What if two people have equally royal blood?

The Schelling point for leadership succession in a theocratic monarchy like Ancient Egypt is the Official Word Of God.

Who determines which individual God actually wants in charge? What if two people both claim that God chose them to rule?

The problem is that these legitimacy claims are founded on fictions. There is no quality of royal-ness to blood, and there is no God to choose rulers.  In a democracy, the Schelling point for democracy is a real thing that is easily verifiable: the popular vote.

Everybody agrees who the correct leader is, because everybody can just look at the election results. And if somebody disagrees on who the correct leader is, then they have a clear action to take: mobilize voters to change their mind by the next election.

Thus democracy plays the dual role of ending succession squabbles and providing a natural pressure valve for those dissatisfied with the current leader.

These differences in structure seem really significant. I think that I would want to break apart Ronfeldt’s Institution category and replace it with two social forms: the Hierarchy and the Democracy.

A Hierarchy would be a social structure in which there is a strict top-down system of authority, and where the population at large does not have a formal role in determining who makes it at the top.

A Democracy also has a top-down system of power, but now also has a formal mechanism for feedback from the population to the top levels of power (e.g. an election). (I’d like a word for this that does not have as political a connotation, but failed to think of any)

***

The TIMN framework naturally leads to a story of the gradual progress of humans in our joint project of perfecting civilization. At each stage in history, new social structures arise to fix the failings of the old, and in this way forward-progress is made.

Overall, I think that the framework offers a potentially useful way of assessing different political and economic systems, by looking at the ways in which they utilize the strengths of these four structures and how they fall victim to the weaknesses.

Race, Ethnicity, and Labels

(This post is me becoming curious about the variety of different opinions on racial labels, spending far too many hours researching the topic, and writing up what I find.)

One thing that I find interesting is that basically every minority ethnic and racial group in the United States has constantly dealt with terminological disputes about their proper group name.

One possible explanation for this constant turn-over was given by disability rights activist Evan Kemp, who wrote:

As long as a group is ostracized or otherwise demeaned, whatever name is used to designate that group will eventually take on a demeaning flavor and have to be replaced. The designation will keep changing every generation or so until the group is integrated into society. Whatever name is in vogue at the point of social acceptance will be the lasting one.

If this is the right explanation, then maybe we’d be able to measure the relative degrees of discrimination faced by different groups on the basis of their ‘terminological velocity’ – how quick a turnover the name for their group has.

Regardless, looking into these issues revealed a bunch of interesting history and weird trivia. So here goes!

***

Native American vs American Indian

A 1995 Census Bureau survey of American Indians found that 49% preferred the term ‘American Indian’ and 37% preferred ‘Native American’. I couldn’t find any more recent polls on this question.

This may seem unusual if you don’t know much about American Indian culture and history. It’s a bit confusing to me; as somebody with a parent born in India, I’m pretty sure that I’m an American Indian.

Why is a term that derives from the geographical error of early European colonists the most favored of all available terms? And why not ‘Native American’? From an outside perspective, ‘Native American’ feels like a respectful term, one that pays homage to the history of American Indians as the original residents of the Americas.

It turns out the answer to these questions comes from a quick look at the history of these terms, which is super fascinating.

‘Native American’ was a term originally used by WASPs in the 1850s to differentiate themselves from Catholic Irish and German immigrants. The anti-immigrant Know-Nothing Party, whose supporters were known for violent riots in Catholic neighborhoods, burning down churches, and tarring and feathering of Catholic priests, was originally known as the Native American Party.

The term fell out of use for a century upon the rise of the anti-slavery movement and subsequent collapse of the Know-Nothings. This time gap probably indicates that the early usage of the term has little current relevance to associations with the term, but I included it anyway. I find it darkly amusing to imagine white anti-Catholic nativists running around calling themselves Native Americans.

The term ‘Native American’ was revived in the civil rights era by anthropologists eager for historical accuracy and disassociation from the negative stereotypes associated with ‘Indian’. This was adopted widely by government agencies, and apparently in doing so picked up a negative connotation.

Prominent Lakota activist Russell Means described the term as “a generic government term used to describe all the indigenous prisoners of the United States.” Some American Indians emphasize a sense of lack of ownership over the term, and feel that it was a “colonial term” given to them by outsiders.

‘American Indian’ is apparently more widely favored. Widespread acceptance of this term dates back to 1968 and the rise of the American Indian Movement (AIM). At a UN conference in 1977, AIM’s International Indian Treaty Council urged collective identification of American Indians with the term.

One argument made for the term is that while the names of other races in America have ‘American’ as their second word (e.g. ‘Asian American’, ‘Arab American’), ‘American Indian’ would have American as its first word, giving American Indians a special distinction. I’m serious, this was a real argument.

‘American Indian’ is etymologically close to ‘Indian’, which dates back to early European colonists that systematically drove American Indian populations out of their homes. Some note derogatory stereotypes from old Western movies associated with ‘cowboys and Indians’, and feel that the association carries over to ‘American Indian’.

Other American Indians say that they would prefer to be identified by their specific tribal nation, feeling that terms like ‘Native American’ and ‘Indian American’ lump all tribes together and ignore important differences in heritage. The problem with this is that there are 562 federally recognized distinct tribes, making this cognitively unfeasible. It’s also just useful to have a term to talk about these tribes in the aggregate.

Interestingly, when I was researching this, I found a Washington Post poll in 2016 that reported that 73% of American Indians felt that the word ‘Redskin’ was not disrespectful, and 80% would not be offended if referred to as a Redskin. A 2004 poll found similar results, with 90% of American Indians saying that the name of the Washington Redskins didn’t bother them. This is significantly more than the percentage of all Americans that don’t find the name offensive, which is around 68%.

I tried to find good arguments against these poll results, and could only find some groundless conspiracy theories suggesting the polls had been infiltrated by white people claiming to be American Indians. In the absence of alternative explanations, I really don’t know what to make of this, besides that it suggests a complete disconnect between American Indian activists and the general American Indian population.

Black vs African American

The 2010 United States Census included “Black, African Am., or Negro” as one of their racial identifications. In response to many complaints and black Americans refusing to select the term, they have now switched to the shorter ‘Black or African American’.

Something that caught my eye was their explanation of this choice, which was that apparently previous research had shown that if polls didn’t allow self-identification as ‘Negro’, a significant number of older African Americans would take the time to write it in under the ‘some other race’ category.

The term ‘Negro’ became popular in the 1920s as a polite term to replace ‘Colored’, which was in turn originally a polite alternative to ‘Nigger’ in the 1900s. An actual argument made for adopting ‘Negro’ was that it was easier to pluralize than ‘Colored’, which required the addition of another word (‘Negroes’ vs ‘Colored people’). Bizarre, but okay!

In 1890, the US Census used a four-way classification: ‘Black’ for those with at least ¾ black blood, ‘mulatto’ from 3/8 to 5/8, ‘quadroon’ for ¼, and ‘octoroon’ for 1/8. Unsurprisingly, this did not catch on.

‘Negro’ was simpler, and quickly became the politically correct and respectful term, used by black leaders like Booker T Washington, Marcus Garvey, W.E.B. Du Bois, and later Martin Luther King Jr. Many black organizations replaced ‘Colored’ in their title with ‘Negro’, with the notable exception of the NAACP.

During the civil rights era, radical and militant black organizations began to attack the term, claiming that it was associated with the history of slavery and racism. ‘Black’ became a term that identified you with radical progressive blacks (think of slogans like ‘Black Power’ and ‘Black is beautiful’), while ‘Negro’ was associated with the status quo and the old guard.

The last US president to use the term ‘Negro’ was Lyndon Johnson, and by 1980 there was a large majority of African Americans in favor of ‘Black’. And of course, in modern times the term ‘Negro’ is commonly perceived as a racial slur. Obama banned the term from usage in federal law in 2016.

Meanwhile ‘Black’ became the standard term employed in surveys and used by black organizations, and having gained popular acceptance, lost its radical connections.

(Quick aside: This looks to me like an instance of what’s called semantic bleaching, where a word weakens in meaning as it increases in usage. My favorite example of this is the phrase ‘God be with you’, which over the years lost its religious connotation and became… ‘goodbye’!)

This lasted until around 1990, when Jesse Jackson announced that ‘Black’ was a term disconnected from cultural heritage, and declared a switch to ‘African American’.

While some organizations changed their names and declared their support for ‘African American’, this didn’t gather the same level of universal acceptance as ‘Black’ had in the 1960s, or indeed ‘Negro’ in the 1900s. The 1995 Census found that 44% of Black Americans still preferred ‘Black’, and only 28% preferred ‘African American’. Some argued that modern African Americans have created a culture that is not tied to Africa, and indeed that there is no coherent concept of a ‘single African culture’.

One paper I read attributed Jackson’s lack of success in making ‘African American’ the universally used term to a missing confrontational intensity that existed in the Black Power movement. For instance, when Malcolm X and other radical black activists challenged the term ‘Negro’, they attacked it harshly and made its usage a social taboo.

Jackson may have lacked the political power to sufficiently mobilize Black Americans. A 2007 Gallup poll found that 61% of Black Americans didn’t care about what term they were described by, reflecting a high level of apathy towards his cause. A 2005 paper found that Black Americans were nearly equally divided between the two.

Currently there’s an uneasy shifting balance between these two terms, where both are acceptable, though sometimes one becomes more acceptable than the other. In my personal experience, I recall a several-year period where I perceived that the term “Black” was becoming increasingly politically incorrect. I later had (and currently have) a sense that this political incorrectness around the term had backed off, keeping it in public acceptance.

Hispanic vs Latino

Americans who trace their roots to Spanish-speaking countries were grouped together by the US government under the umbrella term ‘Hispanic’ in the 1970s. ‘Latino’ later became popular as well, and was first included in the 2000 Census. These terms are defined as synonyms by the U.S. Census Bureau.

Polls indicate that around half of Latinos don’t like either term, and prefer to be identified with their country of origin. When forced to choose, more than twice as many prefer ‘Hispanic’ over ‘Latino’. (Interestingly, Latino friends of mine tell me that they and their Latino friends and family overwhelmingly prefer ‘Latino’ over ‘Hispanic’, which points to some sort of selection bias around me that I don’t understand.)

The federal government officially defines ‘Latino’ not as a race, but an ethnicity. Latinos apparently disagree – 56% claim that is both a race and an ethnicity and 11% that it is a race. Only 19% agree with the official definition!

Both terms ‘Latino’ and ‘Hispanic’ are fairly unique to the United States. Terms that arose from Latino social movements like ‘Chicano’ have never won out among Latinos. This might be in part because of the lack of a strong shared identity – about 70% of Latinos think that there is not a common culture between American Latinos, and instead see a loose group composed of many individual cultures. There’s also a relevant lack of widely-known Latino activists and clear representatives of Latino people to champion these terms.

An older term designed to de-gender the term ‘Latino’ is ‘Latin@’, starting in the 1990s. This was apparently not inclusive enough, as the ‘@’ represents only ‘o’ and ‘a’ and not those that identify with neither. More recently, social justice activists have tried to encourage the adoption of the term ‘Latinx’. This term breaks with the gendered nature of the Spanish language and hardly rollss off the tongue, but has become relatively popular with LGBT activists.

Asian American vs Oriental

The term ‘Oriental’ was prohibited in the same bill in which Obama prohibited the use of the term ‘Negro’ in federal documents. There is a fairly strong consensus at this point that ‘Asian American’ is the appropriate term (though there remains some academic debate about this term).

‘Oriental’ is an old old term, dating back to the late Roman Empire. Over its history, the geographical region it referred to shifted constantly eastward (ad orientalem), from Morocco (yes, at some point it might have been proper to refer to Moroccans as Oriental!) to Egypt and the Levant to India and finally to East and Southeast Asia by the mid-1900s.

The term picked up baggage in the U.S. during the racist campaigns against Asian Americans in the late 1800s and early 1900s, and by now is fairly universally considered a pejorative term.

It was replaced by the term ‘Asian American’, which began to enter into popular use in the 1960s. The US Census definition of ‘Asian American’ still includes Indians, which feels really really wrong to me. I tried and failed to find public opinion polls on how many people feel comfortable with the term ‘Asian’ being applied to Indians.

And others…

The terminological situation of the Roma people is uniquely terrible. They are mostly referred to by the pejorative term ‘Gypsy’, which is essentially synonymous with ‘dangerous thieving wanderer’. The term ‘gypped’, meaning cheated or swindled, also has its origins in this term. They are also commonly referred to by the term ‘Tigan’, another pejorative term that derives from the Greek word for ‘untouchable’.

In a 2013 BBC TV interview, former Romanian prime minister Victor Ponta took care to distinguish Romanians from the Roma, noting that Romanians want to distance themselves from the Roma due to the negative connotations of the similar term.

And in 2010, the Romanian government supported a constitutional amendment legally renaming the Roma to the pejorative ‘Tigan’. (This law was later rejected by the Romanian Senate) Another such amendment was proposed in 2013, this time hoping to ban the self-identification of Roma in Romania as Romanians.

Jewish people are also in an unusual terminological situation. The term ‘Israelite’ was apparently commonly used until the 1947 formation of Israel. While ‘Jew’ is the only remaining commonly used term, there are problems with it. From The American Heritage Dictionary:

It is widely recognized that the attributive use of the word Jew, in phrases such as Jew lawyer or Jew ethics, is both vulgar and highly offensive. In such contexts Jewish is the only acceptable possibility. Some people, however, have become so wary of this construction that they have extended the stigma to any use of Jew as a noun, a practice that carries risks of its own. In a sentence such as There are several Jews on the council, which is unobjectionable, the substitution of a circumlocution like Jewish people or persons of Jewish background may in itself cause offense for seeming to imply that Jew has a negative connotation when used as a noun.

***

All in all, it looks like a really complicated mixture of factors ends up determining how this part of the language evolves.

On the one hand there are syntactic features (like ‘American Indian’ having ‘Indian’ on the right as opposed to the standard left, or ‘Colored’ having a complicated pluralization compared to ‘Negro’).

And on the other hand there are semantic features like the ancient and automatic negative associations with words like ‘dark’ and ‘black’, or the colonial associations tied to the term ‘Indian’.

There are contemporary factors like the existence of a strong shared racial/ethnic identity, the presence of a charismatic racial/ethnic leader, and whether or not the introducer of a new term for a group is an insider or outsider to the group.

Then there are phenomena like semantic bleaching, whereby terms that enter common use have their meaning diluted and weakened, and concept creep, whereby words change their meaning over long stretches of history by altered patterns of usage.

And finally there are longer-term historical effects like the gradual inundation of language with dark undertones over decades of racism and discriminatory treatment.

Is quantum mechanics simpler than classical physics?

I want to make a few very fundamental comparisons between classical and quantum mechanics. I’ll be assuming a lot of background in this particular post to prevent it from getting uncontrollably long, but am planning on writing a series on quantum mechanics at some point.

***

Let’s assume that the universe consists of N simple point particles (where N is an ungodly large number), each interacting with each other in complicated ways according to their relative positions. These positions are written as x1, x2, …, xN.

The classical description for this simple universe makes each position a function of time, and gives the following set of N equations of motion, one for each particle:

Fk(x1, x2, …, xN) = mk · ∂t2xk

Each force function Fk will be a horribly messy nonlinear function of the positions of all the particles in the universe. These functions encode the details of all of the interactions taking place between the particles.

Analytically solving this equation is completely hopeless – It’s a set of N separate equations, each one a highly nonlinear second order differential equation. You couldn’t solve any of them on their own, and on top of that, they are tightly entangled together, making it impossible to solve any one without also solving all the others.

So if you thought that Newton’s equation F = ma was simple, think again!

Compare this to how quantum mechanics describes our universe. The state of the universe is described by a function Ψ(x1, x2, …, xN, t). This function changes over time according to the Schrödinger equation:

tΨ = -i·H[Ψ]

H is a differential operator that is a complicated function of all of the positions of all the particles in the universe. It encodes the information about particle interactions in the same way that the force functions did in classical mechanics.

I claim that Schrodinger’s equation is infinitely easier to solve than Newton’s equation. In fact, I will by the end of this post write out the exact solution to the wave function of the entire universe.

At first glance, you can notice a few features of the equation that make it look potentially simpler than the classical equation. For one, there’s only one single equation, instead of N entangled equations.

Also, the equation is only first order in time derivatives, while Newton’s equation is second order in time derivatives. This is extremely important. The move from a first order differential equation to a second order differential equation is a huge deal. For one thing, there’s a simple general solution to all first order linear differential equations, and nothing close for second order linear differential equations.

Unfortunately… Schrodinger’s equation, just like Newton’s, is highly highly nonlinear, because of the presence of H. If we can’t find a way to simplify this immensely complex operator, then we’re probably stuck.

But quantum mechanics hands us exactly what we need: two magical facts about the universe that allow us to turn Schrodinger’s equation into a linear first-order differential equation.

First: It guarantees us that there exist a set of functions φE(x1, x2, …, xN) such that:

HE] = E · φE

E is an ordinary real number, and its physical meaning is the energy of the entire universe. The set of values of E is the set of allowed energies for the universe. And the functions φE(x1, x2, …, xN) are the wave functions that correspond to each allowed energy.

Second: it tells us that no matter what complicated state our universe is in, we can express it as a weighted sum over these functions:

Ψ = ∑ a· φE

With these two facts, we’re basically omniscient.

Since Ψ is a sum of all the different functions φE, if we want to know how Ψ changes with time, we can just see how each φE changes with time.

How does each φE change with time? We just use the Schrodinger equation:

tφE = -i · HE]
= -iE · φE

And we end up with a first order linear differential equation. We can write down the solution right away:

φE(x1, x2, …, xN, t) = φE(x1, x2, …, xN) · e-iEt

And just like that, we can write down the wave function of the entire universe:

Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN, t)
= ∑ a· φE(x1, x2, …, xN) · e-iEt

Hand me the initial conditions of the universe, and I can hand you back its exact and complete future according to quantum mechanics.

***

Okay, I cheated a little bit. You might have guessed that writing out the exact wave function of the entire universe is not actually doable in a short blog post. The problem can’t be that simple.

But at the same time, everything I said above is actually true, and the final equation I presented really is the correct wave function of the universe. So if the problem must be more complex, where is the complexity hidden away?

The answer is that the complexity is hidden away in the first “magical fact” about allowed energy states.

HE] = E · φE

This equation is a highly non-linear and in general second-order differential equation. If we actually wanted to expand out Ψ in terms of the different functions φE, we’d have to solve this equation.

So there is no free lunch here. But what’s interesting is where the complexity moves when switching from classical mechanics to quantum mechanics.

In classical mechanics, virtually zero effort goes into formalizing the space of states, or talking about what configurations of the universe are allowable. All of the hardness of the problem of solving the laws of physics is packed into the dynamics. That is, it is easy to specify an initial condition of the universe. But describing how that initial condition evolves forward in time is virtually impossible.

By contrast, in quantum mechanics, solving the equation of motion is trivially easy. And all of the complexity has moved to defining the system. If somebody hands you the allowed energy levels and energy functions of the universe at a given moment of time, you can solve the future of the rest of the universe immediately. But actually finding the allowed energy levels and corresponding wave functions is virtually impossible.

***

Let’s get to the strangest (and my favorite) part of this.

If quantum mechanics is an accurate description of the world, then the following must be true:

Ψ(x1, x2, …, xN, 0) = ∑ a· φE(x1, x2, …, xN)
implies
Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN) · e-iEt

This equation has two especially interesting features. First, each term in the sum can be broken down separately into a function of position and a function of time.

And second, the temporal component of each term is an imaginary exponential – a phase factor e-iEt.

Let me take a second to explain the significance of this.

In quantum mechanics, physical quantities are invariably found by taking the absolute square of complex quantities. This is why you can have a complex wave function and an equation of motion with an i in it, and still end up with a universe quite free of imaginary numbers.

But when you take the absolute square of e-iEt, you end up with e-iEt · eiEt = 1. What’s important here is that the time dependence seems to fall away.

A way to see this is to notice that y = e-ix, when graphed, looks like a point on a unit circle in the complex plane.

Phase

So e-iEt, when graphed, is just a point repeatedly spinning around the unit circle. The larger E is, the faster it spins.
2-Interference

Taking the absolute square of a complex number is the same as finding its distance from the origin on the complex plane. And since e-iEt always stays on the unit circle, its absolute square is always 1.

So what this all means is that quantum mechanics tells us that there’s a sense in which our universe is remarkably static. The universe starts off as a superposition of a bunch of possible energy states, each with a particular weight. And it ends up as a sum over the same energy states, with weights of the exact same magnitude, just pointing different directions in the complex plane.

Imagine drawing the universe by drawing out all possible energy states in boxes, and shading these boxes according to how much amplitude is distributed in them. Now we advance time forward by one millisecond. What happens?

Absolutely nothing, according to quantum mechanics. The distribution of shading across the boxes stays the exact same, because the phase factor multiplication does not change the magnitude of the amplitude in each box.

Given this, we are faced with a bizarre question: if quantum mechanics tells us that the universe is static in this particular way, then why do we see so much change and motion and excitement all around us?

I’ll stop here for you to puzzle over, but I’ve posted an answer here.

Iterated Simpson’s Paradox

Previous: Simpson’s paradox

In the last post, we saw how statistical reasoning can go awry in Simpson’s paradox, and how causal reasoning can rescue us. In this post, we’ll be generalizing the idea behind the paradox and producing arbitrarily complex versions of it.

The main idea behind Simpson’s paradox is that conditioning on an extra variable can sometimes reverse dependencies.

In our example in the last post, we saw that one treatment for kidney stones worked better than another, until we conditioned on the kidney stone’s size. Upon conditioning, the sign of the dependence between treatment and recovery changed, so that the first treatment now looked like it was less effective than the other.

We explained this as a result of a spurious correlation, which we represented with ‘paths of dependence’ like so:

simpsons-paradox-paths1.png

But we can do better than just one reversal! With our understanding of causal models, we are able to generate new reversals by introducing appropriate new variables to condition upon.

Our toy model for this will be a population of sick people, some given a drug and some not (D), and some who recover and some who do not (R). If there are no spurious correlations between D and R, then our diagram is simply:

Iter Simpson's 0

Now suppose that we introduce a spurious correlation, wealth (W). Wealthy people are more likely to get the drug (let’s say that this occurs through a causal intermediary of education level E), and are more likely to recover (we’ll suppose that this occurs through a casual intermediary of nutrition level of diet N).

Now we have the following diagram:

Iter Simpson's 1

Where there was only previously one path of dependency between D and R, there is now a second. This means that if we observe W, we break the spurious dependency between D and R, and retain the true causal dependence.

Iter Simpson's 1 all paths          Iter Simpson's 1 broken.png

This allows us one possible Simpson’s paradox: by conditioning upon W, we can change the direction of the dependence between D and R.

But we can do better! Suppose that your education level causally influences your nutrition. This means that we now have three paths of dependency between D and R. This allows us to cause two reversals in dependency: first by conditioning on W and second by conditioning on N.

Iter Simpson's 2 all paths.png  Iter Simpson's 2 broke 1  Iter Simpson's 2 broke 2

And we can keep going! Suppose that education does not cause nutrition, but both education and nutrition causally impact IQ. Now we have three possible reversals. First we condition on W, blocking the top path. Next we condition on I, creating a dependence between E and N (via explaining away). And finally, we condition on N, blocking the path we just opened. Now, to discern the true causal relationship between the drug and recovery, we have two choices: condition on W, or condition on all three W, I, and N.

Iter Simpson's 3 all pathsiter-simpsons-3-cond-w-e1514586779193.pngIter Simpson's 3 cond WIIter Simpson's 3 cond WIN

As might be becoming clear, we can do this arbitrarily many times. For example, here’s a five-step iterated Simpson paradox set-up:

Big iter simpson

The direction of dependence switches when you condition on, in this order: A, X, B’, X’, C’. You can trace out the different paths to see how this happens.

Part of the reason that I wanted to talk about the iterated Simpson’s paradox is to show off the power of causal modeling. Imagine that somebody hands you data that indicates that a drug is helpful in the whole population, harmful when you split the population up by wealth levels, helpful when you split it into wealth-IQ classes, and harmful when you split it into wealth-IQ-education classes.

How would you interpret this data? Causal modeling allows you to answer such questions by simply drawing a few diagrams!

Next we’ll move into one of the most significant parts of causal modeling – causal decision theory.

Previous: Simpson’s paradox

Next: Causal decision theory

Causal decision theory

Previous: Iterated Simpson’s Paradox

We’ll now move on into slightly new intellectual territory, that of decision theory.

While what we’ve previously discussed all had to do with questions about the probabilities of events and causal relationships between variables, we will now discuss questions about what the best decision to make in a given context is.

***

Decision theory has two ingredients. The first is a probabilistic model of different possible events that allows an agent to answer questions like “What is the probability that A happens if I do B?” This is, roughly speaking, the agent’s beliefs about the world.

The second ingredient is a utility function U over possible states of the world. This function takes in propositions, and returns the value to a particular agent of that proposition being true. This represents the agent’s values.

So, for instance, if A = “I win a million dollars” and B = “Somebody cuts my ear off”, U(A) will be a large positive number, and U(B) will be a large negative number. For propositions that an agent feels neutral or apathetic about, the utility function assigns them a value of 0.

Different decision theories represent different ways of combining a utility function with a probability distribution over world states. Said more intuitively, decision theories are prescriptions for combining your beliefs and your values in order to yield decisions.

A proposition that all competing decision theories agree on is “You should act to maximize your expected utility.” The difference between these different theories, then, is how they think that expected utility should be calculated.

“But this is simple!” you might think. “Simply sum over the value of each consequence, and weight each by its likelihood given a particular action! This will be the expected utility of that action.”

This prescription can be written out as follows:

Evidential Decision Theory.png

Here A is an action, C is the index for the different possible world states that you could end up in, and K is the conjunction of all of your background knowledge.

***

While this is quite intuitive, it runs into problems. For instance, suppose that scientists discover a gene G that causes both a greater chance of smoking (S) and a greater chance of developing cancer (C). In addition, suppose that smoking is known to not cause cancer.

Smoking Lesion problem

The question is, if you slightly prefer to smoke, then should you do so?

The most common response is that yes, you should do so. Either you have the cancer-causing gene or you don’t. If you do have the gene, then you’re already likely to develop cancer, and smoking won’t do anything to increase that chance.

And if you don’t have the gene, then you already probably won’t develop cancer, and smoking again doesn’t make it any more likely. So regardless of if you have the gene or not, smoking does not affect your chances of getting cancer. All it does is give you the little utility boost of getting to smoke.

But our expected utility formula given above disagrees. It sees that you are almost certain to get cancer if you smoke, and almost certain not to if you don’t. And this means that the expected utility of smoking includes the utility of cancer, which we’ll suppose to be massively negative.

Let’s do the calculation explicitly:

EU(S) = U(C & S) * P(C | S) + U(~C & S) * P(~C| S)
= U(C & S) << 0
EU(~S) =  U(~S & C) * P(C | ~S) + U(~S & ~C) * P(~C | ~S)
= U(~S & ~C) ~ 0

Therefore we find that EU(~S) >> EU(S), so our expected utility formula will tell us to avoid smoking.

The problem here is evidently that the expected utility function is taking into account not just the causal effects of your actions, but the spurious correlations as well.

The standard way that decision theory deals with this is to modify the expected utility function, switching from ordinary conditional probabilities to causal conditional probabilities.

Causal Decision Theory.png

You can calculate these causal conditional probabilities by intervening on S, which corresponds to removing all its incoming arrows.

Smoking Lesion problem mutilated

Now our expected utility function exactly mirrors our earlier argument – whether or not we smoke has no impact on our chance of getting cancer, so we might as well smoke.

Calculating this explicitly:

EU(S) = U(S & C) * P(C | do S) + U(S & ~C) * P(~C | do S)
= U(S & C) * P(C) + U(S & ~C) * P(~C)
EU(~S) = U(~S & C) * P(C | do ~S) + U(S & ~C) * P(~C | do S)
= U(~S & C) * P(C) + U(~S & ~C) * P(~C)

Looking closely at these values, we can see that EU(S) must be greater than EU(~S), regardless of the value of P(C).

***

The first expected utility formula that we wrote down represents the branch of decision theory called evidential decision theory. The second is what is called causal decision theory.

We can roughly describe the difference between them as that evidential decision theory looks at possible consequences of your decisions as if making an external observation of your decisions, while causal decision theory looks at the consequences of your decisions as if determining your decisions.

EDT treats your decisions as just another event out in the world, while CDT treats your decisions like causal interventions.

Perhaps you think that the choice between these is obvious. But Newcomb’s problem is a famous thought experiment that famously splits people along these lines and challenges both theories. I’ve written about it here, but for now will leave decision theory for new topics.

Previous: Iterated Simpson’s Paradox

Next: Causality for philosophers

Free will and decision theory

This post is about one of the things that I’ve been recently feeling confused about.

In a previous post, I described different decision theories as different algorithms for calculating expected utility. So for instance, the difference between an evidential decision theorist and a causal decision theorist can be expressed in the following way:

EDT vs CDT

What I am confused about is that each decision theory involves a choice to designate some variables in the universe as “actions”, and all the others as “consequences.” I’m having trouble making a principled rule that tells us why some things can be considered actions and others not, without resorting to free will talk.

So for example, consider the following setup:

There’s a gene G in some humans that causes them to have strong desires for candy (D). This gene also causes low blood sugar (B) via a separate mechanism. Eating lots of candy (E) causes increased blood sugar. And finally, people have self-control (S), which help them not eat candy, even if they really desire it.

We can represent all of these relationships in the following diagram.

Free will.png

Now we can compare how EDT and CDT will decide on what to do.

If EDT looks at the expected utility of eating candy vs not eating candy, they’ll find both a negative dependence (eating candy makes a low blood sugar less likely), and a positive dependence (eating candy makes it more likely that you have the gene, which makes it more likely that you have a low blood sugar).

Let’s suppose that the positive dependence outweighs the low dependence, so that EDT ends up seeing that eating candy makes it overall more likely that you have a low blood sugar.

P(B | E) > P(B)

What does the CDT calculate? Well, they look at the causal conditional probability P(B | do E). In other words, they calculate their probabilities according to the following diagram.

Free will CDT

Now they’ll see only a single dependence between eating candy (E) and having a low blood sugar (B) – the direct causal dependence. Thus, they end up thinking that eating candy makes them less likely to have a low blood sugar.

P(B | do E) < P(B)

This difference in how they calculate probabilities may lead them to behave differently. So, for instance, if they both value having a low blood sugar much more than eating candy, then the evidential decision theorist will eat the candy, and the causal decision theorist will not.

Okay, fine. This all makes sense. The problem with this is, both of them decided to make their decision on the basis of what value of E maximizes expected utility. But this was not their only choice!

They could instead have said, “Look, whether or not I actually eat the candy is not under my direct control. That is, the actual movement of my hand to the candy bar and the subsequent chewing and swallowing. What I’m controlling in this process is my brain state before and as I decide to eat the candy. In other words, what I can directly vary is the value of S – whether or not the self-controlled part of my mind tells me to eat the candy or not. The value of E that ends up actually obtaining is then a result of my choice of the value of S.”

If they had thought this way, then instead of calculating EU(E) and EU(~E), they would calculate EU(S) and EU(~S), and go with whichever one maximizes expected utility.

But now we get a different answer than before!

In particular, CDT and EDT are now looking at the same diagram, because when the causal decision theorist intervenes on the value of S, there are no causal arrows for them to break. This means that they calculate the same probabilities.

P(B | S) = P(B | do S)

And thus get the same expected utility values, resulting in them behaving the same way.

Furthermore, somebody else might argue “No, don’t be silly. We don’t only have control over S, we have control over both S, and E.” This corresponds to varying both S and E in our expected utility calculation, and choosing the optimal values. That is, they choose the actions that correspond to the max of the set { EU(S, E), EU(S, ~E), EU(~S, E), EU(~S, ~E) }.

Another person might say “Yes, I’m in control of S. But I’m also in control of D! That is, if I try really hard, I can make myself not desire things that I previously desired.” This person will vary S and D, and choose that which optimizes expected utility.

Another person will claim that they are in control of S, D, and E, and their algorithm will look at all eight combinations of these three values.

Somebody else might say that they have partial control over D. Another person might claim that they can mentally affect their blood sugar levels, so that B should be directly included in their set of “actions” that they use to calculate EU!

And all of these people will, in general, get different answers.

***

Some of these possible choices of the “set of actions” are clearly wrong. For instance, a person that says that they can by introspection change the value of G, editing out the gene in all of their cells, is deluded.

But I’m not sure how to make a principled judgment as to whether or not a person should calculate expected utilities varying S and D, varying just S, varying just E, and other plausible choices.

What’s worse, I’m not exactly sure how to rigorously justify why some variables are “plausible choices” for actions, and others not.

What’s even worse, when I try to make these types of principled judgments, my thinking naturally seems to end up relying on free-will-type ideas. So we want to say that we are actually in control of S, and in a sense we can’t really freely choose the value of D, because it is determined by our genes.

But if we extend this reasoning to its extreme conclusion, we end up saying that we can’t control any of the values of the variables, as they are all the determined results of factors that are out of our control.

If somebody hands me a causal diagram and tells me which variables they are “in control of”, I can tell them what CDT recommends them to do and what EDT recommends them to do.

But if I am just handed the causal diagram by itself, it seems that I am required to make some judgments about what variables are under the “free control” of the agent in question.

One potential way out of this is to say that variable X is under the control of agent A if, when they decide that they want to do X, then X happens. That is, X is an ‘action variable’ if you can always trace a direct link between the event in the brain of A of ‘deciding to do X’ and the actual occurrence of X.

Two problems that I see with this are (1) that this seems like it might be too strong of a requirement, and (2) that this seems to rely on a starting assumption that the event of ‘deciding to do X’ is an action variable.

On (1): we might want to say that I am “in control” of my desire for candy, even if my decision to diminish it is only sometimes effectual. Do we say that I am only in control of my desire for candy in those exact instances when I actually successfully determine their value? How about the cases when my decision to desire candy lines up with whether or not I desire candy, but purely by coincidence? For instance, somebody walking around constantly “deciding” to keep the moon in orbit around the Earth is not in “free control” of the moon’s orbit, but this way of thinking seems to imply that they are.

And on (2): Procedurally, this method involves introducing a new variable (“Decides X”), and seeing whether or not it empirically leads to X. After all, if the part of your brain that decides X is completely out of your control, then it makes as much sense to say that you can control X as to say that you can control the moon’s orbit. But then we have a new question, about how much this decision is under your control.  There’s a circularity here.

We can determine if “Decides X” is a proper action variable by imagining a new variable “Decides (Decides X)”, and seeing if it actually is successful at determining the value of “Decides X”. And then, if somebody asks us how we know that “Decides (Decides X)” is an action variable, we look for a variable “Decides (Decides (Decides X))”. Et cetera.

How can we figure our way out of this mess?

Simpson’s paradox

Previous: Screening off and explaining away

A look at admission statistics at a college reveals that women are less likely to be admitted to graduate programs than men. A closer investigation reveals that in fact when the data is broken down into individual department data, women are more likely to be admitted than men. Does this sound impossible to you? It happened at UC Berkeley in 1973.

When two treatments are tested on a group of patients with kidney stones, Treatment A turns out to lead to worse recovery rates than Treatment B. But when the patients are divided according to the size of their kidney stone, it turns out that no matter how large their kidney stone, Treatment A always does better than Treatment B. Is this a logical contradiction? Nope, it happened in 1986!

What’s going on here? How can we make sense of this apparently inconsistent data? And most importantly, what conclusions do we draw? Is Berkeley biased against women or men? Is Treatment A actually more effective or less effective than Treatment B?

In this post, we’ll apply what we’ve learned about causal modeling to be able to answer these questions.

***

Quine gave the following categorization of types of paradoxes: veridical paradoxes (those that seem wrong but are actually correct), falsidical paradoxes (those that seem wrong and actually are wrong), and antinomies (those that are premised on common forms of reasoning and end up deriving a contradiction).

Simpson’s paradox is in the first category. While it seems impossible, it actually is possible, and it happens all the time. Our first task is to explain away the apparent falsity of the paradox.

Let’s look at some actual data on the recovery rates for different treatments of kidney stones.

Treatment A Treatment B
All patients 78% (273/350) 83% (289/350)

The percentages represent the number of patients that recovered, out of all those that were given the particular treatment. So 273 patients recovered out of the 350 patients given Treatment A, giving us 78%. And 289 patients recovered out of the 350 patients given Treatment B, giving 83%.

At this point we’d be tempted to proclaim that B is the better treatment. But if we now break down the data and divide up the patients by kidney stone size, we see:

Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)

And here the paradoxical conclusion falls out! If you have small stones, Treatment A looks better for you. And if you have large stones, Treatment A looks better for you. So no matter what size kidney stones you have, Treatment A is better!

And yet, amongst all patients, Treatment B has a higher recovery rate.

Small stones: A better than B
Large stones: A better than B
All sizes: B better than A

I encourage you to check out the numbers for yourself, in case you still don’t believe this.

***

The simplest explanation for what’s going on here is that we are treating conditional probabilities like they are joint probabilities. Let’s look again at our table, and express the meaning of the different percentages more precisely.

Treatment A Treatment B
Small stones P(Recovery | Small stones & Treatment A) P(Recovery | Small stones & Treatment B)
Large stones P(Recovery | Large stones & Treatment A) P(Recovery | Large stones & Treatment B)
Everybody P(Recovery | Treatment A) P(Recovery | Treatment B)

Our paradoxical result is the following:

P(Recovery | Small stones & Treatment A) > P(Recovery | Small stones & Treatment B)
P(Recovery | Large stones & Treatment A) > P(Recovery | Large stones & Treatment B)
P(Recovery | Treatment A) < P(Recovery | Treatment B)

But this is no paradox at all! There is no law of probability that tells us:

If P(A | B & C) > P(A | B & ~C)
and P(A | ~B & C) > P(A | ~B & ~C),
then P(A | C) > P(A | ~C)

There is, however, a law of probability that tells us:

If P(A & B | C) > P(A & B | ~C)
and P(A & ~B | C) > P(A & ~B | ~C),
then P(A | C) > P(A | ~C)

And if we represented the data in terms of these joint probabilities (probability of recovery AND small stones given Treatment A, for example) instead of conditional probabilities, we’d find that the probabilities add up nicely and the paradox vanishes.

Treatment A Treatment B
Small stones 23% (81/350) 67% (234/350)
Large stones 55% (192/350) 16% (55/350)
All patients 78% (273/350) 83% (289/350)

It is in this sense that the paradox arises from improper treatment of conditional probabilities as joint probabilities.

***

This tells us why we got a paradoxical result, but isn’t quite fully satisfying. We still want to know, for instance, whether we should give somebody with small kidney stones Treatment A or Treatment B.

The fully satisfying answer comes from causal modeling. The causal diagram we will draw will have three variables, A (which is true if you receive Treatment A and false if you receive Treatment B), S (which is true if you have small kidney stones and false if you have large), and R (which is true if you recovered).

Our causal diagram should express that there is some causal relationship between the treatment you receive (A) and whether you recover (R). It should also show a causal relationship between the size of your kidney stone (S) and your recovery, as the data indicates that larger kidney stones make recovery less likely.

And finally, it should show a causal arrow from the size of the kidney stone to the treatment that you receive. This final arrow comes from the fact that more people with large stones were given Treatment A than Treatment B, and more people with small stones were given Treatment B than Treatment B.

This gives us the following diagram:

Simpson's paradox

The values of P(S), P(A | S), and P(A | ~S) were calculated from the table we started with. For instance, the value of P(S) was calculated by adding up all the patients that had small kidney stones, and dividing by the total number of patients in the study: (87 + 270) / 700.

Now, we want to know if P(R | A) > P(R | ~A) (that is, if recovery is more likely given Treatment A than given Treatment B).

If we just look at the conditional probabilities given by our first table, then we are taking into account two sources of dependency between treatment type and recovery. The first is the direct causal relationship, which is what we want to know. The second is the spurious correlation between A and R as a result of the common cause S.

Simpson's paradox paths

Here the red arrows represent “paths of dependency” between A and R. For example, since those with small stones are more likely to get treatment B, and are also more likely to recover, this will result in a spurious correlation between small stones and recovery.

So how we do we determine the actual non-spurious causal dependency between A and R?

Easy!

If we observe the value of S, then we screen A off from R through S! This removes the spurious correlation, and leaves us with just the causal relationship that we want.

Simpson's paradox broken

What this means is that the true nature of the relationship between treatment type and recovery can be determined by breaking down the data in terms of kidney stone size. Looking back at our original data:

Recovery rate Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)
All patients 78% (273/350) 83% (289/350)

This corresponds to looking at the data divided up by size of stones, and not the data on all patients. And since for each stone size category, Treatment A was more effective than Treatment B, this is the true causal relationship between A and R!

***

A nice feature of the framework of causal modeling is that there are often multiple ways to think about the same problem. So instead of thinking about this in terms of screening off the spurious correlation through observation of S, we could also think in terms of causal interventions.

In other words, to determine the true nature of the causal relationship between A and R, we want to intervene on A, and see what happens to R.

This corresponds to calculating if P(R | do A) > P(R | do ~A), rather than if P(R | A) > P(R | ~A).

Intervention on A gives us the new diagram:

Simpson's paradox intervene

With this diagram, we can calculate:

P(R | do A)
= P(R & S | do A) + P(R & ~S | do A)
= P(S) * P(R | A & S) + P(~S) * P(R | A & ~S)
= 51% * 93% + 49% * 73%
= 83.2%

And…

P(R | do ~A)
= P(R & S | do ~A) + P(R & ~S | do ~A)
= P(S) * P(R | ~A & S) + P(~S) * P(R | ~A & ~S)
= 51% * 87% + 49% * 69%
= 78.2%

Now not only do we see that Treatment A is better than Treatment B, but we can have the exact amount by which it is better – it improves recovery chances by about 5%!

Next, we’re going to go kind of crazy with Simpson’s paradox and show how to construct an infinite chain of Simpson’s paradoxes.

Fantastic paper on all of this here.

Previous: Screening off and explaining away

Next: Iterated Simpson’s paradox

Screening off and explaining away

Previous: Correlation and causation

In this post, I’ll explain three of the most valuable tools for inference that arise naturally from causal modeling.

Screening off via causal intermediary
Screening off via common cause
Explaining away

First:

Suppose that the rain causes the sidewalk to get wet, and the sidewalk getting wet causes you to slip and break your elbow.

rain & slip & elbow.png

This means that if you know that it’s raining, then you know that a broken elbow is more likely. But if you also know that the sidewalk is wet, then learning whether or not it is raining no longer makes a broken elbow more likely. After all, the rain is only a useful piece of information for predicting broken elbows insofar as it allows you to infer sidewalk-wetness.

In other words, the information about sidewalk-wetness screens off the information about whether or not it is raining with respect to broken elbows. In particular, sidewalk-wetness screens off rain because it is a causal intermediary to broken elbows.

Second:

Suppose that being wealthy causes you to eat more nutritious food, and being wealthy also causes you to own fancy cars.

common cause.png

This means that if you see somebody in a fancy car, you know it is more likely that they eat nutritious food. But if you already knew that they were wealthy, then knowing that their car is fancy tells you no more about the nutritiousness of their diet. After all, the fanciness of the car is only a useful piece of information for predicting nutritious diets insofar as it allows you to infer wealth.

In other words, wealth screens off ownership of fancy cars with respect to nutrition. In particular, wealth screens off ownership of fancy cars because it is a common cause of nutrition and fancy car owning.

Third:

Suppose that being really intelligent causes you to get on television, and being really attractive causes you to get on television, but attractiveness and intelligence are not directly causally related.

smart & hot & tv.png

This means that in the general population, you don’t learn anything about somebody’s intelligence by assessing their attractiveness. But if you know that they are on television, then you do learn something about their intelligence by assessing their attractiveness.

In particular, if you know that somebody is on television, and then you learn that they are attractive, then it becomes less likely that they intelligent than it was before you learned this.

We say that in this scenario attractiveness explains away intelligence, given the knowledge that they are on television.

***

I want to introduce some notation that will allow us to really compactly describe these types of effects and visualize them clearly.

We’ll depict an ‘observed variable’ in a causal diagram as follows:

A&gt;B&gt;C with observed B

This diagram says that A causes B, B causes C, and the value of B is known.

In addition, we talked about the value of one variable telling you something about the value of another variable, given some information about other variables. For this we use the language of dependence.

To say, for example, that A and B are independent given C, we write:

(A ⫫ B) | C

And to say that A and B are dependent given C, we just write:

~(A ⫫ B) | C

With this notation, we can summarize everything I said above with the following diagram:

Screening off and explaining away

In words, the first row expresses dependent variables that become independent when conditioning on causal intermediaries. B screens off A from C as a causal intermediary.

The second expresses dependent variables that become independent when conditioning on common causes. B screens off A from C as a common cause.

And the third row expresses independent variables that become dependent when conditioning on common effects. A explains away C, given B.

***

Repeated application of these three rules allows you to determine dependencies in complicated causal diagrams. Let’s say that somebody gives you the following diagram:

Complex cause

First they ask you if E and F are going to be correlated.

We can answer this just by tracing causal paths through the diagram. If we look at all connected triples on paths leading from E to F and find that there is dependence between the end variables in each triple, then we know that E and F are dependent.

The path ECA is a causal chain, and C is not observed, so E and A are dependent along this path. Next, the path CAD is a common cause path, and the common cause (A) is not observed, thus retaining dependence again along the path. And finally, the path ADF is a causal chain with D unobserved, so A and F are dependent along the path.

So E and F are dependent.

Now your questioner tell you the value of D, and re-asks you if E and F are dependent.

Complex cause obs D

Now dependence still exists along the paths ECA and CAD, but the path ADF breaks the dependence. This follows from the rule in row 1: D is observed, so A is screened off from F. Since A is screened off, E is as well. This means that E and F are now independent.

Suppose they asked you if E and B were dependent before telling you the value of D. In this case, the dependence travels along ECA, and along CAD, but is broken along ADB by observation of D. This follows from our rule in row 3.

And if they asked you if E and B were dependent after telling you the value of D, then you would respond that they are dependent. Now the last leg of the path (ADB) is dependent, because A and B explain each other away.

The general ability to look at a complicated causal diagram is a valuable tool, and we will come back to it in the future.

Next, I’ll talk about one of my current favorite applications of causal diagrams: Simpson’s paradox!

Previous: Correlation and causation

Next: Simpson’s paradox