TIMN view of social evolution

(Papers here and here)

In the Neolithic era, societies are thought to have been mostly small groups bonded by kinship relations, with little social stratification. As technological advancement accommodated more complex social structures and larger groups of humans living together, problems of coordination became increasingly difficult. In response, more complex social structures arose, such as Chiefdoms, States and eventually Empires.

These structures solved coordination problems through a top-down command-and-control approach, enforced by strict hierarchical power structures. Historical exemplars of such structures include Ancient Egypt and the Roman Empire. These societies experienced immense growth, stretching out to dominate vast stretches of territory and millions of humans.

But as they grew, these societies began facing increasingly difficult problems of managing vast amounts of information involving complex exchanges and economic dynamics. Eventually, old mercantilist systems in which the state was in charge of economic transactions gave way to a grand new form of social structure: the market.

Societies that adopted market structures alongside the state became global leaders, dominating technological, social, and economic progress up until the present day. And just as previous forms of society had their distinctive failings, capitalistic societies face problems in the creation of social inequalities without the ability to address them.

Advances in technology that allow a revolutionary capacity for information exchange are resulting in the formation of a new form of social structure to address these problems. This structure is characterized by complicated heterarchical cooperation between massive networks of physically dispersed individuals, all coordinating on the basis of shared ideological aims. It is to them that the future belongs.

This is the view of history offered by political scientist David Ronfeldt, who framed the TIMN theory of social evolution.

If I were to summarize his entire theory in four sentences, I would say:

Societies through history can be explained through the interactions of four major forms of social structure: the Tribe, the Institution, the Market, and the Network. Each form defines a structure of governance and the way that individuals interact with one another, as well as cultural values and beliefs about the way society should be organized.

Each has different strengths and its weaknesses, and the progress of history has been a move towards adopting all four forms in a complicated balance. The future will belong to those societies that realize the potential of the network form and successfully incorporate it into their social structure.

There are a lot of parallels between this and previous things that I’ve read. I’ll go into that in a moment, but first will lay out more detailed definitions of his four primary structures.

The Tribe: Tribes are characterized by tight kinship relationships. Tribal social structures create strong senses of social identity and belonging, and define the culture of successive societies. They are small, egalitarian, and generally lack a strong leader. Their limitations are problems of administration and coordination as they grow, as well as nepotism and intertribal wars. Historical examples abound in the Neolithic era, and in modern times they exist in certain hot spots in Third World countries. In the First World, tribal patterns exist within families, urban gangs, civic clubs, and more abstractly in nationalism, racism and sports team mania.

The Institution: Institutions are characterized by authority figures, strict hierarchies, management structures, and administrative bureaucracies. Their strengths involve administration and solving coordination problems. They are afflicted with problems of corruption and abuse of power, as well as difficulty processing large amounts of information, leading to economic inefficiency. Examples include the great Empires, and they exist today in states, military organizations, religious organizations, and corporations.

The Market: Markets are characterized by competition and voluntary exchanges between self-interested individuals. They are uncentralized and nonhierarchical, and do well at handling enormous amounts of complex information and optimizing economic efficiency in exchanges of private goods. They lead to productive and innovative societies with thriving trade and commerce. Markets struggle to deal with externalities and lead to social inequality. Markets historically took off in the transition from mercantilism to capitalism in Europe, and are exemplified by the economies of the U.S. and the U.K. and more recently Chile, China, and Mexico.

The Network: Networks are characterized by cooperation between many autonomous individuals with no single central authority, where each individual is connected to all others. They are tied together not by blood or kinship relationships, but by ideology and common goals. Their strengths are yet to be seen, though Ronfeldt thinks that they could do well at promoting “group empowerment” and solving social issues. Same with their weaknesses, though he points vaguely in the direction of “information overload” and “deception”. Examples include social networks and transnational networks of NGOs.

Networks are the most poorly specified and speculative of the four forms. This is perhaps to be expected; after all, he thinks they have only begun to come into prominence at the advent of the Information Age.

They’re also the form that he stresses the most, making lots of breathless predictions about networked societies superseding the market-state societies that dominate the status quo. He urges states like the U.S. and the U.K. to become active participants in the ushering in of this great new era if they want to remain global leaders.

This part was less interesting to me. I’m not convinced that the problems of social inequality that he thinks Networks are necessary for cannot be fixed in a Market/State paradigm. All the same, it was nice to see falsifiable predictions from an otherwise highly theoretical work.

What I enjoyed most was his view of history. He sees the four forms as additive. When a society incorporates a new form, it does not discard the old, but builds upon it. Both end up modifying and influencing each other, and the end product is a combined system that incorporates both.

So for instance, the culture of a Tribe bleeds into its later instantiations as a State-run society, and can remain generations after the more visible tribal structures have passed on. And the adoption of free-market economic systems forces a reshaping of the State towards political democracy. He quotes Charles Lindblom:

However poorly the market is harnessed to democratic purposes, only within market-oriented systems does political democracy arise. Not all market-oriented systems are democratic, but every democratic system is also a market-oriented system. Apparently, for reasons not wholly understood, political democracy has been unable to exist except when coupled with the market. An extraordinary proposition, it has so far held without exception.

Ronfeldt explains this as a result of the market form pushing social values towards personal freedom, individuality, representation, and governmental accountability.


First connection:

I was reminded of psychologist Jonathan Haidt’s categorization of the different basic types of moral intuitions in The Righteous Mind. These are:

  • Care/Harm: Includes feelings like empathy and compassion. These intuitions are most triggered by experiences of vulnerable children, intense suffering and need, and cruelty.
  • Fairness/Cheating: Includes feelings of reciprocity, injustice, and equality. Triggered by others displaying cooperation or selfishness towards us.
  • Loyalty/Betrayal: Includes feelings of tribalism, unity and kinship. Triggered by involvement in tight groups
  • Authority/Subversion: Includes feelings of respect for parents, teachers, rulers, and religious leaders, as well as the feelings that this respect is owed. Involved in hierarchical thinking and perceptions of dominance relations.
  • Sanctity/Degradation: Includes feelings of disgust, purity, cleanliness, dirtiness, sacredness, and corruption.
  • Liberty/Oppression: Includes feelings of individualism, freedom, and resentment towards being dominated or oppressed.

Different political ideologies line up very well with different “moral foundations profiles”. Liberals tend to care primarily about the first two categories, Libertarians the last, and Conservatives a roughly equal mix of all six. You can take a questionnaire to see your personal moral profile here.

These categories look like they map really nicely onto the TIMN model as organizing principles for the different forms. Here’s my speculation on how the different social forms engage and capitalize on the different types of intuitions:

Tribes: Loyalty/Betrayal

Institutions: Authority/Subversion

Markets: Liberty/Oppression

Networks: Care/Harm?

The natural next question is what types of social forms would have as organizing principles the values of Fairness/Cheating or Sanctity/Degradation.

Second connection:

Sociologist Robert Nisbet attempted to categorize the different basic patterns of social interactions. He gave five categories: cooperation, conflict, exchange, coercion and conformity. For some reason this categorization seemed very deep to me when I first heard it, and it has stuck with me ever since.

Cooperation involves coordination between individuals that have a shared goal, while exchange involves coordination between individuals that are each motivated by their own self-interest.

Conflict occurs when individuals work against each other, competing for a larger share of rewards, for instance. Coercion is the forced cooperation between individuals with different goals. And conformity involves behavior that matches group expectations.

These categories nicely match the types of social interactions that characterize the different social forms in the TIMN model.

Tribes are a social form that are dominated by conformity interactions. Identity is tightly bound up with tribal culture, lineage, and adherence to social norms involving mutual defense and aid and who can have kids with whom.

The structure of Institutions is quite clearly analogous to coercion, and Markets to exchange and conflict. And by Ronfeldt’s description, Networks seem to be analogous to cooperative interactions.

Third connection:

Scott Alexander makes the point that democracies have several unique features that set them apart from previous forms of government.

These features all arise from the fact that democracies answer questions of leadership succession by handing them to the people. This is a big deal, for two main reasons:

First, democracies put an upper bound on how terrible a leader can be.

Why? The basic justification is that while the people don’t get to select the absolute best choice for leadership, they do get to select against the worst choices.

(FPTP is terrible enough that I actually don’t know if this is in general true. But this is in contrast to monarchical forms of government, which involve no feedback from the population, so the point stands.)

When the king of a hereditary monarchy dies and the throne passes to his oldest son, there is no formally recognized way to guard against the possibility that the kid is literally the next Hitler. At best, the population can just try to throw him out when they’ve had enough and let whoever wins out in the resulting scramble for power take over.

Second, democracy provides a great Schelling point for leadership succession.

(A Schelling point is a decision that would be arbitrary except that that is made on the basis of an expectation that everybody else will make the same decision. So if you’re supposed to meet a stranger in NYC, and you don’t know where, you’ll choose to go to Grand Central Terminal, and so will they. Not because of any psychic communication between the two of you, nor any sort of official designation of Grand Central Terminal as the One True Stranger Meeting Spot, but because you each expect the other to be there. Thus Grand Central Terminal is a geographical Schelling point for NYC.)

The Schelling point for leadership succession in a hereditary monarchy is royal blood. Which is to say that when the leader dies, everybody looks for the person (usually the man) with the most royal blood, and elects them.

But who determines if somebody’s blood is truly royal? What do you do if some other family decides that they have the truly royal blood? What if two people have equally royal blood?

The Schelling point for leadership succession in a theocratic monarchy like Ancient Egypt is the Official Word Of God.

Who determines which individual God actually wants in charge? What if two people both claim that God chose them to rule?

The problem is that these legitimacy claims are founded on fictions. There is no quality of royal-ness to blood, and there is no God to choose rulers.  In a democracy, the Schelling point for democracy is a real thing that is easily verifiable: the popular vote.

Everybody agrees who the correct leader is, because everybody can just look at the election results. And if somebody disagrees on who the correct leader is, then they have a clear action to take: mobilize voters to change their mind by the next election.

Thus democracy plays the dual role of ending succession squabbles and providing a natural pressure valve for those dissatisfied with the current leader.

These differences in structure seem really significant. I think that I would want to break apart Ronfeldt’s Institution category and replace it with two social forms: the Hierarchy and the Democracy.

A Hierarchy would be a social structure in which there is a strict top-down system of authority, and where the population at large does not have a formal role in determining who makes it at the top.

A Democracy also has a top-down system of power, but now also has a formal mechanism for feedback from the population to the top levels of power (e.g. an election). (I’d like a word for this that does not have as political a connotation, but failed to think of any)


The TIMN framework naturally leads to a story of the gradual progress of humans in our joint project of perfecting civilization. At each stage in history, new social structures arise to fix the failings of the old, and in this way forward-progress is made.

Overall, I think that the framework offers a potentially useful way of assessing different political and economic systems, by looking at the ways in which they utilize the strengths of these four structures and how they fall victim to the weaknesses.


Race, Ethnicity, and Labels

(This post is me becoming curious about the variety of different opinions on racial labels, spending far too many hours researching the topic, and writing up what I find.)

One thing that I find interesting is that basically every minority ethnic and racial group in the United States has constantly dealt with terminological disputes about their proper group name.

One possible explanation for this constant turn-over was given by disability rights activist Evan Kemp, who wrote:

As long as a group is ostracized or otherwise demeaned, whatever name is used to designate that group will eventually take on a demeaning flavor and have to be replaced. The designation will keep changing every generation or so until the group is integrated into society. Whatever name is in vogue at the point of social acceptance will be the lasting one.

If this is the right explanation, then maybe we’d be able to measure the relative degrees of discrimination faced by different groups on the basis of their ‘terminological velocity’ – how quick a turnover the name for their group has.

Regardless, looking into these issues revealed a bunch of interesting history and weird trivia. So here goes!


Native American vs American Indian

A 1995 Census Bureau survey of American Indians found that 49% preferred the term ‘American Indian’ and 37% preferred ‘Native American’. I couldn’t find any more recent polls on this question.

This may seem unusual if you don’t know much about American Indian culture and history. It’s a bit confusing to me; as somebody with a parent born in India, I’m pretty sure that I’m an American Indian.

Why is a term that derives from the geographical error of early European colonists the most favored of all available terms? And why not ‘Native American’? From an outside perspective, ‘Native American’ feels like a respectful term, one that pays homage to the history of American Indians as the original residents of the Americas.

It turns out the answer to these questions comes from a quick look at the history of these terms, which is super fascinating.

‘Native American’ was a term originally used by WASPs in the 1850s to differentiate themselves from Catholic Irish and German immigrants. The anti-immigrant Know-Nothing Party, whose supporters were known for violent riots in Catholic neighborhoods, burning down churches, and tarring and feathering of Catholic priests, was originally known as the Native American Party.

The term fell out of use for a century upon the rise of the anti-slavery movement and subsequent collapse of the Know-Nothings. This time gap probably indicates that the early usage of the term has little current relevance to associations with the term, but I included it anyway. I find it darkly amusing to imagine white anti-Catholic nativists running around calling themselves Native Americans.

The term ‘Native American’ was revived in the civil rights era by anthropologists eager for historical accuracy and disassociation from the negative stereotypes associated with ‘Indian’. This was adopted widely by government agencies, and apparently in doing so picked up a negative connotation.

Prominent Lakota activist Russell Means described the term as “a generic government term used to describe all the indigenous prisoners of the United States.” Some American Indians emphasize a sense of lack of ownership over the term, and feel that it was a “colonial term” given to them by outsiders.

‘American Indian’ is apparently more widely favored. Widespread acceptance of this term dates back to 1968 and the rise of the American Indian Movement (AIM). At a UN conference in 1977, AIM’s International Indian Treaty Council urged collective identification of American Indians with the term.

One argument made for the term is that while the names of other races in America have ‘American’ as their second word (e.g. ‘Asian American’, ‘Arab American’), ‘American Indian’ would have American as its first word, giving American Indians a special distinction. I’m serious, this was a real argument.

‘American Indian’ is etymologically close to ‘Indian’, which dates back to early European colonists that systematically drove American Indian populations out of their homes. Some note derogatory stereotypes from old Western movies associated with ‘cowboys and Indians’, and feel that the association carries over to ‘American Indian’.

Other American Indians say that they would prefer to be identified by their specific tribal nation, feeling that terms like ‘Native American’ and ‘Indian American’ lump all tribes together and ignore important differences in heritage. The problem with this is that there are 562 federally recognized distinct tribes, making this cognitively unfeasible. It’s also just useful to have a term to talk about these tribes in the aggregate.

Interestingly, when I was researching this, I found a Washington Post poll in 2016 that reported that 73% of American Indians felt that the word ‘Redskin’ was not disrespectful, and 80% would not be offended if referred to as a Redskin. A 2004 poll found similar results, with 90% of American Indians saying that the name of the Washington Redskins didn’t bother them. This is significantly more than the percentage of all Americans that don’t find the name offensive, which is around 68%.

I tried to find good arguments against these poll results, and could only find some groundless conspiracy theories suggesting the polls had been infiltrated by white people claiming to be American Indians. In the absence of alternative explanations, I really don’t know what to make of this, besides that it suggests a complete disconnect between American Indian activists and the general American Indian population.

Black vs African American

The 2010 United States Census included “Black, African Am., or Negro” as one of their racial identifications. In response to many complaints and black Americans refusing to select the term, they have now switched to the shorter ‘Black or African American’.

Something that caught my eye was their explanation of this choice, which was that apparently previous research had shown that if polls didn’t allow self-identification as ‘Negro’, a significant number of older African Americans would take the time to write it in under the ‘some other race’ category.

The term ‘Negro’ became popular in the 1920s as a polite term to replace ‘Colored’, which was in turn originally a polite alternative to ‘Nigger’ in the 1900s. An actual argument made for adopting ‘Negro’ was that it was easier to pluralize than ‘Colored’, which required the addition of another word (‘Negroes’ vs ‘Colored people’). Bizarre, but okay!

In 1890, the US Census used a four-way classification: ‘Black’ for those with at least ¾ black blood, ‘mulatto’ from 3/8 to 5/8, ‘quadroon’ for ¼, and ‘octoroon’ for 1/8. Unsurprisingly, this did not catch on.

‘Negro’ was simpler, and quickly became the politically correct and respectful term, used by black leaders like Booker T Washington, Marcus Garvey, W.E.B. Du Bois, and later Martin Luther King Jr. Many black organizations replaced ‘Colored’ in their title with ‘Negro’, with the notable exception of the NAACP.

During the civil rights era, radical and militant black organizations began to attack the term, claiming that it was associated with the history of slavery and racism. ‘Black’ became a term that identified you with radical progressive blacks (think of slogans like ‘Black Power’ and ‘Black is beautiful’), while ‘Negro’ was associated with the status quo and the old guard.

The last US president to use the term ‘Negro’ was Lyndon Johnson, and by 1980 there was a large majority of African Americans in favor of ‘Black’. And of course, in modern times the term ‘Negro’ is commonly perceived as a racial slur. Obama banned the term from usage in federal law in 2016.

Meanwhile ‘Black’ became the standard term employed in surveys and used by black organizations, and having gained popular acceptance, lost its radical connections.

(Quick aside: This looks to me like an instance of what’s called semantic bleaching, where a word weakens in meaning as it increases in usage. My favorite example of this is the phrase ‘God be with you’, which over the years lost its religious connotation and became… ‘goodbye’!)

This lasted until around 1990, when Jesse Jackson announced that ‘Black’ was a term disconnected from cultural heritage, and declared a switch to ‘African American’.

While some organizations changed their names and declared their support for ‘African American’, this didn’t gather the same level of universal acceptance as ‘Black’ had in the 1960s, or indeed ‘Negro’ in the 1900s. The 1995 Census found that 44% of Black Americans still preferred ‘Black’, and only 28% preferred ‘African American’. Some argued that modern African Americans have created a culture that is not tied to Africa, and indeed that there is no coherent concept of a ‘single African culture’.

One paper I read attributed Jackson’s lack of success in making ‘African American’ the universally used term to a missing confrontational intensity that existed in the Black Power movement. For instance, when Malcolm X and other radical black activists challenged the term ‘Negro’, they attacked it harshly and made its usage a social taboo.

Jackson may have lacked the political power to sufficiently mobilize Black Americans. A 2007 Gallup poll found that 61% of Black Americans didn’t care about what term they were described by, reflecting a high level of apathy towards his cause. A 2005 paper found that Black Americans were nearly equally divided between the two.

Currently there’s an uneasy shifting balance between these two terms, where both are acceptable, though sometimes one becomes more acceptable than the other. In my personal experience, I recall a several-year period where I perceived that the term “Black” was becoming increasingly politically incorrect. I later had (and currently have) a sense that this political incorrectness around the term had backed off, keeping it in public acceptance.

Hispanic vs Latino

Americans who trace their roots to Spanish-speaking countries were grouped together by the US government under the umbrella term ‘Hispanic’ in the 1970s. ‘Latino’ later became popular as well, and was first included in the 2000 Census. These terms are defined as synonyms by the U.S. Census Bureau.

Polls indicate that around half of Latinos don’t like either term, and prefer to be identified with their country of origin. When forced to choose, more than twice as many prefer ‘Hispanic’ over ‘Latino’. (Interestingly, Latino friends of mine tell me that they and their Latino friends and family overwhelmingly prefer ‘Latino’ over ‘Hispanic’, which points to some sort of selection bias around me that I don’t understand.)

The federal government officially defines ‘Latino’ not as a race, but an ethnicity. Latinos apparently disagree – 56% claim that is both a race and an ethnicity and 11% that it is a race. Only 19% agree with the official definition!

Both terms ‘Latino’ and ‘Hispanic’ are fairly unique to the United States. Terms that arose from Latino social movements like ‘Chicano’ have never won out among Latinos. This might be in part because of the lack of a strong shared identity – about 70% of Latinos think that there is not a common culture between American Latinos, and instead see a loose group composed of many individual cultures. There’s also a relevant lack of widely-known Latino activists and clear representatives of Latino people to champion these terms.

An older term designed to de-gender the term ‘Latino’ is ‘Latin@’, starting in the 1990s. This was apparently not inclusive enough, as the ‘@’ represents only ‘o’ and ‘a’ and not those that identify with neither. More recently, social justice activists have tried to encourage the adoption of the term ‘Latinx’. This term breaks with the gendered nature of the Spanish language and hardly rollss off the tongue, but has become relatively popular with LGBT activists.

Asian American vs Oriental

The term ‘Oriental’ was prohibited in the same bill in which Obama prohibited the use of the term ‘Negro’ in federal documents. There is a fairly strong consensus at this point that ‘Asian American’ is the appropriate term (though there remains some academic debate about this term).

‘Oriental’ is an old old term, dating back to the late Roman Empire. Over its history, the geographical region it referred to shifted constantly eastward (ad orientalem), from Morocco (yes, at some point it might have been proper to refer to Moroccans as Oriental!) to Egypt and the Levant to India and finally to East and Southeast Asia by the mid-1900s.

The term picked up baggage in the U.S. during the racist campaigns against Asian Americans in the late 1800s and early 1900s, and by now is fairly universally considered a pejorative term.

It was replaced by the term ‘Asian American’, which began to enter into popular use in the 1960s. The US Census definition of ‘Asian American’ still includes Indians, which feels really really wrong to me. I tried and failed to find public opinion polls on how many people feel comfortable with the term ‘Asian’ being applied to Indians.

And others…

The terminological situation of the Roma people is uniquely terrible. They are mostly referred to by the pejorative term ‘Gypsy’, which is essentially synonymous with ‘dangerous thieving wanderer’. The term ‘gypped’, meaning cheated or swindled, also has its origins in this term. They are also commonly referred to by the term ‘Tigan’, another pejorative term that derives from the Greek word for ‘untouchable’.

In a 2013 BBC TV interview, former Romanian prime minister Victor Ponta took care to distinguish Romanians from the Roma, noting that Romanians want to distance themselves from the Roma due to the negative connotations of the similar term.

And in 2010, the Romanian government supported a constitutional amendment legally renaming the Roma to the pejorative ‘Tigan’. (This law was later rejected by the Romanian Senate) Another such amendment was proposed in 2013, this time hoping to ban the self-identification of Roma in Romania as Romanians.

Jewish people are also in an unusual terminological situation. The term ‘Israelite’ was apparently commonly used until the 1947 formation of Israel. While ‘Jew’ is the only remaining commonly used term, there are problems with it. From The American Heritage Dictionary:

It is widely recognized that the attributive use of the word Jew, in phrases such as Jew lawyer or Jew ethics, is both vulgar and highly offensive. In such contexts Jewish is the only acceptable possibility. Some people, however, have become so wary of this construction that they have extended the stigma to any use of Jew as a noun, a practice that carries risks of its own. In a sentence such as There are several Jews on the council, which is unobjectionable, the substitution of a circumlocution like Jewish people or persons of Jewish background may in itself cause offense for seeming to imply that Jew has a negative connotation when used as a noun.


All in all, it looks like a really complicated mixture of factors ends up determining how this part of the language evolves.

On the one hand there are syntactic features (like ‘American Indian’ having ‘Indian’ on the right as opposed to the standard left, or ‘Colored’ having a complicated pluralization compared to ‘Negro’).

And on the other hand there are semantic features like the ancient and automatic negative associations with words like ‘dark’ and ‘black’, or the colonial associations tied to the term ‘Indian’.

There are contemporary factors like the existence of a strong shared racial/ethnic identity, the presence of a charismatic racial/ethnic leader, and whether or not the introducer of a new term for a group is an insider or outsider to the group.

Then there are phenomena like semantic bleaching, whereby terms that enter common use have their meaning diluted and weakened, and concept creep, whereby words change their meaning over long stretches of history by altered patterns of usage.

And finally there are longer-term historical effects like the gradual inundation of language with dark undertones over decades of racism and discriminatory treatment.

Is quantum mechanics simpler than classical physics?

I want to make a few very fundamental comparisons between classical and quantum mechanics. I’ll be assuming a lot of background in this particular post to prevent it from getting uncontrollably long, but am planning on writing a series on quantum mechanics at some point.


Let’s assume that the universe consists of N simple point particles (where N is an ungodly large number), each interacting with each other in complicated ways according to their relative positions. These positions are written as x1, x2, …, xN.

The classical description for this simple universe makes each position a function of time, and gives the following set of N equations of motion, one for each particle:

Fk(x1, x2, …, xN) = mk · ∂t2xk

Each force function Fk will be a horribly messy nonlinear function of the positions of all the particles in the universe. These functions encode the details of all of the interactions taking place between the particles.

Analytically solving this equation is completely hopeless – It’s a set of N separate equations, each one a highly nonlinear second order differential equation. You couldn’t solve any of them on their own, and on top of that, they are tightly entangled together, making it impossible to solve any one without also solving all the others.

So if you thought that Newton’s equation F = ma was simple, think again!

Compare this to how quantum mechanics describes our universe. The state of the universe is described by a function Ψ(x1, x2, …, xN, t). This function changes over time according to the Schrödinger equation:

tΨ = -i·H[Ψ]

H is a differential operator that is a complicated function of all of the positions of all the particles in the universe. It encodes the information about particle interactions in the same way that the force functions did in classical mechanics.

I claim that Schrodinger’s equation is infinitely easier to solve than Newton’s equation. In fact, I will by the end of this post write out the exact solution to the wave function of the entire universe.

At first glance, you can notice a few features of the equation that make it look potentially simpler than the classical equation. For one, there’s only one single equation, instead of N entangled equations.

Also, the equation is only first order in time derivatives, while Newton’s equation is second order in time derivatives. This is extremely important. The move from a first order differential equation to a second order differential equation is a huge deal. For one thing, there’s a simple general solution to all first order linear differential equations, and nothing close for second order linear differential equations.

Unfortunately… Schrodinger’s equation, just like Newton’s, is highly highly nonlinear, because of the presence of H. If we can’t find a way to simplify this immensely complex operator, then we’re probably stuck.

But quantum mechanics hands us exactly what we need: two magical facts about the universe that allow us to turn Schrodinger’s equation into a linear first-order differential equation.

First: It guarantees us that there exist a set of functions φE(x1, x2, …, xN) such that:

HE] = E · φE

E is an ordinary real number, and its physical meaning is the energy of the entire universe. The set of values of E is the set of allowed energies for the universe. And the functions φE(x1, x2, …, xN) are the wave functions that correspond to each allowed energy.

Second: it tells us that no matter what complicated state our universe is in, we can express it as a weighted sum over these functions:

Ψ = ∑ a· φE

With these two facts, we’re basically omniscient.

Since Ψ is a sum of all the different functions φE, if we want to know how Ψ changes with time, we can just see how each φE changes with time.

How does each φE change with time? We just use the Schrodinger equation:

tφE = -i · HE]
= -iE · φE

And we end up with a first order linear differential equation. We can write down the solution right away:

φE(x1, x2, …, xN, t) = φE(x1, x2, …, xN) · e-iEt

And just like that, we can write down the wave function of the entire universe:

Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN, t)
= ∑ a· φE(x1, x2, …, xN) · e-iEt

Hand me the initial conditions of the universe, and I can hand you back its exact and complete future according to quantum mechanics.


Okay, I cheated a little bit. You might have guessed that writing out the exact wave function of the entire universe is not actually doable in a short blog post. The problem can’t be that simple.

But at the same time, everything I said above is actually true, and the final equation I presented really is the correct wave function of the universe. So if the problem must be more complex, where is the complexity hidden away?

The answer is that the complexity is hidden away in the first “magical fact” about allowed energy states.

HE] = E · φE

This equation is a highly non-linear and in general second-order differential equation. If we actually wanted to expand out Ψ in terms of the different functions φE, we’d have to solve this equation.

So there is no free lunch here. But what’s interesting is where the complexity moves when switching from classical mechanics to quantum mechanics.

In classical mechanics, virtually zero effort goes into formalizing the space of states, or talking about what configurations of the universe are allowable. All of the hardness of the problem of solving the laws of physics is packed into the dynamics. That is, it is easy to specify an initial condition of the universe. But describing how that initial condition evolves forward in time is virtually impossible.

By contrast, in quantum mechanics, solving the equation of motion is trivially easy. And all of the complexity has moved to defining the system. If somebody hands you the allowed energy levels and energy functions of the universe at a given moment of time, you can solve the future of the rest of the universe immediately. But actually finding the allowed energy levels and corresponding wave functions is virtually impossible.


Let’s get to the strangest (and my favorite) part of this.

If quantum mechanics is an accurate description of the world, then the following must be true:

Ψ(x1, x2, …, xN, 0) = ∑ a· φE(x1, x2, …, xN)
Ψ(x1, x2, …, xN, t) = ∑ a· φE(x1, x2, …, xN) · e-iEt

This equation has two especially interesting features. First, each term in the sum can be broken down separately into a function of position and a function of time.

And second, the temporal component of each term is an imaginary exponential – a phase factor e-iEt.

Let me take a second to explain the significance of this.

In quantum mechanics, physical quantities are invariably found by taking the absolute square of complex quantities. This is why you can have a complex wave function and an equation of motion with an i in it, and still end up with a universe quite free of imaginary numbers.

But when you take the absolute square of e-iEt, you end up with e-iEt · eiEt = 1. What’s important here is that the time dependence seems to fall away.

A way to see this is to notice that y = e-ix, when graphed, looks like a point on a unit circle in the complex plane.


So e-iEt, when graphed, is just a point repeatedly spinning around the unit circle. The larger E is, the faster it spins.

Taking the absolute square of a complex number is the same as finding its distance from the origin on the complex plane. And since e-iEt always stays on the unit circle, its absolute square is always 1.

So what this all means is that quantum mechanics tells us that there’s a sense in which our universe is remarkably static. The universe starts off as a superposition of a bunch of possible energy states, each with a particular weight. And it ends up as a sum over the same energy states, with weights of the exact same magnitude, just pointing different directions in the complex plane.

Imagine drawing the universe by drawing out all possible energy states in boxes, and shading these boxes according to how much amplitude is distributed in them. Now we advance time forward by one millisecond. What happens?

Absolutely nothing, according to quantum mechanics. The distribution of shading across the boxes stays the exact same, because the phase factor multiplication does not change the magnitude of the amplitude in each box.

Given this, we are faced with a bizarre question: if quantum mechanics tells us that the universe is static in this particular way, then why do we see so much change and motion and excitement all around us?

I’ll stop here for you to puzzle over, but I’ve posted an answer here.

Iterated Simpson’s Paradox

Previous: Simpson’s paradox

In the last post, we saw how statistical reasoning can go awry in Simpson’s paradox, and how causal reasoning can rescue us. In this post, we’ll be generalizing the idea behind the paradox and producing arbitrarily complex versions of it.

The main idea behind Simpson’s paradox is that conditioning on an extra variable can sometimes reverse dependencies.

In our example in the last post, we saw that one treatment for kidney stones worked better than another, until we conditioned on the kidney stone’s size. Upon conditioning, the sign of the dependence between treatment and recovery changed, so that the first treatment now looked like it was less effective than the other.

We explained this as a result of a spurious correlation, which we represented with ‘paths of dependence’ like so:


But we can do better than just one reversal! With our understanding of causal models, we are able to generate new reversals by introducing appropriate new variables to condition upon.

Our toy model for this will be a population of sick people, some given a drug and some not (D), and some who recover and some who do not (R). If there are no spurious correlations between D and R, then our diagram is simply:

Iter Simpson's 0

Now suppose that we introduce a spurious correlation, wealth (W). Wealthy people are more likely to get the drug (let’s say that this occurs through a causal intermediary of education level E), and are more likely to recover (we’ll suppose that this occurs through a casual intermediary of nutrition level of diet N).

Now we have the following diagram:

Iter Simpson's 1

Where there was only previously one path of dependency between D and R, there is now a second. This means that if we observe W, we break the spurious dependency between D and R, and retain the true causal dependence.

Iter Simpson's 1 all paths          Iter Simpson's 1 broken.png

This allows us one possible Simpson’s paradox: by conditioning upon W, we can change the direction of the dependence between D and R.

But we can do better! Suppose that your education level causally influences your nutrition. This means that we now have three paths of dependency between D and R. This allows us to cause two reversals in dependency: first by conditioning on W and second by conditioning on N.

Iter Simpson's 2 all paths.png  Iter Simpson's 2 broke 1  Iter Simpson's 2 broke 2

And we can keep going! Suppose that education does not cause nutrition, but both education and nutrition causally impact IQ. Now we have three possible reversals. First we condition on W, blocking the top path. Next we condition on I, creating a dependence between E and N (via explaining away). And finally, we condition on N, blocking the path we just opened. Now, to discern the true causal relationship between the drug and recovery, we have two choices: condition on W, or condition on all three W, I, and N.

Iter Simpson's 3 all pathsiter-simpsons-3-cond-w-e1514586779193.pngIter Simpson's 3 cond WIIter Simpson's 3 cond WIN

As might be becoming clear, we can do this arbitrarily many times. For example, here’s a five-step iterated Simpson paradox set-up:

Big iter simpson

The direction of dependence switches when you condition on, in this order: A, X, B’, X’, C’. You can trace out the different paths to see how this happens.

Part of the reason that I wanted to talk about the iterated Simpson’s paradox is to show off the power of causal modeling. Imagine that somebody hands you data that indicates that a drug is helpful in the whole population, harmful when you split the population up by wealth levels, helpful when you split it into wealth-IQ classes, and harmful when you split it into wealth-IQ-education classes.

How would you interpret this data? Causal modeling allows you to answer such questions by simply drawing a few diagrams!

Next we’ll move into one of the most significant parts of causal modeling – causal decision theory.

Previous: Simpson’s paradox

Next: Causal decision theory

Causal decision theory

Previous: Iterated Simpson’s Paradox

We’ll now move on into slightly new intellectual territory, that of decision theory.

While what we’ve previously discussed all had to do with questions about the probabilities of events and causal relationships between variables, we will now discuss questions about what the best decision to make in a given context is.


Decision theory has two ingredients. The first is a probabilistic model of different possible events that allows an agent to answer questions like “What is the probability that A happens if I do B?” This is, roughly speaking, the agent’s beliefs about the world.

The second ingredient is a utility function U over possible states of the world. This function takes in propositions, and returns the value to a particular agent of that proposition being true. This represents the agent’s values.

So, for instance, if A = “I win a million dollars” and B = “Somebody cuts my ear off”, U(A) will be a large positive number, and U(B) will be a large negative number. For propositions that an agent feels neutral or apathetic about, the utility function assigns them a value of 0.

Different decision theories represent different ways of combining a utility function with a probability distribution over world states. Said more intuitively, decision theories are prescriptions for combining your beliefs and your values in order to yield decisions.

A proposition that all competing decision theories agree on is “You should act to maximize your expected utility.” The difference between these different theories, then, is how they think that expected utility should be calculated.

“But this is simple!” you might think. “Simply sum over the value of each consequence, and weight each by its likelihood given a particular action! This will be the expected utility of that action.”

This prescription can be written out as follows:

Evidential Decision Theory.png

Here A is an action, C is the index for the different possible world states that you could end up in, and K is the conjunction of all of your background knowledge.


While this is quite intuitive, it runs into problems. For instance, suppose that scientists discover a gene G that causes both a greater chance of smoking (S) and a greater chance of developing cancer (C). In addition, suppose that smoking is known to not cause cancer.

Smoking Lesion problem

The question is, if you slightly prefer to smoke, then should you do so?

The most common response is that yes, you should do so. Either you have the cancer-causing gene or you don’t. If you do have the gene, then you’re already likely to develop cancer, and smoking won’t do anything to increase that chance.

And if you don’t have the gene, then you already probably won’t develop cancer, and smoking again doesn’t make it any more likely. So regardless of if you have the gene or not, smoking does not affect your chances of getting cancer. All it does is give you the little utility boost of getting to smoke.

But our expected utility formula given above disagrees. It sees that you are almost certain to get cancer if you smoke, and almost certain not to if you don’t. And this means that the expected utility of smoking includes the utility of cancer, which we’ll suppose to be massively negative.

Let’s do the calculation explicitly:

EU(S) = U(C & S) * P(C | S) + U(~C & S) * P(~C| S)
= U(C & S) << 0
EU(~S) =  U(~S & C) * P(C | ~S) + U(~S & ~C) * P(~C | ~S)
= U(~S & ~C) ~ 0

Therefore we find that EU(~S) >> EU(S), so our expected utility formula will tell us to avoid smoking.

The problem here is evidently that the expected utility function is taking into account not just the causal effects of your actions, but the spurious correlations as well.

The standard way that decision theory deals with this is to modify the expected utility function, switching from ordinary conditional probabilities to causal conditional probabilities.

Causal Decision Theory.png

You can calculate these causal conditional probabilities by intervening on S, which corresponds to removing all its incoming arrows.

Smoking Lesion problem mutilated

Now our expected utility function exactly mirrors our earlier argument – whether or not we smoke has no impact on our chance of getting cancer, so we might as well smoke.

Calculating this explicitly:

EU(S) = U(S & C) * P(C | do S) + U(S & ~C) * P(~C | do S)
= U(S & C) * P(C) + U(S & ~C) * P(~C)
EU(~S) = U(~S & C) * P(C | do ~S) + U(S & ~C) * P(~C | do S)
= U(~S & C) * P(C) + U(~S & ~C) * P(~C)

Looking closely at these values, we can see that EU(S) must be greater than EU(~S), regardless of the value of P(C).


The first expected utility formula that we wrote down represents the branch of decision theory called evidential decision theory. The second is what is called causal decision theory.

We can roughly describe the difference between them as that evidential decision theory looks at possible consequences of your decisions as if making an external observation of your decisions, while causal decision theory looks at the consequences of your decisions as if determining your decisions.

EDT treats your decisions as just another event out in the world, while CDT treats your decisions like causal interventions.

Perhaps you think that the choice between these is obvious. But Newcomb’s problem is a famous thought experiment that famously splits people along these lines and challenges both theories. I’ve written about it here, but for now will leave decision theory for new topics.

Previous: Iterated Simpson’s Paradox

Next: Causality for philosophers

Free will and decision theory

This post is about one of the things that I’ve been recently feeling confused about.

In a previous post, I described different decision theories as different algorithms for calculating expected utility. So for instance, the difference between an evidential decision theorist and a causal decision theorist can be expressed in the following way:


What I am confused about is that each decision theory involves a choice to designate some variables in the universe as “actions”, and all the others as “consequences.” I’m having trouble making a principled rule that tells us why some things can be considered actions and others not, without resorting to free will talk.

So for example, consider the following setup:

There’s a gene G in some humans that causes them to have strong desires for candy (D). This gene also causes low blood sugar (B) via a separate mechanism. Eating lots of candy (E) causes increased blood sugar. And finally, people have self-control (S), which help them not eat candy, even if they really desire it.

We can represent all of these relationships in the following diagram.

Free will.png

Now we can compare how EDT and CDT will decide on what to do.

If EDT looks at the expected utility of eating candy vs not eating candy, they’ll find both a negative dependence (eating candy makes a low blood sugar less likely), and a positive dependence (eating candy makes it more likely that you have the gene, which makes it more likely that you have a low blood sugar).

Let’s suppose that the positive dependence outweighs the low dependence, so that EDT ends up seeing that eating candy makes it overall more likely that you have a low blood sugar.

P(B | E) > P(B)

What does the CDT calculate? Well, they look at the causal conditional probability P(B | do E). In other words, they calculate their probabilities according to the following diagram.

Free will CDT

Now they’ll see only a single dependence between eating candy (E) and having a low blood sugar (B) – the direct causal dependence. Thus, they end up thinking that eating candy makes them less likely to have a low blood sugar.

P(B | do E) < P(B)

This difference in how they calculate probabilities may lead them to behave differently. So, for instance, if they both value having a low blood sugar much more than eating candy, then the evidential decision theorist will eat the candy, and the causal decision theorist will not.

Okay, fine. This all makes sense. The problem with this is, both of them decided to make their decision on the basis of what value of E maximizes expected utility. But this was not their only choice!

They could instead have said, “Look, whether or not I actually eat the candy is not under my direct control. That is, the actual movement of my hand to the candy bar and the subsequent chewing and swallowing. What I’m controlling in this process is my brain state before and as I decide to eat the candy. In other words, what I can directly vary is the value of S – whether or not the self-controlled part of my mind tells me to eat the candy or not. The value of E that ends up actually obtaining is then a result of my choice of the value of S.”

If they had thought this way, then instead of calculating EU(E) and EU(~E), they would calculate EU(S) and EU(~S), and go with whichever one maximizes expected utility.

But now we get a different answer than before!

In particular, CDT and EDT are now looking at the same diagram, because when the causal decision theorist intervenes on the value of S, there are no causal arrows for them to break. This means that they calculate the same probabilities.

P(B | S) = P(B | do S)

And thus get the same expected utility values, resulting in them behaving the same way.

Furthermore, somebody else might argue “No, don’t be silly. We don’t only have control over S, we have control over both S, and E.” This corresponds to varying both S and E in our expected utility calculation, and choosing the optimal values. That is, they choose the actions that correspond to the max of the set { EU(S, E), EU(S, ~E), EU(~S, E), EU(~S, ~E) }.

Another person might say “Yes, I’m in control of S. But I’m also in control of D! That is, if I try really hard, I can make myself not desire things that I previously desired.” This person will vary S and D, and choose that which optimizes expected utility.

Another person will claim that they are in control of S, D, and E, and their algorithm will look at all eight combinations of these three values.

Somebody else might say that they have partial control over D. Another person might claim that they can mentally affect their blood sugar levels, so that B should be directly included in their set of “actions” that they use to calculate EU!

And all of these people will, in general, get different answers.


Some of these possible choices of the “set of actions” are clearly wrong. For instance, a person that says that they can by introspection change the value of G, editing out the gene in all of their cells, is deluded.

But I’m not sure how to make a principled judgment as to whether or not a person should calculate expected utilities varying S and D, varying just S, varying just E, and other plausible choices.

What’s worse, I’m not exactly sure how to rigorously justify why some variables are “plausible choices” for actions, and others not.

What’s even worse, when I try to make these types of principled judgments, my thinking naturally seems to end up relying on free-will-type ideas. So we want to say that we are actually in control of S, and in a sense we can’t really freely choose the value of D, because it is determined by our genes.

But if we extend this reasoning to its extreme conclusion, we end up saying that we can’t control any of the values of the variables, as they are all the determined results of factors that are out of our control.

If somebody hands me a causal diagram and tells me which variables they are “in control of”, I can tell them what CDT recommends them to do and what EDT recommends them to do.

But if I am just handed the causal diagram by itself, it seems that I am required to make some judgments about what variables are under the “free control” of the agent in question.

One potential way out of this is to say that variable X is under the control of agent A if, when they decide that they want to do X, then X happens. That is, X is an ‘action variable’ if you can always trace a direct link between the event in the brain of A of ‘deciding to do X’ and the actual occurrence of X.

Two problems that I see with this are (1) that this seems like it might be too strong of a requirement, and (2) that this seems to rely on a starting assumption that the event of ‘deciding to do X’ is an action variable.

On (1): we might want to say that I am “in control” of my desire for candy, even if my decision to diminish it is only sometimes effectual. Do we say that I am only in control of my desire for candy in those exact instances when I actually successfully determine their value? How about the cases when my decision to desire candy lines up with whether or not I desire candy, but purely by coincidence? For instance, somebody walking around constantly “deciding” to keep the moon in orbit around the Earth is not in “free control” of the moon’s orbit, but this way of thinking seems to imply that they are.

And on (2): Procedurally, this method involves introducing a new variable (“Decides X”), and seeing whether or not it empirically leads to X. After all, if the part of your brain that decides X is completely out of your control, then it makes as much sense to say that you can control X as to say that you can control the moon’s orbit. But then we have a new question, about how much this decision is under your control.  There’s a circularity here.

We can determine if “Decides X” is a proper action variable by imagining a new variable “Decides (Decides X)”, and seeing if it actually is successful at determining the value of “Decides X”. And then, if somebody asks us how we know that “Decides (Decides X)” is an action variable, we look for a variable “Decides (Decides (Decides X))”. Et cetera.

How can we figure our way out of this mess?

Simpson’s paradox

Previous: Screening off and explaining away

A look at admission statistics at a college reveals that women are less likely to be admitted to graduate programs than men. A closer investigation reveals that in fact when the data is broken down into individual department data, women are more likely to be admitted than men. Does this sound impossible to you? It happened at UC Berkeley in 1973.

When two treatments are tested on a group of patients with kidney stones, Treatment A turns out to lead to worse recovery rates than Treatment B. But when the patients are divided according to the size of their kidney stone, it turns out that no matter how large their kidney stone, Treatment A always does better than Treatment B. Is this a logical contradiction? Nope, it happened in 1986!

What’s going on here? How can we make sense of this apparently inconsistent data? And most importantly, what conclusions do we draw? Is Berkeley biased against women or men? Is Treatment A actually more effective or less effective than Treatment B?

In this post, we’ll apply what we’ve learned about causal modeling to be able to answer these questions.


Quine gave the following categorization of types of paradoxes: veridical paradoxes (those that seem wrong but are actually correct), falsidical paradoxes (those that seem wrong and actually are wrong), and antinomies (those that are premised on common forms of reasoning and end up deriving a contradiction).

Simpson’s paradox is in the first category. While it seems impossible, it actually is possible, and it happens all the time. Our first task is to explain away the apparent falsity of the paradox.

Let’s look at some actual data on the recovery rates for different treatments of kidney stones.

Treatment A Treatment B
All patients 78% (273/350) 83% (289/350)

The percentages represent the number of patients that recovered, out of all those that were given the particular treatment. So 273 patients recovered out of the 350 patients given Treatment A, giving us 78%. And 289 patients recovered out of the 350 patients given Treatment B, giving 83%.

At this point we’d be tempted to proclaim that B is the better treatment. But if we now break down the data and divide up the patients by kidney stone size, we see:

Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)

And here the paradoxical conclusion falls out! If you have small stones, Treatment A looks better for you. And if you have large stones, Treatment A looks better for you. So no matter what size kidney stones you have, Treatment A is better!

And yet, amongst all patients, Treatment B has a higher recovery rate.

Small stones: A better than B
Large stones: A better than B
All sizes: B better than A

I encourage you to check out the numbers for yourself, in case you still don’t believe this.


The simplest explanation for what’s going on here is that we are treating conditional probabilities like they are joint probabilities. Let’s look again at our table, and express the meaning of the different percentages more precisely.

Treatment A Treatment B
Small stones P(Recovery | Small stones & Treatment A) P(Recovery | Small stones & Treatment B)
Large stones P(Recovery | Large stones & Treatment A) P(Recovery | Large stones & Treatment B)
Everybody P(Recovery | Treatment A) P(Recovery | Treatment B)

Our paradoxical result is the following:

P(Recovery | Small stones & Treatment A) > P(Recovery | Small stones & Treatment B)
P(Recovery | Large stones & Treatment A) > P(Recovery | Large stones & Treatment B)
P(Recovery | Treatment A) < P(Recovery | Treatment B)

But this is no paradox at all! There is no law of probability that tells us:

If P(A | B & C) > P(A | B & ~C)
and P(A | ~B & C) > P(A | ~B & ~C),
then P(A | C) > P(A | ~C)

There is, however, a law of probability that tells us:

If P(A & B | C) > P(A & B | ~C)
and P(A & ~B | C) > P(A & ~B | ~C),
then P(A | C) > P(A | ~C)

And if we represented the data in terms of these joint probabilities (probability of recovery AND small stones given Treatment A, for example) instead of conditional probabilities, we’d find that the probabilities add up nicely and the paradox vanishes.

Treatment A Treatment B
Small stones 23% (81/350) 67% (234/350)
Large stones 55% (192/350) 16% (55/350)
All patients 78% (273/350) 83% (289/350)

It is in this sense that the paradox arises from improper treatment of conditional probabilities as joint probabilities.


This tells us why we got a paradoxical result, but isn’t quite fully satisfying. We still want to know, for instance, whether we should give somebody with small kidney stones Treatment A or Treatment B.

The fully satisfying answer comes from causal modeling. The causal diagram we will draw will have three variables, A (which is true if you receive Treatment A and false if you receive Treatment B), S (which is true if you have small kidney stones and false if you have large), and R (which is true if you recovered).

Our causal diagram should express that there is some causal relationship between the treatment you receive (A) and whether you recover (R). It should also show a causal relationship between the size of your kidney stone (S) and your recovery, as the data indicates that larger kidney stones make recovery less likely.

And finally, it should show a causal arrow from the size of the kidney stone to the treatment that you receive. This final arrow comes from the fact that more people with large stones were given Treatment A than Treatment B, and more people with small stones were given Treatment B than Treatment B.

This gives us the following diagram:

Simpson's paradox

The values of P(S), P(A | S), and P(A | ~S) were calculated from the table we started with. For instance, the value of P(S) was calculated by adding up all the patients that had small kidney stones, and dividing by the total number of patients in the study: (87 + 270) / 700.

Now, we want to know if P(R | A) > P(R | ~A) (that is, if recovery is more likely given Treatment A than given Treatment B).

If we just look at the conditional probabilities given by our first table, then we are taking into account two sources of dependency between treatment type and recovery. The first is the direct causal relationship, which is what we want to know. The second is the spurious correlation between A and R as a result of the common cause S.

Simpson's paradox paths

Here the red arrows represent “paths of dependency” between A and R. For example, since those with small stones are more likely to get treatment B, and are also more likely to recover, this will result in a spurious correlation between small stones and recovery.

So how we do we determine the actual non-spurious causal dependency between A and R?


If we observe the value of S, then we screen A off from R through S! This removes the spurious correlation, and leaves us with just the causal relationship that we want.

Simpson's paradox broken

What this means is that the true nature of the relationship between treatment type and recovery can be determined by breaking down the data in terms of kidney stone size. Looking back at our original data:

Recovery rate Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)
All patients 78% (273/350) 83% (289/350)

This corresponds to looking at the data divided up by size of stones, and not the data on all patients. And since for each stone size category, Treatment A was more effective than Treatment B, this is the true causal relationship between A and R!


A nice feature of the framework of causal modeling is that there are often multiple ways to think about the same problem. So instead of thinking about this in terms of screening off the spurious correlation through observation of S, we could also think in terms of causal interventions.

In other words, to determine the true nature of the causal relationship between A and R, we want to intervene on A, and see what happens to R.

This corresponds to calculating if P(R | do A) > P(R | do ~A), rather than if P(R | A) > P(R | ~A).

Intervention on A gives us the new diagram:

Simpson's paradox intervene

With this diagram, we can calculate:

P(R | do A)
= P(R & S | do A) + P(R & ~S | do A)
= P(S) * P(R | A & S) + P(~S) * P(R | A & ~S)
= 51% * 93% + 49% * 73%
= 83.2%


P(R | do ~A)
= P(R & S | do ~A) + P(R & ~S | do ~A)
= P(S) * P(R | ~A & S) + P(~S) * P(R | ~A & ~S)
= 51% * 87% + 49% * 69%
= 78.2%

Now not only do we see that Treatment A is better than Treatment B, but we can have the exact amount by which it is better – it improves recovery chances by about 5%!

Next, we’re going to go kind of crazy with Simpson’s paradox and show how to construct an infinite chain of Simpson’s paradoxes.

Fantastic paper on all of this here.

Previous: Screening off and explaining away

Next: Iterated Simpson’s paradox