Bayes and beyond

You have lots of beliefs about the world. Each belief can be written as a propositional statement that is either true or false. But while each statement is either true or false, your beliefs are more complicated; they come in shades of gray, not blacks and whites. Instead of beliefs being on-or-off, we have degrees of beliefs – some beliefs are much stronger than others, some have roughly the same degree of belief, and so on. Your smallest degrees of belief are for true impossibilities – things that you can be absolutely certain are false. Your largest degrees of beliefs are for absolute certainties, the other side of the coin.

Now, answer for yourself the following series of questions:

  1. Can you quantify a degree of belief?

By quantify, I mean put a precise, numerical value on it. That is, can you in principle take any belief of yours, and map it to a real number that represents how strongly you believe it? The in principle is doing a lot of work here; maybe you don’t think that you can in practice do this, but does it make conceptual sense to you to think about degrees of belief as quantities?

If so, then we can arbitrarily scale your degrees of belief by translating them into what I’ll call for the moment credences. All of your credences are on a scale from 0 to 1, where 0 is total disbelief and 1 is totally certain belief. We can accomplish this rescaling by just shifting all your degrees of belief up by your lowest degree of belief (that which you assign to logical impossibilities), and then dividing each degree of belief by the difference between your most distant degrees of belief.


  1. If beliefs B and B’ are mutually exclusive (i.e. it is impossible for them both to be true), then do you agree that your credence in one of the two of them being true should be the sum of your credences in each individually?

Said more formally, do you agree that if Cr(B & B’) = 0, then Cr(B or B’) = Cr(B) + Cr(B’)? (The equal sign here should be a normative equals sign. We are not asking if you think this is descriptively true of your degrees of beliefs, but if you think that this should be true of your degrees of beliefs. This is the normativity of rationality, by the way, not ethics.)

If so, then your credence function Cr is really a probability function (Cr(B) = P(B)). With just these two questions and the accompanying comments, we’ve pinned down the Kolmogorov axioms for a simple probability space. But we’re not done!


  1. Do you agree that your credence in two statements B and B’ both being true should be your credence in B’ given that B is true, multiplied by your credence in B?

Formally: Do you agree that P(B & B’) = P (B’ | B) ∙ P(B)? If you haven’t seen this before, this might not seem immediately intuitively obvious. It can be made so quite easily. To find out how strongly you believe both B and B’, you can firstly imagine a world in which B is true and judge your credence in B’ in this scenario, and then secondly judge your actual credence in B being the real world. The conditional probability is important here in order to make sure you are not ignoring possible ways that B and B’ could depend upon each other. If you want to know the chance that both of somebody’s eyes are brown, you need to know (1) how likely it is that their left eye is brown, and (2) how likely it is that their right eye is brown, given that their left eye is brown. Clearly, if we used an unconditional probability for (2), we would end up ignoring the dependency between the colors of the right and left eye.

Still on board? Good! Number 3 is crucially important. You see, the world is constantly offering you up information, and your beliefs are (and should be) constantly shifting in response. We now have an easy way to incorporate these dynamics.

Say that you have some initial credence in a belief B about whether you will experience E in the next few moments. Now you see that after a few moments pass, you did experience E. That is, you discover that B is true. We can now set P(B) equal to 1, and adjust everything else accordingly:

For all beliefs B’, Pnew(B’) = P(B’ | B)

In other words, your new credences are just your old credences given the evidence you received. What if you weren’t totally sure that B is true? Maybe you want P(B) = .99 instead. Easy:

For all beliefs B’: Pnew(B’) = .99 ∙ P(B’ | B) + .01 ∙ P(B’ | ~B)

In other words, your new credence in B’ is just your credence that B is true, multiplied by the conditional credence of B’ given that B is true, added to your credence that B is false times the conditional credence of B’ given that B is false.

We now have a fully specified general system of updating beliefs; that is, we have a mandated set of degrees of beliefs at any moment after some starting point. But what of this starting point? Is there a rationally mandated prior credence to have, before you’ve received any evidence at all? I.e., do we have some a priori favored set of prior degrees of belief?

Intuitively, yes. Some starting points are obviously less rational than others. If somebody starts off being totally certain in the truth of one side of an a posteriori contingent debate that cannot be settled as a matter of logical truth, before receiving any evidence for this side, then they are being irrational. So how best to capture this notion of normative rational priors? This is the question of objective Bayesianism, and there are several candidates for answers.

One candidate relies on the notions of surprise and information. Since we start with no information at all, we should start with priors that represent this state of knowledge. That is, we want priors that represent maximum uncertainty. Formalizing this notion gives us the principle of maximum entropy, which says that the proper starting point for beliefs is that which maximizes the entropy function ∑ -P logP.

There are problems with this principle, however, and many complicated debates comparing it to other intuitively plausible principles. The question of objective Bayesianism is far from straightforward.

Putting aside the question of priors, we have a formalized system of rules that mandates the precise way that we should update our beliefs from moment to moment. Some of the mandates seem unintuitive. For instance, it tells us that if we get a positive result on a 99% accurate test for a disease with a 1% prevalence rate, then we have a 50% chance of having the disease, not 99%. There are many known cases where our intuitive judgments of likelihood differ from the judgments that probability theory tells us are rational.

How do we respond to these cases? We only really have a few options. One, we could discard our formalization in favor of the intuitions. Two, we could discard our intuitions in favor of the formalization. Or three, we could accept both, and be fine with some inconsistency in our lives. Presuming that inconsistency is irrational, we have to make a judgment call between our intuitions and our formalization. Which do we discard?

Remember, our formalization is really just the necessary result of the set of intuitive principles we started with. So at the core of it, we’re really just comparing intuitions of differing strengths. If your intuitive agreement with the starting principles was stronger than your intuitive disagreement with the results of the formalization, then presumably you should stick with the formalization.

Another path to adjudicating these cases is to consider pragmatic arguments for our formalization, like Dutch Book arguments that indicate that our way of assigning degrees of beliefs is the only one that is not exploitable by a bookie to ensure losses. You can also be reassured by looking at consistency and convergence theorems, that show the Bayesian’s beliefs converging to the truth in a wide variety of cases.

If you’re still with me, you are now a Bayesian. What does this mean? It means that you think that it is rational to treat your beliefs like probabilities, and that you should update your beliefs by conditioning upon the evidence you receive.


So what’s next? Are we done? Have all epistemological issues been solved? Unfortunately not. I think of Bayesianism as a first step into the realm of formal epistemology – a very good first step, but nonetheless still a first. Here’s a simple example of where Bayesianism will lead us into apparent irrationality.

Imagine we have two different beliefs about the world: B1 and B2. B2 is a respectable scientific theory: one that puts its neck out with precise predictions about the results of experiments, and tries to identify a general pattern in the underlying phenomenon. B1 is a “cheating” theory: it doesn’t have any clue what’s going to happen before an experiment, but after an experiment it peeks at the results and pretends that it had predicted it all along. We might think of B1 as the theory that perfectly fits all of the data, but only through over-fitting on the data. As such, B1 is unable to make any good predictions about future data.

What does Bayesianism say about these two theories? Well, consider any single data point. Let’s suppose that B2 does a good job predicting this data point, say, P(D | B2) = 99%. And since B1 perfectly fits the data, P(D | B1) = 1. If our priors in B1 and B2 are written as P1 and P2, respectively, then our credences update as follows:

Pnew(B1) = P(B1 | D) = P1 / (P1 + .99 P2)
Pnew(B2) = P(B2 | D) = .99 P2 / (P1 + .99 P2)

For N similar data points, we get:

Pnew(B1) = P(B1 | Dn) = P1 / (P1 + .99n P2)
Pnew(B2) = P(B2 | Dn) = .99n P2 / (P1 + .99n P2)

What happens to these two credences as n gets larger and larger?

Bayes and beyond

As we can see, our credence in B1 approaches 100% exponentially quickly, and our credence in B2 drops to 0% exponentially quickly. Even if we start with an enormously low prior in B1, our credence will eventually be swamped as we gather more and more data.

It looks like in this example, the Bayesian is successfully hoodwinked by the cheating theory, B1. But this is not quite the end of the story for Bayes. The only single theory that perfectly predicts all of the data you receive in the infinite evidence limit is basically just the theory that “Everything that’s going to happen is what’s going to happen.” And, well, this is surely true. It’s just not very useful.

If instead we look at B1 as a sequence of theories, one for each new data point, then we have a way out by claiming that our priors drop as we go further in the sequence. This is an appeal to simplicity – a theory that exactly specifies 1000 different data points is more complex than a theory that exactly specifies 100 different data points. It also suggests a precise way to formalize simplicity, by encoding it into our priors.

While the problem of over-fitting is not an open-and-shut case against Bayesianism, it should still give us pause. The core of the issue is that there are more intuitive epistemic virtues than those that the Bayesian optimizes for. Bayesianism mandates a degree of belief as a function of two ingredients: the prior and the evidential update. The second of these, Bayesian updating, solely optimizes for accommodation of data. And setting of priors is typically done to optimize for some notion of simplicity. Since empirically distinguishable theories have their priors washed out in the limit of infinite evidence, Bayesianism becomes a primarily accommodating epistemology.

This is what creates the potential for problems of overfitting to arise. The Bayesian is only optimizing for accommodation and simplicity, but what we want is a framework that also optimizes for prediction. I’ll give two examples of ways to do this: cross validation and posterior predictive checking.

I’ve talked about cross validation previously. The basic idea is that you split a set of data into a training set and a testing set, optimize your model for best fit with the training set, and then see how it performs on the testing set. In doing so, you are in essence estimating how well your model will do on predictions of future data points.

This procedure is pretty commonsensical. Want to know how well your model does at predicting data? Well, just look at the predictions it makes and evaluate how accurate they were. It is also completely outside of standard Bayesianism, and solves the issues of overfitting. And since the first half of cross validation is training your model to fit the training set, it is optimizing for both accommodation and prediction.

Posterior predictive checks are also pretty commonsensical; you ask your model to make predictions for future data, and then see how these predictions line up with the data you receive.

More formally, if you have some set of observable variables X and some other set of parameters A that are not directly observable, but that influence the observables, you can express your prior knowledge (before receiving data) as a prior over A, P(A), and a likelihood function P(X | A). Upon receiving some data D about the values of X, you can update your prior over A as follows:

P(A) becomes P(A | D)
where P(A | D) = P(D | A) P(A) / P(D)

To make a prediction about how likely you think it is that the next data point will be X, given the data D, you must use the posterior predictive distribution:

P(X | D) = ∫ P(X | A) ∙ P(A | D) dA

This gives you a precise probability that you can use to evaluate the predictive accuracy of your model.

There’s another goal that we can aim towards, besides accommodation, simplicity, or prediction. This is distance from truth. You might think that this is fairly obvious as a goal, and that all the other methods are really only attempts to measure this. But in information theory, there is a precise way in which you can specify the information gap between any given theory and reality. This metric is called the Kullback-Leibler divergence (DKL), and I’ll refer to it as just information divergence.

DKL = ∫ Ptrue log(Ptrue / P) dx

This term, if parsed correctly, represents precisely how much information you gain if you go from your starting distribution P to the true distribution Ptrue.

For example, if you have a fair coin, then the true distribution is given by (Ptrue(H) = .5, Ptrue(T) = .5). You can calculate how far any other theory (P(H) = p, P(T) = 1 – p) is from the truth using DKL.

 DKL = .5 ∙ [ log(1 / 2p) + log(1 / 2(1-p)) ]

I’ve graphed DKL as a function of p here:

Information divergence.png

As you can see, the information divergence is 0 for the correct theory that the coin is fair (p = 0.5), and goes to infinity as you get further away from this.

This is all well and good, but how is this practically applicable? It’s easy to minimize the distance from the true distribution if you already know the true distribution, but the problem is exactly that we don’t know the truth and are trying to figure it out.

Since we don’t have direct access to Ptrue, we must resort to approximations of DKL. The most famous approximation is called the Akaike information criterion (AIC). I won’t derive the approximation here, but will present the form of this quantity.

AIC = k – log(P(data | M))
where M = the model being evaluated
and k = number of parameters in M

The model that minimizes this quantity probably also minimizes the information distance from truth. Thus, “lower AIC value” serves as a good approximation to “closer to the truth”. Notice that AIC explicitly takes into account simplicity; the quantity k tells you about how complex a model is. This is pretty interesting in it’s own right; it’s not obvious why a method that is solely focused on optimizing for truth will end up explicitly including a term that optimizes for simplicity.

Here’s a summary table describing the methods I’ve talked about here (as well as some others that I haven’t talked about), and what they’re optimizing for.


Which theory makes the data most likely?

Maximum likelihood estimation (MLE)

Which theory is most likely, given the data?

Bayesian information criterion (BIC)
Maximum uncertainty

Relative entropy


Minimum description length
Solomonoff induction

Predictive accuracy

Cross validation
Posterior predictive checks

Distance from truth

Information divergence (DKL)
Akaike information criterion (AIC)


Racism and identity

I recently saw that a friend of a friend of mine was writing in a blog about her experience as a mixed race woman in America, and all of the ways in which she feels that she suffers from explicit and implicit discrimination. The impression she conveyed was that she walked around intensely aware of her skin color, and felt that others were equally aware. In her world, people looked at her as primarily a brown woman, a strange and exotic other. She talked about the emotional shock she has to go through when returning to the United States after visiting her family in Thailand, in dealing with the fact that Thai culture is so underrepresented here. There was a lot of anger, a feeling of not being accepted by the majority culture around her, and most of all, a sense of being disrespected and harmed on the basis of her ethnicity.

Whenever I hear people like her talking like this, I get really confused. I am a mixed-race person, living in the same city as her, surrounded by probably very similar people, and yet we seem to live in completely different worlds. I know that the idea of color-blindness is not in vogue, but I walk around literally entirely unaware of my skin color and feel fairly confident that almost everybody else I run into is similarly unaware of it.

I’m somebody that’s fairly attuned to social signals – I feel like if I was being slighted on the basis of my ethnicity, I would notice it – and I’m also not somebody that could remotely pass for white. So I’m left wondering… what’s going on here? How can two people have such radically different experiences of living with their ethnicities, when it seems like so many of the variables are the same?

One answer is that some of the variables that appear to be the same actually aren’t. For instance, while we’re both mixed race, we are different mixes of races. While I could pass (and have passed) for Black, Hispanic, Middle Eastern, or Indian, I’ve never been identified as Southeast Asian. So perhaps while Black/Hispanic/Middle Eastern/Indian people face very little racism in my town, Southeast Asians are relentlessly oppressed. Hmm, somehow that seems wrong…

Maybe a relevant difference is the social circles we surround ourselves with. From what I know of this person, she surrounds herself with people that are very concerned with social justice issues. It seems fairly plausible to me that the types of people that are very concerned with social justice are also going to be very sensitive to racial and ethnic identities, and will be much more likely to see somebody as a mixed-race person (and treat this as an important aspect of their identity). Incidentally, the few people who I’ve actually felt conscious of my skin color or ethnicity around have been exactly those people who are most vocal and passionate about their anti-racism and social justice concerns. Also, anecdotally, the people I know who most strongly emphasize feelings of personal oppression happen to surround themselves with social justice types. Of course, this doesn’t indicate the direction of causation – it could be that those that feel oppressed seek out social justice types that will affirm their feelings of being wronged.

Another possibility to explain the difference in perceptions is that one of us is just wrong. Maybe the oppression and constant discrimination and other-ing is actually in my Southeast Asian friend-of-friend’s head. Or maybe I’m actually being horribly oppressed and discriminated against and just don’t know it. Maybe I’m just extremely lucky and have by chance avoided all the nasty racists in my town. (If one of us is wrong, I’m betting it’s her.)

But this isn’t the only time I’ve noticed this disconnect in experiences. I’m reminded of a debate I watched a while back about sexual harassment. The actual debate itself wasn’t too interesting, but I found the Q&A period fascinating. Many different women stood up and spoke about their personal experiences of sexual harassment in their daily life, and what they said completely contradicted each other. Some women claimed that they felt sexually harassed or at risk of sexual harassment virtually always, like, walking in the middle of the day in a public area or shopping for groceries. Other women claimed that they had never been catcalled, nor sexually harassed or discriminated against because of their sex.

Keep in mind; this was a live debate, with a local audience. All of these women lived in the same area. There weren’t obvious differences in their appearances, or ages, or mannerisms, although there were significant differences in their views on sexual harassment (for obvious reasons). Also keep in mind that some of the claims being made were literally just objectively verifiable factual statements. It’s not like the disagreement was over whether others had objectifying thoughts about them because of their sex. The differences were about things like whether or not they are verbally catcalled while walking downtown. There’s got to be an actual fact about how likely the average woman is to be catcalled on a given street.

This is pretty hard to make sense of, and seems like the exact same phenomenon as what’s going on with my friend’s friend and I. People that should be living in similar worlds mentally feel like they are living in completely different worlds.

One last example: I’ve had similar experiences with my sister. She is the same race as me (shocking, I know), with basically the same amount of exposure to the non-American side of our cultural heritage, has lived in the same city as me for most of our lives, and is not too different in age from me. But she talks about a strong sense of feeling discriminated against as a brown woman, and has described experiences of oppression that seem totally foreign to me.

Perhaps a component of all of this can be explained by incentives to exaggerate. This aligns with my sense that those that think they are oppressed hang around with social justice types. A lot of social justice culture seems to be devoted to jockeying for oppression points and finding ways to appear as unprivileged as possible. In a social circle in which one can gain social brownie points by being discriminated against, you would expect a general upwards pressure on the level of exaggeration that the average person uses in describing said discrimination.

I feel like I should stop here to emphasize that I’m not suggesting that there isn’t racial and sexual discrimination in the world. There obviously is. What I’m specifically wondering about is how it is that people in little liberal college towns like mine with fairly similar racial backgrounds can have such radically different perspectives on the factual matter of the actual oppression they face. It’s especially puzzling to me given that I’m a brown person who has, as far as I can tell, never faced significant drawbacks on the basis of it, and is most of the time unaware of my skin color.

I think that this unawareness of my skin color provides a hint for explaining what might be actually going on here. Not only am I generally not aware of my skin color, but I have always felt this way. I think that there is a spectrum of natural self-identification tendencies, and a bias towards attributing perceived affronts to the most salient aspects of your identity. Let me unpack this.

It’s not exactly that I’m unaware that I’m brown (I wouldn’t be surprised if somebody showed me a picture of myself and pointed out my skin color). It’s that my brownness is a nonexistent component of the way I think about myself. As far back as I can remember, the salient features of my sense of self have been things like my way of thinking and my personality. I’ve always identified myself as mostly a mind, not a body. I even remember a few bizarre experiences where I looked in a mirror and was momentarily struck by a surreal sense of disconnect, that I happen to exist within this body that seems so obviously distinct from me.

It is also the case that when I perceive that others dislike me and don’t have any sense of why this may be, I naturally tend to assume that their dislike relates to some aspect of my mental characteristics; maybe they don’t like my style of reasoning, or my sense of humor, or some other aspect of my personality. I will almost never attribute their dislike to some physical characteristics of mine.

I perceive myself as primarily a thinker occupying a body that I don’t strongly identify with. But other people identify much more closely with their physicality (skin color, facial features, body type, sex, et cetera). It seems plausible to me that just as I perceive affronts as having to do with properties of my mind, those for whom race is a salient component of their personal identity will perceive affronts as having to do with racism, those who identify with their sex will be more likely to see them as sexism, and so on.

This idea of a spectrum of self-identification tendencies is fairly satisfying to me as an explanation of this phenomenon of radically different perceptions of the world. Two people that appear to exist in very similar social environments can have radically different perceptions of their social environments, because of differences in how they conceive of themselves and the way that this affects their framing of their interactions with others. These differing tendencies are not restricted to body-versus-mind. Some people strongly identify themselves with a profession, a cultural heritage, or a nation. Others identify with an ideology or a religion. And in general, the parts of your identity that feel most salient to you are those that will prickle most readily at perceived affronts.

This relates to the notion in psychology of internal vs external loci of control. When you fail a job interview, you blame the traffic in the morning, or the interviewer’s bias. If you had gotten the job, you would have happily praised your interviewee skills and charming smile. When your neighbor fails a job interview, you attribute it to their poor interviewee skills. That is, you place the locus of control over the outcome wherever it is convenient.

This is called the fundamental attribution error. With respect to themselves, people attribute positive outcomes to features of their own identity, and negative outcomes to features of the external world. With respect to others, they attribute positive outcomes to the external world and negative outcomes to the person’s character.

If you strongly identify as a mixed person, then you will see events in your world as being all about your mixed race. And if you identify as a mind floating about in a body, then things like your race or sex or attractiveness will seem mostly irrelevant to explaining the events in your life. This suggests a sort of self-perpetuating cycle whereby those that identify as X will perceive the world as centered around X, further entrenching the self-identification as X.

What is bias?

I find urns to be a fruitful source for metaphors regarding rationality. For example, here’s a question that I’ve recently been thinking about: What does it mean for somebody to be biased?

Imagine that there is an urn containing black and white balls that you don’t have direct access to. You want to know the ratio of white to black balls in the urn, and you know somebody that does have direct access to it. This person will remove some number of balls from the urn and show them to you, thus giving you some evidence as to the contents of the urn.

So, for instance, if this person shows 100 black balls in a row, then this is strong evidence that there are many more black balls in the urn than white balls. Or is it?

In fact, this is only strong evidence if you have good reason to think that the person presenting you with the evidence is unbiased. We can exactly formulate what unbiased means in this example. The procedure your acquaintance is running has two steps: first they remove some balls from the urn, and second they show you some of the balls they removed. Thus there are two sources of bias. I’ll call the first type of bias knowledge bias and the second presentation bias.

Knowledge bias is what happens if the person is not randomly sampling balls from the urn. Maybe they are fishing through the urn until they find a black ball and then removing it. Or maybe for some complicated reason that they are unaware of, their sampling is unrepresentative of the true ratio in the urn. The first of these corresponds to things like motivated reasoning and confirmation bias. The second is more subtle; it corresponds to a bias in terms of the information that they are exposed to. This could come as a result of living in a culture in which certain views are taken for granted and never questioned, or as a result of the information that reaches them being subject to selection pressures that distort the ratio of information on one side to the other. Scott Alexander’s toxoplasma of rage seems like a good example of this.

In short, knowledge bias refers to a state of knowledge where the information that you have is not representative of the information you would get from a random sampling procedure.

Presentation bias is what happens when the balls you are being shown are not a representative sample of the balls that were removed. For example, somebody could have a totally random sampling procedure, and end up removing 10 black balls and 100 white balls, but then only show you the 10 black balls. On the more explicit side, this corresponds to explicitly omitting information or arguments that you know. On the less explicit side, this could correspond to doing a better job at presenting arguments with favorable conclusions than those with unfavorable conclusions. This is pretty hard to avoid in general; it is not easy to do just as good of a job at presenting arguments you dislike as it is for arguments you like.

In short, presentation bias is where the information that is being presented is unrepresentative of a random sampling of the information that the presenter has.

What if all of the good arguments for one side are really complicated and all of the arguments on the other side are dead simple? If you’re talking to a dumb person, you’ll have a hard time conveying the relative strengths of the arguments on either side. In this case, the bias is arising not through the information being presented, but the information that is being received. This is not a presentation bias, but a knowledge bias on the part of the person listening. In this case, a good educator has the choice to either not present the complicated information that their student won’t understand anyway (a presentation bias), or present it and watch it not be understood (a knowledge bias).

Notice that intention is not emphasized in this way of thinking about bias. While intending to present biased information certainly makes it easier to be biased, it is not necessary. Somebody might be biased as a result of not being smart enough, or being surrounded by a biased culture, or being better at making the case for their side than the other.

Bias can get complicated really quickly. Person A, who gets all of their political information from Fox News, probably has a significant knowledge bias. This knowledge bias arises from a presentation bias on the part of Fox News. If Person A presents some arguments they heard on Fox to a friend of theirs, and this friend accepts and updates on those arguments, then they will have unwittingly attained a knowledge bias. This is the case even if there is no presentation bias on the part of Person A!

Basically, bias is contagious. Enter one Super Persuader, somebody who is a master presenter of biased arguments, and bias can propagate like mad throughout a society to the point that it is unclear who and what can be trusted. I’m not sure to what degree it makes sense to say that this is the state of our society today, but it certainly gives reason to be very careful about the way that information is attained and dispersed.

Value beyond ethics

There is a certain type of value in our existence that transcends ethical value. It is beautifully captured in this quote from Richard Feynman:

It is a great adventure to contemplate the universe, beyond man, to contemplate what it would be like without man, as it was in a great part of its long history and as it is in a great majority of places. When this objective view is finally attained, and the mystery and majesty of matter are fully appreciated, to then turn the objective eye back on man viewed as matter, to view life as part of this universal mystery of greatest depth, is to sense an experience which is very rare, and very exciting. It usually ends in laughter and a delight in the futility of trying to understand what this atom in the universe is, this thing—atoms with curiosity—that looks at itself and wonders why it wonders.

Well, these scientific views end in awe and mystery, lost at the edge in uncertainty, but they appear to be so deep and so impressive that the theory that it is all arranged as a stage for God to watch man’s struggle for good and evil seems inadequate.

The Meaning Of It All

Carl Sagan beautifully expressed the same sentiment.

We are the local embodiment of a Cosmos grown to self-awareness. We have begun to contemplate our origins: starstuff pondering the stars; organized assemblages of ten billion billion billion atoms considering the evolution of atoms; tracing the long journey by which, here at least, consciousness arose. Our loyalties are to the species and the planet. We speak for Earth. Our obligation to survive is owed not just to ourselves but also to that Cosmos, ancient and vast, from which we spring.


The ideas expressed in these quotes feels a thousand times deeper and more profound than anything offered in ethics. Trolley problems seem trivial by comparison. If somebody argued that the universe would be better off without us on the basis of, say, a utilitarian calculation of net happiness, I would feel like there is an entire dimension of value that they are completely missing out on. This type of value, a type of raw aesthetic sense of the profound strangeness and beauty of reality, is tremendously subtle and easily slips out of grasp, but is crucially important. My blog header serves as a reminder: We are atoms contemplating atoms.

Taxonomy of infinity catastrophes for expected utility theory

Basics of expected utility theory

I’ve talked quite a bit in past posts about the problems that infinities raise for expected utility theory. In this post, I want to systematically go through and discuss the different categories of problems.

First of all, let’s define expected utility theory.

Given an action A, we have a utility function U over the possible consequences
U = { U1, U2, U3, … UN }
and a credence distribution P over the consequences
P = { P1, P2, P3, … PN }.
We define the expected utility of A to be EU(A) = P1U1 + P2U2 + … + PNUN

Expected Utility Theory:
The rational action is that which maximizes expected utility.

Just to give an example of how this works out, suppose that we can choose between two actions A1 and A2, defined as follows:

Action A1
U1 = { 20, -10 }
P1 = { 50%, 50% }

Action A2
U2 = { 10, -20 }
P2 = { 80%, 20% }

We can compare the expected utilities of these two actions by using the above formula.

EU(A1) = 20∙50% + -10∙50% = 5
EU(A2) = 10∙80% + -20∙20% = 4

Since EU(A1) is greater than EU(A2), expected utility theory mandates that A1 is the rational act for us to take.

Expected utility theory seems to work out fine in the case of finite payouts, but becomes strange when we begin to introduce infinities. Before even talking about the different problems that arise, though, you might be tempted to brush off this issue, thinking that infinite payouts don’t really exist in the real world.

While this is a tenable position to hold, it is certainly not obviously correct. We can easily construct games that are actually do-able that have an infinite expected payout. For instance, a friend of mine runs the following procedure whenever it is getting late and he is trying to decide whether or not he should head home: First, he flips a coin. If it lands heads, he heads home. If tails, he waits one minute and then re-flips the coin. If it lands heads this time, he heads home. If tails, then he waits two minutes and re-flips the coin. On the next flip, if it lands tails, he waits four minutes. Then eight. And so on. The danger of this procedure is that on overage, he ends up staying out for an infinitely long period of time.

This is a more dangerous real-world application of the St. Petersburg Paradox (although you’ll be glad to know that he hasn’t yet been stuck hanging out with me for an infinite amount of time). We might object: Yes, in theory this has an infinite expected time. But we know that in practice, there will be some cap on the total possible time. Perhaps this cap corresponds to the limit of tolerance that my friend has before he gives up on the game. Or, more conclusively, there is certainly an upper limit in terms of his life span.

Are there any real infinities out there that could translate into infinite utilities? Once again, plausibly no. But it doesn’t seem impossible that such infinities could arise. For instance, even if we wanted to map utilities onto positive-valence experiences and believed that there was a theoretical upper limit on the amount of positivity you could possible experience in a finite amount of time, we could still appeal to the possibility of an eternity of happiness. If God appeared before you and offered you an eternity of existence in a Heaven, then you would presumably be considering an offer with a net utility of positive infinity. Maybe you think this is implausible (I certainly do), but it is at least a possibility that we could be confronted with real infinities in expected utility calculations.

Reassured that infinite utilities are probably not a priori ruled out, we can now ask: How does expected utility theory handle these scenarios?

The answer is: not well.

There are three general classes of failures:

  1. Failure of dominance arguments
  2. Undefined expected utilities
  3. Nonsensical expected utilities

Failure of dominance arguments

A dominance argument is an argument that says that if the expected utility of one action is greater than the expected utility of another, no matter what is the case.

Here’s an example. Consider two lotteries: Lottery 1 and Lottery 2. Each one decides on whether a player wins or not by looking at some fixed random event (say, whether or not a radioactive atom decays within a fixed amount of time T), but the reward for winning differs. If the radioactive atom does decay within time T, then you would get $100,000 from Lottery 1 and $200,000 from Lottery 2. If it does not, then you lose $200 dollars from Lottery 1 and lose $100 dollars from Lottery 2. Now imagine that you can choose only one of these two lotteries.

To summarize: If the atom decays, then Lottery 1 gives you $100,000 less than Lottery 2. And if the atom doesn’t decay, then Lottery 1 charges you $100 more than Lottery 2.

In other words, no matter what ends up happing, you are better off choosing Lottery 2 than Lottery 1. This means that Lottery 2 dominates Lottery 1 as a strategy. There is no possible configuration of the world in which you would have been better off by choosing Lottery 1 than you were by Lottery 2, so this choice is essentially risk-free.

So we have the following general principle, which seems to follow nicely from a simple application of expected utility theory:

Dominance: If action A1 dominates action A2, then it is irrational to choose A2 over A1.

Amazingly, this straightforward and apparently obvious rule ends up failing us when we start to talk about infinite payoffs.

Consider the following setup:

Action 1
U = { ∞, 0 }
P = { .5, .5 }

Action 2
U = { ∞, 10 }
P = { .5, .5 }

Action 2 weakly dominates Action 1. This means that no matter what consequence ends up obtaining, we always end up either better off or equally well off if we take Action 2 than Action 1. But when we calculate the expected utilities…

EU(Action 1) = .5 ∙ ∞ + .5 ∙ 0 = ∞
EU(Action 2) = .5 ∙ ∞ + .5 ∙ 10 = ∞

… we find that the two actions are apparently equal in utility, so we should have no preference between them.

This is pretty bizarre. Imagine the following scenario: God is about to appear in front of you and ship you off to Heaven for an eternity of happiness. In the few minutes before he arrives, you are able to enjoy a wonderfully delicious-looking Klondike bar if you so choose. Obviously the rational thing to do is to eat the Klondike bar, right? Apparently not, according to expected utility theory. The additional little burst of pleasure you get fades into irrelevance as soon as the infinities enter the calculation.

Not only do infinities make us indifferent between two actions, one of which dominates the other, but they can even make us end up choosing actions that are clearly dominated! My favorite example of this is one that I’ve talked about earlier, featuring a recently deceased Donald Trump sitting in Limbo negotiating with God.

To briefly rehash this thought experiment, every day Donald Trump is given an offer by God that he spend one day in Hell and in reward get two days in Heaven afterwards. Each day, the rational choice is for Trump to take the offer, spending one more day in Hell before being able to receive his reward. But since he accepts the offer every day, he ends up always delaying his payout in Heaven, and therefore spends all of eternity in Hell, thinking that he’s making a great deal.

We can think of Trump’s reason for accepting each day as a simple expected utility calculation: U(2 days in Heaven) + U(1 day in Hell) > 0. But iterating this decision an infinity of times ends up leaving Trump in the worst possible scenario – eternal torture.

Undefined expected utilities

Now suppose that you get the following deal from God: Either (Option 1) you die and stop existing (suppose this has utility 0 to you), or (Option 2) you die and continue existing in the afterlife forever. If you choose the afterlife, then your schedule will be arranged as follows: 1,000 days of pure bliss in heaven, then one day of misery in hell. Suppose that each day of bliss has finite positive value to you, and each day of misery has finite negative value to you, and that these two values perfectly cancel each other out (a day in Hell is as bad as a day in Heaven is good).

Which option should you take? It seems reasonable that Option 2 is preferable, as you get a thousand to one ratio of happiness to unhappiness for all of eternity.

Option 1: 💀, 💀, 💀, 💀, 
Option 2:
😇 x 1000, 😟, 😇 x 1000, 😟, …

Since U(💀) = 0, we can calculate the expected utility of Option 1 fine. But what about Option 2? The answer we get depends on the order in which we add up the utilities of each day. If we take the days in chronological order, than we get a total infinite positive utility. If we alternate between Heaven days and Hell days, then we get a total expected utility of zero. And if we add up in the order (Hell, Hell, Heaven, Hell, Hell, Heaven, …), then we end up getting an infinite negative expected utility.

In other words, the expected utility of Option 2 is undefined, giving us no guidance as to which we should prefer. Intuitively, we would want a rational theory of preference would tell us that Option 2 is preferable.

A slightly different example of this: Consider the following three lotteries:

Lottery 1
U = { ∞, -∞ }
P = { .5, .5 }

Lottery 2
U = { ∞, -∞ }
P = { .01, .99 }

Lottery 3
U = { ∞, -∞ }
P = { .99, .01 }

Lottery 1 corresponds to flipping a fair coin to determine whether you go to Heaven forever or Hell forever. Lottery 2 corresponds to picking a number between 1 and 100 to decide. And Lottery 3 corresponds to getting to pick 99 numbers between 1 and 100 to decide. It should be obvious that if you were in this situation, then you should prefer Lottery 3 over Lottery 1, and Lottery 1 over Lottery 2. But here, again, expected utility theory fails us. None of these lotteries have defined expected utilities, because ∞ – ∞ is not well defined.

Nonsensical expected utilities

A stranger approaches you and demands twenty bucks, on pain of an eternity of torture. What should you do?

Expected utility theory tells us that as long as we have some non-zero credence in this person’s threat being credible, then we should hand over the twenty bucks. After all, a small but nonzero probability multiplied by -∞ is still just -∞.

Should we have a non-zero credence in the threat being credible? Plausibly so. To have a zero credence in the threat’s credibility is to imagine that there is no possible evidence that could make it any more likely. It is true that no experience you could have would make the threat any more credible? What if he demonstrated incredible control over the universe?

In the end, we have an inconsistent triad.

  1. The rational thing to do is that which maximizes expected utility.
  2. There is a nonzero chance that the stranger threatening you with eternal torture is actually able to follow through on this threat.
  3. It is irrational to hand over the five dollars to the stranger.

This is a rephrasing of Pascal’s wager, but without the same problems as that thought experiment.

Two-factor markets for sex

A two-factor market is essentially just a market is one in which two parties must seek each other out to make trades. Say I want to make a new website-hosting platform to compete against WordPress. Well, just making the platform isn’t enough. Even if I create a platform that is objectively better than WordPress for both readers and creators, neither side will spontaneously start using the platform unless they think that the other side will as well.

A content creator has little incentive to move over from WordPress to my platform, because there are no readers there. And readers have little incentive to check out my platform, because there are no content creators using it. In other words, there exists a signaling equilibrium around WordPress as places for finding and creating online content. Bloggers come to WordPress because they know that it is a good place to find lots of readers, and readers come to WordPress because they know it is a good place to find lots of blogs.

This is a natural result of a two-factor market, and can result in some unfortunate suboptimalities. For instance, I’ve already suggested that an objectively better website-hosting platform might never become widely utilized, because of the nature of this equilibrium. A company like WordPress can exploit this by not investing as heavily in the quality of their product as they would have if the market was perfectly competitive.

Sexual selection looks like it has some of these features. If female birds on average favor a certain streak of red on the head of the males of the species, then we should expect that both this streak of red and the favoring of this streak will increase over time. Once streak-of-red has become a dominant sexually-selected-for trait, it is much harder for streak-of-green to gain prominence in the population. For this to happen, it requires not just a male with a streak of green, but a female that finds this attractive; i.e. the market for sex is a two-factor market. In the end, this trait will only gain prominence if it can beat out the existing red-streak equilibrium.

This two-factor market is coupled to a feedback loop that can further entrench these resulting equilibria. This is reflected in the fact that the products of the “exchanges” in this market are more red-streaked birds and red-streak-favoring birds. This would be as if Craigslist exchanges spawned new human buyers and sellers that would flock to Craigslist. In general, males in a species are attracted to females in that species that have certain specific traits, and females seek out males with certain traits. This results in equilibria in a sexually dimorphous population where both sexes have distinctive stable traits that they find attractive in each other.

In addition, this equilibrium is made more stable by the feedback nature of the market – the fact that the children resulting from the pairing of individuals with given traits are more likely to have those traits. Since the population is stuck in this stable equilibrium, it may prove resistant to change, even when that change would be a net gain in average fitness for the individuals in that population. So, for instance, if there exists a strong enough equilibrium around courtship practices in a certain species of bird, then these courtship practices may exist long past the point where there is any resemblance between the practice and any credible signal of evolutionary fitness.

Some possible examples of this might be the enormous antlers of Irish elk and the majestic tails of peacocks. What sort of evolutionary explanation could justify such opulence and apparent squandering of metabolic resources? Costly signaling is a standard explanation, the idea being that the enormous apparent waste of resources is a way of providing a credible signal to mates of their survival fitness. It’s like saying “if I’m able to waste all of these resources and still be doing fine, then you know that I’m more fit than somebody that’s doing just as well without wasting resources.” Think about an expert chess player playing you without one of his knights, and still managing to beat you, versus an expert chess player that beats you without a handicap. If an organism is sufficiently high-fitness, then handicapping itself can be beneficial as a way of signaling its high fitness over other high fitness individuals.

Even in this explanation, the precise details of how the elk or peacock spend their excess resources are irrelevant. Why is the elk’s energy going to producing enormous antlers, as opposed to any other burdensome bodily structure? The right answer to this may be that there is no real answer – it’s just the result of the type of runaway feedback cycle I’ve described above. What’s surprising and interesting to me is the idea that explanations like costly signaling don’t seem to be needed to explain sexual selection of seemingly arbitrary and wasteful traits; if the argument above is correct, then this would be predicted to happen all on its own.

A simple explanation of Bell’s inequality

Everybody knows that quantum mechanics is weird. But there are plenty of weird things in the world. We’ve pretty much come to expect that as soon as we look out beyond our little corner of the universe, we’ll start seeing intuition-defying things everywhere. So why does quantum mechanics get the reputation of being especially weird?

Bell’s theorem is a good demonstration of how the weirdness of quantum mechanics is in a realm of its own. It’s a set of proposed (and later actually verified) experimental results that seem to defy all attempts at classical interpretation.

The Experimental Results

Here is the experimental setup:


In the center of the diagram, we have a black box that spits out two particles every few minutes. These two particles fly in different directions to two detectors. Each detector has three available settings (marked by 1, 2, and 3) and two bulbs, one red and the other green.

Shortly after a particle enters the detector, one of the two bulbs flashes. Our experiment is simply this: we record which bulb flashes on both the left and right detector, and we take note of the settings on both detectors at the time. We then try randomly varying the detector settings, and collect data for many such trials.

Quick comprehension test: Suppose that what bulb flashes is purely a function of some property of the particles entering the detector, and the settings don’t do anything. Then we should expect that changes in the settings will not have any impact on the frequency of flashing for each bulb. It turns out that we don’t see this in the experimental results.

One more: Suppose that the properties of the particles have nothing to do with which bulb flashes, and all that matters is the detector settings. What do we expect our results to be in this case?

Well, then we should expect that changing the detector settings will change which bulb flashes, but that the variance in the bulb flashes should be able to be fully accounted for by the detector settings. It turns out that this also doesn’t happen.

Okay, so what do we see in the experimental results?

The results are as follows:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
Different colored bulbs flash on the left and right 75% of the time.

In some sense, the paradox is already complete. It turns out that some very minimal assumptions about the nature of reality tell us that these results are impossible.  There is a hidden inconsistency within these results, and the only remaining task is to draw it out and make it obvious.


We’ll start our analysis by detailing our basic assumptions about the nature of the process.

Assumption 1: Lawfulness
The probability of an event is a function of all other events in the universe.

This assumption is incredibly weak. It just says that if you know everything about the universe, then you are able to place a probability distribution over future events. This isn’t even as strong as determinism, as it’s only saying that the future is a probabilistic function of the past. Determinism would be the claim that all such probabilities are 1 or 0, that is, the facts about the past fix the facts about the future.

From Assumption 1 we conclude the following:

There exists a function P(R | everything else) that accurately reports the frequency of the red bulb flashing, given the rest of facts about the universe.

It’s hard to imagine what it would mean for this to be wrong. Even in a perfectly non-deterministic universe where the future is completely probabilistically independent of the past, we could still express what’s going to happen next probabilistically, just with all of the probabilities of events being independent. This is why even naming this assumption lawfulness is too strong – the “lawfulness” could be probabilistic, chaotic, and incredibly minimal.

The next assumption constrains this function a little more.

Assumption 2: Locality
The probability of an event only depends on events local to it.

This assumption is justified by virtually the entire history of physics. Over and over we find that particles influence each others’ behaviors through causal intermediaries. Einstein’s Special Theory of Relativity provides a precise limitation on causal influences; the absolute fastest that causal influences can propagate is the speed of light. The light cone of an event is defined as all the past events that could have causally influenced it, given the speed of light limit, and all future events that can be causally influenced by this event.

Combining Assumption 1 and Assumption 2, we get:

P(R | everything else) = P(R | local events)

So what are these local events? Given our experimental design, we have two possibilities; the particle entering the detector, and the detector settings. Our experimental design explicitly rules out the effects of other causal influences, by holding them fixed. The only thing that we, the experimenters, vary are the detector settings, and the variation in the particle types being produced by the central black box. All else is stipulated to be held constant.

Thus we get our third, and final assumption.

Assumption 3: Good experimental design
The only local events relevant to the bulb flashing are the particle that enters the detector and the detector setting.

Combining these three assumptions, we get the following:

P(R | everything else) = P(R | particle & detector setting)

We can think of this function a little differently, by asking about a particular particle with a fixed set of properties.

Pparticle(R | detector setting)

We haven’t changed anything but the notation – this is the same function as what we originally had, just carrying a different meaning. Now it tells us how likely a given particle is to cause the red bulb to flash, given a certain detector setting. This allows us to categorize all different types of particles by looking at all different settings.

Particle type is defined by
Pparticle(R | Setting 1), Pparticle(R | Setting 2), Pparticle(R | Setting 3) )

This fully defines our particle type for the purposes of our experiment. The set of particle types is the set of three-tuples of probabilities.

So to summarize, here are the only three assumptions we need to generate the paradox.

Lawfulness: Events happen with probabilities that are determined by facts about the universe.
Locality: Causal influences propagate locally.
Good experimental design: Only the particle type and detector setting influence the experiment result.

Now, we generate a contradiction between these assumptions and the experimental results!


Recall our experimental results:

(1) When the two detectors have the same settings:
The same color of bulb always flashes on the left and right.

(2) When the two detectors have different settings:
The same color bulb flashes on the left and right 25% of the time.
Different colored bulbs flash on the left and right 75% of the time.

We are guaranteed by Assumptions 1 to 3 that there exists a function Pparticle(R | detector setting) that describes the frequencies we observe for a detector. We have two particles and two detectors, so we are really dealing with two functions for each experimental trial.

Left particle: Pleft(R | left setting)
Right particle: Pright(R | right setting)

From Result (1), we see that when left settingright setting, the same color always flashes on both sides. This means two things: first, that the black box always produces two particles of the same type, and second, that the behavior observed in the experiment is deterministic.

Why must they be the same type? Well, if they were different, then we would expect different frequencies on the left and the right. Why determinism? If the results were at all probabilistic, then even if the probability functions for the left and right particles were the same, we’d expect to still see them sometimes give different results. Since they don’t, the results must be fully determined.

Pleft(R | setting 1) = Pright(R | setting 1) = 0 or 1
Pleft(R | setting 2) = Pright(R | setting 2) = 0 or 1
Pleft(R | setting 3) = Pright(R | setting 3) = 0 or 1

This means that we can fully express particle types by a function that takes in a setting (1, 2, or 3), and returns a value (0 or 1) corresponding to whether or not the red bulb will flash. How many different types of particles are there? Eight!

Abbreviation: Pn = P(R | setting n)
P1 = 1, P2 = 1, P3 = 1 : (RRR)
P1 = 1, P2 = 1, P3 = 0 : (RRG)
P1 = 1, P2 = 0, P3 = 1 : (RGR)
P1 = 1, P2 = 0, P3 = 0 : (RGG)
P1 = 0, P2 = 1, P3 = 1 : (GRR)
P1 = 0, P2 = 1, P3 = 0 : (GRG)
P1 = 0, P2 = 0, P3 = 1 : (GGR)
P1 = 0, P2 = 0, P3 = 0 : (GGG)

The three-letter strings (RRR) are short representations of which bulb will flash for each detector setting.

Now we are ready to bring in experimental result (2). In 25% of the cases in which the settings are different, the same bulbs flash on either side. Is this possible given our results? No! Check out the following table that describes what happens with RRR-type particles and RRG-type particles when the detectors have different settings different detector settings.

(Setting 1, Setting 2) RRR-type RRG-type
1, 2 R, R R, R
1, 3 R, R R, G
2, 1 R, R R, R
2, 3 R, R R, G
3, 1 R, R G, R
3, 2 R, R G, R
100% same 33% same

Obviously, if the particle always triggers a red flash, then any combination of detector settings will result in a red flash. So when the particles are the RRR-type, you will always see the same color flash on either side. And when the particles are the RRG-type, you end up seeing the same color bulb flash in only two of the six cases with different detector settings.

By symmetry, we can extend this to all of the other types.

Particle type Percentage of the time that the same bulb flashes (for different detector settings)
RRR 100%
RRG 33%
RGR 33%
RGG 33%
GRR 33%
GRG 33%
GGR 33%
GGG 100%

Recall, in our original experimental results, we found that the same bulb flashes 25% of the time when the detectors are on different settings. Is this possible? Is there any distribution of particle types that could be produced by the central black box that would give us a 25% chance of seeing the same color?

No! How could there be? No matter how the black box produces particles, the best it can do is generate a distribution without RRRs and GGGs, in which case we would see 33% instead of 25%. In other words, the lowest that this value could possibly get is 33%!

This is the contradiction. Bell’s inequality points out a contradiction between theory and observation:

Theory: P(same color flash | different detector settings) ≥ 33%
Experiment: P(same color flash | different detector settings) = 25%


We have a contradiction between experimental results and a set of assumptions about reality. So one of our assumptions has to go. Which one?

Assumption 3: Experimental design. Good experimental design can be challenged, but this would require more detail on precisely how these experiments are done. The key feature of this is that you would have to propose a mechanism by which changes to the detector setting end up altering other relevant background factors that affect the experiment results. You’d also have to be able to do this for all the other subtly different variants of Bell’s experiment that give the same result. While this path is open, it doesn’t look promising.

Assumption 1: Lawfulness. Challenging the lawfulness of the universe looks really difficult. As I said before, I can barely imagine what a universe that doesn’t adhere to some version of Assumption 1 looks like. It’s almost tautological that some function will exist that can probabilistically describe the behavior of the universe. The universe must have some behavior, and why would we be unable to describe it probabilistically?

Assumption 2: Locality. This leaves us with locality. This is also really hard to deny! Modern physics has repeatedly confirmed that the speed of light acts as a speed limit on causal interactions, and that any influences must propagate locally. But perhaps quantum mechanics requires us to overthrow this old assumption and reveal it as a mere approximation to a deeper reality, as has been done many times before.

If we abandon number 2, we are allowing for the existence of statistical dependencies between variables that are entirely causally disconnected. Here’s Bell’s inequality in a causal diagram:

Bell Causal w entanglement

Since the detector settings on the left and the right are independent by assumption, we end up finding an unexplained dependence between the left particle and the right particle. Neither the common cause between them or any sort of subjunctive dependence a la timeless decision theory are able to explain away this dependence. In quantum mechanics, this dependence is given a name: entanglement. But of course, naming it doesn’t make it any less mysterious. Whatever entanglement is, it is something completely new to physics and challenges our intuitions about the very structure of causality.

Cults, tribes, states, and markets

The general problem solved by Civilization is how to get a bunch of people with different goals, each partial to themselves, to live together in peace and build a happy society instead of all just killing each other. It’s easy to forget just how incredibly hard of a problem this is. The lesson of game theory is that even two people whose interests don’t align can end up in shitty suboptimal Nash equilibria where they’re both worse off, by each behaving apparently perfectly rationally. Generalize this to twenty people, or a thousand people, or 300 million people, and you start to get a sense of how surprising it is that civilization exists on the scale that it does at all.

Yes, history tells many thousands of tales about large-scale defecting (civil wars, corruption, oppressive treatment of minority populations, outbreaks of violence and lawlessness, disputes over the line of succession) and the anarchic chaos that results, but it’s easy to imagine it being way, way worse. People are complex things with complex desires, and when you put that many people together, you should expect some serious failures. Hell, even a world of selfless altruists with shared goals would still have a tough time solving coordination problems of this size. Nobody thinks that the average person is better than this, so what gives?

Part of the explanation comes from psychologists like Jonathan Haidt and Joshua Greene, who detail the process by which humans evolved a moral sense that involved things like tit-for-tat emotional responses and tribalistic impulses. This baseline level of desire to form cooperative equilibria with friends helps push the balance away from chaos towards civilization, but it can’t be the whole explanation. After all, history does not reveal a constant base-rate of cooperative capacity between different humans, but instead tells a story of increasingly large-scale and complex civilizations. We went from thousands of small tribes scattered across Africa and Asia, to chiefdoms of tens of thousands individuals all working together, to vast empires that were home to millions of humans, and to today’s complex balance of global forces that make up a cooperative web that we are all part of. And we did this in the space of some ten thousand years.

This is not the type of timescale over which we can reasonably expect that evolution drastically reshaped our brains. Our moral instincts (love of kin, loyalty to friends, deference to authority, altruistic tendencies) can help us explain the cooperation we saw in 6000 B.C.E. in a tribe of a few hundred individuals. But they aren’t as helpful when we’re talking about the global network of cooperation, in which lawfulness is ensured by groups of individuals thousands of miles away, in which virtually every product that we rely on in our day-to-day life is the result of a global supply chain that brings together thousands of individuals that have never even seen each other, and in which a large and growing proportion of the world have safe access to hospitals and schools and other fruits of cooperation.

The explanation for this immense growth of humanity’s cooperative capacity is the development of institutions. As time passed, different bands of humans tried out different ways of structuring their social order. Some ways of structuring society worked better and lived on to the next generations of humans, who made further experiments in civilizational engineering. I think there is a lot to be learned by looking at the products of this thousand-year-long selection process for designing stable cooperative structures and seeing what happened to work best. In a previous post I described the TIMN theory of social evolution, which can be thought of as a categorization of the most successful organizational strategies that we’ve invented across throughout history. The following categorization is inspired by this framing, but different in many places.

The State: Cooperation is enforced by a central authority who can punish defectors. This central authority employs vast networks of hierarchically descending authority and systems of bureaucracy to be able to reach out across huge populations and keep individuals from defecting, even if they are nowhere near the actual people in charge. “State” is technically too narrow of a term, as these types of structures are not limited to governments, but can include corporate governance by CEOs, religious organizations, and criminal organizations like the Medellin Cartel. Ronfeldt uses the term Institution for this instead, but that sounds too broad to me.

The Market: Cooperation is not enforced by anybody, but instead arises as a natural result of the self-interested behaviors of individuals that each stand to gain through an exchange of goods. Markets have some really nice properties that a structure like the State doesn’t have, such as natural tendencies for exchange rates to equilibrate towards those that maximize efficiency. They also are fantastically good at dealing with huge amounts of complex information that a single central authority would be unable to parse (for instance, a weather event occurs on one coast of the United States, affecting suppliers of certain products, who then adjust their prices to re-equilibrate, which then results in a cascade of changes in consumer behavior across other markets, which also then react, and eventually the “news” of the weather event has traveled to the other coast, adjusting prices so that the products are allocated efficiently). A beautiful feature of the Market structure is that you can get HUGE amounts of people to cooperate in order to produce incredibly innovative and valuable stuff, without this cooperation being explicitly enforced by threats of punishment for defecting. Of course, Markets also have numerous failings, and the nice properties I discussed only apply for certain types of goods (those that are excludable and rival). When the Market structure extends outside of this realm, you see catastrophic failures of organization, the scale of which pose genuine threats to the continued existence of human civilization.

The Tribe: Cooperation is achieved not through a central authority or through mutually beneficial exchange, but through strong kinship and friendship relations. Tribe-type structures spring up naturally all the time in extended families, groups of friends, or shared living situations. Strong loyalty intuitions and communitarian instincts can serve to functionally punish defectors through social exclusion from their tribe, giving it some immunity to invading defector strategies. But the primary mechanism through which cooperation is enforced is the part of our psychology that keeps us from lying to our friends or stealing from our partners, even when we think we can get away with it. The problem with this structure is that it scales really poorly. Our brains can only handle a few dozen real friendships at a time, and typically these relationships require regular contact to be maintained. Historically, this has meant that tribes can only survive for fairly small groups of people that are geographically close to each other, and this is pretty much the range of their effectiveness.

The Cult: The primary idea of this category is that cooperation does not arise from self-interested exchange or from punishment for defectors, but from shared sacred beliefs or values. These beliefs often shape their holders’ entire world-views and relate to intense feelings of meaning, purpose, reverence, and awe. They can be about political ideology, metaphysics, aesthetics, or anything else that carries with it sufficient value as to penetrate into and reshape a whole worldview. The world’s major religions are the most striking examples of this, having been one of the biggest shapers of human behavior throughout history. Different members of the same religion can pour countless hours into dedicated cooperative work, not because of any sense of kinship with one another, but because of a sense of shared purpose.

The Pope won’t throw you in jail if you stop going to church, and you don’t go to make an exchange of goods with your priest (except in some very metaphorical sense that I don’t find interesting). You go because you believe deeply in the importance of going. There are aspects of Science that remind me of the Cult structure, like the hours of unpaid and anonymous work that senior scientists put into reviewing the papers of their colleagues in the field in order to give guidance to journals, grant-funders, or the researchers themselves on the quality of the material. When I’ve asked why spend so much time on doing this when they are not getting paid or recognized for their work, the responses I’ve gotten make reference to the value of the peer-review process and the joy and importance of advancing the frontier of knowledge. This type of response clearly indicates the sense of Science as a Sacred Value that serves as a driving force in the behavior of many scientists.

A Cult is like a Tribe in many ways, but one that is not limited to small sizes. Cults can grow and become global behemoths, inspiring feelings of camaraderie between total strangers that have nothing in common besides shared worldview. While the term ‘Cult’ is typically derogatory, I don’t mean to use it in this sense here. Cults are incredibly powerful ways to get huge numbers of people to work together, despite there being no obvious reason why they should do so to anybody on the outside of their worldview. And not only do they inspire large-scale cooperative behavior, but they are powerful sources of meaning and purpose in our lives. This seems tremendously valuable and loaded with potential for developing a better future society. Think about the strength of something like Judaism, and how it persevered through thousands of years of repeated extermination attempts, diasporas, and religious factioning, all the while maintaining a strong sense of Jewish identity and fervent religious belief. Taking the perspective of an alien visiting the planet, it might be baffling to try to understand why this set of beliefs didn’t die out long ago, and what constituted the glue holding the Jewish people together.

I think that the Cult structure is really undervalued in the circles I hang out in, which tend to focus on the irrationality that is often associated with a Cult. This irrationality seems natural enough; a Cult forms around a deeply held belief or set of beliefs, and strong identification with beliefs leads to dogmatism and denial of evidence. I wonder if you could have a “Cult of Rationality”, in which the “sacred beliefs” include explicit dedication to open-mindedness and non-dogmatic thinking, or if this would be in some sense self-defeating. There’s also the memetic aspect of this, which is that not just any idea is apt to become a sacred belief. It might be that the type of person that is deeply invested in rationality is exactly the type that would typically scoff at the idea of a Cult of Rationality, for instance.

Broad strokes: Tribes play on our loyalty and kinship intuitions. States play on our respect for authority. Markets play on our self-interest. And Cults play on our sense of reverence, awe, and sacredness.

Bayesian experimental design

We can use the concepts in information theory that I’ve been discussing recently to discuss the idea of optimal experimental design. The main idea is that when deciding which experiment to run out of some set of possible experiments, you should choose the one that will generate the maximum information. Said another way, you want to choose experiments that are surprising as possible, since these provide the strongest evidence.

An example!

Suppose that you have a bunch of face-down cups in front of you. You know that there is a ping pong ball underneath one of the cups, and want to discover which one it is. You have some prior distribution of probabilities over the cups, and are allowed to check under exactly one cup. Which cup should you choose in order to get the highest expected information gain?

The answer to this question isn’t extremely intuitively obvious. You might think that you want to choose the cup that you think is most likely to hold the ball, because then you’ll be most likely to find the ball there and thus learn exactly where the ball is. But at the same time, the most likely choice of ball location is also the one that gives you the least information if the ball is actually there. If you were already fairly sure that the ball was under that cup, then you don’t learn much by discovering that it was.

Maybe instead the better strategy is to go for a cup that you think is fairly unlikely to be hiding the ball. Then you’ll have a small chance of finding the ball, but in that case will gain a huge amount of evidence. Or perhaps the maximum expected information gain is somewhere in the middle.

The best way to answer this question is to actually do the calculation. So let’s do it!

First, we’ll label the different theories about the cup containing the ball:

{C1, C2, C3, … CN}

Ck corresponds to the theory that the ball is under the kth cup. Next, we’ll label the possible observations you could make:

{X1, X2, X3, … XN}

Xk corresponds to the observation that the ball is under the kth cup.

Now, our prior over the cups will contain all of our past information about the ball and the cups. Perhaps we thought we heard a rattle when somebody bumped one of the cups earlier, or we notice that the person who put the ball under one of the cups was closer to the cups on the right hand side. All of this information will be contained in the distribution P:

(P1, P2, P3, … PN)

Pk is shorthand for P(Ck) – the probability of Ck being true.

Good! Now we are ready to calculate the expected information gain from any particular observation. Let’s say that we decide to observe X3. There are two scenarios: either we find the ball there, or we don’t.

Scenario 1: You find the ball under cup 3. In this case, you previously had a credence of P3 in X3 being true, so you gain -log(P3) bits of information.

Scenario 2: You don’t find the ball under cup 3. In this case, you gain –log(1 – P3) bits of information.

With probability P3, you gain –log(P3) bits of information, and with probability (1 – P3) you gain –log(1 – P3) bits of information. So your expected information gain is just –P3 logP3 – (1 – P3) logP3.

In general, we see that if you have a prior credence of P in the cup containing the ball, then your expected information gain is:

-P logP – (1 – P) logP

What does this function look like?

Experimental design

We see that it has a peak value at 50%. This means that you expect to gain the most information by looking at a cup that you are 50% sure contains the ball. If you are any more or less confident than this, then evidently you learn less than you would have if you were exactly agnostic about the cup.

Intuitively speaking, this means that we stand to learn the most by doing an experiment on a quantity that we are perfectly agnostic about. Practically speaking, however, the mandate that we run the experiment that maximizes information gain ends up telling us to always test the cup that we are most confident contains the ball. This is because if you split your credences among N cups, they will be mostly under 50%, so the closest you can get to 50% will be the largest credence.

Even if you are 99% confident that the fifteenth cup out of one hundred contains the ball, you will have just about .01% credence in each of the others containing the ball. Since 99% is closer to 50% than .01%, you will stand to gain the most information by testing the fifteenth ball (although you stand to gain very little information in a more absolute sense).

This generalizes nicely. Suppose that instead of trying to guess whether or not there is a ball under a cup, you are trying to guess whether there is a ball, a cube, or nothing. Now your expected information gain in testing a cup is a function of your prior over the cup containing a ball Pball, your prior over it containing a cube Pcube, and your prior over it containing nothing Pempty.

-Pball logPball – Pcube logPcube – Pempty logPempty

Subject to the constraint that these three priors must add up to 1, what set of (Pball, Pcube, Pempy) maximizes the information gain? It is just (⅓, ⅓, ⅓).

Optimal (Pball, Pcube, Pempy) = (⅓, ⅓, ⅓)

Imagine that you know that exactly one cup is empty, exactly one contains a cube, and exactly one contains a ball, and have the following distribution over the cups:

Cup 1: (⅓, ⅓, ⅓)
Cup 2: (⅔, ⅙, ⅙)
Cup 3: (0, ½, ½)

If you can only peek under a single cup, which one should you choose in order to learn the most possible? I take it that the answer to this question is not immediately obvious. But using these methods in information theory, we can answer this question unambiguously: Cup 1 is the best choice – the optimal experiment.

We can even numerically quantify how much more information you get by checking under Cup 1 than by checking under Cup 2:

Information gain(check cup 1) ≈ 1.58 bits
Information gain(check cup 2) ≈ 1.25 bits
Information gain(check cup 3) = 1 bits

Checking cup 1 is thus 0.33 bits better than checking cup 2, and 0.58 bits better than checking cup 3. Since receiving N bits of information corresponds to ruling out all but 1/2N possibilities, we rule out 20.33 ≈ 1.26 times more possibilities by checking cup 1 than cup 2, and 20.58 ≈ 1.5 times more possibilities than cup 3.

Even more generally, we see that when we can test N mutually exclusive characteristics of an object at once, the test is most informative when our credences in the characteristics are smeared out evenly; P(k) = 1/N.

This makes a lot of sense. We learn the most by testing things about which we are very uncertain. The more smeared out our probabilities are over the possibilities, the less confident we are, and thus the more effective a test will be. Here we see a case in which information theory vindicates common sense!

Why relative entropy

Background for this post: Entropy is expected surprise, A survey of entropy and entropy variants, and Maximum Entropy and Bayes

Suppose you have some old distribution Pold, and you want to update it to a new distribution Pnew given some information.

You want to do this in such a way as to be as uncertain as possible, given your evidence. One strategy for achieving this is to maximize the difference in entropy between your new distribution and your old one.

Max (Snew – Sold) = ∑ -Pnew logPnew – ∑ -Pold logPold

Entropy is expected surprise. So this quantity is the new expected surprise minus the old expected surprise. Maximizing this corresponds to trying to be as much more surprised on average as possible than you expected to be previously.

But this is not quite right. We are comparing the degree of surprise you expect to have now to the degree of surprise you expected to have previously, based on your old distribution. But in general, your new distribution may contain important information as to how surprised you should have expected to be.

Think about it this way.

One minute ago, you had some set of beliefs about the world. This set of beliefs carried with it some degree of expected surprise. This expected surprise is not the same as the true average surprise, because you could be very wrong in your beliefs. That is, you might be very confident in your beliefs (i.e. have very low EXPECTED surprise), but turn out to be very wrong (i.e. have very high ACTUAL average surprise).

What we care about is not how surprised somebody with the distribution Pold would have expected to be, but how surprised you now expect somebody with the distribution Pold to be. That is, you care about the average value of surprise, given your new distribution, your new best estimate of the actual distribution

That is to say, instead of using the simple difference in entropies S(Pnew) – S(Pold), you should be using the relative entropy Srel(Pnew, Pold).

Max Srel = ∑ -Pnew logPnew – ∑ -Pnew logPold

Here’s a diagram describing the three species of entropy: entropy, cross entropy, and relative entropy.

Types of Entropy.png

As one more example of why this makes sense: imagine that one minute ago you were totally ignorant and knew absolutely nothing about the world, but were for some reason very irrationally confident about your beliefs. Now you are suddenly intervened upon by an omniscient Oracle that tells you with perfect accuracy exactly what is truly going on.

If your new beliefs are designed by maximizing the absolute gain in entropy, then you will be misled by your old irrational confidence; your old expected surprise will be much lower than it should have been. If you use relative entropy, then you will be using your best measure of the actual average surprise for your old beliefs, which might have been very large. So in this scenario, relative entropy is a much better measure of your actual change in average surprise than the absolute entropy difference, as it avoids being misled by previous irrationality.

A good way to put this is that relative entropy is better because it uses your current best information to estimate the difference in average surprise. While maximizing absolute entropy differences will give you the biggest change in expected surprise, maximizing relative entropy differences will do a better job at giving you the biggest difference in *actual* surprise. Relative entropy, in other words, allows you to correct for previous bad estimates of your average surprise, and substitute in the best estimate you currently have.

These two approaches, maximizing absolute entropy difference and maximizing relative entropy, can give very different answers for what you should believe. It so happens that the answers you get by maximizing relative entropy line up nicely with the answers you get from just ordinary Bayesian updating, while the answers you get by maximizing absolute entropy differences do not, which is why this difference is important.