Bayes and beyond

You have lots of beliefs about the world. Each belief can be written as a propositional statement that is either true or false. But while each statement is either true or false, your beliefs are more complicated; they come in shades of gray, not blacks and whites. Instead of beliefs being on-or-off, we have degrees of beliefs – some beliefs are much stronger than others, some have roughly the same degree of belief, and so on. Your smallest degrees of belief are for true impossibilities – things that you can be absolutely certain are false. Your largest degrees of beliefs are for absolute certainties, the other side of the coin.

Now, answer for yourself the following series of questions:

  1. Can you quantify a degree of belief?

By quantify, I mean put a precise, numerical value on it. That is, can you in principle take any belief of yours, and map it to a real number that represents how strongly you believe it? The in principle is doing a lot of work here; maybe you don’t think that you can in practice do this, but does it make conceptual sense to you to think about degrees of belief as quantities?

If so, then we can arbitrarily scale your degrees of belief by translating them into what I’ll call for the moment credences. All of your credences are on a scale from 0 to 1, where 0 is total disbelief and 1 is totally certain belief. We can accomplish this rescaling by just shifting all your degrees of belief up by your lowest degree of belief (that which you assign to logical impossibilities), and then dividing each degree of belief by the difference between your most distant degrees of belief.

Now,

  1. If beliefs B and B’ are mutually exclusive (i.e. it is impossible for them both to be true), then do you agree that your credence in one of the two of them being true should be the sum of your credences in each individually?

Said more formally, do you agree that if Cr(B & B’) = 0, then Cr(B or B’) = Cr(B) + Cr(B’)? (The equal sign here should be a normative equals sign. We are not asking if you think this is descriptively true of your degrees of beliefs, but if you think that this should be true of your degrees of beliefs. This is the normativity of rationality, by the way, not ethics.)

If so, then your credence function Cr is really a probability function (Cr(B) = P(B)). With just these two questions and the accompanying comments, we’ve pinned down the Kolmogorov axioms for a simple probability space. But we’re not done!

Next,

  1. Do you agree that your credence in two statements B and B’ both being true should be your credence in B’ given that B is true, multiplied by your credence in B?

Formally: Do you agree that P(B & B’) = P (B’ | B) ∙ P(B)? If you haven’t seen this before, this might not seem immediately intuitively obvious. It can be made so quite easily. To find out how strongly you believe both B and B’, you can firstly imagine a world in which B is true and judge your credence in B’ in this scenario, and then secondly judge your actual credence in B being the real world. The conditional probability is important here in order to make sure you are not ignoring possible ways that B and B’ could depend upon each other. If you want to know the chance that both of somebody’s eyes are brown, you need to know (1) how likely it is that their left eye is brown, and (2) how likely it is that their right eye is brown, given that their left eye is brown. Clearly, if we used an unconditional probability for (2), we would end up ignoring the dependency between the colors of the right and left eye.

Still on board? Good! Number 3 is crucially important. You see, the world is constantly offering you up information, and your beliefs are (and should be) constantly shifting in response. We now have an easy way to incorporate these dynamics.

Say that you have some initial credence in a belief B about whether you will experience E in the next few moments. Now you see that after a few moments pass, you did experience E. That is, you discover that B is true. We can now set P(B) equal to 1, and adjust everything else accordingly:

For all beliefs B’, Pnew(B’) = P(B’ | B)

In other words, your new credences are just your old credences given the evidence you received. What if you weren’t totally sure that B is true? Maybe you want P(B) = .99 instead. Easy:

For all beliefs B’: Pnew(B’) = .99 ∙ P(B’ | B) + .01 ∙ P(B’ | ~B)

In other words, your new credence in B’ is just your credence that B is true, multiplied by the conditional credence of B’ given that B is true, added to your credence that B is false times the conditional credence of B’ given that B is false.

We now have a fully specified general system of updating beliefs; that is, we have a mandated set of degrees of beliefs at any moment after some starting point. But what of this starting point? Is there a rationally mandated prior credence to have, before you’ve received any evidence at all? I.e., do we have some a priori favored set of prior degrees of belief?

Intuitively, yes. Some starting points are obviously less rational than others. If somebody starts off being totally certain in the truth of one side of an a posteriori contingent debate that cannot be settled as a matter of logical truth, before receiving any evidence for this side, then they are being irrational. So how best to capture this notion of normative rational priors? This is the question of objective Bayesianism, and there are several candidates for answers.

One candidate relies on the notions of surprise and information. Since we start with no information at all, we should start with priors that represent this state of knowledge. That is, we want priors that represent maximum uncertainty. Formalizing this notion gives us the principle of maximum entropy, which says that the proper starting point for beliefs is that which maximizes the entropy function ∑ -P logP.

There are problems with this principle, however, and many complicated debates comparing it to other intuitively plausible principles. The question of objective Bayesianism is far from straightforward.

Putting aside the question of priors, we have a formalized system of rules that mandates the precise way that we should update our beliefs from moment to moment. Some of the mandates seem unintuitive. For instance, it tells us that if we get a positive result on a 99% accurate test for a disease with a 1% prevalence rate, then we have a 50% chance of having the disease, not 99%. There are many known cases where our intuitive judgments of likelihood differ from the judgments that probability theory tells us are rational.

How do we respond to these cases? We only really have a few options. One, we could discard our formalization in favor of the intuitions. Two, we could discard our intuitions in favor of the formalization. Or three, we could accept both, and be fine with some inconsistency in our lives. Presuming that inconsistency is irrational, we have to make a judgment call between our intuitions and our formalization. Which do we discard?

Remember, our formalization is really just the necessary result of the set of intuitive principles we started with. So at the core of it, we’re really just comparing intuitions of differing strengths. If your intuitive agreement with the starting principles was stronger than your intuitive disagreement with the results of the formalization, then presumably you should stick with the formalization.

Another path to adjudicating these cases is to consider pragmatic arguments for our formalization, like Dutch Book arguments that indicate that our way of assigning degrees of beliefs is the only one that is not exploitable by a bookie to ensure losses. You can also be reassured by looking at consistency and convergence theorems, that show the Bayesian’s beliefs converging to the truth in a wide variety of cases.

If you’re still with me, you are now a Bayesian. What does this mean? It means that you think that it is rational to treat your beliefs like probabilities, and that you should update your beliefs by conditioning upon the evidence you receive.

***

So what’s next? Are we done? Have all epistemological issues been solved? Unfortunately not. I think of Bayesianism as a first step into the realm of formal epistemology – a very good first step, but nonetheless still a first. Here’s a simple example of where Bayesianism will lead us into apparent irrationality.

Imagine we have two different beliefs about the world: B1 and B2. B2 is a respectable scientific theory: one that puts its neck out with precise predictions about the results of experiments, and tries to identify a general pattern in the underlying phenomenon. B1 is a “cheating” theory: it doesn’t have any clue what’s going to happen before an experiment, but after an experiment it peeks at the results and pretends that it had predicted it all along. We might think of B1 as the theory that perfectly fits all of the data, but only through over-fitting on the data. As such, B1 is unable to make any good predictions about future data.

What does Bayesianism say about these two theories? Well, consider any single data point. Let’s suppose that B2 does a good job predicting this data point, say, P(D | B2) = 99%. And since B1 perfectly fits the data, P(D | B1) = 1. If our priors in B1 and B2 are written as P1 and P2, respectively, then our credences update as follows:

Pnew(B1) = P(B1 | D) = P1 / (P1 + .99 P2)
Pnew(B2) = P(B2 | D) = .99 P2 / (P1 + .99 P2)

For N similar data points, we get:

Pnew(B1) = P(B1 | Dn) = P1 / (P1 + .99n P2)
Pnew(B2) = P(B2 | Dn) = .99n P2 / (P1 + .99n P2)

What happens to these two credences as n gets larger and larger?

Bayes and beyond

As we can see, our credence in B1 approaches 100% exponentially quickly, and our credence in B2 drops to 0% exponentially quickly. Even if we start with an enormously low prior in B1, our credence will eventually be swamped as we gather more and more data.

It looks like in this example, the Bayesian is successfully hoodwinked by the cheating theory, B1. But this is not quite the end of the story for Bayes. The only single theory that perfectly predicts all of the data you receive in the infinite evidence limit is basically just the theory that “Everything that’s going to happen is what’s going to happen.” And, well, this is surely true. It’s just not very useful.

If instead we look at B1 as a sequence of theories, one for each new data point, then we have a way out by claiming that our priors drop as we go further in the sequence. This is an appeal to simplicity – a theory that exactly specifies 1000 different data points is more complex than a theory that exactly specifies 100 different data points. It also suggests a precise way to formalize simplicity, by encoding it into our priors.

While the problem of over-fitting is not an open-and-shut case against Bayesianism, it should still give us pause. The core of the issue is that there are more intuitive epistemic virtues than those that the Bayesian optimizes for. Bayesianism mandates a degree of belief as a function of two ingredients: the prior and the evidential update. The second of these, Bayesian updating, solely optimizes for accommodation of data. And setting of priors is typically done to optimize for some notion of simplicity. Since empirically distinguishable theories have their priors washed out in the limit of infinite evidence, Bayesianism becomes a primarily accommodating epistemology.

This is what creates the potential for problems of overfitting to arise. The Bayesian is only optimizing for accommodation and simplicity, but what we want is a framework that also optimizes for prediction. I’ll give two examples of ways to do this: cross validation and posterior predictive checking.

I’ve talked about cross validation previously. The basic idea is that you split a set of data into a training set and a testing set, optimize your model for best fit with the training set, and then see how it performs on the testing set. In doing so, you are in essence estimating how well your model will do on predictions of future data points.

This procedure is pretty commonsensical. Want to know how well your model does at predicting data? Well, just look at the predictions it makes and evaluate how accurate they were. It is also completely outside of standard Bayesianism, and solves the issues of overfitting. And since the first half of cross validation is training your model to fit the training set, it is optimizing for both accommodation and prediction.

Posterior predictive checks are also pretty commonsensical; you ask your model to make predictions for future data, and then see how these predictions line up with the data you receive.

More formally, if you have some set of observable variables X and some other set of parameters A that are not directly observable, but that influence the observables, you can express your prior knowledge (before receiving data) as a prior over A, P(A), and a likelihood function P(X | A). Upon receiving some data D about the values of X, you can update your prior over A as follows:

P(A) becomes P(A | D)
where P(A | D) = P(D | A) P(A) / P(D)

To make a prediction about how likely you think it is that the next data point will be X, given the data D, you must use the posterior predictive distribution:

P(X | D) = ∫ P(X | A) ∙ P(A | D) dA

This gives you a precise probability that you can use to evaluate the predictive accuracy of your model.

There’s another goal that we can aim towards, besides accommodation, simplicity, or prediction. This is distance from truth. You might think that this is fairly obvious as a goal, and that all the other methods are really only attempts to measure this. But in information theory, there is a precise way in which you can specify the information gap between any given theory and reality. This metric is called the Kullback-Leibler divergence (DKL), and I’ll refer to it as just information divergence.

DKL = ∫ Ptrue log(Ptrue / P) dx

This term, if parsed correctly, represents precisely how much information you gain if you go from your starting distribution P to the true distribution Ptrue.

For example, if you have a fair coin, then the true distribution is given by (Ptrue(H) = .5, Ptrue(T) = .5). You can calculate how far any other theory (P(H) = p, P(T) = 1 – p) is from the truth using DKL.

 DKL = .5 ∙ [ log(1 / 2p) + log(1 / 2(1-p)) ]

I’ve graphed DKL as a function of p here:

Information divergence.png

As you can see, the information divergence is 0 for the correct theory that the coin is fair (p = 0.5), and goes to infinity as you get further away from this.

This is all well and good, but how is this practically applicable? It’s easy to minimize the distance from the true distribution if you already know the true distribution, but the problem is exactly that we don’t know the truth and are trying to figure it out.

Since we don’t have direct access to Ptrue, we must resort to approximations of DKL. The most famous approximation is called the Akaike information criterion (AIC). I won’t derive the approximation here, but will present the form of this quantity.

AIC = k – log(P(data | M))
where M = the model being evaluated
and k = number of parameters in M

The model that minimizes this quantity probably also minimizes the information distance from truth. Thus, “lower AIC value” serves as a good approximation to “closer to the truth”. Notice that AIC explicitly takes into account simplicity; the quantity k tells you about how complex a model is. This is pretty interesting in it’s own right; it’s not obvious why a method that is solely focused on optimizing for truth will end up explicitly including a term that optimizes for simplicity.

Here’s a summary table describing the methods I’ve talked about here (as well as some others that I haven’t talked about), and what they’re optimizing for.

Goal

Method(s)
Which theory makes the data most likely?

Maximum likelihood estimation (MLE)
p-testing

Which theory is most likely, given the data?

Bayes
Bayesian information criterion (BIC)
Maximum uncertainty

Entropy
Relative entropy

Simplicity

Minimum description length
Solomonoff induction

Predictive accuracy

Cross validation
Posterior predictive checks

Distance from truth

Information divergence (DKL)
Akaike information criterion (AIC)

 

Racism and identity

I recently saw that a friend of a friend of mine was writing in a blog about her experience as a mixed race woman in America, and all of the ways in which she feels that she suffers from explicit and implicit discrimination. The impression she conveyed was that she walked around intensely aware of her skin color, and felt that others were equally aware. In her world, people looked at her as primarily a brown woman, a strange and exotic other. She talked about the emotional shock she has to go through when returning to the United States after visiting her family in Thailand, in dealing with the fact that Thai culture is so underrepresented here. There was a lot of anger, a feeling of not being accepted by the majority culture around her, and most of all, a sense of being disrespected and harmed on the basis of her ethnicity.

Whenever I hear people like her talking like this, I get really confused. I am a mixed-race person, living in the same city as her, surrounded by probably very similar people, and yet we seem to live in completely different worlds. I know that the idea of color-blindness is not in vogue, but I walk around literally entirely unaware of my skin color and feel fairly confident that almost everybody else I run into is similarly unaware of it.

I’m somebody that’s fairly attuned to social signals – I feel like if I was being slighted on the basis of my ethnicity, I would notice it – and I’m also not somebody that could remotely pass for white. So I’m left wondering… what’s going on here? How can two people have such radically different experiences of living with their ethnicities, when it seems like so many of the variables are the same?

One answer is that some of the variables that appear to be the same actually aren’t. For instance, while we’re both mixed race, we are different mixes of races. While I could pass (and have passed) for Black, Hispanic, Middle Eastern, or Indian, I’ve never been identified as Southeast Asian. So perhaps while Black/Hispanic/Middle Eastern/Indian people face very little racism in my town, Southeast Asians are relentlessly oppressed. Hmm, somehow that seems wrong…

Maybe a relevant difference is the social circles we surround ourselves with. From what I know of this person, she surrounds herself with people that are very concerned with social justice issues. It seems fairly plausible to me that the types of people that are very concerned with social justice are also going to be very sensitive to racial and ethnic identities, and will be much more likely to see somebody as a mixed-race person (and treat this as an important aspect of their identity). Incidentally, the few people who I’ve actually felt conscious of my skin color or ethnicity around have been exactly those people who are most vocal and passionate about their anti-racism and social justice concerns. Also, anecdotally, the people I know who most strongly emphasize feelings of personal oppression happen to surround themselves with social justice types. Of course, this doesn’t indicate the direction of causation – it could be that those that feel oppressed seek out social justice types that will affirm their feelings of being wronged.

Another possibility to explain the difference in perceptions is that one of us is just wrong. Maybe the oppression and constant discrimination and other-ing is actually in my Southeast Asian friend-of-friend’s head. Or maybe I’m actually being horribly oppressed and discriminated against and just don’t know it. Maybe I’m just extremely lucky and have by chance avoided all the nasty racists in my town. (If one of us is wrong, I’m betting it’s her.)

But this isn’t the only time I’ve noticed this disconnect in experiences. I’m reminded of a debate I watched a while back about sexual harassment. The actual debate itself wasn’t too interesting, but I found the Q&A period fascinating. Many different women stood up and spoke about their personal experiences of sexual harassment in their daily life, and what they said completely contradicted each other. Some women claimed that they felt sexually harassed or at risk of sexual harassment virtually always, like, walking in the middle of the day in a public area or shopping for groceries. Other women claimed that they had never been catcalled, nor sexually harassed or discriminated against because of their sex.

Keep in mind; this was a live debate, with a local audience. All of these women lived in the same area. There weren’t obvious differences in their appearances, or ages, or mannerisms, although there were significant differences in their views on sexual harassment (for obvious reasons). Also keep in mind that some of the claims being made were literally just objectively verifiable factual statements. It’s not like the disagreement was over whether others had objectifying thoughts about them because of their sex. The differences were about things like whether or not they are verbally catcalled while walking downtown. There’s got to be an actual fact about how likely the average woman is to be catcalled on a given street.

This is pretty hard to make sense of, and seems like the exact same phenomenon as what’s going on with my friend’s friend and I. People that should be living in similar worlds mentally feel like they are living in completely different worlds.

One last example: I’ve had similar experiences with my sister. She is the same race as me (shocking, I know), with basically the same amount of exposure to the non-American side of our cultural heritage, has lived in the same city as me for most of our lives, and is not too different in age from me. But she talks about a strong sense of feeling discriminated against as a brown woman, and has described experiences of oppression that seem totally foreign to me.

Perhaps a component of all of this can be explained by incentives to exaggerate. This aligns with my sense that those that think they are oppressed hang around with social justice types. A lot of social justice culture seems to be devoted to jockeying for oppression points and finding ways to appear as unprivileged as possible. In a social circle in which one can gain social brownie points by being discriminated against, you would expect a general upwards pressure on the level of exaggeration that the average person uses in describing said discrimination.

I feel like I should stop here to emphasize that I’m not suggesting that there isn’t racial and sexual discrimination in the world. There obviously is. What I’m specifically wondering about is how it is that people in little liberal college towns like mine with fairly similar racial backgrounds can have such radically different perspectives on the factual matter of the actual oppression they face. It’s especially puzzling to me given that I’m a brown person who has, as far as I can tell, never faced significant drawbacks on the basis of it, and is most of the time unaware of my skin color.

I think that this unawareness of my skin color provides a hint for explaining what might be actually going on here. Not only am I generally not aware of my skin color, but I have always felt this way. I think that there is a spectrum of natural self-identification tendencies, and a bias towards attributing perceived affronts to the most salient aspects of your identity. Let me unpack this.

It’s not exactly that I’m unaware that I’m brown (I wouldn’t be surprised if somebody showed me a picture of myself and pointed out my skin color). It’s that my brownness is a nonexistent component of the way I think about myself. As far back as I can remember, the salient features of my sense of self have been things like my way of thinking and my personality. I’ve always identified myself as mostly a mind, not a body. I even remember a few bizarre experiences where I looked in a mirror and was momentarily struck by a surreal sense of disconnect, that I happen to exist within this body that seems so obviously distinct from me.

It is also the case that when I perceive that others dislike me and don’t have any sense of why this may be, I naturally tend to assume that their dislike relates to some aspect of my mental characteristics; maybe they don’t like my style of reasoning, or my sense of humor, or some other aspect of my personality. I will almost never attribute their dislike to some physical characteristics of mine.

I perceive myself as primarily a thinker occupying a body that I don’t strongly identify with. But other people identify much more closely with their physicality (skin color, facial features, body type, sex, et cetera). It seems plausible to me that just as I perceive affronts as having to do with properties of my mind, those for whom race is a salient component of their personal identity will perceive affronts as having to do with racism, those who identify with their sex will be more likely to see them as sexism, and so on.

This idea of a spectrum of self-identification tendencies is fairly satisfying to me as an explanation of this phenomenon of radically different perceptions of the world. Two people that appear to exist in very similar social environments can have radically different perceptions of their social environments, because of differences in how they conceive of themselves and the way that this affects their framing of their interactions with others. These differing tendencies are not restricted to body-versus-mind. Some people strongly identify themselves with a profession, a cultural heritage, or a nation. Others identify with an ideology or a religion. And in general, the parts of your identity that feel most salient to you are those that will prickle most readily at perceived affronts.

This relates to the notion in psychology of internal vs external loci of control. When you fail a job interview, you blame the traffic in the morning, or the interviewer’s bias. If you had gotten the job, you would have happily praised your interviewee skills and charming smile. When your neighbor fails a job interview, you attribute it to their poor interviewee skills. That is, you place the locus of control over the outcome wherever it is convenient.

This is called the fundamental attribution error. With respect to themselves, people attribute positive outcomes to features of their own identity, and negative outcomes to features of the external world. With respect to others, they attribute positive outcomes to the external world and negative outcomes to the person’s character.

If you strongly identify as a mixed person, then you will see events in your world as being all about your mixed race. And if you identify as a mind floating about in a body, then things like your race or sex or attractiveness will seem mostly irrelevant to explaining the events in your life. This suggests a sort of self-perpetuating cycle whereby those that identify as X will perceive the world as centered around X, further entrenching the self-identification as X.