Simple induction

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:

Fixed

This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update. How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity).

What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

Priors in the supernatural

A friend of mine recently told me the following anecdote.

Years back, she had visited an astrologer in India with her boyfriend, who told her the following things: (1) she would end up marrying her boyfriend at the time, (2) down the line they would have two kids, the first a girl and the second a boy, and (3) he predicted the exact dates of birth of both children.

Many years down the line, all of these predictions turned out to be true.

I trust this friend a great deal, and don’t have any reason to think that she misremembered the details or lied to me about them. But at the same time, I recognize that astrology is completely crazy.

Since that conversation, I’ve been thinking about the ways in which we can evaluate our de facto priors in supernatural events by consulting either real-world anecdotes or thought experiments. For instance, if we think that each of these two predictions gave us a likelihood ratio of 100:1 in favor of astrology being true, and if I ended up thinking that astrology was about as likely to be true as false, then I must have started with roughly 1:10,000 odds against astrology being true.

That’s not crazily low for a belief that contradicts much of our understanding of physics. I would have thought that my prior odds would be something much lower, like 1:1010 or something. But really put yourself in that situation.

Imagine that you go to an astrologer, who is able to predict an essentially unpredictable sequence of events years down the line, with incredible accuracy. Suppose that the astrologer tells you who you will marry, how many kids you’ll have, and the dates of birth of each. Would you really be totally unshaken by this experience? Would you really believe that it was more likely to have happened by coincidence?

Yes, yes, I know the official Bayesian response; I read it in Jaynes long ago. For beliefs like astrology that contradict our basic understanding of science and causality, we should always have reserved some amount of credence for alternate explanations, even if we can’t think of any on the spot. This reserve of credence will insure us against jumping in credence to 99% upon seeing a psychic continuously predict the number in your heads, ensuring sanity and a nice simple secular worldview.

But that response is not sufficient to rule out all strong evidence for the supernatural.

Here’s one such category of strong evidence: evidence for which all alternative explanations are ruled out by the laws of physics as strongly as the supernatural hypothesis is ruled out by the laws of physics.

I think that my anecdote is one such case. If it was true, then there is no good natural alternative explanation for it. The reason? Because the information about the dates of birth of my friend’s children did not exist in the world at the time of the prediction, in any way that could be naturally attainable by any human being.

By contrast, imagine you go to a psychic who tells you to put up some fingers behind your back and then predicts over and over again how many fingers you have up. There’s hundreds of alternative explanations for this besides “Psychics are real science has failed us.” The reason that there are these alternative explanations is that the information predicted by the psychic existed in the world at the time of the prediction.

But in the case of my friend’s anecdote, the information predicted by the astrologer was lost far in the chaotic dynamics of the future.

What this rules out is the possibility that the astrologer somehow obtained the information surreptitiously by any natural means. It doesn’t rule out a host of other explanations, such as that my friend’s perception at the time was mistaken, that her memory of the event is skewed, or that she is lying. I could even, as a last resort, consider that possibility that I hallucinated the entire conversation with her. (I’d like to give the formal title “unbelievable propositions” to the set of propositions that are so unlikely that we should sooner believe that we are hallucinating than accept evidence for them.)

But each of these sources of alternative explanations, with the possible exception of the last, can be made significantly less plausible.

Let me use a thought experiment to illustrate this.

Imagine that you are a nuclear physicist who, with a group of fellow colleagues, have decided to test the predictive powers of a fortune teller. You carefully design an experiment in which a source of true quantum randomness will produce a number between 1 and N. Before the number has been produced, when it still exists only as an unrealized possibility in the wave function, you ask the fortune teller to predict its value.

Suppose that they get it correct. For what value of N would you begin to take their fortune telling abilities seriously?

Here’s how I would react to the success, for different values of N.

N = 10: “Haha, that’s a funny coincidence.”

N = 100: “Hm, that’s pretty weird.”

N = 1000: “What…”

N = 10,000: “Wait, WHAT!?”

N = 100,000: “How on Earth?? This is crazy.”

N = 1,000,000: “Ok, I’m completely baffled.”

I think I’d start taking them seriously as early as N = 10,000. This indicates prior odds of roughly 1:10,000 against fortune-telling abilities (roughly the same as my prior odds against astrology, interestingly!). Once again, this seems disconcertingly low.

But let’s try to imagine some alternative explanations.

As far as I can tell, there are only three potential failure points: (1) our understanding of physics, (2) our engineering of the experiment, (3) our perception of the fortune teller’s prediction.

First of all, if our understanding of quantum mechanics is correct, there is no possible way that any agent could do better than random at predicting the number.

Secondly, we stipulated that the experiment was designed meticulously so as to ensure that the information was truly random, and unavailable to the fortune-teller. I don’t think that such an experiment would actually be that hard to design. But let’s go even further and imagine that we’ve designed the experiment so that the fortune teller is not in causal contact with the quantum number-generator until after she has made her prediction.

And thirdly, we can suppose that the prediction is viewed by multiple different people, all of whom affirm that it was correct. We can even go further and imagine that video was taken, and broadcast to millions of viewers, all of whom agreed. Not all of them could just be getting it wrong over and over again. The only possibility is that we’re hallucinating not just the experimental result, but indeed also the public reaction and consensus on the experimental result.

But the hypothesis of a hallucination now becomes inconsistent with our understanding of how the brain works! A hallucination wouldn’t have the effect of creating a perception of a completely coherent reality in which everybody behaves exactly as normal except that they saw the fortune teller make a correct prediction. We’d expect that if this were a hallucination, it would not be so self-consistent.

Pretty much all that’s left, as far as I can tell, is some sort of Cartesian evil demon that’s cleverly messing with our brains to create this bizarre false reality. If this is right, then we’re left weighing the credibility of the laws of physics against the credibility of radical skepticism. And in that battle, I think, the laws of physics lose out. (Consider that the invalidity of radical skepticism is a precondition for the development of laws of physics in the first place.)

The point of all of this is just to sketch an example where I think we’d have a good justification for ruling out all alternative explanations, at least with an equivalent degree of confidence that we have for affirming any of our scientific knowledge.

Let’s bring this all the way back to where we started, with astrology. The conclusion of this blog post is not that I’m now a believer in astrology. I think that there’s enough credence in the buckets of “my friend misremembered details”, “my friend misreported details”, and “I misunderstood details” so that the likelihood ratio I’m faced with is not actually 10,000 to 1. I’d guess it’s something more like 10 to 1.

But I am now that much less confident that astrology is wrong. And I can imagine circumstances under which my confidence would be drastically decreased. While I don’t expect such circumstances to occur, I do find it instructive (and fun!) to think about them. It’s a good test of your epistemology to wonder what it would take for your most deeply-held beliefs to be overturned.

Patterns of inductive inference

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.

If all truths are knowable, then all truths are known

The title of this post is what’s called Fitch’s paradox of knowability.

It’s a weird result that arises from a few very intuitive assumptions about the notion of knowability. I’ll prove it here.

First, let’s list five assumptions. The first of these will be the only strong one – the others should all seem very obviously correct.

Assumptions

  1. All truths are knowable.
  2. If P & Q is known, then both P and Q are known.
  3. Knowledge entails truth.
  4. If P is possible and Q can be derived from P, then Q is possible.
  5. Contradictions are necessarily false.

Let’s put these assumptions in more formal language by using the following symbolization:

P means that P is possible
KP means that P is known by somebody at some time

Assumptions

  1. From P, derive KP
  2. From K(P & Q), derive KP & KQ
  3. From KP, derive P
  4. From ◇P & (P → Q), derive ◇Q
  5. ◇[P & -P]

Now, the proof. First in English…

Proof

  1. Suppose that P is true and unknown.
  2. Then it is knowable that P is true and unknown.
  3. Thus it is possible that P is known and that it is known that P is unknown.
  4. So it is possible that P is both known and not known.
  5. Since 4 is a contradiction, it is not the case that P is true and unknown.
  6. In other words, if P is true, then it is known.

Follow all of that? Essentially, we assume that there is some statement P that is both true and unknown. But if this last sentence is true, and if all truths are knowable, then it should be a knowable truth. I.e. it is knowable that P is both true and unknown. But of course this can’t be knowable, since to know that P is both true and unknown is to both know it and not know it. And thus it must be the case that if all truths are knowable, then all truths are known.

I’ll write out the proof more formally now.

Proof

  1. P & –KP                Provisional assumption
  2. K(P & –KP)        Assumption 1
  3. ◇(KP & KKP)     Assumption 2
  4. ◇(KP & –KP)        Assumption 3
  5. -(P & –KP)            Reductio ad absurdum of 1
  6. P → KP                 Standard tautology

I love finding little examples like these where attempts to formalize our intuitions about basic concepts we use all the time lead us into disaster. You can’t simultaneously accept all of the following:

  • Not all truths are known.
  • All truths are knowable.
  • If P & Q is known, then both P and Q are known.
  • Knowledge entails truth.
  • If P is possible and P implies Q, then Q is possible.
  • Contradictions are necessarily false.

The value of the personal

I have been thinking about the value of powerful anecdotes. An easy argument for why we should be very cautious of personal experience and anecdotal evidence is that it has the potential to cause us to over-update. E.g. somebody that hears a few harrowing stories from friends about gun violence in Chicago is more likely to have an overly high estimation of how dangerous Chicago is.

Maybe the best way to formulate our worldview is in a cold and impersonal manner, disregarding most anecdotes in favor of hard data. This is the type of thing I might have once said, but I now think that this approach is likely just as flawed.

First of all, I think it’s an unrealistic demand on most people’s cognitive systems that they toss out the personal and compelling in their worldview.

And second of all, just like personal experience and anecdotal evidence have the potential to cause over-updating, statistical data and dry studies have the potential to cause under-updating.

Reading some psychological studies about the seriousness of the psychological harms of extended periods of solitary confinement is no match for witnessing or personally experiencing the effects of being locked in a tiny room alone for years. There’s a real and important difference between abstractly comprehending a fact and really understanding the fact. Other terms for this second thing include internalizing the fact, embodying it, and making it salient to yourself.

This difference is not easy to capture on a one-dimensional model of epistemology where beliefs are represented as simple real numbers. I’m not even sure if there’d be any good reason to build this distinction into artificial intelligences we might eventually construct. But it is there in us, and has a powerful influence.

How do we know whether somebody has really internalized a belief or not? I’m not sure, but here’s a gesture in what I think is the right direction.

We can conceive of somebody’s view of the world as a massive web of beliefs, where the connections between beliefs indicate dependencies and logical relations. To have a fully internalized a belief is to have a web that is fully consistent with the truth of this belief. On the other hand, if you notice that somebody verbally reports that they believe A, but then also seem to believe B, C, and D, where all of these are inconsistent with A, then they have not really internalized A.

The worry is that a cold and impersonal approach to forming your worldview is the type of thing that would result in this type of inconsistency and disconnectedness in your web of beliefs, through the failure to internalize important facts.

Such failures become most obvious when you have a good sense of somebody’s values, and can simply observe their behavior to see what it reveals about their beliefs. If somebody is a pure act utilitarian (I know that nobody actually has a value system as simple as this, but just play along for a moment), then they should be sending their money wherever it would be better spent maximizing utility. If they are not doing so, then this reveals an implicit belief that there is no better way to be maximizing utility than by keeping their own money.

This is sort of an attempt to uncover somebody’s revealed beliefs, to steal the concept of revealed preferences from economics. Conflicts between revealed beliefs and stated beliefs indicate a lack of internalization.

Confirmation bias, or different priors?

Consider two people.

Person A is a person of color who grew up in an upper middle class household in a wealthy and crime-free neighborhood. This person has gone through their life encountering mostly others similar to themselves – well-educated liberals that believe strongly in socially progressive ideals. They would deny ever having personally experienced or witnessed racism, despite their skin color. In addition, when their friends discuss the problem of racism in America, they feels a baseline level of skepticism about the actual extent of the problem, and suspect that the whole thing has been overblown by sensationalism in the media. Certainly there was racism in the past, they reason, but the problem seems largely solved in the present day. This suspicion of the mainstream narrative seems confirmed by the emphasis placed on shootings of young black men like Michael Brown, which were upon closer reflection not clear cases of racial discrimination at all.

Person B grew up in a lower middle class family, and developed friendships across a wide range of socioeconomic backgrounds. They began witnessing racist behavior towards black friends at a young age. As they got older, this racism became more pernicious, and several close friends described their frustration at experiences of racial profiling. Many ended up struggling with the law, and some ended up in jail. Studying history, they could see that racism is not a new phenomenon, and descends from a long and brutal history of segregation, Jim Crow, and discriminatory housing practices. To Person B, it is extremely obvious that racism is a deeply pervasive force in society today, and that it results in many injustices. These injustices are those that sparked the Black Lives Matter movement, which they are an enthusiastic supporter of. They are aware that BLM has made some mistakes in the past, but they see a torrent of evidence in favor of the primary message of the movement: that policing is racially biased.

Now both A and B are presented with the infamous Roland Fryer study finding that when you carefully control for confounding factors, black people are no more likely to be shot by police officers than whites, and are in fact slightly less likely to be shot than whites.

Person A is not super surprised by these results, and feels vindicated in his skepticism of the mainstream narrative. To them, this is clear-cut evidence supporting their preconception that a large part of the Black Lives Matter movement rests on an exaggeration of the seriousness of racial issues in the country.

On the other hand, Person B right away dismisses the results of this study. They know from their whole life experience that the results must be flawed, and their primary take-away is the fallibility of statistics in analyzing complex social issues.

They examine the study closely, trying to find the flaw. They come up with a few hypotheses: (1) The data is subject to one or more selection biases, having been provided by police departments that have an interest in seeming colorblind, and having come from the post-Ferguson period in which police officers became temporarily more careful to not shoot blacks (dubbed the Ferguson effect). (2) The study looked at encounters between officers and blacks/whites, but didn’t take into account the effects of differential encounter rates. If police are more likely to pull over or arrest black people than white people for minor violations, then this would artificially lower the rate of officer shootings.

A and B now discuss the findings. Witnessing B’s response to the data, A perceives it as highly irrational. It appears to be a textbook example of confirmation bias. B immediately dismissed the data that contradicted their beliefs, and rationalized it by conjuring complicated explanations for why the results of the study were wrong. Sure, (thinks Person A) these explanations were clever and suggested the need for more nuance in the statistical analysis, but clearly it was more likely that the straightforward interpretation of the data was correct than these complicated alternatives.

B finds A’s quick acceptance of the results troubling and suggestive of a lack of nuance. To B, it appears that A already had their mind made up, and eagerly jumped onto this study without being sufficiently cautious.

Now the two are presented with a series of in-depth case studies of shootings of young black men that show clear evidence of racial profiling.

These stories fit perfectly within B’s worldview, and they find themselves deeply moved by the injustices that these young men experienced. Their dedication to the cause of fighting police brutality is reinvigorated.

But now A is the skeptical one. After all, the plural of anecdote is not data, and the existence of some racist cops by no means indicts all of society as racist. And what about all the similar stories with white victims that don’t get reported? They also recall the pernicious effects of cognitive biases that could make a young black man fed narratives of police racism more likely to see racism where there is none.

To B, all of this gives them the impression that A is doing cartwheels to avoid acknowledging the simple fact that racism exists.

In the first case, was Person B falling prey to confirmation bias? Was A? Were they both?

How about in the second case… Is A thinking irrationally, as B believes?

✯✯✯

I think that in both cases, the right answer is most likely no. In each case we had two people that were rationally responding to the evidence that they received, just starting from very different presuppositions.

Said differently, A and B had vastly different priors upon encountering the same data, and this difference is sufficient to explain their differing reactions. Given these priors it makes sense and is perfectly rational for B to quickly dismiss the Fryer report and search for alternative explanations, and for A to regard stories of racial profiling as overblown. It makes sense for the same reason that it makes sense for a scientist encountering a psychic that seems eerily accurate to right away write it off as some complicated psychological illusion… strong priors are not easily budged by evidence, and there are almost always alternative explanations that are more likely.

This is all perfectly Bayesian, by the way. If two interpretations of a data set equally well predict or make sense of the data (i.e. P(data | interpretation 1) = P(data | interpretation 2)), then their posterior odds ratio P(interpretation 1 | data) / P(interpretation 2 | data) should be no different from their prior ratio P(interpretation 1) / P(interpretation 2). In other words, strong priors dictate strong posteriors when the evidence is weakly discriminatory between the hypotheses.

When the evidence does rule out some interpretations, probability mass is preferentially shifted towards the interpretations that started with stronger priors. For instance, suppose you have three theories with credences (P(T1), P(T2), P(T3)) = (5%, 5%, 10%), and some evidence E is received that rules out T1, but is equally likely under T2 and T3. Then your posterior probabilities will be (P(T1 | E), P(T2 | E), P(T3 | E)) = (0%, 5.6%, 94.4%).

T2 gains only .6% credence, while T3 gains 4.4%. In other words, while the posterior odds stay the same, the actual probability mass has shifted relatively more towards theories favored in the prior.

The moral of this story is to be careful when accusing others, internally or externally, of confirmation bias. What looks like mulish unwillingness to take seriously alternative hypotheses can actually be good rational behavior and the effect of different priors. Being able to tell the difference between confirmation bias and strong priors is a hard task – one that most people probably won’t undertake, opting instead to assume the worst of their ideological opponents.

Another moral is that privilege is an important epistemic concept. Privilege means, in part, having lived a life sufficiently insulated from injustice that it makes sense to wonder if it is there at all. Privilege is a set of priors tilted in favor of colorblindness and absolute moral progress. “Recognizing privilege” corresponds to doing anthropic reasoning to correct for selection biases in the things you have personally experienced, and adjusting your priors accordingly.

Objective Bayesianism and choices of concepts

Bayesians believe in treating belief probabilistically, and updating credences via Bayes’ rule. They face the problem of how to set priors – while probability theory gives a clear prescription for how to update beliefs, it doesn’t tell you what credences you should start with before getting any evidence.

Bayesians are thus split into two camps: objective Bayesians and subjective Bayesians. Subjective Bayesians think that there are no objectively correct priors. A corollary to this is that there are no correct answers to what somebody should believe, given their evidence.

Objective Bayesians disagree. Different variants specify different procedures for determining priors. For instance, the principle of indifference (POI) prescribes that the proper priors are those that are indifferent between all possibilities. If you have N possibilities, then according to the POI, you should distribute your priors credences evenly (1/N each). If you are considering a continuum of hypotheses (say, about the mass of an object), then the principle of indifference says that your probability density function should be uniform over all possible masses.

Now, here’s a problem for objective Bayesians.

You are going to be handed a cube, and all that you know about it is that it is smaller than 1 cm3. What should your prior distribution over possible cubes you might be handed look like?

Naively applying the POI, you might evenly distribute your credences across all volumes from 0 cm3 to 1 cm3 (so that there is a 50% chance that the cube has a volume less than .50 cm3 and a 50% chance its volume is between greater than .50 cm3).

But instead of choosing to be indifferent over possible volumes, we could equally well have chosen to be indifferent over possible side areas, or side lengths. The key point is that these are all different distributions. If we spread our credences evenly across possible side lengths from 0 cm to 1 cm, then we would have a distribution with a 50% chance that the cube has a volume less than .125 cm3 and a 50% chance that the volume is greater than this.

Cube puzzle

In other words, our choice of concepts (edge length vs side area vs volume) ends up determining the shape of our prior. Insofar as there is no objectively correct choice of concepts to be using, there is no objectively correct prior distribution.

I’ve known about this thought experiment for a while, but only recently internalized how serious of a problem it presents. It essentially says that your choice of priors is hostage to your choice of concepts, which is a pretty unsavory idea. In some cases, which concept to choose is very non-obvious (e.g. length vs area vs volume). In others, there are strong intuitions about some concepts being better than others.

The most famous example of this is contained in Nelson Goodman’s “new riddle of induction.” He proposes a new concept grue, which is defined as the set of objects that are either observed before 2100 and green, or observed after 2100 and blue. So if you spot an emerald before 2100, it is grue. So is a blue ball that you spot after 2100. But if you see an emerald after 2100, it will not be grue.

To characterize objects like this emerald that is observed after 2100, Goodman also creates another concept bleen, which is the inverse of grue. The set of bleen objects is composed of blue objects observed before 2100 and green objects observed after 2100.

Now, if we run ordinary induction using the concepts grue and bleen, we end up making bizarre predictions. For instance, say we observe many emeralds before 2100, and always found them to be green. By induction, we should infer that the next emerald we observe after 2100 is very likely going to be green as well. But if we thought in terms of the concepts grue and bleen, then we would say that all our observations of emeralds so far have provided inductive support for the claim “All emeralds are grue.” The implication of this is that the emeralds we observe after time 2100 will very likely also be grue (and thus blue).

In other words, by simply choosing a different set of fundamental concepts to work with, we end up getting an entirely different prediction about the future.

Here’s one response that you’ve probably already thought of: “But grue and bleen are such weird artificial choices of concepts! Surely we can prefer green/blue over bleen/grue on the basis of the additional complexity required in specifying the transition time 2100?”

The problem with this is that we could equally well define green and blue in terms of grue and bleen:

Green = grue before 2100 or bleen after 2100
Blue = bleen before 2100 or grue after 2100

If for whatever reason somebody had grue and bleen as their primitive concepts, they would see green and blue as the concepts that required the additional complexity of the time specification.

“Okay, sure, but this is only if we pretend that color is something that doesn’t emerge from lower physical levels. If we tried specifying the set of grue objects in terms of properties of atoms, we’d have a lot harder time than if we tried specifying the set of green or blue objects in terms of properties of atoms.”

This is right, and I think it’s a good response to this particular problem. But it doesn’t work as a response to a more generic form of the dilemma. In particular, you can construct a grue/bleen-style set of concepts for whatever you think is the fundamental level of reality. If you think electrons and neutrinos are undecomposable into smaller components, then you can imagine “electrinos” and “neuctrons.” And now we have the same issue as before… thinking in terms of electrinos would lead us to conclude that all electrons will suddenly transform into neutrinos in 2100.

The type of response I want to give is that concepts like “electron” and “neutrino” are preferable to concepts like “electrinos” and “neuctrons” because they mirror the structure of reality. Nature herself computes electrons, not electrinos.

But the problem is that we’re saying that in order to determine which concepts we should use, we need to first understand the broad structure of reality. After which we can run some formal inductive schema to, y’know, figure out the broad structure of reality.

Said differently, we can’t really appeal to “the structure of reality” to determine our choices of concepts, since our choices of concepts end up determining the results of our inductive algorithms, which are what we’re relying on to tell us the structure of reality in the first place!

This seems like a big problem to me, and I don’t know how to solve it.