The Rival-Expert Heuristic

I like to try to surround myself with people that are very intelligent and know a lot about subjects that I know very little about. As such, I am sometimes in the position that Scott Alexander refers to as epistemic learned helplessness. The basic idea bears some resemblance to ideas I explored in a previous post about reasoning in the presence of Super Persuaders.

When you’re talking to somebody who is much more knowledgeable than you about a particular subject and who is presenting to you very compelling arguments, it becomes unclear how strongly you should update on the arguments you are receiving. In particular, if the person you’re talking to is very plausibly presenting a biased sampling of the relevant arguments, then you should be very hesitant to update on these arguments as fully as you would otherwise.

One way of dealing with this is just to avoid people that know more than you and have strong opinions on matters that are disputed among experts. But that’s no fun.

A useful heuristic here is to do your best to imagine what it would be like if there was a rival expert in the room with you and your conversation partner. Creatively, I call this the Rival-Expert Heuristic.

For example, imagine that you’re in conversation with an expert sociologist who is making some very compelling-sounding arguments for why socialist economic systems are overall better than capitalistic systems. It might be that you can’t personally see any reason why the arguments they’re making would fail, and are unable to think of any original arguments for capitalism or against socialism.

In such a situation, it might be genuinely helpful to imagine that Milton Friedman is sitting in the room beside you, holding forth against the scholar. Even if you don’t personally know any counterarguments, you might have some sense that it is likely that such counterarguments exist and that Milton Friedman would know them.

If they say “Capitalism is a system that exploits workers and causes wealth to concentration at the top!”, and you don’t know of any good responses to this, you should consider the chance that Milton Friedman has heard of this line of argument and has a crushingly good response to it. If you can’t think of arguments of your own to present, you should try to take into account the “empty space” in the conversation where these opposing arguments would be if Milton Friedman was in the room.

This can potentially help you with judging how strong the arguments you’re receiving actually are. The primary difficulty is obvious: it’s not easy to accurately imagine a rival expert for exactly the reason that you don’t personally know what arguments they would be making.

At the same time, it is probably much easier to simply consider the question: “How likely is it that a rival expert would have a compelling response to this?” than it is to try to construct such a response yourself. I also think that it can be more reliable in many cases. Imagine that somebody comes up to you with plans for a perpetual motion device, and begins to describe them in much greater detail than you are able to understand. Perhaps this person understands the underlying physics much better than you, and whenever you raise an objection to their design, they are able to easily respond with apparently logical arguments. This is a case where you can be extremely confident that there exist good reasons why they are wrong, even though you have no idea what those reasons might be.

More realistically, suppose that somebody presents you with an argument for why X is true, and you vividly remember hearing a fantastic argument just last week for the falsity of X by a very reputable expert on X-like matters. The trouble is, you can’t remember any of the details of this argument, just that it was a much stronger argument by a more reputable source that this argument you are receiving now. This is a situation that we are often in, but is not typically addressed in standard philosophy talk about epistemology.

Are we justified in believing that what they’re saying is probably wrong, even though we can’t remember the details of the argument? Of course! Our confidence in the falsity of X is moved by an argument’s strength, only indirectly by its content. If the memory of the strength of the argument is retained and reliable, then there is no reason to backtrack on the earlier credence bump.

But just feeling confident that the things you’re hearing are wrong is often not very salient to us, especially if the person saying them is very charismatic and persuasive. You’ll eventually be tempted to relent in your dogged agnosticism after repeatedly failing to see any flaws in their arguments.

This, I think, is the main strength of the rival-expert heuristic. Dogged adherence to uncertainty in the face of compelling evidence feels much more okay if you can vividly imagine a more balanced social dynamic, one in which compelling evidence is being presented on both sides of the issue.

A more general form of this heuristic is to not form strong opinions or take sides on issues that are controversial amongst those that know the most on them, unless you yourself are one of the top experts. I think that a world in which this was more common would be hugely improved. As it is, people generally have far too many beliefs that are far too strong on matters that are disputed among experts. Part of the problem is that beliefs are sticky – It’s easier to acquire them than it is to abandon them once they have become a part of your identity.

If you think that raising the minimum wage is obviously a fantastic idea, but also know that there is a great deal of complicated debate amongst professional economists on the matter, then you are implicitly assuming that you know better than all those economists that disagree with you.

More viscerally, you must come to terms with the fact that if you were faced with the boatloads of experts that disagree with you, your arguments would probably fall flat, and you would likely hear a bunch of compelling arguments for why you are wrong. If this is true, then you essentially are just hanging on to your beliefs because you have by chance happened to avoid these experts!

Ultimately, the Rival-Expert Heuristic is about updating on evidence that you don’t have, but which you have good reason to believe exists. Perhaps this feels weird, but to sum up, there are three basic motivations for doing so.

First, we are easily convinced by compelling-sounding arguments from biased sources.

Second, abstractly knowing of the existence of experts that disagree with compelling-sounding arguments is less likely to properly influence your epistemic habits than actually imagining those experts engaging with the arguments.

And third, beliefs are “sticky” and easier to take on than to back out of.

Inference as a balance of accommodation, prediction, and simplicity

(This post is a soft intro to some of the many interesting aspects of model selection. I will inevitably skim over many nuances and leave out important details, but hopefully the final product is worth reading as a dive into the topic. A lot of the general framing I present here is picked up from Malcolm Forster’s writings.)

What is the optimal algorithm for discovering the truth? There are many different candidates out there, and it’s not totally clear how to adjudicate between them. One issue is that it is not obvious exactly how to measure correspondence to truth. There are several different criterion that we can use, and in this post, I want to talk about three big ones: accommodation, prediction, and simplicity.
The basic idea of accommodation is that we want our theories to do a good job at explaining the data that we have observed. Prediction is about doing well at predicting future data. Simplicity is, well, just exactly what it sounds like. Its value has been recognized in the form of Occam’s razor, or the law of parsimony, although it is famously difficult to formalize.
Let’s say that we want to model the relationship between the number of times we toss a fair coin and the number of times that it lands H. We might get a data set that looks something like this:
Data

Now, our goal is to fit a curve to this data. How best to do this?

Consider the following two potential curves:

Curve fitting

Curve 1 is generated by Procedure 1: Find the lowest-order polynomial that perfectly matches the data.

Curve 2 is generated by Procedure 2: Find the straight line that best fits the data.

If we only cared about accommodation, then we’ll prefer Curve 1 over Curve 2. After all, Curve 1 matches our data perfectly! Curve 2, on the other hand, is always close but never exactly right.

On the other hand, regardless of how well Curve 1 fits the data, it entirely misses the underlying pattern in the data captured by Curve 2! This demonstrates one of the failure modes of a single-minded focus on accommodation: the problem of overfitting.

We might want to solve in this problem by noting that while Curve 1 matches the data better, it does so in virtue of its enormous complexity. Curve 2, on the other hand, matches the data pretty well, but does so simply. A combined focus on accommodation + simplicity might, therefore, favor Curve 2. Of course, this requires us to precisely specify what we mean by ‘simplicity’, which has been the subject of a lot of debate. For instance, some have argued that an individual curve cannot be said to be more or less simple than a different curve, as just rephrasing the data in a new coordinate system can flip the apparent simplicity relationship. This is a general version of the grue-bleen problem, which is a fantastic problem that deserves talking about in a separate post.

Another way to solve this problem is by optimizing for accommodation + prediction. The over-fitted curve is likely to be very off if you ask for predictions about future data, while the straight line is likely going to do better. This makes sense – a straight line makes better forecasts about future data because it has gotten to the true nature of the underlying relationship.

What if we want to ensure that our model does a good job at predicting future data, but are unable to gather future data? For example, suppose that we lost the coin that we were using to generate the data, but still want to know what model would have done best at predicting future flips? Cross-validation is a wonderful technique that can be used to deal with exactly this problem.

How does it work? The idea is that we randomly split up the data we have into two sets, the training set and the testing set. Then we train our models on the training set (see which curve each model ends up choosing as its best fit, given the training data), and test it on the testing set. For instance, if our training set is just the data from the early coin flips, we find the following:

Curve fitting cross validation
Cross validation

We can see that while the new Curve 2 does roughly as well as it did before, the new Curve 1 will do horribly on the testing set. We now do this for many different ways of splitting up our data set, and in the end accumulate a cross-validation “score”. This score represents the average success of the model at predicting points that it was not trained on.

We expect that in general, models that overfit will tend to do horribly badly when asked to predict the testing data, while models that actually get at the true relationship will tend to do much better. This is a beautiful method for avoiding overfitting by getting at the deep underlying relationships, and optimizing for the value of predictive accuracy.

It seems like predictive accuracy and simplicity often go hand-in-hand. In our coin example, the simpler model (the straight line) was also the more predictively accurate one. And models that overfit tend to be both bad at making accurate predictions and enormously complicated. What is the explanation for this relationship?

One classic explanation says that simpler models tend to be more predictive because the universe just actually is relatively simple. For whatever reason, the actual relationships between different variables in the universe happens to be best modeled by simple equations, not complicated ones. Why? One reason that you could point to is the underlying simplicity of the laws of nature.

The Standard Model of particle physics, which gives rise to basically all of the complex behavior we see in the world, can be expressed in an equation that can be written on a t-shirt. In general, physicists have found that reality seems to obey very mathematically simple laws at its most fundamental level.

I think that this is somewhat of a non-explanation. It predicts simplicity in the results of particle physics experiments, but does not at all predict simple results for higher-level phenomenon. In general, very complex phenomena can arise from very simple laws, and we get no guarantee that the world will obey simple laws when we’re talking about patterns involving 1020 particles.

An explanation that I haven’t heard before references possible selection biases. The basic idea is that most variables out there that we could analyze are likely not connected by any simple relationships. Think of any random two variables, like the number of seals mating at any moment and the distance between Obama and Trump at that moment. Are these likely to be related by a simple equation? Of course!

(Kidding. Of course not.)

The only times when we do end up searching for patterns in variables is when we have already noticed that some pattern does plausibly seem to exist. And since we’re more likely to notice simpler patterns, we should expect a selection bias among those patterns we’re looking at. In other words, given that we’re looking for a pattern between two variables, it is fairly likely that there is a pattern that is simple enough for us to notice in the first place.

Regardless, it looks like an important general feature of inference systems to provide a good balance between accommodation and either prediction or simplicity. So what do actual systems of inference do?

I’ve already talked about cross validation as a tool for inference. It optimizes for accommodation (in the training set) + prediction (in the testing set), but not explicitly for simplicity.

Updating of beliefs via Bayes’ rule is a purely accommodation procedure. When you take your prior credence P(T) and update it with evidence E, you are ultimately just doing your best to accommodate the new information.

Bayes’ Rule: P(T | E) = P(T) ∙ P(E | T) / P(T) 

The theory that receives the greatest credence bump is going to be the theory that maximizes P(E | T), or the likelihood of the evidence given the theory. This is all about accommodation, and entirely unrelated to the other virtues. Technically, the method of choosing the theory that maximizes the likelihood of your data is known as Maximum Likelihood Estimation (MLE).

On the other hand, the priors that you start with might be set in such a way as to favor simpler theories. Most frameworks for setting priors do this either explicitly or implicitly (principle of indifference, maximum entropy, minimum description length, Solomonoff induction).

Leaving Bayes, we can look to information theory as the foundation for another set of epistemological frameworks. These are focused mostly on minimizing the information gain from new evidence, which is equivalent to maximizing the relative entropy of your new distribution and your old distribution.

Two approximations of this procedure are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), each focusing on subtly different goals. Both of these explicitly take into account simplicity in their form, and are designed to optimize for both accommodation and prediction.

Here’s a table of these different procedures, as well as others I haven’t mentioned yet, and what they optimize for:

Optimizes for…

Accommodation?

Prediction?

Simplicity?

Maximum Likelihood Estimation

Minimize Sum of Squares

Bayesian Updating

Principle of Indifference

Maximum Entropy Priors

Minimum Message Length

Solomonoff Induction

P-Testing

Minimize Mallow’s Cp

Maximize Relative Entropy

Minimize Log Loss

Cross Validation

Minimize Akaike Information Criterion (AIC)

Minimize Bayesian Information Criterion (AIC)

Some of the procedures I’ve included are closely related to others, and in some cases they are in fact approximations of others (e.g. minimize log loss ≈ maximize relative entropy, minimize AIC ≈ minimize log loss).

We can see in this table that Bayesianism (Bayesian updating + a prior-setting procedure) does not explicitly optimize for predictive value. It optimizes for simplicity through the prior-setting procedure, and in doing so also happens to pick up predictive value by association, but doesn’t get the benefits of procedures like cross-validation.

This is one reason why Bayesianism might be seen as suboptimal – prediction is the great goal of science, and it is entirely missing from the equations of Bayes’ rule.

On the other hand, procedures like cross validation and maximization of relative entropy look like good candidates for optimizing for accommodation and predictive value, and picking up simplicity along the way.

Pascal’s mugging

  • You should make decisions by evaluating the expected utilities of your various options and choosing the largest one.

This is a pretty standard and uncontroversial idea. There is room for controversy about how to fill in the details about how to evaluate expected utilities, but this basic premise is hard to argue against. So let’s argue against it!

Suppose that a stranger walks up to you in the street and says to you “I have been wired in from outside the simulation to give you the following message: If you don’t hand over five dollars to me right now, your simulator will teleport you to a dungeon and torture you for all eternity.” What should you do?

The obviously correct answer is that you should chuckle, continue on with your day, and laugh about the incident later on with your friends.

The answer you get from a simple application of decision theory is that as long as you aren’t absolutely, 100% sure that they are wrong, you should give them the five dollars. And you should definitely not be 100% sure. Why?

Suppose that the stranger says next: “I know that you’re probably skeptical about the whole simulation business, so here’s some evidence. Say any word that you please, and I will instantly reshape the clouds in the sky into that word.” You do so, and sure enough the clouds reshape themselves. Would this push your credences around a little? If so, then you didn’t start at 100%. Truly certain beliefs are those that can’t be budged by any evidence whatsoever. You can never update downwards on truly certain beliefs, by the definition of ‘truly certain’.

To go more extreme, just suppose that they demonstrate to you that they’re telling you the truth by teleporting you to a dungeon for five minutes of torture, and then bringing you back to your starting spot. If you would even slightly update your beliefs about their credibility in this scenario, then you had a non-zero credence in their credibility from the start.

And after all, this makes sense. You should only have complete confidence in the falsity of logical contradictions, and it’s not literally logically impossible that we are in a simulation, or that the simulator decides to mess with our heads in this bizarre way.

Okay, so you have a nonzero credence in their ability to do what they say they can do. And any nonzero credence, no matter how tiny, will result in the rational choice being to hand over the $5. After all, if expected utility is just calculated by summing up utilities weighted by probabilities, then you have something like the following:

EU(keep $5) – EU(give $5) = ε · U(infinite torture) – U(keep $5)
where ε = P(infinite torture | keep $5) – P(infinite torture | give $5)

As long as losing $5 isn’t infinitely bad to you, you should hand over the money. This seems like a problem, either for our intuitions or for decision theory.

***

So here are four propositions, and you must reject at least one of them:

  1. There is a nonzero chance of the stranger’s threat being credible.
  2. Infinite torture is infinitely worse than losing $5.
  3. The rational thing to do is that which maximizes expected utility.
  4. It is irrational to give the stranger $5.

I’ve already argued for (1), and (2) seems virtually definitional. So our choice is between (3) and (4). In other words, we either abandon the principle of maximizing expected utility as a guide to instrumental rationality, or we reject our intuitive confidence in the correctness of (4).

Maybe at this point you feel more willing to accept (4). After all, intuitions are just intuitions, and humans are known to be bad at reasoning about very small probabilities and very large numbers. Maybe it actually makes sense to hand over the $5.

But consider where this line of reasoning leads.

The exact same argument should lead you to give in to any demand that the stranger makes of you, as long as it doesn’t have a literal negative infinity utility value. So if the stranger tells you to hand over your car keys, to go dance around naked in a public square, or to commit heinous crimes… all of these behaviors would be apparently rationally mandated.

Maybe, maybe, you might be willing to bite the bullet and say that yes, these behaviors are all perfectly rational, because of the tiny chance that this stranger is telling the truth. I’d still be willing to bet that you wouldn’t actually behave in this self-professedly “rational” manner if I now made this threat to you.

Also, notice that this dilemma is almost identical to Pascal’s wager. If you buy the argument here, then you should also be doing all that you can to ensure that you stay out of Hell. If you’re queasy about the infinities and think decision theory shouldn’t be messing around with such things, then we can easily modify the problem.

Instead of “your simulator will teleport you to a dungeon and torture you for all eternity”, make it “your simulator will teleport you to a dungeon and torture you for 3↑↑↑↑3 years.” The negative utility of this is large enough as to outweigh any reasonable credence you could place in the credibility of the threat. And if it isn’t, we can just make the number of years even larger.

Maybe the probability of a given payout scales inversely with the size of the payout? But this seems fairly arbitrary. Is it really the case that the ability to torture you for 3↑↑↑↑3 years is twice as likely as the ability to torture you for 2 ∙ 3↑↑↑↑3 years? I can’t imagine why. It seems like the probability of these are going to be roughly equal – essentially, once you buy into the prospect of a simulator that is able to torture you for 3↑↑↑↑3 years, you’ve already basically bought into the prospect that they are able to torture you for twice that amount of time.

All we’re left with is to throw our hands up and say “I can’t explain why this argument is wrong, and I don’t know how decision theory has gone wrong here, but I just know that it’s wrong. There is no way that the actually rational thing to do is to allow myself to get mugged by anybody that has heard of Pascal’s wager.”

In other words, it seems like the correct response to Pascal’s mugging is to reject (3) and deny the expected-utility-maximizing approach to decision theory. The natural next question is: If expected utility maximization has failed us, then what should replace it? And how would it deal with Pascal’s mugging scenarios? I would love to see suggestions in the comments, but I suspect that this is a question that we are simply not advanced enough to satisfactorily answer yet.

Why systematize epistemology?

A general pattern I’ve noticed in meta-level thinking is a spectrum of systemizing. I’ll explain what this means by a personal example.

When I was first exposed to the idea of ethics as a serious discipline, I found it fairly silly. I mean, clearly our ethical beliefs are not the types of things that we should expect objectivity from. They form from a highly subjective and complex mix of factors involving the peer group we surround ourselves with, the type of parents we had, our religious background, our inbuilt deep moral intuitions, our life experiences, and so on. What’s the point in thinking hard about your ethical beliefs – they just are what they are, right?

What I found funny was the idea that people thought it made sense to spend serious time and effort trying to analyze their ethical intuitions and creating general frameworks that capture as much of these intuitions as they could. I would say that I, for whatever reason, had an initially highly non-systematizing attitude towards ethics.

In college, I fell in with a crowd that liked spending long hours debating abstract ethical principals, and eventually grew fond of it myself. It became intuitive to me that of course it is desirable to have a simple, precisely formalized, and vastly generalizable ethical framework to guide your beliefs and actions. This remained the case even though I never lost the intuitive sense of the obviousness of moral non-objectivity.

Frameworks like utilitarianism appealed to me as incredibly simple general “laws of morality” that were able to capture most of my ethical intuitions, When they contradicted strong ethical intuitions, I felt okay with overriding these intuitions for the sake of the more valuable synthesis that was the framework as a whole.

These types of cognitive patterns – taking complex disparate phenomena, analyzing patterns in them, looking for precise and simple descriptions of these patterns and trying to generalize them as far as possible – are what I mean by systematizing. Some people are very strong systematizers when it comes to their aesthetic tastes – they will spend hours arguing about what beauty is and analyzing their basic aesthetic reactions in order to form simple general Theories of Everything Beautiful. Others think that this is stupid and a waste of time and cognitive resources.

Philosophers tend to be systematizers about literally everything – I’d say systematization comes close to a general definition of philosophy as an intellectual field. Scientists tend to be systematizers about the field that they work in, where they work obsessively to cleanly and neatly describe vast realms of natural phenomena. In our daily lives, systematizing tendencies come out in arguments about the quality of a certain movie or the tastiness of a meal or the attractiveness of a celebrity. Some people will want to dive into these debates with an attitude towards forming general principles of what makes a quality movie, or a tasty meal, or an attractive person, while others will dismiss the general principles, arguing instead from their gut-level reactions to the movie. Which is to say, some people will feel a desire to systematize their thoughts/ opinions/ desires/ tastes, and others will not.

Those that do not are perfectly content with a complicated and messy reality. They feel no inner urge pulling them towards de-cluttering their view of the world. From this perspective, it can be perplexing to see people working very hard to systematize their intuitions. Such efforts can seem fairly pointless, and downright absurd when the final product ends up contradicting some of the intuitions from which it was built.

About a lot of things, I am an extreme systematizer, relentlessly searching for concise, elegant, and powerful models to piece everything together. But there are plenty of other areas where I feel totally fine with messiness and complexity and am turned off by efforts to reduce or remove them. Aesthetics is one such area – I appreciate art on a gut level, and am weirded out by the prospect of trying to formulate a simple general theory of aesthetics.

One of the areas where I have the most extreme systematizing tendencies (as might be obvious from my writings on this blog) is formal epistemology. A single neat equation that summarizes the process of rational belief formation is just obviously desirable to me. This is not a desirability borne out of practical considerations. It is perhaps at its root a deeply aesthetic feeling about different structures of reasoning. I want to know not just what is practically useful for day-to-day reasoning, but also what is ultimately the best and most fundamental framework with which to describe my epistemological intuitions.

I choose the phrase ‘epistemological intuitions’ carefully and intentionally. We do not have any direct line to objective epistemic truth; we are not provided by Nature with a golden shining book in which the true nature of normative rational reasoning is laid out for us. What we do have, ultimately, is a set of deep intuitions about the way that good reasoning works. These intuitions are messy and complicated.

I say this all to make the point that strong enough systematizing intuitions can make the non-objective look objective, and I think it’s important to try to avoid that mistake. Maybe we think that if we extend our framework of reasoning enough, we can eventually find evolutionary justifications for why our patterns of reasoning should in general align with the truth. But this is simply an appeal to the value of reflective equilibria – the criterion that multiple alternative perspectives on the same framework end up cohering and bolstering one another.

If we try to say something like “We can find out what framework works best by just seeing how they do at predicting future events,” then we are relying on the intuition that empiricism is an epistemic virtue. Similarly, if we appeal to Occam’s razor, we are relying on intuitions about simplicity. If we think that better frameworks take little for granted and are cautious about jumping to strong conclusions, then we are relying on intuitions about epistemic humility. Etc.

The best we can do, it seems to me, is to compile different arguments starting from our deepest intuitions and ending at a particular epistemic framework. Bayesianism has arguments like Cox’s theorem and Dutch Book arguments. The empirical case for Bayesianism can be made by convergence and consistency theorems, as well as case studies in which Bayesian methods result in great predictive power.

But I think that it’s important to keep in mind that these are not absolute proofs of the objective superiority of Bayesianism. Ultimately, arguments for any epistemic framework rest on some set of deep-seated epistemic intuitions, and are ineradicably tied to these intuitions.

A failure of Bayesian reasoning

Bayesianism does great when the true model of the world is included in the set of models that we are using. It can run into issues, however, when the true model starts with zero prior probability.

We’d hope that even in these cases, the Bayesian agent ends up doing as well as possible, given their limitations. But this paper presents lots of examples of how a Bayesian thinker can go horribly wrong as a result of accidentally excluding the true model from the set of models they are considering. I’ll present one such example here.

Here’s the setup: A fair coin is being repeatedly flipped, while being watched by a Bayesian agent that wants to predict the bias in the coin. This agent starts off with the correct credence distribution over outcomes: they have a 50% credence in it landing heads and a 50% chance of it landing tails.

However, this agent only has two theories available to them:

T1: The coin lands heads 80% of the time.
T2: The coin lands heads 20% of the time.

Even though the Bayesian doesn’t have access to the true model of reality, they are still able to correctly forecast a 50% chance of the coin landing heads by evenly splitting their credences in these two theories. Given this, we’d hope that they wouldn’t be too handicapped and in the long run would be able to do pretty well at predicting the next flip.

Here’s the punchline, before diving into the math: The Bayesian doesn’t do this. In fact, their behavior becomes more and more unreasonable the more evidence they get.

They end up spending almost all of their time being virtually certain that the coin is biased, and occasionally flip-flopping in their belief about the direction of the bias. As a result of this, their forecast will almost always be very wrong. Not only will it fail to converge to a realistic forecast, but in fact, it will get further and further away from the true value on average. And remember, this is the case even though convergence is possible!

Alright, so let’s see why this is true.

First of all, our agent starts out thinking that T1 and T2 are equally likely. This gives them an initially correct forecast:

P(T1) = 50%
P(T2) = 50%

P(H) = P(H | T1) · P(T1) + P(H | T2) · P(T2)
= 80% · 50% + 20% · 50% = 50%

So even though the Bayesian doesn’t have the correct model in their model set, they are able to distribute their credences in a way that will produce the correct forecast. If they’re smart enough, then they should just stay near this distribution of credences in the long run, and in the limit of infinite evidence converge to it. So do they?

Nope! If they observe n heads and m tails, then their likelihood ratios end up moving exponentially with nm. This means that the credences will almost certainly end up very highly uneven.

In what follows, I’ll write the difference in the number of heads and the number of tails as z.

z = n – m

P(n, m | T1) = .8.2m
P(n, m | T2) = .2n .8m

L(n, m | T1) = 4z
L(n, m | T2) = 1/4z

P(T1 | n, m) = 4z / (4z + 1)
P(T2 | n, m) = 1 / (4z + 1)

Notice that the final credences only depend on z. It doesn’t matter if you’ve done 100 trials or 1 trillion, all that matters is how many more heads than tails there are.

Also notice that the final credences are exponential in z. This means that for positive z, P(T1 | n, m) goes to 100% exponentially quickly, and vice versa.

z

0

1 2 3 4 5

6

P(T1|z) 50% 80% 94.12% 98.46% 99.61% 99.90% 99.97%
P(T2|z) 50% 20% 5.88% 1.54% 0.39% .10% 0.03%

The Bayesian agent is almost always virtually certain in the truth of one of their two theories. But which theory they think is true is constantly flip-flopping, resulting in a belief system that is vacillating helplessly between two suboptimal extremes. This is clearly really undesirable behavior for a supposed model of epistemic rationality.

In addition, as the number of coin tosses increases, it becomes less and less likely that z is exactly 0. At N tosses, the average value of z is √N. This means that the more evidence they receive, the further on average they will be from the ideal distribution.

Sure, you can object that in this case, it would be dead obvious to just include a T3: “The coin lands heads 50% of the time.” But that misses the point.

The Bayesian agent had a way out – they could have noticed after a long time that their beliefs were constantly wavering from extreme confidence in T1 to extreme confidence in T2, and seemed to be doing the opposite of converging to reality. They could have noticed that an even distribution of credences would allow them to do much better at predicting the data. And if they had done so, they they would end up always giving an accurate forecast of the next outcome.

But they didn’t, and they didn’t because the model that exactly fit reality was not in their model set. Their epistemic system didn’t allow them the flexibility needed to realize that they needed to learn from their failures and rethink their priors.

Reality is very messy and complicated and rarely adheres exactly to the nice simple models we construct. It doesn’t seem crazily implausible that we might end up accidentally excluding the true model from our set of possible models, and this example demonstrates a way that Bayesian reasoning can lead you astray in exactly these circumstances.

Bayesianism as natural selection of ideas

There’s a beautiful parallel between Bayesian updating of beliefs and evolutionary dynamics of a population that I want to present.

Let’s start by deriving some basic evolutionary game theory! We’ll describe a population as made up of N different genotypes:

(1, 2, 3, …, N)

Each of these genotypes is represented in some proportion of the population, which we’ll label with an X.

Distribution of genotypes in the population X =  (X1, X2, X3, …, XN)

Each of these fractions will in general change with time. For example, if some ecosystem change occurs that favors genotype 1 over the other genotypes, then we expect X1 to grow. So we’ll write:

Distribution of genotypes over time = (X1(t), X2(t), X3(t), …, XN(t))

Each genotype has a particular fitness that represents how well-adjusted it is to survive onto the next generation in a population.

Fitness of genotypes = (f1, f2, f3, …, fN)

Now, if Genotype 1 corresponds to a predator, and Genotype 2 to its prey, then the fitness of Genotype 2 very much depends on the population of Genotype 1 organisms as well its own population. In general, the fitness function for a particular genotype is going to depend on the distribution of all the genotypes, not just that one. This means that we should write each fitness as a function of all the Xis

Fitness of genotypes = (f1(X), f2(X), f3(X), …, fN(X))

Now, what is relevant to the change of any Xi is not the absolute value of the fitness function fi, but the comparison of fi to the average fitness of the entire population. This reflects the fact that natural selection is competitive. It’s not enough to just be fit, you need to be more fit than your neighbors to successfully pass on your genes.

We can find the average fitness of the population by the standard method of summing over each fitness weighted by the proportion of the population that has that fitness:

favg = X1 f1 + X2 f2 + … + XN fN

And since the fitness of a genotype is relative to the average population genotype the change of Xi is proportional to the ratio of f/ favg. In addition, the change of Xi at time t should be proportional to the size of Xat time t (larger populations grow faster than small populations). Here is the simplest equation we could write with these properties:

Xi(t + 1) = Xi(t) · f/ favg

This is the discrete replicator equation. Each genotype either grows or shrinks over time according to the ratio of its fitness to the average population fitness. If the fitness of a given genotype is exactly the same as the average fitness, then the proportion of the population that has that genotype stays the same.

Now, how does this relate to Bayesian inference? Instead of a population composed of different genotypes, we have a population composed of beliefs in different theories. The fitness function for each theory corresponds to how well it predicts new evidence. And the evolution over time corresponds to the updating of these beliefs upon receiving new evidence.

Xi(t + 1) → P(Ti | E)
Xi(t) → P(Ti)
fi → P(E | Ti)

What does favg become?

favg = X1 f1 + … + XN fN
becomes
P(E) = P(T1) P(E | T2) + … + P(TN) P(E | TN)

But now our equation describing evolutionary dynamics just becomes identical to Bayes’ rule!

Xi(t + 1) = Xi(t) · f/ favg
becomes
P(Ti | E) = P(Ti) P(E | Ti) / P(E)

This is pretty fantastic. It means that we can quite literally think of Bayesian reasoning as a form of natural selection, where only the best ideas survive and all others are outcompeted. A Bayesian treats their beliefs as if they are organisms in an ecosystem that punishes those that fail to accurately predict what will happen next. It is evolution towards maximum predictive power.

There are some intriguing hints here of further directions for study. For example, the Bayesian fitness function only depended on the particular theory whose fitness was being evaluated, but it could have more generally depended on all of the different theories as in the original replicator equation.

Plus, the discrete replicator equation is only one simple idealized model of patterns of evolutionary change in populations. There is a continuous replicator equation, where populations evolve smoothly as analytic functions of time. There are also generalizations that introduce mutation, allowing a population to spontaneously generate new genotypes and transition back and forth between similar genotypes. Evolutionary graph theory incorporates population structure into the model, allowing for subtleties regarding complex spatial population interactions.

What would an inference system based off of these more general evolutionary dynamics look like? How would it compare to Bayesianism?

More on random sampling from Nature’s Urn

In a previous post, I developed an analogy between patterns of reasoning and sampling procedures. I want to go a little further with two expansions on this idea.

Scientific laws and domains of validity

First, different sampling procedures can focus on sampling from different regions of the urn. This is analogous to how scientific theories have specific domains of validity that they were built to explain, and in general their conclusions do not spread beyond this domain.

Classical Newtonian mechanics is a great theory to explain slowly swinging pendulums and large gravitating bodies, but if you apply it to particles that are too small, or moving too fast, or too massive, then you’ll get bad results. In general, any scientific law will be known to work within a certain range of energies or sizes or speeds.

By analogy, the Super Persuader was not a good source of evidence, because its sampling procedure was to scour the urn for any black balls it could find, and ignore all white balls. Ideally, we want our truth-seeking enterprises to function like random sampling of balls from an urn. But of course, the way that scientists seek out evidence is not analogous to randomly sampling from the entire urn consisting of all pieces of evidence as to the structure of reality. Instead, a psychologist will focus on one region of the urn, a biologist another, and a physicist another.

In this way, a psychologist can say that the evidence they receive is representative of the general state of evidence in a certain region of the urn. The region of the urn being sampled by the scientist represents the domain of validity of the laws they develop.

Developing this line further, we might imagine that there is a general positioning of pieces of evidence or good arguments in terms of accessibility to humans. Some arguments or ideas or pieces of evidence about reality will lie near the top of the urn, and will be low-hanging fruits for any investigators. (Mixing metaphors!) Others will lie deeper down, requiring more serious thought and dedicated investigation to come across.

Advances in tech can allow scientists to dig deeper into Nature’s urn, expanding the domains of validity of their theories and becoming better acquainted with the structure of reality.

Cognitive biases and generalized distortions of reasoning

Second, a taxonomy of different ways in which reasoning can go wrong naturally arises from the metaphor. Some of these correspond nicely to well-known cognitive biases.

For instance, the sampling procedure used by the Super Persuader involved selectively choosing evidence to support a certain hypothesis. In general, this corresponds to selection biases. A special case of this is motivated reasoning. When we strongly desire a hypothesis to be true, we are more likely to find, remember, and fairly judge evidence in its favor than evidence against it. Selection biases are in general just non-random sampling procedures.

Another class of error is misjudgment, where we draw a black ball, but see it as a white ball. This would correspond to things like the backfire effect (LINK), where evidence against a proposition we favor serves to strengthen our belief in it, or just failure to understand an argument or a piece of evidence.

A third class of error is bad extrapolation, where we are sampling randomly from one region of the urn, but then act as if we are sampling from some other region. This would include hasty generalizations and all forms of irrational stereotyping.

Generalizing argument strength

Finally, a weakness of the urn analogy is that it treats all arguments as equally strong. We can fix this by imagining that some balls come clustered together as a single, stronger argument. Additionally, we could imagine argument strength as ball density, and suppose that we actually want to estimate the ratio of mass of black balls to mass of white balls. In this way, denser balls effect our judgment of the ratio more severely than less dense ones.

Does fine-tuning give evidence for God?

I used to think that the fine-tuning argument was the strongest argument out there for the existence of a creator deity. I was especially impressed by the apparent magnitude of the fine-tuning – Steven Weinberg has stated that the value of the cosmological constant was fine-tuned to one part in 10120.

If one takes a naïve (and as we’ll see, incorrect) Bayesian approach to assessing this as evidence, then it looks like this should serve as an incredible amount of evidence for the existence of a God, enough to totally overwhelm all other possible considerations. Why? Because if there is a God, then we expect fine-tuning, while if not, then the fine-tuning looks incredibly unlikely. Given this, the God explanation should receive a credence bump proportional to 10120 upon updating on the observation of fine-tuning.

As a quick aside before diving into the numbers, there is a lot of dispute about whether or not there even is fine-tuning in our universe. For the purposes of this post, I’m going to ignore all of these disputes, and pretend that there is a strong consensus on this matter. I’ll use Weinberg’s estimate of 10-120 for the fine-tuning of the cosmological constant. I know that this is controversial, but the point I’m making will stand for even this insanely tiny value.

Okay, so let’s first present a formal version of the fine-tuning argument for God.

F = “The universe is fine-tuned for life.”
G = “A creator deity fine-tuned the universe for life.”

O(G | F) = L(F | G) · O(G)
L(F | G) = P(F | G) / P(F | ~G) ≈ 1 / 10-120 = 1200 dB

So O(G | F) = 10120 · O(G)

This uses the odds formulation of Bayes’ rule – look it up if you’re unfamiliar.

This argument says that your credences should be adjusted by a factor of 10120 upon observing the fine-tuning of the universe. In other words, to not be virtually certain that there exists a creator deity that rules the universe after updating on fine-tuning, you’d have to have initially had a credence on the order of 10-120.

Let me point out that 10-120 is a really really small number. It’s virtually impossible to imagine any good reason why you would be justified in having a prior credence on this order of magnitude. Nobody should be that sure about anything. Evidence of a strength of 1200 dB is analogous in strength to a noise that is a quadrillion times more intense than the threshold of human hearing.

So what’s wrong with this argument? In fact, it fails at the first step. In calculating the strength of the evidence, we only considered two possible hypotheses: either God or, if not, then random coincidence. But there are many other options that we have to factor in as well, most famously the multiverse hypothesis.

But even if there are other hypotheses out there, shouldn’t they just all share the benefit of the credence boost? The existence of another hypothesis that made the same prediction shouldn’t count as a penalty, right?

Wrong! Probabilities have to add up to 1, and you can’t have multiple mutually exclusive competing hypotheses that you have virtual certainty about. Whatever happens when you add other hypotheses must be more subtle than that. So let’s calculate using Bayes’ rule!

O(T | E) = L(E | T) · O(T)
L(E | T) = P(E | T) / P(E | ~T)

For each theory T we consider, we have to take into account all other theories in the denominator of our likelihood function L. We’ll want to keep in mind the following identity:

P(B & C) = 0
implies
P(A | B or C) = [P(A | B) P(B) + P(A | C) P(C)] / [P(B) + P(C)]

So, for instance, let’s divide up our explanations of the fine-tuning F into three disjoint categories: (1) random coincidence C, (2) a deistic God G, and (3) all other explanations that are mutually incompatible.

L(F | X) = P(F | X) / P(F | ~X)
= P(F | X) (1 – P(X)) / ∑Y≠PP(F | Y) P(Y)

P(F | C) ≈ 10-120
P(F | G) ≈ 1
P(F | O) ≈ 1

L(F | C) ≈ 10-120
L(F | G) ≈ P(~G) / P(O)
L(F | O) ≈ P(~O) / P(G)

O(C | F) = 10-120 · O(C)
O(G | F) = P(G) / P(O)
O(O | F) = P(O) / P(G)

In the end, what we find is that the “Coincidence” hypothesis has been down-voted completely out of existence, leaving only the “God” hypothesis and the “Other” hypothesis.

And importantly, our final credence in either of these hypotheses is not on the order of magnitude of 1 – 10-120. The final balance depends entirely just on the ratio of prior credences in the two explanations.

Trial Run

Let’s look at two individuals updating on the observation of fine-tuning.

Atheist
P(G) = .01%
P(O) = 50%

Deist
P(G) = 99%
P(O) = 1%

(The exact details of these numbers aren’t that important, just that they’re somewhat qualitatively accurate.) Their final credences will be:

Atheist
P(G | F) = 0.02%
P(O | F) = 99.98%

Deist
P(G | F) = 99%
P(O | F) = 1%

And we see that nobody ends up significantly updating their religious beliefs on the evidence of fine-tuning. The deist held a worldview in which the random coincidence hypothesis was already ruled out, so the observation of fine-tuning doesn’t change anything for them. And the atheists were initially fairly agnostic about whether or not the universe was fine-tuned, but were very confident in the existence of other explanations besides God. As such, the observation of fine-tuning served as a minor increase in their belief in God (+.01%), while they become extremely confident that there must be some other explanation.

Fine-tuning would only serve as strong evidence for you if you were initially very sure that there was a God, but agnostic about if God would have designed the universe to accommodate human life, or if its design was purely random coincidence. Even in this case, the bump in credence you’d receive would be nothing like the massive update that seems apparent from a naïve (and wrong) application of Bayesian reasoning.

Noisy Evidence

Scope insensitivity is a cognitive bias that involves a failure to internalize the true scale of quantities. Some of the most striking and frankly depressing examples of this phenomenon involve altruistic behavior, where people care just as much about a cause regardless of how many lives are concerned. In some cases, increasing numbers of affected people result in decreasing willingness to pay.

This issue arises when quantitative metrics don’t line up with our intuitive metrics – 10 billion doesn’t feel 1000 times larger than 10 million. A solution that might be sometimes possible is to adjust the numerical scale you are dealing with to try to get the true scale to match the intuitive scale.

This is a large part of what I think is great about the notion of evidence as noise.

Humans have scope insensitivity with respect to very large and very small probabilities. 99.99% doesn’t feel that different to us from 99.9999%. But they are extremely different. The amount of evidence required to push you from 99.99% to 99.9999% is the same as the amount of evidence that would have pushed you from 9% to 91%. There is a big difference between 99.99% and 99.9999% in terms of the state of knowledge represented.

The problem is that as the probability approaches 100%, the number looks to us like it is barely budging. This can be fixed by making our scale logarithmic. We do this by first converting our probabilities to odds ratios (so 50% becomes 1:1 odds, 75% becomes 3:1 odds, etc), and then taking a logarithm. This is exactly analogous to the decibel scale for noise, so this is called the decibel (dB) scale for evidence.

Probability of A = P(A)
Odds of A = P(A) / P(~A)
Decibel strength of A = 10 · log10(P(A) / P(~A))

Very strong evidence is very noisy, and weak evidence is silent, barely affecting our beliefs. This is also nice because Bayes’ rule becomes additive:

Posterior Odds Ratio = Likelihood Ratio · Prior Odds Ratio
O(T | E) = L(E | T) · O(T)
becomes…
OdB(T | E) = LdB(E | T) + OdB(T)

If your evidence E is equally likely whether or not the theory T is true, then L(E | T) = 1 and LdB(E | T) = 0. Thus you add 0, and end up with the same odds as you started with.

Theories that are very high or very low in credence are very noisy, while those that are around 50% are silent.

Now what’s the difference between 99.99% and 99.9999%?

99.99% = 9999:1 = 40 dB
99.9999% = 999999:1 = 60 dB

A 20 dB difference in strength of belief is a lot easier to wrap your head around than a 0.0099% difference!

In addition, equally strong evidence always looks equally strong when expressed in dB, while it can look increasingly weak when expressed in probabilities.

For example, imagine that somebody comes up to you and claims to be able to read your mind. To test them, you decide to ask her to tell you what number between 0 and 10 is in your head right now. If she gets this right, then this counts as 10 decibels of evidence for her psychic abilities.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

10 log₁₀(10) = 10 dB

So if your previous belief in her psychic abilities was at -50 decibels (100,000:1 odds against), then it should now be at -40 decibels (10,000:1 odds against).

The same calculation would tell you that another successful test would nudge you another +10 dB, from -40 to -30. Extrapolation seems to indicate that you should be pretty much agnostic as to whether or not she is psychic after three more such successful tests, and strong believers after only eight total tests.

Initial strength of belief = -50 dB
First test gives evidence of +10 dB
New strength of belief = -40 dB
Four more tests give total evidence of +40 dB
New strength of belief = 0 dB
Three more tests give total evidence of +30 dB
Final strength of belief = +30 dB (99.9%)

This example actually gets things wrong in a very important way. Eight tests like those that I described is probably not sufficient to establish psychic abilities. This is a little off topic, but is useful to go into as a demonstration of how naive usage of Bayes’ rule can lead you off the rails.

Where we went wrong was in the very first step, in calculating the decibel strength of the evidence.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

The presumption behind this calculation is that if she were psychic, then she would almost definitely be able to get the number right (≈ 100%), but if not, then she would have a random shot (10%). But “psychic” and “random” are not the only two theories! For instance, maybe the apparent psychic has actually just figured out a masterful method for reading subtle facial movements to guess at the number being guessed, rather than actually being able to look into your mind.

The face-reading hypothesis seems unlikely, but probably less so than true mind-reading abilities. Let’s give it a decibel score of -20 (corresponding to an initial credence of about 1%). This should barely factor into our initial calculation, so let’s suppose that +10 dB is the actual strength of evidence for psychic abilities.

Now PdB(psychic) goes from -50 dB to -40 dB, and PdB(face-reading) goes from -20 dB to -10 dB. They have both gotten more likely, because they both successfully predicted the outcome! And now for the second test, face-reading should have a bigger effect on the calculation! I’ll skip the algebra and just present the new strengths of evidence for the second test:

L(correct | psychic) = 7 dB
L(correct | face-reading) = 10 dB

Notice that the evidence is now weaker for the “psychic” hypothesis, because it has a more likely competing hypothesis. The evidence is still equally strong for face-reading, on the other hand, because its competing hypothesis (that she is psychic) is still very weak.

So we update again!

Psychic: -40 dB to -33 dB (.05%)
Face-reading: -10 dB to 0 dB (50%)

Now the face-reading hypothesis is 50% – apparently equally likely to be true and false! This will sway the strength of the evidence for the ‘psychic’ hypothesis even more on the third trial:

L(correct | psychic) = 3 dB
L(correct | face-reading) = 10 dB

Now with such a likely alternative explanation, the evidence is even weaker than previously for the psychic hypothesis. After our third trial, our beliefs will update as follows:

Psychic: -33 dB to -30 dB (.1%)
Face-reading: 0 dB to 10 dB (90%)

As you can see, the face-reading hypothesis takes off, while the psychic hypothesis ends up staying stuck around .1%.

I’ll talk more about this in a post tomorrow, in which I show how the exact same simple error in our first argument is being made in fine-tuning arguments for God!

Nature’s Urn and Ignoring the Super Persuader

This post is about one of my favorite little-known thought experiments.

Here’s the setup:

You are in conversation with a Super Persuader – an artificial intelligence that has access to an enormously larger pool of information than you, and that is excellent at synthesizing information to form powerful arguments. The Super Persuader is so super at persuading, in fact, that given any proposition, it will be able to construct the most powerful argument possible for that proposition, consisting of the strongest evidence it has access to.

The Super Persuader is going to try and persuade you either that a certain proposition A is true or that it is false. In doing so, you know that it cannot lie, but it can pick and choose the information that it presents to you, giving an incomplete picture.

Finally, you know that the Super Persuader is going to decide which side of the issue to argue based off of a random coin toss: 50% chance they will argue that A is true, and 50% chance they will argue that A is false.

Once the coin is tossed and the Persuader begins to present the evidence, how should you rationally respond? Should you be swayed by the arguments, ignore them, or something else?

Here’s a basic presentation of one response to this thought experiment:

Of course you should be swayed by their arguments! If not, then you end up receiving boatloads of crazily persuasive argumentation and pretending like you’ve heard none of it. This is the very definition of irrationality – closing your eyes to the evidence you have sitting right in front of you! There’s no reason to disregard all of the useful information that you’re getting, just because it’s coming from a source that is trying to persuade you. Regardless of the motives of the Super Persuader, it can only persuade you by giving you honest and genuinely convincing evidence. And a rational agent has no choice but to update their credences on this evidence.

I think that this is a bad argument. Here’s an analogy to help explain why.

Imagine the set of all possible pieces of evidence you could receive for a given proposition as a massive urn filled with marbles. Each marble is a single argument that could be made for the proposition. If the argument is in support of the proposition, then the marble representing it will be black. And if the argument is against the proposition, then the marble representing it will be white.

Now, the question as to whether the proposition is more likely to be true or false is roughly the same as the question of whether there are more black or white marbles in the urn. That is the exact same question if all of the arguments in question are equally strong, and we have no reason for starting out favoring one side over the other.

But now we can think about the actions of the Super Persuader as follows: the Super Persuader has direct access to the urn, and can select any marble it wants. If it wants to persuade you that the proposition is true, then it will just fish through the urn and present you with as many black marbles as it desires, ignoring all the white marbles.

Clearly this process gives you no information as to the true proportion of the marbles that are white versus the proportion that are black. The data you are receiving is contaminated by a ridiculously powerful selection bias. The evidence you see is no longer linked in any way to the truth of the proposition, because regardless of whether or not it is true, you still expect to receive large amounts of evidence for it.

In the end, all of the pieces of evidence you receive are useless, in the same way that a stacked deck is not a reliable source of information about the average card deck.

This has some really weird consequences. For one thing, after your conversation you still have all of that information hanging around in your head (as long as you have a good enough memory). So if anybody asks you what you think about the issue, you will be able to spout off incredibly powerful arguments for exactly one side of the issue. But you’ll also have to concede that you don’t actually strongly believe the conclusion of these arguments. And if you’re asked to present any evidence for not accepting the conclusion, you’ll likely draw a blank, or only be able to produce very unsatisfactory answers. You will certainly not come off as a very rational person! Continue reading “Nature’s Urn and Ignoring the Super Persuader”