Does fine-tuning give evidence for God?

January 23, 2018January 24, 2018 ~ squarishbracket ~ Leave a comment

I used to think that the fine-tuning argument was the strongest argument out there for the existence of a creator deity. I was especially impressed by the apparent magnitude of the fine-tuning – Steven Weinberg has stated that the value of the cosmological constant was fine-tuned to one part in 10¹²⁰.

If one takes a naïve (and as we’ll see, incorrect) Bayesian approach to assessing this as evidence, then it looks like this should serve as an incredible amount of evidence for the existence of a God, enough to totally overwhelm all other possible considerations. Why? Because if there is a God, then we expect fine-tuning, while if not, then the fine-tuning looks incredibly unlikely. Given this, the God explanation should receive a credence bump proportional to 10¹²⁰ upon updating on the observation of fine-tuning.

As a quick aside before diving into the numbers, there is a lot of dispute about whether or not there even is fine-tuning in our universe. For the purposes of this post, I’m going to ignore all of these disputes, and pretend that there is a strong consensus on this matter. I’ll use Weinberg’s estimate of 10^-120 for the fine-tuning of the cosmological constant. I know that this is controversial, but the point I’m making will stand for even this insanely tiny value.

Okay, so let’s first present a formal version of the fine-tuning argument for God.

F = “The universe is fine-tuned for life.”
G = “A creator deity fine-tuned the universe for life.”

O(G | F) = L(F | G) · O(G)
L(F | G) = P(F | G) / P(F | ~G) ≈ 1 / 10^-120 = 1200 dB

So O(G | F) = 10¹²⁰ · O(G)

This uses the odds formulation of Bayes’ rule – look it up if you’re unfamiliar.

This argument says that your credences should be adjusted by a factor of 10¹²⁰ upon observing the fine-tuning of the universe. In other words, to not be virtually certain that there exists a creator deity that rules the universe after updating on fine-tuning, you’d have to have initially had a credence on the order of 10^-120.

Let me point out that 10^-120 is a really really small number. It’s virtually impossible to imagine any good reason why you would be justified in having a prior credence on this order of magnitude. Nobody should be that sure about anything. Evidence of a strength of 1200 dB is analogous in strength to a noise that is a quadrillion times more intense than the threshold of human hearing.

So what’s wrong with this argument? In fact, it fails at the first step. In calculating the strength of the evidence, we only considered two possible hypotheses: either God or, if not, then random coincidence. But there are many other options that we have to factor in as well, most famously the multiverse hypothesis.

But even if there are other hypotheses out there, shouldn’t they just all share the benefit of the credence boost? The existence of another hypothesis that made the same prediction shouldn’t count as a penalty, right?

Wrong! Probabilities have to add up to 1, and you can’t have multiple mutually exclusive competing hypotheses that you have virtual certainty about. Whatever happens when you add other hypotheses must be more subtle than that. So let’s calculate using Bayes’ rule!

O(T | E) = L(E | T) · O(T)
L(E | T) = P(E | T) / P(E | ~T)

For each theory T we consider, we have to take into account all other theories in the denominator of our likelihood function L. We’ll want to keep in mind the following identity:

P(B & C) = 0
implies
P(A | B or C) = [P(A | B) P(B) + P(A | C) P(C)] / [P(B) + P(C)]

So, for instance, let’s divide up our explanations of the fine-tuning F into three disjoint categories: (1) random coincidence C, (2) a deistic God G, and (3) all other explanations that are mutually incompatible.

L(F | X) = P(F | X) / P(F | ~X)
= P(F | X) (1 – P(X)) / ∑_Y≠PP(F | Y) P(Y)

P(F | C) ≈ 10^-120
P(F | G) ≈ 1
P(F | O) ≈ 1

L(F | C) ≈ 10^-120
L(F | G) ≈ P(~G) / P(O)
L(F | O) ≈ P(~O) / P(G)

O(C | F) = 10^-120 · O(C)
O(G | F) = P(G) / P(O)
O(O | F) = P(O) / P(G)

In the end, what we find is that the “Coincidence” hypothesis has been down-voted completely out of existence, leaving only the “God” hypothesis and the “Other” hypothesis.

And importantly, our final credence in either of these hypotheses is not on the order of magnitude of 1 – 10^-120. The final balance depends entirely just on the ratio of prior credences in the two explanations.

Trial Run

Let’s look at two individuals updating on the observation of fine-tuning.

Atheist
P(G) = .01%
P(O) = 50%

Deist
P(G) = 99%
P(O) = 1%

(The exact details of these numbers aren’t that important, just that they’re somewhat qualitatively accurate.) Their final credences will be:

Atheist
P(G | F) = 0.02%
P(O | F) = 99.98%

Deist
P(G | F) = 99%
P(O | F) = 1%

And we see that nobody ends up significantly updating their religious beliefs on the evidence of fine-tuning. The deist held a worldview in which the random coincidence hypothesis was already ruled out, so the observation of fine-tuning doesn’t change anything for them. And the atheists were initially fairly agnostic about whether or not the universe was fine-tuned, but were very confident in the existence of other explanations besides God. As such, the observation of fine-tuning served as a minor increase in their belief in God (+.01%), while they become extremely confident that there must be some other explanation.

Fine-tuning would only serve as strong evidence for you if you were initially very sure that there was a God, but agnostic about if God would have designed the universe to accommodate human life, or if its design was purely random coincidence. Even in this case, the bump in credence you’d receive would be nothing like the massive update that seems apparent from a naïve (and wrong) application of Bayesian reasoning.

Noisy Evidence

January 21, 2018March 15, 2018 ~ squarishbracket ~ Leave a comment

Scope insensitivity is a cognitive bias that involves a failure to internalize the true scale of quantities. Some of the most striking and frankly depressing examples of this phenomenon involve altruistic behavior, where people care just as much about a cause regardless of how many lives are concerned. In some cases, increasing numbers of affected people result in decreasing willingness to pay.

This issue arises when quantitative metrics don’t line up with our intuitive metrics – 10 billion doesn’t feel 1000 times larger than 10 million. A solution that might be sometimes possible is to adjust the numerical scale you are dealing with to try to get the true scale to match the intuitive scale.

This is a large part of what I think is great about the notion of evidence as noise.

Humans have scope insensitivity with respect to very large and very small probabilities. 99.99% doesn’t feel that different to us from 99.9999%. But they are extremely different. The amount of evidence required to push you from 99.99% to 99.9999% is the same as the amount of evidence that would have pushed you from 9% to 91%. There is a big difference between 99.99% and 99.9999% in terms of the state of knowledge represented.

The problem is that as the probability approaches 100%, the number looks to us like it is barely budging. This can be fixed by making our scale logarithmic. We do this by first converting our probabilities to odds ratios (so 50% becomes 1:1 odds, 75% becomes 3:1 odds, etc), and then taking a logarithm. This is exactly analogous to the decibel scale for noise, so this is called the decibel (dB) scale for evidence.

Probability of A = P(A)
Odds of A = P(A) / P(~A)
Decibel strength of A = 10 · log₁₀(P(A) / P(~A))

Very strong evidence is very noisy, and weak evidence is silent, barely affecting our beliefs. This is also nice because Bayes’ rule becomes additive:

Posterior Odds Ratio = Likelihood Ratio · Prior Odds Ratio
O(T | E) = L(E | T) · O(T)
becomes…
O_dB(T | E) = L_dB(E | T) + O_dB(T)

If your evidence E is equally likely whether or not the theory T is true, then L(E | T) = 1 and L_dB(E | T) = 0. Thus you add 0, and end up with the same odds as you started with.

Theories that are very high or very low in credence are very noisy, while those that are around 50% are silent.

Now what’s the difference between 99.99% and 99.9999%?

99.99% = 9999:1 = 40 dB
99.9999% = 999999:1 = 60 dB

A 20 dB difference in strength of belief is a lot easier to wrap your head around than a 0.0099% difference!

In addition, equally strong evidence always looks equally strong when expressed in dB, while it can look increasingly weak when expressed in probabilities.

For example, imagine that somebody comes up to you and claims to be able to read your mind. To test them, you decide to ask her to tell you what number between 0 and 10 is in your head right now. If she gets this right, then this counts as 10 decibels of evidence for her psychic abilities.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

10 log₁₀(10) = 10 dB

So if your previous belief in her psychic abilities was at -50 decibels (100,000:1 odds against), then it should now be at -40 decibels (10,000:1 odds against).

The same calculation would tell you that another successful test would nudge you another +10 dB, from -40 to -30. Extrapolation seems to indicate that you should be pretty much agnostic as to whether or not she is psychic after three more such successful tests, and strong believers after only eight total tests.

Initial strength of belief = -50 dB
First test gives evidence of +10 dB
New strength of belief = -40 dB
Four more tests give total evidence of +40 dB
New strength of belief = 0 dB
Three more tests give total evidence of +30 dB
Final strength of belief = +30 dB (99.9%)

This example actually gets things wrong in a very important way. Eight tests like those that I described is probably not sufficient to establish psychic abilities. This is a little off topic, but is useful to go into as a demonstration of how naive usage of Bayes’ rule can lead you off the rails.

Where we went wrong was in the very first step, in calculating the decibel strength of the evidence.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

The presumption behind this calculation is that if she were psychic, then she would almost definitely be able to get the number right (≈ 100%), but if not, then she would have a random shot (10%). But “psychic” and “random” are not the only two theories! For instance, maybe the apparent psychic has actually just figured out a masterful method for reading subtle facial movements to guess at the number being guessed, rather than actually being able to look into your mind.

The face-reading hypothesis seems unlikely, but probably less so than true mind-reading abilities. Let’s give it a decibel score of -20 (corresponding to an initial credence of about 1%). This should barely factor into our initial calculation, so let’s suppose that +10 dB is the actual strength of evidence for psychic abilities.

Now P_dB(psychic) goes from -50 dB to -40 dB, and P_dB(face-reading) goes from -20 dB to -10 dB. They have both gotten more likely, because they both successfully predicted the outcome! And now for the second test, face-reading should have a bigger effect on the calculation! I’ll skip the algebra and just present the new strengths of evidence for the second test:

L(correct | psychic) = 7 dB
L(correct | face-reading) = 10 dB

Notice that the evidence is now weaker for the “psychic” hypothesis, because it has a more likely competing hypothesis. The evidence is still equally strong for face-reading, on the other hand, because its competing hypothesis (that she is psychic) is still very weak.

So we update again!

Psychic: -40 dB to -33 dB (.05%)
Face-reading: -10 dB to 0 dB (50%)

Now the face-reading hypothesis is 50% – apparently equally likely to be true and false! This will sway the strength of the evidence for the ‘psychic’ hypothesis even more on the third trial:

L(correct | psychic) = 3 dB
L(correct | face-reading) = 10 dB

Now with such a likely alternative explanation, the evidence is even weaker than previously for the psychic hypothesis. After our third trial, our beliefs will update as follows:

Psychic: -33 dB to -30 dB (.1%)
Face-reading: 0 dB to 10 dB (90%)

As you can see, the face-reading hypothesis takes off, while the psychic hypothesis ends up staying stuck around .1%.

I’ll talk more about this in a post tomorrow, in which I show how the exact same simple error in our first argument is being made in fine-tuning arguments for God!

Nature’s Urn and Ignoring the Super Persuader

January 19, 2018January 21, 2018 ~ squarishbracket ~ 3 Comments

This post is about one of my favorite little-known thought experiments.

Here’s the setup:

You are in conversation with a Super Persuader – an artificial intelligence that has access to an enormously larger pool of information than you, and that is excellent at synthesizing information to form powerful arguments. The Super Persuader is so super at persuading, in fact, that given any proposition, it will be able to construct the most powerful argument possible for that proposition, consisting of the strongest evidence it has access to.

The Super Persuader is going to try and persuade you either that a certain proposition A is true or that it is false. In doing so, you know that it cannot lie, but it can pick and choose the information that it presents to you, giving an incomplete picture.

Finally, you know that the Super Persuader is going to decide which side of the issue to argue based off of a random coin toss: 50% chance they will argue that A is true, and 50% chance they will argue that A is false.

Once the coin is tossed and the Persuader begins to present the evidence, how should you rationally respond? Should you be swayed by the arguments, ignore them, or something else?

Here’s a basic presentation of one response to this thought experiment:

Of course you should be swayed by their arguments! If not, then you end up receiving boatloads of crazily persuasive argumentation and pretending like you’ve heard none of it. This is the very definition of irrationality – closing your eyes to the evidence you have sitting right in front of you! There’s no reason to disregard all of the useful information that you’re getting, just because it’s coming from a source that is trying to persuade you. Regardless of the motives of the Super Persuader, it can only persuade you by giving you honest and genuinely convincing evidence. And a rational agent has no choice but to update their credences on this evidence.

I think that this is a bad argument. Here’s an analogy to help explain why.

Imagine the set of all possible pieces of evidence you could receive for a given proposition as a massive urn filled with marbles. Each marble is a single argument that could be made for the proposition. If the argument is in support of the proposition, then the marble representing it will be black. And if the argument is against the proposition, then the marble representing it will be white.

Now, the question as to whether the proposition is more likely to be true or false is roughly the same as the question of whether there are more black or white marbles in the urn. That is the exact same question if all of the arguments in question are equally strong, and we have no reason for starting out favoring one side over the other.

But now we can think about the actions of the Super Persuader as follows: the Super Persuader has direct access to the urn, and can select any marble it wants. If it wants to persuade you that the proposition is true, then it will just fish through the urn and present you with as many black marbles as it desires, ignoring all the white marbles.

Clearly this process gives you no information as to the true proportion of the marbles that are white versus the proportion that are black. The data you are receiving is contaminated by a ridiculously powerful selection bias. The evidence you see is no longer linked in any way to the truth of the proposition, because regardless of whether or not it is true, you still expect to receive large amounts of evidence for it.

In the end, all of the pieces of evidence you receive are useless, in the same way that a stacked deck is not a reliable source of information about the average card deck.

This has some really weird consequences. For one thing, after your conversation you still have all of that information hanging around in your head (as long as you have a good enough memory). So if anybody asks you what you think about the issue, you will be able to spout off incredibly powerful arguments for exactly one side of the issue. But you’ll also have to concede that you don’t actually strongly believe the conclusion of these arguments. And if you’re asked to present any evidence for not accepting the conclusion, you’ll likely draw a blank, or only be able to produce very unsatisfactory answers. You will certainly not come off as a very rational person! Continue reading “Nature’s Urn and Ignoring the Super Persuader” →

Entropy vs relative entropy

January 19, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

This post is about the relationship between entropy and relative entropy. This relationship is subtle but important – purely maximizing entropy (MaxEnt) is not equivalent to Bayesian conditionalization except in special cases, while maximizing relative entropy (ME) is. In addition, the justifications for MaxEnt are beautiful and grounded in fundamental principles of normative epistemology. Do these justifications carry over to maximizing relative entropy?

To a degree, yes. We’ll see that maximizing relative entropy is a more general procedure than maximizing entropy, and reduces to it in special cases. The cases where MaxEnt gives different results from ME can be interpreted through the lens of MaxEnt, and relate to an interesting distinction between commuting and non-commuting observations.

So let’s get started!

We’ll solve three problems: first, using MaxEnt to find an optimal distribution with a single constraint C₁; second, using MaxEnt to find an optimal distribution with constraints C₁ and C₂; and third, using ME to find the optimal distribution with C₂ given the starting distribution found in the first problem.

Part 1

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C₁[P] dx = 0

∂_P ( – P₁ logP₁ + (α + 1) P₁ + βC₁[P₁] ) = 0
– logP₁ + α + β C₁’[P₁] = 0

Part 2

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C₁[P] dx = 0
∫ C₂[P] dx = 0

∂_P ( – P₂ logP₂ + (α’ + 1) P₂ + β’C₁[P₂] + λ C₂[P₂] ) = 0
– logP₂ + α’ + β’ C₁’[P₂] + λ C₂’[P₂] = 0

Part 3

Problem: Maximize – ∫ P log(P / P₁) dx with constraints
∫ P dx = 1
∫ C₂[P] dx = 0

∂_P ( – P₃ logP₃ + P₃ logP₁ + (α’’ + 1)P₃ + λ’ C₂[P₃] ) = 0
– logP₃ + α’’ + logP₁ + λ’ C₂’[P₃] = 0
– logP₃ + α’’ + α + β C₁’[P₁] + λ’ C₂’[P₃] = 0
– logP₃ + α’’’ + β C₁’[P₁] + λ’ C₂’[P₃] = 0

We can now compare our answers for Part 2 to Part 3. These are the same problem, solved with MaxEnt and ME. While they are clearly different solutions, they have interesting similarities.

MaxEnt
– logP₂ + α’ + β’ C₁’[P₂] + λ C₂’[P₂] = 0
∫ P₂ dx = 1
∫ C₁[P₂] dx = 0
∫ C₂[P₂] dx = 0

ME
– logP₃ + α’’’ + β C₁’[P₁] + λ’ C₂’[P₃] = 0
∫ P₃ dx = 1
∫ C₁[P₁] dx = 0
∫ C₂[P₃] dx = 0

The equations are almost identical. The only difference is in how they treat the old constraint. In MaxEnt, the old constraint is treated just like the new constraint – a condition that must be satisfied for the final distribution.

But in ME, the old constraint is no longer required to be satisfied by the final distribution! Instead, the requirement is that the old constraint be satisfied by your initial distribution!

That is, MaxEnt takes all previous information, and treats it as current information that must constrain your current probability distribution.

On the other hand, ME treats your previous information as constraints only on your starting distribution, and only ensures that your new distribution satisfy the new constraint!

When might this be useful?

Well, say that the first piece of information you received, C₁, was the expected value of some measurable quantity. Maybe it was that x̄ = 5.

But if the second piece of information C₂ was an observation of the exact value of x, then we clearly no longer want our new distribution to still have an expected value of x̄. After all, it is common for the expected value of a variable to differ from the exact value of x.

E(x) vs x

Once we have found the exact value of x, all previous information relating to the value of x is screened off, and should no longer be taken as constraints on our distribution! And this is exactly what ME does, and MaxEnt fails to do.

What about a case where the old information stays relevant? For instance, an observation of the values of a certain variable is not ‘cancelled out’ by a later observation of another variable. Observations can’t be un-observed. Does ME respect these types of constraints?

Yes!

Observations of variables are represented by constraints that set the distribution over those variables to delta-functions. And when your old distribution contains a delta function, that delta function will still stick around in your new distribution, ensuring that the old constraint is still satisfied.

P_old ~ δ(x – x’)
implies
P_new ~ δ(x – x’)

The class of observations that are made obsolete by new observations are called non-commuting observations. They are given this name because for such observations, the order in which you process the information is essential. Observations for which the order of processing doesn’t matter are called commuting observations.

In summation: maximizing relative entropy allows us to take into account subtle differences in the type of evidence we receive, such as whether or not old data is made obsolete by new data. And mathematically, maximizing relative entropy is equivalent to maximizing ordinary entropy with whatever new constraints were not included in your initial distribution, as well as an additional constraint relating to the value of your old distribution. While the old constraints are not guaranteed to be satisfied by your new distribution, the information about them is preserved in the form of the prior distribution that is a factor in the new distribution.

Fun with Akaike

January 19, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

The Akaike information criterion is a metric for model quality that naturally arises from the principle of maximum entropy. It balances predictive accuracy against model complexity, encoding a formal version of Occam’s razor and solving problems of overfitting. I’m just now learning about how to use this metric, and will present a simple example that shows off a lot of the features of this framework.

Suppose that we have two coins. For each coin, we can ask what the probability is of each coin landing heads. Call these probabilities p and q.

Two classes of models are (1) those that say that p = q and (2) those that say that p ≠ q. The first class of models are simpler in an important sense – they can be characterized by a single parameter p. The second class, on the other hand, require two separate parameters, one for each coin.

The number of parameters (k) used by our model is one way to measure model complexity. But of course, we can also test our models by getting experimental data. That is, we can flip each coin a bunch of times, record the results, and see how they favor one model over another.

One common way of quantifying the empirical success of a given model is by looking at the maximum value of its likelihood function L. This is the function that tells you how likely your data was, given a particular model. If Model 2 can at best do better at predicting the data than Model 1, then this should count in favor of Model 2.

So how do we combine these pieces of information – k (the number of parameters in a model) and L (the predictive success of the model)? Akaike’s criterion says to look at the following metric:

AIK = 2k – 2 lnL

The smaller the value of this parameter, the better your model is.

So let’s apply this on an arbitrary data set:

n₁ = number of times coin 1 landed heads
n₂ = number of times coin 1 landed tails
m₁ =number of times coin 2 landed heads
m₂ = number of times coin 2 landed tails

For convenience, we’ll also call the total flips of coin 1 N, and the total flips of coin 2 M.

First, let’s look at how Model 1 (the one that says that the two coins have an equal chance of landing heads) does on this data. This model predicts with probability p each heads, on either coin, and with probability 1 – p each tails on either coin.

L₁ = C(N,n₁) C(M,m₁)p^n₁+m₁ (1 – p)^n₂+m₂

The two choose functions C(N, n₁) and C(M, m₁) are there to ensure that this function is nicely normalized. Intuitively, they arise from the fact that any given number of coin tosses that land heads could happen in many possible, ways, and all of these ways must be summed up.

This function finds its peak value at the following value of p:

p = (n₁ + m₁) / (N + M)
L_1,max = C(N, n₁) C(M, m₁)(n₁ + m₁)^n₁+m₁ (n₂ + m₂)^n₂+m₂ / (N + M)^N+M

By Stirling’s approximation, this becomes:

ln(L_1,max) ~ -ln(F) – ½ ln(G)
where F = (N + M)^N+M/N^NM^M · n₁^n₁m₁^m₁/(n₁ + m₁)^n₁+m₁ · n₂^n₂m₂^m₂/(n₂ + m₂)^n₂^+m₂
and G = n₁n₂m₁m₂/ NM

With this, our Akaike information criterion for Model 1 tells us:

AIC₁= 2 + 2ln(F) + ln(G)

Moving on to Model 2, we now have two different parameters p and q to vary. The likelihood of our data given these two parameters are given by:

L₂ = C(N, n₁) C(M, m₁)p^n₁ (1 – p)^n₂ q^m₁ (1 – q)^m₂

The values of p and q that make this data most likely are:

p = n₁ / N
q = m₁ / M
So L_2,max = C(N, n₁) C(M, m₁)n₁^n₁m₁^m₁n₂^n₂m₂^m₂ / N^NM^M

And again, using Stirling’s approximation, we get:

ln(L_1,max) ~ – ½ ln(G)
So AIC₂= 4 + ln(G)

We now just need to compare these two AICs to see which model is preferable for a given set of data:

AIC₂= 4 + ln(G)
AIC₁= 2 + 2ln(F) + ln(G)

AIC₂– AIC₁= 2 – 2lnF

Let’s look at two cases that are telling. The first case will be that we find that both coin 1 and coin 2 end up landing heads an equal proportion of the time, and for simplicity, both coins are tossed the same number of times. Formally:

Case 1: N = M, n₁ = m₁, n₂ = m₂

In this case, F becomes 1, so lnF becomes 0.

AIC₂– AIC₁= 2 > 0
So Model 1 is preferable.

This makes sense! After all, if the two coins are equally likely to land heads, then our two models do equally well at predicting the data. But Model 1 is simpler, involving only a single parameter, so it is preferable. AIC gives us a precise numerical criterion by which to judge how preferable Model 1 is!

Okay, now let’s consider a case where coin 1 lands heads much more often than coin 2.

Case 2: N = M, n₁ = 2m₁, 2n₂ = m₂

Now if we go through the algebra, we find:

F = 4^N (4/27)^(m1+n2)~ 1.12^NSo lnF ~ N ln(1.12) ~ .11 N
So AIC₂– AIC₁= 2 – .22N

This quantity is larger than 0 when N is less than 10, but then becomes smaller than zero for all other values.

Which means that for Case 2, small enough data sets still allow us to go with Model 1, but as we get more data, the predictive accuracy of Model 2 eventually wins out!

It’s worth pointing out here that the Akaike information criterion is an approximation to the technique of maximizing relative entropy, and this approximation assumes large sets of data. Given this, it’s not clear how reliable our estimate of 10 is for the largest data set.

Let’s do one last thing with our simple models.

As our two coins become more and more similar in how often they land heads, we expect Model 1 to last longer before Model 2 ultimately wins out. Let’s calculate a general relationship between the similarity in the ratios of heads to tails in coin 1 and coin 2 and the amount of time it takes for Model 1 to lose out.

Case 3: N = M, n₁ = r·m₁, r·n₂ = m₂

r here is our ratio of p/q – the chance of heads in coin 1 over the chance of heads in coin 2. Skipping ahead through the algebra we find:

lnF = N ln( 2 r^r/(r+1) / (r + 1) )

Model 2 becomes preferable to Model 1 when AIC₂becomes smaller than AIC₁, so we can find the critical point by setting ∆AIC = 2 – 2 lnF = 0

lnF = N ln( 2 r^{r/(r + 1)} / (r + 1) ) = 1
N = 1 / ln( 2 r^{r/(r + 1)} / (r + 1) )

We can see that as r goes to 1, N goes to ∞. We can see how quickly this happens by doing some asymptotics:

r = 1 + ε
N ~ 1 / ln(1 + ε)
So ε ~ e^1/N – 1

N goes to infinity at an exponential rate as r approaches 1 linearly. This gives us a very rough idea of how similar our coins must be for us to treat them as essentially identical. We can use our earlier result that at r = 2, N = 10 to construct a table:

r	N
2	10
1.1	100
1.01	1000
1.001	10,000
1.0001	100,000

Anthropic argument for common priors

January 18, 2018May 22, 2018 ~ squarishbracket ~ Leave a comment

(Idea from Robin Hanson and Tyler Cowen’s 2004 paper Are Disagreements Honest?)

One common argument relating to common priors is that two rational agents with all the same information (including no information at all) could have no possible grounds on which to disagree. Priors by definition refer to the state of knowledge before either agent had any evidence relevant to a given proposition. So there is no information that either agent could have that would allow a difference in priors.

A response to this is that some information that we have is inherently private and unique to us. For instance, you and I might have differences in intelligence, in ways of conceptualizing the world, or in the things we innately find intuitively plausible. All of these differences may count as important information in shaping our priors on a given subject, before we ever encounter a single piece of evidence relevant to the subject.

Here’s a really weird argument for why even these differences should not count. If we use anthropic reasoning, and treat our own existence and the details of our brain and body as just another thing to be conditioned on, then even these private intimate details are simply contingent facts about the world that are to be treated as evidence. Before you’ve conditioned on your own existence, you should be agnostic as to which set of brain/body/mind out of all the possible sets of observers “you” will end up being. You must imagine yourself behind Rawls’ veil of ignorance, a disembodied reasoner that is identical to all other such reasoners. So there is no conceivable reason why your prior should differ from anybody else’s – you must treat yourself as literally the same entity as them pre anthropic conditioning.

In less out-there terms, if you encounter somebody with an apparently different prior from you, then you should consider “Hmm, what if I were born as this person, instead of myself?” The answer to which is, of course, you would have had the same priors as them. Which means that your difference in “priors” is actually a difference of posteriors resulting from conditioning on the arbitrary choice of body/brain/experiences you ended up with.

In addition, by Aumann’s agreement theorem, any apparent differences in priors that become common knowledge should quickly go away, once they are realized to be merely differences in posteriors. Essentially, any differences in priors that last between two rational individuals are signs that they are arbitrarily favoring their own existence in considerations of what prior they should use.

Why you should be a Bayesian

January 17, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

In this post, I take Bayesianism to be the following normative epistemological claim: “You should treat your beliefs like probabilities, and reason according to the axioms of probability theory.”

Here are a few reasons why I support this claim:

I. Logic is not enough

Reasoning deductively from premises to conclusion is a wonderfully powerful tool when it can be applied. If you have absolute certainty in some set of premises, and these premises entail a new conclusion, then you can extend your certainty to the new conclusion. Alternatively, you can state clearly the conditions under which you would be granted certain belief, in the form of a conditional argument (if you were to convince me that A and B are true, then I would believe that C is true).

This is great for mathematicians proving theorems about abstract logical entities. But in the real world, deductive inference is simply not enough to account for the types of problems we face. We are constantly reasoning in a condition of uncertainty, where we have multiple competing theories about what’s going on, and we seek evidence – partial evidence, not deductively complete evidence – as to which of these theories we should favor.

If you want to know how to form beliefs about the parts of reality that aren’t clear-cut and certain, then you need to go beyond pure logic.

II. Probability theory is a natural extension of logic

Cox’s theorem shows that any system of plausible reasoning – modifying and updating beliefs in the presence of uncertainty – that is consistent with logic and a few minimal assumptions about normative reasoning is necessarily isomorphic to probability theory.

The converse of this is that any system of reasoning under uncertainty that isn’t ultimately functionally equivalent to Bayesianism is either logically inconsistent or violates other common-sense axioms of reasoning.

In other words, probability theory is the best candidate that we have for extending logic into the domain of the uncertain. It is about what is likely, not certain, to be true, and the way that we should update these assessments when receiving new information. In turn, probability theory contains ordinary logic as a special case when you take the limit of absolutely certainty.

III. Non-Bayesian systems of plausible reasoning result in inconsistencies and irrational behavior

Dutch-book arguments prove that any agent that is violating the axioms of probability theory can be exploited by cleverly capitalizing on logical inconsistencies in their beliefs. This combines a pragmatic argument (non-Bayesians are worse off in the long run) with an epistemic argument (non-Bayesians are vulnerable to logical inconsistencies in their preferences).

IV. You should be honest about your uncertainty

The principle of maximizing entropy mandates a unique way to set beliefs given your evidence, such that you make no presumptions about knowledge that you don’t have. This principle is fully consistent with and equivalent to standard Bayesian conditionalization.

In other words, Bayesianism is about epistemic humility – it tells you to not pretend to know things that you don’t know.

V. Bayesianism provides the foundations for the scientific method

The scientific method, needless to say, is humanity’s crowning epistemic achievement. With it we have invented medicine, probed atoms, and gone to the stars. Its success can be attributed to the structure of its method of investigating truth claims: in short, science is about searching theories for testable consequences, and then running experiments to update our beliefs in these theories.

This is all contained in Bayes’ rule, the fundamental law of probabilistic inference:

Pr(theory | evidence) ~ Pr(evidence | theory) · Pr(theory)

This rule tells you precisely how you should update your beliefs given your evidence, no more and no less. It contains the wisdom of empiricism that has revolutionized the world we live in.

VI. Bayesianism is practically useful

So maybe you’re convinced that Bayesianism is right in principle. There’s a separate question of if Bayesianism is useful in practice. Maybe treating your beliefs like probabilities is like trying to do psychology starting from Schrödinger’s equation – possible in principle but practically infeasible, not to mention a waste of time.

But Bayesianism is practically useful.

Statistical mechanics, one of the most powerful branches of modern science, is built on a foundation of explicitly Bayesian principles. More generally, good statistical reasoning is incredibly useful across all domains of truth-seeking, and an essential skill for anybody that wants to understand the world.

And Bayesianism is not just useful for epistemic reasons. A fundamental ingredient of decision-making is the ability to produce accurate models of reality. If you want to effectively achieve your goals, whatever they are, you must be able to engage in careful probabilistic reasoning.

And finally, in my personal experience I have found Bayesian epistemology to be infinitely mineable for useful heuristics in thinking about philosophy, physics, altruism, psychology, politics, my personal life, and pretty much everything else. I recommend anybody whose interest has been sparked to check out the following links:

Arbital guide to Bayes’ rule (if you’re only going to check out one of the links, make it this one)
E.T. Jaynes’ full-length textbook Probability Theory: The Logic of Science
Stanford Encyclopedia of Philosophy entry on Bayesianism
The blog SlateStarCodex, often a source of good applied Bayesian thinking – especially this and this

Consistency and priors

January 16, 2018January 18, 2018 ~ squarishbracket ~ Leave a comment

The method of reasoning illustrated here is somewhat reminiscent of Laplace’s “principle of indifference.” However, we are concerned here with indifference between problems, rather than indifference between events. The distinction is essential, for indifference between events is a matter of intuitive judgment on which our intuition often fails even when there is some obvious geometrical symmetry (as Bertrand’s paradox shows).

E. T. Jaynes
Prior Probabilities

I’ve previously written praise of the principle of maximum entropy as a prior-setting method that is justified on the basis of a very minimal and highly intuitive set of epistemic features.

But there’s an even better technique for prior-setting, one that is justified on incredibly fundamental grounds. This technique can only be used in rare times, and is immensely powerful when it is used. It’s the principle of transformation groups.

Here is the single assumption from which the principle arises:

“In problems where we have the same prior information, we should assign the same prior probabilities.” (Jaynes’ wording)

This is simple to the point of seeming almost tautological. So what can we do with it?

We’ll start with one of the simplest applications of transformation groups. Suppose that somebody gives you the following information:

I = “This coin will land either tails or heads.”

Now you want to say what the following probabilities should be:

P(This coin will land tails | I) = p
P(This coin will land heads | I) = q

Intuitively, it seems obvious to us that absent any other information, we should assign equal probabilities to these. But why? Is there a principled reason for assuming that the coin is a fair coin? Or is this just a presumption that is importing into the problem our background knowledge about most coins being fair?

The method of transformation groups gives us a principled reason. It says to rephrase the problem as follows:

I’ = “This coin will land either heads or tails.”

Now, our initial problem has only changed to our new problem by replacing every “heads” with “tails” and “tails” with “heads”. Since our prior-setting procedure found that P(This coin will land tails | I) = p in the first problem, it should now find P(This coin will land heads | I) = P in this new one. This is required for any consistent prior-setting procedure! If the problem changes by just switching places of labels, then the priors should change in the exact same way. This means that:

P(This coin will land heads | I’) = p
P(This coin will land tails | I’) = q

But clearly, I = I’; the logical operator “OR” is symmetric! Which means that:

P(This coin will land heads | I’) = P(This coin will land heads | I)

And this is only possible if p = q = ½!

This is simple, but beautiful. The principle tells us that the only logical way to set our priors in this case is evenly – anything else would be either logically inconsistent, or assuming extra information that breaks the symmetry between heads and tails. It goes from logical symmetry to probability symmetry!

Finding these symmetries is what the method of transformation groups is all about. More generally, one can represent a choice between N different possibilities as the statement:

I = “Possibility 1 or possibility 2 or … possibility N”

But this is symmetric with:

I’ = “Possibility 2 or possibility 1 or … possibility N”

As well as all other orderings.

By the exact same argument as above, your prior-setting procedure is required by logical consistency to evenly distribute credences across the N procedures. So for each n from 1 to N, P(Possibility k | I) = 1/N.

The method of transformation groups can also be applied to continuous variables, where finding the right set of priors can be a lot less intuitive. You do so by noting different types of symmetries for different types of parameters.

For instance, a location parameter is one that serves to merely shift a probability distribution over an observable quantity, without reshaping the distribution. We can formally express this by saying that for a location parameter µ, the distribution over x depends only on the difference x – µ:

p(x | µ) = f(x – µ)

For such parameters, it must be the case that the prior distribution over them is similarly symmetric over translational shifts:

For all ∆, p(µ) = p(µ + ∆)
So p(µ) = c, for some constant c

Another common category of parameters are scale parameters. These are parameters that serve to rescale probability distributions without reshaping them. Formally:

p(x | σ) = 1/σ g(x / σ)

For this symmetry, the requirement for consistency is:

For all s, p(σ) = 1/s · p(σ / s)
So, p(σ) = 1/σ

In summation, by carefully analyzing the symmetries of the background information you have, you can extract out requirements for how to set your prior distribution that are mandated on threat of logical inconsistency!

Dutch book arguments

January 16, 2018January 20, 2018 ~ squarishbracket ~ 3 Comments

These are the laws of probability, which we have proved to be necessarily true of any consistent set of degrees of belief. Any definite set of degrees of belief which broke them would be inconsistent in the sense that it violated the laws of preference between options … If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event.

We find, therefore, that a precise account of the nature of partial belief reveals that the laws of probability are laws of consistency, an extension to partial beliefs of formal logic, the logic of consistency.

Frank Ramsey
The Foundations of Mathematics and Other Logical Essays, Volume 5

In this post, I’m going to describe one of the more famous arguments for Bayesianism.

These arguments are about how different types of epistemological frameworks will handle different series of wagers. Let me just lay out clearly what exactly we mean by a wager, so as to remove any ambiguity.

A wager on proposition A is a betting opportunity. It involves a payoff amount S and a buy-in quantity. In general, the amount that the buy-in costs will be some fraction f of the payoff amount, so we’ll write it as fS. If you bet on A and it turns out true, then you get the payout S, but still lost the initial buy-in. And if you bet on A and it turns out false, then you get no payout and lose the fS you already spent.

A	Net Payout
True	S – fS
False	-fS

From this payout table, you can calculate that an agent will find the wager to be favorable to them exactly in the case that P(A) is greater than f. That is, the agent will want to take the bet whenever the chance of a payout is greater than the proportion of the payout that is required to buy into the bet.

Now, imagine that somebody has a credence of 52% in a proposition A and a credence of 52% in the proposition ~A. How will they evaluate the following set of bets?

B₁: pays out $100 if A is true, buy-in of $51
B₂: pays out $100 if ~A is true, buy-in of $51
B₃: pays out a guaranteed $100, buy-in of $102

They will see both B₁ and B₂ as favorable bets, since the buy-in is a smaller fraction of the payout than the chance of payout. And they will see B₃ as an unfavorable bet, since clearly the buy-in is a larger proportion of the payout than the chance of a payout.

But B₃ is just the same as the combination of bets B₁ and B₂!

Why? Well, if you bet on both B₁ and B₂, then you are guaranteed to win exactly one of the two (since A and ~A cannot both be true, but one of the two must be). Then you will have paid in a net sum of $102, and gotten back only $100.

A similar argument can be made for any levels of credence C(A) and C(~A) that don’t sum up to 100%. And all of the usual axioms of probability theory can be argued for in the same way. Such arguments are called Dutch book arguments.

Dutch book arguments are standardly presented as revealing that if one does not form beliefs according to the laws of probability theory, then they will be able to be juiced for money by clever bookies.

This is true; somebody with beliefs like those described above can be endlessly exploited for profit. But it is much less impressive than the real conclusion of the Dutch book argument.

Recall that our agent above was found to believing a logical contradiction as a result of not having their beliefs align with probability theory (they had to believe that a bet was simultaneously favorable to them and not favorable to them)

Said another way, an agent not following the probability calculus may evaluate the same proposition differently if presented in a different form.

This is what Dutch book arguments really say: if you want your beliefs to be logically consistent, then you are required to reason according to probability theory!

Cox’s theorem

January 14, 2018June 11, 2018 ~ squarishbracket ~ 5 Comments

A very original and thoroughgoing development of the theory of probability, which does not depend on the concept of frequency in an ensemble, has been given by Keynes. In his view, the theory of probability is an extended logic, the logic of probable inference. Probability is a relation between a hypothesis and a conclusion, corresponding to the degree of rational belief and limited by the extreme relations of certainty and impossibility. Classical deductive logic, which deals with these limiting relations only, is a special case in this more general development.

R. T. Cox
Probability, Frequency, and Reasonable Expectation
(Yep, that Keynes! He was an influential early Bayesian thinker as well as a famous economist)

Cox’s famous theorem says that if your way of reasoning is not in some sense isomorphic to Bayesianism, then you are violating one of the following ten basic principles. I’ll derive the theorem in this post.

Logic

~~a = a
ab = ba
(ab)c = a(bc)
aa = a
~(ab) = ~a or ~b
a(a or b) = a

Reasoning under uncertainty

Degrees of plausibility are represented by real numbers: {b | a}
{bc | a} = F({b | a}, {c | ab})
{~b | a} = G({b | a})
F and G are monotonic.

The first six, I think, need no introduction. I write “a and b” as “ab” for aesthetics.

The next four extend logic beyond the realm of the perfectly certain, and relate to how we should reason in the presence of uncertainty – how we should reason about degrees of plausibility. For this step, we need a new notation: a way to represent not the truth of a proposition a, but its plausibility. We do this with the {} notation: {a | b} = the plausibility of a given that b is known to be true.

I’ll make some brief notes about each of 7 through 10.

Number 7 perhaps sounds strange – plausibilities are states of belief, not numbers. But you can make sense of this in two ways: first, we can consider this a simple model of plausibilities, in which we are merely mirroring the properties of plausibilities in the structure of the system we set up around how to manipulate numbers. And second, if one is to design a robot that reasons about the world, it isn’t crazy to think about programming it to represent beliefs about the plausibilities of propositions as numbers.

#7 also contains an assumption of continuity – that it’s not the case that there are discontinuous jumps in the plausibility of propositions. Said another way, if two different real numbers represent two different degrees of plausibility, then every possible value between them should also represent a degree of plausibility.

Number 8 is about relevance. It says that the plausibility that b and c are both true given some background information a, is only dependent on (1) the plausibility that b is true given a and (2) the plausibility that c is true given a and b.

If you want to know how likely it is that b and c are both true given a, you can break the process down into two steps. First you determine how likely it is that b is true, given your background information. And second, you determine how likely it is that a is true, given both your background information and the truth of b.

In other words, all that you need to know if you want to know {bc | a} are {b | a} and {c | ab}. For example, say that you want the plausibility of Usain Bolt winning the 100 meter dash and also running an extra lap around the track at the end of the dash. The only pieces of information you need for this are (1) the plausibility of Usain Bolt winning the 100 meter dash, and (2) the plausibility of him running an extra lap at the end of the race, given that he won the race. And of course, each of these plausibilities are also conditional on all of the background information you have about the situation.

We give the name F to the function that details precisely how we determine {bc | a}.

Number 9 says that all we need to determine the plausibility of a proposition being false is the plausibility of the proposition being true. The function that details how we determine one from the other is named G.

And finally, Number 10 says that the plausibility relationships described in 8 and 9 are in some sense simple. For instance, if b becomes less plausible and ~b becomes more plausible, then if b becomes even less plausible, ~b should not suddenly become less plausible. If a change in the plausibility of a proposition a makes the plausibility of another proposition b change in a certain direction, then a greater change in the plausibility of a in the same direction should not result in a reversal of the direction of change of the plausibility of b.

There is an additional implicit assumption that is almost too obvious to be stated: If two propositions are just different phrasings of the same state of knowledge, then you should have the same plausibility in both of them. This is the basic requirement of consistency. It says, for instance, that “a and b” is exactly as plausible as “b and a”.

***

Now we can derive Bayesianism!

First step: conditional probabilities. We use the associativity of “and”:

{bcd | a} = F( {b | a}, {cd | ab} ) = F( {b | a}, F( {c | ab}, {d | abc} ) )
{bcd | a} = F( {bc | a}, {d | abc} ) = F( F({b | a}, {c | ab}), {d | abc} ) )

So F(x, F(y, z)) = F(F(x, y), z)

Any monotonic function that satisfies this functional equation is isomorphic to ordinary multiplication on the interval [0, 1].

In other words, there exists a function W such that:

W(F(x, y)) = W(x) · W(y)

From which we can see:

W({bc | a}) = W({b | a}) · W({c | ab})

Which is exactly the form of conjunction in ordinary probability theory!

This means that any form of reasoning about the plausibility of conjunctions of propositions is either violating one of the 10 axioms, or is equivalent to probability theory up to isomorphism.

So if somebody walks up to you and presents to you an algorithm for computing the plausibility of the conjunction of two propositions, and you know that they are following the above ten rules, then you are guaranteed to be able to find some function that takes in their algorithm and translates into ordinary Bayesianism.

This translation function is W! Notice that although W looks a lot like ordinary probability, it is a different thing entirely. For instance, if somebody is already reasoning with ordinary probability theory, then {b | a} = P(b | a). To convert {b | a} into P(b | a), then, W need not do anything! So W({b | a}) = {b | a} = P(b | a), or W(x) = x, which is clearly different from the probability function.

Next, we reveal the nature of the function G.

The first step is easy:

{~~b | a} = G(G({b | a}))
So G(G(x)) = x

This means that G is an involution – a function that is its own inverse.

Next, we use the commutativity of “and”:

W({bc | a}) = W({b | a}) · W({c | ab})
= W({b | a}) · G( W({~c | ab}) )
= W({b | a}) · G( W({b~c | a}) / W({b | a}) )

So W({cb | a}) = W({c | a}) · G( W({~bc | a}) / W({c | a}) )

Now if we let c = ~(bd) for some new statement d, and rename W(b | a) = x and W(c | a) = y, we find:

x G( G(y) / x ) = y G( G(x) / y )

The only functions that satisfy this functional equation as well as the earlier involution equation are the following:

G(x) = (1 – x^m)^1/m, for any m

With some algebraic manipulation, we see that this is equivalent to:

W({a | b})^m + W({~a | b})^m = 1

This equation almost perfectly represents the normalization rule: that the total probability must sum to 1! It seems mildly inconvenient that this holds true for the function W^m rather than W. If we knew that m had to equal 1, then we’d have a perfect translation function both for conditional probabilities and for normalization… But if we look back at our conditional probability finding, we notice that it can be equivalently represented in terms of W^m instead of W!

W({bc | a}) = W({b | a}) · W({c | ab})
W({bc | a})^m = W({b | a})^m · W({c | ab})^m

Now we just define one last function Q = W^m, and we’re done!

Q takes in any system of reasoning that obeys the starting ten rules, and reveals it to be equivalent to ordinary probability theory.

The rest of probability theory can be shown to result from these basic axioms (we’ll drop the brackets now, and will call Q({a | b}) its proper name: P(a | b).

This derivation of probability theory is great because it isn’t limited by the ordinary frequency interpretations of probability theory, in which a probability is defined to be a limit of a ratio of an infinite number of experimental results. This is not only ugly theoretically, but leaves us puzzled as to how to talk about the probabilities of one-shot events or of hypotheses that can’t be directly translated into empirical results.

The definition of probability invoked here is infinitely deeper. Instead of frequencies, probabilities are defined according to fundamental principles of normative epistemology. Any agent that reasons consistently according to a few basic maxims will be reasoning in a way that is functionally identical to probability theory.

And probabilities here can be assigned to any propositions – not just those that refer to empirically measurable events that are repeatable ad infinitum. They represent a normative rational degree of certainty that one must possess if one is to reason consistently!