Paradoxes of infinite decision theory

January 23, 2018March 15, 2018 ~ squarishbracket ~ 5 Comments

(Two decision theory puzzles from this paper.)

Trumped

Donald Trump has just arrived in Purgatory. God visits him and offers him the following deal. If he spends tomorrow in Hell, Donald will be allowed to spend the next two days in Heaven, before returning to Purgatory forever. Otherwise he will spend forever in Purgatory. Since Heaven is as pleasant as Hell is unpleasant, Donald accepts the deal. The next evening, as he runs out of Hell, God offers Donald another deal: if Donald spends another day in Hell, he’ll earn an additional two days in Heaven, for a total of four days in Heaven (the two days he already owed him, plus two new ones) before his return to Purgatory. Donald accepts for the same reason as before. In fact, every time he drags himself out of Hell, God offers him the same deal, and he accepts. Donald spends all of eternity writhing in agony in Hell. Where did he go wrong?

Satan’s Apple

Satan has cut a delicious apple into infinitely many pieces, labeled by the natural numbers. Eve may take whichever pieces she chooses. If she takes merely finitely many of the pieces, then she suffers no penalty. But if she takes infinitely many of the pieces, then she is expelled from the Garden for her greed. Either way, she gets to eat whatever pieces she has taken.

Eve’s first priority is to stay in the Garden. Her second priority is to eat as much apple as possible. She is deciding what to do when Satan speaks. “Eve, you should make your decision one piece at a time. Consider just piece #1. No matter what other pieces you end up taking and rejecting, you do better to take piece #1 than to reject it. For if you take only finitely many other pieces, then taking piece #1 will get you more apple without incurring the greed penalty. On the other hand, if you take infinitely many other pieces, then you will be ejected from the Garden whether or not you take piece #1. So in that case you might as well take it, so that you can console yourself with some additional apple as you are escorted out.”

Eve finds this reasonable, and decides provisionally to take piece #1. Satan continues, “By the way, the same reasoning holds for piece #2.” Eve agrees. “And piece #3, and piece #4 and…” Eve takes every piece, and is ejected from the Garden. Where did she go wrong?

The second thought experiment is sufficiently similar to the first one that I won’t say much about it – just included it in here because I like it.

Analysis

Let’s assume that Trump is able to keep track of how many days he has been in hell, and can credibly pre-commit to strategies involving only accepting a fixed number of offers before rejecting. Now we can write out all possible strategies for sequences of responses that Trump could make:

Strategy 0 Accept none of the offers, and stay in Purgatory forever.

Strategy N Accept some finite number N of offers, after which you spend 2N days in Heaven and then infinity in Purgatory.

Strategy ∞ Accept all of the offers, and stay in Hell forever.

Assuming that a day in hell is exactly as bad as a day in heaven, Strategy 0 nets you 0 days in Heaven, Strategy N nets you N days in Heaven, and Strategy ∞ nets you ∞ days in Hell.

Obviously Strategy ∞ is the worst option (it is infinitely worse than all other strategies). And for every N, Strategy N is better than Strategy 0. So we have ∞ < 0 < N.

So we should choose Strategy N for some N. But which N? Obviously, for any choice of N, there will be arbitrarily better choices that you could have done. The problem is that there is no optimal choice of N. Any reasonable decision theory, when asked to optimize N for utility, is going to just return an error. It’s like asking somebody to tell you the largest integer. This is perhaps something that is difficult to come to terms with, but it is not paradoxical – there is no law of decision theory that every problem has a best solution.

But we still want to answer what we would do if we were in Trump’s shoes. If we actually have to pick an N, what should we do? I think the right answer is that there is no right answer for what we should do. We can say “x is better than y” for different strategies, but cannot say definitively the best answer… because there is no best answer.

One technique that I thought of, however, is the following (inspired by the Saint Petersburg Paradox):

On the first day, Trump should flip a coin. If it lands heads, then he chooses Strategy 1. If it lands tails, then he flips the coin again.

If on the next flip the coin lands heads, then he chooses Strategy 2. And if it lands tails, again he flips the coin.

If on this third flip the coin lands heads, then he chooses Strategy 4. And if not, then he flips again.

Et cetera to infinity.

With this decision strategy, we can calculate the expected number N that Trump will choose. This number is:

E[N] = ½･1 + ¼･2 + ⅛･4 + … = ∞

But at the same time, the coin will certainly eventually land heads, and the process will terminate. The probability that the coin lands tails an infinite number of times is zero! So by leveraging infinities in his favor, Trump gets an infinite positive expected value for days spent in heaven, and is guaranteed to not spend all eternity in Hell.

A weird question now arises: Why should Trump have started at Strategy 1? Or why multiply by 2 each time? Consider the following alternative decision process for the value of N:

On the first day, Trump should flip a coin. If it lands heads, then he chooses Strategy 1,000,000. If it lands tails, then he flips the coin again.

If on the next flip the coin lands heads, then he chooses Strategy 10,000,000. And if it lands tails, again he flips the coin.

If on this third flip the coin lands heads, then he chooses Strategy 100,000,000. And if not, then he flips again.

Et cetera to infinity.

This decision process seems obviously better than the previous one – the minimum number of days in heaven Trump nets is 1 million, which would only have previously happened if the coin had landed tails 20 times in a row. And the growth in number of net days in heaven per tail flip is 5x better than it was originally.

But now we have an analogous problem to the one we started with in choosing N. Any choice of starting strategy or growth rate seems suboptimal – there are always an infinity of arbitrarily better strategies.

At least here we have a way out: All such strategies are equivalent in that they net an infinite number of days. And none of these infinities are any larger than any others. So even if it intuitively seems like one decision process is better than another, on average both strategies will do equally well.

This is weird, and I’m not totally satisfied with it. But as far as I can tell, there isn’t a better alternative response.

Schelling fence

How could a strategy like Strategy N actually be instantiated? One potential way would be for Trump to set up a Schelling fence at a particular value of N. For example, Trump could pre-commit from the first day to only allowing himself to say yes 500 times, and after that saying no.

But there’s a problem with this – if Trump has any doubts about his ability to stick with his plan, and puts any credence in his breezing past Strategy N and staying in hell forever, then this will result in an infinite negative expected value of using a Schelling fence. In other words, use of a Schelling fence seems only advisable if you are 100% sure of your ability to credibly pre-commit.

Here’s an alternative strategy for instantiating Strategy N that smooths out this wrinkle: Each time Trump is given another offer by God, he accepts it with probability N/(N+1), and rejects it with probability 1/(N+1). By doing this, he will on average do Strategy N, but will sometimes do a different strategy M for an M that is close to N.

A harder variant would be if Trump’s memory is wiped clean after every day he spends in Hell, so that each day when he receives God’s offer, it is as if it is the first time. Even if Trump knows that his memory will be wiped clean on subsequent days, he now has a problem: he has no way to remember his Schelling fence, or to even know if he has reached it yet. And if he tries the probabilistic acceptance approach, he has no way to remember the value of N that he decided on.

But there’s still a way for him to get the infinite positive expected utility! He can do so by running a Saint Petersburg Paradox like above not just the first day, but every day! Every day he chooses a value of N using a process with an infinite expected value but a guaranteed finite actual value, and then probabilistically accepts/rejects the offer using this N.

Quick proof that this still ensures finitude: Suppose that he stays in Hell forever, never rejecting the offer. Since there is a finite chance that he selects N = 1, this means that he will select N = 1 an infinite number of times. For each of these times, he has a ½ chance of rejecting and ½ chance of accepting. Since this happens an infinite number of times, he is guaranteed to eventually reject an offer.

Question: what’s the expected number of days in Heaven for this new process? Infinite, just as before! But guaranteed finite. (There should be a name for these types of guaranteed-finite-but-infinite-expected-value quantities.)

Anyway, the conclusion of all of this? Infinite decision theory is really weird.

Does fine-tuning give evidence for God?

January 23, 2018January 24, 2018 ~ squarishbracket ~ Leave a comment

I used to think that the fine-tuning argument was the strongest argument out there for the existence of a creator deity. I was especially impressed by the apparent magnitude of the fine-tuning – Steven Weinberg has stated that the value of the cosmological constant was fine-tuned to one part in 10¹²⁰.

If one takes a naïve (and as we’ll see, incorrect) Bayesian approach to assessing this as evidence, then it looks like this should serve as an incredible amount of evidence for the existence of a God, enough to totally overwhelm all other possible considerations. Why? Because if there is a God, then we expect fine-tuning, while if not, then the fine-tuning looks incredibly unlikely. Given this, the God explanation should receive a credence bump proportional to 10¹²⁰ upon updating on the observation of fine-tuning.

As a quick aside before diving into the numbers, there is a lot of dispute about whether or not there even is fine-tuning in our universe. For the purposes of this post, I’m going to ignore all of these disputes, and pretend that there is a strong consensus on this matter. I’ll use Weinberg’s estimate of 10^-120 for the fine-tuning of the cosmological constant. I know that this is controversial, but the point I’m making will stand for even this insanely tiny value.

Okay, so let’s first present a formal version of the fine-tuning argument for God.

F = “The universe is fine-tuned for life.”
G = “A creator deity fine-tuned the universe for life.”

O(G | F) = L(F | G) · O(G)
L(F | G) = P(F | G) / P(F | ~G) ≈ 1 / 10^-120 = 1200 dB

So O(G | F) = 10¹²⁰ · O(G)

This uses the odds formulation of Bayes’ rule – look it up if you’re unfamiliar.

This argument says that your credences should be adjusted by a factor of 10¹²⁰ upon observing the fine-tuning of the universe. In other words, to not be virtually certain that there exists a creator deity that rules the universe after updating on fine-tuning, you’d have to have initially had a credence on the order of 10^-120.

Let me point out that 10^-120 is a really really small number. It’s virtually impossible to imagine any good reason why you would be justified in having a prior credence on this order of magnitude. Nobody should be that sure about anything. Evidence of a strength of 1200 dB is analogous in strength to a noise that is a quadrillion times more intense than the threshold of human hearing.

So what’s wrong with this argument? In fact, it fails at the first step. In calculating the strength of the evidence, we only considered two possible hypotheses: either God or, if not, then random coincidence. But there are many other options that we have to factor in as well, most famously the multiverse hypothesis.

But even if there are other hypotheses out there, shouldn’t they just all share the benefit of the credence boost? The existence of another hypothesis that made the same prediction shouldn’t count as a penalty, right?

Wrong! Probabilities have to add up to 1, and you can’t have multiple mutually exclusive competing hypotheses that you have virtual certainty about. Whatever happens when you add other hypotheses must be more subtle than that. So let’s calculate using Bayes’ rule!

O(T | E) = L(E | T) · O(T)
L(E | T) = P(E | T) / P(E | ~T)

For each theory T we consider, we have to take into account all other theories in the denominator of our likelihood function L. We’ll want to keep in mind the following identity:

P(B & C) = 0
implies
P(A | B or C) = [P(A | B) P(B) + P(A | C) P(C)] / [P(B) + P(C)]

So, for instance, let’s divide up our explanations of the fine-tuning F into three disjoint categories: (1) random coincidence C, (2) a deistic God G, and (3) all other explanations that are mutually incompatible.

L(F | X) = P(F | X) / P(F | ~X)
= P(F | X) (1 – P(X)) / ∑_Y≠PP(F | Y) P(Y)

P(F | C) ≈ 10^-120
P(F | G) ≈ 1
P(F | O) ≈ 1

L(F | C) ≈ 10^-120
L(F | G) ≈ P(~G) / P(O)
L(F | O) ≈ P(~O) / P(G)

O(C | F) = 10^-120 · O(C)
O(G | F) = P(G) / P(O)
O(O | F) = P(O) / P(G)

In the end, what we find is that the “Coincidence” hypothesis has been down-voted completely out of existence, leaving only the “God” hypothesis and the “Other” hypothesis.

And importantly, our final credence in either of these hypotheses is not on the order of magnitude of 1 – 10^-120. The final balance depends entirely just on the ratio of prior credences in the two explanations.

Trial Run

Let’s look at two individuals updating on the observation of fine-tuning.

Atheist
P(G) = .01%
P(O) = 50%

Deist
P(G) = 99%
P(O) = 1%

(The exact details of these numbers aren’t that important, just that they’re somewhat qualitatively accurate.) Their final credences will be:

Atheist
P(G | F) = 0.02%
P(O | F) = 99.98%

Deist
P(G | F) = 99%
P(O | F) = 1%

And we see that nobody ends up significantly updating their religious beliefs on the evidence of fine-tuning. The deist held a worldview in which the random coincidence hypothesis was already ruled out, so the observation of fine-tuning doesn’t change anything for them. And the atheists were initially fairly agnostic about whether or not the universe was fine-tuned, but were very confident in the existence of other explanations besides God. As such, the observation of fine-tuning served as a minor increase in their belief in God (+.01%), while they become extremely confident that there must be some other explanation.

Fine-tuning would only serve as strong evidence for you if you were initially very sure that there was a God, but agnostic about if God would have designed the universe to accommodate human life, or if its design was purely random coincidence. Even in this case, the bump in credence you’d receive would be nothing like the massive update that seems apparent from a naïve (and wrong) application of Bayesian reasoning.

Noisy Evidence

January 21, 2018March 15, 2018 ~ squarishbracket ~ Leave a comment

Scope insensitivity is a cognitive bias that involves a failure to internalize the true scale of quantities. Some of the most striking and frankly depressing examples of this phenomenon involve altruistic behavior, where people care just as much about a cause regardless of how many lives are concerned. In some cases, increasing numbers of affected people result in decreasing willingness to pay.

This issue arises when quantitative metrics don’t line up with our intuitive metrics – 10 billion doesn’t feel 1000 times larger than 10 million. A solution that might be sometimes possible is to adjust the numerical scale you are dealing with to try to get the true scale to match the intuitive scale.

This is a large part of what I think is great about the notion of evidence as noise.

Humans have scope insensitivity with respect to very large and very small probabilities. 99.99% doesn’t feel that different to us from 99.9999%. But they are extremely different. The amount of evidence required to push you from 99.99% to 99.9999% is the same as the amount of evidence that would have pushed you from 9% to 91%. There is a big difference between 99.99% and 99.9999% in terms of the state of knowledge represented.

The problem is that as the probability approaches 100%, the number looks to us like it is barely budging. This can be fixed by making our scale logarithmic. We do this by first converting our probabilities to odds ratios (so 50% becomes 1:1 odds, 75% becomes 3:1 odds, etc), and then taking a logarithm. This is exactly analogous to the decibel scale for noise, so this is called the decibel (dB) scale for evidence.

Probability of A = P(A)
Odds of A = P(A) / P(~A)
Decibel strength of A = 10 · log₁₀(P(A) / P(~A))

Very strong evidence is very noisy, and weak evidence is silent, barely affecting our beliefs. This is also nice because Bayes’ rule becomes additive:

Posterior Odds Ratio = Likelihood Ratio · Prior Odds Ratio
O(T | E) = L(E | T) · O(T)
becomes…
O_dB(T | E) = L_dB(E | T) + O_dB(T)

If your evidence E is equally likely whether or not the theory T is true, then L(E | T) = 1 and L_dB(E | T) = 0. Thus you add 0, and end up with the same odds as you started with.

Theories that are very high or very low in credence are very noisy, while those that are around 50% are silent.

Now what’s the difference between 99.99% and 99.9999%?

99.99% = 9999:1 = 40 dB
99.9999% = 999999:1 = 60 dB

A 20 dB difference in strength of belief is a lot easier to wrap your head around than a 0.0099% difference!

In addition, equally strong evidence always looks equally strong when expressed in dB, while it can look increasingly weak when expressed in probabilities.

For example, imagine that somebody comes up to you and claims to be able to read your mind. To test them, you decide to ask her to tell you what number between 0 and 10 is in your head right now. If she gets this right, then this counts as 10 decibels of evidence for her psychic abilities.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

10 log₁₀(10) = 10 dB

So if your previous belief in her psychic abilities was at -50 decibels (100,000:1 odds against), then it should now be at -40 decibels (10,000:1 odds against).

The same calculation would tell you that another successful test would nudge you another +10 dB, from -40 to -30. Extrapolation seems to indicate that you should be pretty much agnostic as to whether or not she is psychic after three more such successful tests, and strong believers after only eight total tests.

Initial strength of belief = -50 dB
First test gives evidence of +10 dB
New strength of belief = -40 dB
Four more tests give total evidence of +40 dB
New strength of belief = 0 dB
Three more tests give total evidence of +30 dB
Final strength of belief = +30 dB (99.9%)

This example actually gets things wrong in a very important way. Eight tests like those that I described is probably not sufficient to establish psychic abilities. This is a little off topic, but is useful to go into as a demonstration of how naive usage of Bayes’ rule can lead you off the rails.

Where we went wrong was in the very first step, in calculating the decibel strength of the evidence.

L(correct | psychic) = P(correct | psychic) ÷ P(correct | not psychic)
≈ 100% / 10% = 10

The presumption behind this calculation is that if she were psychic, then she would almost definitely be able to get the number right (≈ 100%), but if not, then she would have a random shot (10%). But “psychic” and “random” are not the only two theories! For instance, maybe the apparent psychic has actually just figured out a masterful method for reading subtle facial movements to guess at the number being guessed, rather than actually being able to look into your mind.

The face-reading hypothesis seems unlikely, but probably less so than true mind-reading abilities. Let’s give it a decibel score of -20 (corresponding to an initial credence of about 1%). This should barely factor into our initial calculation, so let’s suppose that +10 dB is the actual strength of evidence for psychic abilities.

Now P_dB(psychic) goes from -50 dB to -40 dB, and P_dB(face-reading) goes from -20 dB to -10 dB. They have both gotten more likely, because they both successfully predicted the outcome! And now for the second test, face-reading should have a bigger effect on the calculation! I’ll skip the algebra and just present the new strengths of evidence for the second test:

L(correct | psychic) = 7 dB
L(correct | face-reading) = 10 dB

Notice that the evidence is now weaker for the “psychic” hypothesis, because it has a more likely competing hypothesis. The evidence is still equally strong for face-reading, on the other hand, because its competing hypothesis (that she is psychic) is still very weak.

So we update again!

Psychic: -40 dB to -33 dB (.05%)
Face-reading: -10 dB to 0 dB (50%)

Now the face-reading hypothesis is 50% – apparently equally likely to be true and false! This will sway the strength of the evidence for the ‘psychic’ hypothesis even more on the third trial:

L(correct | psychic) = 3 dB
L(correct | face-reading) = 10 dB

Now with such a likely alternative explanation, the evidence is even weaker than previously for the psychic hypothesis. After our third trial, our beliefs will update as follows:

Psychic: -33 dB to -30 dB (.1%)
Face-reading: 0 dB to 10 dB (90%)

As you can see, the face-reading hypothesis takes off, while the psychic hypothesis ends up staying stuck around .1%.

I’ll talk more about this in a post tomorrow, in which I show how the exact same simple error in our first argument is being made in fine-tuning arguments for God!

Nature’s Urn and Ignoring the Super Persuader

January 19, 2018January 21, 2018 ~ squarishbracket ~ 3 Comments

This post is about one of my favorite little-known thought experiments.

Here’s the setup:

You are in conversation with a Super Persuader – an artificial intelligence that has access to an enormously larger pool of information than you, and that is excellent at synthesizing information to form powerful arguments. The Super Persuader is so super at persuading, in fact, that given any proposition, it will be able to construct the most powerful argument possible for that proposition, consisting of the strongest evidence it has access to.

The Super Persuader is going to try and persuade you either that a certain proposition A is true or that it is false. In doing so, you know that it cannot lie, but it can pick and choose the information that it presents to you, giving an incomplete picture.

Finally, you know that the Super Persuader is going to decide which side of the issue to argue based off of a random coin toss: 50% chance they will argue that A is true, and 50% chance they will argue that A is false.

Once the coin is tossed and the Persuader begins to present the evidence, how should you rationally respond? Should you be swayed by the arguments, ignore them, or something else?

Here’s a basic presentation of one response to this thought experiment:

Of course you should be swayed by their arguments! If not, then you end up receiving boatloads of crazily persuasive argumentation and pretending like you’ve heard none of it. This is the very definition of irrationality – closing your eyes to the evidence you have sitting right in front of you! There’s no reason to disregard all of the useful information that you’re getting, just because it’s coming from a source that is trying to persuade you. Regardless of the motives of the Super Persuader, it can only persuade you by giving you honest and genuinely convincing evidence. And a rational agent has no choice but to update their credences on this evidence.

I think that this is a bad argument. Here’s an analogy to help explain why.

Imagine the set of all possible pieces of evidence you could receive for a given proposition as a massive urn filled with marbles. Each marble is a single argument that could be made for the proposition. If the argument is in support of the proposition, then the marble representing it will be black. And if the argument is against the proposition, then the marble representing it will be white.

Now, the question as to whether the proposition is more likely to be true or false is roughly the same as the question of whether there are more black or white marbles in the urn. That is the exact same question if all of the arguments in question are equally strong, and we have no reason for starting out favoring one side over the other.

But now we can think about the actions of the Super Persuader as follows: the Super Persuader has direct access to the urn, and can select any marble it wants. If it wants to persuade you that the proposition is true, then it will just fish through the urn and present you with as many black marbles as it desires, ignoring all the white marbles.

Clearly this process gives you no information as to the true proportion of the marbles that are white versus the proportion that are black. The data you are receiving is contaminated by a ridiculously powerful selection bias. The evidence you see is no longer linked in any way to the truth of the proposition, because regardless of whether or not it is true, you still expect to receive large amounts of evidence for it.

In the end, all of the pieces of evidence you receive are useless, in the same way that a stacked deck is not a reliable source of information about the average card deck.

This has some really weird consequences. For one thing, after your conversation you still have all of that information hanging around in your head (as long as you have a good enough memory). So if anybody asks you what you think about the issue, you will be able to spout off incredibly powerful arguments for exactly one side of the issue. But you’ll also have to concede that you don’t actually strongly believe the conclusion of these arguments. And if you’re asked to present any evidence for not accepting the conclusion, you’ll likely draw a blank, or only be able to produce very unsatisfactory answers. You will certainly not come off as a very rational person! Continue reading “Nature’s Urn and Ignoring the Super Persuader” →

Entropy vs relative entropy

January 19, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

This post is about the relationship between entropy and relative entropy. This relationship is subtle but important – purely maximizing entropy (MaxEnt) is not equivalent to Bayesian conditionalization except in special cases, while maximizing relative entropy (ME) is. In addition, the justifications for MaxEnt are beautiful and grounded in fundamental principles of normative epistemology. Do these justifications carry over to maximizing relative entropy?

To a degree, yes. We’ll see that maximizing relative entropy is a more general procedure than maximizing entropy, and reduces to it in special cases. The cases where MaxEnt gives different results from ME can be interpreted through the lens of MaxEnt, and relate to an interesting distinction between commuting and non-commuting observations.

So let’s get started!

We’ll solve three problems: first, using MaxEnt to find an optimal distribution with a single constraint C₁; second, using MaxEnt to find an optimal distribution with constraints C₁ and C₂; and third, using ME to find the optimal distribution with C₂ given the starting distribution found in the first problem.

Part 1

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C₁[P] dx = 0

∂_P ( – P₁ logP₁ + (α + 1) P₁ + βC₁[P₁] ) = 0
– logP₁ + α + β C₁’[P₁] = 0

Part 2

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C₁[P] dx = 0
∫ C₂[P] dx = 0

∂_P ( – P₂ logP₂ + (α’ + 1) P₂ + β’C₁[P₂] + λ C₂[P₂] ) = 0
– logP₂ + α’ + β’ C₁’[P₂] + λ C₂’[P₂] = 0

Part 3

Problem: Maximize – ∫ P log(P / P₁) dx with constraints
∫ P dx = 1
∫ C₂[P] dx = 0

∂_P ( – P₃ logP₃ + P₃ logP₁ + (α’’ + 1)P₃ + λ’ C₂[P₃] ) = 0
– logP₃ + α’’ + logP₁ + λ’ C₂’[P₃] = 0
– logP₃ + α’’ + α + β C₁’[P₁] + λ’ C₂’[P₃] = 0
– logP₃ + α’’’ + β C₁’[P₁] + λ’ C₂’[P₃] = 0

We can now compare our answers for Part 2 to Part 3. These are the same problem, solved with MaxEnt and ME. While they are clearly different solutions, they have interesting similarities.

MaxEnt
– logP₂ + α’ + β’ C₁’[P₂] + λ C₂’[P₂] = 0
∫ P₂ dx = 1
∫ C₁[P₂] dx = 0
∫ C₂[P₂] dx = 0

ME
– logP₃ + α’’’ + β C₁’[P₁] + λ’ C₂’[P₃] = 0
∫ P₃ dx = 1
∫ C₁[P₁] dx = 0
∫ C₂[P₃] dx = 0

The equations are almost identical. The only difference is in how they treat the old constraint. In MaxEnt, the old constraint is treated just like the new constraint – a condition that must be satisfied for the final distribution.

But in ME, the old constraint is no longer required to be satisfied by the final distribution! Instead, the requirement is that the old constraint be satisfied by your initial distribution!

That is, MaxEnt takes all previous information, and treats it as current information that must constrain your current probability distribution.

On the other hand, ME treats your previous information as constraints only on your starting distribution, and only ensures that your new distribution satisfy the new constraint!

When might this be useful?

Well, say that the first piece of information you received, C₁, was the expected value of some measurable quantity. Maybe it was that x̄ = 5.

But if the second piece of information C₂ was an observation of the exact value of x, then we clearly no longer want our new distribution to still have an expected value of x̄. After all, it is common for the expected value of a variable to differ from the exact value of x.

E(x) vs x

Once we have found the exact value of x, all previous information relating to the value of x is screened off, and should no longer be taken as constraints on our distribution! And this is exactly what ME does, and MaxEnt fails to do.

What about a case where the old information stays relevant? For instance, an observation of the values of a certain variable is not ‘cancelled out’ by a later observation of another variable. Observations can’t be un-observed. Does ME respect these types of constraints?

Yes!

Observations of variables are represented by constraints that set the distribution over those variables to delta-functions. And when your old distribution contains a delta function, that delta function will still stick around in your new distribution, ensuring that the old constraint is still satisfied.

P_old ~ δ(x – x’)
implies
P_new ~ δ(x – x’)

The class of observations that are made obsolete by new observations are called non-commuting observations. They are given this name because for such observations, the order in which you process the information is essential. Observations for which the order of processing doesn’t matter are called commuting observations.

In summation: maximizing relative entropy allows us to take into account subtle differences in the type of evidence we receive, such as whether or not old data is made obsolete by new data. And mathematically, maximizing relative entropy is equivalent to maximizing ordinary entropy with whatever new constraints were not included in your initial distribution, as well as an additional constraint relating to the value of your old distribution. While the old constraints are not guaranteed to be satisfied by your new distribution, the information about them is preserved in the form of the prior distribution that is a factor in the new distribution.

Fun with Akaike

January 19, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

The Akaike information criterion is a metric for model quality that naturally arises from the principle of maximum entropy. It balances predictive accuracy against model complexity, encoding a formal version of Occam’s razor and solving problems of overfitting. I’m just now learning about how to use this metric, and will present a simple example that shows off a lot of the features of this framework.

Suppose that we have two coins. For each coin, we can ask what the probability is of each coin landing heads. Call these probabilities p and q.

Two classes of models are (1) those that say that p = q and (2) those that say that p ≠ q. The first class of models are simpler in an important sense – they can be characterized by a single parameter p. The second class, on the other hand, require two separate parameters, one for each coin.

The number of parameters (k) used by our model is one way to measure model complexity. But of course, we can also test our models by getting experimental data. That is, we can flip each coin a bunch of times, record the results, and see how they favor one model over another.

One common way of quantifying the empirical success of a given model is by looking at the maximum value of its likelihood function L. This is the function that tells you how likely your data was, given a particular model. If Model 2 can at best do better at predicting the data than Model 1, then this should count in favor of Model 2.

So how do we combine these pieces of information – k (the number of parameters in a model) and L (the predictive success of the model)? Akaike’s criterion says to look at the following metric:

AIK = 2k – 2 lnL

The smaller the value of this parameter, the better your model is.

So let’s apply this on an arbitrary data set:

n₁ = number of times coin 1 landed heads
n₂ = number of times coin 1 landed tails
m₁ =number of times coin 2 landed heads
m₂ = number of times coin 2 landed tails

For convenience, we’ll also call the total flips of coin 1 N, and the total flips of coin 2 M.

First, let’s look at how Model 1 (the one that says that the two coins have an equal chance of landing heads) does on this data. This model predicts with probability p each heads, on either coin, and with probability 1 – p each tails on either coin.

L₁ = C(N,n₁) C(M,m₁)p^n₁+m₁ (1 – p)^n₂+m₂

The two choose functions C(N, n₁) and C(M, m₁) are there to ensure that this function is nicely normalized. Intuitively, they arise from the fact that any given number of coin tosses that land heads could happen in many possible, ways, and all of these ways must be summed up.

This function finds its peak value at the following value of p:

p = (n₁ + m₁) / (N + M)
L_1,max = C(N, n₁) C(M, m₁)(n₁ + m₁)^n₁+m₁ (n₂ + m₂)^n₂+m₂ / (N + M)^N+M

By Stirling’s approximation, this becomes:

ln(L_1,max) ~ -ln(F) – ½ ln(G)
where F = (N + M)^N+M/N^NM^M · n₁^n₁m₁^m₁/(n₁ + m₁)^n₁+m₁ · n₂^n₂m₂^m₂/(n₂ + m₂)^n₂^+m₂
and G = n₁n₂m₁m₂/ NM

With this, our Akaike information criterion for Model 1 tells us:

AIC₁= 2 + 2ln(F) + ln(G)

Moving on to Model 2, we now have two different parameters p and q to vary. The likelihood of our data given these two parameters are given by:

L₂ = C(N, n₁) C(M, m₁)p^n₁ (1 – p)^n₂ q^m₁ (1 – q)^m₂

The values of p and q that make this data most likely are:

p = n₁ / N
q = m₁ / M
So L_2,max = C(N, n₁) C(M, m₁)n₁^n₁m₁^m₁n₂^n₂m₂^m₂ / N^NM^M

And again, using Stirling’s approximation, we get:

ln(L_1,max) ~ – ½ ln(G)
So AIC₂= 4 + ln(G)

We now just need to compare these two AICs to see which model is preferable for a given set of data:

AIC₂= 4 + ln(G)
AIC₁= 2 + 2ln(F) + ln(G)

AIC₂– AIC₁= 2 – 2lnF

Let’s look at two cases that are telling. The first case will be that we find that both coin 1 and coin 2 end up landing heads an equal proportion of the time, and for simplicity, both coins are tossed the same number of times. Formally:

Case 1: N = M, n₁ = m₁, n₂ = m₂

In this case, F becomes 1, so lnF becomes 0.

AIC₂– AIC₁= 2 > 0
So Model 1 is preferable.

This makes sense! After all, if the two coins are equally likely to land heads, then our two models do equally well at predicting the data. But Model 1 is simpler, involving only a single parameter, so it is preferable. AIC gives us a precise numerical criterion by which to judge how preferable Model 1 is!

Okay, now let’s consider a case where coin 1 lands heads much more often than coin 2.

Case 2: N = M, n₁ = 2m₁, 2n₂ = m₂

Now if we go through the algebra, we find:

F = 4^N (4/27)^(m1+n2)~ 1.12^NSo lnF ~ N ln(1.12) ~ .11 N
So AIC₂– AIC₁= 2 – .22N

This quantity is larger than 0 when N is less than 10, but then becomes smaller than zero for all other values.

Which means that for Case 2, small enough data sets still allow us to go with Model 1, but as we get more data, the predictive accuracy of Model 2 eventually wins out!

It’s worth pointing out here that the Akaike information criterion is an approximation to the technique of maximizing relative entropy, and this approximation assumes large sets of data. Given this, it’s not clear how reliable our estimate of 10 is for the largest data set.

Let’s do one last thing with our simple models.

As our two coins become more and more similar in how often they land heads, we expect Model 1 to last longer before Model 2 ultimately wins out. Let’s calculate a general relationship between the similarity in the ratios of heads to tails in coin 1 and coin 2 and the amount of time it takes for Model 1 to lose out.

Case 3: N = M, n₁ = r·m₁, r·n₂ = m₂

r here is our ratio of p/q – the chance of heads in coin 1 over the chance of heads in coin 2. Skipping ahead through the algebra we find:

lnF = N ln( 2 r^r/(r+1) / (r + 1) )

Model 2 becomes preferable to Model 1 when AIC₂becomes smaller than AIC₁, so we can find the critical point by setting ∆AIC = 2 – 2 lnF = 0

lnF = N ln( 2 r^{r/(r + 1)} / (r + 1) ) = 1
N = 1 / ln( 2 r^{r/(r + 1)} / (r + 1) )

We can see that as r goes to 1, N goes to ∞. We can see how quickly this happens by doing some asymptotics:

r = 1 + ε
N ~ 1 / ln(1 + ε)
So ε ~ e^1/N – 1

N goes to infinity at an exponential rate as r approaches 1 linearly. This gives us a very rough idea of how similar our coins must be for us to treat them as essentially identical. We can use our earlier result that at r = 2, N = 10 to construct a table:

r	N
2	10
1.1	100
1.01	1000
1.001	10,000
1.0001	100,000

Anthropic argument for common priors

January 18, 2018May 22, 2018 ~ squarishbracket ~ Leave a comment

(Idea from Robin Hanson and Tyler Cowen’s 2004 paper Are Disagreements Honest?)

One common argument relating to common priors is that two rational agents with all the same information (including no information at all) could have no possible grounds on which to disagree. Priors by definition refer to the state of knowledge before either agent had any evidence relevant to a given proposition. So there is no information that either agent could have that would allow a difference in priors.

A response to this is that some information that we have is inherently private and unique to us. For instance, you and I might have differences in intelligence, in ways of conceptualizing the world, or in the things we innately find intuitively plausible. All of these differences may count as important information in shaping our priors on a given subject, before we ever encounter a single piece of evidence relevant to the subject.

Here’s a really weird argument for why even these differences should not count. If we use anthropic reasoning, and treat our own existence and the details of our brain and body as just another thing to be conditioned on, then even these private intimate details are simply contingent facts about the world that are to be treated as evidence. Before you’ve conditioned on your own existence, you should be agnostic as to which set of brain/body/mind out of all the possible sets of observers “you” will end up being. You must imagine yourself behind Rawls’ veil of ignorance, a disembodied reasoner that is identical to all other such reasoners. So there is no conceivable reason why your prior should differ from anybody else’s – you must treat yourself as literally the same entity as them pre anthropic conditioning.

In less out-there terms, if you encounter somebody with an apparently different prior from you, then you should consider “Hmm, what if I were born as this person, instead of myself?” The answer to which is, of course, you would have had the same priors as them. Which means that your difference in “priors” is actually a difference of posteriors resulting from conditioning on the arbitrary choice of body/brain/experiences you ended up with.

In addition, by Aumann’s agreement theorem, any apparent differences in priors that become common knowledge should quickly go away, once they are realized to be merely differences in posteriors. Essentially, any differences in priors that last between two rational individuals are signs that they are arbitrarily favoring their own existence in considerations of what prior they should use.

Why you should be a Bayesian

January 17, 2018February 9, 2018 ~ squarishbracket ~ Leave a comment

In this post, I take Bayesianism to be the following normative epistemological claim: “You should treat your beliefs like probabilities, and reason according to the axioms of probability theory.”

Here are a few reasons why I support this claim:

I. Logic is not enough

Reasoning deductively from premises to conclusion is a wonderfully powerful tool when it can be applied. If you have absolute certainty in some set of premises, and these premises entail a new conclusion, then you can extend your certainty to the new conclusion. Alternatively, you can state clearly the conditions under which you would be granted certain belief, in the form of a conditional argument (if you were to convince me that A and B are true, then I would believe that C is true).

This is great for mathematicians proving theorems about abstract logical entities. But in the real world, deductive inference is simply not enough to account for the types of problems we face. We are constantly reasoning in a condition of uncertainty, where we have multiple competing theories about what’s going on, and we seek evidence – partial evidence, not deductively complete evidence – as to which of these theories we should favor.

If you want to know how to form beliefs about the parts of reality that aren’t clear-cut and certain, then you need to go beyond pure logic.

II. Probability theory is a natural extension of logic

Cox’s theorem shows that any system of plausible reasoning – modifying and updating beliefs in the presence of uncertainty – that is consistent with logic and a few minimal assumptions about normative reasoning is necessarily isomorphic to probability theory.

The converse of this is that any system of reasoning under uncertainty that isn’t ultimately functionally equivalent to Bayesianism is either logically inconsistent or violates other common-sense axioms of reasoning.

In other words, probability theory is the best candidate that we have for extending logic into the domain of the uncertain. It is about what is likely, not certain, to be true, and the way that we should update these assessments when receiving new information. In turn, probability theory contains ordinary logic as a special case when you take the limit of absolutely certainty.

III. Non-Bayesian systems of plausible reasoning result in inconsistencies and irrational behavior

Dutch-book arguments prove that any agent that is violating the axioms of probability theory can be exploited by cleverly capitalizing on logical inconsistencies in their beliefs. This combines a pragmatic argument (non-Bayesians are worse off in the long run) with an epistemic argument (non-Bayesians are vulnerable to logical inconsistencies in their preferences).

IV. You should be honest about your uncertainty

The principle of maximizing entropy mandates a unique way to set beliefs given your evidence, such that you make no presumptions about knowledge that you don’t have. This principle is fully consistent with and equivalent to standard Bayesian conditionalization.

In other words, Bayesianism is about epistemic humility – it tells you to not pretend to know things that you don’t know.

V. Bayesianism provides the foundations for the scientific method

The scientific method, needless to say, is humanity’s crowning epistemic achievement. With it we have invented medicine, probed atoms, and gone to the stars. Its success can be attributed to the structure of its method of investigating truth claims: in short, science is about searching theories for testable consequences, and then running experiments to update our beliefs in these theories.

This is all contained in Bayes’ rule, the fundamental law of probabilistic inference:

Pr(theory | evidence) ~ Pr(evidence | theory) · Pr(theory)

This rule tells you precisely how you should update your beliefs given your evidence, no more and no less. It contains the wisdom of empiricism that has revolutionized the world we live in.

VI. Bayesianism is practically useful

So maybe you’re convinced that Bayesianism is right in principle. There’s a separate question of if Bayesianism is useful in practice. Maybe treating your beliefs like probabilities is like trying to do psychology starting from Schrödinger’s equation – possible in principle but practically infeasible, not to mention a waste of time.

But Bayesianism is practically useful.

Statistical mechanics, one of the most powerful branches of modern science, is built on a foundation of explicitly Bayesian principles. More generally, good statistical reasoning is incredibly useful across all domains of truth-seeking, and an essential skill for anybody that wants to understand the world.

And Bayesianism is not just useful for epistemic reasons. A fundamental ingredient of decision-making is the ability to produce accurate models of reality. If you want to effectively achieve your goals, whatever they are, you must be able to engage in careful probabilistic reasoning.

And finally, in my personal experience I have found Bayesian epistemology to be infinitely mineable for useful heuristics in thinking about philosophy, physics, altruism, psychology, politics, my personal life, and pretty much everything else. I recommend anybody whose interest has been sparked to check out the following links:

Arbital guide to Bayes’ rule (if you’re only going to check out one of the links, make it this one)
E.T. Jaynes’ full-length textbook Probability Theory: The Logic of Science
Stanford Encyclopedia of Philosophy entry on Bayesianism
The blog SlateStarCodex, often a source of good applied Bayesian thinking – especially this and this

Consistency and priors

January 16, 2018January 18, 2018 ~ squarishbracket ~ Leave a comment

The method of reasoning illustrated here is somewhat reminiscent of Laplace’s “principle of indifference.” However, we are concerned here with indifference between problems, rather than indifference between events. The distinction is essential, for indifference between events is a matter of intuitive judgment on which our intuition often fails even when there is some obvious geometrical symmetry (as Bertrand’s paradox shows).

E. T. Jaynes
Prior Probabilities

I’ve previously written praise of the principle of maximum entropy as a prior-setting method that is justified on the basis of a very minimal and highly intuitive set of epistemic features.

But there’s an even better technique for prior-setting, one that is justified on incredibly fundamental grounds. This technique can only be used in rare times, and is immensely powerful when it is used. It’s the principle of transformation groups.

Here is the single assumption from which the principle arises:

“In problems where we have the same prior information, we should assign the same prior probabilities.” (Jaynes’ wording)

This is simple to the point of seeming almost tautological. So what can we do with it?

We’ll start with one of the simplest applications of transformation groups. Suppose that somebody gives you the following information:

I = “This coin will land either tails or heads.”

Now you want to say what the following probabilities should be:

P(This coin will land tails | I) = p
P(This coin will land heads | I) = q

Intuitively, it seems obvious to us that absent any other information, we should assign equal probabilities to these. But why? Is there a principled reason for assuming that the coin is a fair coin? Or is this just a presumption that is importing into the problem our background knowledge about most coins being fair?

The method of transformation groups gives us a principled reason. It says to rephrase the problem as follows:

I’ = “This coin will land either heads or tails.”

Now, our initial problem has only changed to our new problem by replacing every “heads” with “tails” and “tails” with “heads”. Since our prior-setting procedure found that P(This coin will land tails | I) = p in the first problem, it should now find P(This coin will land heads | I) = P in this new one. This is required for any consistent prior-setting procedure! If the problem changes by just switching places of labels, then the priors should change in the exact same way. This means that:

P(This coin will land heads | I’) = p
P(This coin will land tails | I’) = q

But clearly, I = I’; the logical operator “OR” is symmetric! Which means that:

P(This coin will land heads | I’) = P(This coin will land heads | I)

And this is only possible if p = q = ½!

This is simple, but beautiful. The principle tells us that the only logical way to set our priors in this case is evenly – anything else would be either logically inconsistent, or assuming extra information that breaks the symmetry between heads and tails. It goes from logical symmetry to probability symmetry!

Finding these symmetries is what the method of transformation groups is all about. More generally, one can represent a choice between N different possibilities as the statement:

I = “Possibility 1 or possibility 2 or … possibility N”

But this is symmetric with:

I’ = “Possibility 2 or possibility 1 or … possibility N”

As well as all other orderings.

By the exact same argument as above, your prior-setting procedure is required by logical consistency to evenly distribute credences across the N procedures. So for each n from 1 to N, P(Possibility k | I) = 1/N.

The method of transformation groups can also be applied to continuous variables, where finding the right set of priors can be a lot less intuitive. You do so by noting different types of symmetries for different types of parameters.

For instance, a location parameter is one that serves to merely shift a probability distribution over an observable quantity, without reshaping the distribution. We can formally express this by saying that for a location parameter µ, the distribution over x depends only on the difference x – µ:

p(x | µ) = f(x – µ)

For such parameters, it must be the case that the prior distribution over them is similarly symmetric over translational shifts:

For all ∆, p(µ) = p(µ + ∆)
So p(µ) = c, for some constant c

Another common category of parameters are scale parameters. These are parameters that serve to rescale probability distributions without reshaping them. Formally:

p(x | σ) = 1/σ g(x / σ)

For this symmetry, the requirement for consistency is:

For all s, p(σ) = 1/s · p(σ / s)
So, p(σ) = 1/σ

In summation, by carefully analyzing the symmetries of the background information you have, you can extract out requirements for how to set your prior distribution that are mandated on threat of logical inconsistency!

Dutch book arguments

January 16, 2018January 20, 2018 ~ squarishbracket ~ 3 Comments

These are the laws of probability, which we have proved to be necessarily true of any consistent set of degrees of belief. Any definite set of degrees of belief which broke them would be inconsistent in the sense that it violated the laws of preference between options … If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event.

We find, therefore, that a precise account of the nature of partial belief reveals that the laws of probability are laws of consistency, an extension to partial beliefs of formal logic, the logic of consistency.

Frank Ramsey
The Foundations of Mathematics and Other Logical Essays, Volume 5

In this post, I’m going to describe one of the more famous arguments for Bayesianism.

These arguments are about how different types of epistemological frameworks will handle different series of wagers. Let me just lay out clearly what exactly we mean by a wager, so as to remove any ambiguity.

A wager on proposition A is a betting opportunity. It involves a payoff amount S and a buy-in quantity. In general, the amount that the buy-in costs will be some fraction f of the payoff amount, so we’ll write it as fS. If you bet on A and it turns out true, then you get the payout S, but still lost the initial buy-in. And if you bet on A and it turns out false, then you get no payout and lose the fS you already spent.

A	Net Payout
True	S – fS
False	-fS

From this payout table, you can calculate that an agent will find the wager to be favorable to them exactly in the case that P(A) is greater than f. That is, the agent will want to take the bet whenever the chance of a payout is greater than the proportion of the payout that is required to buy into the bet.

Now, imagine that somebody has a credence of 52% in a proposition A and a credence of 52% in the proposition ~A. How will they evaluate the following set of bets?

B₁: pays out $100 if A is true, buy-in of $51
B₂: pays out $100 if ~A is true, buy-in of $51
B₃: pays out a guaranteed $100, buy-in of $102

They will see both B₁ and B₂ as favorable bets, since the buy-in is a smaller fraction of the payout than the chance of payout. And they will see B₃ as an unfavorable bet, since clearly the buy-in is a larger proportion of the payout than the chance of a payout.

But B₃ is just the same as the combination of bets B₁ and B₂!

Why? Well, if you bet on both B₁ and B₂, then you are guaranteed to win exactly one of the two (since A and ~A cannot both be true, but one of the two must be). Then you will have paid in a net sum of $102, and gotten back only $100.

A similar argument can be made for any levels of credence C(A) and C(~A) that don’t sum up to 100%. And all of the usual axioms of probability theory can be argued for in the same way. Such arguments are called Dutch book arguments.

Dutch book arguments are standardly presented as revealing that if one does not form beliefs according to the laws of probability theory, then they will be able to be juiced for money by clever bookies.

This is true; somebody with beliefs like those described above can be endlessly exploited for profit. But it is much less impressive than the real conclusion of the Dutch book argument.

Recall that our agent above was found to believing a logical contradiction as a result of not having their beliefs align with probability theory (they had to believe that a bet was simultaneously favorable to them and not favorable to them)

Said another way, an agent not following the probability calculus may evaluate the same proposition differently if presented in a different form.

This is what Dutch book arguments really say: if you want your beliefs to be logically consistent, then you are required to reason according to probability theory!