Nature’s Urn and Ignoring the Super Persuader

This post is about one of my favorite little-known thought experiments.

Here’s the setup:

You are in conversation with a Super Persuader – an artificial intelligence that has access to an enormously larger pool of information than you, and that is excellent at synthesizing information to form powerful arguments. The Super Persuader is so super at persuading, in fact, that given any proposition, it will be able to construct the most powerful argument possible for that proposition, consisting of the strongest evidence it has access to.

The Super Persuader is going to try and persuade you either that a certain proposition A is true or that it is false. In doing so, you know that it cannot lie, but it can pick and choose the information that it presents to you, giving an incomplete picture.

Finally, you know that the Super Persuader is going to decide which side of the issue to argue based off of a random coin toss: 50% chance they will argue that A is true, and 50% chance they will argue that A is false.

Once the coin is tossed and the Persuader begins to present the evidence, how should you rationally respond? Should you be swayed by the arguments, ignore them, or something else?

Here’s a basic presentation of one response to this thought experiment:

Of course you should be swayed by their arguments! If not, then you end up receiving boatloads of crazily persuasive argumentation and pretending like you’ve heard none of it. This is the very definition of irrationality – closing your eyes to the evidence you have sitting right in front of you! There’s no reason to disregard all of the useful information that you’re getting, just because it’s coming from a source that is trying to persuade you. Regardless of the motives of the Super Persuader, it can only persuade you by giving you honest and genuinely convincing evidence. And a rational agent has no choice but to update their credences on this evidence.

I think that this is a bad argument. Here’s an analogy to help explain why.

Imagine the set of all possible pieces of evidence you could receive for a given proposition as a massive urn filled with marbles. Each marble is a single argument that could be made for the proposition. If the argument is in support of the proposition, then the marble representing it will be black. And if the argument is against the proposition, then the marble representing it will be white.

Now, the question as to whether the proposition is more likely to be true or false is roughly the same as the question of whether there are more black or white marbles in the urn. That is the exact same question if all of the arguments in question are equally strong, and we have no reason for starting out favoring one side over the other.

But now we can think about the actions of the Super Persuader as follows: the Super Persuader has direct access to the urn, and can select any marble it wants. If it wants to persuade you that the proposition is true, then it will just fish through the urn and present you with as many black marbles as it desires, ignoring all the white marbles.

Clearly this process gives you no information as to the true proportion of the marbles that are white versus the proportion that are black. The data you are receiving is contaminated by a ridiculously powerful selection bias. The evidence you see is no longer linked in any way to the truth of the proposition, because regardless of whether or not it is true, you still expect to receive large amounts of evidence for it.

In the end, all of the pieces of evidence you receive are useless, in the same way that a stacked deck is not a reliable source of information about the average card deck.

This has some really weird consequences. For one thing, after your conversation you still have all of that information hanging around in your head (as long as you have a good enough memory). So if anybody asks you what you think about the issue, you will be able to spout off incredibly powerful arguments for exactly one side of the issue. But you’ll also have to concede that you don’t actually strongly believe the conclusion of these arguments. And if you’re asked to present any evidence for not accepting the conclusion, you’ll likely draw a blank, or only be able to produce very unsatisfactory answers. You will certainly not come off as a very rational person! Continue reading “Nature’s Urn and Ignoring the Super Persuader”

Entropy vs relative entropy

This post is about the relationship between entropy and relative entropy. This relationship is subtle but important – purely maximizing entropy (MaxEnt) is not equivalent to Bayesian conditionalization except in special cases, while maximizing relative entropy (ME) is. In addition, the justifications for MaxEnt are beautiful and grounded in fundamental principles of normative epistemology. Do these justifications carry over to maximizing relative entropy?

To a degree, yes. We’ll see that maximizing relative entropy is a more general procedure than maximizing entropy, and reduces to it in special cases. The cases where MaxEnt gives different results from ME can be interpreted through the lens of MaxEnt, and relate to an interesting distinction between commuting and non-commuting observations.

So let’s get started!

We’ll solve three problems: first, using MaxEnt to find an optimal distribution with a single constraint C1; second, using MaxEnt to find an optimal distribution with constraints C1 and C2; and third, using ME to find the optimal distribution with C2 given the starting distribution found in the first problem.

Part 1

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0

P ( – P1 logP1 + (α + 1) P1 + βC1[P1] ) = 0
– logP1 + α + β C1’[P1] = 0

Part 2

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0
∫ C2[P] dx = 0

P ( – P2 logP2 + (α’ + 1) P2 + β’C1[P2] + λ C2[P2] ) = 0
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0

Part 3

Problem: Maximize – ∫ P log(P / P1) dx with constraints
∫ P dx = 1
∫ C2[P] dx = 0

P ( – P3 logP3 + P3 logP1 + (α’’ + 1)P3 + λ’ C2[P3] ) = 0
– logP3 + α’’ + logP1 + λ’ C2’[P3] = 0
– logP3 + α’’ + α + β C1’[P1] + λ’ C2’[P3] = 0
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0

We can now compare our answers for Part 2 to Part 3. These are the same problem, solved with MaxEnt and ME. While they are clearly different solutions, they have interesting similarities.

MaxEnt
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0
∫ P2 dx = 1
∫ C1[P2] dx = 0
∫ C2[P2] dx = 0

ME
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0
∫ P3 dx = 1
∫ C1[P1] dx = 0
∫ C2[P3] dx = 0

The equations are almost identical. The only difference is in how they treat the old constraint. In MaxEnt, the old constraint is treated just like the new constraint – a condition that must be satisfied for the final distribution.

But in ME, the old constraint is no longer required to be satisfied by the final distribution! Instead, the requirement is that the old constraint be satisfied by your initial distribution!

That is, MaxEnt takes all previous information, and treats it as current information that must constrain your current probability distribution.

On the other hand, ME treats your previous information as constraints only on your starting distribution, and only ensures that your new distribution satisfy the new constraint!

When might this be useful?

Well, say that the first piece of information you received, C1, was the expected value of some measurable quantity. Maybe it was that x̄ = 5.

But if the second piece of information C2 was an observation of the exact value of x, then we clearly no longer want our new distribution to still have an expected value of x̄. After all, it is common for the expected value of a variable to differ from the exact value of x.

E(x) vs x

Once we have found the exact value of x, all previous information relating to the value of x is screened off, and should no longer be taken as constraints on our distribution! And this is exactly what ME does, and MaxEnt fails to do.

What about a case where the old information stays relevant? For instance, an observation of the values of a certain variable is not ‘cancelled out’ by a later observation of another variable. Observations can’t be un-observed. Does ME respect these types of constraints?

Yes!

Observations of variables are represented by constraints that set the distribution over those variables to delta-functions. And when your old distribution contains a delta function, that delta function will still stick around in your new distribution, ensuring that the old constraint is still satisfied.

Pold ~ δ(x – x’)
implies
Pnew ~ δ(x – x’)

The class of observations that are made obsolete by new observations are called non-commuting observations. They are given this name because for such observations, the order in which you process the information is essential. Observations for which the order of processing doesn’t matter are called commuting observations.

In summation: maximizing relative entropy allows us to take into account subtle differences in the type of evidence we receive, such as whether or not old data is made obsolete by new data. And mathematically, maximizing relative entropy is equivalent to maximizing ordinary entropy with whatever new constraints were not included in your initial distribution, as well as an additional constraint relating to the value of your old distribution. While the old constraints are not guaranteed to be satisfied by your new distribution, the information about them is preserved in the form of the prior distribution that is a factor in the new distribution.

Fun with Akaike

The Akaike information criterion is a metric for model quality that naturally arises from the principle of maximum entropy. It balances predictive accuracy against model complexity, encoding a formal version of Occam’s razor and solving problems of overfitting. I’m just now learning about how to use this metric, and will present a simple example that shows off a lot of the features of this framework.

Suppose that we have two coins. For each coin, we can ask what the probability is of each coin landing heads. Call these probabilities p and q.

Akaike.png

Two classes of models are (1) those that say that p = q and (2) those that say that p ≠ q. The first class of models are simpler in an important sense – they can be characterized by a single parameter p. The second class, on the other hand, require two separate parameters, one for each coin.

The number of parameters (k) used by our model is one way to measure model complexity. But of course, we can also test our models by getting experimental data. That is, we can flip each coin a bunch of times, record the results, and see how they favor one model over another.

One common way of quantifying the empirical success of a given model is by looking at the maximum value of its likelihood function L. This is the function that tells you how likely your data was, given a particular model. If Model 2 can at best do better at predicting the data than Model 1, then this should count in favor of Model 2.

So how do we combine these pieces of information – k (the number of parameters in a model) and L (the predictive success of the model)? Akaike’s criterion says to look at the following metric:

AIK = 2k – 2 lnL

The smaller the value of this parameter, the better your model is.

So let’s apply this on an arbitrary data set:

n1 = number of times coin 1 landed heads
n2 = number of times coin 1 landed tails
m1 =number of times coin 2 landed heads
m2 = number of times coin 2 landed tails

For convenience, we’ll also call the total flips of coin 1 N, and the total flips of coin 2 M.

First, let’s look at how Model 1 (the one that says that the two coins have an equal chance of landing heads) does on this data. This model predicts with probability p each heads, on either coin, and with probability 1 – p each tails on either coin.

L1 = C(N,n1) C(M,m1) pn1+m1 (1 – p)n2+m2

The two choose functions C(N, n1) and C(M, m1) are there to ensure that this function is nicely normalized. Intuitively, they arise from the fact that any given number of coin tosses that land heads could happen in many possible, ways, and all of these ways must be summed up.

This function finds its peak value at the following value of p:

p = (n1 + m1) / (N + M)
L1,max = C(N, n1) C(M, m1) (n1 + m1)n1+m1 (n2 + m2)n2+m2 / (N + M)N+M

By Stirling’s approximation, this becomes:

ln(L1,max) ~ -ln(F) – ½ ln(G)
where F = (N + M)N+M/NNMM · n1n1m1m1/(n1 + m1)n1+m1 · n2n2m2m2/(n2 + m2)n2+m2
and G = n1n2m1m/ NM

With this, our Akaike information criterion for Model 1 tells us:

AIC= 2 + 2ln(F) + ln(G)

Moving on to Model 2, we now have two different parameters p and q to vary. The likelihood of our data given these two parameters are given by:

L2 = C(N, n1) C(M, m1) pn1 (1 – p)n2 qm1 (1 – q)m2

The values of p and q that make this data most likely are:

p = n1 / N
q = m1 / M
So L2,max = C(N, n1) C(M, m1) n1n1m1m1n2n2m2m2 / NNMM

And again, using Stirling’s approximation, we get:

ln(L1,max) ~  – ½ ln(G)
So AIC= 4 + ln(G)

We now just need to compare these two AICs to see which model is preferable for a given set of data:

AIC= 4 + ln(G)
AIC= 2 + 2ln(F) + ln(G)

AIC– AIC= 2 – 2lnF

Let’s look at two cases that are telling. The first case will be that we find that both coin 1 and coin 2 end up landing heads an equal proportion of the time, and for simplicity, both coins are tossed the same number of times. Formally:

Case 1: N = M, n1 = m1, n2 = m2

In this case, F becomes 1, so lnF becomes 0.

AIC– AIC1 = 2 > 0
So Model 1 is preferable.

This makes sense! After all, if the two coins are equally likely to land heads, then our two models do equally well at predicting the data. But Model 1 is simpler, involving only a single parameter, so it is preferable. AIC gives us a precise numerical criterion by which to judge how preferable Model 1 is!

Okay, now let’s consider a case where coin 1 lands heads much more often than coin 2.

Case 2: N = M, n1 = 2m1, 2n2 = m2

Now if we go through the algebra, we find:

F = 4N (4/27)(m1+n2) ~ 1.12N
So lnF ~ N ln(1.12) ~ .11 N
So AIC– AIC1 = 2 – .22N

This quantity is larger than 0 when N is less than 10, but then becomes smaller than zero for all other values.

Which means that for Case 2, small enough data sets still allow us to go with Model 1, but as we get more data, the predictive accuracy of Model 2 eventually wins out!

It’s worth pointing out here that the Akaike information criterion is an approximation to the technique of maximizing relative entropy, and this approximation assumes large sets of data. Given this, it’s not clear how reliable our estimate of 10 is for the largest data set.

Let’s do one last thing with our simple models.

As our two coins become more and more similar in how often they land heads, we expect Model 1 to last longer before Model 2 ultimately wins out. Let’s calculate a general relationship between the similarity in the ratios of heads to tails in coin 1 and coin 2 and the amount of time it takes for Model 1 to lose out.

Case 3: N = M, n1 = r·m1, r·n2 = m2

r here is our ratio of p/q – the chance of heads in coin 1 over the chance of heads in coin 2. Skipping ahead through the algebra we find:

lnF = N ln( 2 rr/(r+1) / (r + 1) )

Model 2 becomes preferable to Model 1 when AICbecomes smaller than AIC1, so we can find the critical point by setting ∆AIC = 2 – 2 lnF = 0

lnF = N ln( 2 rr/(r + 1) / (r + 1) ) = 1
N = 1 / ln( 2 rr/(r + 1) / (r + 1) )

We can see that as r goes to 1, N goes to ∞. We can see how quickly this happens by doing some asymptotics:

r = 1 + ε
N ~ 1 / ln(1 + ε)
So ε ~ e1/N – 1

N goes to infinity at an exponential rate as r approaches 1 linearly. This gives us a very rough idea of how similar our coins must be for us to treat them as essentially identical. We can use our earlier result that at r = 2, N = 10 to construct a table:

r N
2 10
1.1 100
1.01 1000
1.001 10,000
1.0001 100,000

Anthropic argument for common priors

(Idea from Robin Hanson and Tyler Cowen’s 2004 paper Are Disagreements Honest?)

One common argument relating to common priors is that two rational agents with all the same information (including no information at all) could have no possible grounds on which to disagree. Priors by definition refer to the state of knowledge before either agent had any evidence relevant to a given proposition. So there is no information that either agent could have that would allow a difference in priors.

A response to this is that some information that we have is inherently private and unique to us. For instance, you and I might have differences in intelligence, in ways of conceptualizing the world, or in the things we innately find intuitively plausible. All of these differences may count as important information in shaping our priors on a given subject, before we ever encounter a single piece of evidence relevant to the subject.

Here’s a really weird argument for why even these differences should not count. If we use anthropic reasoning, and treat our own existence and the details of our brain and body as just another thing to be conditioned on, then even these private intimate details are simply contingent facts about the world that are to be treated as evidence. Before you’ve conditioned on your own existence, you should be agnostic as to which set of brain/body/mind out of all the possible sets of observers “you” will end up being. You must imagine yourself behind Rawls’ veil of ignorance, a disembodied reasoner that is identical to all other such reasoners. So there is no conceivable reason why your prior should differ from anybody else’s – you must treat yourself as literally the same entity as them pre anthropic conditioning.

In less out-there terms, if you encounter somebody with an apparently different prior from you, then you should consider “Hmm, what if I were born as this person, instead of myself?” The answer to which is, of course, you would have had the same priors as them. Which means that your difference in “priors” is actually a difference of posteriors resulting from conditioning on the arbitrary choice of body/brain/experiences you ended up with.

In addition, by Aumann’s agreement theorem, any apparent differences in priors that become common knowledge should quickly go away, once they are realized to be merely differences in posteriors. Essentially, any differences in priors that last between two rational individuals are signs that they are arbitrarily favoring their own existence in considerations of what prior they should use.

Why you should be a Bayesian

In this post, I take Bayesianism to be the following normative epistemological claim: “You should treat your beliefs like probabilities, and reason according to the axioms of probability theory.”

Here are a few reasons why I support this claim:

I. Logic is not enough

Reasoning deductively from premises to conclusion is a wonderfully powerful tool when it can be applied. If you have absolute certainty in some set of premises, and these premises entail a new conclusion, then you can extend your certainty to the new conclusion. Alternatively, you can state clearly the conditions under which you would be granted certain belief, in the form of a conditional argument (if you were to convince me that A and B are true, then I would believe that C is true).

This is great for mathematicians proving theorems about abstract logical entities. But in the real world, deductive inference is simply not enough to account for the types of problems we face. We are constantly reasoning in a condition of uncertainty, where we have multiple competing theories about what’s going on, and we seek evidence – partial evidence, not deductively complete evidence – as to which of these theories we should favor.

If you want to know how to form beliefs about the parts of reality that aren’t clear-cut and certain, then you need to go beyond pure logic.

II. Probability theory is a natural extension of logic

Cox’s theorem shows that any system of plausible reasoning – modifying and updating beliefs in the presence of uncertainty – that is consistent with logic and a few minimal assumptions about normative reasoning is necessarily isomorphic to probability theory.

The converse of this is that any system of reasoning under uncertainty that isn’t ultimately functionally equivalent to Bayesianism is either logically inconsistent or violates other common-sense axioms of reasoning.

In other words, probability theory is the best candidate that we have for extending logic into the domain of the uncertain. It is about what is likely, not certain, to be true, and the way that we should update these assessments when receiving new information. In turn, probability theory contains ordinary logic as a special case when you take the limit of absolutely certainty.

III. Non-Bayesian systems of plausible reasoning result in inconsistencies and irrational behavior

Dutch-book arguments prove that any agent that is violating the axioms of probability theory can be exploited by cleverly capitalizing on logical inconsistencies in their beliefs. This combines a pragmatic argument (non-Bayesians are worse off in the long run) with an epistemic argument (non-Bayesians are vulnerable to logical inconsistencies in their preferences).

IV. You should be honest about your uncertainty

The principle of maximizing entropy mandates a unique way to set beliefs given your evidence, such that you make no presumptions about knowledge that you don’t have. This principle is fully consistent with and equivalent to standard Bayesian conditionalization.

In other words, Bayesianism is about epistemic humility – it tells you to not pretend to know things that you don’t know.

V. Bayesianism provides the foundations for the scientific method

The scientific method, needless to say, is humanity’s crowning epistemic achievement. With it we have invented medicine, probed atoms, and gone to the stars. Its success can be attributed to the structure of its method of investigating truth claims: in short, science is about searching theories for testable consequences, and then running experiments to update our beliefs in these theories.

This is all contained in Bayes’ rule, the fundamental law of probabilistic inference:

Pr(theory | evidence) ~ Pr(evidence | theory) · Pr(theory)

This rule tells you precisely how you should update your beliefs given your evidence, no more and no less. It contains the wisdom of empiricism that has revolutionized the world we live in.

VI. Bayesianism is practically useful

So maybe you’re convinced that Bayesianism is right in principle. There’s a separate question of if Bayesianism is useful in practice. Maybe treating your beliefs like probabilities is like trying to do psychology starting from Schrödinger’s equation – possible in principle but practically infeasible, not to mention a waste of time.

But Bayesianism is practically useful.

Statistical mechanics, one of the most powerful branches of modern science, is built on a foundation of explicitly Bayesian principles. More generally, good statistical reasoning is incredibly useful across all domains of truth-seeking, and an essential skill for anybody that wants to understand the world.

And Bayesianism is not just useful for epistemic reasons. A fundamental ingredient of decision-making is the ability to produce accurate models of reality. If you want to effectively achieve your goals, whatever they are, you must be able to engage in careful probabilistic reasoning.

And finally, in my personal experience I have found Bayesian epistemology to be infinitely mineable for useful heuristics in thinking about philosophy, physics, altruism, psychology, politics, my personal life, and pretty much everything else. I recommend anybody whose interest has been sparked to check out the following links:

 

Consistency and priors

The method of reasoning illustrated here is somewhat reminiscent of Laplace’s “principle of indifference.” However, we are concerned here with indifference between problems, rather than indifference between events. The distinction is essential, for indifference between events is a matter of intuitive judgment on which our intuition often fails even when there is some obvious geometrical symmetry (as Bertrand’s paradox shows).

E. T. Jaynes
Prior Probabilities

I’ve previously written praise of the principle of maximum entropy as a prior-setting method that is justified on the basis of a very minimal and highly intuitive set of epistemic features.

But there’s an even better technique for prior-setting, one that is justified on incredibly fundamental grounds. This technique can only be used in rare times, and is immensely powerful when it is used. It’s the principle of transformation groups.

Here is the single assumption from which the principle arises:

“In problems where we have the same prior information, we should assign the same prior probabilities.” (Jaynes’ wording)

This is simple to the point of seeming almost tautological. So what can we do with it?

We’ll start with one of the simplest applications of transformation groups. Suppose that somebody gives you the following information:

I = “This coin will land either tails or heads.”

Now you want to say what the following probabilities should be:

P(This coin will land tails | I) = p
P(This coin will land heads | I) = q

Intuitively, it seems obvious to us that absent any other information, we should assign equal probabilities to these. But why? Is there a principled reason for assuming that the coin is a fair coin? Or is this just a presumption that is importing into the problem our background knowledge about most coins being fair?

The method of transformation groups gives us a principled reason. It says to rephrase the problem as follows:

I’ = “This coin will land either heads or tails.”

Now, our initial problem has only changed to our new problem by replacing every “heads” with “tails” and “tails” with “heads”. Since our prior-setting procedure found that P(This coin will land tails | I) = p in the first problem, it should now find P(This coin will land heads | I) = P in this new one. This is required for any consistent prior-setting procedure! If the problem changes by just switching places of labels, then the priors should change in the exact same way. This means that:

P(This coin will land heads | I’) = p
P(This coin will land tails | I’) = q

But clearly, I = I’; the logical operator “OR” is symmetric! Which means that:

P(This coin will land heads | I’) = P(This coin will land heads | I)

And this is only possible if p = q = ½!

This is simple, but beautiful. The principle tells us that the only logical way to set our priors in this case is evenly – anything else would be either logically inconsistent, or assuming extra information that breaks the symmetry between heads and tails. It goes from logical symmetry to probability symmetry!

Finding these symmetries is what the method of transformation groups is all about. More generally, one can represent a choice between N different possibilities as the statement:

I = “Possibility 1 or possibility 2 or … possibility N”

But this is symmetric with:

I’ = “Possibility 2 or possibility 1 or … possibility N”

As well as all other orderings.

By the exact same argument as above, your prior-setting procedure is required by logical consistency to evenly distribute credences across the N procedures. So for each n from 1 to N, P(Possibility k | I) = 1/N.

The method of transformation groups can also be applied to continuous variables, where finding the right set of priors can be a lot less intuitive. You do so by noting different types of symmetries for different types of parameters.

For instance, a location parameter is one that serves to merely shift a probability distribution over an observable quantity, without reshaping the distribution. We can formally express this by saying that for a location parameter µ, the distribution over x depends only on the difference x – µ:

p(x | µ) = f(x – µ)

For such parameters, it must be the case that the prior distribution over them is similarly symmetric over translational shifts:

For all ∆, p(µ) = p(µ + ∆)
So p(µ) = c, for some constant c

Another common category of parameters are scale parameters. These are parameters that serve to rescale probability distributions without reshaping them. Formally:

p(x | σ) = 1/σ g(x / σ)

For this symmetry, the requirement for consistency is:

For all s, p(σ) = 1/s · p(σ / s)
So, p(σ) = 1/σ

In summation, by carefully analyzing the symmetries of the background information you have, you can extract out requirements for how to set your prior distribution that are mandated on threat of logical inconsistency!

Dutch book arguments

These are the laws of probability, which we have proved to be necessarily true of any consistent set of degrees of belief. Any definite set of degrees of belief which broke them would be inconsistent in the sense that it violated the laws of preference between options … If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event.

We find, therefore, that a precise account of the nature of partial belief reveals that the laws of probability are laws of consistency, an extension to partial beliefs of formal logic, the logic of consistency.

Frank Ramsey
The Foundations of Mathematics and Other Logical Essays, Volume 5

 

In this post, I’m going to describe one of the more famous arguments for Bayesianism.

These arguments are about how different types of epistemological frameworks will handle different series of wagers. Let me just lay out clearly what exactly we mean by a wager, so as to remove any ambiguity.

A wager on proposition A is a betting opportunity. It involves a payoff amount S and a buy-in quantity. In general, the amount that the buy-in costs will be some fraction f of the payoff amount, so we’ll write it as fS. If you bet on A and it turns out true, then you get the payout S, but still lost the initial buy-in. And if you bet on A and it turns out false, then you get no payout and lose the fS you already spent.

A Net Payout
True

S – fS

False

-fS

From this payout table, you can calculate that an agent will find the wager to be favorable to them exactly in the case that P(A) is greater than f. That is, the agent will want to take the bet whenever the chance of a payout is greater than the proportion of the payout that is required to buy into the bet.

Now, imagine that somebody has a credence of 52% in a proposition A and a credence of 52% in the proposition ~A. How will they evaluate the following set of bets?

B1: pays out $100 if A is true, buy-in of $51
B2: pays out $100 if ~A is true, buy-in of $51
B3: pays out a guaranteed $100, buy-in of $102

They will see both B1 and B2 as favorable bets, since the buy-in is a smaller fraction of the payout than the chance of payout. And they will see B3 as an unfavorable bet, since clearly the buy-in is a larger proportion of the payout than the chance of a payout.

But B3 is just the same as the combination of bets B1 and B2!

Why? Well, if you bet on both B1 and B2, then you are guaranteed to win exactly one of the two (since A and ~A cannot both be true, but one of the two must be). Then you will have paid in a net sum of $102, and gotten back only $100.

A similar argument can be made for any levels of credence C(A) and C(~A) that don’t sum up to 100%. And all of the usual axioms of probability theory can be argued for in the same way. Such arguments are called Dutch book arguments.

Dutch book arguments are standardly presented as revealing that if one does not form beliefs according to the laws of probability theory, then they will be able to be juiced for money by clever bookies.

This is true; somebody with beliefs like those described above can be endlessly exploited for profit. But it is much less impressive than the real conclusion of the Dutch book argument.

Recall that our agent above was found to believing a logical contradiction as a result of not having their beliefs align with probability theory (they had to believe that a bet was simultaneously favorable to them and not favorable to them)

Said another way, an agent not following the probability calculus may evaluate the same proposition differently if presented in a different form.

This is what Dutch book arguments really say: if you want your beliefs to be logically consistent, then you are required to reason according to probability theory!

Cox’s theorem

A very original and thoroughgoing development of the theory of probability, which does not depend on the concept of frequency in an ensemble, has been given by Keynes. In his view, the theory of probability is an extended logic, the logic of probable inference. Probability is a relation between a hypothesis and a conclusion, corresponding to the degree of rational belief and limited by the extreme relations of certainty and impossibility. Classical deductive logic, which deals with these limiting relations only, is a special case in this more general development.

R. T. Cox
Probability, Frequency, and Reasonable Expectation
(Yep, that Keynes! He was an influential early Bayesian thinker as well as a famous economist)

Cox’s famous theorem says that if your way of reasoning is not in some sense isomorphic to Bayesianism, then you are violating one of the following ten basic principles. I’ll derive the theorem in this post.

Logic

  1. ~~a = a
  2. ab = ba
  3. (ab)c = a(bc)
  4. aa = a
  5. ~(ab) = ~a or ~b
  6. a(a or b) = a

Reasoning under uncertainty

  1. Degrees of plausibility are represented by real numbers: {b | a}
  2. {bc | a} = F({b | a}, {c | ab})
  3. {~b | a} = G({b | a})
  4. F and G are monotonic.

The first six, I think, need no introduction. I write “a and b” as “ab” for aesthetics.

The next four extend logic beyond the realm of the perfectly certain, and relate to how we should reason in the presence of uncertainty – how we should reason about degrees of plausibility. For this step, we need a new notation: a way to represent not the truth of a proposition a, but its plausibility. We do this with the {} notation: {a | b} = the plausibility of a given that b is known to be true.

I’ll make some brief notes about each of 7 through 10.

Number 7 perhaps sounds strange – plausibilities are states of belief, not numbers. But you can make sense of this in two ways: first, we can consider this a simple model of plausibilities, in which we are merely mirroring the properties of plausibilities in the structure of the system we set up around how to manipulate numbers. And second, if one is to design a robot that reasons about the world, it isn’t crazy to think about programming it to represent beliefs about the plausibilities of propositions as numbers.

#7 also contains an assumption of continuity – that it’s not the case that there are discontinuous jumps in the plausibility of propositions. Said another way, if two different real numbers represent two different degrees of plausibility, then every possible value between them should also represent a degree of plausibility.

Number 8 is about relevance. It says that the plausibility that b and c are both true given some background information a, is only dependent on (1) the plausibility that b is true given a and (2) the plausibility that c is true given a and b.

If you want to know how likely it is that b and c are both true given a, you can break the process down into two steps. First you determine how likely it is that b is true, given your background information. And second, you determine how likely it is that a is true, given both your background information and the truth of b.

In other words, all that you need to know if you want to know {bc | a} are {b | a} and {c | ab}. For example, say that you want the plausibility of Usain Bolt winning the 100 meter dash and also running an extra lap around the track at the end of the dash. The only pieces of information you need for this are (1) the plausibility of Usain Bolt winning the 100 meter dash, and (2) the plausibility of him running an extra lap at the end of the race, given that he won the race. And of course, each of these plausibilities are also conditional on all of the background information you have about the situation.

We give the name F to the function that details precisely how we determine {bc | a}.

Number 9 says that all we need to determine the plausibility of a proposition being false is the plausibility of the proposition being true. The function that details how we determine one from the other is named G.

And finally, Number 10 says that the plausibility relationships described in 8 and 9 are in some sense simple. For instance, if b becomes less plausible and ~b becomes more plausible, then if b becomes even less plausible, ~b should not suddenly become less plausible. If a change in the plausibility of a proposition a makes the plausibility of another proposition b change in a certain direction, then a greater change in the plausibility of a in the same direction should not result in a reversal of the direction of change of the plausibility of b.

There is an additional implicit assumption that is almost too obvious to be stated: If two propositions are just different phrasings of the same state of knowledge, then you should have the same plausibility in both of them. This is the basic requirement of consistency. It says, for instance, that “a and b” is exactly as plausible as “b and a”.

***

Now we can derive Bayesianism!

First step: conditional probabilities. We use the associativity of “and”:

{bcd | a} = F( {b | a}, {cd | ab} ) = F( {b | a}, F( {c | ab}, {d | abc} ) )
{bcd | a} = F( {bc | a}, {d | abc} ) = F( F({b | a}, {c | ab}), {d | abc} ) )

So F(x, F(y, z)) = F(F(x, y), z)

Any monotonic function that satisfies this functional equation is isomorphic to ordinary multiplication on the interval [0, 1].

In other words, there exists a function W such that:

W(F(x, y)) = W(x) · W(y)

From which we can see:

W({bc | a}) = W({b | a}) · W({c | ab})

Which is exactly the form of conjunction in ordinary probability theory!

This means that any form of reasoning about the plausibility of conjunctions of propositions is either violating one of the 10 axioms, or is equivalent to probability theory up to isomorphism.

So if somebody walks up to you and presents to you an algorithm for computing the plausibility of the conjunction of two propositions, and you know that they are following the above ten rules, then you are guaranteed to be able to find some function that takes in their algorithm and translates into ordinary Bayesianism.

This translation function is W! Notice that although W looks a lot like ordinary probability, it is a different thing entirely. For instance, if somebody is already reasoning with ordinary probability theory, then {b | a} = P(b | a). To convert {b | a} into P(b | a), then, W need not do anything! So W({b | a}) = {b | a} = P(b | a), or W(x) = x, which is clearly different from the probability function.

Next, we reveal the nature of the function G.

The first step is easy:

{~~b | a} = G(G({b | a}))
So G(G(x)) = x

This means that G is an involution – a function that is its own inverse.

Next, we use the commutativity of “and”:

W({bc | a}) = W({b | a}) · W({c | ab})
= W({b | a}) · G( W({~c | ab}) )
= W({b | a}) · G( W({b~c | a}) / W({b | a}) )

So W({cb | a}) = W({c | a}) · G( W({~bc | a}) / W({c | a}) )

Now if we let c = ~(bd) for some new statement d, and rename W(b | a) = x and W(c | a) = y, we find:

x G( G(y) / x ) = y G( G(x) / y )

The only functions that satisfy this functional equation as well as the earlier involution equation are the following:

G(x) = (1 – xm)1/m, for any m

With some algebraic manipulation, we see that this is equivalent to:

W({a | b})m + W({~a | b})m = 1

This equation almost perfectly represents the normalization rule: that the total probability must sum to 1! It seems mildly inconvenient that this holds true for the function Wm rather than W. If we knew that m had to equal 1, then we’d have a perfect translation function both for conditional probabilities and for normalization… But if we look back at our conditional probability finding, we notice that it can be equivalently represented in terms of Wm instead of W!

W({bc | a}) = W({b | a}) · W({c | ab})
W({bc | a})m = W({b | a})m · W({c | ab})m

Now we just define one last function Q = Wm, and we’re done!

Q takes in any system of reasoning that obeys the starting ten rules, and reveals it to be equivalent to ordinary probability theory.

The rest of probability theory can be shown to result from these basic axioms (we’ll drop the brackets now, and will call Q({a | b}) its proper name: P(a | b).

P(certainty) = 1
P(impossibility) = 0
P(bc | a) + P(b~| a) = P(b | a)
P(a or b | c) = P(a | c) + P(b | c) – P(ab | c)
And so on.

This derivation of probability theory is great because it isn’t limited by the ordinary frequency interpretations of probability theory, in which a probability is defined to be a limit of a ratio of an infinite number of experimental results. This is not only ugly theoretically, but leaves us puzzled as to how to talk about the probabilities of one-shot events or of hypotheses that can’t be directly translated into empirical results.

The definition of probability invoked here is infinitely deeper. Instead of frequencies, probabilities are defined according to fundamental principles of normative epistemology. Any agent that reasons consistently according to a few basic maxims will be reasoning in a way that is functionally identical to probability theory.

And probabilities here can be assigned to any propositions – not just those that refer to empirically measurable events that are repeatable ad infinitum. They represent a normative rational degree of certainty that one must possess if one is to reason consistently!

Maximum Entropy and Bayes

The original method of Maximum Entropy, MaxEnt, was designed to assign probabilities on the basis of information in the form of constraints. It gradually evolved into a more general method, the method of Maximum relative Entropy (abbreviated ME), which allows one to update probabilities from arbitrary priors unlike the original MaxEnt which is restricted to updates from a uniform background measure.

The realization that ME includes not just MaxEnt but also Bayes’ rule as special cases is highly significant. First, it implies that ME is capable of reproducing every aspect of orthodox Bayesian inference and proves the complete compatibility of Bayesian and entropy methods. Second, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually.

Giffin and Caticha
https://arxiv.org/pdf/0708.1593.pdf

I want to heap a little more praise on the principle of maximum entropy. Previously I’ve praised it as a solution to the problem of the priors – a way to decide what to believe when you are totally ignorant.

But it’s much much more than that. The procedure by which you maximize entropy is not only applicable in total ignorance, but is also applicable in the presence of partial information! So not only can we calculate the maximum entropy distribution given total ignorance, but we can also calculate the maximum entropy distribution given some set of evidence constraints.

That is, the principle of maximum entropy is not just a solution to the problem of the priors – it’s an entire epistemic framework in itself! It tells you what you should believe at any given moment, given any evidence that you have. And it’s better than Bayesianism in the sense that the question of priors never comes up – we maximize entropy when we don’t have any evidence just like we do when we do have evidence! There is no need for a special case study of the zero-evidence limit.

But a natural question arises – if the principle of maximum entropy and Bayes’ rule are both self-contained procedures for updating your beliefs in the face of evidence, are these two procedures consistent?

Anddd the answer is, yes! They’re perfectly consistent. Bayes’ rule leads you from one set of beliefs to the set of beliefs that are maximally uncertain under the new information you receive.

This post will be proving that Bayes’ rule arises naturally from maximizing entropy after you receive evidence.

But first, let me point out that we’re making a slight shift in our definition of entropy, as suggested in the quote I started this post with. Rather than maximizing the entropy S(P) = – ∫ P log(P) dx, we will maximize the relative entropy:

Srel(P, Pold) = – ∫ P log(P / Pold) dx.

The relative entropy is much more general than the ordinary entropy – it serves as a way to compare entropies of distributions, and gives a simple way to talk about the change in uncertainty from a previous distribution to a new one. Intuitively, it is the additional information that is required to specify P, once you’ve already specified Pold. You can think of it in terms of surprisal: Srel(P, Q) is how much more surprised you will be if P is true and you believe Q than if P is true and you believe P.

You might be concerned that this function no longer has the nice properties of entropy that we discussed earlier – the only possible function for consistently representing uncertainty. But these worries aren’t warranted. If some set of initial constraints give Pold as the maximum-entropy distribution, then the function that maximizes relative entropy with just the new constraints will be the same as the function that maximizes entropy with the new constraints and the value of your prior distribution as an additional constraint.

Okay, so from now on whenever I talk about entropy, I’m talking about relative entropy. I’ll just denote it by S as usual, instead of writing out Srel every time. We’ll now prove that the prescribed change in your beliefs upon receiving the results of an experiment is the same under Bayesian conditionalization as it is under maximum entropy.

Say that our probability distribution is over the possible values of some parameter A and the possible results of an experiment that will tell us the value of X. Thus our initial model of reality can be written as:

Pinit(A = a, X = x), and
Pinit(A = a) = ∫ dx Pinit(A = a | X = x) P(X = x)

Which we’ll rewrite for ease of notation as:

Pinit(a, x), and
Pinit(a) = ∫ Pinit(a | x) P(x) dx

Ordinary Bayesian conditionalization says that when we receive the information that the experiment returned the result X = x’, we update our probabilities as follows:

Pnew(a) = Pinit(a | x’)

What does the principle of maximum entropy say to do? It prescribes the following algorithm:

Maximize the value of S = – ∫ da dx P(a, x) log( P(a,x) / Pinit(a, x) )
with the following constraints:
Constraint 1: ∫ da dx P(a, x) = 1
Constraint 2: P(x) = δ(x – x’)

Constraint 2 represents the experimental information that our new probability distribution over X is zero everywhere except for at X = x’, and that we are certain that the value of X is x’. Notice that it is actually an infinite number of constraints – one for each value of X.

We will rewrite Constraint 2 so that it is of the same form as the entropy function and the first constraint:

Constraint 2: ∫ da P(a, x) = δ(x – x’)

The method of Lagrange multipliers tells us how to solve this equation!

First, define a new quantity A as follows:

A = S + Constraint 1 + Constraint 2
= – ∫ da dx P log(P/Pinit) + α · [ ∫ da dx P – 1 ] + ∫ dx β(x) · [ ∫ da P – δ(x – x’) ]

Now we solve!

∆A = 0
P ∫ da dx [ – P log(P/Pinit) + α P + β(x) P] = 0
P [ – P log P + P log Pinit + α P + β(x) P ] = 0
-log Pnew – 1 + log Pinit + α + β(x) = 0
Pnew(a, x) = Pinit(a, x) · eβ(x)/Z

Z is our normalization constant, and we can find β(x) by applying Constraint 2:

Constraint 2: ∫ da P(a, x) = δ(x – x’)
∫ da Pinit(a, x) · eβ(x)/Z = δ(x – x’)
Pinit(x) · eβ(x)/Z = δ(x – x’)

And finally, we can plug in:

Pnew(a, x) = Pinit(a, x) · eβ(x) / Z
= Pinit(a | x) · Pinit(x) · eβ(x) / Z
= Pinit(a | x) · δ(x – x’)
So Pnew(a) = Pinit(a | x’)

Exactly the same as Bayesian conditionalization!!

What’s so great about this is that the principle of maximum entropy is an entire theory of normative epistemology in its own right, and it’s equivalent to Bayesianism, AND it has no problem of the priors!

If you’re a Bayesian, then you know what to do when you encounter new evidence, as long as you already have a prior in hand. But when somebody asks you how you should choose the prior that you have… well then you’re stumped, or have to appeal to some other prior-setting principle outside of Bayes’ rule.

But if you ask a maximum-entropy theorist how they got their priors, they just answer: “The same way I got all of my other beliefs! I just maximize my uncertainty, subject to the information that I possess as constraints. I don’t need any special consideration for the situation in which I possess no information – I just maximize entropy with no constraints at all!”

I think this is wonderful. It’s also really aesthetic. The principle of maximum entropy says that you should be honest about your uncertainty. You should choose your beliefs in such a way as to ensure that you’re not pretending to know anything that you don’t know. And there is a single unique way to do this – by maximizing the function ∫ P log P.

Any other distribution you might choose represents a decision to pretend that you know things that you don’t know – and maximum entropy says that you should never do this. It’s an epistemological framework built on the virtue of humility!

Advanced two-envelopes paradox

Yesterday I described the two-envelopes paradox and laid out its solution. Yay! Problem solved.

Except that it’s not. Because I said that the root of the problem was an improper prior, and when we instead use a proper prior, any proper prior, we get the right result. But we can propose a variant of the two envelopes problem that gives a proper prior, and still mandates infinite switching.

Here it is:

In front of you are two envelopes, each containing some unknown amount of money. You know that one of the envelopes has twice the amount of money of the other, but you’re not sure which one that is and can only take one of the two.

In addition, you know that the envelopes were stocked by a mad genius according to the following procedure: He randomly selects an integer n ≥ 0 with probability ⅓ (⅔)n, then stocked the smaller envelope with $2n and the larger with double this amount.

You have picked up one of the envelopes and are now considering if you should switch your choice.

Let’s verify quickly that the mad genius’s procedure for selecting the amount of money makes sense:

Total probability = ∑n ⅓ (⅔)n = ⅓ · 3 = 1

Okay, good. Now we can calculate the expected value.

You know that the envelope that you’re holding contains one of the following amounts of money: ($1, $2, $4, $8, …).

First let’s consider the case in which it contains $1. If so, then you know that your envelope must be the smaller of the two, since there is no $0.50 envelope. So if your envelope contains $1, then you are sure to gain $1 by switching.

Now let’s consider every other case. If the amount you’re holding is $2n, then you know that there is a probability of ⅓ (⅔)n that it is the smaller envelope and ⅓ (⅔)n+1 that it’s the larger one. You are $2n better off if you have the smaller envelope and switch, and are 2n-1 worse off if you initially had the larger envelope and switch.

So your change in expected value by switching instead of staying is:

∆EU = $ ⅓ (1⅓)n – $ ⅓ ¼ (1⅓)n+1
= $ ⅓ (1⅓)n (1 – ¼ · 1⅓)
= $ ⅓ (1⅓)n (1 – ⅓) > 0

So if you are holding $1, you are better off switching. And if you are holding more than $1, you are better off switching. In other words, switching is always better than staying, regardless of how much money you are holding.

And yet this exact same argument applies once you’ve switched envelopes, so you are led to an infinite process of switching envelopes back and forth. Your decision theory tells you that as you’re doing this, your expected value is exponentially growing, so it’s worth it to you to keep on switching ad infinitum – it’s not often that you get a chance to generate exponentially large amounts of money!

The problem this time can’t be the prior – we are explicitly given the prior in the problem, and verified that it was normalized just in case.

So what’s going wrong?

***

 

 

(once again, recommend that you sit down and try to figure this out for yourself before reading on)

 

 

***

Recall that in my post yesterday, I claimed to have proven that no matter what your prior distribution over money amounts in your envelope, you will always have a net zero expected value. But apparently here we have a statement that contradicts that.

The reason is that my proof yesterday was only for continuous prior distributions over all real numbers, and didn’t apply to discrete distributions like the one in this variant. And apparently for discrete distributions, it is no longer the case that your expected value is zero.

The best solution to this problem that I’ve come across is the following: This problem involves comparing infinite utilities, and decision theory can’t handle infinities.

There’s a long and fascinating precedent for this claim, starting with problems like the Saint Petersburg paradox, where an infinite expected value leads you to bet arbitrarily large amounts of money on arbitrarily unlikely scenarios, and including weird issues in Boltzmann brain scenarios. Discussions of Pascal’s wager also end up confronting this difficulty – comparing different levels of infinite expected utility leads you into big trouble.

And in this variant of the problem, both your expected utility for switching and your expected utility for staying are infinite. Both involve a calculation of a sum of (⅔)n (the probability) times 2n, which diverges.

This is fairly unsatisfying to me, but perhaps it’s the same dissatisfaction that I feel when confronting problems like Pascal’s wager – a mistaken feeling that decision theory should be able to handle these problems, ultimately rooted in a failure to internalize the hidden infinities in the problem.