Nature’s Urn and Ignoring the Super Persuader

This post is about one of my favorite little-known thought experiments.

Here’s the setup:

You are in conversation with a Super Persuader – an artificial intelligence that has access to an enormously larger pool of information than you, and that is excellent at synthesizing information to form powerful arguments. The Super Persuader is so super at persuading, in fact, that given any proposition, it will be able to construct the most powerful argument possible for that proposition, consisting of the strongest evidence it has access to.

The Super Persuader is going to try and persuade you either that a certain proposition A is true or that it is false. In doing so, you know that it cannot lie, but it can pick and choose the information that it presents to you, giving an incomplete picture.

Finally, you know that the Super Persuader is going to decide which side of the issue to argue based off of a random coin toss: 50% chance they will argue that A is true, and 50% chance they will argue that A is false.

Once the coin is tossed and the Persuader begins to present the evidence, how should you rationally respond? Should you be swayed by the arguments, ignore them, or something else?

Here’s a basic presentation of one response to this thought experiment:

Of course you should be swayed by their arguments! If not, then you end up receiving boatloads of crazily persuasive argumentation and pretending like you’ve heard none of it. This is the very definition of irrationality – closing your eyes to the evidence you have sitting right in front of you! There’s no reason to disregard all of the useful information that you’re getting, just because it’s coming from a source that is trying to persuade you. Regardless of the motives of the Super Persuader, it can only persuade you by giving you honest and genuinely convincing evidence. And a rational agent has no choice but to update their credences on this evidence.

I think that this is a bad argument. Here’s an analogy to help explain why.

Imagine the set of all possible pieces of evidence you could receive for a given proposition as a massive urn filled with marbles. Each marble is a single argument that could be made for the proposition. If the argument is in support of the proposition, then the marble representing it will be black. And if the argument is against the proposition, then the marble representing it will be white.

Now, the question as to whether the proposition is more likely to be true or false is roughly the same as the question of whether there are more black or white marbles in the urn. That is the exact same question if all of the arguments in question are equally strong, and we have no reason for starting out favoring one side over the other.

But now we can think about the actions of the Super Persuader as follows: the Super Persuader has direct access to the urn, and can select any marble it wants. If it wants to persuade you that the proposition is true, then it will just fish through the urn and present you with as many black marbles as it desires, ignoring all the white marbles.

Clearly this process gives you no information as to the true proportion of the marbles that are white versus the proportion that are black. The data you are receiving is contaminated by a ridiculously powerful selection bias. The evidence you see is no longer linked in any way to the truth of the proposition, because regardless of whether or not it is true, you still expect to receive large amounts of evidence for it.

In the end, all of the pieces of evidence you receive are useless, in the same way that a stacked deck is not a reliable source of information about the average card deck.

This has some really weird consequences. For one thing, after your conversation you still have all of that information hanging around in your head (as long as you have a good enough memory). So if anybody asks you what you think about the issue, you will be able to spout off incredibly powerful arguments for exactly one side of the issue. But you’ll also have to concede that you don’t actually strongly believe the conclusion of these arguments. And if you’re asked to present any evidence for not accepting the conclusion, you’ll likely draw a blank, or only be able to produce very unsatisfactory answers. You will certainly not come off as a very rational person! Continue reading “Nature’s Urn and Ignoring the Super Persuader”

Entropy vs relative entropy

This post is about the relationship between entropy and relative entropy. This relationship is subtle but important – purely maximizing entropy (MaxEnt) is not equivalent to Bayesian conditionalization except in special cases, while maximizing relative entropy (ME) is. In addition, the justifications for MaxEnt are beautiful and grounded in fundamental principles of normative epistemology. Do these justifications carry over to maximizing relative entropy?

To a degree, yes. We’ll see that maximizing relative entropy is a more general procedure than maximizing entropy, and reduces to it in special cases. The cases where MaxEnt gives different results from ME can be interpreted through the lens of MaxEnt, and relate to an interesting distinction between commuting and non-commuting observations.

So let’s get started!

We’ll solve three problems: first, using MaxEnt to find an optimal distribution with a single constraint C1; second, using MaxEnt to find an optimal distribution with constraints C1 and C2; and third, using ME to find the optimal distribution with C2 given the starting distribution found in the first problem.

Part 1

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0

P ( – P1 logP1 + (α + 1) P1 + βC1[P1] ) = 0
– logP1 + α + β C1’[P1] = 0

Part 2

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0
∫ C2[P] dx = 0

P ( – P2 logP2 + (α’ + 1) P2 + β’C1[P2] + λ C2[P2] ) = 0
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0

Part 3

Problem: Maximize – ∫ P log(P / P1) dx with constraints
∫ P dx = 1
∫ C2[P] dx = 0

P ( – P3 logP3 + P3 logP1 + (α’’ + 1)P3 + λ’ C2[P3] ) = 0
– logP3 + α’’ + logP1 + λ’ C2’[P3] = 0
– logP3 + α’’ + α + β C1’[P1] + λ’ C2’[P3] = 0
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0

We can now compare our answers for Part 2 to Part 3. These are the same problem, solved with MaxEnt and ME. While they are clearly different solutions, they have interesting similarities.

MaxEnt
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0
∫ P2 dx = 1
∫ C1[P2] dx = 0
∫ C2[P2] dx = 0

ME
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0
∫ P3 dx = 1
∫ C1[P1] dx = 0
∫ C2[P3] dx = 0

The equations are almost identical. The only difference is in how they treat the old constraint. In MaxEnt, the old constraint is treated just like the new constraint – a condition that must be satisfied for the final distribution.

But in ME, the old constraint is no longer required to be satisfied by the final distribution! Instead, the requirement is that the old constraint be satisfied by your initial distribution!

That is, MaxEnt takes all previous information, and treats it as current information that must constrain your current probability distribution.

On the other hand, ME treats your previous information as constraints only on your starting distribution, and only ensures that your new distribution satisfy the new constraint!

When might this be useful?

Well, say that the first piece of information you received, C1, was the expected value of some measurable quantity. Maybe it was that x̄ = 5.

But if the second piece of information C2 was an observation of the exact value of x, then we clearly no longer want our new distribution to still have an expected value of x̄. After all, it is common for the expected value of a variable to differ from the exact value of x.

E(x) vs x

Once we have found the exact value of x, all previous information relating to the value of x is screened off, and should no longer be taken as constraints on our distribution! And this is exactly what ME does, and MaxEnt fails to do.

What about a case where the old information stays relevant? For instance, an observation of the values of a certain variable is not ‘cancelled out’ by a later observation of another variable. Observations can’t be un-observed. Does ME respect these types of constraints?

Yes!

Observations of variables are represented by constraints that set the distribution over those variables to delta-functions. And when your old distribution contains a delta function, that delta function will still stick around in your new distribution, ensuring that the old constraint is still satisfied.

Pold ~ δ(x – x’)
implies
Pnew ~ δ(x – x’)

The class of observations that are made obsolete by new observations are called non-commuting observations. They are given this name because for such observations, the order in which you process the information is essential. Observations for which the order of processing doesn’t matter are called commuting observations.

In summation: maximizing relative entropy allows us to take into account subtle differences in the type of evidence we receive, such as whether or not old data is made obsolete by new data. And mathematically, maximizing relative entropy is equivalent to maximizing ordinary entropy with whatever new constraints were not included in your initial distribution, as well as an additional constraint relating to the value of your old distribution. While the old constraints are not guaranteed to be satisfied by your new distribution, the information about them is preserved in the form of the prior distribution that is a factor in the new distribution.

Fun with Akaike

The Akaike information criterion is a metric for model quality that naturally arises from the principle of maximum entropy. It balances predictive accuracy against model complexity, encoding a formal version of Occam’s razor and solving problems of overfitting. I’m just now learning about how to use this metric, and will present a simple example that shows off a lot of the features of this framework.

Suppose that we have two coins. For each coin, we can ask what the probability is of each coin landing heads. Call these probabilities p and q.

Akaike.png

Two classes of models are (1) those that say that p = q and (2) those that say that p ≠ q. The first class of models are simpler in an important sense – they can be characterized by a single parameter p. The second class, on the other hand, require two separate parameters, one for each coin.

The number of parameters (k) used by our model is one way to measure model complexity. But of course, we can also test our models by getting experimental data. That is, we can flip each coin a bunch of times, record the results, and see how they favor one model over another.

One common way of quantifying the empirical success of a given model is by looking at the maximum value of its likelihood function L. This is the function that tells you how likely your data was, given a particular model. If Model 2 can at best do better at predicting the data than Model 1, then this should count in favor of Model 2.

So how do we combine these pieces of information – k (the number of parameters in a model) and L (the predictive success of the model)? Akaike’s criterion says to look at the following metric:

AIK = 2k – 2 lnL

The smaller the value of this parameter, the better your model is.

So let’s apply this on an arbitrary data set:

n1 = number of times coin 1 landed heads
n2 = number of times coin 1 landed tails
m1 =number of times coin 2 landed heads
m2 = number of times coin 2 landed tails

For convenience, we’ll also call the total flips of coin 1 N, and the total flips of coin 2 M.

First, let’s look at how Model 1 (the one that says that the two coins have an equal chance of landing heads) does on this data. This model predicts with probability p each heads, on either coin, and with probability 1 – p each tails on either coin.

L1 = C(N,n1) C(M,m1) pn1+m1 (1 – p)n2+m2

The two choose functions C(N, n1) and C(M, m1) are there to ensure that this function is nicely normalized. Intuitively, they arise from the fact that any given number of coin tosses that land heads could happen in many possible, ways, and all of these ways must be summed up.

This function finds its peak value at the following value of p:

p = (n1 + m1) / (N + M)
L1,max = C(N, n1) C(M, m1) (n1 + m1)n1+m1 (n2 + m2)n2+m2 / (N + M)N+M

By Stirling’s approximation, this becomes:

ln(L1,max) ~ -ln(F) – ½ ln(G)
where F = (N + M)N+M/NNMM · n1n1m1m1/(n1 + m1)n1+m1 · n2n2m2m2/(n2 + m2)n2+m2
and G = n1n2m1m/ NM

With this, our Akaike information criterion for Model 1 tells us:

AIC= 2 + 2ln(F) + ln(G)

Moving on to Model 2, we now have two different parameters p and q to vary. The likelihood of our data given these two parameters are given by:

L2 = C(N, n1) C(M, m1) pn1 (1 – p)n2 qm1 (1 – q)m2

The values of p and q that make this data most likely are:

p = n1 / N
q = m1 / M
So L2,max = C(N, n1) C(M, m1) n1n1m1m1n2n2m2m2 / NNMM

And again, using Stirling’s approximation, we get:

ln(L1,max) ~  – ½ ln(G)
So AIC= 4 + ln(G)

We now just need to compare these two AICs to see which model is preferable for a given set of data:

AIC= 4 + ln(G)
AIC= 2 + 2ln(F) + ln(G)

AIC– AIC= 2 – 2lnF

Let’s look at two cases that are telling. The first case will be that we find that both coin 1 and coin 2 end up landing heads an equal proportion of the time, and for simplicity, both coins are tossed the same number of times. Formally:

Case 1: N = M, n1 = m1, n2 = m2

In this case, F becomes 1, so lnF becomes 0.

AIC– AIC1 = 2 > 0
So Model 1 is preferable.

This makes sense! After all, if the two coins are equally likely to land heads, then our two models do equally well at predicting the data. But Model 1 is simpler, involving only a single parameter, so it is preferable. AIC gives us a precise numerical criterion by which to judge how preferable Model 1 is!

Okay, now let’s consider a case where coin 1 lands heads much more often than coin 2.

Case 2: N = M, n1 = 2m1, 2n2 = m2

Now if we go through the algebra, we find:

F = 4N (4/27)(m1+n2) ~ 1.12N
So lnF ~ N ln(1.12) ~ .11 N
So AIC– AIC1 = 2 – .22N

This quantity is larger than 0 when N is less than 10, but then becomes smaller than zero for all other values.

Which means that for Case 2, small enough data sets still allow us to go with Model 1, but as we get more data, the predictive accuracy of Model 2 eventually wins out!

It’s worth pointing out here that the Akaike information criterion is an approximation to the technique of maximizing relative entropy, and this approximation assumes large sets of data. Given this, it’s not clear how reliable our estimate of 10 is for the largest data set.

Let’s do one last thing with our simple models.

As our two coins become more and more similar in how often they land heads, we expect Model 1 to last longer before Model 2 ultimately wins out. Let’s calculate a general relationship between the similarity in the ratios of heads to tails in coin 1 and coin 2 and the amount of time it takes for Model 1 to lose out.

Case 3: N = M, n1 = r·m1, r·n2 = m2

r here is our ratio of p/q – the chance of heads in coin 1 over the chance of heads in coin 2. Skipping ahead through the algebra we find:

lnF = N ln( 2 rr/(r+1) / (r + 1) )

Model 2 becomes preferable to Model 1 when AICbecomes smaller than AIC1, so we can find the critical point by setting ∆AIC = 2 – 2 lnF = 0

lnF = N ln( 2 rr/(r + 1) / (r + 1) ) = 1
N = 1 / ln( 2 rr/(r + 1) / (r + 1) )

We can see that as r goes to 1, N goes to ∞. We can see how quickly this happens by doing some asymptotics:

r = 1 + ε
N ~ 1 / ln(1 + ε)
So ε ~ e1/N – 1

N goes to infinity at an exponential rate as r approaches 1 linearly. This gives us a very rough idea of how similar our coins must be for us to treat them as essentially identical. We can use our earlier result that at r = 2, N = 10 to construct a table:

r N
2 10
1.1 100
1.01 1000
1.001 10,000
1.0001 100,000