# Asymmetry between the infinite and finite

This is well-trod territory for this blog, but I want to give a little reminder of some of the subtleties involving infinity in logic. In particular, there’s a fun asymmetry between the finite and the infinite in terms of the expressive power of first order logic.

Consider the set of all first order sentences that are true in every infinite model. This includes any tautology (like P ∨ ¬ P), as well as some other sentences like ∃x ∃y (x ≠ y) (which says that there are at least two distinct things). Are there any finite models that satisfy this entire set of sentences?

Answer is no: for any finite n you can form a sentence that says “there are at least n things”. E.g. for n = 4 we say ∃x ∃y ∃z ∃w (x≠y ∧ x≠z ∧ x≠w ∧ y≠z ∧ y≠w ∧ z≠w). This basically says “there exists an x, y, z, and w that are all distinct”, i.e. there are at least four things. This can be easily generalized to any finite n. Sentences of this form all appear in our set, because they’re true in every infinite model. But of course, no finite model satisfies all of them. Any finite model contains only so many elements, say N, and the sentence which says “there are at least N+1 things” will be false in this model.

What about the reverse question? If you consider the set of all first order sentences that are true in every fiinite model, are there any infinite models that satisfy this set?

The answer is yes! If there weren’t, then you’d have a collection of sentences that suffice to rule out all infinite models while leaving all finite models. And ruling out all and only the infinite models of a theory turns out to be like a superpower that’s inconsistent with any logic that’s sound and complete like first-order logic. The reasoning, in ultra-condensed form: finitary proofs + soundness + completeness imply compactness, and compactness gives you a quick proof that any theory with arbitrarily large finite models has infinite models (add an infinity of constant symbols to your theory and declare all of them unequal. Since every finite subset of these declarations is satisfied by some model, compactness says that the whole lot is satisfied by some model, which is an infinite model of the starting theory).

This is an interesting asymmetry between the finite and the infinite. If you assert every sentence that’s true in all infinite models, you’ve automatically excluded all finite models. But if you assert every sentence that’s true in all finite models, you haven’t excluded all infinite models. A listener that patiently waited for you to finish saying all those infinitely many sentences might still take you to be talking about some infinite model. (And in fact, by the Lowenheim-Skolem theorem, they could understand you to be talking about a model of any infinite cardinality, which is pretty dramatic.)

TL;DR: There are no finite models that satisfy the set of all first-order sentences that are true in every infinite model. On the other hand, there are infinite models that satisfy the set of all first-order sentences that are true in every finite model.

# PA doesn’t trust its provability predicate

Here’s a fun thing: PA has a predicate Provable(n) such that when fed a number n, it returns true if and only if the sentence coded by that number is provable from PA. In other words:

PA ⊢ X iff PA ⊢ Provable(⟦X⟧)

Here we’re using two notational conveniences: ⟦X⟧ means “the number that is the Gödel code of the sentence X”, and PA ⊢ X means “PA proves X”.

Now, if PA can prove something, then it’s true in every model of PA. So perhaps we should expect that for every X, Provable(⟦X⟧) implies that X is true. Or more precisely, Provable(⟦X⟧) → X But PA can’t prove this! Löb’s theorem says that “Provable(⟦X⟧) → X” is provable only when X is provable. So if PA could prove that for every X, then it could also prove every X and would thus be inconsistent!

In other words, if PA is consistent then the following two things are both true:

(1) PA ⊢ X iff PA ⊢ Provable(⟦X⟧)
(2) PA ⊬ Provable(⟦X⟧) → X

By the deduction theorem for first order logic, this also implies that PA conjoined with the assumption that Provable(⟦X⟧) is not in general sufficient to prove X. To put the weirdness in non-technical terms, (2) says that PA doesn’t trust its provability predicate to ensure truth, while (1) says that what PA can prove about its provability predicate is exactly the true statements about provability.

What’s up with this? The explanation is kinda subtle: PA proves X if and only if PA proves Provable(⟦X⟧). But nonetheless “PA proves X” is not equivalent to Provable(⟦X⟧). “PA proves X” implies Provable(⟦X⟧), but it’s possible for Provable(⟦X⟧) to be true (just not provable!) despite PA not actually proving X.

To put it another way:

PA ⊢ X
iff
PA ⊢ Provable(⟦X⟧)
iff
Provable(⟦X⟧) is true in every model of PA

But it’s possible for Provable(⟦X⟧) to be true in some models of PA and false in others! “True in every model of PA” is much stronger than “true in at least one model of PA.”

Best way to think about this is probably in terms of nonstandard models. The statement Provable(⟦X⟧) is an existential statement: It says “there exists a number n that encodes a proof of X from PA”. In nonstandard models, there exist a bunch of nonstandard numbers, and some of these numbers trick our Gödel-encoding scheme, convincing the model that they are actually proofs. So it’s possible in a nonstandard model for Provable(⟦X⟧) to be true and X false. But for the sentences X for which this is true, PA doesn’t actually prove that they’re provable! Why? Because PA can only proves things that are true in every model, and Provable(⟦X⟧) is false in the standard model.

This holds quite generally. Suppose you have any theory T that can express arithmetic. If T is consistent, then no provability predicate can have all four of the following properties:

1. If T ⊢ X then T ⊢ Prov(X)
2. T ⊢ (Prov(X) → Prov(Prov(X)))
3. T ⊢ Prov(A→B) → (Prov(A) → Prov(B))
4. T ⊢ Prov(X) → X

Now, (1) through (3) jointly entail Löb’s theorem. This means that by adding (4) we allow T to prove every X, contradicting its consistency. So no theory of arithmetic can have a provability predicate that it trusts to ensure truth, so long as that provability predicate satisfies the first three desiderata!

# No more than half of anti-sets are big

Background on anti-set theory here.

Suppose that somebody tells you that they have in their hands a model of anti-set theory containing 50 anti-sets. The axiom of anti-infinity gives us some nice immediate restrictions on how large these anti-sets can be. As a quick reminder, the axiom of anti-infinity tells us that every anti-set X must contain another anti-set Y that has a successor that’s not in X. The successor of X is defined to be X ⋃ {X} (an anti-set containing everything in X plus X itself). One big difference between successors in ZFC and successors in anti-set-theory is that anti-sets aren’t always guaranteed to have successors. Another is that anti-sets can be their own successors (this happens if the anti-set contains itself).

Ok, so anti-infinity immediately tells us that there can be no universal set (i.e. a set containing the entire universe). Since every set X must contain a set that has a successor outside X, no set can contain everything, or else it would contain all the successors!

So there are no anti-sets of size 50. How about anti-sets of size 49? It’s recently been proved that there can only be at most 25 sets of size 49. And in general, in a universe of size N no more than half of the anti-sets can be size N-1. Let’s prove it!

First, a naming convention: in a universe of size N, let’s call any sets of size N-1 big sets, and any other sets small sets. Big sets are as big as an anti-set can be. To satisfy anti-infinity, any set X must contain an element Y that has a successor that’s not in X. We’ll call Y the anti-infinity-satisfier for X.

Now, some observations. Suppose that a big set X has a successor. This successor is either size N-1 (if X contains itself) or size N (if not). But no set is size N! So any big set that has a successor, must be its own successor. But this means that big sets cannot be anti-infinity-satisfiers. If a big set X was an anti-infinity-satisfier for some other set Y, then its successor could not be in Y. But its successor is X, and X is in Y!

What this means is that every set must contain a small set as its anti-infinity-satisfier. Now, consider any big set X. X must contain some small set z that serves as an anti-infinity-satisfier for it. But for z to serve as an anti-infinity-satisfier, z must have a successor that’s not in X. What this means is that X must be missing z’s successor.

In other words, every big set must be missing at least one small set’s successor. But remember, big sets are size N-1. This means that they contain every set in the universe besides one. So now we know that the “missing set” that characterizes each big set is always a successor of a small set!

Let’s say that our universe contains k small sets. Then there are at most k successors of small sets. Any big set must be the set of all sets besides one of these successors. Thus by extensionality, there are at most k big sets. Therefore there can never be more big sets than small sets!

And that’s the result: In any finite universe, at most half of the anti-sets are big.

This resolves a conjecture that I posted earlier as an open question: in an N-element model, are there always at least three small sets? The answer is yes. If we had fewer than three small sets, then we’d have to also have fewer than three big sets. This means that our universe is at most size four (two small sets and two big sets). But no four-element models exist! QED.

# A Dating Puzzle, and Bayesian Experimental Design

A puzzle for you!

A group of six men (A, B, C, D, E, and F) and six women (a, b, c, d, e, and f) are chosen so that everybody has a perfect match of the opposite gender. They are given the following clues:

1. E and e are a match.
2. A and c are a match.
3. The pairing (A,b), (B,a), (C,c), (D,d), (E,e), (F,f) contains exactly 2 matches.
4. The pairing (A,c), (B,d), (C,a), (D,b), (E,e), (F,f) contains exactly 2 matches.
5. The pairing (A,c), (B,f), (C,b), (D,d), (E,e), (F,a) contains exactly 3 matches.

If they can figure out everybody’s perfect matches, the group will get a million dollars to share! Can they figure it out?

If you figure that one out, then here’s the next level, with ten pairs and twelve clues!

A group of ten men (ABCDEFGHIJ) and ten women (abcdefghij) are chosen so that everybody has a perfect match of the opposite gender. They are given the following twelve clues:

1. F and f are not a match
2. J and h are not a match
3. B and e are not a match
4. D and d are a match
5. H and c are a match
6. The pairing (Ai Bb Ca Dd Ee Fc Gj Hg If Jh) contains exactly 2 matches.
7. The pairing (Af Be Cg Dj Eh Fd Gi Hc Ia Jb) contains exactly 4 matches.
8. The pairing (Af Be Ca Dd Ej Fh Gi Hb Ig Jc) contains exactly 2 matches.
9. The pairing (Aa Bc Ci Dd Eg Fj Gf Hb Ih Je) contains exactly 2 matches.
10. The pairing (Af Bi Ce Dd Eh Fg Gj Hc Ia Jb) contains exactly 5 matches.
11. The pairing (Af Ba Cb Dd Eh Fi Gj Hc Ie Jg) contains exactly 5 matches.
12. The pairing (Af Bi Ch Dd Eb Fg Gj Hc Ia Je) contains exactly 7 matches.

Can you help them get their million dollars?

✯✯✯

Some background on the puzzle:

I didn’t actually come up with it out of nowhere, I encountered in the wild! A few days ago I started watching a new Netflix reality show called “Are You The One?” The premise of the show: ten men and ten women are paired up via a matchmaking algorithm, and if they can all figure out their other halves after living together for a month, then they win one million dollars. It’s an extremely corny show, but there was one aspect of it which I found pretty interesting: it’s a perfect setting for Bayesian experimental design! Me being me, I spent the whole show thinking about the math of how contestants get information about their potential “perfect matches”.

Let’s start with a very basic look at the underlying math. Ten pairs, chosen from two groups of ten, gives ten factorial possible matchings. 10! is 3,628,800, which is a lot of possibilities. If you were to randomly choose the ten matches, with no background information, you’d have a .000028% chance of getting everything right. This is about as likely as you are to correctly guess the outcome of 22 tosses of an unbiased coin in a row!

Of course, the show isn’t so cruel as to force them to just guess with no information. For one thing, they have a whole month to get to know each other and gather lots of that subtle tricky-to-quantify social evidence. But more importantly, every episode they get two crucial pieces of information:

First, they get to choose one pair (a man and a woman) to go into the “Truth Booth”. The group is then informed whether these two are a genuine match or not.

And second, at the end of each episode the entire group gathers together, everybody pairs off with one another, and they are told how many pairs they got right (though not which ones). On the final episode, the matching they choose determines whether they get the million dollars or not.

I call the first type of test the individual pair test and the second the group test. And this is where the two types of clues in the above puzzles come from! The clues for the second puzzle are actually just a translation of the tests that the group decided to do in the first seven episodes. (So if you successfully solved it, then feel good about yourself for being the type of person that would have won the million.) Interestingly, it turns out that by the seventh episode they could had already figured out everybody’s perfect match, but it took them three more episodes to get to that point! Silly non-logically-omniscient humans.

Putting aside the puzzle we started with, the show setup naturally lends itself to some interesting questions. First of all, is there a strategy that guarantees that the group will get the million dollars by the tenth episode? In other words, is ten pair tests and nine group tests sufficient to go from 3,628,800 possibilities to 1?

I am not yet sure of the answer. The fact that the group in season one managed to narrow it down to a single possible world with only seven episodes seems like evidence that yes, those 19 tests do provide enough evidence. In addition, there have been eight seasons and in only one did the group fail to find the correct matches. (And in the season whose group failed, they started with eleven pairs – multiplying the number of possible worlds by 11 – and used two fewer pair tests than previous seasons.)

However, it’s worth keeping in mind that while the group of twenty individuals was far from logically omniscient, they did have a great wealth of additional social evidence, and that evidence may have allowed them to make choices for their tests that yielded much more expected information than the optimal test in the absence of social information. (I’m also suspicious that the producers might be a little more involved in the process than they seem. There are a few seasons where the group is consistently getting 2 or 3 matches until the last episode where they suddenly get everything right. This happens in season 3, and by my calculations there were still four possible worlds consistent with all their information by the time of the last matching ceremony!)

We can also ask the Bayesian experimental design question. What’s the optimal pair test/group test to perform, given the set of remaining possible worlds?

We can solve this fairly easily using a brute force approach. Consider the problem of finding the optimal pair test. First we actually generate the set of all possible worlds (where each world is a specification of all ten matches). Let N be the size of this set. Then we look through all possible pairs of individuals (i, j), and for each pair calculate the number of worlds in which this pair is a match. Call this quantity nij. Then the expected number of worlds remaining after performing a pair test on (i, j) is:

nij Pr(i and j are a match) + (N – nij) Pr(i and j are not a match) = nij2/N+ (N – nij)2/N

So we simply search for the pair (i, j) that minimizes nij2 + (N – nij)2. This is equivalent to maximizing nij (N – nij): the product of the number of worlds where (i, j) is a match and the number of worlds where (i, j) is not a match.

We do pretty much the same thing for the group test. But here’s where we run into a bit of trouble: though our algorithm is guaranteed to return an optimal test, the brute force approach has to search through 10!10! possibilities, and this is just not feasible. The time taken to solve the problem grows exponentially in the number of worlds, which grows exponentially in the number of individuals.

So we have another interesting question on our hands: Is there an efficient algorithm for calculating the optimal pair/group test? In the more general setting, where the tests are not restricted to just being pair tests or group tests, but can be a test of ANY boolean expression, this is the question of whether SAT can be solved efficiently. And given that SAT is known to be NP-complete (it was in fact the first problem to be proven NP-complete!), this more general question ends up being equivalent to whether P = NP!

# What precedent does the Trump Twitter ban set?

I don’t know if anybody needs me to tell them this, but Trump has recently been permanently banned from Twitter, following a violent attack on the capitol inspired by his conduct since his election loss. This decision by Twitter has inspired a lot of discussion about free speech and censorship, and I want to chime in with one small point about a common argument you hear in these discussions.

A lot of the anti-ban discussion that I’ve seen relates to fears about setting a bad precedent for future decisions by tech companies. I think that these are legitimate concerns, but that they are often stated over-confidently. When trying to judge the precedent that an action sets, there’s a type of scope ambiguity that arises. What exactly is the precedent that Twitter’s ban of Trump sets? Is it that any time social media authorities find somebody’s ideas repugnant enough they will ban them? That’s probably too broad, but somebody that thinks that this is the precedent will find it overly subjective and judge it to be overreach. Is the precedent that Twitter will ban the President of the United States if his conduct leads to an assault on the Capitol? That’s probably too narrow, but might be a standard we find acceptable.

Sometimes there’s less ambiguity, like when the action fits nicely within a predetermined paradigm for precedent-breaking (e.g., perhaps making a new amendment to the Constitution), and other times there’s more ambiguity. Sometimes actions come with a detailed description of why they’re being taken, and this can be some guide to what precedent we consider the action to be setting. But other times they don’t, and even in those cases that they do we might have reasons to distrust the description, or to think the precedent being set is actually different from the description.

I guess we could say: there’s the intent in the mind of the Twitter authorities, and there’s their actual statement of the reasons for the action. There’s the explicit policy they are citing for their action (in other cases, they might be creating a new policy), and there’s the way that that action is understood among the general public. What we really care about is the likely future consequences of the act, and all of these things are relevant to judging this to one degree or another. Framing it this way, it’s a very complicated prediction about the world that one makes when they say that the Trump Twitter ban is setting a bad precedent: it’s saying that the future actions of Twitter and other tech companies are worse than they would have been in the counterfactual world where they had held off on the ban. This is a claim I have a lot of uncertainty about, and I think probably most other people should as well.

# Divisibility Tricks and Orderly Numbers

A fun exercise is to think about the divisibility tricks you learned in grade school and generalize them to different bases. In base 10 we have:

N is divisible by 2 if its last digit is divisible by 2
N is divisible by 3 if the sum of its digits is divisible by 3
N is divisible by 4 if its last two digits are divisible by 4
N is divisible by 5 if its last digit is 0 or 5
N is divisible by 6 if N is divisible by 2 and 3
N is divisible by 7 if N = Xy for a single digit y and X – 2y is divisible by 7
N is divisible by 8 if its last three digits are divisible by 8
N is divisible by 9 if the sum of its digits is divisible by 9
N is divisible by 10 if its last digit is 0
N is divisible by 11 if the alternating sum of its digits is divisible by 11

There are a few natural categories of divisibility rules here. Divisibility by 3 and 9 are both tested by taking a sum of the digits of N. Divisibility by 11 is checked with an alternating sum. Divisibility by 2, 4, 5, and 8 is tested by looking at the last few digits of N. Divisibility by 6 is a combination of two other divisibility rules. And divisibility by 7 is in its own category.

Sum Test
To test divisibility of N by k, check if the sum of digits of N is divisible by k.

In base B, this test can be used whenever k divides B – 1. For example, in base B = 10, we use the sum test only to test divisibility by numbers that divide 9, i.e. 3 and 9.

Alternating Sum Test
To test divisibility of N by k, check if the alternating sum of digits of N is divisible by k.

In base B, this test can be used only if k divides B+1. For example, in base B = 10, we use the alternating sum test only to test divisibility by numbers that divide 11, i.e. 11.

Ending Test
To test divisibility of N by k, check if the last m digits of N is divisible by k.

In base B, this test can only be used if k divides B, or is a power of some number that divides B. For example, in base B = 10, we use the ending test only to test divisibility by numbers that divide 10, i.e. 2, 5, and 10, and powers of those numbers, i.e. 4, 8, 16, and 25.

Combination Tests
In any base B, to test divisibility of N by k = ij, where i and j are co-prime, it suffices to check divisibility by i and divisibility by j.

So for instance, divisibility by 6 can be checked by performing the tests for 2 and 3, but divisibility by 12 can’t be checked by performing the tests for 2 and 6.

Other Tests
The test for divisibility by 7 in base 10 doesn’t fit into any of the previous categories.

We’ll only be considering the first four categories.

There’s a nice way to see why these tests generalize in the way that they do. Let’s think about the sum test first.

Sum Test

We reason it through by induction. Suppose that we’re in base B, and checking divisibility by K < B. Our base case is that the sum of the digits of K is just K, which is divisible by K. For our inductive case, assume that every number n less than N, n is divisible by K if the digits of n add up to something divisible by K. Now, N = (N-K) + K, and (N-K) < N, so (N-K) satisfies the inductive hypothesis. This means that we just need to show that if we start with a number whose digits add to a multiple of K, and add K to this number, the sum of the digits of the resulting number will also be a multiple of K.

Break this into a few cases: First, adding K only affects the last digit. Then the last digit increases by K, so the sum of digits increases by K, and is thus still divisible by K.

What if adding K makes the last digit larger than B? Then we subtract B from the last digit, and add 1 to the second-to-last digit. So the net change in the digit sum is +1 + (K – B) = K – (B – 1)

But adding 1 to the second-to-last digit might also have resulted in a digit larger than B. In this case, we decrease the second-to-last digit is by B and add 1 to the the third-to-last digit. The net change here is -(B – 1).

If this causes the third-to-last digit to exceed B, then the same thing happens, and the net change in the sum of digits from those two is another -(B – 1)

In sum, this tells us that by adding K, the change in the digit sum is one of the following:

+K
+K – (B – 1)
+K – 2(B – 1)
+K – 3(B – 1)

In general, the change in the digit sum is +K – n(B – 1) for some n. If we now assume that (B-1) is a multiple of K, then the change in the digit-sum is also a multiple of K. And thus the resulting digit sum will be a multiple of K as well!

So for any K such that B – 1 is a multiple of K, if the digit sum of N is a multiple of K, then N is a multiple of K!

Alternating Sum Test

We can reason it through similarly for the alternating sum test. The possible changes to digits are:

+K
+1, +K – B
+1, +1 – B, +K – B
+1, +1 – B, +1 – B, +K – B
+1, +1 – B, +1 – B, +1 – B, +K – B

The alternating sum of this pattern, starting from the far right, gives:

K
(K – B) – 1
(K – B) – (1 – B) + 1
(K – B) – (1 – B) + (1 – B) – 1
(K – B) – (1 – B) + (1 – B) – (1 – B) + 1

The result of this sum is always either K or K – (B + 1). So as long as B+1 is a multiple of K, then the alternating sum will also change by a multiple of K!

Thus for any K such that B+1 is a multiple of K, if the alternating sum of the digits of N is a multiple of K, then N is a multiple of K.

If we disregard the “Other Tests” category, we can quite easily count for any given base how many digits will have sum, alternating sum, ending, or combination tests. People sometimes say that highly composite numbers would be ideal choices for numerical base systems, but those are really only good for ending tests, and not necessarily for sum or alternating sum tests.

I’ll define the notion of the orderliness of N as the amount of numbers 2 ≤ n < N that have a divisibility rule in base N, restricting divisibility rules to sum tests, alternating sum tests, ending tests, and combination tests.

Here’s a plot of the orderliness of bases from 3 to 100:

And from 3 to 1000:

The jaggedness of this graph suggests that some numbers make much better bases than their neighbors. By analogy to the notion of highly composite numbers, I’ll define highly ordered numbers as those whose orderliness is larger than all smaller numbers. These are numbers that would be ideal choices for numerical bases. The first few highly ordered numbers are:

3, 4, 5, 6, 8, 9, 10, 11, 14, 19, 20, 21, 29, 34, 49, 50, 55, 56, 69, 76, 90, 91, 99

Notice that 12 isn’t on this list, despite being highly composite! You might be able to guess why: though 12 has lots of factors, its neighbors 11 and 13 do not. On the other hand, 11 doesn’t have a lot of factors, but its neighbors 10 and 12 do, so it lands a spot on the list.

We could also measure orderability by what percentage of smaller numbers have rules for them. Here’s what that looks like:

# What’s the probability that an election winner leads throughout the entire vote?

Try to answer the question in the title for yourself. An election has n votes for candidate A, m votes for candidate B, and no third candidate. Candidate A wins, so n > m. Assuming that the votes were equally likely to be counted in any order, what’s the probability that candidate A was always ahead? (Note that “always ahead” precludes the vote being tied at any point.)

My friend showed me this puzzle a couple of days ago and told me that there’s an extremely simple way to solve it. After trying a bunch of complicated things (including some wild Pascal’s triangle variants), I eventually relented and asked for the simple solution. Not only is there a short proof of the answer, but the answer is itself incredibly simple and elegant. See if you can do better than I did!

(…)

(…)

(…)

(Read on only after you’ve attempted the problem!)

(…)

(…)

(…)

So, here’s the solution. We start by considering the opposite of A always being ahead, which is that A is either behind or that the vote is tied at some point. Since we know that A eventually wins, A being behind at some point implies that at a later point A and B are tied. So the opposite of A always being ahead is really just that there is a tie at some point.

= 1 – Pr(A is behind at some point or tied at some point | n, m)
= 1 – Pr(There’s a tie at some point | n, m)

Now, let’s consider the probability of a tie at some point. There are two ways for this to happen: either the first vote is for A or the first vote is for B. The first vote being for B entails that there must be a tie at some point, since A must eventually pull ahead to win. This allows us to say the following:

Pr(Tie at some point | n, m)
= Pr(Tie at some point & A is first vote | n, m) + Pr(Tie at some point & B is first vote | n, m)
= Pr(Tie at some point & A is first vote | n, m) + Pr(B is first vote | n, m)
= Pr(Tie at some point & A is first vote | n, m) + m/(n+m)

Now, the final task is to figure out Pr(Tie at some point & A is first vote | n, m). If you haven’t yet figured it out, I encourage you to pause for a minute and think it over.

(…)

(…)

Alright, so here’s the trick. If there’s a tie at some point, then up to and including that point there are an equal number of votes for A and B. But this means that there are the same number of possible worlds in which A votes first as there are possible worlds in which B votes first! And this means that we can say the following:

Pr(Tie at some point & A is first vote | n, m)
= Pr(Tie at some point & B is first vote | n, m)

And this is exactly the probability we’ve already solved!

Pr(Tie at some point & B is first vote | n, m)
= Pr(B is first vote | n, m)
= m/(n+m)

And now we’re basically done!

Pr(Tie at some point | n, m)
= m/(n+m) + m/(n+m)
= 2m/(n+m)

Pr(A is always ahead | n, m)
= 1 – Pr(Tie at some point | n, m)
= 1 – 2m/(n+m)
= (n – m) / (n + m)

And there we have it: The probability that A is always ahead is just the difference in votes over the total number of votes! Beautiful, right?

We can even more elegantly express this as simply the percent of people that voted for A minus the percent that voted for B.

(n – m) / (n + m)
= n/(n+m) – m/(n+m)
=%A – %B

This tells us that even in the case where candidate A gets 75% of the vote, there’s still a 50/50 chance that they fall behind at some point!

An example of this: the recent election had Biden with 81,283,485 votes and Trump with 74,223,744 votes. Imagining that there were no third candidates, this would mean that Biden had 52.27% of the popular vote and Trump had 47.73%. And if we now pretend that the votes were equally likely to be counted in any order, then this tells us that there would only be a 9.54% chance that Biden would be ahead the entire time! Taking into account the composition of mail-in ballots, which were counted later, this means that Trump having an early lead was in fact exactly what we should have expected. The chance that Biden would have fallen behind at some point was likely quite a bit higher than 90.5%!

# Quantum Chess

Recently I’ve been thinking about both quantum mechanics and chess, and came up with a chess variant that blends the two. The difference between ordinary chess and quantum chess is the ability to put your pieces in superpositions. Here’s how it works!

# Movement

There are five modes of movement you can choose between in quantum chess: Move, Split, Merge, Collapse, and Branchmove.

Modes 1 and 2 are for non-superposed pieces, and modes 3, 4, and 5 are for superposed pieces.

## Mode 1: Move

This mode allows you to move just as you would in an ordinary chess game.

## Mode 2: Split

In the split mode, you can choose a non-superposed piece and split it between two positions. You can’t choose a position that is occupied, even by a superposed piece, meaning that splitting moves can never be attacks.

One powerful strategy is to castle into a superposition, bringing out both rooks and forcing your opponent to gamble on which side of the board to stage an attack on.

## Mode 3: Merge

In mode 3, you can merge the two branches of one of your superposed pieces, recombining them onto a square that’s accessible from both branches.

You can’t merge to a position that’s only accessible from one of the two branches, and you can’t merge onto a square that’s occupied by one of your own superposed pieces, but merge moves can be attacks.

## Mode 4: Collapse

Mode 4 is the riskiest mode. In this mode, you choose one of your superposed pieces and collapse it. There are two possible outcomes: First, it might collapse to the position you clicked on. In this case, you now have a choice to either perform an ordinary move…

… or to split it into a new superposition.

But if you get unlucky, then it will collapse to the position you didn’t select. In this case, your turn is over and it goes to your opponent.

## Mode 5: Branch Move

Finally, in a branch move, you relocate just one branch of the superposition, without collapsing the wave-function or affecting the other branch.

# Piece Interactions

## Attacking a Superposed Piece

What happens if you attack a superposed piece? The result is that the superposed piece collapses. If the piece collapses onto the square you attacked, then you capture it.

But if it collapses onto the other branch of the superposition, then it is safe, and your piece moves harmlessly into the square you just attacked.

This means that attacking a superposed piece is risky! It’s possible for your attack to backfire, resulting in the attacker being captured next turn by the same piece it attacked.

It’s also possible for a pawn to move diagonally without taking a piece, in a failed attack.

## Line of Sight

Superposed pieces block the lines of sight of both your pieces and your opponent’s pieces. This allows you to defend your pieces or block open files, without fully committing to defensive positions.

## Winning the Game

To win quantum chess, you must actually take the opponent’s king. Let’s see why it’s not enough to just get the opponent into a position that would ordinarily be checkmate:

It’s blue’s turn now, and things look pretty lost. But look what blue can do:

Now the red queen has to choose one of the two targets to attack, and there’s a 50% chance that she gets it wrong, in which case the blue king can freely take the red queen, turning a sure loss into a draw!

So how can red get a guaranteed win? It takes patience. Rather than trying to attack one of the two squares, the red queen can hang back and wait for a turn.

Now the blue king has two choices: leave superposition, after which they can be taken wherever they go. Or move one branch of the superposition, but any possible branch move results in a safe shot at the king with the queen. This can be repeated until the king is taken.

And that’s quantum chess! I’ve played several games with friends, and each time have noticed interesting and surprising strategies arising. Let me leave you with a quantum chess puzzle. Here’s a position I found myself in, playing red:

What do you think was my best move here?

# Biblical inerrancy

A puzzling phenomenon is the existence of Biblical inerrantists. It seems to me to be impossible to have both (1) carefully read the Bible and (2) come to the conclusion that it’s inerrant. Bart Ehrman talks about a possible explanation for this phenomenon in the order in which people read the Gospels: if you read the Gospels all the way through, one at a time, rather than reading them simultaneously, side by side, then it’s much easier for you to fail to notice all of the discrepancies. And boy are there discrepancies!

Virtually no story that appears in multiple Gospels is identical in the different tellings. That’s not hyperbole; literally from the story of Jesus’s birth all the way to his crucifixion and resurrection, there are unambiguous contradictions to be found the entire way. I don’t think that these contradictions make it rationally impossible to be a Christian, but they certainly do make it rationally impossible to be a Christian of an inerrantist tradition. And for more liberal Christians, they face a serious challenge of how they justify placing such enormous stock in the wording of a text that is known to be error-ridden.

There are just too many examples of blatant contradictions to go through them all. It’s a remarkable fact that many Christians that have read these stories throughout their lives are completely unaware that they disagree with one another! What I want to do here is just to pick one of the most well-known stories, the empty tomb. As I go through the story as each Gospel tells it, if at any moment you feel skeptical of what I’m saying, just go look at the verses being cited for yourself! The source of all the following quotes is the New International Version (NIV).

I’ve copied the entire “empty tomb” story as it’s told in each of the Gospels and highlighted the differences.

Now, let’s test your reading comprehension! In the story of the empty tomb, how many women come to the tomb, one, two, or three? Depends on who you read! According to Mark, it was three (the two Marys and Salome). According to Matthew, it’s two (the two Marys). According to John it’s just Mary Magdalene. And according to Luke it’s some unspecified number more than 1.

How many entities do they encounter at the tomb, and are they ordinary men or angels? According to Mark, they see one young man already inside the tomb. Luke says that after they enter, two men suddenly appear beside them. Matthew describes a violent earthquake preceding the arrival of an angel from heaven, while still outside the tomb. And John describes two angels inside the tomb (seen by Mary from the outside of the tomb). What’s more, in John and Matthew the woman/women see and talks to Jesus at the tomb! You have to agree that this is a pretty noteworthy thing for Mark and Luke to leave out.

Ok, how about the stone in front of the entrance? When the women/woman arrive(s), is the stone already rolled away from the tomb (as in Mark, Luke, and John), or is it moved later (as in Matthew)?

When the one/two men/angels speak to the woman/women, do they say that Jesus will meet the disciples in Galilee? In Mark and Matthew, yes. But in Luke and John there’s no mention of this! And in fact, in Luke the disciples don’t go to Galilee to meet Jesus at all! Jesus appears to disciples in Jerusalem and tells them to stay there, which they do until the end of the gospel (Luke 24:51-53). He ascends right outside of Jerusalem (1:9,12)

When the woman/women leave the tomb, do they describe what they saw to the disciples or not? According to Luke, Matthew, and John, yes. But not according to Mark; in Mark, the women flee and in their fear “say nothing to anyone”!

Did Peter ever visit the tomb? Not in Matthew or Mark, but in Luke and John he does. Is he by himself or with another disciple? Matthew says he’s by himself, John describes another disciple with him.

So much for the empty tomb! This level of contradiction is not special to this story. Think about Jesus’s death. When did he die? This is one of the most blatant contradictions in the Bible, because both John and Mark take great pains to explicitly lay out their chronology. According to Mark, Jesus and his disciples have their last supper on the evening of the Passover (Mark 14:12-17), and the following morning he is taken to be crucified (Mark 15:1). In John there is no last supper! John explicitly states in John 19:14 that Jesus is taken away from crufixion on “the day of Preparation of the Passover”, that is, the day before Passover!

Let’s now quickly run through some more Biblical contradictions. Who was the father of Joseph, Jesus’s father? According to Matthew 1:15, “Jacob begat Joseph, the husband of Mary”, whereas according to Luke 3:23, Joseph is said to be “the son of Heli”. The genealogies presented in Matthew and Luke are virtually in complete disagreement starting two generations up from Jesus. Apologists will often argue that one of the two is presenting the maternal lineage rather than the paternal line, but this is far from obvious when you look at the wording, which is specifically about Joseph’s father not Mary’s (plus the fact that in both genealogies, the entire rest of the list follows only the paternal line.)

In Mark 5:21-24, Jairus comes to Jesus before his daughter dies and asks him to heal her (“My little daughter is dying. Please come and put your hands on her so that she will be healed and live”), but in Matthew 9:18-20 the daughter has already died by the time Jairus comes to Jesus (“My daughter has just died. But come and put your hand on her, and she will live”).

In Mark 15:37-39, the curtain that separates the holy of holies from the rest of the temple rips in two after Jesus dies. And in Luke 23:45-46, it rips before.

In Matthew 2:1-23, Joseph Mary and Jesus flee to Egypt (250 miles away) after Jesus’s birth, where they stay until King Herod dies, after which they resettle in Nazareth. But in Luke 2:1-40, Joseph and the fam do their rites of purification in Bethlehem after birth and return to Nazareth directly, 41 days after Jesus’s birth.

Also in Luke 2, it is described that Joseph and Mary travel to Galilee for a census declared by decree by Caesar Augustus to be “taken of the entire Roman world.” The problem with this is that we have good historical records of Caesar Augustus, and no such census took place!

One final one: in Mark 2:25-26, Jesus references an Old Testament passage about David eating unlawfully eating consecrated bread “in the days of Abiathar the high priest.” There’s a big problem with this: Jesus made a mistake! In the Old Testament passage, Abiathar wasn’t the high priest! The high priest was Ahimelech, whose son Abiathar would much later become high priest (1 Sam 21.1-7). So Christians have a choice to make between either Jesus not knowing his Old Testament or Mark not being an inerrant recording of Jesus’s sayings.

✯✯✯

All these contradictions are begging for an explanation. Is one or more of the authors lying? Not necessarily. Lying implies intention, and it’s worth keeping in mind the timeline of the Bible. Jesus is purported to have lived from 0 to 30 AD. Scholars unanimously agree that the earliest of the Gospels is Mark, and that it was originally written around 70 AD. Next were Matthew and Luke, both around 80-85 AD, and finally came John, around 90-95 AD. That’s a gap of 40 to 65 years from the events that are described! What’s more, the authors of Mark, Matthew, and John were almost certainly not the actual historical Mark, Matthew, and John (for a bunch of reasons I won’t get into now, most obviously that these texts were written in Greek by highly educated individuals and all three of these individuals were uneducated and would not have known Greek). And of course, Luke wasn’t a disciple and never met Jesus personally.

So the first texts that are written are from non-eyewitnesses recording an oral tradition that had started forty to sixty-five years before! In a forty-year game of telephone, nobody needs to have lied in order for the original story to become warped beyond recognition. Anybody that doubts that stories can become so dramatically altered over time need only think about the many Trump supporters that to this day insist that Trump’s inauguration had more attendees than Obama’s, despite LITERAL TIME LAPSE FOOTAGE of the entire thing and photographs all throughout. In one poll, Trump and Clinton voters that were handed these two photos, one from Obama’s inauguration and the other from Trump’s:

And guess what? 15% of Trump voters said that the left photo has more people! Suffice it to say, in the presence of emotionally charged topics like religion and politics, human brains start to act funny. Put this in context: this is an event from four years ago that we have video records of. And it’s somehow supposed to be unimaginable that 40 years of word-of-mouth transmission by religious believers made any significant changes to the original stories?

It’s even worse than this. These first texts are not what our modern Gospels are based off of, simply because we don’t have any copies of these first texts! The first copies of the texts that we possess come over A HUNDRED YEARS LATER, meaning that we have more than a hundred years of scribes making copies of copies of copies of the original texts. We know for a fact that these scribes were not perfect copyists, from the thousands of copies of the Gospels we possess, which abound in mistakes as small as spelling errors to as large as entire missing stories or new stories that had never appeared before. I’m sure that you know the story of the adulteress who Jesus defends, saying “he that is without sin among you, let him first cast a stone at her.” Did you know that this story appears in none of our earliest copies of the Gospels? Scholars unanimously agree that this story was added by scribes hundreds of years after the original writing, both because it literally doesn’t appear in copies earlier than that and also because the writing style is different from the rest of John.

So it seems to me like there really is no mystery here once you learn about the actual history of the texts. There are contradictions in the Bible because the Bible is an extremely imperfect copy of a copy of a copy of a … copy of a text written by non-eyewitnesses that heard stories told by people who had heard stories told by… people who had heard stories told by eyewitnesses to the events.

# Formal Semantics 1: Historical Prelude and Compositionality

English is really complicated. For a long time, logicians looking at natural languages believed that there could be no formal system detailing their grammar and semantics. They resigned themselves to extremely simple idealized fragments of English, like propositional logic (formalizing “and”, “not”, and “or”) and first-order logic (formalizing “every”, “some”, and “is”).

The slogan of the time was “ordinary language has no logic” (Bertrand Russell and Peter Strawson). Chomsky famously argued that the languages invented by logicians were too artificial and entirely unlike natural languages, and that therefore the methods of logicians simply couldn’t be applied to this more complex realm.

This attitude has changed over time. Perhaps the most important figure in the “logic of natural language” movement is Richard Montague, a student of the giant of logic Alfred Tarski. The first line of his paper English as a Formal Language reads “I reject the contention that an important theoretical difference exists between formal and natural languages”, and he follows this up by more or less single-handedly invented formal semantics, now a thriving field. Hilariously, Montague apparently saw this work as child’s play, writing:

I (…) sat down one day and proceeded to do something that I previously regarded, and continue to regard, as both rather easy and not very important — that is, to analyze ordinary language.

(This had to hit hard for linguists of his time.)

Alright, enough prologue. In the next few posts I want to describe a naive first pass at formalizing a fairly substantial fragment of English, modeled off of Montague semantics. The key concept throughout will be the notion of compositionality, which I’ll briefly describe now.

Compositionality

Compositionality is all about how to construct the meaning of phrases from their smaller components. Take a sentence like “The cat sat on the mat.” The meaning of this sentence clearly has something to do with the meanings of “the cat” and “sat on the mat”. Similarly, the meaning of “sat on the mat” must have something to do with the meanings of “sat”, “on”, “the”, and “mat”.

The compositionality thesis says that this is all that determines the meaning of “the cat sat on the mat.” In other words, the meaning of any phrase is a function of the meanings of the individual words within it. These meanings are composed together in some way to form the meaning of the sentence as a whole.

The natural question that arises now is, what is the nature of this composition? Take a very simple example: “Epstein died.” According to compositionality, the meaning of “Epstein died” depends only on the meanings of “Epstein” and “died”. That seems pretty reasonable. What about: “Epstein died suspiciously”? How do we compose the meanings of the individual words when there are three?

One proposal is to compose all three simultaneously. That’s possible, but a simpler framework would have us build up the meanings of our sentences iteratively, composing two units of meaning at a time until we’ve generated the entire sentence’s meaning.

Let me now introduce some notation that allows us to say this compactly. If X is some word, phrase, or sentence, we’ll denote the meaning of X as ⟦X⟧. Then the principle of binary compositionality is just that there’s some function F such that ⟦X Y⟧ = F(⟦X⟧, ⟦Y⟧).

There’s two major questions that arise at this point.

First, in which order should we compose our units of meaning? Should we combine “Epstein” with “died” first, and then combine that with “suspiciously”? Or should it be “Epstein” and “suspiciously” first, then that with “died”? Or should we combine “Epstein” with the combination of “suspiciously” and “died”?

One might suggest here that the order actually doesn’t matter; no matter what order we combine the meanings in, we should still get the same meaning. The problem with this is that “The Clintons killed Epstein” has a different meaning than “Epstein killed the Clintons.” If order of composition didn’t matter, then we’d expect these to mean the same thing.

Second, how exactly does composing two meanings work? Is there a single rule for composition, or are there multiple different rules that apply in different contexts? It would be most elegant if we could find a single universal rule for generating meanings of complicated phrases from simple ones, but maybe that’s overambitious.

For instance, you might model the meaning of “died” as a set of objects, namely all those objects that died at some moment in the past, and the meaning of “Epstein” as one particular object in the universe. Then we might have our composition rule be the following: ⟦Epstein died⟧ will be a truth value, and it will be True if and only if the object denoted by “Epstein” is within the set of objects denoted by “died”. So in this framework, ⟦X Y⟧ = True if and only if ⟦X⟧ ∈ ⟦Y⟧.

This works nicely for “Epstein died”. But what about “Epstein died suspiciously”? Now we have two compositions to do, and the order of composition will matter. The problem is that no matter how we compose things, it seems not to work. Suppose that we combine “died” and “suspiciously” first, then combine “Epstein” with that. Using our model, ⟦died suspiciously⟧ will be True if and only if ⟦died⟧ ∈ ⟦suspiciously⟧, which is already a little bit weird. But even worse, ⟦Epstein died suspiciously⟧ will be True if and only if ⟦Epstein⟧ ∈ ⟦died suspiciously⟧. But what would it mean for the object denoted by “Epstein” to be an element of a truth value? It looks like in this framework, most three-word sentences end up becoming vacuously false.

Anyway, the last two paragraphs only show us that one particular attempt to formalize composition fails to be universal. It doesn’t show that it’s impossible in general. In fact, we’ll end up doing pretty well with a small set of composition rules centered around function application. The idea can be very simply phrased as: ⟦X Y⟧ = ⟦X⟧(⟦Y⟧). And in particular, the meaning of “Epstein died suspiciously” will be ⟦suspiciously⟧(⟦died⟧)(⟦Epstein⟧). And that’s enough warm-up! Next we’ll explore this idea further and dive into our Montague-style system.