No more than half of anti-sets are big

Background on anti-set theory here.

Suppose that somebody tells you that they have in their hands a model of anti-set theory containing 50 anti-sets. The axiom of anti-infinity gives us some nice immediate restrictions on how large these anti-sets can be. As a quick reminder, the axiom of anti-infinity tells us that every anti-set X must contain another anti-set Y that has a successor that’s not in X. The successor of X is defined to be X ⋃ {X} (an anti-set containing everything in X plus X itself). One big difference between successors in ZFC and successors in anti-set-theory is that anti-sets aren’t always guaranteed to have successors. Another is that anti-sets can be their own successors (this happens if the anti-set contains itself).

Ok, so anti-infinity immediately tells us that there can be no universal set (i.e. a set containing the entire universe). Since every set X must contain a set that has a successor outside X, no set can contain everything, or else it would contain all the successors!

So there are no anti-sets of size 50. How about anti-sets of size 49? It’s recently been proved that there can only be at most 25 sets of size 49. And in general, in a universe of size N no more than half of the anti-sets can be size N-1. Let’s prove it!

First, a naming convention: in a universe of size N, let’s call any sets of size N-1 big sets, and any other sets small sets. Big sets are as big as an anti-set can be. To satisfy anti-infinity, any set X must contain an element Y that has a successor that’s not in X. We’ll call Y the anti-infinity-satisfier for X.

Now, some observations. Suppose that a big set X has a successor. This successor is either size N-1 (if X contains itself) or size N (if not). But no set is size N! So any big set that has a successor, must be its own successor. But this means that big sets cannot be anti-infinity-satisfiers. If a big set X was an anti-infinity-satisfier for some other set Y, then its successor could not be in Y. But its successor is X, and X is in Y!

What this means is that every set must contain a small set as its anti-infinity-satisfier. Now, consider any big set X. X must contain some small set z that serves as an anti-infinity-satisfier for it. But for z to serve as an anti-infinity-satisfier, z must have a successor that’s not in X. What this means is that X must be missing z’s successor.

In other words, every big set must be missing at least one small set’s successor. But remember, big sets are size N-1. This means that they contain every set in the universe besides one. So now we know that the “missing set” that characterizes each big set is always a successor of a small set!

Let’s say that our universe contains k small sets. Then there are at most k successors of small sets. Any big set must be the set of all sets besides one of these successors. Thus by extensionality, there are at most k big sets. Therefore there can never be more big sets than small sets!

And that’s the result: In any finite universe, at most half of the anti-sets are big.

This resolves a conjecture that I posted earlier as an open question: in an N-element model, are there always more than three small sets? The answer is yes. If we had fewer than three small sets, then we’d have to also have fewer than three big sets. This means that our universe is at most size four (two small sets and two big sets). But no four-element models exist! QED.

A Dating Puzzle, and Bayesian Experimental Design

A puzzle for you!

A group of six men (A, B, C, D, E, and F) and six women (a, b, c, d, e, and f) are chosen so that everybody has a perfect match of the opposite gender. They are given the following clues:

  1. E and e are a match.
  2. A and c are a match.
  3. The pairing (A,b), (B,a), (C,c), (D,d), (E,e), (F,f) contains exactly 2 matches.
  4. The pairing (A,c), (B,d), (C,a), (D,b), (E,e), (F,f) contains exactly 2 matches.
  5. The pairing (A,c), (B,f), (C,b), (D,d), (E,e), (F,a) contains exactly 3 matches.

If they can figure out everybody’s perfect matches, the group will get a million dollars to share! Can they figure it out?

If you figure that one out, then here’s the next level, with ten pairs and twelve clues!

A group of ten men (ABCDEFGHIJ) and ten women (abcdefghij) are chosen so that everybody has a perfect match of the opposite gender. They are given the following twelve clues:

  1. F and f are not a match
  2. J and h are not a match
  3. B and e are not a match
  4. D and d are a match
  5. H and c are a match
  6. The pairing (Ai Bb Ca Dd Ee Fc Gj Hg If Jh) contains exactly 2 matches.
  7. The pairing (Af Be Cg Dj Eh Fd Gi Hc Ia Jb) contains exactly 4 matches.
  8. The pairing (Af Be Ca Dd Ej Fh Gi Hb Ig Jc) contains exactly 2 matches.
  9. The pairing (Aa Bc Ci Dd Eg Fj Gf Hb Ih Je) contains exactly 2 matches.
  10. The pairing (Af Bi Ce Dd Eh Fg Gj Hc Ia Jb) contains exactly 5 matches.
  11. The pairing (Af Ba Cb Dd Eh Fi Gj Hc Ie Jg) contains exactly 5 matches.
  12. The pairing (Af Bi Ch Dd Eb Fg Gj Hc Ia Je) contains exactly 7 matches.

Can you help them get their million dollars?

✯✯✯

Some background on the puzzle:

I didn’t actually come up with it out of nowhere, I encountered in the wild! A few days ago I started watching a new Netflix reality show called “Are You The One?” The premise of the show: ten men and ten women are paired up via a matchmaking algorithm, and if they can all figure out their other halves after living together for a month, then they win one million dollars. It’s an extremely corny show, but there was one aspect of it which I found pretty interesting: it’s a perfect setting for Bayesian experimental design! Me being me, I spent the whole show thinking about the math of how contestants get information about their potential “perfect matches”.

Let’s start with a very basic look at the underlying math. Ten pairs, chosen from two groups of ten, gives ten factorial possible matchings. 10! is 3,628,800, which is a lot of possibilities. If you were to randomly choose the ten matches, with no background information, you’d have a .000028% chance of getting everything right. This is about as likely as you are to correctly guess the outcome of 22 tosses of an unbiased coin in a row!

Of course, the show isn’t so cruel as to force them to just guess with no information. For one thing, they have a whole month to get to know each other and gather lots of that subtle tricky-to-quantify social evidence. But more importantly, every episode they get two crucial pieces of information:

First, they get to choose one pair (a man and a woman) to go into the “Truth Booth”. The group is then informed whether these two are a genuine match or not.

And second, at the end of each episode the entire group gathers together, everybody pairs off with one another, and they are told how many pairs they got right (though not which ones). On the final episode, the matching they choose determines whether they get the million dollars or not.

I call the first type of test the individual pair test and the second the group test. And this is where the two types of clues in the above puzzles come from! The clues for the second puzzle are actually just a translation of the tests that the group decided to do in the first seven episodes. (So if you successfully solved it, then feel good about yourself for being the type of person that would have won the million.) Interestingly, it turns out that by the seventh episode they could had already figured out everybody’s perfect match, but it took them three more episodes to get to that point! Silly non-logically-omniscient humans.

Putting aside the puzzle we started with, the show setup naturally lends itself to some interesting questions. First of all, is there a strategy that guarantees that the group will get the million dollars by the tenth episode? In other words, is ten pair tests and nine group tests sufficient to go from 3,628,800 possibilities to 1?

I am not yet sure of the answer. The fact that the group in season one managed to narrow it down to a single possible world with only seven episodes seems like evidence that yes, those 19 tests do provide enough evidence. In addition, there have been eight seasons and in only one did the group fail to find the correct matches. (And in the season whose group failed, they started with eleven pairs – multiplying the number of possible worlds by 11 – and used two fewer pair tests than previous seasons.)

However, it’s worth keeping in mind that while the group of twenty individuals was far from logically omniscient, they did have a great wealth of additional social evidence, and that evidence may have allowed them to make choices for their tests that yielded much more expected information than the optimal test in the absence of social information. (I’m also suspicious that the producers might be a little more involved in the process than they seem. There are a few seasons where the group is consistently getting 2 or 3 matches until the last episode where they suddenly get everything right. This happens in season 3, and by my calculations there were still four possible worlds consistent with all their information by the time of the last matching ceremony!)

We can also ask the Bayesian experimental design question. What’s the optimal pair test/group test to perform, given the set of remaining possible worlds?

We can solve this fairly easily using a brute force approach. Consider the problem of finding the optimal pair test. First we actually generate the set of all possible worlds (where each world is a specification of all ten matches). Let N be the size of this set. Then we look through all possible pairs of individuals (i, j), and for each pair calculate the number of worlds in which this pair is a match. Call this quantity nij. Then the expected number of worlds remaining after performing a pair test on (i, j) is:

nij Pr(i and j are a match) + (N – nij) Pr(i and j are not a match) = nij2/N+ (N – nij)2/N

So we simply search for the pair (i, j) that minimizes nij2 + (N – nij)2. This is equivalent to maximizing nij (N – nij): the product of the number of worlds where (i, j) is a match and the number of worlds where (i, j) is not a match.

We do pretty much the same thing for the group test. But here’s where we run into a bit of trouble: though our algorithm is guaranteed to return an optimal test, the brute force approach has to search through 10!10! possibilities, and this is just not feasible. The time taken to solve the problem grows exponentially in the number of worlds, which grows exponentially in the number of individuals.

So we have another interesting question on our hands: Is there an efficient algorithm for calculating the optimal pair/group test? In the more general setting, where the tests are not restricted to just being pair tests or group tests, but can be a test of ANY boolean expression, this is the question of whether SAT can be solved efficiently. And given that SAT is known to be NP-complete (it was in fact the first problem to be proven NP-complete!), this more general question ends up being equivalent to whether P = NP!

Formal Semantics 1: Historical Prelude and Compositionality

English is really complicated. For a long time, logicians looking at natural languages believed that there could be no formal system detailing their grammar and semantics. They resigned themselves to extremely simple idealized fragments of English, like propositional logic (formalizing “and”, “not”, and “or”) and first-order logic (formalizing “every”, “some”, and “is”).

The slogan of the time was “ordinary language has no logic” (Bertrand Russell and Peter Strawson). Chomsky famously argued that the languages invented by logicians were too artificial and entirely unlike natural languages, and that therefore the methods of logicians simply couldn’t be applied to this more complex realm.

This attitude has changed over time. Perhaps the most important figure in the “logic of natural language” movement is Richard Montague, a student of the giant of logic Alfred Tarski. The first line of his paper English as a Formal Language reads “I reject the contention that an important theoretical difference exists between formal and natural languages”, and he follows this up by more or less single-handedly invented formal semantics, now a thriving field. Hilariously, Montague apparently saw this work as child’s play, writing:

I (…) sat down one day and proceeded to do something that I previously regarded, and continue to regard, as both rather easy and not very important — that is, to analyze ordinary language.

(This had to hit hard for linguists of his time.)

Alright, enough prologue. In the next few posts I want to describe a naive first pass at formalizing a fairly substantial fragment of English, modeled off of Montague semantics. The key concept throughout will be the notion of compositionality, which I’ll briefly describe now.

Compositionality

Compositionality is all about how to construct the meaning of phrases from their smaller components. Take a sentence like “The cat sat on the mat.” The meaning of this sentence clearly has something to do with the meanings of “the cat” and “sat on the mat”. Similarly, the meaning of “sat on the mat” must have something to do with the meanings of “sat”, “on”, “the”, and “mat”.

The compositionality thesis says that this is all that determines the meaning of “the cat sat on the mat.” In other words, the meaning of any phrase is a function of the meanings of the individual words within it. These meanings are composed together in some way to form the meaning of the sentence as a whole.

The natural question that arises now is, what is the nature of this composition? Take a very simple example: “Epstein died.” According to compositionality, the meaning of “Epstein died” depends only on the meanings of “Epstein” and “died”. That seems pretty reasonable. What about: “Epstein died suspiciously”? How do we compose the meanings of the individual words when there are three?

One proposal is to compose all three simultaneously. That’s possible, but a simpler framework would have us build up the meanings of our sentences iteratively, composing two units of meaning at a time until we’ve generated the entire sentence’s meaning.

Let me now introduce some notation that allows us to say this compactly. If X is some word, phrase, or sentence, we’ll denote the meaning of X as ⟦X⟧. Then the principle of binary compositionality is just that there’s some function F such that ⟦X Y⟧ = F(⟦X⟧, ⟦Y⟧).

There’s two major questions that arise at this point.

First, in which order should we compose our units of meaning? Should we combine “Epstein” with “died” first, and then combine that with “suspiciously”? Or should it be “Epstein” and “suspiciously” first, then that with “died”? Or should we combine “Epstein” with the combination of “suspiciously” and “died”?

One might suggest here that the order actually doesn’t matter; no matter what order we combine the meanings in, we should still get the same meaning. The problem with this is that “The Clintons killed Epstein” has a different meaning than “Epstein killed the Clintons.” If order of composition didn’t matter, then we’d expect these to mean the same thing.

Second, how exactly does composing two meanings work? Is there a single rule for composition, or are there multiple different rules that apply in different contexts? It would be most elegant if we could find a single universal rule for generating meanings of complicated phrases from simple ones, but maybe that’s overambitious.

For instance, you might model the meaning of “died” as a set of objects, namely all those objects that died at some moment in the past, and the meaning of “Epstein” as one particular object in the universe. Then we might have our composition rule be the following: ⟦Epstein died⟧ will be a truth value, and it will be True if and only if the object denoted by “Epstein” is within the set of objects denoted by “died”. So in this framework, ⟦X Y⟧ = True if and only if ⟦X⟧ ∈ ⟦Y⟧.

This works nicely for “Epstein died”. But what about “Epstein died suspiciously”? Now we have two compositions to do, and the order of composition will matter. The problem is that no matter how we compose things, it seems not to work. Suppose that we combine “died” and “suspiciously” first, then combine “Epstein” with that. Using our model, ⟦died suspiciously⟧ will be True if and only if ⟦died⟧ ∈ ⟦suspiciously⟧, which is already a little bit weird. But even worse, ⟦Epstein died suspiciously⟧ will be True if and only if ⟦Epstein⟧ ∈ ⟦died suspiciously⟧. But what would it mean for the object denoted by “Epstein” to be an element of a truth value? It looks like in this framework, most three-word sentences end up becoming vacuously false.

Anyway, the last two paragraphs only show us that one particular attempt to formalize composition fails to be universal. It doesn’t show that it’s impossible in general. In fact, we’ll end up doing pretty well with a small set of composition rules centered around function application. The idea can be very simply phrased as: ⟦X Y⟧ = ⟦X⟧(⟦Y⟧). And in particular, the meaning of “Epstein died suspiciously” will be ⟦suspiciously⟧(⟦died⟧)(⟦Epstein⟧). And that’s enough warm-up! Next we’ll explore this idea further and dive into our Montague-style system.

Two more short and sweet proofs of propositional compactness

Proof 1 (countable language)

Let S be a countable set of sentences in a propositional language with atomic sentences p0, p1, p2, ….  Assume that S is finitely satisfiable. We want to build a truth assignment V that satisfies S.

We’ll assign truth values to the atomic sentences one at a time. Vn will be the partial truth assignment after assigning the first n truth values. If W agrees with Vn on its domain, then call W an extension of Vn.

We’ll now prove by induction that for each n, every finite subset of S is satisfied by some extension of Vn.

First of all, V0 is just the empty set, and every function is an extension of the empty set. So the hypothesis follows trivially from the finite satisfiability of S. 

Now, assume that every finite subset of S is satisfied by some extension of Vn. We’ll show that the same holds of Vn+1.

Suppose not. Then both of the following must be true:
(1) Some finite subset S0 of S is not satisfied by any extension of Vn ⋃ {(pn, T)}
(2) Some finite subset S1 of S is not satisfied by any extension of Vn ⋃ {(pn, F)}

S0 ⋃ S1 is not satisfied by any extension of Vn ⋃ {(pn, T)}, or any extension of Vn ⋃ {(pn, F)}. So it’s not satisfied by any extension of Vn, contradicting the inductive hypothesis since S0 ⋃ S1 is a finite set. This proves that every finite subset of S is satisfied by some extension of Vn, for each n.

Define V = U {Vn : n ∈ ω}. We now show that V satisfies S. Consider any sentence φ ∈ S. φ contains a finite number of atomic sentences, so there’s some n large enough that Vn assigns truth values to all sentence letters in φ.  {φ} is a finite subset of S, so some extension of Vn satisfies it. Every extension of Vn agrees on the assignments to the atomic sentences that appear in φ.  So since some extension of Vn makes φ true, all extensions of Vn must make φ true.  In particular, V makes φ true.

Proof 2 (any language)

Suppose S is a finitely satisfiable set of sentences, and let L be the set of atomic sentences.

A partial function g: L → {T, F} is called good if each finite subset of S is satisfied by some extension of g.

The good partial functions are a poset under ⊆, and since S is finitely satisfiable, ∅ is good. Furthermore, the union of any chain of good functions is a good function. Thus every chain has an upper bound, namely its union.

By Zorn’s lemma, there’s a maximal good function g. Since g is maximal, the domain of g is all of L.

We now show that g satisfies every element of S.
… Suppose φ ∈ S.
… Since g is good, it has an extension satisfying φ.
… But g already has all of L as its domain, so g satisfies φ.

So S is satisfiable!

A Compact Proof of the Compactness Theorem

I’ve written up a proof of the compactness theorem before, but I recently looked it over and think that the proof can be expressed more, heh heh, compactly, than before.

So, what is the compactness theorem? It is the statement that if a set of statements is finitely satisfiable (each of its finite subsets has a model), then it’s satisfiable. The converse of this (a satisfiable set of statements is also finitely satisfiable) is trivial: a model of a set of sentences is also a model of every subset of that set. The following proof will be for propositional logic, but can be easily extended to first order logic.

The short version

Suppose that a set of sentences A is finitely satisfiable. We’ll extend A to a larger set B by giving A an “opinion” on every sentence. We can build up this extension in a series of stages: B0 is just A. Now, take any sentence a. If B0 ⋃ {a} is finitely satisfiable, define B1 to be B0 ⋃ {a}. Otherwise, define B1 to be B0 ⋃ {¬a}. Either way, B1 will be finitely satisfiable, because B0 cannot be inconsistent with both a and ¬a. When we’ve gone through every sentence, we take the union of all these extensions to form B. B is finitely satisfiable, since every finite subset of B is also a finite subset of one of the extensions that make it up. Now, we define a truth assignment V that assigns True to each propositional variable p if and only if p is in B. V satisfies B (which you can show by induction on sentences), and since A is a subset of B, V also satisfies A. So A is satisfiable.

The long(er) version

Suppose that a set of sentences A is finitely satisfiable. If a set of sentences A’ is finitely satisfiable, then for any sentence a, at least one of A’ ⋃ {a} and A’ ⋃ {¬a} is finitely satisfiable. If neither were, then we’d have two finite sets, one that entails ¬a and the other that entails a, and the union of these would be an unsatisfiable finite subset of A’.) So given any well-ordering of the sentences of the language, we can extend A one sentence at a time, maintaining finite satisfiability at each step. The union of all these extensions, call it B, is still finitely satisfiable, because any finite subset of B is also a finite subset of one of the extensions. B is also complete: for any sentence b either b ∈ B or ¬b ∈ B.

Now, define the truth assignment V as follows: for any atomic sentence b, V(b) = True iff b ∈ B. V satisfies B, which we prove by induction on sentences:

  • If b is an atomic sentence, then V(b) = True iff b ∈ B, by construction.
  • If b = ¬c for some c (for which V(c) = True iff c ∈ B), then V(b) = True iff V(c) = False iff c ∉ B iff b ∈ B.
  • If b = c ∧ d for some c and d that satisfy the induction hypothesis, then V(b) = V(c ∧ d) = True iff V(c) = V(d) = True iff c ∈ B and d ∈ B. If both c and d are in B, then ¬(c ∧ d) can’t be in B by finite satisfiability, so c ∧ d ∈ B by completeness of B. And if c ∧ d ∈ B, then neither ¬c nor ¬d can be in B by finite satisfiability. So c and d are both in B by completeness of B.

This proof by induction covers the connectives ¬ and ∧, with which you can express all other connectives, so this shows that V satisfies B. And since B is a superset of A, V satisfies A as well. This shows that any finitely satisfiable set of sentences is also satisfiable. Note that we used the axiom of choice implicitly with the instruction to well-order the language. As the language can be any size, this is equivalent to the well-ordering principle, which is equivalent in ZF to the axiom of choice. The same sort of assumption arises in the proof of the completeness theorems for first-order and propositional logics. If you reject choice, then you should also be skeptical that propositional logic and first-order logics are complete!

ZFC, and getting the right answer by accident

There’s something I’m confused about with regards to ZFC. We already know that no first order theory allows you to perfectly capture the semantics of concepts like the natural numbers or finitely many. The best we can do if restricted to first-order logic (which is where most of modern mathematics tries to stay) is to have theories with surrogates for those concepts.

For instance, there’s this object called ω that stands in for the natural numbers in ZFC, even though there are all these models of ZFC in which ω contains many more objects besides the natural numbers (in fact, uncountably many in some models). And “X is finite” is replaced by “there’s a bijection from X to an element of ω.”

In formalizing these surrogate concepts, we make sure to not include any false statements about the concepts, which gives us a type of soundness of our surrogate concept. I.e., ZFC isn’t going to prove things about ω that are false of the set of natural numbers, because in one model ω is the set of natural numbers.

But it doesn’t give us completeness. We aren’t going to be able to prove ALL the true first-order sentences about the natural numbers, or about the concept of finiteness. (This is of course just a product of the first-order nature of the theories in which we’re defining these surrogates.)

So in each of these cases there will be true statements involving the concept that our theory will be agnostic about. We can look for “really good” surrogates, surrogates which allow us to prove most of the important and interesting true statements about the concept, and only fail in very abstract and unusual settings. The degree to which we can make good surrogates is the degree to which a first-order thinker can make sense of and usefully apply non-first-order concepts. (A first-order thinker being one that has no built-in understanding of non-first-order concepts.)

So one general question is: how good is a given surrogate? And another is, how do we know based on the axioms how good of a surrogate we’ll be getting? This is the thing I’m confused about.

In ZFC, there’s this weird phenomenon of the theory “getting the right answers accidentally.” It’s a little tough to put into words, but here’s an example:

ZFC can prove that |P(X)| > |X| for all X. So for instance ZFC can prove that |P(ω)| > |ω|. Meaning that ZFC can prove that the power set of the naturals is uncountable. But ZFC has countable models! (As does every first-order theory.) In those models, the power set of the naturals is NOT uncountable.

First order logic is sound, so what’s going on ISN’T that ZFC is proving a sentence that’s false in some of its models. It’s that the sentence is false in that model, if interpreted to be about the actual concept, and true in that model if interpreted to be about the surrogate concept. The surrogate for “P(ω) is uncountable” in ZFC is “there’s no bijection from P(ω) to ω”. And in the countable models of ZFC, the bijection that would prove the equinumerosity of ω and P(ω) is missing! So in those models, it’s actually TRUE that “there’s no bijection from P(ω) to ω”, even though P(ω) and ω really do have the same number of elements.

This is the sort of subtle nonsense you get used to in mathematical logic. Two accidents cancel out: first ZFC makes the mistake of having models where P(ω) is countable, and second it makes the mistake of losing track of the bijection from P(ω) and ω. And as a result, ZFC is able to prove the correct thing.

This seems really weird to me, and it’s an unsettling way for a supposed foundations of mathematics to be getting the right answers. This type of thing happens a lot, and it feels like we keep getting “lucky” in that our unintended interpretations of a sentence interfere with each other and cancel out their problematic features. It makes me wonder why we should be confident that ZFC will continue giving us the right answers (as opposed to being agnostic on important basic questions). And in fact we do have some examples of important and basic questions that ZFC is agnostic on, most dramatically that if |X| < |Y|, then |P(X)| < |P(Y)|!

It’s not that I doubt that ZFC’s surrogates for non-first order concepts end up allowing us to prove an enormous amount of the true facts about these concepts, because they clearly do. But is there some principled reason we can give for WHY the axioms of ZFC lead us to such a robust framework for mathematics in which we can prove many true statements?

One thing that suggests this is the power of the axiom of replacement. It’s just an extremely strong and useful axiom that allows us to prove a whole lot. But that doesn’t seem to help explain the “right-by-accident” phenomenon. So what does?

Polish Notation and Garden-Path Sentences

Polish notation is a mathematical notation system that allows you to eliminate parentheses without ambiguity. It’s called “Polish” because the name of its Polish creator, Jan Łukasiewicz, was too difficult for people to pronounce.

A motivating example: Suppose somebody says “p and q implies r”. There are two possible interpretations of this: “(p and q) implies r” and “p and (q implies r)”. The usual way to disambiguate these two is to simply add in parentheses like I just did. Another way is to set an order-of-operations convention, like that “and” always applies before “implies”. This is what’s used in basic algebra, and what allows you to write 2 + 2 ⋅ 4 without any fear that you’ll be interpreted as meaning (2 + 2) ⋅ 4.

Łukasiewicz’s method was to make all binary connectives into prefixes. So “A and B” because “and A B”, “P implies Q” becomes “implies P Q”, and so on. In this system, “(p and q) implies r” translates to “implies and p q r”, and “p and (q implies r)” translates to “and p implies q r”. Since the two expressions are different, there’s no need for parentheses! And in general, no ambiguity ever arises from lack of parentheses when using Polish notation.

If this is your first time encountering Polish notation, your first reaction might be to groan and develop a slight headache. But there’s something delightfully puzzling about reading an expression written in Polish notation and trying to understand what it means. Try figuring out what this means: “implies and not p or q s r”. Algebra can be written in Polish notation just as easily, removing the need for both parentheses AND order-of-operations. “2 + 2 = 4” becomes “+ 2 2 = 4”, or even better, “= + 2 2 4”.

Other binary connectives can be treated in Polish notation as well, creating gems like: “If and you’re happy you know it clap your hands!” “When life is what happens you’re busy making plans.” “And keep calm carry on.” “Therefore I think, I am.” (This last one is by of the author the Meditations). Hopefully you agree with me that these sentences have a nice ring to them, though the meaning is somewhat obscured.

But putting connectives in front of the two things being connected is not unheard of. Some examples in English: “ever since”, “because”, “nonwithstanding”, “whenever”, “when”, “until”, “unless”. Each of these connects two sentences, and yet can appear in front of both. When we hear a sentence like “Whenever he cheated on a test the professor caught him”, we don’t have any trouble parsing it. (And presumably you had no trouble parsing that entire last sentence either!) One could imagine growing up in a society where “and” and “or” are treated the same way as “ever since” and “until”, and perhaps in this society Polish notation would seem much more natural!

Slightly related to sentential connectives are verbs, which connect subjects and objects. English places its verbs squarely between the subject and the object, as does Chinese, French, and Spanish. But in fact the most common ordering is subject-object-verb! 45% of languages, including Hindi, Japanese, Korean, Latin, and Ancient Greek, use this pattern. So for instance, instead of “She burned her hand”, one would say “she her hand burned”. This is potentially weirder to English-speakers than Polish notation; it’s reverse Polish notation!

9% of languages use Polish notation for verbs (the verb-subject-object pattern). These include Biblical Hebrew, Arabic, Irish, and Filipino. In such languages, it would be grammatical to say “Loves she him” but not “She loves him”. (3% of languages are VOS – loves him she – 1% are OVS – him loves she – and just a handful are OSV – him she loves).

Let’s return to English. Binary prepositions like “until” appear out front, but they also swap the order of the two things that they connect. For instance, “Until you do your homework, you cannot go outside” is the same as “You cannot go outside until you do your homework”, not “You do your homework until you cannot go outside”, which sounds a bit more sinister.

I came up with some examples of sentences with several layers of these binary prepositions to see if the same type of confusion as we get when examining Polish notation for “and” or “implies” sets in here, and oh boy does it.

Single connective
Since when the Americans dropped the bomb the war ended, some claimed it was justified.

Two connectives, unlayered
Since when the Americans dropped the bomb the war ended, when some claimed it was an atrocity others argued it was justified.

Still pretty readable, no? Now let’s layer the connectives.

One layer
Whenever he was late she would weep.
She would weep whenever he was late.

Two layers
Since whenever he was late she would weep, he hurried over.
He hurried over, since she would weep whenever he was late.

Three layers
Because since whenever he was late she would weep he hurried over, he left his wallet at home.
He left his wallet at home, because he hurried over since she would weep whenever he was late.

Four layers
Because because since whenever he was late she would weep he hurried over he left his wallet at home, when he was pulled over the officer didn’t give him a ticket.
The officer didn’t give him a ticket when he was pulled over, because he left his wallet at home because he hurried over since she would weep whenever he was late.

Five layers
When he heard because because since whenever he was late she would weep he hurried over he left his wallet at home, when he was pulled over the officer didn’t give the man a ticket, the mayor was outraged at the lawlessness.
The mayor was outraged at the lawlessness when he heard the officer didn’t give the man a ticket when he was pulled over because he left his wallet at home because he hurried over since she would weep whenever he was late.

Read that last one out loud to a friend and see if they believes you that it makes grammatical sense! With each new layer, things become more and more… Polish. That is, indecipherable. (Incidentally, Polish is SVO just like English). Part of the problem is that when we have multiple layers like this, phrases that are semantically connected can become more and more distant in the sentence. It reminds me of my favorite garden-path sentence pattern:

The mouse the cat the dog chased ate was digested.
(The mouse that (the cat that the dog chased) ate) was digested.
The mouse (that the cat (that the dog chased) ate) was digested.

The phrases that are meant to be connected, like “the mouse” and “was digested” are sandwiched on either side of the sentence, and can be made arbitrarily distant by the addition of more “that the X verbed” clauses.

Does anybody know of any languages where “and” comes before the two conjuncts? What about “or”? English does this with “if”, so it might not be too much of a stretch.

Defining uncountability

Most of the shocking results in mathematical logic have to do with how limited we are in terms of producing good deductive systems for mathematical concepts. (Good here means sound, complete, finitary, and computable.) No logic with a good deductive system can categorically define the natural numbers. Adding a “for finitely many” quantifier to first-order logic makes it impossible to have a good deductive system. Ditto for a “for all subsets of” quantifier. And so on.

Once you’ve accustomed yourself to the extreme logical limitations imposed on us, it comes as a bit of a shock when you realize that we still have the ability to describe ANY complicated mathematical concept. What I learned recently was that you can extend first-order logic by adding a “for uncountably many” quantifier, and still have a good deductive system for the logic! It says something that I was so impressed by this fairly moderate ability to distinguish between the countable and the uncountable. But it certainly does feel puzzling that while you can distinguish between the countable and the uncountable, you can’t distinguish between the finite and the infinite, which seems like a much more basic and fundamental division!

What’s even cooler is that the deductive system for this extended logic is extremely simple. It amounts to taking the axioms and inference rules of first-order logic, and then tacking on four additional axiom schemas. We’ll write “for uncountably many x, Φ(x)” as Ux Φ(x).

  1. ∀y∀z¬Ux (x = y ∨ x = z)
  2. ∀x (Φ(x) → Ψ(x)) → (Ux Φ(x) → Ux Ψ(x))
  3. Ux Φ(x) ↔ Uy Φ(y), where y isn’t free in Φ(x)
  4. Uy∃x Φ(x,y) → (∃xUy Φ(x,y) ∨ Ux∃y Φ(x,y))

Each of these axioms schemas is conceptually quite easy to make sense of. The first says that for any two elements y and z, there’s not uncountably many things that are each equal to one or the other. In other words, uncountable is bigger than two.

The second says that if the set of things that satisfy Φ is a subset of the set of things that satisfy Ψ, and uncountably many things satisfy Φ, then uncountably many things satisfy Ψ. In other words, if a set has an uncountable subset, then it must be uncountable itself.

The third is obvious enough that it doesn’t need explanation. The fourth, on the other hand, does. I find it helpful to think of Φ(x,y) as saying “x points at y.” Then Uy∃x Φ(x,y) becomes “uncountably many things are pointed at”, ∃xUy Φ(x,y) becomes “some person points at uncountably many things”, and Ux∃y Φ(x,y) becomes “there are uncountably many people pointing.”

So the fourth axiom just says that if uncountably many things are pointed at, then either somebody is pointing at uncountably many things, or there are uncountably many people pointing at things. Otherwise you’d only have countably many people pointing at countably many things each, and that’s not enough to get uncountably many things pointed at! So this fourth axiom can really be understood as saying that the union of countably many countable sets is countable.

✯✯✯

There’s a caveat to the above claims. This is that first-order logic extended by the “for uncountably many” quantifier only has a good deductive system IF the language is restricted to only allow countably many constant symbols. Here’s the reasoning:

First of all, if a logic has a sound, complete and finitary deductive system, then it is compact. Discussed here.

Second, if a logic L is compact and a set of sentences Σ in L has a countably infinite model, then Σ also has an uncountable model. Why? Simply append to L an uncountable set A of constant symbols, and add to Σ an uncountable set of sentences declaring that each of these constants refer to a distinct object. Now we have an uncountable model of Σ’s extension by compactness (the countably infinite model is a model of every finite subset of Σ’s extension), and this uncountable model must also satisfy Σ. (This is the only place where we make use of uncountably many constant symbols.)

Third, if a logic L is compact, and a set of sentences Σ in L has models of arbitrarily large finite size, then Σ has infinite models. The proof of this is virtually identical to the last paragraph; add infinitely many constant symbols and declare them all to refer to distinct objects. Every finite subset of this extension of Σ has a model, so by compactness the extension of Σ also has a model. This model is infinite by construction, and as a model of an extension of Σ must also be a model of Σ.

That sets up all the background model theory we need. Now, suppose that FOL+U had a good deductive system that applied even with uncountably many constant symbols. By soundness, completeness, and finiteness, FOL + U would be compact. Consider now the FOL+U theory T consisting of the single axiom ¬Ux (x = x). T cannot have uncountable models, by assumption that the quantifier Ux is really capturing the semantics of “for uncountably many x”.

But if T can’t have uncountable models, then it can’t have a countably infinite models either, by the second result above! So T doesn’t have any infinite models. And then by the third result, T cannot have models of arbitrarily large finite size! This means that asserting “there aren’t uncountably many objects”, we actually end up ruling out all countably infinite models, as well as all models larger than some finite size! This is inconsistent with the claim that the U quantifier is correctly capturing the semantics of “uncountably many”; after all, the negation of “for uncountably many” should be “for countably many”, not “for small enough finite”!

Summing up the argument:

  1. If a logic has a sound, complete, and finitary deductive system, then it is compact.
  2. If a logic L is compact, and a set of sentences Σ in L has a countable model, then Σ has an uncountable model.
  3. If a logic L is compact, and a set of sentences Σ in L has models of arbitrary finite size, then Σ has infinite models.
  4. FOL+U = First order logic + “for uncountably many” is sound, complete, and finitary.
  5. The FOL+U theory T = {¬Ux (x = x)} doesn’t have uncountable models.
  6. By 1 and 4, FOL+U is compact.
  7. By 2 and 6, if a set of sentences Σ in FOL+U has a countable model, then Σ has an uncountable model.
  8. Suppose T had countably infinite models. Then by 7, it would also have to have uncountable models, contradicting 5. So T doesn’t have countably infinite models.
  9. Suppose T had arbitrarily large finite models. Then by 3 and 6, T would have infinite models, contradicting 5 and 8. So T doesn’t have arbitrarily large finite models.
  10. Rephrasing 9, there’s some finite n such that T has no models of cardinality larger than n.

✯✯✯

What’s great about the logic FOL+U is that it allows you to talk about the property of uncountability, so long as your theories are all only countably large (which isn’t too bad a restriction). So for instance, consider the FOL+U theory consisting of the axioms of first-order Peano arithmetic plus one additional axiom: ¬Ux (x = x). This theory successfully rules out all uncountable models! This is something that you couldn’t do in ordinary first order logic, since it obeys the upward Lowenheim-Skolem property (a model of one infinite cardinality implies models of every larger infinite cardinality). The order type of all countable nonstandard models of PA is known (each is ω + ℤ⋅ℚ, a natural number line followed by countably many integer lines arranged like the rationals), and much of the difficulty of the model theory of PA involves the uncountable models. So this extension of PA is a big improvement over first-order PA!

Remember that ZFC has those pesky countable models, in which “the set of all subsets” turns out to be missing some subsets? We can get rid of all of those countable models with the FOL+U theory consisting of the axioms of ZFC + the additional axiom Ux (x = x). I’m pretty sure that this does not fix all the issues with correctly pinning down the notion of the power set — that would be too good to be true — but it is certainly an improvement!

Who would win in a fight, logic or computation?

Which paradigm is stronger? When I say logic here, I’ll be referring to standard formal theories like ZFC and Peano arithmetic. If one uses an expansive enough definition to include computability theory, then logic encompasses computation and wins by default.

If you read this post, you might immediately jump to the conclusion that logic is stronger. After all, we concluded that in first order Peano arithmetic one can unambiguously define not only the decidable sets, but the sets decidable with halting oracles, and those decidable with halting oracles for halting oracles, and so on forever. We saw that the Busy Beaver numbers, while undecidable, are definable by a Π1 sentence. And Peano arithmetic is a relatively weak theory; much more is definable in a theory like ZFC!

On the other hand, remember that there’s a difference between what’s absolutely definable and what’s definable-relative-to-ℕ. A set of numbers is absolutely definable if there’s a sentence that is true of just those numbers in every model of PA. It’s definable-relative-to-ℕ if there’s a sentence that’s true of just those numbers in the standard model of PA (the natural numbers, ℕ). Definable-relative-to-ℕ is an unfair standard for the strength of PA, given that PA, and in general no first order system, is able to categorically define the natural numbers. On the other hand, absolute definability is the standard that meets the criterion of “unambiguous definition”. With an absolute definition, there’s no room for confusion about which mathematical structure is being discussed.

All those highly undecidable sets we discussed earlier were only definable-relative-to-ℕ. In terms of what PA is able to define absolutely, it’s limited to only the finite sets. Compare this to what sets of numbers can be decided by a Turing machine. Every finite set is decidable, plus many infinite sets, including the natural numbers!

This is the first example of computers winning out over formal logic in their expressive power. While no first order theory (and in general no theory in a logic with a sound and complete proof system) can categorically define the natural numbers, it’s incredibly simple to write a program that defines that set using the language of PA. In regex, it’s just “S*0”. And here’s a Python script that does the job:

def check(s):
    if s == '':
        return False
    elif s == ‘0’:
        return True
    elif s[0] == ’s’:
        return check(s[1:])
    else:
        return False

We don’t even need a Turing machine for this; we can build a simple finite state machine:

In general, the sets decidable by computer can be much more complicated and interesting than those absolutely definable by a first-order theory. While first-order theories like ZFC have “surrogates” for infinite sets of numbers (ω and its definable subsets), these surrogate sets include nonstandard elements in some models. As a result, ZFC may fail to prove some statements about these sets which hold true of ℕ but fail for nonstandard models. For instance, it may be that the Collatz conjecture is true of the natural numbers but can’t be proven from ZFC. This would be because ZFC’s surrogate for the set of natural numbers (i.e. ω) includes nonstandard elements in some models, and the Collatz conjecture may fail for some such nonstandard elements.

On top of that, by taking just the first Turing jump into uncomputability, computation with a halting oracle, we are able to prove the consistency of ZFC and PA. Since ZFC is recursively axiomatizable, there’s a program that runs forever listing out theorems, such that every theorem is output at some finite point. We can use this to produce a program that looks to see if “0 = 1” is a theorem of ZFC. If ZFC is consistent, then the program will search forever and never find this theorem. But if ZFC is not consistent, then the program will eventually find the theorem “0 = 1” and will terminate at that point. Now we just apply our halting oracle to this program! If ZFC is consistent, then the halting oracle tells us that the program doesn’t halt, in which case we return “True”. If ZFC is not consistent, then the halting oracle tells us that the program does halt, in which case we return “False”. And just like that, we’ve decided the truth value of Con(ZFC)!

The same type of argument applies to Con(PA), and Con(T) for any T that is recursively axiomatizable. If you’re not a big fan of ZFC but have some other favored system for mathematics, then so long as this system’s axioms can be recognized by a computer, its consistency can be determined by a halting oracle.

Some summary remarks.

On the one hand, there’s a very simple explanation for why logic appears over and over again to be weaker than computation: we only choose to study formal theories that are recursively axiomatizable and have computable inference rules. Of course we can’t do better than a computer using a logical theory that can be built into a computer! If on the other hand, we chose true arithmetic, the set of all first-order sentences of PA that are true of ℕ, to be our de jure theory of mathematics, then the theorems of mathematics could not be recursively enumerated by any Turing machine, or indeed any oracle machine below 0(ω). So perhaps there’s nothing too mysterious here after all.

On the other hand, there is something very philosophically interesting about truths of mathematics that cannot be proven just using mathematics, but could be proven if it happened that our universe gave us access to an oracle for the halting problem. If a priori reasoning is thought to be captured by a formal system like ZFC, then it’s remarkable that there are facts about the a priori (like Con(ZFC)) that cannot possibly be established by a priori reasoning. Any consistency proof cannot be provided by a consistent system itself, and going outside to a stronger system whose consistency is more in doubt doesn’t help at all. The only possible way to learn the truth value of such statements is contingent; we can learn it if the universe contains some physical manifestation of a halting oracle!

Computing truth values of sentences of arithmetic, or: Math is hard

Previously I talked about the arithmetic hierarchy for sets, and how it relates to the decidability of sets. There’s also a parallel notion of the arithmetic hierarchy for sentences of Peano arithmetic, and it relates to the difficulty of deciding the truth value of those sentences.

Truth value here and everywhere else in this post refers to truth value in the standard model of arithmetic. Truth value in the sense of “being true in all models of PA” is a much simpler matter; PA is recursively axiomatizable and first order logic is sound and complete, so any sentence that’s true in all models of PA can be eventually proven by a program that enumerates all the theorems of PA. So if a sentence is true in all models of PA, then there’s an algorithm that will tell you that in a finite amount of time (though it will run forever on an input that’s false in some models).

Not so for truth in the standard model! As we’ll see, whether a sentence evaluates to true in the standard model of arithmetic turns out to be much more difficult to determine in general. Only for the simplest sentences can you decide their truth value using an ordinary Turing machine. And the set of all sentences is in some sense infinitely uncomputable (you’ll see in a bit in what sense exactly this is).

What we’ll discuss is a way to convert sentences of Peano arithmetic to computer programs. Before diving into that, though, one note of caution is necessary: the arithmetic hierarchy for sentences is sometimes talked about purely syntactically (just by looking at the sentence as a string of symbols) and other times is talked about semantically (by looking at logically equivalent sentences). Here I will be primarily interested in the entirely-syntactic version of the arithmetic hierarchy. If you’ve only been introduced to the semantic version of the hierarchy, what you see here might differ a bit from what you recognize.

Let’s begin!

The simplest types of sentences have no quantifiers at all. For instance…

0 = 0
2 ⋅ 2 < 7
(2 + 2 = 4) → (2 ⋅ 2 = 4)

Each of these sentences can be translated into a program quite easily, since +, ⋅, =, and < are computable. We can translate the → in the third sentence by converting it into a conjunction:

## (2 + 2 = 4) → (2 ⋅ 2 = 4)
not(2 + 2 == 4 and not 2 * 2 == 4)

Slightly less simple-looking are sentences with bounded quantifiers:

∀x < 10 (x + 0 = x)
∃x < 100 (x + x = x)
∀x < 5 ∃y < 7 (x > 1 → x⋅y = 12)
∃x < 5 ∀y < x ∀z < y (y⋅z ≠ x)

In each of these examples, the bounded quantifier could in principle be expanded out, leaving us with a finite quantifier-free sentence. This should suggest to us that adding bounded quantifiers doesn’t actually increase the computational difficulty.

We can translate sentences with bounded quantifiers into programs by converting each bounded quantifier to a for loop. The translation slightly differently depending on whether the quantifier is universal or existential:

def Aupto(n, phi):
    for x in range(n):
        if not phi(x):
            return False
    return True
def Elessthan(n, phi):
    for x in range(n):
        if phi(x):
            return True
    return False

Note that the second input needs to be a function; reflecting that it’s a sentence with free variables. Now we can quite easily translate each of the examples, using lambda notation to more conveniently define the necessary functions

## ∀x<10 (x + 0 = x)
Aupto(10, lambda x: x + 0 == x)

## ∃x<100 (x + x = x)
Elessthan(100, lambda x: x + x == x)

## ∀x<5 ∃y<7 ((x > 1) → (x*y = 12))
Aupto(5, lambda x: Elessthan(7, lambda y: not (x > 1 and x * y != 12)))

## ∃x<5 ∀y<x ∀z<y (y⋅z ≠ x)
Elessthan(5, lambda x: Aupto(x, lambda y: Aupto(y, lambda z: y * z != x)))

Each of these programs, when run, determines whether or not the sentence is true. Hopefully it’s clear how we can translate any sentence with bounded quantifiers into a program of this form. And when we run the program, it will determine the truth value of the sentence in a finite amount of time.

So far, we’ve only talked about the simplest kinds of sentences, with no unbounded quantifiers. There are two names that both refer to this class: Π0 and Σ0. So now you know how to write a program that determines the truth value of any Σ00 sentence!

We now move up a level in the hierarchy, by adding unbounded quantifiers. These quantifiers must all appear out front and be the same type of quantifier (all universal or all existential).

Σ1 sentences: ∃x1 ∃x2 … ∃xk Phi(x1, x2, …, xk), where Phi is Π0.
Π1 sentences: ∀x1 ∀x2 … ∀xk Phi(x1, x2, …, xk), where Phi is Σ0.

Some examples of Σ1 sentences:

∃x ∃y (x⋅x = y)
∃x (x⋅x = 5)
∃x ∀y < x (x+y > x⋅y)

And some Π1 sentences:

∀x (x + 0 = x)
∀x ∀y (x + y < 10)
∀x ∃y < 10 (y⋅y + y = x)

We can translate unbounded quantifiers as while loops:

def A(phi):
    x = 0
    while True:
        if not phi(x):
            return False
        x += 1

def E(phi):
    x = 0
    while True:
        if phi(x):
            return True
        x += 1

There’s a radical change here from the bounded case, which is that these functions are no longer guaranteed to terminate. A(Φ) never returns True, and E(Φ) never returns False. This reflects the nature of unbounded quantifiers. An unbounded universal quantifier is claiming something to be true of all numbers, and thus there are infinitely many cases to be checked. Of course, the moment you find a case that fails, you can return False. But if the universally quantified statement is true of all numbers, then the function will have to keep searching through the numbers forever, hoping to find a counterexample. With an unbounded existential quantifier, all one needs to do is find a single example where the statement is true and then return True. But if there is no such example (i.e. if the statement is always false), then the program will have to search forever.

I encourage you to think about these functions for a few minutes until you’re satisfied that not only do they capture the unbounded universal and existential quantifiers, but that there’s no better way to define them.

Now we can quite easily translate our example sentences as programs:

## ∃x ∃y (x⋅x = y)
E(lambda x: E(lambda y: x * x == y))

## ∃x (x⋅x = 5)
E(lambda x: x * x == 5)

## ∃x ∀y < x (x+y > x⋅y)
E(lambda x: Aupto(x, lambda y: x + y > x * y))

## ∀x (x + 0 = x)
A(lambda x: x + 0 == x)

## ∀x ∀y (x + y < 10)
A(lambda x: A(lambda y: x + y < 10))

## ∀x ∃y < 10 (y⋅y + y = x)
A(lambda x: Elessthan(10, y * y + y == x))

The first is a true Σ1 sentence, so it terminates and returns True. The second is a false Σ1 sentence, so it runs forever. See if you can figure out if the third ever halts, and then run the program for yourself to see!

The fourth is a true Π1 sentence, which means that it will never halt (it will keep looking for a counterexample and failing to find one forever). The fifth is a false Π1 sentence, so it does halt at the first moment it finds a value of x and y whose sum is 10. And figure out the sixth for yourself!

The next level of the hierarchy involves alternating quantifiers.

Σ2 sentences: ∃x1 ∃x2 … ∃xk Φ(x1, x2, …, xk), where Φ is Π1.
Π2 sentences: ∀x1 ∀x2 … ∀xk Φ(x1, x2, …, xk), where Φ is Σ1.

So now we’re allowed sentences with a block of one type of unbounded quantifier followed by a block of the other type of unbounded quantifier, and ending with a Σ0 sentence. You might guess that the Python functions we’ve defined already are strong enough to handle this case (and indeed, all higher levels of the hierarchy), and you’re right. At least, partially. Try running some examples of Σ2 or Π2 sentences and see what happens. For example:

## ∀x ∃y (x > y)
A(lambda x: E(lambda y: x > y))

It runs forever! If we were to look into the structure of this program, we’d see that A(Φ) only halts if it finds a counterexample to Φ, and E(Φ) only halts if it finds an example of Φ. In other words A(E(Φ)) only halts if A finds out that E(Φ) is false; but E(Φ) never halts if it’s false! The two programs’ goals are diametrically opposed, and as such, brought together like this they never halt on any input.

The same goes for a sentence like ∃x ∀y (x > y): for this program to halt, it would require that ∀y (x > y) is found to be true for some value of x, But ∀y (x > y) will never be found true, because universally quantified sentences can only be found false! This has nothing to do with the (x > y) being quantified over, it’s entirely about the structure of the quantifiers.

No Turing machine can decide the truth values of Σ2 and Π2 sentences. There’s a caveat here, related to the semantic version of the arithmetic hierarchy. It’s often possible to take a Π2 sentence like ∀x ∃y (y + y = x) and convert it to a logically equivalent but Π1 sentence like ∀x ∃y<x (y + y = x). This translation works, because y + y = x is only going to be true if y is less than or equal to x. Now we have a false Π1 sentence rather than a false Π2 sentence, and as such we can find a counterexample and halt.

We can talk about a sentence’s essential level on the arithmetic hierarchy, which is the lowest level of the logically equivalent sentence. It’s important to note here that “logically equivalent sentence” is a cross-model notion: A and B are logically equivalent if and only if they have the same truth values in every model of PA, not just the standard model. The soundness and completeness of first order logic, and the recursive nature of the axioms of PA, tells us that the set of sentences that are logically equivalent to a given sentence of PA is recursively enumerable. So we can generate these sentences by searching for PA proofs of equivalence and keeping track of the lowest level of the arithmetic hierarchy attained so far.

Even when we do this, we will still find sentences that have no logical equivalents below Σ2 or Π2. These sentences are essentially uncomputable; not just uncomputable in virtue of their form, but truly uncomputable in all of their logical equivalents. However, while they are uncomputable, they would become computable if we had a stronger Turing machine. Let’s take another look at the last example:

## ∀x ∃y (x > y)
A(lambda x: E(lambda y: x > y))

Recall that the problem was that A(E(Φ)) only halts if E(Φ) returns False, and E(Φ) can only return True. But if we had a TM equipped with an oracle for the truth value of E(Φ) sentences, then maybe we could evaluate A(E(Φ))!

Let’s think about that for a minute more. What would an oracle for the truth value of Σ1 sentences be like? One thing that would work is if we could run E(Φ) “to infinity” and see if it ever finds an example, and if not, then return False. So perhaps an infinite-time Turing machine would do the trick. Another way would be if we could simply ask whether E(Φ) ever halts! If it does, then ∃y (x > y) must be true, and if not, then it must be false.

So a halting oracle suffices to decide the truth values of Σ1 sentences! Same for Π1 sentences: we just ask if A(Φ) ever halts and return False if so, and True otherwise.

If we run the above program on a Turing machine equipped with a halting oracle, what will we get? Now we can evaluate the inner existential quantifier for any given value of x. So in particular, for x = 0, we will find that Ey (x > y) is false. We’ve found a counterexample, so our program will terminate and return False.

On the other hand, if our sentence was true, then we would be faced with the familiar feature of universal quantifiers: we’d run forever looking for a counterexample and never find one. So to determine that this sentence is true, we’d need an oracle for the halting problem for this new more powerful Turing machine!

Here’s a summary of what we have so far:

TM = Ordinary Turing Machine
TM2 = TM + oracle for TM
TM3 = TM + oracle for TM2

The table shows what type of machine suffices to decide the truth value of a sentence, depending on where on the arithmetic hierarchy the sentence falls and whether the sentence is true or false.

We’re now ready to generalize. In general, Σn sentences start with a block of existential quantifiers, and then alternate between blocks of existential and universal quantifiers n – 1 times before ending in a Σ0 sentence. Πn sentences start with a block of universal quantifiers, alternates quantifiers n – 1 times, and then ends in a Σ0 sentence. And as you move up the arithmetic hierarchy, it requires more and more powerful halting oracles to decide whether sentences are true:

(TM = ordinary Turing machine, TMn+1 = TM + oracle for TMn)

If we define Σω to be the union of all the Σ classes in the hierarchy, and Πω the union of the Π classes, then deciding the truth value of Σω ⋃ Πω (the set of all arithmetic sentences) would require a TMω – a Turing machine with an oracle for TM, TM2, TM3, and so on. Thus the theory of true arithmetic (the set of all first-order sentences that are true of ℕ), is not only undecidable, it’s undecidable with a TM2, TM3, and TMn for every n ∈ ℕ. At every level of the arithmetic hierarchy, we get new sentences that are essentially on that level (not just sentences that are superficially on that level in light of their syntactic form, but sentences which, in their simplest possible logically equivalent form, lie on that level).

This gives some sense of just how hard math is. Just understanding the first-order truths of arithmetic requires an infinity of halting oracles, each more powerful than the last. And that says nothing about the second-order truths of arithmetic! That would require even stronger Turing machines than TMω – Turing machines that have halting oracles for TMω, and then TMs with oracles for that, and so on to unimaginable heights (just how high we must go is not currently known).