Some simple probability puzzles

(Most of these are taken from Ian Hacking’s Introduction to Probability and Inductive Logic.)

  1. About as many boys as girls are born in hospitals. Many babies are born every week at City General. In Cornwall, a country town, there is a small hospital where only a few babies are born every week.

    Define a normal week as one where between 45% and 55% of babies are female. An unusual week is one where more than 55% or less than 45% are girls.

    Which of the following is true:
    (a) Unusual weeks occur equally often at City General and at Cornwall.
    (b) Unusual weeks are more common at City General than at Cornwall.
    (c) Unusual weeks are more common at Cornwall than at City General.

  2. Pia is 31 years old, single, outspoken, and smart. She was a philosophy major. When a student, she was an ardent supporter of Native American rights, and she picketed a department store that had no facilities for nursing mothers.

    Which of the following statements are most probable? Which are least probable?

    (a) Pia is an active feminist.
    (b) Pia is a bank teller.
    (c) Pia works in a small bookstore.
    (d) Pia is a bank teller and an active feminist.
    (e) Pia is a bank teller and an active feminist who takes yoga classes.
    (f) Pia works in a small bookstore and is an active feminist who takes yoga classes.

  3. You have been called to jury duty in a town with only green and blue taxis. Green taxis dominate the market, with 85% of the taxis on the road.

    On a misty winter night a taxi sideswiped another car and drove off. A witness said it was a blue cab. This witness is tested under similar conditions, and gets the color right 80% of the time.

    You conclude about the sideswiping taxi:
    (a) The probability that it is blue is 80%.
    (b) It is probably blue, but with a lower probability than 80%.
    (c) It is equally likely to be blue or green.
    (d) It is more likely than not to be green.

  4. You are a physician. You think that it’s quite likely that a patient of yours has strep throat. You take five swabs from the throat of this patient and send them to a lab for testing.

    If the patient has strep throat, the lab results are right 70% of the time. If not, then the lab is right 90% of the time.

    The test results come back: YES, NO, NO, YES, NO

    You conclude:
    (a) The results are worthless.
    (b) It is likely that the patient does not have strep throat.
    (c) It is slightly more likely than not that the patient does have strep throat.
    (d) It is very much more likely than not that the patient does have strep throat.

  5. In a country, all families wants a boy. They keep having babies till a boy is born. What is the expected ratio of boys and girls in the country?
  6.  Answer the following series of questions:

    If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT?

    If you flip the coin repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT?

    If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely?

    You play a game with a friend in which you each choose a sequence of three possible flips (e.g HHT and TTH). You then flip the coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor?

    Are you always able to bias the game in your favor?


Solutions (and lessons)

  1. The correct answer is (a): Unusual weeks occur more often at Cornwall than at City General. Even though the chance of a boy is the same at Cornwall as it is at City General, the percentage of boys from week to week is larger in the smaller city (for N patients a week, the percentage boys goes like 1/sqrt(N)). Indeed, if you think about an extreme case where Cornwall has only one birth a week, then every week will be an unusual week (100% boys or 0% boys).
  2. There is room to debate the exact answer but whatever it is, it has to obey some constraints. Namely, the most probable statement cannot be (d), (e), or (f), and the least probable statement cannot be (a), (b), or (c). Why? Because of the conjunction rule of probability: each of (d), (e), and (f) are conjunctions of (a), (b), and (c), so they cannot be more likely. P(A & B) ≤ P(A).

    It turns out that most people violate this constraint. Many people answer that (f) is the most probable description, and (b) is the least probable. This result is commonly interpreted to reveal a cognitive bias known as the representativeness heuristic – essentially, that our judgements of likelihood are done by considering which descriptions most closely resemble the known facts. In this case,

    Another factor to consider is that prior to considering the evidence, your odds on a given person being a bank teller as opposed to working in a small bookstore should be heavily weighted towards her being a bank teller. There are just far more bank tellers than small bookstore workers (maybe a factor of around 20:1). This does not necessarily mean that (b) is more likely than (c), but it does mean that the evidence must discriminate strongly enough against her being a bank teller so as to overcome the prior odds.

    This leads us to another lesson, which is to not neglect the base rate. It is easy to ignore the prior odds when it feels like we have strong evidence (Pia’s age, her personality, her major, etc.). But the base rate on small bookstore workers and bank tellers are very relevant to the final judgement.

  3. The correct answer is (d) – it is more likely than not that the sideswiper was green. This is a basic case of base rate neglect – many people would see that the witness is right 80% of the time and conclude that the witness’s testimony has an 80% chance of being correct. But this is ignoring the prior odds on the content of the witness’s testimony.

    In this case, there were prior odds of 17:3 (85%:15%) in favor of the taxi being green. The evidence had a strength of 1:4 (20%:80%), resulting in the final odds being 17:12 in favor of the taxi being green. Translating from odds to probabilities, we get a roughly 59% chance of the taxi having been green.

    We could have concluded (d) very simply by just comparing the prior probability (85% for green) with the evidence (80% for blue), and noticing that the evidence would not be strong enough to make blue more likely than green (since 85% > 80%). Being able to very quickly translate between statistics and conclusions is a valuable skill to foster.

  4. The right answer is (d). We calculate this just like we did the last time:

    The results were YES, NO, NO, YES, NO.

    Each YES provides evidence with strength 7:1 (70%/10%) in favor of strep, and each NO provides evidence with strength 1:3 (30%/90%).

    So our strength of evidence is 7:1 ⋅ 1:3 ⋅ 1:3 ⋅ 7:1 ⋅ 1:3 = 49:27, or roughly 1.81:1 in favor of strep. This might be a little surprising… we got more NOs than YESs and the NO was correct 90% of the time for people without strep, compared to the YES being correct only 70% of the time in people with strep.

    Since the evidence is in favor of strep, and we started out already thinking that strep was quite likely, in the end we should be very convinced that they have strep. If our prior on the patient having strep was 75% (3:1 odds), then our probability after getting evidence will be 84% (49:9 odds).

    Again, surprising! The patient who sees these results and hears the doctor declaring that the test strengthens their belief that the patient has strep might feel that this is irrational and object to the conclusion. But the doctor would be right!

  5. Supposing as before that the chance of any given birth being a boy is equal to the chance of it being a girl, we end up concluding…

    The expected ratio of boys and girls in the country is 1! That is, this strategy doesn’t allow you to “cheat” – it has no impact at all on the ratio. Why? I’ll leave this one for you to figure out. Here’s a diagram for a hint:


    This is important because it applies to the problem of p-hacking. Imagine that all researchers just repeatedly do studies until they get the results they like, and only publish these results. Now suppose that all the researchers in the world are required to publish every study that they do. Now, can they still get a bias in favor of results they like? No! Even though they always stop when getting the result they like, the aggregate of their studies is unbiased evidence. They can’t game the system!

  6.  Answers, in order:

    If you flip a fair coin twice, do you have the same chance of getting HH as you have of getting HT? (Yes)

    If you flip it repeatedly until you get HH, does this result in the same average number of flips as if you repeat until you get HT? (No)

    If you flip it repeatedly until either HH emerges or HT emerges, is either outcome equally likely? (Yes)

    You play a game with a friend in which you each choose a sequence of three coin flips (e.g HHT and TTH). You then flip a coin repeatedly until one of the two patterns emerges, and whosever pattern it is wins the game. You get to see your friend’s choice of pattern before deciding yours. Are you ever able to bias the game in your favor? (Yes)

    Are you always able to bias the game in your favor? (Yes!)

    Here’s a wiki page with a good explanation of this: LINK. A table from that page illustrating a winning strategy for any choice your friend makes:

    1st player’s choice 2nd player’s choice Odds in favour of 2nd player
    HHH THH 7 to 1
    HHT THH 3 to 1
    HTH HHT 2 to 1
    HTT HHT 2 to 1
    THH TTH 2 to 1
    THT TTH 2 to 1
    TTH HTT 3 to 1
    TTT HTT 7 to 1

Simple induction

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:


This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update.

How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity). What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

Variational Bayesian inference

Today I learned a cool trick for practical implementation of Bayesian inference.

Bayesians are interested in calculating posterior probability distributions of unobserved parameters X, given data (which consists of the values of observed parameters Y).

To do so, they need only know the form of the likelihood function (the probability of Y given X) and their own prior distribution over X. Then they can apply Bayes’ rule…

P(X | Y) = P(Y | X) P(X) / P(Y)

… and voila, Bayesian inference complete.

The trickiest part of this process is calculating the term in the denominator, the marginal likelihood P(Y). Trying to calculate this term analytically is typically  very computationally expensive – it involves a sum over all possible values of the parameters of the likelihood multiplied by the prior. If Y is drawn from a continuous infinity of possible parameter values, then calculating the marginal likelihood amounts to solving a (typically completely intractable) integral.

P(Y) = ∫ P(Y | X) P(X) dX

Variational Bayesian inference is a procedure that solves this problem through a clever trick.

We start by searching for a posterior in a space of functions F that are easily integrable.

Our goal is not to find the exact form of the posterior, although if we do, that’s great. Instead, we want to find the function Q(X) within F that is as close to the posterior P(X | Y) as possible.

Distance between probability distributions is typically calculated by the information divergence D(Q, P), which is defined by…

D(Q, P) = ∫ Q(X) log(Q(X) / P(X|Y)) dX

To explicitly calculate and minimize this, we would need to know the form of the posterior P(X | Y) from the start. But let’s plug in the definition of conditional probability…

P(X | Y) = P(X, Y) / P(Y)

D(Q, P) = ∫ Q(X) log(Q(X) P(Y) / P(X, Y)) dX
= ∫ Q(X) log(Q(X) / P(X, Y)) dX  +  ∫ Q(X) log P(Y) dX

The second term is easily calculated. Since log(P(Y)) is not a function of X, the integral just becomes…

∫ Q(X) log P(Y) dX = log P(Y)

Rearranging, we get…

log P(Y) = D(Q, P)  –  ∫ Q(X) log(Q(X) / P(X, Y)) dX

The second term depends on Q(X) and the joint probability P(X, Y), which we can calculate easily as the product of the likelihood P(Y | X) and the prior P(X). We name it the variational free energy, L(Q).

log P(Y) = D(Q, P) + L(Q)

Now, on the left-hand side we have the log of the marginal likelihood, and on the right we have the information distance plus the variational free energy.

Notice that the left side is not a function of Q. This is really important! It tells us that if we’re trying to vary Q to minimize D(Q, P), then the right side will be a constant quantity.

In other words, any increase in L(Q) is necessarily a decrease in D(Q, P). What this means is that the Q that minimizes D(Q, P) is the same thing as the Q that maximizes L(Q)!

We can use this to minimize D(Q, P) without ever explicitly knowing P.

Recalling the definition of the variational free energy, we have…

L(Q) = – ∫ Q(X) log(Q(X) / P(X, Y)) dX
= ∫ Q(X) log P(X, Y) dX – ∫ Q(X) log Q(X) dX

Both of these integrals are computable insofar as we made a good choice for the function space F. Thus we can exactly find Q*, the best approximation to P in F. Then, knowing Q*, we can calculate L(Q*), which serves as a lower bound on the log of the marginal likelihood P(Y).

log P(Y) = D(Q, P) + L(Q)
so log P(Y) ≥ L(Q*)

Summing up…

  1. Variational Bayesian inference approximates the posterior probability P(X | Y) with a function Q(X) in the function space F.
  2. We find the function Q* that is as similar as possible to P(X | Y) by maximizing L(Q).
  3. L(Q*) gives us a lower bound on the log of the marginal likelihood, log P(Y).

Regularization as approximately Bayesian inference

In an earlier post, I showed how the procedure of minimizing sum of squares falls out of regular old frequentist inference. This time I’ll do something similar, but with regularization and Bayesian inference.

Regularization is essentially a technique in which you evaluate models in terms of not just their fit to the data, but also the values of the parameters involved. For instance, say you are modeling some data with a second-order polynomial.

M = { f(x) = a + bx + cx2 | a, b, c ∈ R }
D = { (x1, y1), …, (xN, yN) }

We can evaluate our model’s fit to the data with SOS:

SOS = ∑ (yn – f(xn))2

Minimizing SOS gives us the frequentist answer – the answer that best fits the data. But what if we suspect that the values of a, b, and c are probably small? In other words, what if we have an informative prior about the parameter values? Then we can explicitly add on a penalty term that increases the SOS, such as…

SOS with L1 regularization = k1 |a| + k2 |b| + k3 |c| + ∑ (yn – f(xn))2

The constants k1, k2, and k3 determine how much we will penalize each parameter a, b, and c. This is not the only form of regularization we could use, we could also use the L2 norm:

SOS with L2 regularization = k1 a2 + k2 b2 + k3 c2 + ∑ (yn – f(xn))2

You might, having heard of this procedure, already suspect it of having a Bayesian bent. The notion of penalizing large parameter values on the basis of a prior suspicion that the values should be small sounds a lot like what the Bayesian would call “low priors on high parameter values.”

We’ll now make the connection explicit.

Frequentist inference tries to select the theory that makes the data most likely. Bayesian inference tries to select the theory that is made most likely by the data. I.e. frequentists choose f to maximize P(D | f), and Bayesians choose f to maximize P(f | D).

Assessing P(f | D) requires us to have a prior over our set of functions f, which we’ll call π(f).

P(f | D) = P(D | f) π(f) / P(D)

We take a logarithm to make everything easier:

log P(f | D) = log P(D | f) + log π(f) – log P(D)

We already evaluated P(D | f) in the last post, so we’ll just plug it in right away.

log P(f | D) = – SOS/2σ2 – N/2 log(2πσ2)) + log π(f) – log P(D)

Since we are maximizing with respect to f, two of these terms will fall away.

log P(f | D) = – SOS/2σ2 + log π(f) + constant

Now we just have to decide on the form of π(f). Since the functional form of f is determined by the values of the parameters {a, b, c}, π(f) = π(a, b, c). One plausible choice is a Gaussian centered around the values of each parameter:

π(f) = exp( -a2 / 2σa2 ) exp( -b2 / 2σb2 ) exp( -c2 / 2σc2 ) / √(8π3σa2σb2σc2)
log π(f) = -a2/2σa2 – b2/2σb2 – c2/2σc2 – ½ log(8π3σa2σb2σc2)

Now, throwing out terms that don’t depend on the values of the parameters, we find:

log P(f | D) = – SOS/2σ2 -a2/2σa2 – b2/2σb2 – c2/2σc2 + constant

This is exactly L2 regularization, where each kn = σ2n2. In other words, L2 regularization is Bayesian inference with Gaussian priors over the parameters!

What priors does L1 regularization correspond to?

log π(f) = -k1 |a| – k2 |b| – k3 |c|
π(a, b, c) = e-k1|a| e-k2|b| e-k3|a|

I.e. the L1 regularization prior is an exponential distribution.

This can be easily extended to any regularization technique. This is a way to get some insight into what your favorite regularization methods mean. They are ultimately to be cashed out in the form of your prior knowledge of the parameters!

Why minimizing sum of squares is equivalent to frequentist inference

(This will be the first in a short series of posts describing how various commonly used statistical methods are approximate versions of frequentist, Bayesian, and Akaike-ian inference)

Suppose that we have some data D = { (x₁, y₁), (x₂, y₂), … , (xɴ, yɴ) }, and a candidate function y = f(x).

Frequentist inference involves the assessment of the likelihood of the data given this candidate function: P(D | f).

Since D is composed of N independent data points, we can assess the probability of each data point separately, and multiply them all together.

P(D | f) = P(x₁, y₁ | f) P(x₂, y₂ | f) … P(xɴ, yɴ | f)

So now we just need to answer the question: What is P(x, y | f)?

f predicts that for the value x, the most likely y-value is f(x).

The other possible y-values will be normally distributed around f(x).


The equation for this distribution is a Gaussian:

P(x, y | f) = exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)

Now that we know how to find P(x, y | f), we can easily calculate P(D | F)!

P(D | f) = exp[ -(y – f(x))² / 2σ² ] /√(2πσ²) ・ exp[ -(y – f(x))² / 2σ² ] / √(2πσ²) … exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)
= exp[ -(y – f(x))² / 2σ² ] ・ exp[ -(y – f(x))² / 2σ² ] … exp[ -(y – f(x))² / 2σ² ] / (2πσ²)N/2

Products are messy and logarithms are monotonic, so log(P(D | f)) is easier to work with: it turns the product into a sum.

log P(D | f) = log( exp[ -(y₁ – f(x₁))² / 2σ² ] … exp[ -(yɴ – f(xɴ))² / 2σ² ] / (2πσ²)N/2 )
= log( exp[ -(y₁ – f(x₁))² / 2σ² ] ) + … log( exp[ -(yɴ – f(xɴ))² / 2σ² ] ) – N/2 log(2πσ²)
= -(y₁ – f(x₁))² / 2σ² ) + -(yɴ – f(xɴ))² / 2σ² ) – N/2 log(2πσ²)
= -1/2σ² [ (y₁ – f(x₁))² + … +(yɴ – f(xɴ))² ] – N/2 log(2πσ²)

Now notice that the sum of squares just naturally pops out!

SOS = (y₁ – f(x₁))² + … + (yɴ – f(xɴ))²
log P(D | f) = -SOS/2σ² – N/2 log(2πσ²)

Frequentist inference chooses f to maximize P(D | f). We can now immediately see why this is equivalent to minimizing SOS!

argmax{ P(D | f) }
= argmax{ log P(D | f) }
= argmax{ – SOS/2σ² – N/2 log(2πσ²) }
= argmin{ SOS/2σ² + N/2 log(2πσ²) }
= argmin{ SOS/2σ² }
= argmin{ SOS }

Next, we’ll go Bayesian…

Galileo and the Schelling point improbability principle

An alternative history interaction between Galileo and his famous statistician friend


In the year 1609, when Galileo Galilei finished the construction of his majestic artificial eye, the first place he turned his gaze was the glowing crescent moon. He reveled in the crevices and mountains he saw, knowing that he was the first man alive to see such a sight, and his mind expanded as he saw the folly of the science of his day and wondered what else we might be wrong about.

For days he was glued to his telescope, gazing at the Heavens. He saw the planets become colorful expressive spheres and reveal tiny orbiting companions, and observed the distant supernova which Kepler had seen blinking into existence only five years prior. He discovered that Venus had phases like the Moon, that some apparently single stars revealed themselves to be binaries when magnified, and that there were dense star clusters scattered through the sky. All this he recorded in frantic enthusiastic writing, putting out sentences filled with novel discoveries nearly every time he turned his telescope in a new direction. The universe had opened itself up to him, revealing all its secrets to be uncovered by his ravenous intellect.

It took him two weeks to pull himself away from his study room for long enough to notify his friend Bertolfo Eamadin of his breakthrough. Eamadin was a renowned scholar, having pioneered at age 15 his mathematical theory of uncertainty and created the science of probability. Galileo often sought him out to discuss puzzles of chance and randomness, and this time was no exception. He had noticed a remarkable confluence of three stars that were in perfect alignment, and needed the counsel of his friend to sort out his thoughts.

Eamadin arrived at the home of Galileo half-dressed and disheveled, obviously having leapt from his bed and rushed over immediately upon receiving Galileo’s correspondence. He practically shoved Galileo out from his viewing seat and took his place, eyes glued with fascination on the sky.

Galileo allowed his friend to observe unmolested for a half-hour, listening with growing impatience to the ‘oohs’ and ‘aahs’ being emitted as the telescope swung wildly from one part of the sky to another. Finally, he interrupted.

Galileo: “Look, friend, at the pattern I have called you here to discuss.”

Galileo swiveled the telescope carefully to the position he had marked out earlier.

Eamadin: “Yes, I see it, just as you said. The three stars form a seemingly perfect line, each of the two outer ones equidistant from the central star.”

Galileo: “Now tell me, Eamadin, what are the chances of observing such a coincidence? One in a million? A billion?”

Eamadin frowned and shook his head. “It’s certainly a beautiful pattern, Galileo, but I don’t see what good a statistician like myself can do for you. What is there to be explained? With so many stars in the sky, of course you would chance upon some patterns that look pretty.”

Galileo: “Perhaps it seems only an attractive configuration of stars spewed randomly across the sky. I thought the same myself. But the symmetry seemed too perfect. I decided to carefully measure the central angle, as well as the angular distance distended by the paths from each outer star to the central one. Look.”

Galileo pulled out a sheet of paper that had been densely scribbled upon. “My calculations revealed the central angle to be precisely 180.000º, with an error of ± .003º. And similarly, I found the difference in the two angular distances to be .000º, with a margin of error of ± .002º.”

Eamadin: “Let me look at your notes.”

Galileo handed over the sheets to Eamadin. “I checked over my calculations a dozen times before writing you. I found the angular distances by approaching and retreating from this thin paper, which I placed between the three stars and me. I found the distance at which the thin paper just happened to cover both stars on one extreme simultaneously, and did the same for the two stars on the other extreme. The distance was precisely the same, leaving measurement error only for the thickness of the paper, my distance from it, and the resolution of my vision.”

Eamadin: “I see, I see. Yes, what you have found is a startlingly clear pattern. A similarity in distance and precision of angle this precise is quite unlikely to be the result of any natural phenomenon… ”

Galileo: “Exactly what I thought at first! But then I thought about the vast quantity of stars in the sky, and the vast number of ways of arranging them into groups of three, and wondered if perhaps in fact such coincidences might be expected. I tried to apply your method of uncertainty to the problem, and came to the conclusion that the chance of such a pattern having occurred through random chance is one in a thousand million! I must confess, however, that at several points in the calculation I found myself confronted with doubt about how to progress and wished for your counsel.”

Eamadin stared at Galileo’s notes, then pulled out a pad of his own and began scribbling intensely. Eventually, he spoke. “Yes, your calculations are correct. The chance of such a pattern having occurred to within the degree of measurement error you have specified by random forces is 10-9.”

Galileo: “Aha! Remarkable. So what does this mean? What strange forces have conspired to place the stars in such a pattern? And, most significantly, why?”

Eamadin: “Hold it there, Galileo. It is not reasonable to jump from the knowledge that the chance of an event is remarkably small to the conclusion that it demands a novel explanation.”

Galileo: “How so?”

Eamadin: “I’ll show you by means of a thought experiment. Suppose that we found that instead of the angle being 180.000º with an experimental error of .003º, it was 180.001º with the same error. The probability of this outcome would be the same as the outcome we found – one in a thousand million.”

Galileo: “That can’t be right. Surely it’s less likely to find a perfectly straight line than a merely nearly perfectly straight line.”

Eamadin: “While that is true, it is also true that the exact calculation you did for 180.000º ± .003º would apply for 180.001º ± .003º. And indeed, it is less likely to find the stars at this precise angle, than it is to find the stars merely near this angle. We must compare like with like, and when we do so we find that 180.000º is no more likely than any other angle!”

Galileo: “I see your reasoning, Eamadin, but you are missing something of importance. Surely there is something objectively more significant about finding an exactly straight line than about a nearly straight line, even if they have the same probability. Not all equiprobable events should be considered to be equally important. Think, for instance, of a sequence of twenty coin tosses. While it’s true that the outcome HHTHTTTTHTHHHTHHHTTH has the same probability as the outcome HHHHHHHHHHHHHHHHHHHH, the second is clearly more remarkable than the first.”

Eamadin: “But what is significance if disentangled from probability? I insist that the concept of significance only makes sense in the context of my theory of uncertainty. Significant results are those that either have a low probability or have a low conditional probability given a set of plausible hypotheses. It is this second class that we may utilize in analyzing your coin tossing example, Galileo. The two strings of tosses you mention are only significant to different degrees in that the second more naturally lends itself to a set of hypotheses in which the coin is heavily biased towards heads. In judging the second to be a more significant result than the first, you are really just saying that you use a natural hypothesis class in which probability judgments are only dependent on the ratios of heads and tails, not the particular sequence of heads and tails. Now, my question for you is: since 180.000º is just as likely as 180.001º, what set of hypotheses are you considering in which the first is much less likely than the second?”

Galileo: “I must confess, I have difficulty answering your question. For while there is a simple sense in which the number of heads and tails is a product of a coin’s bias, it is less clear what would be the analogous ‘bias’ in angles and distances between stars that should make straight lines and equal distances less likely than any others. I must say, Eamadin, that in calling you here, I find myself even more confused than when I began!”

Eamadin: “I apologize, my friend. But now let me attempt to disentangle this mess and provide a guiding light towards a solution to your problem.”

Galileo: “Please.”

Eamadin: “Perhaps we may find some objective sense in which a straight line or the equality of two quantities is a simpler mathematical pattern than a nearly straight line or two nearly equal quantities. But even if so, this will only be a help to us insofar as we have a presumption in favor of less simple patterns inhering in Nature.”

Galileo: “This is no help at all! For surely the principle of Ockham should push us towards favoring more simple patterns.”

Eamadin: “Precisely. So if we are not to look for an objective basis for the improbability of simple and elegant patterns, then we must look towards the subjective. Here we may find our answer. Suppose I were to scribble down on a sheet of paper a series of symbols and shapes, hidden from your view. Now imagine that I hand the images to you, and you go off to some unexplored land. You explore the region and draw up cartographic depictions of the land, having never seen my images. It would be quite a remarkable surprise were you to find upon looking at my images that they precisely matched your maps of the land.”

Galileo: “Indeed it would be. It would also quickly lend itself to a number of possible explanations. Firstly, it may be that you were previously aware of the layout of the land, and drew your pictures intentionally to capture the layout of the land – that is, that the layout directly caused the resemblance in your depictions. Secondly, it could be that there was a common cause between the resemblance and the layout; perhaps, for instance, the patterns that most naturally come to the mind are those that resemble common geographic features. And thirdly, included only for completion, it could be that your images somehow caused the land to have the geographic features that it did.”

Eamadin: “Exactly! You catch on quickly. Now, this case of the curious coincidence of depiction and reality is exactly analogous to your problem of the straight line in the sky. The straight lines and equal distances are just like patterns on the slips of paper I handed to you. For whatever reason, we come pre-loaded with a set of sensitivities to certain visual patterns. And what’s remarkable about your observation of the three stars is that a feature of the natural world happens to precisely align with these patterns, where we would expect no such coincidence to occur!”

Galileo: “Yes, yes, I see. You are saying that the improbability doesn’t come from any objective unusual-ness of straight lines or equal distances. Instead, the improbability comes from the fact that the patterns in reality just happen to be the same as the patterns in my head!”

Eamadin: “Precisely. Now we can break down the suitable explanations, just as you did with my cartographic example. The first explanation is that the patterns in your mind were caused by the patterns in the sky. That is, for some reason the fact that these stars were aligned in this particular way caused you to by psychologically sensitive to straight lines and equal quantities.”

Galileo: “We may discard this explanation immediately, for such sensitivities are too universal and primitive to be the result of a configuration of stars that has only just now made itself apparent to me.”

Eamadin: “Agreed. Next we have a common cause explanation. For instance, perhaps our mind is naturally sensitive to visual patterns like straight lines because such patterns tend to commonly arise in Nature. This natural sensitivity is what feels to us on the inside as simplicity. In this case, you would expect it to be more likely for you to observe simple patterns than might be naively thought.”

Galileo: “We must deny this explanation as well, it seems to me. For the resemblance to a straight line goes much further than my visual resolution could even make out. The increased likelihood of observing a straight line could hardly be enough to outweigh our initial naïve calculation of the probability being 10-9. But thinking more about this line of reasoning, it strikes me that you have just provided an explanation the apparent simplicity of the laws of Nature! We have developed to be especially sensitive to patterns that are common in Nature, we interpret such patterns as ‘simple’, and thus it is a tautology that we will observe Nature to be full of simple patterns.”

Eamadin: “Indeed, I have offered just such an explanation. But it is an unsatisfactory explanation, insofar as one is opposed to the notion of simplicity as a purely subjective feature. Most people, myself included, would strongly suggest that a straight line is inherently simpler than a curvy line.”

Galileo: “I feel the same temptation. Of course, justifying a measure of simplicity that does the job we want of it is easier said than done. Now, on to the third explanation: that my sensitivity to straight lines has caused the apparent resemblance to a straight line. There are two interpretations of this. The first is that the stars are not actually in a straight line, and you only think this because of your predisposition towards identifying straight lines. The second is that the stars aligned in a straight line because of these predispositions. I’m sure you agree that both can be reasonably excluded.”

Eamadin: “Indeed. Although it may look like we’ve excluded all possible explanations, notice that we only considered one possible form of the common cause explanation. The other two categories of explanations seem more thoroughly ruled out; your dispositions couldn’t be caused by the star alignment given that you have only just found out about it and the star alignment couldn’t be caused by your dispositions given the physical distance.”

Galileo: “Agreed. Here is another common cause explanation: God, who crafted the patterns we see in Nature, also created humans to have similar mental features to Himself. These mental features include aesthetic preferences for simple patterns. Thus God causes both the salience of the line pattern to humans and the existence of the line pattern in Nature.”

Eamadin: “The problem with this is that it explains too much. Based solely on this argument, we would expect that when looking up at the sky, we should see it entirely populated by simple and aesthetic arrangements of stars. Instead it looks mostly random and scattershot, with a few striking exceptions like those which you have pointed out.”

Galileo: “Your point is well taken. All I can imagine now is that there must be some sort of ethereal force that links some stars together, gradually pushing them so that they end up in nearly straight lines.”

Eamadin: “Perhaps that will be the final answer in the end. Or perhaps we will discover that it is the whim of a capricious Creator with an unusual habit for placing unsolvable mysteries in our paths. I sometimes feel this way myself.”

Galileo: “I confess, I have felt the same at times. Well, Eamadin, although we have failed to find a satisfactory explanation for the moment, I feel much less confused about this matter. I must say, I find this method of reasoning by noticing similarities between features of our mind and features of the world quite intriguing. Have you a name for it?”

Eamadin: “In fact, I just thought of it on the spot! I suppose that it is quite generalizable… We come pre-loaded with a set of very salient and intuitive concepts, be they geometric, temporal, or logical. We should be surprised to find these concepts instantiated in the world, unless we know of some causal connection between the patterns in our mind and the patterns in reality. And by Eamadin’s rule of probability-updating, when we notice these similarities, we should increase our strength of belief in these possible causal connections. In the spirit of anachrony, let us refer to this as the Schelling point improbability principle!”

Galileo: “Sounds good to me! Thank you for your assistance, my friend. And now I must return to my exploration of the Cosmos.”

Why “number of parameters” isn’t good enough

A friend of mine recently pointed out a curious fact. Any set of two-dimensional data whatsoever can be perfectly fit by a simple two-parameter sinusoidal model.

y(x) = A sin(Bx)

Sound wrong? Check it out:


Zoomed out:small-sine.png

N = 10 pointssine-overfit.png

As you see, as the number of data points goes up, all you need to do to accommodate this is increase the frequency in your sine function, and adjust the amplitude as necessary. Ultimately, you can fit any data set with a ridiculously quickly oscillating and large-amplitude sine function.

Now, most model selection methods explicitly rely on the parameter count to estimate the potential of a model to overfit. For example, if k is the number of parameters in a model, and L is the log likelihood of the data given the model, we have:

AIC = L – k
BIC = L – k/2・log(N)

This little example represents a fantastic failure of parameter count to successfully do the job AIC and BIC ask of it. Evidently parameter count is too blunt an instrument to do the job we require of it, and we need something with more nuance.

One more example.

For any set of data, if you can perfectly fit a curve to each data point, and if your measurement error σ is an adjustable parameter, then you can take the measurement error to zero to have a fit with infinite accuracy. Now when we evaluate, you find it running off to infinity! Thus our ‘fit to data’ term L goes to infinity, while the model complexity penalty stays a small finite number.

Once again, we see the same lack of nuance dragging us into trouble. The number of parameters might do well at estimating overfitting potential for some types of well-behaved parameters, but it clearly doesn’t do the job universally. What we want is some measure that is sensitive to the potential for some parameters to capture “more” of the space of all possible distributions than others.

And lo and behold, we have such a measure! This is the purpose of information geometry and the volume of a model in the space formed by the Fisher information metric as the penalty for overfitting potential. You can learn more about it in a post I wrote here.