# A simple probability puzzle

In front of you is an urn containing some unknown quantity of balls. These balls are labeled 1, 2, 3, etc. They’ve been jumbled about so as to be in no particular order within the urn. You initially consider it equally likely that the urn contains 1 ball as that it contains 2 balls, 3 balls, and so on, up to 100 balls, which is the maximum capacity of the urn.

Now you reach in to draw out a ball and read the number on it: 34. What is the most likely theory for how many balls the urn contains?

(…)

(Think of an answer before reading on.)

(…)

The answer turns out to be 34!

Hopefully this is a little unintuitive. Specifically, what seems wrong is that you draw out a ball and then conclude that this is the ball with the largest value on it. Shouldn’t extreme results be unlikely? But remember, the balls were randomly jumbled about inside the urn. So whether or not the number on the ball you drew is at the beginning, middle, or end of the set of numbers is pretty much irrelevant.

What is relevant is the likelihood: Pr(There are N balls | I drew a ball numbered 34). And the value of this is simply 1/N.

In general, comparing the theory that there are N balls to the theory that there are M balls, we look at the likelihood ratio: Pr(There are N balls | I drew a ball numbered 34) / Pr(There are M balls | I drew a ball numbered 34). This is simply M/N.

Thus we see that our prior odds get updated by a factor that favors smaller values of N, as long as N ≥ 34. The likelihood is zero up to N = 33, maxes at 34, and then decreases steadily after it as N goes to infinity. Since our prior was evenly spread out between N = 1 and 100 and zero everywhere else, our posterior will be peaked at 34 and decline until 100, after which it will drop to zero.

One way to make this result seem more intuitive is to realize that while strictly speaking the most probable number of balls in the urn is 34, it’s not that much more probable than 35 or 36. The actual probability of 34 is still quite small, it just happens to be a little bit more probable than its larger neighbors. And indeed, for larger values of the maximum capacity of the urn, the relative difference between the posterior probability of 34 and that of 35 decreases.

# Ramanujan’s infinite radicals

Here’s one of my favorite puzzles.

The Problem

Solve the following equation: Background

This problem was initially posed by the most brilliant mathematician of the past century, an eccentric prodigy named Srinivasa Ramanujan. He submitted it to the Journal of Indian Mathematical Society without a solution, challenging his fellow mathematicians to solve it. Eventually, when nobody could do it, he gave in and revealed the answer along with a simple proof that probably put them all to shame.

Solution

Ok, so seriously, try this for yourself before reading on. Most people haven’t ever done anything with infinite nested square roots, so this is a good opportunity to be creative and come up with weird ways of solving the problem. Play around with it, give random things names, and see what you can do!

You’ll notice (as I did when I first tried working it out) that solving nested radicals can be really really hard. Things quickly get really ugly. Luckily, the answer is beautifully simple. It’s just 3!

Let’s see why. We start with a very simple identity: Taking the square root, we get If we just plug in a few integer values, we see the following: What happens if we plug in the value of 4 from the second line to the 4 in the first? We get Hm, already looking familiar! We’re not done yet – we now plug in the value of 5 from the third line to this new equation, obtaining: And once more: And etc off to infinity. And that’s the proof!

Generalizing

Now, Ramanujan actually proved a much more general result than this. I’ll prove a less general result than his, but still more general than what we just saw.

The generalization comes from noticing that our initial identity didn’t have to involve adding 1 in (x+1)2. We could have added any number to x, squared it, and then gotten a brand new general theorem about a whole class of continued root problems. or… Now we perform substitutions exactly as we did above: In the end, what we get is a solution for a whole class of initially very difficult seeming problems: And if we plug in m = 1 and n = 2, we get our initial problem!

# Quantum Computing in a Nutshell

You can think about classical bits as like little switches. They have two possible states, ON and OFF, and are in either one or the other. Then we have logic gates, which are like operations we perform on the switches. If we have one switch, we might flip it from whatever state it’s in to the opposite state. If we have two, we might flip the second switch conditional on the first switch being ON, and otherwise do nothing. And so on. Put together a bunch of these bits and gates and you get an algorithm. If you arrange them cleverly enough, you end up getting interesting computations like adding binary numbers and factoring primes and playing Skyrim.

Quantum computing is a fundamentally different beast. To start with, the fundamental units of quantum computation are non-deterministic. We call them qubits, for quantum bits. While a qubit can, like a classical bit, be in the state ON and the state OFF, it can also be in other more exotic states: Important! Just like a classical bit, a qubit will only ever be observed in two possible states, ON and OFF. We never observe a qubit being in the state “36% ON and 64% OFF.” But if we prepare a qubit in that particular state, and then measure it over and over again, we could find that 36% of the time it is ON and 64% of the time it is OFF. That is, the state of a qubit specifies the probability distribution over possible observed outcomes.

Let’s compare the complexity of bits to the complexity of qubits.

To specify the state of a classical bit, we only need to answer a single yes or no question: Is it ON?

To specify the state of a quantum bit, we need to answer a more complicated question: What is the probability that it is ON?

Since probabilities are real numbers, answering the second question requires infinitely more information than answering the first. (In general, specifying a real number to perfect precision requires answering an unbounded number of yes or no questions.) The implication of this is that in some sense, the state of a qubit contains infinitely more information than the state of a classical bit. The trick of quantum computing is to exploit this enhanced information capacity in order to build quantum algorithms that beat out the best classical algorithms.

Moving on! The set of all states that a qubit can be in is the set of all probability distributions over the outcomes ON and OFF. Now, we could describe a simple probabilistic extension to classical computation, where logic gates transform probability distributions into different probability distributions, and algorithms neatly combine these gates to get useful computations, and be done with it. But it turns out that things are a slight bit more complicated than this. Having described this probabilistic extension, we would still not have quantum computing.

In fact, the states of qubits are specified not by probability distributions over the observations ON and OFF, but an amplitude distribution over these observations.

What are these amplitudes? An amplitude is simply a thing that you square to get the probability of the observation. Amplitudes have a wider range of possible values than probabilities. While a probability has to be greater than zero, an amplitude can be negative (since the square of a negative number will still be positive). In fact, amplitudes can even be complex numbers, in which case the square of the amplitude is just the square of its distance from zero in the complex plane. Now, why does it matter that the set of quantum states be described by amplitudes rather than probabilities? After all, the amplitudes will always just be converted back to probabilities when we observe the qubits, so what difference does it make if the probability came from a negative amplitude or a positive one? The answer to this question is difficult, but it comes down to this:

For some reason, it appears that on the quantum level, the universe calculates the states of particles in terms of these complex numbers we call amplitudes, not probabilities. This was the discovery of experimentalists in the 1900s that looked at things like the double slit experiment, the Stern-Gerlach experiment, and so on. If you try to just analyze everything in terms of probabilities instead of amplitudes, you will get the wrong answer.

We’ll write the state of a qubit that has an amplitude alpha of being ON and an amplitude beta of being OFF as follows: It’s useful to have a few different ways of visualizing the state of a qubit. Now, quantum gates are transformations that take amplitude distributions to different amplitude distributions. The set of physically realizable quantum gates is the set of all transformations that take normalized states (|α|² + |β|² = 1) to other normalized states. Some of these gates are given special names like the Hadamard gate or the CNOT gate and crop up all the time. And just like with classical states, you can put together these gates in clever ways to construct algorithms that do interesting computations for you.

So, that’s quantum computing!

Now, the first thing to notice is that the set of all things that quantum computers can do must contain the set of all things that classical computers can do. Why? Because classical computers are just a special case of quantum computers where states are only allowed to be either ON or OFF. Every classical logic gate has a parallel quantum logic gate, and so every classical algorithm is a quantum algorithm is disguise.

But quantum computers can also do more than classical computers. In the next posts, I’ll give two examples of quantum algorithms we’ve discovered that solve problems at speeds that are classically impossible: the Deutsch-Jozsa algorithm and Grover’s algorithm. Stay tuned!

# Short and sweet proof of the f(xy) = f(x) + f(y) logarithmic property

If you want a continuous function f(x) from the reals to the reals that has the property that for all real x and y, f(xy) = f(x) + f(y), then this function must take the form f(x) = k log(x) for some real k.

A proof of this just popped into my head in the shower. (As always with shower-proofs, it was slightly wrong, but I worked it out and got it right after coming out).

I haven’t seen it anywhere before, and it’s a lot simpler than previous proofs that I’ve encountered.

Here goes:

f(xy) = f(x) + f(y)

differentiate w.r.t. x…
f'(xy) y = f'(x)

differentiate w.r.t. y…
f”(xy) xy + f'(xy) = 0

rearrange, and rename xy to z…
f”(z) = -f'(z)/z

solve for f'(z) with standard 1st order DE techniques…
df’/f’ = – dz/z
log(f’) = -log(z) + constant
f’ = constant/z

integrate to get f…
f(z) = k log(z) for some constant k

And that’s the whole proof!

As for why this is interesting to me… the equation f(xy) = f(x) + f(y) is very easy to arrive at in constructing functions with desirable features. In words, it means that you want the function’s outputs to be additive when the inputs are multiplicative.

One example of this, which I’ve written about before, is formally quantifying our intuitive notion of surprise. We formalize surprise by asking the question: How surprised should you be if you observe an event that you thought had a probability P? In other words, we treat surprise as a function that takes in a probability and returns a scalar value.

We can lay down a few intuitive desideratum for our formalization of surprise, and one such desideratum is that for independent events E and F, our surprise at them both happening should just be the sum of the surprise at each one individually. In other words, we want surprise to be additive for independent events E and F.

But if E and F are independent, then the joint probability P(E, F) is just the product of the individual probabilities: P(E, F) = P(E) P(F). In other words, we want our outputs to be additive, when our inputs are multiplicative!

This automatically gives us that the form of our surprise function must be k log(z). To spell it out explicitly…

Desideratum: Surprise(P(E, F)) = Surprise(P(E)) + Surprise(P(F))

But P(E,F) = P(E) P(F), so…
Surprise(P(E) P(F)) = Surprise(P(E)) + Surprise(P(F))

Renaming P(E) to x and P(F) to y…
Surprise(xy) = Surprise(x) + Surprise(y)

Thus, by the above proof…
Surprise(x) = k log(x) for some constant k

That’s a pretty strong constraint for some fairly weak inputs!

That’s basically why I find this interesting: it’s a strong constraint that comes out of an intuitively weak condition.

# Constructing the world

In this six and a half hour lecture series by David Chalmers, he describes the concept of a minimal set of statements from which all other truths are a priori “scrutable” (meaning, basically, in-principle knowable or derivable).

What are the types of statements in this minimal set required to construct the world? Chalmers offers up four categories, and abbreviates this theory PQIT.

# P

P is the set of physical facts (for instance, everything that would be accessible to a Laplacean demon). It can be thought of as essentially the initial conditions of the universe and the laws governing their changes over time.

# Q

Q is the set of facts about qualitative experience. We can see Chalmers’ rejection of physicalism here, as he doesn’t think that Q is eclipsed within P. Example of a type of statement that cannot be derived from P without Q: “There is a beige region in the bottom right of my visual field.”

# I

Here’s a true statement: “I’m in the United States.” Could this be derivable from P and Q? Presumably not; we need another set of indexical truths that allows us to have “self-locating” beliefs and to engage in anthropic reasoning.

# T

Suppose that P, Q, and I really are able to capture all the true statements there are to be captured. Well then, the statement “P, Q, and I really are able to capture all the true statements there are to be captured” is a true statement, and it is presumably not captured by P, Q, and I! In other words, we need some final negative statements that tell us that what we have is enough, and that there are no more truths out there. These “that’s all”-type statements are put into the set T.

⁂⁂⁂

So this is a basic sketch of Chalmer’s construction. I like that we can use these tags like PQIT or PT or QIT as a sort of philosophical zip-code indicating the core features of a person’s philosophical worldview. I also want to think about developing this further. What other possible types of statements are there out there that may not be captured in PQIT? Here is a suggestion for a more complete taxonomy:

p    microphysics
P    macrophysics (by which I mean all of science besides fundamental physics)
Q    consciousness
R    normative rationality
E
normative ethics
C    counterfactuals
L
mathematical / logical truths
I     indexicals
T    “that’s-all” statements

I’ve split P into big-P (macrophysics) and little-p (microphysics) to account for the disagreements about emergence and reductionism. Normativity here is broad enough to include both normative epistemic statements (e.g. “You should increase your credence in the next coin toss landing H after observing it land H one hundred times in a row”) and ethical statements. The others are fairly self-explanatory.

The most ontologically extravagant philosophical worldview would then be characterized as pPQRECLIT.

My philosophical address is pRLIT (with the addendum that I think C comes from p, and am really confused about Q). What’s yours?

# How to sort a list of numbers

It’s funny how some of the simplest questions you can think of actually have interesting and nontrivial answers.

Take this question, for instance. If I hand you a list of N numbers, how can you quickly sort this list from smallest to largest? Suppose that you are only capable of sorting two numbers at a time.

You might immediately think of some obvious ways to solve this problem. For instance, you could just start at the beginning of the list and repeatedly traverse it, swapping numbers if they’re in the wrong order. This will guarantee that each pass-through, the largest number out of those not yet sorted will “bubble up” to the top. This algorithm is called Bubble Sort.

The problem is that for a sizable list, this will take a long long time. On average, for a list of size N, Bubble Sort takes O(N^2) steps to finish sorting. Can you think of a faster algorithm?

It turns out that we can go much faster with some clever algorithms – from O(N^2) to O(N logN). If you have a little time to burn, here are some great basic videos describing and visualizing different sorting techniques:

## Visualizations

Beautiful visualization of Bubble, Shell, and Quick Sort

(I’d be interested to know why the visualization of Bubble Sort in the final frame gave a curve of unsorted values that looked roughly logarithmic)

Visualization of 9 sorting algorithms

Auditory presentation of 15 sorting algorithms

## Bonus clip…

Cool proof of the equivalence of Selection Sort and Insertion Sort

## Searching a list

By the way, how about if you have a sorted list and want to find a particular value in it? If we’re being dumb, we could just ignore the fact that our list came pre-sorted and search through every element on the list in order. We’d find our value in O(N). We can go much faster by just starting in the middle of our list, looking at the size of the middle value to determine which half of the list to look at next, and then repeating with the appropriate half of the list. This is called binary search and takes O(logN), a massive speedup.

Now, let’s look back at the problem of finding a particular value in an unsorted list. Can we think of any techniques to accomplish this more quickly than O(N)?

No. Try as you might, as long as the list has no structure for you to exploit, your optimal performance will be limited to O(N).

Or so everybody thought until quantum mechanics showed up and blew our minds.

It turns out that a quantum computer would be able to solve this exact problem – searching through an unsorted list – in only O(√N) steps!

This should seem impossible to you. If a friend hands you a random jumbled list of 10,000 names and asks you to locate one particular name on the list, you’re going to end up looking through 5,000 names on average. There’s no clever trick you can use to speed up the process. Except that quantum mechanics says you’re wrong! In fact, if you were a quantum computer, you’d only have to search through ~100 names to find your name.

This quantum algorithm is called Grover’s algorithm, and I want to write up a future post about it. But that’s not even the coolest part! It turns out that if a non-local hidden variable interpretation of quantum mechanics is correct, then a quantum computer could search an N-item database in O(∛N)! You can imagine the triumphal feeling of the hidden variables people if one day they managed to build a quantum computer that simultaneously proved them right and broke the record for search algorithm speed!

# More model selection visualizations

I added cross validation to my model selection program, and ooh boy do I now understand why people want to find more efficient alternatives.

Reliable CV calculations end up making the program run orders of magnitude slower, as they require re-fitting the curve for every partition of your data and this is the most resource-intensive part of the process. While it’s beautiful in theory, this is a major set-back in practice.

For a bunch of different true distributions I tried, I found that CV pretty much always lines up with all of the other methods, just with a more “jagged” curve and a steeper complexity penalty. This similarity looks like a score for the other methods, assuming that CV does a good job of measuring predictive power. (And for those interested in the technical details, the cross validation procedure implemented here is leave-one-out CV)

I also added an explicit calculation of DKL, which should help to give some idea of a benchmark against which we can measure all of these methods. Anyway, I have some more images!

True curve = e-x/3 – ex/3

N = 100 data points N = 100 data points, zoomed in a bit For smaller data sets, you can see that AICc tracks DKL much more closely than any other technique (which is of course the entire purpose of AICc).

N = 25 N = 25, zoomed N = 25, super zoom Interestingly, you start to see BIC really suffering relative to the other methods, beginning to overfit the data. This is counterintuitive; BIC is supposed to be the method that penalizes complexity excessively and underfits the data. Relevant is that in this program, I use the form of BIC that doesn’t approximate for large N.

BICsmall N = k log(N/2π) – 2L
BICordinary = k log(N) – 2L

When I use the approximation instead, the problem disappears. Of course, this is not too great a solution; why should the large N approximation be necessary for fixing BIC specifically when N is small?

(Edit: after many more runs, it’s looking more like it may have just been a random accident that BIC overfit in the first few runs)

Just for the heck of it, I also decided to torture my polynomials a little bit by making them try to fit the function 1/x. I got dismal results no matter how I tried to tweak the hyper-parameters, which is, well, pretty much what I expected (1/x has no Taylor expansion around 0, for one thing).

More surprisingly, I tried fitting a simple Gaussian curve and again got fairly bad results. The methods disagreed with one another a lot of the time (although weirdly, AICc and BIC seemed to generally be in agreement), and gave curves that were overfitting the data a bit. The part that seems hardest for a polynomial to nail down is the flattened ends of the Gaussian curve.

True curve = 40 exp(-x2/2), N = 100 data points And zoomed in… If the jaggedness of the cross validation score is not purely an artifact of random fluctuations in the data, I don’t really get it. Why should, for example, a 27-parameter model roughly equal a 25-parameter model in predictive power, but a 26-parameter model be significantly worse than both?