Ramanujan’s infinite radicals

Here’s one of my favorite puzzles.

The Problem

Solve the following equation:

Screen Shot 2018-07-28 at 7.00.13 AM


This problem was initially posed by the most brilliant mathematician of the past century, an eccentric prodigy named Srinivasa Ramanujan. He submitted it to the Journal of Indian Mathematical Society without a solution, challenging his fellow mathematicians to solve it. Eventually, when nobody could do it, he gave in and revealed the answer along with a simple proof that probably put them all to shame.


Ok, so seriously, try this for yourself before reading on. Most people haven’t ever done anything with infinite nested square roots, so this is a good opportunity to be creative and come up with weird ways of solving the problem. Play around with it, give random things names, and see what you can do!

You’ll notice (as I did when I first tried working it out) that solving nested radicals can be really really hard. Things quickly get really ugly. Luckily, the answer is beautifully simple. It’s just 3!

Let’s see why. We start with a very simple identity:

Screen Shot 2018-07-28 at 7.20.54 AM

Taking the square root, we get

Screen Shot 2018-07-28 at 7.21.02 AM

If we just plug in a few integer values, we see the following:

Screen Shot 2018-07-28 at 7.16.28 AM.png

What happens if we plug in the value of 4 from the second line to the 4 in the first? We get

Screen Shot 2018-07-28 at 7.18.10 AM.png

Hm, already looking familiar! We’re not done yet – we now plug in the value of 5 from the third line to this new equation, obtaining:

Screen Shot 2018-07-28 at 7.19.37 AM.png

And once more:

Screen Shot 2018-07-28 at 7.22.31 AM.png

And etc off to infinity.

Screen Shot 2018-07-28 at 7.42.25 AM.png

And that’s the proof!


Now, Ramanujan actually proved a much more general result than this. I’ll prove a less general result than his, but still more general than what we just saw.

The generalization comes from noticing that our initial identity didn’t have to involve adding 1 in (x+1)2. We could have added any number to x, squared it, and then gotten a brand new general theorem about a whole class of continued root problems.

Thus we start with the identity

Screen Shot 2018-07-28 at 7.30.03 AM


Screen Shot 2018-07-28 at 7.30.23 AM.png

Now we perform substitutions exactly as we did above:


In the end, what we get is a solution for a whole class of initially very difficult seeming problems:

Screen Shot 2018-07-28 at 7.39.58 AM.png

And if we plug in m = 1 and n = 2, we get our initial problem!

Quantum Computing in a Nutshell

You can think about classical bits as like little switches. They have two possible states, ON and OFF, and are in either one or the other.


Then we have logic gates, which are like operations we perform on the switches. If we have one switch, we might flip it from whatever state it’s in to the opposite state. If we have two, we might flip the second switch conditional on the first switch being ON, and otherwise do nothing. And so on.


Put together a bunch of these bits and gates and you get an algorithm. If you arrange them cleverly enough, you end up getting interesting computations like adding binary numbers and factoring primes and playing Skyrim.

Quantum computing is a fundamentally different beast. To start with, the fundamental units of quantum computation are non-deterministic. We call them qubits, for quantum bits. While a qubit can, like a classical bit, be in the state ON and the state OFF, it can also be in other more exotic states:


Important! Just like a classical bit, a qubit will only ever be observed in two possible states, ON and OFF. We never observe a qubit being in the state “36% ON and 64% OFF.” But if we prepare a qubit in that particular state, and then measure it over and over again, we could find that 36% of the time it is ON and 64% of the time it is OFF. That is, the state of a qubit specifies the probability distribution over possible observed outcomes.

Let’s compare the complexity of bits to the complexity of qubits.

To specify the state of a classical bit, we only need to answer a single yes or no question: Is it ON?

To specify the state of a quantum bit, we need to answer a more complicated question: What is the probability that it is ON?

Since probabilities are real numbers, answering the second question requires infinitely more information than answering the first. (In general, specifying a real number to perfect precision requires answering an unbounded number of yes or no questions.) The implication of this is that in some sense, the state of a qubit contains infinitely more information than the state of a classical bit. The trick of quantum computing is to exploit this enhanced information capacity in order to build quantum algorithms that beat out the best classical algorithms.

Moving on! The set of all states that a qubit can be in is the set of all probability distributions over the outcomes ON and OFF.


Now, we could describe a simple probabilistic extension to classical computation, where logic gates transform probability distributions into different probability distributions, and algorithms neatly combine these gates to get useful computations, and be done with it. But it turns out that things are a slight bit more complicated than this. Having described this probabilistic extension, we would still not have quantum computing.

In fact, the states of qubits are specified not by probability distributions over the observations ON and OFF, but an amplitude distribution over these observations.

What are these amplitudes? An amplitude is simply a thing that you square to get the probability of the observation. Amplitudes have a wider range of possible values than probabilities. While a probability has to be greater than zero, an amplitude can be negative (since the square of a negative number will still be positive). In fact, amplitudes can even be complex numbers, in which case the square of the amplitude is just the square of its distance from zero in the complex plane.


Now, why does it matter that the set of quantum states be described by amplitudes rather than probabilities? After all, the amplitudes will always just be converted back to probabilities when we observe the qubits, so what difference does it make if the probability came from a negative amplitude or a positive one? The answer to this question is difficult, but it comes down to this:

For some reason, it appears that on the quantum level, the universe calculates the states of particles in terms of these complex numbers we call amplitudes, not probabilities. This was the discovery of experimentalists in the 1900s that looked at things like the double slit experiment, the Stern-Gerlach experiment, and so on. If you try to just analyze everything in terms of probabilities instead of amplitudes, you will get the wrong answer.

We’ll write the state of a qubit that has an amplitude alpha of being ON and an amplitude beta of being OFF as follows:


It’s useful to have a few different ways of visualizing the state of a qubit.


Now, quantum gates are transformations that take amplitude distributions to different amplitude distributions.


The set of physically realizable quantum gates is the set of all transformations that take normalized states (|α|² + |β|² = 1) to other normalized states. Some of these gates are given special names like the Hadamard gate or the CNOT gate and crop up all the time. And just like with classical states, you can put together these gates in clever ways to construct algorithms that do interesting computations for you.

So, that’s quantum computing!

Now, the first thing to notice is that the set of all things that quantum computers can do must contain the set of all things that classical computers can do. Why? Because classical computers are just a special case of quantum computers where states are only allowed to be either ON or OFF. Every classical logic gate has a parallel quantum logic gate, and so every classical algorithm is a quantum algorithm is disguise.

But quantum computers can also do more than classical computers. In the next posts, I’ll give two examples of quantum algorithms we’ve discovered that solve problems at speeds that are classically impossible: the Deutsch-Jozsa algorithm and Grover’s algorithm. Stay tuned!

Short and sweet proof of the f(xy) = f(x) + f(y) logarithmic property

If you want a continuous function f(x) from the reals to the reals that has the property that for all real x and y, f(xy) = f(x) + f(y), then this function must take the form f(x) = k log(x) for some real k.

A proof of this just popped into my head in the shower. (As always with shower-proofs, it was slightly wrong, but I worked it out and got it right after coming out).

I haven’t seen it anywhere before, and it’s a lot simpler than previous proofs that I’ve encountered.

Here goes:

f(xy) = f(x) + f(y)

differentiate w.r.t. x…
f'(xy) y = f'(x)

differentiate w.r.t. y…
f”(xy) xy + f'(xy) = 0

rearrange, and rename xy to z…
f”(z) = -f'(z)/z

solve for f'(z) with standard 1st order DE techniques…
df’/f’ = – dz/z
log(f’) = -log(z) + constant
f’ = constant/z

integrate to get f…
f(z) = k log(z) for some constant k

And that’s the whole proof!

As for why this is interesting to me… the equation f(xy) = f(x) + f(y) is very easy to arrive at in constructing functions with desirable features. In words, it means that you want the function’s outputs to be additive when the inputs are multiplicative.

One example of this, which I’ve written about before, is formally quantifying our intuitive notion of surprise. We formalize surprise by asking the question: How surprised should you be if you observe an event that you thought had a probability P? In other words, we treat surprise as a function that takes in a probability and returns a scalar value.

We can lay down a few intuitive desideratum for our formalization of surprise, and one such desideratum is that for independent events E and F, our surprise at them both happening should just be the sum of the surprise at each one individually. In other words, we want surprise to be additive for independent events E and F.

But if E and F are independent, then the joint probability P(E, F) is just the product of the individual probabilities: P(E, F) = P(E) P(F). In other words, we want our outputs to be additive, when our inputs are multiplicative!

This automatically gives us that the form of our surprise function must be k log(z). To spell it out explicitly…

Desideratum: Surprise(P(E, F)) = Surprise(P(E)) + Surprise(P(F))

But P(E,F) = P(E) P(F), so…
Surprise(P(E) P(F)) = Surprise(P(E)) + Surprise(P(F))

Renaming P(E) to x and P(F) to y…
Surprise(xy) = Surprise(x) + Surprise(y)

Thus, by the above proof…
Surprise(x) = k log(x) for some constant k

That’s a pretty strong constraint for some fairly weak inputs!

That’s basically why I find this interesting: it’s a strong constraint that comes out of an intuitively weak condition.

Constructing the world

In this six and a half hour lecture series by David Chalmers, he describes the concept of a minimal set of statements from which all other truths are a priori “scrutable” (meaning, basically, in-principle knowable or derivable).

What are the types of statements in this minimal set required to construct the world? Chalmers offers up four categories, and abbreviates this theory PQIT.


P is the set of physical facts (for instance, everything that would be accessible to a Laplacean demon). It can be thought of as essentially the initial conditions of the universe and the laws governing their changes over time.


Q is the set of facts about qualitative experience. We can see Chalmers’ rejection of physicalism here, as he doesn’t think that Q is eclipsed within P. Example of a type of statement that cannot be derived from P without Q: “There is a beige region in the bottom right of my visual field.”


Here’s a true statement: “I’m in the United States.” Could this be derivable from P and Q? Presumably not; we need another set of indexical truths that allows us to have “self-locating” beliefs and to engage in anthropic reasoning.


Suppose that P, Q, and I really are able to capture all the true statements there are to be captured. Well then, the statement “P, Q, and I really are able to capture all the true statements there are to be captured” is a true statement, and it is presumably not captured by P, Q, and I! In other words, we need some final negative statements that tell us that what we have is enough, and that there are no more truths out there. These “that’s all”-type statements are put into the set T.


So this is a basic sketch of Chalmer’s construction. I like that we can use these tags like PQIT or PT or QIT as a sort of philosophical zip-code indicating the core features of a person’s philosophical worldview. I also want to think about developing this further. What other possible types of statements are there out there that may not be captured in PQIT? Here is a suggestion for a more complete taxonomy:

p    microphysics
P    macrophysics (by which I mean all of science besides fundamental physics)
Q    consciousness
R    normative rationality
normative ethics
C    counterfactuals
mathematical / logical truths
I     indexicals
T    “that’s-all” statements

I’ve split P into big-P (macrophysics) and little-p (microphysics) to account for the disagreements about emergence and reductionism. Normativity here is broad enough to include both normative epistemic statements (e.g. “You should increase your credence in the next coin toss landing H after observing it land H one hundred times in a row”) and ethical statements. The others are fairly self-explanatory.

The most ontologically extravagant philosophical worldview would then be characterized as pPQRECLIT.

My philosophical address is pRLIT (with the addendum that I think C comes from p, and am really confused about Q). What’s yours?

How to sort a list of numbers

It’s funny how some of the simplest questions you can think of actually have interesting and nontrivial answers.

Take this question, for instance. If I hand you a list of N numbers, how can you quickly sort this list from smallest to largest? Suppose that you are only capable of sorting two numbers at a time.

You might immediately think of some obvious ways to solve this problem. For instance, you could just start at the beginning of the list and repeatedly traverse it, swapping numbers if they’re in the wrong order. This will guarantee that each pass-through, the largest number out of those not yet sorted will “bubble up” to the top. This algorithm is called Bubble Sort.

The problem is that for a sizable list, this will take a long long time. On average, for a list of size N, Bubble Sort takes O(N^2) steps to finish sorting. Can you think of a faster algorithm?

It turns out that we can go much faster with some clever algorithms – from O(N^2) to O(N logN). If you have a little time to burn, here are some great basic videos describing and visualizing different sorting techniques:

Comparisons of Merge, Quick, Heap, Bubble and Insertion Sort


Beautiful visualization of Bubble, Shell, and Quick Sort

(I’d be interested to know why the visualization of Bubble Sort in the final frame gave a curve of unsorted values that looked roughly logarithmic)

Visualization of 9 sorting algorithms

Auditory presentation of 15 sorting algorithms

Bonus clip…

Cool proof of the equivalence of Selection Sort and Insertion Sort

Searching a list

By the way, how about if you have a sorted list and want to find a particular value in it? If we’re being dumb, we could just ignore the fact that our list came pre-sorted and search through every element on the list in order. We’d find our value in O(N). We can go much faster by just starting in the middle of our list, looking at the size of the middle value to determine which half of the list to look at next, and then repeating with the appropriate half of the list. This is called binary search and takes O(logN), a massive speedup.

Now, let’s look back at the problem of finding a particular value in an unsorted list. Can we think of any techniques to accomplish this more quickly than O(N)?

No. Try as you might, as long as the list has no structure for you to exploit, your optimal performance will be limited to O(N).

Or so everybody thought until quantum mechanics showed up and blew our minds.

It turns out that a quantum computer would be able to solve this exact problem – searching through an unsorted list – in only O(√N) steps! 

This should seem impossible to you. If a friend hands you a random jumbled list of 10,000 names and asks you to locate one particular name on the list, you’re going to end up looking through 5,000 names on average. There’s no clever trick you can use to speed up the process. Except that quantum mechanics says you’re wrong! In fact, if you were a quantum computer, you’d only have to search through ~100 names to find your name.

This quantum algorithm is called Grover’s algorithm, and I want to write up a future post about it. But that’s not even the coolest part! It turns out that if a non-local hidden variable interpretation of quantum mechanics is correct, then a quantum computer could search an N-item database in O(∛N)! You can imagine the triumphal feeling of the hidden variables people if one day they managed to build a quantum computer that simultaneously proved them right and broke the record for search algorithm speed!

More model selection visualizations

I added cross validation to my model selection program, and ooh boy do I now understand why people want to find more efficient alternatives.

Reliable CV calculations end up making the program run orders of magnitude slower, as they require re-fitting the curve for every partition of your data and this is the most resource-intensive part of the process. While it’s beautiful in theory, this is a major set-back in practice.

For a bunch of different true distributions I tried, I found that CV pretty much always lines up with all of the other methods, just with a more “jagged” curve and a steeper complexity penalty. This similarity looks like a score for the other methods, assuming that CV does a good job of measuring predictive power. (And for those interested in the technical details, the cross validation procedure implemented here is leave-one-out CV)

I also added an explicit calculation of DKL, which should help to give some idea of a benchmark against which we can measure all of these methods. Anyway, I have some more images!

True curve = e-x/3 – ex/3

N = 100 data points


N = 100 data points, zoomed in a bit


For smaller data sets, you can see that AICc tracks DKL much more closely than any other technique (which is of course the entire purpose of AICc).

N = 25


N = 25, zoomed


N = 25, super zoom


Interestingly, you start to see BIC really suffering relative to the other methods, beginning to overfit the data. This is counterintuitive; BIC is supposed to be the method that penalizes complexity excessively and underfits the data. Relevant is that in this program, I use the form of BIC that doesn’t approximate for large N.

BICsmall N = k log(N/2π) – 2L
BICordinary = k log(N) – 2L

When I use the approximation instead, the problem disappears. Of course, this is not too great a solution; why should the large N approximation be necessary for fixing BIC specifically when N is small?

 (Edit: after many more runs, it’s looking more like it may have just been a random accident that BIC overfit in the first few runs)

Just for the heck of it, I also decided to torture my polynomials a little bit by making them try to fit the function 1/x. I got dismal results no matter how I tried to tweak the hyper-parameters, which is, well, pretty much what I expected (1/x has no Taylor expansion around 0, for one thing).

More surprisingly, I tried fitting a simple Gaussian curve and again got fairly bad results. The methods disagreed with one another a lot of the time (although weirdly, AICc and BIC seemed to generally be in agreement), and gave curves that were overfitting the data a bit. The part that seems hardest for a polynomial to nail down is the flattened ends of the Gaussian curve.

True curve = 40 exp(-x2/2), N = 100 data points


And zoomed in…


If the jaggedness of the cross validation score is not purely an artifact of random fluctuations in the data, I don’t really get it. Why should, for example, a 27-parameter model roughly equal a 25-parameter model in predictive power, but a 26-parameter model be significantly worse than both?


Journey into abstraction

A mathematician is asked to design a table. He first designs a table with no legs. Then he designs a table with infinitely many legs. He spend the rest of his life generalizing the results for the table with N legs (where N is not necessarily a natural number).

A key feature of mathematics that makes it so variously fun or irritating (depending on who you are) is the tendency to abstract away from an initially practically useful question and end up millions of miles from where you started, talking about things that bear virtually no resemblance whatsoever to the topic you started with.

This gives rise to quotes like Feynman’s “Physics is to math what sex is to masturbation” and jokes like the one above.

I want to take you down one of these little rabbit holes of abstraction in this post, starting with factorials.

The definition of the factorial is the following:

N factorial = N! = N ∙ (N – 1) ∙ (N – 2) ∙ (…) ∙ 3 ∙ 2 ∙ 1

This function turns out to be mightily useful in combinatorics. The basic reason for this comes down to the fact that there are N! different ways of putting together N distinct objects into a sequence. The factorial also turns out to be extremely useful in probability theory and in calculus.

Now, the lover of abstraction looks at the factorial and notices that it is only defined for positive integers. We can say that 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120, but what about something like ½!, or (-√2)!? Enter the Gamma function!

The Gamma function is one of my favorite functions. It’s weird and mysterious, and tremendously useful in a whole bunch of very different areas. Here’s the general definition:

𝚪(n) = ∫ xn e-x dx

(The integral should be taken from 0 to ∞, but I can’t figure out how to get WordPress to allow me to do this. Also, this is actually technically the Gamma function displaced by 1, but the difference won’t become important here.)

This function is the natural generalization of the factorial function! We can prove this in just a few lines:

𝚪(n) = ∫ xn e-x dx
= ∫ n xn – 1 e-x dx
= n 𝚪(n – 1)

𝚪(0) = ∫ e-x dx = 1

This is sufficient to prove that 𝚪(n) is equal to n! for all integer values of n, since these two statements uniquely determine the values of the factorials of all positive integers.

n! = n ∙ (n – 1)!
0! = 1

The Gamma function generalizes the factorial not just to all real numbers, but to complex numbers. Not only can you say what ½! and (-√2)! are, you can say what i! is! Here are some of the values of the Gamma function:

(-½)! = √π
½! = √π/2
√2 ! ≈ -3.676
i! ≈ .5 – .15 i

The proof of the first two of these identities is nice, so I’ll lay it out briefly:

(-½)! = ∫ e-x/√x dx
= 2 ∫ e-u∙u dx
= √π

(½)! = ½ (-½)! = √π/2

We can go further and deduce the values of the factorials of (3/2, 5/2, …) by just applying the definition of the factorial.

The function is also beautiful when plotted in the complex plane. Here’s a graph where the color corresponds to the complex value of the Gamma function.



At this point, one can feel that we are already lost in abstraction. Perhaps we had initially thought of the factorial as the quantity that tells you about the possible number of permutations of N distinct symbols. But what does it mean to say that there are about -3.676 ways of permuting √2 items, or (.5 – .15 i) ways of permuting an imaginary number of items?

To further frustrate our intuitions, the value of the Gamma function turns out to be undefined at every negative integer. (Some sense can be made of this by realizing that (-1)! is just 0!/0 = 1/0, which is undefined).


Often in times like these, it suits us to switch our intuitive understanding of the factorial to something that does generalize more nicely. Looking at the factorial in the context of probability theory can be helpful.

In statistical mechanics, it is common to describe distributions over events that individually have virtually zero probability of happening, but of which there are virtually infinite opportunities for them to happen, by a Poisson distribution.

An example of this might be a very unlikely radioactive decay. Suppose that the probability p of a single atom decaying is virtually zero, but the number N of atoms in the system you’re studying is virtually infinite ∞, and yet these balance in such a way that the product p∙N = λ is a finite and manageable number.

The Poisson distribution naturally arises as the answer to the question: What is the probability that N atoms decay in a period of time? The form of the distribution is:

P(n) = λn e / n!

We can now use this to imagine a generalization for a process that doesn’t have to be discrete.

Say we are studying the amount of energy emitted in a system where individual emissions have virtually 0 probability but there are a virtually infinite amount of ways for the energy to be emitted. If the energy can take on any real value, then our distribution requires us to talk about the value of n! for arbitrary real n.

Anyway, let’s put aside the attempt to intuitively ground the Gamma function and talk about an application of this to calculus.

Calculus allows us to ask about the derivative of a function, the integral, the second derivative, the 15th derivative, and etc. Lovers of abstraction naturally began to ask questions like: “What’s the ½th derivative of a function? Are there imaginary derivatives?”

This opened the field known as fractional calculus.

Here’s a brief example of how we might do this.

The first derivative D of xn is n xn – 1
The second derivative Dof xis n∙(n – 1) ∙ xn – 2
The kth derivative Dk of xn is n! / (n – k)! ∙ xn – k

But we know how to generalize this to any non-integer value, because we know how to generalize the factorial function!

So, for instance, the ½th derivative of x turns out to be:

D½[x] = 1! / ½! ∙ x½
= 2/√π ∙ √x

Now, what we’ve presented is an identity that only applies to functions that look like xn. The general definition for a fractional derivative (or integral) is more complicated, but also uses the Gamma function.

What could it mean to talk about the fractional-derivative of a function? Intuitively, derivatives line up with slopes of curves, second derivatives are slopes of slopes, et cetera. What is a half-derivative of a curve?

I have no way to make intuitive sense of this, although fractional derivatives do tend to line up with our intuitive expectations for what an interpolation between functions should look like.

File:Fractional Derivative of Basic Power Function (2014).gif

I’d be interested to know if there are types of functions whose fractional derivatives don’t have this property of smooth transitioning between the ordinary derivatives. Also, I could imagine a generalization of the Taylor approximation, where instead of aligning the integer derivatives of a function, we align all rational derivatives of the function, or all real-valued derivatives from 0 to 1.

The natural next step in this abstraction is to talk about functional calculus, where we can talk about not only arbitrary powers of derivatives, but arbitrary functions of derivatives. In this way we could talk about the logarithm or the sine of a derivative or an integral. But we’ll end the journey into abstraction here, as this ventures into territories that are foreign to me.