In front of you is an urn containing some unknown quantity of balls. These balls are labeled 1, 2, 3, etc. They’ve been jumbled about so as to be in no particular order within the urn. You initially consider it equally likely that the urn contains 1 ball as that it contains 2 balls, 3 balls, and so on, up to 100 balls, which is the maximum capacity of the urn.

Now you reach in to draw out a ball and read the number on it: 34. What is the most likely theory for how many balls the urn contains?

(…)

(Think of an answer before reading on.)

(…)

The answer turns out to be 34!

Hopefully this is a little unintuitive. Specifically, what seems wrong is that you draw out a ball and then conclude that this is the ball with the largest value on it. Shouldn’t extreme results be unlikely? But remember, the balls were randomly jumbled about inside the urn. So whether or not the number on the ball you drew is at the beginning, middle, or end of the set of numbers is pretty much irrelevant.

What *is* relevant is the likelihood: Pr(There are N balls | I drew a ball numbered 34). And the value of this is simply 1/N.

In general, comparing the theory that there are N balls to the theory that there are M balls, we look at the likelihood ratio: Pr(There are N balls | I drew a ball numbered 34) / Pr(There are M balls | I drew a ball numbered 34). This is simply M/N.

Thus we see that our prior odds get updated by a factor that favors smaller values of N, as long as N ≥ 34. The likelihood is zero up to N = 33, maxes at 34, and then decreases steadily after it as N goes to infinity. Since our prior was evenly spread out between N = 1 and 100 and zero everywhere else, our posterior will be peaked at 34 and decline until 100, after which it will drop to zero.

One way to make this result seem more intuitive is to realize that while *strictly speaking* the most probable number of balls in the urn is 34, it’s not that much more probable than 35 or 36. The actual probability of 34 is still quite small, it just happens to be a little bit more probable than its larger neighbors. And indeed, for larger values of the maximum capacity of the urn, the relative difference between the posterior probability of 34 and that of 35 decreases.

Why describe a probability distribution in terms of its mode? The expected outcome isn’t always the most likely outcome, it’s the mean of the probability distribution.

Estimation should be about minimizing error, not about maximizing the probability of being right.

It sounds like you’re objecting to something I said, but I’m not sure exactly what. There’s definitely more to a probability distribution than the value at which it is maximized, I just thought it was interesting that in this case the value with the highest probability is one that intuitively seems like an “extreme” result. In the end, any weirdness regarding this is defused by considering that “highest probability” does not mean “high probability” and that yes, the value you expect is not in general the maximum probability value.

As to estimation… that’s a whooole big topic. I have some upcoming posts about some of the different philosophies of estimation, but one common view is something along the lines of “try to maximize the probability that you’re right.” I’m pretty undecided on the issue personally.