Argument screens off identity

In his post Argument Screens Off Authority, Eliezer Yudkowsky argues that while a person’s credentials can give you evidence as to the quality of arguments you expect to hear from them, these conclusions becomes irrelevant to the truth of their conclusion once you actually hear and understand their arguments. A true conclusion tends to have good arguments for it, and good arguments tend to become known by well-credentialed persons. But since the causal dependency from “truth” to “credentials” is indirect, you can remove the dependency by simply conditioning on the intermediate (“argument quality”).

I think this is a special case of a more general principle: Argument screens off identity. Consider the following hypothetical exchange.

Person 1: I don’t support affirmative action. Here are some arguments why.

(…)

Person 1: So those are the main reasons for my position.

Person 2: Hmm, while I don’t see anything wrong with your arguments, I think you’re forgetting that you’re not a member of <insert protected identity group>. This invalidates any arguments you make.

Person 1: Um, I don’t think that’s how that works…

Person 2: No, seriously. If you’re not black, Hispanic, Native American, or any other minority group that is affected by affirmative action, then you have no right to be talking about it. You don’t have any credibility on the issue unless you’ve lived it.

Yes, this is a bit of a straw man, but it’s not actually so far from real arguments that I’ve heard before. I don’t know of an exact name for the logical fallacy Person 2 is making here (it’s not exactly an ad hominem), but I think that it is covered nicely by the slogan that is the title of this post.

A white person who has spent lots of time carefully researching and thinking through issues of racial inequality may well have more to say on these issues than a black person who has not done so. A man who has studied the nature of sexism in our society and the social norms surrounding sexual harassment may know more than a woman who has not done so.

More to the point, when somebody presents an argument, the response “Yeah, well you aren’t a member of this specific identity group that your argument refers to, so your argument is not credible” is virtually always a distraction and hinders progress.

More formally, my claim is that in most contexts:

Pr(conclusion | argument, identity) = Pr(conclusion | argument)

While your identity might inform the types of arguments you are able to make, predisposed to make, or willing to make, the arguments themselves are the only pieces of evidence we have in evaluating your conclusion.

Galileo and the Schelling point improbability principle

An alternative history interaction between Galileo and his famous statistician friend

***

In the year 1609, when Galileo Galilei finished the construction of his majestic artificial eye, the first place he turned his gaze was the glowing crescent moon. He reveled in the crevices and mountains he saw, knowing that he was the first man alive to see such a sight, and his mind expanded as he saw the folly of the science of his day and wondered what else we might be wrong about.

For days he was glued to his telescope, gazing at the Heavens. He saw the planets become colorful expressive spheres and reveal tiny orbiting companions, and observed the distant supernova which Kepler had seen blinking into existence only five years prior. He discovered that Venus had phases like the Moon, that some apparently single stars revealed themselves to be binaries when magnified, and that there were dense star clusters scattered through the sky. All this he recorded in frantic enthusiastic writing, putting out sentences filled with novel discoveries nearly every time he turned his telescope in a new direction. The universe had opened itself up to him, revealing all its secrets to be uncovered by his ravenous intellect.

It took him two weeks to pull himself away from his study room for long enough to notify his friend Bertolfo Eamadin of his breakthrough. Eamadin was a renowned scholar, having pioneered at age 15 his mathematical theory of uncertainty and created the science of probability. Galileo often sought him out to discuss puzzles of chance and randomness, and this time was no exception. He had noticed a remarkable confluence of three stars that were in perfect alignment, and needed the counsel of his friend to sort out his thoughts.

Eamadin arrived at the home of Galileo half-dressed and disheveled, obviously having leapt from his bed and rushed over immediately upon receiving Galileo’s correspondence. He practically shoved Galileo out from his viewing seat and took his place, eyes glued with fascination on the sky.

Galileo allowed his friend to observe unmolested for a half-hour, listening with growing impatience to the ‘oohs’ and ‘aahs’ being emitted as the telescope swung wildly from one part of the sky to another. Finally, he interrupted.

Galileo: “Look, friend, at the pattern I have called you here to discuss.”

Galileo swiveled the telescope carefully to the position he had marked out earlier.

Eamadin: “Yes, I see it, just as you said. The three stars form a seemingly perfect line, each of the two outer ones equidistant from the central star.”

Galileo: “Now tell me, Eamadin, what are the chances of observing such a coincidence? One in a million? A billion?”

Eamadin frowned and shook his head. “It’s certainly a beautiful pattern, Galileo, but I don’t see what good a statistician like myself can do for you. What is there to be explained? With so many stars in the sky, of course you would chance upon some patterns that look pretty.”

Galileo: “Perhaps it seems only an attractive configuration of stars spewed randomly across the sky. I thought the same myself. But the symmetry seemed too perfect. I decided to carefully measure the central angle, as well as the angular distance distended by the paths from each outer star to the central one. Look.”

Galileo pulled out a sheet of paper that had been densely scribbled upon. “My calculations revealed the central angle to be precisely 180.000º, with an error of ± .003º. And similarly, I found the difference in the two angular distances to be .000º, with a margin of error of ± .002º.”

Eamadin: “Let me look at your notes.”

Galileo handed over the sheets to Eamadin. “I checked over my calculations a dozen times before writing you. I found the angular distances by approaching and retreating from this thin paper, which I placed between the three stars and me. I found the distance at which the thin paper just happened to cover both stars on one extreme simultaneously, and did the same for the two stars on the other extreme. The distance was precisely the same, leaving measurement error only for the thickness of the paper, my distance from it, and the resolution of my vision.”

Eamadin: “I see, I see. Yes, what you have found is a startlingly clear pattern. A similarity in distance and precision of angle this precise is quite unlikely to be the result of any natural phenomenon… ”

Galileo: “Exactly what I thought at first! But then I thought about the vast quantity of stars in the sky, and the vast number of ways of arranging them into groups of three, and wondered if perhaps in fact such coincidences might be expected. I tried to apply your method of uncertainty to the problem, and came to the conclusion that the chance of such a pattern having occurred through random chance is one in a thousand million! I must confess, however, that at several points in the calculation I found myself confronted with doubt about how to progress and wished for your counsel.”

Eamadin stared at Galileo’s notes, then pulled out a pad of his own and began scribbling intensely. Eventually, he spoke. “Yes, your calculations are correct. The chance of such a pattern having occurred to within the degree of measurement error you have specified by random forces is 10-9.”

Galileo: “Aha! Remarkable. So what does this mean? What strange forces have conspired to place the stars in such a pattern? And, most significantly, why?”

Eamadin: “Hold it there, Galileo. It is not reasonable to jump from the knowledge that the chance of an event is remarkably small to the conclusion that it demands a novel explanation.”

Galileo: “How so?”

Eamadin: “I’ll show you by means of a thought experiment. Suppose that we found that instead of the angle being 180.000º with an experimental error of .003º, it was 180.001º with the same error. The probability of this outcome would be the same as the outcome we found – one in a thousand million.”

Galileo: “That can’t be right. Surely it’s less likely to find a perfectly straight line than a merely nearly perfectly straight line.”

Eamadin: “While that is true, it is also true that the exact calculation you did for 180.000º ± .003º would apply for 180.001º ± .003º. And indeed, it is less likely to find the stars at this precise angle, than it is to find the stars merely near this angle. We must compare like with like, and when we do so we find that 180.000º is no more likely than any other angle!”

Galileo: “I see your reasoning, Eamadin, but you are missing something of importance. Surely there is something objectively more significant about finding an exactly straight line than about a nearly straight line, even if they have the same probability. Not all equiprobable events should be considered to be equally important. Think, for instance, of a sequence of twenty coin tosses. While it’s true that the outcome HHTHTTTTHTHHHTHHHTTH has the same probability as the outcome HHHHHHHHHHHHHHHHHHHH, the second is clearly more remarkable than the first.”

Eamadin: “But what is significance if disentangled from probability? I insist that the concept of significance only makes sense in the context of my theory of uncertainty. Significant results are those that either have a low probability or have a low conditional probability given a set of plausible hypotheses. It is this second class that we may utilize in analyzing your coin tossing example, Galileo. The two strings of tosses you mention are only significant to different degrees in that the second more naturally lends itself to a set of hypotheses in which the coin is heavily biased towards heads. In judging the second to be a more significant result than the first, you are really just saying that you use a natural hypothesis class in which probability judgments are only dependent on the ratios of heads and tails, not the particular sequence of heads and tails. Now, my question for you is: since 180.000º is just as likely as 180.001º, what set of hypotheses are you considering in which the first is much less likely than the second?”

Galileo: “I must confess, I have difficulty answering your question. For while there is a simple sense in which the number of heads and tails is a product of a coin’s bias, it is less clear what would be the analogous ‘bias’ in angles and distances between stars that should make straight lines and equal distances less likely than any others. I must say, Eamadin, that in calling you here, I find myself even more confused than when I began!”

Eamadin: “I apologize, my friend. But now let me attempt to disentangle this mess and provide a guiding light towards a solution to your problem.”

Galileo: “Please.”

Eamadin: “Perhaps we may find some objective sense in which a straight line or the equality of two quantities is a simpler mathematical pattern than a nearly straight line or two nearly equal quantities. But even if so, this will only be a help to us insofar as we have a presumption in favor of less simple patterns inhering in Nature.”

Galileo: “This is no help at all! For surely the principle of Ockham should push us towards favoring more simple patterns.”

Eamadin: “Precisely. So if we are not to look for an objective basis for the improbability of simple and elegant patterns, then we must look towards the subjective. Here we may find our answer. Suppose I were to scribble down on a sheet of paper a series of symbols and shapes, hidden from your view. Now imagine that I hand the images to you, and you go off to some unexplored land. You explore the region and draw up cartographic depictions of the land, having never seen my images. It would be quite a remarkable surprise were you to find upon looking at my images that they precisely matched your maps of the land.”

Galileo: “Indeed it would be. It would also quickly lend itself to a number of possible explanations. Firstly, it may be that you were previously aware of the layout of the land, and drew your pictures intentionally to capture the layout of the land – that is, that the layout directly caused the resemblance in your depictions. Secondly, it could be that there was a common cause between the resemblance and the layout; perhaps, for instance, the patterns that most naturally come to the mind are those that resemble common geographic features. And thirdly, included only for completion, it could be that your images somehow caused the land to have the geographic features that it did.”

Eamadin: “Exactly! You catch on quickly. Now, this case of the curious coincidence of depiction and reality is exactly analogous to your problem of the straight line in the sky. The straight lines and equal distances are just like patterns on the slips of paper I handed to you. For whatever reason, we come pre-loaded with a set of sensitivities to certain visual patterns. And what’s remarkable about your observation of the three stars is that a feature of the natural world happens to precisely align with these patterns, where we would expect no such coincidence to occur!”

Galileo: “Yes, yes, I see. You are saying that the improbability doesn’t come from any objective unusual-ness of straight lines or equal distances. Instead, the improbability comes from the fact that the patterns in reality just happen to be the same as the patterns in my head!”

Eamadin: “Precisely. Now we can break down the suitable explanations, just as you did with my cartographic example. The first explanation is that the patterns in your mind were caused by the patterns in the sky. That is, for some reason the fact that these stars were aligned in this particular way caused you to by psychologically sensitive to straight lines and equal quantities.”

Galileo: “We may discard this explanation immediately, for such sensitivities are too universal and primitive to be the result of a configuration of stars that has only just now made itself apparent to me.”

Eamadin: “Agreed. Next we have a common cause explanation. For instance, perhaps our mind is naturally sensitive to visual patterns like straight lines because such patterns tend to commonly arise in Nature. This natural sensitivity is what feels to us on the inside as simplicity. In this case, you would expect it to be more likely for you to observe simple patterns than might be naively thought.”

Galileo: “We must deny this explanation as well, it seems to me. For the resemblance to a straight line goes much further than my visual resolution could even make out. The increased likelihood of observing a straight line could hardly be enough to outweigh our initial naïve calculation of the probability being 10-9. But thinking more about this line of reasoning, it strikes me that you have just provided an explanation the apparent simplicity of the laws of Nature! We have developed to be especially sensitive to patterns that are common in Nature, we interpret such patterns as ‘simple’, and thus it is a tautology that we will observe Nature to be full of simple patterns.”

Eamadin: “Indeed, I have offered just such an explanation. But it is an unsatisfactory explanation, insofar as one is opposed to the notion of simplicity as a purely subjective feature. Most people, myself included, would strongly suggest that a straight line is inherently simpler than a curvy line.”

Galileo: “I feel the same temptation. Of course, justifying a measure of simplicity that does the job we want of it is easier said than done. Now, on to the third explanation: that my sensitivity to straight lines has caused the apparent resemblance to a straight line. There are two interpretations of this. The first is that the stars are not actually in a straight line, and you only think this because of your predisposition towards identifying straight lines. The second is that the stars aligned in a straight line because of these predispositions. I’m sure you agree that both can be reasonably excluded.”

Eamadin: “Indeed. Although it may look like we’ve excluded all possible explanations, notice that we only considered one possible form of the common cause explanation. The other two categories of explanations seem more thoroughly ruled out; your dispositions couldn’t be caused by the star alignment given that you have only just found out about it and the star alignment couldn’t be caused by your dispositions given the physical distance.”

Galileo: “Agreed. Here is another common cause explanation: God, who crafted the patterns we see in Nature, also created humans to have similar mental features to Himself. These mental features include aesthetic preferences for simple patterns. Thus God causes both the salience of the line pattern to humans and the existence of the line pattern in Nature.”

Eamadin: “The problem with this is that it explains too much. Based solely on this argument, we would expect that when looking up at the sky, we should see it entirely populated by simple and aesthetic arrangements of stars. Instead it looks mostly random and scattershot, with a few striking exceptions like those which you have pointed out.”

Galileo: “Your point is well taken. All I can imagine now is that there must be some sort of ethereal force that links some stars together, gradually pushing them so that they end up in nearly straight lines.”

Eamadin: “Perhaps that will be the final answer in the end. Or perhaps we will discover that it is the whim of a capricious Creator with an unusual habit for placing unsolvable mysteries in our paths. I sometimes feel this way myself.”

Galileo: “I confess, I have felt the same at times. Well, Eamadin, although we have failed to find a satisfactory explanation for the moment, I feel much less confused about this matter. I must say, I find this method of reasoning by noticing similarities between features of our mind and features of the world quite intriguing. Have you a name for it?”

Eamadin: “In fact, I just thought of it on the spot! I suppose that it is quite generalizable… We come pre-loaded with a set of very salient and intuitive concepts, be they geometric, temporal, or logical. We should be surprised to find these concepts instantiated in the world, unless we know of some causal connection between the patterns in our mind and the patterns in reality. And by Eamadin’s rule of probability-updating, when we notice these similarities, we should increase our strength of belief in these possible causal connections. In the spirit of anachrony, let us refer to this as the Schelling point improbability principle!”

Galileo: “Sounds good to me! Thank you for your assistance, my friend. And now I must return to my exploration of the Cosmos.”

Why “number of parameters” isn’t good enough

A friend of mine recently pointed out a curious fact. Any set of two-dimensional data whatsoever can be perfectly fit by a simple two-parameter sinusoidal model.

y(x) = A sin(Bx)

Sound wrong? Check it out:

small-sine-zoom.png

Zoomed out:small-sine.png

N = 10 pointssine-overfit.png

As you see, as the number of data points goes up, all you need to do to accommodate this is increase the frequency in your sine function, and adjust the amplitude as necessary. Ultimately, you can fit any data set with a ridiculously quickly oscillating and large-amplitude sine function.

Now, most model selection methods explicitly rely on the parameter count to estimate the potential of a model to overfit. For example, if k is the number of parameters in a model, and L is the log likelihood of the data given the model, we have:

AIC = L – k
BIC = L – k/2・log(N)

This little example represents a fantastic failure of parameter count to successfully do the job AIC and BIC ask of it. Evidently parameter count is too blunt an instrument to do the job we require of it, and we need something with more nuance.

One more example.

For any set of data, if you can perfectly fit a curve to each data point, and if your measurement error σ is an adjustable parameter, then you can take the measurement error to zero to have a fit with infinite accuracy. Now when we evaluate, you find it running off to infinity! Thus our ‘fit to data’ term L goes to infinity, while the model complexity penalty stays a small finite number.

Once again, we see the same lack of nuance dragging us into trouble. The number of parameters might do well at estimating overfitting potential for some types of well-behaved parameters, but it clearly doesn’t do the job universally. What we want is some measure that is sensitive to the potential for some parameters to capture “more” of the space of all possible distributions than others.

And lo and behold, we have such a measure! This is the purpose of information geometry and the volume of a model in the space formed by the Fisher information metric as the penalty for overfitting potential. You can learn more about it in a post I wrote here.

Bayesian Occam’s Razor

A couple of days ago I posted a question that has been bugging me; namely, does Bayes’ overfit, and if not, why not?

Today I post the solution!

There are two parts: first, explaining where my initial argument against Bayes went wrong, and second, describing the Bayesian Occam’s Razor, the key to understanding how a Bayesian deals with overfitting.

Part 1: Why I was wrong

Here’s the argument I wrote initially:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

Here’s a more formal version of the argument:

  1. The relative probabilities of two model given data is calculated by Bayes’ rule:
    P(M | D) / P(M’ | D)  = P(M) / P(M’)・P(D | M) / P(D | M’)
  2. If M overfits the data and M’ does not, then as the size of the data set |D| goes to infinity, the likelihood factor P(D | M) / P(D | M’) goes to infinity.
  3. Thus the posterior probability P(M | D) should go to 1 for the model that most drastically overfits the data.

This argument is wrong for a couple of reasons. For one, the argument assumes that as the size of the data set grows, the model stays the same. But this is very much not going to be true in general. The task of overfitting gets harder and harder as the number of data points go up. It’s not that there’s no longer noise in the data; it’s that the signal becomes more and more powerful.

A perfect polynomial fit on 100 data points must have, at the worst, 100 parameters. On 1000 data points: 1000 parameters. Etc. In general, as you add more data points, a model that was initially overfitting (e.g. the 100-parameter distribution) will find that it is harder and harder to ignore the signal for the noise, and the next best overfitting model will have more parameters (e.g. the 1000-parameter distribution).

But now we have a very natural solution to the problem we started with! It is true that as the number of data points increases, the evidential support for the model that overfits the data will get larger and larger. It’s also true is that the number of parameters required to overfit the data will grow as well. So if your prior in a model is a decreasing function of the number of parameters in the model, then you can in principle find a perfect balance and avoid overfitting. This perfect balance would be characterized by the following: each time you increase the number of parameters, the prior should decrease by an amount proportional to how much more you get rewarded by overfitting the data with the extra parameters.

How do we find this prior in practice? Beats me… I’d be curious to know, myself.

But what’s most interesting to me is that to solve overfitting as a Bayesian, you don’t even need the priors; the solution comes from the evidential update! It turns out that in fact, the likelihood function for updating credences in a model given data automatically incorporates in model overparameterization. Which brings us to part 2!

Part 2: Bayesian Occam’s Razor

That last sentence bears repeating. In reality, although priors can play some role by manually penalizing models with high overfitting potential, the true source of the Bayesian Occam’s razor comes from the evidential update. What we’ll find by the end of this post is that models that overfit don’t actually get a stronger evidential update than models that don’t.

You might wonder how this is possible. Isn’t it practically the definition of overfitting that it is an enhancement of the strength of an evidential update through fitting to noise in the data?

Sort of. It is super important to keep in mind the distinction between a model and a distribution. A distribution is a single probability function over your possible observable data. A model is a set of distributions, characterized by a set of parameters. When we say that some models have the potential to overfit a set of data, what we are really saying is that some models contain distributions that overfit the data.

Why is this important? Because assessing the posterior probability of the model is not the same as assessing the posterior probability of the overfitting distribution within the model! Here’s Bayes’ rule, applied to the model and to the overfitting distribution:

(1) P(M | D) = P(M)・P(D | M) / P(D)

(2) P(theta hat | D) = P(theta hat)・P(D | theta hat) / P(D)

It’s clear how to evaluate equation (2). You have some prior probability assigned to theta hat, you know how to assess the likelihood function P(D | theta hat), and P(D) is an integral that is in principle do-able. In addition, equation (2) has the scary feature we’ve been talking about: the likelihood function P(D | theta hat) is really really large if our parameter theta hat overfits the data, potentially large enough to swamp the priors and screw up our Bayesian calculus.

But what we’re really interested in evaluating is not equation (2), but equation (1)! This is, after all, model selection; we are in the end trying to assess the quality of different models, not individual distributions.

So how do we evaluate (1)? The key term is P(D | M); your prior over the models and the data you receive are not too important for the moment. What is P(D | M)? This question does not actually have an obvious answer… M is a model, a set of distributions, not a single distribution. If we were looking at one distribution, it would be easy to assess the likelihood of the data given that distribution.

So what does P(D | M) really mean?

It represents the average probability of the data, given the model. It’s as if you were to draw a distribution at random from your model, and see how well it fits the data. More precisely, you draw a distribution from your model, according to your prior distribution over the distributions in the model.

That was a mouthful. But the basic idea is simple; a model is an infinite set of distributions, each corresponding to a particular set of values for the parameters that define the model. You have a prior distribution over these values for the parameters, and you use this prior distribution to “randomly” select a distribution in your model. You then assess the probability of the data given that distribution, and voila, you have your likelihood function.

In other words…

P(D | M) = ∫ P(D | θ) P(θ | M) dθ

Now, an overfitting model has a massive space of parameters, and in some small region of this space contains distributions that fit the data really well. On the other hand, a simple model that generalizes well has a small space of parameters, and a region of this space contains distributions that fit the data well (though not as well as the overfitter).

So on average, you are much less likely to select the optimal distribution in the overfitting model than in the generalizable model. Why? Because the space of parameters you must search through to find it is so much larger!

True, when you do select the optimal distribution in the overfitting model, you get rewarded with a better fit to the data than you could have gotten from the nice model. But the balance, in the end, pushes you towards simpler and more general models.

This is the Bayesian Occam’s Razor! Models that are underparameterized do poorly on average, because they just can’t fit the data at all. Models that are overparametrized do poorly on average, because the subset of the parameter space that fits the data well is so tiny compared to the volume of the parameter space as a whole. And the models that strike the perfect balance are those that have enough parameters to fit the data well, but not too many as to excessively bloat the parameter space.

Here are some lecture slides from these great notes that have some helpful visualizations:

Screen Shot 2018-03-16 at 2.33.34 AMScreen Shot 2018-03-16 at 2.33.50 AM

Recapping in a few sentences: Simpler models are promoted, simply because they do well on average. And evidential support for a model comes down to the performance on average, not optimal performance. The likelihood in question is not P(data | best distribution in model), it’s P(data | average distribution in model). So overfitting models actually don’t get as much evidential support from data when assessing the model quality as a whole!

Ain’t that cool??

A model selection puzzle: Why is BIC ≠ AIC?

Slide 19 from this lecture:

Screen Shot 2018-03-14 at 8.14.41 PM.png

This is a really important result. It says that Bayesian updating ultimately converges to the distribution in a model that minimizes DKL, even when the truth is not in your model.

But it is also confusing to me, for the following reason.

If Bayes converges to the minimum DKL solution, and BIC approximates Bayes, and if AIC approximately finds the minimum DKL solution… well, then how can they give different answers?

In other words, how can all three of the following statements be true?

  1. BIC approximates Bayes, which minimizes DKL.
  2. AIC approximates the minimum DKL solution.
  3. But BIC ≠ AIC.

Clearly we have a problem here.

It’s possible that the answer to this is just that the differences arise from the differences in approximations between AIC and BIC. But this seems like a inadequate explanation to account for such a huge difference, on the order of log(size of data set).

A friend of mine suggested that the reason is that the derivation of BIC assumes that the truth is in the set of candidate models, and this assumption is broken in the condition where Bayes’ optimizes for DKL.

I’m not sure how strongly ‘the truth is in your set of candidate models’ is actually assumed by BIC. I know that this is the standard thing people say about BIC, but is it really that the exact truth has to be in the model, or just that the model has a low overall bias? For instance, you can derive AIC by assuming that the truth is in your set of candidate models. But you don’t need this assumption; you can also derive AIC as an approximate measure of DKL when your set of candidate models contains models with low bias.

This question amounts to looking closely at the derivation of BIC to see what is absolutely necessary for the result. For now, I’m just pointing out the basic confusion, and will hopefully post a solution soon!

Bayes and overfitting

In a couple of previous posts, I’ve said some things about Bayesianism that I now think might not be right. Specifically, I claimed a few times that Bayesians will have trouble with overfitting data.  Having looked into it more and seen some complicated arguments on either side, I’m less sure about this. I’m currently just confused about it, so I’m writing up my confusion here.

The reasoning behind my initial claim was something like this:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

In other words, the model that is best supported by the data will be the one that fits it perfectly (i.e. overfitting). We get out of this by giving overfitting models low priors… but we should expect that even this won’t be sufficient if we get strong enough evidence.

Is this wrong? And why?

 

The basics of information divergence

I’ve talked quite a bit about DKL on this blog, but I think I’ve yet to give a simple introduction to the concept. That’s what this post is about; an introduction to DKL in all it’s wonder!

What is DKL?

Essentially, DKL is a measure of the information distance between a given model of reality and reality itself. Information distance, more precisely, is a quantification of how many bits of information you would need to receive in order to update your model of reality into perfect alignment with reality.

Said another way, it is how much information you would lose if you started with a perfectly aligned model of reality and ended with the model of reality that you currently have. And said one final way, it is how much more surprised you will be on average given your beliefs, than you would be if you had all true beliefs.

Here’s the functional form of DKL:

DKL(Ptrue, P) = ∫ Ptrue log(Ptrue / P)
= Etrue[ log(Ptrue / P) ]

The Etrue[-] on the second line refers to an expected value taken over the true distribution.

Why the log? I give some intuitive reasons here.

The problem, however, is that DKL cannot be directly calculated. Notice that one of the inputs to the function is the true probability distribution over outcomes. If you had access to this from the start, then you would have no need for any fancy methods of inference in the first place.

However, there are good ways to indirectly approximate DKL. How could we ever approximate a function that takes in an input that we don’t know? Through data!

Loosely speaking, data functions as a window that allows you to sneak peeks at reality. When you observe the outcomes of an experiment, the result you get will not always be aligned with your beliefs about reality. Instead, the outcomes of the experiment reflect the nature of reality itself, unfiltered by your beliefs.

(This is putting aside subtleties about good experimental design, but even those subtleties are unnecessary; technically the data you get is always a product of the nature of reality as it is,  it’s just that our interpretation of the data might be flawed.)

So if we have access to some set of data from well-designed experiments (that is, experiments whose results we are correctly interpreting), we can use it to form an approximation of the DKL of any given model of reality. This first approximation is called the log loss, and takes the following form:

Log Loss = – Edata[ log(P) ]

There is one more problem with this notion of using data to approximate DKL. The problem is that normally, we use data to update our beliefs. If we first update our beliefs with the data, and then approximate the DKL of our new distribution using the data, then we are biasing our approximation. It’s sort of like assessing intelligence by giving people a IQ test, but they were allowed to study by examining that exact IQ test and its answer key. If they do well on that test, it might not be because they are actually intelligent, but rather just that they’ve memorized all of the answers (overfit to the data of the IQ test).

So we have a few choices; first, we could refuse to update our beliefs on the data, and then have a nice unbiased estimate of the DKL of our un-updated distribution. Second, we could update our beliefs on the data, but give up hope of an unbiased estimate of DKL. Third, we could use some of the data for updating our beliefs, and the rest of it for evaluating DKL (this option is called cross validation). Or finally, we could try to find some way to approximate the amount of bias introduced by a given update of beliefs to our estimate of DKL.

Amazingly, we actually know precisely how to do this final option! This was the great contribution of the brilliant Japanese statistician Hirotogu Akaike. The equation he derived when trying to quantify the degree of bias is called the Akaike information criterion.

AIC(Model M) = Number of parameters in M – log P(data | M)

The best model in a set is the one with the lowest AIC score. It makes a lot of sense that models with more parameters are penalized; models with more tweakable parameters are like students that are better at memorizing an answer key to their test.

Can we do any better than AIC? Yes, in fact! For small data sets, a better measure is AICc, which adds a correction term that scales like 1/N.

So to summarize everything in a few sentences:

  1. DKL is a measure of how far your model of reality is from the truth.
  2. DKL cannot be calculated without prior knowledge of the truth.
  3. However, we can use data to approximate DKL, by calculating log loss.
  4. Unfortunately, if we are also using the data to update our beliefs, log loss is a biased estimator of DKL.
  5. We can approximate the bias and negate it using the Akaike information criterion, AIC.
  6. An even better approximation for small data sets is AICc.

More model selection visualizations

I added cross validation to my model selection program, and ooh boy do I now understand why people want to find more efficient alternatives.

Reliable CV calculations end up making the program run orders of magnitude slower, as they require re-fitting the curve for every partition of your data and this is the most resource-intensive part of the process. While it’s beautiful in theory, this is a major set-back in practice.

For a bunch of different true distributions I tried, I found that CV pretty much always lines up with all of the other methods, just with a more “jagged” curve and a steeper complexity penalty. This similarity looks like a score for the other methods, assuming that CV does a good job of measuring predictive power. (And for those interested in the technical details, the cross validation procedure implemented here is leave-one-out CV)

I also added an explicit calculation of DKL, which should help to give some idea of a benchmark against which we can measure all of these methods. Anyway, I have some more images!

True curve = e-x/3 – ex/3

N = 100 data points

exps-wcv.png

N = 100 data points, zoomed in a bit

exps-wcv-zoom.png

For smaller data sets, you can see that AICc tracks DKL much more closely than any other technique (which is of course the entire purpose of AICc).

N = 25

exps-small-n.png

N = 25, zoomed

exps-small-n-zoom.png

N = 25, super zoom

exps-small-n-bigzoom.png

Interestingly, you start to see BIC really suffering relative to the other methods, beginning to overfit the data. This is counterintuitive; BIC is supposed to be the method that penalizes complexity excessively and underfits the data. Relevant is that in this program, I use the form of BIC that doesn’t approximate for large N.

BICsmall N = k log(N/2π) – 2L
BICordinary = k log(N) – 2L

When I use the approximation instead, the problem disappears. Of course, this is not too great a solution; why should the large N approximation be necessary for fixing BIC specifically when N is small?

 (Edit: after many more runs, it’s looking more like it may have just been a random accident that BIC overfit in the first few runs)

Just for the heck of it, I also decided to torture my polynomials a little bit by making them try to fit the function 1/x. I got dismal results no matter how I tried to tweak the hyper-parameters, which is, well, pretty much what I expected (1/x has no Taylor expansion around 0, for one thing).

More surprisingly, I tried fitting a simple Gaussian curve and again got fairly bad results. The methods disagreed with one another a lot of the time (although weirdly, AICc and BIC seemed to generally be in agreement), and gave curves that were overfitting the data a bit. The part that seems hardest for a polynomial to nail down is the flattened ends of the Gaussian curve.

True curve = 40 exp(-x2/2), N = 100 data points

gaussian-best.png

And zoomed in…

gaussian-best-zoom.png

If the jaggedness of the cross validation score is not purely an artifact of random fluctuations in the data, I don’t really get it. Why should, for example, a 27-parameter model roughly equal a 25-parameter model in predictive power, but a 26-parameter model be significantly worse than both?

 

The Monty Hall non-paradox

I recently showed the famous Monty Hall problem to a friend. This friend solved the problem right away, and we realized quickly that the standard presentation of the problem is highly misleading.

Here’s the setup as it was originally described in the magazine column that made it famous:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I encourage you to think through this problem for yourself and come to an answer. Will provide some blank space so that you don’t accidentally read ahead.

 

 

 

 

 

 

Now, the writer of the column was Marilyn vos Savant, famous for having an impossible IQ of 228 according to an interpretation of a test that violated “almost every rule imaginable concerning the meaning of IQs” (psychologist Alan Kaufman). In her response to the problem, she declared that switching gives you a 2/3 chance of winning the car, as opposed to a 1/3 chance for staying. She argued by analogy:

Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance. Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

Notice that this answer contains a crucial detail that is not contained in the statement of the problem! Namely, the answer adds the stipulation that the host “knows what’s behind the doors and will always avoid the one with the prize.”

The original statement of the problem in no way implies this general statement about the host’s behavior. All you are justified to assume in an initial reading of the problem are the observational facts that (1) the host happened to open door No. 3, and (2) this door happened to contain a goat.

When nearly a thousand PhDs wrote in to the magazine explaining that her answer was wrong, she gave further arguments that failed to reference the crucial point; that her answer was only true given additional unstated assumptions.

My original answer is correct. But first, let me explain why your answer is wrong. The winning odds of 1/3 on the first choice can’t go up to 1/2 just because the host opens a losing door. To illustrate this, let’s say we play a shell game. You look away, and I put a pea under one of three shells. Then I ask you to put your finger on a shell. The odds that your choice contains a pea are 1/3, agreed? Then I simply lift up an empty shell from the remaining other two. As I can (and will) do this regardless of what you’ve chosen, we’ve learned nothing to allow us to revise the odds on the shell under your finger.

Notice that this argument is literally just a restatement of the original problem. If one didn’t buy the conclusion initially, restating it in terms of peas and shells is unlikely to do the trick!

This problem was made even more famous by this scene in the movie “21”, in which the protagonist demonstrates his brilliance by coming to the same conclusion as vos Savant. While the problem is stated slightly better in this scene, enough ambiguity still exists that the proper response should be that the problem is underspecified, or perhaps a set of different answers for different sets of auxiliary assumptions.

The wiki page on this ‘paradox’ describes it as a veridical paradox, “because the correct choice (that one should switch doors) is so counterintuitive it can seem absurd, but is nevertheless demonstrably true.”

Later on the page, we see the following:

In her book The Power of Logical Thinking, vos Savant (1996, p. 15) quotes cognitive psychologist Massimo Piattelli-Palmarini as saying that “no other statistical puzzle comes so close to fooling all the people all the time,” and “even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who propose the right answer.”

There’s something to be said about adequacy reasoning here; when thousands of PhDs and some of the most brilliant mathematicians in the world are making the same point, perhaps we are too quick to write it off as “Wow, look at the strength of this cognitive bias! Thank goodness I’m bright enough to see past it.”

In fact, the source of all of the confusion is fairly easy to understand, and I can demonstrate it in a few lines.

Solution to the problem as presented

Initially, all three doors are equally likely to contain the car.
So Pr(1) = Pr(2) = Pr(3) = ⅓

We are interested in how these probabilities update upon the observation that 3 does not contain the car.
Pr(1 | ~3) = Pr(1)・Pr(~3 | 1) / Pr(~3)
= (⅓ ・1) / ⅔ = ½

By the same argument,
Pr(2 | ~3) = ½

Voila. There’s the simple solution to the problem as it is presented, with no additional presumptions about the host’s behavior. Accepting this argument requires only accepting three premises:

(1) Initially all doors are equally likely to be hiding the car.

(2) Bayes’ rule.

(3) There is only one car.

(3) implies that Pr(the car is not behind a door | the car is behind a different door) = 100%, which we use when we replace Pr(~3 | 1) with 1.

The answer we get is perfectly obvious; in the end all you know is that the car is either in door 1 or door 2, and that you picked door 1 initially. Since which door you initially picked has nothing to do with which door the car was behind, and the host’s decision gives you no information favoring door 1 over door 2, the probabilities should be evenly split between the two.

It is also the answer that all the PhDs gave.

Now, why does taking into account the host’s decision process change things? Simply because the host’s decision is now contingent on your decision, as well as the actual location of the car. Given that you initially opened door 1, the host is guaranteed to not open door 1 for you, and is also guaranteed to not open up a door hiding the car.

Solution with specified host behavior

Initially, all three doors are equally likely to contain the car.
So Pr(1) = Pr(2) = Pr(3) = ⅓

We update these probabilities upon the observation that 3 does not contain the car, using the likelihood formulation of Bayes’ rule.

Pr(1 | open 3) / Pr(2 | open 3)
= Pr(1) / Pr(2)・Pr(open 3 | 1) / Pr(open 3 | 2)
= ⅓ / ⅓・½ / 1 = ½

So Pr(1 | open 3) = ⅓ and Pr(2 | open 3) = ⅔

Pr(open 3 | 2) = 1, because the host has no choice of which door to open if you have selected door 1 and the car is behind door 2.

Pr(open 3 | 1) = ½, because the host has a choice of either opening 2 or 3.

In fact, it’s worth pointing out that this requires another behavioral assumption about the host that is nowhere stated in the original post, or Savant’s solution. This is that if there is a choice about which of two doors to open, the host will pick randomly.

This assumption is again not obviously correct from the outset; perhaps the host chooses the larger of the two door numbers in such cases, or the one closer to themselves, or the one or the smaller number with 25% probability. There are an infinity of possible strategies the host could be using, and this particular strategy must be explicitly stipulated to get the answer that Wiki proclaims to be correct.

It’s also worth pointing out that once these additional assumptions are made explicit, the ⅓ answer is fairly obvious and not much of a paradox. If you know that the host is guaranteed to choose a door with a goat behind it, and not one with a car, then of course their decision about which door to open gives you information. It gives you information because it would have been less likely in the world where the car was under door 1 than in the world where the car was under door 2.

In terms of causal diagrams, the second formulation of the Monty Hall problem makes your initial choice of door and the location of the car dependent upon one another. There is a path of causal dependency that goes forwards from your decision to the host’s decision, which is conditioned upon, and then backward from the host’s decision to which door the car is behind.

Any unintuitiveness in this version of the Monty Hall problem is ultimately due to the unintuitiveness of the effects of conditioning on a common effect of two variables.

Monty Hall Causal

In summary, there is no paradox behind the Monty Hall problem, because there is no single Monty Hall problem. There are two different problems, each containing different assumptions, and each with different answers. The answers to each problem are fairly clear after a little thought, and the only appearance of a paradox comes from apparent disagreements between individuals that are actually just talking about different problems. There is no great surprise when ambiguous wording turns out multiple plausible solutions, it’s just surprising that so many people see something deeper than mere ambiguity here.

Akaike, epicycles, and falsifiability

I found a nice example of an application of model selection techniques in this paper.

The history of astronomy provides one of the earliest examples of the problem at hand. In Ptolemy’s geocentric astronomy, the relative motion of the earth and the sun is independently replicated within the model for each planet, thereby unnecessarily adding to the number of adjustable parameters in his system. Copernicus’s major innovation was to decompose the apparent motion of the planets into their individual motions around the sun together with a common sun-earth component, thereby reducing the number of adjustable parameters. At the end of the non-technical exposition of his programme in De Revolutionibus, Copernicus repeatedly traces the weakness of Ptolemy’s astronomy back to its failure to impose any principled constraints on the separate planetary models.

In a now famous passage, Kuhn claims that the unification or harmony of Copernicus’ system appeals to an aesthetic sense, and that alone. Many philosophers of science have resisted Kuhn’s analysis, but none has made a convincing reply. We present the maximization of estimated predictive accuracy as the rationale for accepting the Copernican model over its Ptolemaic rival. For example, if each additional epicycle is characterized by 4 adjustable parameters, then the likelihood of the best basic Ptolemaic model, with just twelve circles, would have to be e20 (or more than 485 million) times the likelihood of its Copernican counterpart with just seven circles for the evidence to favour the Ptolemaic proposal. Yet it is generally agreed that these basic models had about the same degree of fit with the data known at the time. The advantage of the Copernican model can hardly be characterized as merely aesthetic; it is observation, not a prioristic preference, that drives our choice of theory in this instance.

Forster
How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions

Looking into this a little, I found on Wiki that apparently more and more complicated epicycle models were developed after Ptolemy.

As a measure of complexity, the number of circles is given as 80 for Ptolemy, versus a mere 34 for Copernicus. The highest number appeared in the Encyclopædia Britannica on Astronomy during the 1960s, in a discussion of King Alfonso X of Castile’s interest in astronomy during the 13th century. (Alfonso is credited with commissioning the Alfonsine Tables.)

By this time each planet had been provided with from 40 to 60 epicycles to represent after a fashion its complex movement among the stars. Amazed at the difficulty of the project, Alfonso is credited with the remark that had he been present at the Creation he might have given excellent advice.

40 epicycles per planet, with five known planets in Ptolemy’s time, and four adjustable parameters per epicycle, gives 800 additional parameters.

Since AIC scores are given by (# of parameters) – (log of likelihood of evidence), we can write:

AICCopernicus = kCopernicus – LCopernicus
AICepicycles = (kCopernicus + 800) – Lepicycles

AICepicycles > AICCopernicus only if Lepicycles / LCopernicus > e800

For these two models to perform equally well according to AIC, the strength of the evidence for epicycles would have to be at least e800 times stronger than the strength of the evidence for Copernicus. This corresponds roughly to a 2 with 347 zeroes after it. This is a much clearer argument for the superiority of heliocentrism over geocentrism than a vague appeal to lower priors in the latter than the former.

I like this as a nice simple example of how AIC can be practically applied. It’s also interesting to see how the type of reasoning formalized by AIC is fairly intuitive, and that even scholars in the 1500s were thinking in terms of excessive model flexibility in terms of abundant parameters as an epistemic failing.

Another example given in the same paper is Newton’s notion of admitting only as many causes as are necessary to explain the data. This is nicely formalized in terms of AIC using causal diagrams; if a model of a variable references more causes of that variable, then that model involves more adjustable parameters. In addition, adding causal dependencies to a causal model adds parameters to the description of the system as a whole.

One way to think about all this is that AIC and other model selection techniques provide a protection against unfalsifiability. A theory with too many tweakable parameters can be adjusted to fit a very wide range of data points, and therefore is harder to find evidence against.

I recall a discussion between two physicists somewhere about whether Newton’s famous equation F = ma counts as an unfalsifiable theory. The idea is just that for basically any interaction between particles, you could find some function F that makes the equation true. This has the effect of making the statement fairly vacuous, and carrying little content.

What does AIC have to say about this? The family of functions represented by F = ma is:

= { F = ma : F any function of the coordinates of the system }

How many parameters does this model have? Well, the ‘tweakable parameter’ lives inside an infinite dimensional Hilbert space of functions, suggesting that the number of parameters is infinity! If this is right, then the overfitting penalty on Newton’s second law is infinitely large and should outweigh any amount of evidence that could support it. This is actually not too crazy; if a model can accommodate any data set, then the model is infinitely weak.

One possible response is that the equation F = ma is meant to be a definitional statement, rather than a claim about the laws of physics. This seems wrong to me for several reasons, the most important of which is that it is not the case that any set of laws of physics can be framed in terms of Newton’s equation.

Case in point: quantum mechanics. Try as you might, you won’t be able to express quantum happenings as the result of forces causing accelerations according to F = ma. This suggests that F = ma is at least somewhat of a contingent statement, one that is meant to model aspects of reality rather than simply define terms.