Facts about guns

I’ve recently come across some pretty surprising statistics regarding guns and violence, so I’ve decided to compile some of them here. I might update this if I run across more interesting things in the future.

  • Guns probably save many more lives than they end. Source (CDC and the National Research Council) and source (1995 criminology paper).
    • There are an estimated 500,000 to 3,000,000 defensive gun uses per year, and only about 300,000 violent gun crimes per year.
    • Defensive uses of guns in the US save around 162,000 lives per year (based off self-report), while overall non-suicide gun deaths only result in 11,000 deaths per year. Estimates of lives saved don’t include any military service, police work, or work as a security guard.
    • Defensive gun use reliably reduces injury rates among gun-using crime victims.

 

  • 1994 imposition of five-day waiting periods for firearms didn’t reduce the overall suicide rate. Source (paper in AMA journal).

 

  • Homicides have been on the decline for years, and guns aren’t nearly as dangerous as we think. Source (Freakonomics podcast).
    • There have been an average of 2 mass shootings and 16.5 fatalities a year from mass shootings (excluding gang shootings and armed robberies).
    • Any particular handgun in the US will kill somebody about once every 10,000 years.
    • A given swimming pool is 100 times more likely to lead to the death of a child than a particular gun is to lead to the death of a child.
    • Gun buyback programs are horribly ineffective – typically saving an estimated .0001 lives.

 

  • The “more likely to have your gun used against you” meme is super misleading; it refers to the increased chance of suicide in the home for men with guns, not intruders wielding your gun against you. Source for one of the original findings.

 

Side note: Upon reflection, I’m super suspicious of the 162,000 lives/year saved number. Obviously measuring the counterfactual “would you have died if not for X?” is hard, but the number seems impossibly large when you think about the current murder rate… it corresponds to almost an extra 50 per 100,000 where the current homicide rate is 4.9 per 100,000. The cited study looks at self-reported potential fatality, which seems quite plausibly skewed upwards (if people tend to exaggerate the lethality of their encounters).

“You don’t believe in the God you want to, and I won’t believe in the God I want to”

From my favorite book of all time:

“I’m probably just as good an atheist as you are,” she speculated boastfully. “But even I feel that we all have a great deal to be thankful for and that we shouldn’t be ashamed to show it.”

“Name one thing I’ve got to be thankful for,” Yossarian challenged her without interest.

“Well…” Lieutenant Scheisskopf’s wife mused and paused a moment to ponder dubiously. “Me.”

“Oh, come on,” he scoffed.

She arched her eyebrows in surprise. “Aren’t you thankful for me?” she asked. She frowned peevishly, her pride wounded. “I don’t have to shack up with you, you know,” she told him with cold dignity. “My husband has a whole squadron full of aviation cadets who would be only too happy to shack up with their commanding officer’s wife just for the added fillip it would give them.” Yossarian decided to change the subject. “Now you’re changing the subject,” he pointed out diplomatically. “I’ll bet I can name two things to be miserable about for every one you can name to be thankful for.”

“Be thankful you’ve got me,” she insisted.

“I am, honey. But I’m also goddam good and miserable that I can’t have Dori Duz again, too. Or the hundreds of other girls and women I’ll see and want in my short lifetime and won’t be able to go to bed with even once.”

“Be thankful you’re healthy.”

“Be bitter you’re not going to stay that way.”

“Be glad you’re even alive.”

“Be furious you’re going to die.”

“Things could be much worse,” she cried.

“They could be one hell of a lot better,” he answered heatedly.

“You’re naming only one thing,” she protested. “You said you could name two.”

“And don’t tell me God works in mysterious ways,” Yossarian continued, hurtling on over her objection. “There’s nothing so mysterious about it. He’s not working at all. He’s playing. Or else He’s forgotten all about us. That’s the kind of God you people talk about–a country bumpkin, a clumsy, bungling, brainless, conceited, uncouth hayseed. Good God, how much reverence can you have for a Supreme Being who finds it necessary to include such phenomena as phlegm and tooth decay in His divine system of creation? What in the world was running through that warped, evil, scatological mind of His when He robbed old people of the power to control their bowel movements? Why in the world did He ever create pain?”

“Pain?” Lieutenant Scheisskopf’s wife pounced upon the word victoriously. “Pain is a useful symptom. Pain is a warning to us of bodily dangers.”

“And who created the dangers?” Yossarian demanded. He laughed caustically. “Oh, He was really being charitable to us when He gave us pain! Why couldn’t He have used a doorbell instead to notify us, or one of His celestial choirs? Or a system of blue-and-red neon tubes right in the middle of each person’s forehead. Any jukebox manufacturer worth his salt could have done that. Why couldn’t He?”

“People would certainly look silly walking around with red neon tubes in the middle of their foreheads.”

“They certainly look beautiful now writhing in agony or stupefied with morphine, don’t they? What a colossal, immortal blunderer! When you consider the opportunity and power He had to really do a job, and then look at the stupid, ugly little mess He made of it instead, His sheer incompetence is almost staggering. It’s obvious He never met a payroll. Why, no self-respecting businessman would hire a bungler like Him as even a shipping clerk!” Lieutenant Scheisskopf’s wife had turned ashen in disbelief and was ogling him with alarm. “You’d better not talk that way about Him, honey,” she warned him reprovingly in a low and hostile voice. “He might punish you.”

“Isn’t He punishing me enough?” Yossarian snorted resentfully. “You know, we mustn’t let Him get away with it. Oh, no, we certainly mustn’t let Him get away scot free for all the sorrow He’s caused us. Someday I’m going to make Him pay. I know when. On the Judgment Day. Yes, That’s the day I’ll be close enough to reach out and grab that little yokel by His neck and–”

“Stop it! Stop it!” Lieutenant Scheisskopf’s wife screamed suddenly, and began beating him ineffectually about the head with both fists. “Stop it!” Yossarian ducked behind his arm for protection while she slammed away at him in feminine fury for a few seconds, and then he caught her determinedly by the wrists and forced her gently back down on the bed. “What the hell are you getting so upset about?” he asked her bewilderedly in a tone of contrite amusement. “I thought you didn’t believe in God.”

“I don’t,” she sobbed, bursting violently into tears. “But the God I don’t believe in is a good God, a just God, a merciful God. He’s not the mean and stupid God you make Him out to be.” Yossarian laughed and turned her arms loose. “Let’s have a little more religious freedom between us,” he proposed obligingly. “You don’t believe in the God you want to, and I won’t believe in the God I want to. Is that a deal?”

Joseph Heller, Catch-22

Galileo and the Schelling point improbability principle

An alternative history interaction between Galileo and his famous statistician friend

***

In the year 1609, when Galileo Galilei finished the construction of his majestic artificial eye, the first place he turned his gaze was the glowing crescent moon. He reveled in the crevices and mountains he saw, knowing that he was the first man alive to see such a sight, and his mind expanded as he saw the folly of the science of his day and wondered what else we might be wrong about.

For days he was glued to his telescope, gazing at the Heavens. He saw the planets become colorful expressive spheres and reveal tiny orbiting companions, and observed the distant supernova which Kepler had seen blinking into existence only five years prior. He discovered that Venus had phases like the Moon, that some apparently single stars revealed themselves to be binaries when magnified, and that there were dense star clusters scattered through the sky. All this he recorded in frantic enthusiastic writing, putting out sentences filled with novel discoveries nearly every time he turned his telescope in a new direction. The universe had opened itself up to him, revealing all its secrets to be uncovered by his ravenous intellect.

It took him two weeks to pull himself away from his study room for long enough to notify his friend Bertolfo Eamadin of his breakthrough. Eamadin was a renowned scholar, having pioneered at age 15 his mathematical theory of uncertainty and created the science of probability. Galileo often sought him out to discuss puzzles of chance and randomness, and this time was no exception. He had noticed a remarkable confluence of three stars that were in perfect alignment, and needed the counsel of his friend to sort out his thoughts.

Eamadin arrived at the home of Galileo half-dressed and disheveled, obviously having leapt from his bed and rushed over immediately upon receiving Galileo’s correspondence. He practically shoved Galileo out from his viewing seat and took his place, eyes glued with fascination on the sky.

Galileo allowed his friend to observe unmolested for a half-hour, listening with growing impatience to the ‘oohs’ and ‘aahs’ being emitted as the telescope swung wildly from one part of the sky to another. Finally, he interrupted.

Galileo: “Look, friend, at the pattern I have called you here to discuss.”

Galileo swiveled the telescope carefully to the position he had marked out earlier.

Eamadin: “Yes, I see it, just as you said. The three stars form a seemingly perfect line, each of the two outer ones equidistant from the central star.”

Galileo: “Now tell me, Eamadin, what are the chances of observing such a coincidence? One in a million? A billion?”

Eamadin frowned and shook his head. “It’s certainly a beautiful pattern, Galileo, but I don’t see what good a statistician like myself can do for you. What is there to be explained? With so many stars in the sky, of course you would chance upon some patterns that look pretty.”

Galileo: “Perhaps it seems only an attractive configuration of stars spewed randomly across the sky. I thought the same myself. But the symmetry seemed too perfect. I decided to carefully measure the central angle, as well as the angular distance distended by the paths from each outer star to the central one. Look.”

Galileo pulled out a sheet of paper that had been densely scribbled upon. “My calculations revealed the central angle to be precisely 180.000º, with an error of ± .003º. And similarly, I found the difference in the two angular distances to be .000º, with a margin of error of ± .002º.”

Eamadin: “Let me look at your notes.”

Galileo handed over the sheets to Eamadin. “I checked over my calculations a dozen times before writing you. I found the angular distances by approaching and retreating from this thin paper, which I placed between the three stars and me. I found the distance at which the thin paper just happened to cover both stars on one extreme simultaneously, and did the same for the two stars on the other extreme. The distance was precisely the same, leaving measurement error only for the thickness of the paper, my distance from it, and the resolution of my vision.”

Eamadin: “I see, I see. Yes, what you have found is a startlingly clear pattern. A similarity in distance and precision of angle this precise is quite unlikely to be the result of any natural phenomenon… ”

Galileo: “Exactly what I thought at first! But then I thought about the vast quantity of stars in the sky, and the vast number of ways of arranging them into groups of three, and wondered if perhaps in fact such coincidences might be expected. I tried to apply your method of uncertainty to the problem, and came to the conclusion that the chance of such a pattern having occurred through random chance is one in a thousand million! I must confess, however, that at several points in the calculation I found myself confronted with doubt about how to progress and wished for your counsel.”

Eamadin stared at Galileo’s notes, then pulled out a pad of his own and began scribbling intensely. Eventually, he spoke. “Yes, your calculations are correct. The chance of such a pattern having occurred to within the degree of measurement error you have specified by random forces is 10-9.”

Galileo: “Aha! Remarkable. So what does this mean? What strange forces have conspired to place the stars in such a pattern? And, most significantly, why?”

Eamadin: “Hold it there, Galileo. It is not reasonable to jump from the knowledge that the chance of an event is remarkably small to the conclusion that it demands a novel explanation.”

Galileo: “How so?”

Eamadin: “I’ll show you by means of a thought experiment. Suppose that we found that instead of the angle being 180.000º with an experimental error of .003º, it was 180.001º with the same error. The probability of this outcome would be the same as the outcome we found – one in a thousand million.”

Galileo: “That can’t be right. Surely it’s less likely to find a perfectly straight line than a merely nearly perfectly straight line.”

Eamadin: “While that is true, it is also true that the exact calculation you did for 180.000º ± .003º would apply for 180.001º ± .003º. And indeed, it is less likely to find the stars at this precise angle, than it is to find the stars merely near this angle. We must compare like with like, and when we do so we find that 180.000º is no more likely than any other angle!”

Galileo: “I see your reasoning, Eamadin, but you are missing something of importance. Surely there is something objectively more significant about finding an exactly straight line than about a nearly straight line, even if they have the same probability. Not all equiprobable events should be considered to be equally important. Think, for instance, of a sequence of twenty coin tosses. While it’s true that the outcome HHTHTTTTHTHHHTHHHTTH has the same probability as the outcome HHHHHHHHHHHHHHHHHHHH, the second is clearly more remarkable than the first.”

Eamadin: “But what is significance if disentangled from probability? I insist that the concept of significance only makes sense in the context of my theory of uncertainty. Significant results are those that either have a low probability or have a low conditional probability given a set of plausible hypotheses. It is this second class that we may utilize in analyzing your coin tossing example, Galileo. The two strings of tosses you mention are only significant to different degrees in that the second more naturally lends itself to a set of hypotheses in which the coin is heavily biased towards heads. In judging the second to be a more significant result than the first, you are really just saying that you use a natural hypothesis class in which probability judgments are only dependent on the ratios of heads and tails, not the particular sequence of heads and tails. Now, my question for you is: since 180.000º is just as likely as 180.001º, what set of hypotheses are you considering in which the first is much less likely than the second?”

Galileo: “I must confess, I have difficulty answering your question. For while there is a simple sense in which the number of heads and tails is a product of a coin’s bias, it is less clear what would be the analogous ‘bias’ in angles and distances between stars that should make straight lines and equal distances less likely than any others. I must say, Eamadin, that in calling you here, I find myself even more confused than when I began!”

Eamadin: “I apologize, my friend. But now let me attempt to disentangle this mess and provide a guiding light towards a solution to your problem.”

Galileo: “Please.”

Eamadin: “Perhaps we may find some objective sense in which a straight line or the equality of two quantities is a simpler mathematical pattern than a nearly straight line or two nearly equal quantities. But even if so, this will only be a help to us insofar as we have a presumption in favor of less simple patterns inhering in Nature.”

Galileo: “This is no help at all! For surely the principle of Ockham should push us towards favoring more simple patterns.”

Eamadin: “Precisely. So if we are not to look for an objective basis for the improbability of simple and elegant patterns, then we must look towards the subjective. Here we may find our answer. Suppose I were to scribble down on a sheet of paper a series of symbols and shapes, hidden from your view. Now imagine that I hand the images to you, and you go off to some unexplored land. You explore the region and draw up cartographic depictions of the land, having never seen my images. It would be quite a remarkable surprise were you to find upon looking at my images that they precisely matched your maps of the land.”

Galileo: “Indeed it would be. It would also quickly lend itself to a number of possible explanations. Firstly, it may be that you were previously aware of the layout of the land, and drew your pictures intentionally to capture the layout of the land – that is, that the layout directly caused the resemblance in your depictions. Secondly, it could be that there was a common cause between the resemblance and the layout; perhaps, for instance, the patterns that most naturally come to the mind are those that resemble common geographic features. And thirdly, included only for completion, it could be that your images somehow caused the land to have the geographic features that it did.”

Eamadin: “Exactly! You catch on quickly. Now, this case of the curious coincidence of depiction and reality is exactly analogous to your problem of the straight line in the sky. The straight lines and equal distances are just like patterns on the slips of paper I handed to you. For whatever reason, we come pre-loaded with a set of sensitivities to certain visual patterns. And what’s remarkable about your observation of the three stars is that a feature of the natural world happens to precisely align with these patterns, where we would expect no such coincidence to occur!”

Galileo: “Yes, yes, I see. You are saying that the improbability doesn’t come from any objective unusual-ness of straight lines or equal distances. Instead, the improbability comes from the fact that the patterns in reality just happen to be the same as the patterns in my head!”

Eamadin: “Precisely. Now we can break down the suitable explanations, just as you did with my cartographic example. The first explanation is that the patterns in your mind were caused by the patterns in the sky. That is, for some reason the fact that these stars were aligned in this particular way caused you to by psychologically sensitive to straight lines and equal quantities.”

Galileo: “We may discard this explanation immediately, for such sensitivities are too universal and primitive to be the result of a configuration of stars that has only just now made itself apparent to me.”

Eamadin: “Agreed. Next we have a common cause explanation. For instance, perhaps our mind is naturally sensitive to visual patterns like straight lines because such patterns tend to commonly arise in Nature. This natural sensitivity is what feels to us on the inside as simplicity. In this case, you would expect it to be more likely for you to observe simple patterns than might be naively thought.”

Galileo: “We must deny this explanation as well, it seems to me. For the resemblance to a straight line goes much further than my visual resolution could even make out. The increased likelihood of observing a straight line could hardly be enough to outweigh our initial naïve calculation of the probability being 10-9. But thinking more about this line of reasoning, it strikes me that you have just provided an explanation the apparent simplicity of the laws of Nature! We have developed to be especially sensitive to patterns that are common in Nature, we interpret such patterns as ‘simple’, and thus it is a tautology that we will observe Nature to be full of simple patterns.”

Eamadin: “Indeed, I have offered just such an explanation. But it is an unsatisfactory explanation, insofar as one is opposed to the notion of simplicity as a purely subjective feature. Most people, myself included, would strongly suggest that a straight line is inherently simpler than a curvy line.”

Galileo: “I feel the same temptation. Of course, justifying a measure of simplicity that does the job we want of it is easier said than done. Now, on to the third explanation: that my sensitivity to straight lines has caused the apparent resemblance to a straight line. There are two interpretations of this. The first is that the stars are not actually in a straight line, and you only think this because of your predisposition towards identifying straight lines. The second is that the stars aligned in a straight line because of these predispositions. I’m sure you agree that both can be reasonably excluded.”

Eamadin: “Indeed. Although it may look like we’ve excluded all possible explanations, notice that we only considered one possible form of the common cause explanation. The other two categories of explanations seem more thoroughly ruled out; your dispositions couldn’t be caused by the star alignment given that you have only just found out about it and the star alignment couldn’t be caused by your dispositions given the physical distance.”

Galileo: “Agreed. Here is another common cause explanation: God, who crafted the patterns we see in Nature, also created humans to have similar mental features to Himself. These mental features include aesthetic preferences for simple patterns. Thus God causes both the salience of the line pattern to humans and the existence of the line pattern in Nature.”

Eamadin: “The problem with this is that it explains too much. Based solely on this argument, we would expect that when looking up at the sky, we should see it entirely populated by simple and aesthetic arrangements of stars. Instead it looks mostly random and scattershot, with a few striking exceptions like those which you have pointed out.”

Galileo: “Your point is well taken. All I can imagine now is that there must be some sort of ethereal force that links some stars together, gradually pushing them so that they end up in nearly straight lines.”

Eamadin: “Perhaps that will be the final answer in the end. Or perhaps we will discover that it is the whim of a capricious Creator with an unusual habit for placing unsolvable mysteries in our paths. I sometimes feel this way myself.”

Galileo: “I confess, I have felt the same at times. Well, Eamadin, although we have failed to find a satisfactory explanation for the moment, I feel much less confused about this matter. I must say, I find this method of reasoning by noticing similarities between features of our mind and features of the world quite intriguing. Have you a name for it?”

Eamadin: “In fact, I just thought of it on the spot! I suppose that it is quite generalizable… We come pre-loaded with a set of very salient and intuitive concepts, be they geometric, temporal, or logical. We should be surprised to find these concepts instantiated in the world, unless we know of some causal connection between the patterns in our mind and the patterns in reality. And by Eamadin’s rule of probability-updating, when we notice these similarities, we should increase our strength of belief in these possible causal connections. In the spirit of anachrony, let us refer to this as the Schelling point improbability principle!”

Galileo: “Sounds good to me! Thank you for your assistance, my friend. And now I must return to my exploration of the Cosmos.”

Why “number of parameters” isn’t good enough

A friend of mine recently pointed out a curious fact. Any set of two-dimensional data whatsoever can be perfectly fit by a simple two-parameter sinusoidal model.

y(x) = A sin(Bx)

Sound wrong? Check it out:

small-sine-zoom.png

Zoomed out:small-sine.png

N = 10 pointssine-overfit.png

As you see, as the number of data points goes up, all you need to do to accommodate this is increase the frequency in your sine function, and adjust the amplitude as necessary. Ultimately, you can fit any data set with a ridiculously quickly oscillating and large-amplitude sine function.

Now, most model selection methods explicitly rely on the parameter count to estimate the potential of a model to overfit. For example, if k is the number of parameters in a model, and L is the log likelihood of the data given the model, we have:

AIC = L – k
BIC = L – k/2・log(N)

This little example represents a fantastic failure of parameter count to successfully do the job AIC and BIC ask of it. Evidently parameter count is too blunt an instrument to do the job we require of it, and we need something with more nuance.

One more example.

For any set of data, if you can perfectly fit a curve to each data point, and if your measurement error σ is an adjustable parameter, then you can take the measurement error to zero to have a fit with infinite accuracy. Now when we evaluate, you find it running off to infinity! Thus our ‘fit to data’ term L goes to infinity, while the model complexity penalty stays a small finite number.

Once again, we see the same lack of nuance dragging us into trouble. The number of parameters might do well at estimating overfitting potential for some types of well-behaved parameters, but it clearly doesn’t do the job universally. What we want is some measure that is sensitive to the potential for some parameters to capture “more” of the space of all possible distributions than others.

And lo and behold, we have such a measure! This is the purpose of information geometry and the volume of a model in the space formed by the Fisher information metric as the penalty for overfitting potential. You can learn more about it in a post I wrote here.

Bayesian Occam’s Razor

A couple of days ago I posted a question that has been bugging me; namely, does Bayes’ overfit, and if not, why not?

Today I post the solution!

There are two parts: first, explaining where my initial argument against Bayes went wrong, and second, describing the Bayesian Occam’s Razor, the key to understanding how a Bayesian deals with overfitting.

Part 1: Why I was wrong

Here’s the argument I wrote initially:

  1. Overfitting arises from an excessive focus on accommodation. (If your only epistemic priority is accommodating the data you receive, then you will over-accommodate the data, by fitting the noise in the data instead of just the underlying trend.)
  2. We can deal with overfitting by optimizing for other epistemic virtues like simplicity, predictive accuracy, or some measure of distance to truth. (For example, minimum description length and maximum entropy optimize for simplicity, and cross validation optimizes for predictive accuracy).
  3. Bayesianism is an epistemological procedure that has two steps, setting of priors and updating those priors.
  4. Updating of priors is done via Bayes’ rule, which rewards theories according to how well they accommodate their data (creating the potential for overfitting).
  5. Bayesian priors can be set in ways that optimize for other epistemic virtues, like simplicity or humility.
  6. In the limit of infinite evidence, differences in priors between empirically distinguishable theories are washed away.
  7. Thus, in the limit, Bayesianism becomes a primarily accommodating procedure, as the strength of the evidential update swamps your initial differences in priors.

Here’s a more formal version of the argument:

  1. The relative probabilities of two model given data is calculated by Bayes’ rule:
    P(M | D) / P(M’ | D)  = P(M) / P(M’)・P(D | M) / P(D | M’)
  2. If M overfits the data and M’ does not, then as the size of the data set |D| goes to infinity, the likelihood factor P(D | M) / P(D | M’) goes to infinity.
  3. Thus the posterior probability P(M | D) should go to 1 for the model that most drastically overfits the data.

This argument is wrong for a couple of reasons. For one, the argument assumes that as the size of the data set grows, the model stays the same. But this is very much not going to be true in general. The task of overfitting gets harder and harder as the number of data points go up. It’s not that there’s no longer noise in the data; it’s that the signal becomes more and more powerful.

A perfect polynomial fit on 100 data points must have, at the worst, 100 parameters. On 1000 data points: 1000 parameters. Etc. In general, as you add more data points, a model that was initially overfitting (e.g. the 100-parameter distribution) will find that it is harder and harder to ignore the signal for the noise, and the next best overfitting model will have more parameters (e.g. the 1000-parameter distribution).

But now we have a very natural solution to the problem we started with! It is true that as the number of data points increases, the evidential support for the model that overfits the data will get larger and larger. It’s also true is that the number of parameters required to overfit the data will grow as well. So if your prior in a model is a decreasing function of the number of parameters in the model, then you can in principle find a perfect balance and avoid overfitting. This perfect balance would be characterized by the following: each time you increase the number of parameters, the prior should decrease by an amount proportional to how much more you get rewarded by overfitting the data with the extra parameters.

How do we find this prior in practice? Beats me… I’d be curious to know, myself.

But what’s most interesting to me is that to solve overfitting as a Bayesian, you don’t even need the priors; the solution comes from the evidential update! It turns out that in fact, the likelihood function for updating credences in a model given data automatically incorporates in model overparameterization. Which brings us to part 2!

Part 2: Bayesian Occam’s Razor

That last sentence bears repeating. In reality, although priors can play some role by manually penalizing models with high overfitting potential, the true source of the Bayesian Occam’s razor comes from the evidential update. What we’ll find by the end of this post is that models that overfit don’t actually get a stronger evidential update than models that don’t.

You might wonder how this is possible. Isn’t it practically the definition of overfitting that it is an enhancement of the strength of an evidential update through fitting to noise in the data?

Sort of. It is super important to keep in mind the distinction between a model and a distribution. A distribution is a single probability function over your possible observable data. A model is a set of distributions, characterized by a set of parameters. When we say that some models have the potential to overfit a set of data, what we are really saying is that some models contain distributions that overfit the data.

Why is this important? Because assessing the posterior probability of the model is not the same as assessing the posterior probability of the overfitting distribution within the model! Here’s Bayes’ rule, applied to the model and to the overfitting distribution:

(1) P(M | D) = P(M)・P(D | M) / P(D)

(2) P(theta hat | D) = P(theta hat)・P(D | theta hat) / P(D)

It’s clear how to evaluate equation (2). You have some prior probability assigned to theta hat, you know how to assess the likelihood function P(D | theta hat), and P(D) is an integral that is in principle do-able. In addition, equation (2) has the scary feature we’ve been talking about: the likelihood function P(D | theta hat) is really really large if our parameter theta hat overfits the data, potentially large enough to swamp the priors and screw up our Bayesian calculus.

But what we’re really interested in evaluating is not equation (2), but equation (1)! This is, after all, model selection; we are in the end trying to assess the quality of different models, not individual distributions.

So how do we evaluate (1)? The key term is P(D | M); your prior over the models and the data you receive are not too important for the moment. What is P(D | M)? This question does not actually have an obvious answer… M is a model, a set of distributions, not a single distribution. If we were looking at one distribution, it would be easy to assess the likelihood of the data given that distribution.

So what does P(D | M) really mean?

It represents the average probability of the data, given the model. It’s as if you were to draw a distribution at random from your model, and see how well it fits the data. More precisely, you draw a distribution from your model, according to your prior distribution over the distributions in the model.

That was a mouthful. But the basic idea is simple; a model is an infinite set of distributions, each corresponding to a particular set of values for the parameters that define the model. You have a prior distribution over these values for the parameters, and you use this prior distribution to “randomly” select a distribution in your model. You then assess the probability of the data given that distribution, and voila, you have your likelihood function.

In other words…

P(D | M) = ∫ P(D | θ) P(θ | M) dθ

Now, an overfitting model has a massive space of parameters, and in some small region of this space contains distributions that fit the data really well. On the other hand, a simple model that generalizes well has a small space of parameters, and a region of this space contains distributions that fit the data well (though not as well as the overfitter).

So on average, you are much less likely to select the optimal distribution in the overfitting model than in the generalizable model. Why? Because the space of parameters you must search through to find it is so much larger!

True, when you do select the optimal distribution in the overfitting model, you get rewarded with a better fit to the data than you could have gotten from the nice model. But the balance, in the end, pushes you towards simpler and more general models.

This is the Bayesian Occam’s Razor! Models that are underparameterized do poorly on average, because they just can’t fit the data at all. Models that are overparametrized do poorly on average, because the subset of the parameter space that fits the data well is so tiny compared to the volume of the parameter space as a whole. And the models that strike the perfect balance are those that have enough parameters to fit the data well, but not too many as to excessively bloat the parameter space.

Here are some lecture slides from these great notes that have some helpful visualizations:

Screen Shot 2018-03-16 at 2.33.34 AMScreen Shot 2018-03-16 at 2.33.50 AM

Recapping in a few sentences: Simpler models are promoted, simply because they do well on average. And evidential support for a model comes down to the performance on average, not optimal performance. The likelihood in question is not P(data | best distribution in model), it’s P(data | average distribution in model). So overfitting models actually don’t get as much evidential support from data when assessing the model quality as a whole!

Ain’t that cool??

All about IQ

IQ is an increasingly controversial topic these days. I find that when it comes up, different people seem to be extremely confident in wildly different beliefs about the nature of IQ as a measure of intelligence.

Part of this has to do with education. This paper analyzed the top 29 most used introductory psychology textbooks and “found that 79.3% of textbooks contained inaccurate statements and 79.3% had logical fallacies in their sections about intelligence.” [1]

This is pretty insane, and sounds kinda like something you’d hear from an Alex Jones-style conspiracy theorist. But if you look at what the world’s experts on human intelligence say about public opinion on intelligence, they’re all in agreement: misinformation about IQ is everywhere. It’s gotten to the point where world-famous respected psychologists like Steven Pinker are being blasted as racists in articles in mainstream news outlets for citing basic points of consensus in the scientific literature.

The reasons for this are pretty clear… people are worried about nasty social and political implications of true facts about IQ. There are worthwhile points to be made about morally hazardous beliefs and the possibility that some truths should not be publicly known. At the same time, the quantification and study of human intelligence is absurdly important. The difference between us and the rest of the animal world, the types of possible futures that are open to us as a civilization, the ability to understand the structure of the universe and manipulate it to our ends; these are the types of things that the subject of human intelligence touches on. In short, intelligence is how we accomplish anything as a civilization, and the prospect of missing out on ways to reliably intervene and enhance it because we avoided or covered up research that revealed some inconvenient truths seems really bad to me.

Overall, I lean towards thinking that the misinformation is so great, and the truth so important, that it’s worthwhile to attempt to clear things up. So! The purpose of this post is just to sort through some of the mess and come up with a concise and referenced list of some of the most important things we know about IQ and intelligence.

IQ Basics

  • The most replicated finding in all of psychology is that good performance on virtually all cognitively demanding tasks is positively correlated. The name for whatever cognitive faculty causes this correlation is “general intelligence”, or g.
  • A definition of intelligence from 52 prominent intelligence researchers: [2]

Intelligence is a very general capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test‑taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings—‘catching on’, ‘making sense’ of things, or ‘figuring out’ what to do. Intelligence, so defined, can be measured, and intelligence tests measure it well.

  • IQ tests are among the most reliable and valid of all psychological tests and assessments. [3]
    • They are designed to test general intelligence, and not character or personality.
    • Modern IQ tests have a standard error of measurement of about 3 points.
  • The distribution of IQs in a population nicely fits a Bell curve.
    • IQ is defined in such a way as to make the population mean exactly 100, and the standard deviation 15.
  • People with high IQs tend to be healthier, wealthier, live longer, and have more successful careers. [4][5][6]
    • IQ is highly predictive of educational aptitude and job performance. [7][8][9][10][11]
    • Longitudinal studies have shown that IQ “is a causal influence on future achievement measures whereas achievement measures do not substantially influence future IQ scores.” [12]

Average adult combined IQs associated with real-life accomplishments by various tests

Accomplishment IQ
MDs, JDs, and PhDs 125
College Graduates 115
1–3 years of college 104
Clerical and sales workers 100–105
High school graduates, skilled workers (e.g., electricians, cabinetmakers) 97
1–3 years of high school (completed 9–11 years of school) 94
Semi-skilled workers (e.g. truck drivers, factory workers) 90–95
Elementary school graduates (completed eighth grade) 90
Elementary school dropouts (completed 0–7 years of school) 80–85
Have 50/50 chance of reaching high school 75

(table from Wiki)

 

Table 25.1 Relationship between intelligence and measures of success (Results from meta-analyses)
Measure of success r k N Source
Academic performance in primary education 0.58 4 1791 Poropat (2009)
Educational attainment 0.56 59 84828 Strenze (2007)
Job performance (supervisory rating) 0.53 425 32124 Hunter and Hunter (1984)
Occupational attainment 0.43 45 72290 Strenze (2007)
Job performance (work sample) 0.38 36 16480 Roth et al. (2005)
Skill acquisition in work training 0.38 17 6713 Colquitt et al. (2000)
Degree attainment speed in graduate school 0.35 5 1700 Kuncel et al. (2004)
Group leadership success (group productivity) 0.33 14 Judge et al. (2004)
Promotions at work 0.28 9 21290 Schmitt et al. (1984)
Interview success (interviewer rating of applicant) 0.27 40 11317 Berry et al. (2007)
Reading performance among problem children 0.26 8 944 Nelson et al. (2003)
Becoming a leader in group 0.25 65 Judge et al. (2004)
Academic performance in secondary education 0.24 17 12606 Poropat (2009)
Academic performance in tertiary education 0.23 26 17588 Poropat (2009)
Income 0.20 31 58758 Strenze (2007)
Having anorexia nervosa 0.20 16 484 Lopez et al. (2010)
Research productivity in graduate school 0.19 4 314 Kuncel et al. (2004)
Participation in group activities 0.18 36 Mann (1959)
Group leadership success (group member rating) 0.17 64 Judge et al. (2004)
Creativity 0.17 447 Kim (2005)
Popularity among group members 0.10 38 Mann (1959)
Happiness 0.05 19 2546 DeNeve & Cooper (1998)
Procrastination (needless delay of action) 0.03 14 2151 Steel (2007)
Changing jobs 0.01 7 6062 Griffeth et al. (2000)
Physical attractiveness -0.04 31 3497 Feingold (1992)
Recidivism (repeated criminal behavior) -0.07 32 21369 Gendreau et al. (1996)
Number of children -0.11 3 Lynn (1996)
Traffic accident involvement -0.12 10 1020 Arthur et al. (1991)
Conformity to persuasion -0.12 7 Rhodes and Wood (1992)
Communication anxiety -0.13 8 2548 Bourhis and Allen (1992)
Having schizophrenia -0.26 18 Woodberry et al. (2008)

(from Gwern)

Nature of g

  • IQ scores are very stable across lifetime. [13]
    • This doesn’t mean that 30-year-old you is no smarter than 10-year-old you. It means that if you test the IQ of a bunch of children, and then later test them as adults, the rank order will remain roughly the same. A smarter-than-average 10 year old becomes a smarter-than-average 30 year old.
  • After your mid-20s, crystallized intelligence plateaus and  fluid intelligence starts declining. Obligatory terrifying graph: (source)

  • High IQ is correlated with more gray matter in the brain, larger frontal lobes, and a thicker cortex. [14][15]
    • There is a constant cascade of information being processed in the entire brain, but intelligence seems related to an efficient use of relatively few structures, where the more gray matter the better.” [16]
  • “Estimates of how much of the total variance in general intelligence can be attributed to genetic influences range from 30 to 80%.” [17]
    • Twin studies show the same results; there are substantial genetic influences on human intelligence. [18]
    • The genetic component of IQ is highly polygenic, and no specific genes have been robustly associated with human intelligence. The best we’ve found so far is a single gene that accounts for 0.1% of the variance in IQ. [17]
  • Many genes have been weakly associated with IQ. “40% of the variation in crystallized-type intelligence and 51% of the variation in fluid-type intelligence between individuals” is accounted for by genetic differences. [19]
    • Scientists can predict your IQ by looking only at your genes (not perfectly, but significantly better than random). [19]
      • This study analyzed 549,692 base pairs and found a R = .11 mean correlation between their predictions and the actual fluid intelligence of over 3500 unrelated adults. [19]

You might be wondering at this point what all the controversy regarding IQ is about. Why are so many people eager to dismiss IQ as a valid measure of intelligence? Well, we now dive straight into the heart of the controversy: intergroup variation in IQ.

It’s worth noting that, as Scott Alexander puts it: society is fixed, while biology is mutable. This fear we have that if biology factors into the underperformance of some groups, then such difference are intrinsically unalterable, makes little sense. We can do things to modify biology just as we can do things to modify society, and in fact the first is often much easier to do and more effective than the easier.

Anyway, prelude aside, we dive into the controversy.

Group differences in IQ

  • Yes, there are racial differences in IQ, both globally and within the United States. This has been studied to death, and is a universal consensus; you won’t find a single paper in a reputable psychology journal denying the numerical differences. [20]
  • Within the United States, there is a long-standing 1 SD (15 to 18 point) IQ difference between African Americans and White Americans. [2]
    • The tests in which these differences are most pronounced are those that most closely correspond to g, like Raven’s Progressive Matrices. [6] This test also is free of culturally-loaded knowledge, and only requires being able to solve visual pattern-recognition puzzles like these ones:

      • Controlling for the way the tests are formulated and administered does not affect this difference. [2]
      • IQ scores predict success equally accurately regardless of race or social class. This provides some evidence that the test is not culturally biased as a predictor. [2] [19]
  • Internationally, the lowest average IQs are found in sub-Saharan Africa and the highest average IQs are found in East Asia. The variations span a range of three standard deviations (45 IQ points). [21]
    • Malawi has an estimated average IQ of 60.
    • Singapore and Hong Kong have estimated IQs around 108.

(image from here)

  • A large survey published in one of the top psychology journals polled over 250 experts on IQ and international intelligence differences. [21]
    • On possible causes of cross-national differences in cognitive ability: “Genes were rated as the most important cause (17%), followed by educational quality (11.44%), health (10.88%), and educational quantity (10.20%).”
    • “Around 90% of experts believed that genes had at least some influence on cross-national differences in cognitive ability.”
  • Men and women have equal average IQs.
    • But: “most IQ tests are constructed so that there are no overall score differences between females and males.” [6]
    • They do this by removing items that show significant sex differences. So, for instance, men have a 1 SD (15 point) advantage on visual-spatial tasks over women. Thus mental rotation tests have been removed, in order to reduce the perception of bias. [22]
    • Males also do better on proportional and mechanical reasoning and mathematics, while females do better on verbal tests. [22]
  • Hormones are thought to play a role in sex differences in cognitive abilities. [23]
    • Females that are exposed to male hormones in utero have higher spatiotemporal reasoning scores than females that are not. [24]
    • The same thing is seen with men that have higher testosterone levels, and older males given testosterone. [25]
  • There is also some evidence of men having a higher IQ variance than women, but this seems to be disputed. If true, it would indicate more men at the very bottom and the very top of the IQ scale (helping to explain sex disparities in high-IQ professions). [26]

IQ Trends

  • In the developed world, average IQ has been increasing by 2 to 3 points per decade since 1930. This is called the Flynn effect.
    • The average IQ in the US in 1932, as measured by a 1997 IQ test, would be around 80. People with IQ 80 and below correspond to the bottom 9% of the 1997 population. [27]
  • Some studies have found that the Flynn effect seems to be waning in the developing world, and beginning in the developing world. [28]
  • A large survey of experts found that most attribute the Flynn effect to “better health and nutrition, more and better education and rising standards of living.” [29]
  • The Flynn effect is not limited to IQ tests, but is also found in memory tests, object naming, and other commonly used neuropsychological tests. [30]
  • Many studies indicate that the black-white IQ gap in the United States is closing. [23]

Can IQ be increased?

  • There are not any known interventions to reliably cause long term increases (although decreasing it is easy).
    • Essentially, you can do a handful of things to ensure that your child’s IQ is not low (give them access to education, provide them good nutrition, prevent iodine deficiency, etc), but you can’t do much beyond these.
  • Educational intervention programs have fairly unanimously failed to show long-term increases in IQ in the developed world. [23]
    • The best prekindergarten programs have a substantial short-term effect on IQ, but this effect fades by late elementary school.

Random curiosities

  • Several large-scale longitudinal studies have found that children with higher IQ are more likely to have used illegal drugs by middle age. This association is stronger for women than men. [31][32]
    • This actually makes some sense, given that IQ is positively correlated with Openness (in the Big Five personality traits breakdown).
  • The average intelligence of Marines has been significantly declining since 1980. [33]
  • “The US military has minimum enlistment standards at about the IQ 85 level. There have been two experiments with lowering this to 80 but in both cases these men could not master soldiering well enough to justify their costs.” (from Wiki)
    • This is fairly terrifying when you consider that 10% of the US population has an IQ of 80 or below; evidently, this enormous segment of humanity has an extremely limited capacity to do useful work for society.
  • Researchers used to think that IQ declined significantly starting around age 20. Subsequently this was found to be mostly a product of the Flynn effect: as average IQ increases, the normed IQ value inflates, so a constant IQ looks like it decreases. (from Wiki)
  • The popular idea that listening to classical music increases IQ has not been borne out by research. (Wiki)
  • There’s evidence that intelligence is part of the explanation for differential health outcomes across socioeconomic class.
    • “…Health workers can diagnose and treat incubating problems, such as high blood pressure or diabetes, but only when people seek preventive screening and follow treatment regimens. Many do not. In fact, perhaps a third of all prescription medications are taken in a manner that jeopardizes the patient’s health. Non-adherence to prescribed treatment regimens doubles the risk of death among heart patients (Gallagher, Viscoli, & Horwitz, 1993). For better or worse, people are substantially their own primary health care providers.”

      “For instance, one study (Williams et al., 1995) found that, overall, 26% of the outpatients at two urban hospitals were unable to determine from an appointment slip when their next appointment was scheduled, and 42% did not understand directions for taking medicine on an empty stomach. The percentages specifically among outpatients with inadequate literacy were worse: 40% and 65%, respectively. In comparison, the percentages were 5% and 24% among outpatients with adequate literacy. In another study (Williams, Baker, Parker, & Nurss, 1998), many insulin-dependent diabetics did not understand fundamental facts for maintaining daily control of their disease: Among those classified as having inadequate literacy, about half did not know the signs of very low or very high blood sugar, and 60% did not know the corrective actions they needed to take if their blood sugar was too low or too high. Among diabetics, intelligence at time of diagnosis correlates significantly (.36) with diabetes knowledge measured 1 year later (Taylor, Frier, et al., 2003).” [34]
  • IQ differences might be able to account for a significant portion of global income inequality.
    • “… in a conventional Ramsey model, between one-fourth and one-half of income differences across countries can be explained by a single factor: The steady-state effect of large, persistent differences in national average IQ on worker productivity. These differences in cognitive ability – which are well-supported in the psychology literature – are likely to be malleable through better nutrition, better education, and better health care in the world’s poorest countries. A simple calibration exercise in the spirit of Bils and Klenow (AER, 2000) and Castro (Rev. Ec. Dyn., 2005) is conducted. According to the model, a move from the bottom decile of the global IQ distribution to the top decile will cause steady-state living standards to rise by between 75 and 350 percent. I provide evidence that little of IQ-productivity relationship is likely to be due to reverse causality.” [35]
  • Exposure to lead hampers cognitive development and lowers IQ. You can calculate the economic boost the US received as a result of the dramatic reduction in children’s exposure to lead since the 1970s and the resulting increase in IQs.
    • “The base-case estimate of $213 billion in economic benefit for each cohort is based on conservative assumptions about both the effect of IQ on earnings and the effect of lead on IQ.” [36]
    • Yes. $213 billion.
  • In a 113-country analysis, IQ has been found to positively affect all main measures of institutional quality.
    • “The results show that average IQ positively affects all the measures of institutional quality considered in our study, namely government efficiency, regulatory quality, rule of law, political stability and voice and accountability. The positive effect of intelligence is robust to controlling for other determinants of institutional quality.” [37]
  • High IQ people cooperate more in repeated prisoner’s experiments; 5% to 8% more cooperation per 100 point increase in SAT score (7 pt IQ increase). [38][39]
    • The second paper also shows more patience and higher savings rates for higher IQ. [39]
  • Embryo selection is a possible way to enhance the IQ of future generations, and is already technologically feasible.
    • “Biomedical research into human stem cell-derived gametes may enable iterated embryo selection (IES) in vitro, compressing multiple generations of selection into a few years or less.” [40]
      Selection Average IQ gain
      1 in 2 4.2
      1 in 10 11.5
      1 in 100 18.8
      1 in 1000 24.3

Sources

There is a ridiculous amount of research out there on IQ, and you can easily reach any conclusion you want by just finding some studies that agree with you. I’ve tried to stick to relying on large meta-analyses, papers of historical significance, large surveys of experts, and summaries by experts of consensus views.

[1] Warne, R. T., Astle, M. C., & Hill, J. C. (2018). What Do Undergraduates Learn About Human Intelligence? An Analysis of Introductory Psychology Textbooks. Archives of Scientific Psychology, 6(1), 32-50.

[2] Gottfredson, L. S. (1997). Mainstream science on intelligence: An editorial with 52 signatories, history and bibliography. Intelligence, 24(1), 13-23.

[3] Colom, R. (2004). Intelligence Assessment. Encyclopedia of Applied Psychology, 2(2), 307–314.

[4] Batty, D. G., Deary, I. J,, Gottfredson, L. S. (2007).  Premorbid (early life) IQ and Later Mortality Risk: Systematic ReviewAnnals of Epidemiology, 17(4), 278–288.

[5] Gottfredson, L. S. (1997). Why g Matters: The Complexity of Everyday LifeIntelligence, 24(1), 79-132.

[6] Neisser, U, et al. (1996). Intelligence: Knowns and Unknowns. American Psychological Association. American Psychologist, 51(2), 77-101.

[7] Deary, I. J., et al. (2007). Intelligence and educational achievementIntelligence, 35(1), 13-21.

[8] Dumfart, B., & Neubauer, A. C. (2016). Conscientiousness is the most powerful noncognitive predictor of school achievement in adolescents. Journal of Individual Differences, 37(1), 8-15.

[9] Kuncel, N. R., & Hezlett, S. A. (2010). Fact and Fiction in Cognitive Ability Testing for Admissions and Hiring DecisionsCurrent Directions in Psychological Science, 19(6), 339-345.

[10] Schmidt, F. L., Hunter, J. E. (1998). The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 Years of Research FindingsPsychological Bulletin, 124(2), 262-274.

[11] Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performancePsychological Bulletin, 96(1), 72-98.

[12] Watkins, M. W., Lei, P., Canivez, G. L. (2007). Psychometric intelligence and achievement: A cross-lagged panel analysisIntelligence, 35(1), 59-68.

[13] Deary, I. J., et al. (2000). The stability of individual differences in mental ability from childhood to old age: follow-up of the 1932 Scottish Mental Survey. Intelligence, 28(1), 49–55.

[14] Frangou, S., Chitins, X., Williams, S. C. R. (2004). Mapping IQ and gray matter density in healthy young people.  NeuroImage, 23(3), 800-805.

[16] Narr, K., et al. (2007). Relationships between IQ and Regional Cortical Gray Matter Thickness in Healthy Adults. Cerebral Cortex, 17(9), 2163–2171.

[15] University Of California – Irvine. “Human Intelligence Determined By Volume And Location Of Gray Matter Tissue In Brain.” ScienceDaily, 20 July 2004.

[17] Deary, I. J., Penke, L., Johnson, W. (2010) The neuroscience of human intelligence differencesNature Reviews Neuroscience, 11(3), 201–211.

[18] Deary, I. J., Johnson, W., Houlihan, L. M. (2009). Genetic foundations of human intelligenceHuman Genetics, 126(1), 215-232.

[19] Davies, G., et al. (2011). Genome-wide association studies establish that human intelligence is highly heritable and polygenicMol Psychiatry, 16(10), 996–1005.

[20] Rushton, J. P., Jensen, A. R. (2005). Thirty Years of Research on Race Differences in Cognitive AbilityPsychology, Public Policy, 11(2), 235-294.

[21] Rindermann, H., Becker, D., Coyle, T. R. (2016). Survey of Expert Opinion on Intelligence: Causes of International Differences in Cognitive Ability TestsFrontiers in Psychology, 7.

[22] Ellis, L., et al. (2008). Sex Differences: Summarizing More than a Century of Scientific Research. Psychology Press.

[23] Nisbett, R. E., et al. (2012). Intelligence: New Findings and Theoretical DevelopmentsAmerican Psychologist, 67(2), 129.

[24] Resnick, S. M., et al. (1986). Early hormonal influences on cognitive functioning in congenital adrenal hyperplasiaDevelopmental Psychology, 22(2), 191-198.

[25] Janowsky, J. S., Oviatt, S. K., Orwoll, E. S. (1994) Testosterone influences spatial cognition in older men. Behavioral Neuroscience, 108(2), 325-332.

[26] Lynn, R., Kanazawa, S. (2011). A longitudinal study of sex differences in intelligence at ages 7, 11 and 16 yearsPersonality and Individual Differences, 51(3), 321–324.

[27] Neisser, U. (1997). Rising Scores on Intelligence Tests. American Scientist, 85(5), 440-447.

[28] Pietschnig, J., Voracek, M. (2015). One Century of Global IQ Gains: A Formal Meta-Analysis of the Flynn Effect (1909-2013)Perspectives on Psychological Science, 10(3), 282-306.

[29] Rindermann, H., Becker, D., Coyle, T. R. (2017). Survey of expert opinion on intelligence: The Flynn effect and the future of intelligence. Personality and Individual Differences, 106, 242-247.

[30] Trahan, L. H., et al. (2014). The Flynn Effect: A Meta-analysisPsychological Bulletin, 140(5), 1332-1360.

[31] White, J., Gale, C. R., Batty, D. G. (2012). Intelligence quotient in childhood and the risk of illegal drug use in middle-age: the 1958 National Child Development SurveyAnnals of Epidemiology, 22(9), 654-657.

[32] White, J., Batty, D. G. (2011). Intelligence across childhood in relation to illegal drug use in adulthood: 1970 British Cohort Study. Journal of Epidemiology & Community Health, 66(9).

[33] Cancian, M. F., Klein, M. W. (2015). Military Officer Quality in the All-Volunteer ForceNational Bureau of Economic Research, WP 21372.

[34] Gottfredson, L.S. (2004). Intelligence: Is it the epidemiologists’ elusive fundamental cause of social class inequalities in health?Journal of Personality and Social Psychology86(1), 174-199.

[35] Jones, G. (2005). IQ in the Ramsey Model: A Naive Calibration. George Mason University.

[36] Grosse, S. D., et al. (2002). Economic Gains Resulting from the Reduction in Children’s Exposure to Lead in the United StatesEnvironmental Health Perspectives, 110(6), 563-569.

[37] Kalonda-Kanyama, I. & Kodila-Tedika, O. (2012). Quality of Institutions: Does Intelligence Matter?Working Papers Series in Theoretical and Applied Economics 201206, University of Kansas, Department of Economics.

[38] Jones, G. (2008). Are Smarter Groups More Cooperative? Evidence from Prisoner’s Dilemma Experiments, 1959-2003Journal of Economic Behavior & Organization 68(3–4), 489-497.

[39] Jones, G. (2011). National IQ and National Productivity: The Hive Mind Across Asia. Asian Development Review, 28(1), 51-71.

[40] Shulman, C. & Bostrom, N. (2014). Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?. Global Policy 5(1), 85-92.

A model selection puzzle: Why is BIC ≠ AIC?

Slide 19 from this lecture:

Screen Shot 2018-03-14 at 8.14.41 PM.png

This is a really important result. It says that Bayesian updating ultimately converges to the distribution in a model that minimizes DKL, even when the truth is not in your model.

But it is also confusing to me, for the following reason.

If Bayes converges to the minimum DKL solution, and BIC approximates Bayes, and if AIC approximately finds the minimum DKL solution… well, then how can they give different answers?

In other words, how can all three of the following statements be true?

  1. BIC approximates Bayes, which minimizes DKL.
  2. AIC approximates the minimum DKL solution.
  3. But BIC ≠ AIC.

Clearly we have a problem here.

It’s possible that the answer to this is just that the differences arise from the differences in approximations between AIC and BIC. But this seems like a inadequate explanation to account for such a huge difference, on the order of log(size of data set).

A friend of mine suggested that the reason is that the derivation of BIC assumes that the truth is in the set of candidate models, and this assumption is broken in the condition where Bayes’ optimizes for DKL.

I’m not sure how strongly ‘the truth is in your set of candidate models’ is actually assumed by BIC. I know that this is the standard thing people say about BIC, but is it really that the exact truth has to be in the model, or just that the model has a low overall bias? For instance, you can derive AIC by assuming that the truth is in your set of candidate models. But you don’t need this assumption; you can also derive AIC as an approximate measure of DKL when your set of candidate models contains models with low bias.

This question amounts to looking closely at the derivation of BIC to see what is absolutely necessary for the result. For now, I’m just pointing out the basic confusion, and will hopefully post a solution soon!