Slide 19 from this lecture:

This is a really important result. It says that Bayesian updating ultimately converges to the distribution in a model that minimizes D_{KL}, even when the truth is not in your model.

But it is also confusing to me, for the following reason.

If Bayes converges to the minimum D_{KL} solution, and BIC approximates Bayes, *and* if AIC approximately finds the minimum D_{KL} solution… well, then how can they give different answers?

In other words, how can all three of the following statements be true?

- BIC approximates Bayes, which minimizes D
_{KL}. - AIC approximates the minimum D
_{KL}solution. - But BIC ≠ AIC.

Clearly we have a problem here.

It’s possible that the answer to this is just that the differences arise from the differences in approximations between AIC and BIC. But this seems like a inadequate explanation to account for such a huge difference, on the order of log(size of data set).

A friend of mine suggested that the reason is that the derivation of BIC assumes that the truth is in the set of candidate models, and this assumption is broken in the condition where Bayes’ optimizes for D_{KL}.

I’m not sure how strongly ‘the truth is in your set of candidate models’ is *actually* assumed by BIC. I know that this is the standard thing people say about BIC, but is it really that the exact truth *has* to be in the model, or just that the model has a low overall bias? For instance, you can derive AIC by assuming that the truth is in your set of candidate models. But you don’t *need* this assumption; you can also derive AIC as an approximate measure of D_{KL} when your set of candidate models contains models with low bias.

This question amounts to looking closely at the derivation of BIC to see what is absolutely necessary for the result. For now, I’m just pointing out the basic confusion, and will hopefully post a solution soon!