Slide 19 from this lecture:
This is a really important result. It says that Bayesian updating ultimately converges to the distribution in a model that minimizes DKL, even when the truth is not in your model.
But it is also confusing to me, for the following reason.
If Bayes converges to the minimum DKL solution, and BIC approximates Bayes, and if AIC approximately finds the minimum DKL solution… well, then how can they give different answers?
In other words, how can all three of the following statements be true?
- BIC approximates Bayes, which minimizes DKL.
- AIC approximates the minimum DKL solution.
- But BIC ≠ AIC.
Clearly we have a problem here.
It’s possible that the answer to this is just that the differences arise from the differences in approximations between AIC and BIC. But this seems like a inadequate explanation to account for such a huge difference, on the order of log(size of data set).
A friend of mine suggested that the reason is that the derivation of BIC assumes that the truth is in the set of candidate models, and this assumption is broken in the condition where Bayes’ optimizes for DKL.
I’m not sure how strongly ‘the truth is in your set of candidate models’ is actually assumed by BIC. I know that this is the standard thing people say about BIC, but is it really that the exact truth has to be in the model, or just that the model has a low overall bias? For instance, you can derive AIC by assuming that the truth is in your set of candidate models. But you don’t need this assumption; you can also derive AIC as an approximate measure of DKL when your set of candidate models contains models with low bias.
This question amounts to looking closely at the derivation of BIC to see what is absolutely necessary for the result. For now, I’m just pointing out the basic confusion, and will hopefully post a solution soon!