As a quick reminder from previous posts, we can define the surprise in an occurrence of an event E with probability P(E) as:
Sur(P(E)) = log(1/P) = – log(P).
I’ve discussed why this definition makes sense here. Now, with this definition, we can talk about expected surprise; in general, the surprise that somebody with distribution Q would expect somebody with distribution P to have is:
EQ[Sur(P)] = ∫ – Q log(P) dE
This integral is taken over all possible events. A special case of it is entropy, which is somebody’s own expected surprise. This corresponds to the intuitive notion of uncertainty:
Entropy = EP[Sur(P)] = ∫ – P log(P) dE
The actual average surprise for somebody with distribution P is:
Actual average surprise = ∫ – Ptrue log(P) dE
Here we are using the idea of a true probability distribution, which corresponds to the distribution over possible events that best describes the frequencies of each event. And finally, the “gap” in average surprise between P and Q is:
∫ Ptrue log(P/Q) dE
Gibbs’ inequality says the following:
For any two different probability distributions P and Q:
EP[Sur(P)] < EP[Sur(Q)]
This means that out of all possible ways of distributing your credences, you should always expect that your own distribution is the least surprising.
In other words, you should always expect to be less surprised than everybody else.
This is really unintuitive, and I’m not sure how to make sense of it. Say that you think that a coin will either land heads or tails, with probability 50% for each. In addition, you are with somebody (who we’ll call A) that you know has perfect information about how the coin will land.
Does it make sense to say that you expect them to be more surprised about the result of the coin flip than you will be? This seems hardly intuitive. One potential way out of this is that the statement “A knows exactly how the coin will land” has not actually been included in your probability distribution, so it isn’t fair to stipulate that you know this. One way to try to add in this information is to model their knowledge by something like “There’s a 50% chance that A’s distribution is 100% H, and a 50% chance that it is 100% T.”
The problem is that when you average over these distributions, you get a new distribution that is identical to your own. This is clearly not capturing the state of knowledge in question.
Another possibility is that we should not be thinking about the expected surprise of people, but solely of distributions. In other words, Gibb’s inequality tells us that you will expect a higher average surprise for any distribution that you are handed, than for your own distribution. This can only be translated into statements about people‘s average surprise when their knowledge can be directly translated into a distribution.