The original method of Maximum Entropy, MaxEnt, was designed to assign probabilities on the basis of information in the form of constraints. It gradually evolved into a more general method, the method of Maximum relative Entropy (abbreviated ME), which allows one to update probabilities from arbitrary priors unlike the original MaxEnt which is restricted to updates from a uniform background measure.

The realization that ME includes not just MaxEnt but also Bayes’ rule as special cases is highly significant. First, it implies that ME is capable of reproducing every aspect of orthodox Bayesian inference and proves the complete compatibility of Bayesian and entropy methods. Second, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually.

Giffin and Caticha

https://arxiv.org/pdf/0708.1593.pdf

I want to heap a little more praise on the principle of maximum entropy. Previously I’ve praised it as a solution to the problem of the priors – a way to decide what to believe when you are totally ignorant.

But it’s much much more than that. The procedure by which you maximize entropy is not only applicable in total ignorance, but is also applicable in the presence of partial information! So not only can we calculate the maximum entropy distribution given total ignorance, but we can also calculate the maximum entropy distribution given some set of evidence constraints.

That is, the principle of maximum entropy is not *just* a solution to the problem of the priors – it’s an entire epistemic framework in itself! It tells you what you should believe at any given moment, given any evidence that you have. And it’s *better* than Bayesianism in the sense that the question of priors never comes up – we maximize entropy when we don’t have any evidence just like we do when we *do* have evidence! There is no need for a special case study of the zero-evidence limit.

But a natural question arises – if the principle of maximum entropy and Bayes’ rule are both self-contained procedures for updating your beliefs in the face of evidence, are these two procedures consistent?

Anddd the answer is, yes! They’re perfectly consistent. Bayes’ rule leads you from one set of beliefs to the set of beliefs that are maximally uncertain under the new information you receive.

This post will be proving that Bayes’ rule arises naturally from maximizing entropy after you receive evidence.

But first, let me point out that we’re making a slight shift in our definition of entropy, as suggested in the quote I started this post with. Rather than maximizing the entropy S(P) = – ∫ P log(P) dx, we will maximize the *relative* entropy:

S_{rel}(P, P_{old}) = – ∫ P log(P / P_{old}) dx.

The relative entropy is much more general than the ordinary entropy – it serves as a way to compare entropies of distributions, and gives a simple way to talk about the change in uncertainty from a previous distribution to a new one. Intuitively, it is the additional information that is required to specify P, once you’ve already specified P_{old}. You can think of it in terms of surprisal: S_{rel}(P, Q) is how much more surprised you will be if P is true and you believe Q than if P is true and you believe P.

You might be concerned that this function no longer has the nice properties of entropy that we discussed earlier – the only possible function for consistently representing uncertainty. But these worries aren’t warranted. If some set of initial constraints give P_{old} as the maximum-entropy distribution, then the function that maximizes relative entropy with just the new constraints will be the same as the function that maximizes entropy with the new constraints *and* the value of your prior distribution as an additional constraint.

Okay, so from now on whenever I talk about entropy, I’m talking about *relative* entropy. I’ll just denote it by S as usual, instead of writing out S_{rel} every time. We’ll now prove that the prescribed change in your beliefs upon receiving the results of an experiment is the same under Bayesian conditionalization as it is under maximum entropy.

Say that our probability distribution is over the possible values of some parameter A and the possible results of an experiment that will tell us the value of X. Thus our initial model of reality can be written as:

P_{init}(A = a, X = x), and

P_{init}(A = a) = ∫ dx P_{init}(A = a | X = x) P(X = x)

Which we’ll rewrite for ease of notation as:

P_{init}(a, x), and

P_{init}(a) = ∫ P_{init}(a | x) P(x) dx

Ordinary Bayesian conditionalization says that when we receive the information that the experiment returned the result X = x’, we update our probabilities as follows:

P_{new}(a) = P_{init}(a | x’)

What does the principle of maximum entropy say to do? It prescribes the following algorithm:

Maximize the value of S = – ∫ da dx P(a, x) log( P(a,x) / P_{init}(a, x) )

with the following constraints:

Constraint 1: ∫ da dx P(a, x) = 1

Constraint 2: P(x) = δ(x – x’)

Constraint 2 represents the experimental information that our new probability distribution over X is zero everywhere except for at X = x’, and that we are certain that the value of X is x’. Notice that it is actually an infinite number of constraints – one for each value of X.

We will rewrite Constraint 2 so that it is of the same form as the entropy function and the first constraint:

Constraint 2: ∫ da P(a, x) = δ(x – x’)

The method of Lagrange multipliers tells us how to solve this equation!

First, define a new quantity A as follows:

A = S + Constraint 1 + Constraint 2

= – ∫ da dx P log(P/P_{init}) + α · [ ∫ da dx P – 1 ] + ∫ dx β(x) · [ ∫ da P – δ(x – x’) ]

Now we solve!

∆A = 0

∆_{P} ∫ da dx [ – P log(P/P_{init}) + α P + β(x) P] = 0

∂_{P} [ – P log P + P log P_{init} + α P + β(x) P ] = 0

-log P_{new} – 1 + log P_{init} + α + β(x) = 0

P_{new}(a, x) = P_{init}(a, x) · e^{β(x)}/Z

Z is our normalization constant, and we can find β(x) by applying Constraint 2:

Constraint 2: ∫ da P(a, x) = δ(x – x’)

∫ da P_{init}(a, x) · e^{β(x)}/Z = δ(x – x’)

P_{init}(x) · e^{β(x)}/Z = δ(x – x’)

And finally, we can plug in:

P_{new}(a, x) = P_{init}(a, x) · e^{β(x)} / Z

= P_{init}(a | x) · P_{init}(x) · e^{β(x)} / Z

= P_{init}(a | x) · δ(x – x’)

So P_{new}(a) = P_{init}(a | x’)

Exactly the same as Bayesian conditionalization!!

What’s so great about this is that the principle of maximum entropy is an entire theory of normative epistemology in its own right, *and* it’s equivalent to Bayesianism, AND it has no problem of the priors!

If you’re a Bayesian, then you know what to do when you encounter new evidence, as long as you already have a prior in hand. But when somebody asks you how you should choose the prior that you have… well then you’re stumped, or have to appeal to some other prior-setting principle outside of Bayes’ rule.

But if you ask a maximum-entropy theorist how they got their priors, they just answer: “The same way I got all of my other beliefs! I just maximize my uncertainty, subject to the information that I possess as constraints. I don’t need any special consideration for the situation in which I possess *no* information – I just maximize entropy with no constraints at all!”

I think this is wonderful. It’s also really aesthetic. The principle of maximum entropy says that you should be *honest about your uncertainty*. You should choose your beliefs in such a way as to ensure that you’re not pretending to know anything that you don’t know. And there is a single unique way to do this – by maximizing the function ∫ P log P.

Any other distribution you might choose represents a decision to *pretend that you know things that you don’t know* – and maximum entropy says that you should never do this. It’s an epistemological framework built on the virtue of humility!

## 4 thoughts on “Maximum Entropy and Bayes”