Entropy vs relative entropy

This post is about the relationship between entropy and relative entropy. This relationship is subtle but important – purely maximizing entropy (MaxEnt) is not equivalent to Bayesian conditionalization except in special cases, while maximizing relative entropy (ME) is. In addition, the justifications for MaxEnt are beautiful and grounded in fundamental principles of normative epistemology. Do these justifications carry over to maximizing relative entropy?

To a degree, yes. We’ll see that maximizing relative entropy is a more general procedure than maximizing entropy, and reduces to it in special cases. The cases where MaxEnt gives different results from ME can be interpreted through the lens of MaxEnt, and relate to an interesting distinction between commuting and non-commuting observations.

So let’s get started!

We’ll solve three problems: first, using MaxEnt to find an optimal distribution with a single constraint C1; second, using MaxEnt to find an optimal distribution with constraints C1 and C2; and third, using ME to find the optimal distribution with C2 given the starting distribution found in the first problem.

Part 1

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0

P ( – P1 logP1 + (α + 1) P1 + βC1[P1] ) = 0
– logP1 + α + β C1’[P1] = 0

Part 2

Problem: Maximize – ∫ P logP dx with constraints
∫ P dx = 1
∫ C1[P] dx = 0
∫ C2[P] dx = 0

P ( – P2 logP2 + (α’ + 1) P2 + β’C1[P2] + λ C2[P2] ) = 0
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0

Part 3

Problem: Maximize – ∫ P log(P / P1) dx with constraints
∫ P dx = 1
∫ C2[P] dx = 0

P ( – P3 logP3 + P3 logP1 + (α’’ + 1)P3 + λ’ C2[P3] ) = 0
– logP3 + α’’ + logP1 + λ’ C2’[P3] = 0
– logP3 + α’’ + α + β C1’[P1] + λ’ C2’[P3] = 0
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0

We can now compare our answers for Part 2 to Part 3. These are the same problem, solved with MaxEnt and ME. While they are clearly different solutions, they have interesting similarities.

MaxEnt
– logP2 + α’ + β’ C1’[P2] + λ C2’[P2] = 0
∫ P2 dx = 1
∫ C1[P2] dx = 0
∫ C2[P2] dx = 0

ME
– logP3 + α’’’ + β C1’[P1] + λ’ C2’[P3] = 0
∫ P3 dx = 1
∫ C1[P1] dx = 0
∫ C2[P3] dx = 0

The equations are almost identical. The only difference is in how they treat the old constraint. In MaxEnt, the old constraint is treated just like the new constraint – a condition that must be satisfied for the final distribution.

But in ME, the old constraint is no longer required to be satisfied by the final distribution! Instead, the requirement is that the old constraint be satisfied by your initial distribution!

That is, MaxEnt takes all previous information, and treats it as current information that must constrain your current probability distribution.

On the other hand, ME treats your previous information as constraints only on your starting distribution, and only ensures that your new distribution satisfy the new constraint!

When might this be useful?

Well, say that the first piece of information you received, C1, was the expected value of some measurable quantity. Maybe it was that x̄ = 5.

But if the second piece of information C2 was an observation of the exact value of x, then we clearly no longer want our new distribution to still have an expected value of x̄. After all, it is common for the expected value of a variable to differ from the exact value of x.

E(x) vs x

Once we have found the exact value of x, all previous information relating to the value of x is screened off, and should no longer be taken as constraints on our distribution! And this is exactly what ME does, and MaxEnt fails to do.

What about a case where the old information stays relevant? For instance, an observation of the values of a certain variable is not ‘cancelled out’ by a later observation of another variable. Observations can’t be un-observed. Does ME respect these types of constraints?

Yes!

Observations of variables are represented by constraints that set the distribution over those variables to delta-functions. And when your old distribution contains a delta function, that delta function will still stick around in your new distribution, ensuring that the old constraint is still satisfied.

Pold ~ δ(x – x’)
implies
Pnew ~ δ(x – x’)

The class of observations that are made obsolete by new observations are called non-commuting observations. They are given this name because for such observations, the order in which you process the information is essential. Observations for which the order of processing doesn’t matter are called commuting observations.

In summation: maximizing relative entropy allows us to take into account subtle differences in the type of evidence we receive, such as whether or not old data is made obsolete by new data. And mathematically, maximizing relative entropy is equivalent to maximizing ordinary entropy with whatever new constraints were not included in your initial distribution, as well as an additional constraint relating to the value of your old distribution. While the old constraints are not guaranteed to be satisfied by your new distribution, the information about them is preserved in the form of the prior distribution that is a factor in the new distribution.

Anthropic argument for common priors

(Idea from Robin Hanson and Tyler Cowen’s 2004 paper Are Disagreements Honest?)

One common argument relating to common priors is that two rational agents with all the same information (including no information at all) could have no possible grounds on which to disagree. Priors by definition refer to the state of knowledge before either agent had any evidence relevant to a given proposition. So there is no information that either agent could have that would allow a difference in priors.

A response to this is that some information that we have is inherently private and unique to us. For instance, you and I might have differences in intelligence, in ways of conceptualizing the world, or in the things we innately find intuitively plausible. All of these differences may count as important information in shaping our priors on a given subject, before we ever encounter a single piece of evidence relevant to the subject.

Here’s a really weird argument for why even these differences should not count. If we use anthropic reasoning, and treat our own existence and the details of our brain and body as just another thing to be conditioned on, then even these private intimate details are simply contingent facts about the world that are to be treated as evidence. Before you’ve conditioned on your own existence, you should be agnostic as to which set of brain/body/mind out of all the possible sets of observers “you” will end up being. You must imagine yourself behind Rawls’ veil of ignorance, a disembodied reasoner that is identical to all other such reasoners. So there is no conceivable reason why your prior should differ from anybody else’s – you must treat yourself as literally the same entity as them pre anthropic conditioning.

In less out-there terms, if you encounter somebody with an apparently different prior from you, then you should consider “Hmm, what if I were born as this person, instead of myself?” The answer to which is, of course, you would have had the same priors as them. Which means that your difference in “priors” is actually a difference of posteriors resulting from conditioning on the arbitrary choice of body/brain/experiences you ended up with.

In addition, by Aumann’s agreement theorem, any apparent differences in priors that become common knowledge should quickly go away, once they are realized to be merely differences in posteriors. Essentially, any differences in priors that last between two rational individuals are signs that they are arbitrarily favoring their own existence in considerations of what prior they should use.

Why you should be a Bayesian

In this post, I take Bayesianism to be the following normative epistemological claim: “You should treat your beliefs like probabilities, and reason according to the axioms of probability theory.”

Here are a few reasons why I support this claim:

I. Logic is not enough

Reasoning deductively from premises to conclusion is a wonderfully powerful tool when it can be applied. If you have absolute certainty in some set of premises, and these premises entail a new conclusion, then you can extend your certainty to the new conclusion. Alternatively, you can state clearly the conditions under which you would be granted certain belief, in the form of a conditional argument (if you were to convince me that A and B are true, then I would believe that C is true).

This is great for mathematicians proving theorems about abstract logical entities. But in the real world, deductive inference is simply not enough to account for the types of problems we face. We are constantly reasoning in a condition of uncertainty, where we have multiple competing theories about what’s going on, and we seek evidence – partial evidence, not deductively complete evidence – as to which of these theories we should favor.

If you want to know how to form beliefs about the parts of reality that aren’t clear-cut and certain, then you need to go beyond pure logic.

II. Probability theory is a natural extension of logic

Cox’s theorem shows that any system of plausible reasoning – modifying and updating beliefs in the presence of uncertainty – that is consistent with logic and a few minimal assumptions about normative reasoning is necessarily isomorphic to probability theory.

The converse of this is that any system of reasoning under uncertainty that isn’t ultimately functionally equivalent to Bayesianism is either logically inconsistent or violates other common-sense axioms of reasoning.

In other words, probability theory is the best candidate that we have for extending logic into the domain of the uncertain. It is about what is likely, not certain, to be true, and the way that we should update these assessments when receiving new information. In turn, probability theory contains ordinary logic as a special case when you take the limit of absolutely certainty.

III. Non-Bayesian systems of plausible reasoning result in inconsistencies and irrational behavior

Dutch-book arguments prove that any agent that is violating the axioms of probability theory can be exploited by cleverly capitalizing on logical inconsistencies in their beliefs. This combines a pragmatic argument (non-Bayesians are worse off in the long run) with an epistemic argument (non-Bayesians are vulnerable to logical inconsistencies in their preferences).

IV. You should be honest about your uncertainty

The principle of maximizing entropy mandates a unique way to set beliefs given your evidence, such that you make no presumptions about knowledge that you don’t have. This principle is fully consistent with and equivalent to standard Bayesian conditionalization.

In other words, Bayesianism is about epistemic humility – it tells you to not pretend to know things that you don’t know.

V. Bayesianism provides the foundations for the scientific method

The scientific method, needless to say, is humanity’s crowning epistemic achievement. With it we have invented medicine, probed atoms, and gone to the stars. Its success can be attributed to the structure of its method of investigating truth claims: in short, science is about searching theories for testable consequences, and then running experiments to update our beliefs in these theories.

This is all contained in Bayes’ rule, the fundamental law of probabilistic inference:

Pr(theory | evidence) ~ Pr(evidence | theory) · Pr(theory)

This rule tells you precisely how you should update your beliefs given your evidence, no more and no less. It contains the wisdom of empiricism that has revolutionized the world we live in.

VI. Bayesianism is practically useful

So maybe you’re convinced that Bayesianism is right in principle. There’s a separate question of if Bayesianism is useful in practice. Maybe treating your beliefs like probabilities is like trying to do psychology starting from Schrödinger’s equation – possible in principle but practically infeasible, not to mention a waste of time.

But Bayesianism is practically useful.

Statistical mechanics, one of the most powerful branches of modern science, is built on a foundation of explicitly Bayesian principles. More generally, good statistical reasoning is incredibly useful across all domains of truth-seeking, and an essential skill for anybody that wants to understand the world.

And Bayesianism is not just useful for epistemic reasons. A fundamental ingredient of decision-making is the ability to produce accurate models of reality. If you want to effectively achieve your goals, whatever they are, you must be able to engage in careful probabilistic reasoning.

And finally, in my personal experience I have found Bayesian epistemology to be infinitely mineable for useful heuristics in thinking about philosophy, physics, altruism, psychology, politics, my personal life, and pretty much everything else. I recommend anybody whose interest has been sparked to check out the following links:

 

Consistency and priors

The method of reasoning illustrated here is somewhat reminiscent of Laplace’s “principle of indifference.” However, we are concerned here with indifference between problems, rather than indifference between events. The distinction is essential, for indifference between events is a matter of intuitive judgment on which our intuition often fails even when there is some obvious geometrical symmetry (as Bertrand’s paradox shows).

E. T. Jaynes
Prior Probabilities

I’ve previously written praise of the principle of maximum entropy as a prior-setting method that is justified on the basis of a very minimal and highly intuitive set of epistemic features.

But there’s an even better technique for prior-setting, one that is justified on incredibly fundamental grounds. This technique can only be used in rare times, and is immensely powerful when it is used. It’s the principle of transformation groups.

Here is the single assumption from which the principle arises:

“In problems where we have the same prior information, we should assign the same prior probabilities.” (Jaynes’ wording)

This is simple to the point of seeming almost tautological. So what can we do with it?

We’ll start with one of the simplest applications of transformation groups. Suppose that somebody gives you the following information:

I = “This coin will land either tails or heads.”

Now you want to say what the following probabilities should be:

P(This coin will land tails | I) = p
P(This coin will land heads | I) = q

Intuitively, it seems obvious to us that absent any other information, we should assign equal probabilities to these. But why? Is there a principled reason for assuming that the coin is a fair coin? Or is this just a presumption that is importing into the problem our background knowledge about most coins being fair?

The method of transformation groups gives us a principled reason. It says to rephrase the problem as follows:

I’ = “This coin will land either heads or tails.”

Now, our initial problem has only changed to our new problem by replacing every “heads” with “tails” and “tails” with “heads”. Since our prior-setting procedure found that P(This coin will land tails | I) = p in the first problem, it should now find P(This coin will land heads | I) = P in this new one. This is required for any consistent prior-setting procedure! If the problem changes by just switching places of labels, then the priors should change in the exact same way. This means that:

P(This coin will land heads | I’) = p
P(This coin will land tails | I’) = q

But clearly, I = I’; the logical operator “OR” is symmetric! Which means that:

P(This coin will land heads | I’) = P(This coin will land heads | I)

And this is only possible if p = q = ½!

This is simple, but beautiful. The principle tells us that the only logical way to set our priors in this case is evenly – anything else would be either logically inconsistent, or assuming extra information that breaks the symmetry between heads and tails. It goes from logical symmetry to probability symmetry!

Finding these symmetries is what the method of transformation groups is all about. More generally, one can represent a choice between N different possibilities as the statement:

I = “Possibility 1 or possibility 2 or … possibility N”

But this is symmetric with:

I’ = “Possibility 2 or possibility 1 or … possibility N”

As well as all other orderings.

By the exact same argument as above, your prior-setting procedure is required by logical consistency to evenly distribute credences across the N procedures. So for each n from 1 to N, P(Possibility k | I) = 1/N.

The method of transformation groups can also be applied to continuous variables, where finding the right set of priors can be a lot less intuitive. You do so by noting different types of symmetries for different types of parameters.

For instance, a location parameter is one that serves to merely shift a probability distribution over an observable quantity, without reshaping the distribution. We can formally express this by saying that for a location parameter µ, the distribution over x depends only on the difference x – µ:

p(x | µ) = f(x – µ)

For such parameters, it must be the case that the prior distribution over them is similarly symmetric over translational shifts:

For all ∆, p(µ) = p(µ + ∆)
So p(µ) = c, for some constant c

Another common category of parameters are scale parameters. These are parameters that serve to rescale probability distributions without reshaping them. Formally:

p(x | σ) = 1/σ g(x / σ)

For this symmetry, the requirement for consistency is:

For all s, p(σ) = 1/s · p(σ / s)
So, p(σ) = 1/σ

In summation, by carefully analyzing the symmetries of the background information you have, you can extract out requirements for how to set your prior distribution that are mandated on threat of logical inconsistency!

Dutch book arguments

These are the laws of probability, which we have proved to be necessarily true of any consistent set of degrees of belief. Any definite set of degrees of belief which broke them would be inconsistent in the sense that it violated the laws of preference between options … If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event.

We find, therefore, that a precise account of the nature of partial belief reveals that the laws of probability are laws of consistency, an extension to partial beliefs of formal logic, the logic of consistency.

Frank Ramsey
The Foundations of Mathematics and Other Logical Essays, Volume 5

 

In this post, I’m going to describe one of the more famous arguments for Bayesianism.

These arguments are about how different types of epistemological frameworks will handle different series of wagers. Let me just lay out clearly what exactly we mean by a wager, so as to remove any ambiguity.

A wager on proposition A is a betting opportunity. It involves a payoff amount S and a buy-in quantity. In general, the amount that the buy-in costs will be some fraction f of the payoff amount, so we’ll write it as fS. If you bet on A and it turns out true, then you get the payout S, but still lost the initial buy-in. And if you bet on A and it turns out false, then you get no payout and lose the fS you already spent.

A Net Payout
True

S – fS

False

-fS

From this payout table, you can calculate that an agent will find the wager to be favorable to them exactly in the case that P(A) is greater than f. That is, the agent will want to take the bet whenever the chance of a payout is greater than the proportion of the payout that is required to buy into the bet.

Now, imagine that somebody has a credence of 52% in a proposition A and a credence of 52% in the proposition ~A. How will they evaluate the following set of bets?

B1: pays out $100 if A is true, buy-in of $51
B2: pays out $100 if ~A is true, buy-in of $51
B3: pays out a guaranteed $100, buy-in of $102

They will see both B1 and B2 as favorable bets, since the buy-in is a smaller fraction of the payout than the chance of payout. And they will see B3 as an unfavorable bet, since clearly the buy-in is a larger proportion of the payout than the chance of a payout.

But B3 is just the same as the combination of bets B1 and B2!

Why? Well, if you bet on both B1 and B2, then you are guaranteed to win exactly one of the two (since A and ~A cannot both be true, but one of the two must be). Then you will have paid in a net sum of $102, and gotten back only $100.

A similar argument can be made for any levels of credence C(A) and C(~A) that don’t sum up to 100%. And all of the usual axioms of probability theory can be argued for in the same way. Such arguments are called Dutch book arguments.

Dutch book arguments are standardly presented as revealing that if one does not form beliefs according to the laws of probability theory, then they will be able to be juiced for money by clever bookies.

This is true; somebody with beliefs like those described above can be endlessly exploited for profit. But it is much less impressive than the real conclusion of the Dutch book argument.

Recall that our agent above was found to believing a logical contradiction as a result of not having their beliefs align with probability theory (they had to believe that a bet was simultaneously favorable to them and not favorable to them)

Said another way, an agent not following the probability calculus may evaluate the same proposition differently if presented in a different form.

This is what Dutch book arguments really say: if you want your beliefs to be logically consistent, then you are required to reason according to probability theory!

Cox’s theorem

A very original and thoroughgoing development of the theory of probability, which does not depend on the concept of frequency in an ensemble, has been given by Keynes. In his view, the theory of probability is an extended logic, the logic of probable inference. Probability is a relation between a hypothesis and a conclusion, corresponding to the degree of rational belief and limited by the extreme relations of certainty and impossibility. Classical deductive logic, which deals with these limiting relations only, is a special case in this more general development.

R. T. Cox
Probability, Frequency, and Reasonable Expectation
(Yep, that Keynes! He was an influential early Bayesian thinker as well as a famous economist)

Cox’s famous theorem says that if your way of reasoning is not in some sense isomorphic to Bayesianism, then you are violating one of the following ten basic principles. I’ll derive the theorem in this post.

Logic

  1. ~~a = a
  2. ab = ba
  3. (ab)c = a(bc)
  4. aa = a
  5. ~(ab) = ~a or ~b
  6. a(a or b) = a

Reasoning under uncertainty

  1. Degrees of plausibility are represented by real numbers: {b | a}
  2. {bc | a} = F({b | a}, {c | ab})
  3. {~b | a} = G({b | a})
  4. F and G are monotonic.

The first six, I think, need no introduction. I write “a and b” as “ab” for aesthetics.

The next four extend logic beyond the realm of the perfectly certain, and relate to how we should reason in the presence of uncertainty – how we should reason about degrees of plausibility. For this step, we need a new notation: a way to represent not the truth of a proposition a, but its plausibility. We do this with the {} notation: {a | b} = the plausibility of a given that b is known to be true.

I’ll make some brief notes about each of 7 through 10.

Number 7 perhaps sounds strange – plausibilities are states of belief, not numbers. But you can make sense of this in two ways: first, we can consider this a simple model of plausibilities, in which we are merely mirroring the properties of plausibilities in the structure of the system we set up around how to manipulate numbers. And second, if one is to design a robot that reasons about the world, it isn’t crazy to think about programming it to represent beliefs about the plausibilities of propositions as numbers.

#7 also contains an assumption of continuity – that it’s not the case that there are discontinuous jumps in the plausibility of propositions. Said another way, if two different real numbers represent two different degrees of plausibility, then every possible value between them should also represent a degree of plausibility.

Number 8 is about relevance. It says that the plausibility that b and c are both true given some background information a, is only dependent on (1) the plausibility that b is true given a and (2) the plausibility that c is true given a and b.

If you want to know how likely it is that b and c are both true given a, you can break the process down into two steps. First you determine how likely it is that b is true, given your background information. And second, you determine how likely it is that a is true, given both your background information and the truth of b.

In other words, all that you need to know if you want to know {bc | a} are {b | a} and {c | ab}. For example, say that you want the plausibility of Usain Bolt winning the 100 meter dash and also running an extra lap around the track at the end of the dash. The only pieces of information you need for this are (1) the plausibility of Usain Bolt winning the 100 meter dash, and (2) the plausibility of him running an extra lap at the end of the race, given that he won the race. And of course, each of these plausibilities are also conditional on all of the background information you have about the situation.

We give the name F to the function that details precisely how we determine {bc | a}.

Number 9 says that all we need to determine the plausibility of a proposition being false is the plausibility of the proposition being true. The function that details how we determine one from the other is named G.

And finally, Number 10 says that the plausibility relationships described in 8 and 9 are in some sense simple. For instance, if b becomes less plausible and ~b becomes more plausible, then if b becomes even less plausible, ~b should not suddenly become less plausible. If a change in the plausibility of a proposition a makes the plausibility of another proposition b change in a certain direction, then a greater change in the plausibility of a in the same direction should not result in a reversal of the direction of change of the plausibility of b.

There is an additional implicit assumption that is almost too obvious to be stated: If two propositions are just different phrasings of the same state of knowledge, then you should have the same plausibility in both of them. This is the basic requirement of consistency. It says, for instance, that “a and b” is exactly as plausible as “b and a”.

***

Now we can derive Bayesianism!

First step: conditional probabilities. We use the associativity of “and”:

{bcd | a} = F( {b | a}, {cd | ab} ) = F( {b | a}, F( {c | ab}, {d | abc} ) )
{bcd | a} = F( {bc | a}, {d | abc} ) = F( F({b | a}, {c | ab}), {d | abc} ) )

So F(x, F(y, z)) = F(F(x, y), z)

Any monotonic function that satisfies this functional equation is isomorphic to ordinary multiplication on the interval [0, 1].

In other words, there exists a function W such that:

W(F(x, y)) = W(x) · W(y)

From which we can see:

W({bc | a}) = W({b | a}) · W({c | ab})

Which is exactly the form of conjunction in ordinary probability theory!

This means that any form of reasoning about the plausibility of conjunctions of propositions is either violating one of the 10 axioms, or is equivalent to probability theory up to isomorphism.

So if somebody walks up to you and presents to you an algorithm for computing the plausibility of the conjunction of two propositions, and you know that they are following the above ten rules, then you are guaranteed to be able to find some function that takes in their algorithm and translates into ordinary Bayesianism.

This translation function is W! Notice that although W looks a lot like ordinary probability, it is a different thing entirely. For instance, if somebody is already reasoning with ordinary probability theory, then {b | a} = P(b | a). To convert {b | a} into P(b | a), then, W need not do anything! So W({b | a}) = {b | a} = P(b | a), or W(x) = x, which is clearly different from the probability function.

Next, we reveal the nature of the function G.

The first step is easy:

{~~b | a} = G(G({b | a}))
So G(G(x)) = x

This means that G is an involution – a function that is its own inverse.

Next, we use the commutativity of “and”:

W({bc | a}) = W({b | a}) · W({c | ab})
= W({b | a}) · G( W({~c | ab}) )
= W({b | a}) · G( W({b~c | a}) / W({b | a}) )

So W({cb | a}) = W({c | a}) · G( W({~bc | a}) / W({c | a}) )

Now if we let c = ~(bd) for some new statement d, and rename W(b | a) = x and W(c | a) = y, we find:

x G( G(y) / x ) = y G( G(x) / y )

The only functions that satisfy this functional equation as well as the earlier involution equation are the following:

G(x) = (1 – xm)1/m, for any m

With some algebraic manipulation, we see that this is equivalent to:

W({a | b})m + W({~a | b})m = 1

This equation almost perfectly represents the normalization rule: that the total probability must sum to 1! It seems mildly inconvenient that this holds true for the function Wm rather than W. If we knew that m had to equal 1, then we’d have a perfect translation function both for conditional probabilities and for normalization… But if we look back at our conditional probability finding, we notice that it can be equivalently represented in terms of Wm instead of W!

W({bc | a}) = W({b | a}) · W({c | ab})
W({bc | a})m = W({b | a})m · W({c | ab})m

Now we just define one last function Q = Wm, and we’re done!

Q takes in any system of reasoning that obeys the starting ten rules, and reveals it to be equivalent to ordinary probability theory.

The rest of probability theory can be shown to result from these basic axioms (we’ll drop the brackets now, and will call Q({a | b}) its proper name: P(a | b).

P(certainty) = 1
P(impossibility) = 0
P(bc | a) + P(b~| a) = P(b | a)
P(a or b | c) = P(a | c) + P(b | c) – P(ab | c)
And so on.

This derivation of probability theory is great because it isn’t limited by the ordinary frequency interpretations of probability theory, in which a probability is defined to be a limit of a ratio of an infinite number of experimental results. This is not only ugly theoretically, but leaves us puzzled as to how to talk about the probabilities of one-shot events or of hypotheses that can’t be directly translated into empirical results.

The definition of probability invoked here is infinitely deeper. Instead of frequencies, probabilities are defined according to fundamental principles of normative epistemology. Any agent that reasons consistently according to a few basic maxims will be reasoning in a way that is functionally identical to probability theory.

And probabilities here can be assigned to any propositions – not just those that refer to empirically measurable events that are repeatable ad infinitum. They represent a normative rational degree of certainty that one must possess if one is to reason consistently!

Maximum Entropy and Bayes

The original method of Maximum Entropy, MaxEnt, was designed to assign probabilities on the basis of information in the form of constraints. It gradually evolved into a more general method, the method of Maximum relative Entropy (abbreviated ME), which allows one to update probabilities from arbitrary priors unlike the original MaxEnt which is restricted to updates from a uniform background measure.

The realization that ME includes not just MaxEnt but also Bayes’ rule as special cases is highly significant. First, it implies that ME is capable of reproducing every aspect of orthodox Bayesian inference and proves the complete compatibility of Bayesian and entropy methods. Second, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually.

Giffin and Caticha
https://arxiv.org/pdf/0708.1593.pdf

I want to heap a little more praise on the principle of maximum entropy. Previously I’ve praised it as a solution to the problem of the priors – a way to decide what to believe when you are totally ignorant.

But it’s much much more than that. The procedure by which you maximize entropy is not only applicable in total ignorance, but is also applicable in the presence of partial information! So not only can we calculate the maximum entropy distribution given total ignorance, but we can also calculate the maximum entropy distribution given some set of evidence constraints.

That is, the principle of maximum entropy is not just a solution to the problem of the priors – it’s an entire epistemic framework in itself! It tells you what you should believe at any given moment, given any evidence that you have. And it’s better than Bayesianism in the sense that the question of priors never comes up – we maximize entropy when we don’t have any evidence just like we do when we do have evidence! There is no need for a special case study of the zero-evidence limit.

But a natural question arises – if the principle of maximum entropy and Bayes’ rule are both self-contained procedures for updating your beliefs in the face of evidence, are these two procedures consistent?

Anddd the answer is, yes! They’re perfectly consistent. Bayes’ rule leads you from one set of beliefs to the set of beliefs that are maximally uncertain under the new information you receive.

This post will be proving that Bayes’ rule arises naturally from maximizing entropy after you receive evidence.

But first, let me point out that we’re making a slight shift in our definition of entropy, as suggested in the quote I started this post with. Rather than maximizing the entropy S(P) = – ∫ P log(P) dx, we will maximize the relative entropy:

Srel(P, Pold) = – ∫ P log(P / Pold) dx.

The relative entropy is much more general than the ordinary entropy – it serves as a way to compare entropies of distributions, and gives a simple way to talk about the change in uncertainty from a previous distribution to a new one. Intuitively, it is the additional information that is required to specify P, once you’ve already specified Pold. You can think of it in terms of surprisal: Srel(P, Q) is how much more surprised you will be if P is true and you believe Q than if P is true and you believe P.

You might be concerned that this function no longer has the nice properties of entropy that we discussed earlier – the only possible function for consistently representing uncertainty. But these worries aren’t warranted. If some set of initial constraints give Pold as the maximum-entropy distribution, then the function that maximizes relative entropy with just the new constraints will be the same as the function that maximizes entropy with the new constraints and the value of your prior distribution as an additional constraint.

Okay, so from now on whenever I talk about entropy, I’m talking about relative entropy. I’ll just denote it by S as usual, instead of writing out Srel every time. We’ll now prove that the prescribed change in your beliefs upon receiving the results of an experiment is the same under Bayesian conditionalization as it is under maximum entropy.

Say that our probability distribution is over the possible values of some parameter A and the possible results of an experiment that will tell us the value of X. Thus our initial model of reality can be written as:

Pinit(A = a, X = x), and
Pinit(A = a) = ∫ dx Pinit(A = a | X = x) P(X = x)

Which we’ll rewrite for ease of notation as:

Pinit(a, x), and
Pinit(a) = ∫ Pinit(a | x) P(x) dx

Ordinary Bayesian conditionalization says that when we receive the information that the experiment returned the result X = x’, we update our probabilities as follows:

Pnew(a) = Pinit(a | x’)

What does the principle of maximum entropy say to do? It prescribes the following algorithm:

Maximize the value of S = – ∫ da dx P(a, x) log( P(a,x) / Pinit(a, x) )
with the following constraints:
Constraint 1: ∫ da dx P(a, x) = 1
Constraint 2: P(x) = δ(x – x’)

Constraint 2 represents the experimental information that our new probability distribution over X is zero everywhere except for at X = x’, and that we are certain that the value of X is x’. Notice that it is actually an infinite number of constraints – one for each value of X.

We will rewrite Constraint 2 so that it is of the same form as the entropy function and the first constraint:

Constraint 2: ∫ da P(a, x) = δ(x – x’)

The method of Lagrange multipliers tells us how to solve this equation!

First, define a new quantity A as follows:

A = S + Constraint 1 + Constraint 2
= – ∫ da dx P log(P/Pinit) + α · [ ∫ da dx P – 1 ] + ∫ dx β(x) · [ ∫ da P – δ(x – x’) ]

Now we solve!

∆A = 0
P ∫ da dx [ – P log(P/Pinit) + α P + β(x) P] = 0
P [ – P log P + P log Pinit + α P + β(x) P ] = 0
-log Pnew – 1 + log Pinit + α + β(x) = 0
Pnew(a, x) = Pinit(a, x) · eβ(x)/Z

Z is our normalization constant, and we can find β(x) by applying Constraint 2:

Constraint 2: ∫ da P(a, x) = δ(x – x’)
∫ da Pinit(a, x) · eβ(x)/Z = δ(x – x’)
Pinit(x) · eβ(x)/Z = δ(x – x’)

And finally, we can plug in:

Pnew(a, x) = Pinit(a, x) · eβ(x) / Z
= Pinit(a | x) · Pinit(x) · eβ(x) / Z
= Pinit(a | x) · δ(x – x’)
So Pnew(a) = Pinit(a | x’)

Exactly the same as Bayesian conditionalization!!

What’s so great about this is that the principle of maximum entropy is an entire theory of normative epistemology in its own right, and it’s equivalent to Bayesianism, AND it has no problem of the priors!

If you’re a Bayesian, then you know what to do when you encounter new evidence, as long as you already have a prior in hand. But when somebody asks you how you should choose the prior that you have… well then you’re stumped, or have to appeal to some other prior-setting principle outside of Bayes’ rule.

But if you ask a maximum-entropy theorist how they got their priors, they just answer: “The same way I got all of my other beliefs! I just maximize my uncertainty, subject to the information that I possess as constraints. I don’t need any special consideration for the situation in which I possess no information – I just maximize entropy with no constraints at all!”

I think this is wonderful. It’s also really aesthetic. The principle of maximum entropy says that you should be honest about your uncertainty. You should choose your beliefs in such a way as to ensure that you’re not pretending to know anything that you don’t know. And there is a single unique way to do this – by maximizing the function ∫ P log P.

Any other distribution you might choose represents a decision to pretend that you know things that you don’t know – and maximum entropy says that you should never do this. It’s an epistemological framework built on the virtue of humility!

Statistical mechanics is wonderful

The law that entropy always increases, holds, I think, the supreme position among the laws of Nature. If someone points out to you that your pet theory of the universe is in disagreement with Maxwell’s equations — then so much the worse for Maxwell’s equations. If it is found to be contradicted by observation — well, these experimentalists do bungle things sometimes. But if your theory is found to be against the second law of thermodynamics I can give you no hope; there is nothing for it but to collapse in deepest humiliation.

 – Eddington

My favorite part of physics is statistical mechanics.

This wasn’t the case when it was first presented to me – it seemed fairly ugly and complicated compared to the elegant and deep formulations of classical mechanics and quantum mechanics. There were too many disconnected rules and special cases messily bundled together to match empirical results. Unlike the rest of physics, I failed to see the same sorts of deep principles motivating the equations we derived.

Since then I’ve realized that I was completely wrong. I’ve come to appreciate it as one of the deepest parts of physics I know, and mentally categorize it somewhere in the intersection of physics, math, and philosophy.

This post is an attempt to convey how statistical mechanics connects these fields, and to show concisely how some of the standard equations of statistical mechanics arise out of deep philosophical principles.

***

The fundamental goal of statistical mechanics is beautiful. It answers the question “How do we apply our knowledge of the universe on the tiniest scale to everyday life?”

In doing so, it bridges the divide between questions about the fundamental nature of reality (What is everything made of? What types of interactions link everything together?) and the types of questions that a ten-year old might ask (Why is the sky blue? Why is the table hard? What is air made of? Why are some things hot and others cold?).

Statistical mechanics peeks at the realm of quarks and gluons and electrons, and then uses insights from this realm to understand the workings of the world on a scale a factor of 1021 larger.

Wilfrid Sellars described philosophy as an attempt to reconcile the manifest image (the universe as it presents itself to us, as a world of people and objects and purposes and values), and the scientific image (the universe as revealed to us by scientific inquiry, empty of purpose, empty of meaning, and animated by simple exact mathematical laws that operate like clockwork). This is what I see as the fundamental goal of statistical mechanics.

What is incredible to me is how elegantly it manages to succeed at this. The universality and simplicity of the equations of statistical mechanics are astounding, given the type of problem we’re dealing with. Physicists would like to say that once they’ve figured out the fundamental equations of physics, then we understand the whole universe. Rutherford said that “all science is either physics or stamp collecting.” But you try to take some laws that tell you how two electrons interact, and then answer questions about how 1023 electrons will behave when all crushed together.

The miracle is that we can do this, and not only can we do it, but we can do it with beautiful, simple equations that are loaded with physical insight.

There’s an even deeper connection to philosophy. Statistical mechanics is about epistemology. (There’s a sense in which all of science is part of epistemology. I don’t mean this. I mean that I think of statistical mechanics as deeply tied to the philosophical foundations of epistemology.)

Statistical mechanics doesn’t just tell us what the world should look like on the scale of balloons and oceans and people. Some of the most fundamental concepts in statistical mechanics are ultimately about our state of knowledge about the world. It contains precise laws telling us what we can know about the universe, what we should believe, how we should deal with uncertainty, and how this uncertainty is structured in the physical laws.

While the rest of physics searches for perfect objectivity (taking the “view from nowhere”, in Nagel’s great phrase), statistical mechanics has one foot firmly planted in the subjective. It is an epistemological framework, a theory of physics, and a piece of beautiful mathematics all in one.

***

Enough gushing.

I want to express some of these deep concepts I’ve been referring to.

First of all, statistical mechanics is fundamentally about probability.

It accepts that trying to keep track of the positions and velocities of 1023 particles all interacting with each other is futile, regardless of how much you know about the equations guiding their motion.

And it offers a solution: Instead of trying to map out all of the particles, let’s course-grain our model of the universe and talk about the likelihood that a given particle is in a given position with a given velocity.

As soon as we do this, our theory is no longer just about the universe in itself, it is also about us, and our model of the universe. Equations in statistical mechanics are not only about external objective features of the world; they are also about properties of the map that we use to describe it.

This is fantastic and I think really under-appreciated. When we talk about the results of the theory, we must keep in mind that these results must be interpreted in this joint way. I’ve seen many misunderstandings arise from failures of exactly this kind, like when people think of entropy as a purely physical quantity and take the second law of thermodynamics to be solely a statement about the world.

But I’m getting ahead of myself.

Statistical mechanics is about probability. So if we have a universe consisting of N = 1080 particles, then we will create a function P that assigns a probability to every possible position for each of these particles at a given moment:

P(x1, y1, z1, x2, y2, z2, …, xN, yN, zN)

P is a function of 3•1080 values… this looks super complicated. Where’s all this elegance and simplicity I’ve been gushing about? Just wait.

The second fundamental concept in statistical mechanics is entropy. I’m going to spend way too much time on this, because it’s really misunderstood and really important.

Entropy is fundamentally a measure of uncertainty. It takes in a model of reality and returns a numerical value. The larger this value, the more coarse-grained your model of reality is. And as this value approaches zero, your model approaches perfect certainty.

Notice: Entropy is not an objective feature of the physical world!! Entropy is a function of your model of reality. This is very very important.

So how exactly do we define the entropy function?

Say that a masked inquisitor tells you to close your eyes and hands you a string of twenty 0s and 1s. They then ask you what your uncertainty is about the exact value of the string.

If you don’t have any relevant background knowledge about this string, then you have no reason to suspect that any letter in the string is more likely to be a 0 than a 1 or vice versa. So perhaps your model places equal likelihood in every possible string. (This corresponds to a probability of ½ • ½ • … • ½ twenty times, or 1/220).

The entropy of this model is 20.

Now your inquisitor allows you to peek at only the first number in the string, and you see that it is a 1.

By the same reasoning, your model is now an equal distribution of likelihoods over all strings that start with 1.

The entropy of this model? 19.

If now the masked inquisitor tells you that he has added five new numbers at the end of your string, the entropy of your new model will be 24.

The idea is that if you are processing information right, then every time you get a single bit of information, your entropy should decrease by exactly 1. And every time you “lose” a bit of information, your entropy should increase by exactly 1.

In addition, when you have perfect knowledge, your entropy should be zero. This means that the entropy of your model can be thought of as the number of pieces of binary information you would have to receive to have perfect knowledge.

How do we formalize this?

Well, your initial model (back when there were 20 numbers and you had no information about any of them) gave each outcome a probability of P = 1/220. How do we get a 20 out of this? Simple!

Entropy = S = log2(1/P)

(Yes, entropy is denoted by S. Why? Don’t ask me, I didn’t invent the notation! But you’ll get used to it.)

We can check if this formula still works out right when we get new information. When we learned that the first number was a 1, half of our previous possibilities disappeared. Given that the others are all still equally likely, our new probabilities for each should double from 1/220 to 1/219.

And S = log2(1/(1/219)) = log2(219) = 19. Perfect!

What if you now open your eyes and see the full string? Well now your probability distribution is 0 over all strings except the one you see, which has probability 1.

So S = log2(1/1) = log2(1) = 0. Zero entropy corresponds to perfect information.

This is nice, but it’s a simple idealized case. What if we only get partial information? What if the masked stranger tells you that they chose the numbers by running a process that 80% of the time returns 0 and 20% of the time returns 1, and you’re fifty percent sure they’re lying?

In general, we want our entropy function to be able to handle models more sophisticated than just uniform distributions with equal probabilities for every event. Here’s how.

We can write out any arbitrary probability distribution over N binary events as follows:

(P1, P2, …, PN)

As we’ve seen, if they were all equal then we would just find the entropy according to previous equation: S = log2(1/P).

But if they’re not equal, then we can just find the weighted average! In other words:

S = mean(log2(1/P)) =∑ Pn log2(1/Pn)

We can put this into the standard form by noting that log(1/P) = -log(P).

And we have our general definition of entropy!

For discrete probabilities: S = – ∑ Plog Pn
For continuous probabilities: S = – ∫ P(x) log P(x) dx

(Aside: Physicists generally use a natural logarithm instead of log2 when they define entropy. This is just a difference in convention: e pops up more in physics and 2 in information theory. It’s a little weird, because now when entropy drops by 1 this means you’ve excluded 1/e of the options, instead of ½. But it makes equations much nicer.)

I’m going to spend a little more time talking about this, because it’s that important.

We’ve already seen that entropy is a measure of how much you know. When you have perfect and complete knowledge, your model has entropy zero. And the more uncertainty you have, the more entropy you have.

You can visualize entropy as a measure of the size of your probability distribution. Some examples you can calculate for yourself using the above equations:

Roughly, when you double the “size” of your probability distribution, you increase its entropy by 1.

But what does it mean to double the size of your probability distribution? It means that there are two times as many possibilities as you initially thought – which is equivalent to you losing one piece of binary information! This is exactly the connection between these two different ways of thinking about entropy.

Third: (I won’t name it yet so as to not ruin the surprise). This is so important that I should have put it earlier, but I couldn’t have because I needed to introduce entropy first.

So I’ve been sneakily slipping in an assumption throughout the last paragraphs. This is that when you don’t have any knowledge about the probability of a set of events, you should act as if all events are equally likely.

This might seem like a benign assumption, but it’s responsible for god-knows how many hours of heated academic debate. Here’s the problem: sure it seems intuitive to say that 0 and 1 are equally likely. But that itself is just one of many possibilities. Maybe 0 comes up 57% of the time, or maybe 34%. It’s not like you have any knowledge that tells you that 0 and 1 are objectively equally likely, so why should you favor that hypothesis?

Statistical mechanics answers this by just postulating a general principle: Look at the set of all possible probability distributions, calculate the entropy of each of them, and then choose the one with the largest entropy.

In cases where you have literally no information (like our earlier inquisitor-string example), this principle becomes the principle of indifference: spread your credences evenly among the possibilities. (Prove it yourself! It’s a fun proof.)

But as a matter of fact, this principle doesn’t only apply to cases where you have no information. If you have partial or incomplete information, you apply the exact same principle by looking at the set of probability distributions that are consistent with this information and maximizing entropy.

This principle of maximum entropy is the foundational assumption of statistical mechanics. And it is a purely epistemic assumption. It is a normative statement about how you should rationally divide up your credences in the absence of information.

Said another way, statistical mechanics prescribes an answer to the problem of the priors, the biggest problem haunting Bayesian epistemologists. If you want to treat your beliefs like probabilities and update them with evidence, you have to have started out with an initial level of belief before you had any evidence. And what should that prior probability be?

Statistical mechanics says: It should be the probability that maximizes your entropy. And statistical mechanics is one of the best-verified and most successful areas of science. Somehow this is not loudly shouted in the pages of every text on Bayesianism.

There’s much more to say about this, but I’ll set it aside for the moment.

***

So we have our setup for statistical mechanics.

  1. Coarse-grain your model of reality by constructing a probability distribution over all possible microstates of the world.
  2. Construct this probability distribution according to the principle of maximum entropy.

Okay! So going back to our world of N = 1080 particles jostling each other around, we now know how to construct our probability distribution P(x1, …, xN). (I’ve made the universe one-dimensional for no good reason except to pretty it up – everything I say follows exactly the same if I left it in 3D. I’ll also start writing the set of all N coordinates as X, again for prettiness.)

What probability distribution maximizes S = – ∫ P logP dX?

We can solve this with the method of Lagrange multipliers:

P [ P logP + λP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1

This is such a nice equation and you should do yourself a favor and learn it, because I’m not going to explain it (if I explained everything, this post would become a textbook!).

But it essentially maximizes the value of S, subject to the constraint that the total probability is 1. When we solve it we find:

P(x1, …, xN) = 1/VN, where V is the volume of the universe

Remember earlier when I said to just wait for the probability equation to get simple?

Okay, so this is simple, but it’s also not very useful. It tells us that every particle has an equal probability of being in any equally sized region of space. But we want to know more. Like, are the higher energy particles distributed differently than the lower energy?

The great thing about statistical mechanics is that if you want a better model, you can just feed in more information to your distribution.

So let’s say we want to find the probability distribution, given two pieces of information: (1) we know the energy of every possible configuration of particles, and (2) the average total energy of the universe is fixed.

That is, we have a function E(x1, …, xN) that tells us energies, and we know that the total energy E = ∫ P(x1, …, xN)•E(x1, …, xN) dX is fixed.

So how do we find our new P? Using the same method as before:

P [ P logP + λP + βEP ] = 0,
where λ is chosen to satisfy: ∫ P dX = 1
and β is chosen to satisfy: ∫ P•E dX = E

This might look intimidating, but it’s really not. I’ll write out how to solve this:

P [P logP + λP + βEP) ]
= logP + 1 + λ + βE = 0
So P = e-(1+λ) • e-βE
Renaming our first term, we get:
P(X) = 1/Z • e-βE(X)

This result is called the Boltzmann distribution, and it’s one of the incredibly important must-know equations of statistical mechanics. The amount of physics you can do with just this one equation is staggering. And we got it by just adding conservation of energy to the principle of maximum entropy.

Maybe you’re disturbed by the strange new symbols Z and β that have appeared in the equation. Don’t fear! Z is simply a normalization constant: it’s there to keep the probability of the total distribution at 1. We can calculate it explicitly:

Z = ∫ e-βE dX

And β is really interesting. Notice that β came into our equations because we had to satisfy this extra constraint about a fixed total energy. Is there some nice physical significance to this quantity?

Yes, very much so. β is what we humans like to call ‘temperature’, or more precisely, inverse temperature.

β = 1/T

While avoiding the math, I can just say the following: Temperature is defined to be the change in the energy of a system when you change its entropy a little bit. (This definition is much more general than the special case definition of temperature as average kinetic energy)

And it turns out that when you manipulate the above equations a little bit, you see that ∂SE = 1/β = T.

So we could rewrite our probability distribution as follows:

P(X) = 1/Z • e-E(X)/T

Feed in your fundamental laws of physics to the energy function, and you can see the distribution of particles across the universe!

Let’s just look at the basic properties of this equation. First of all, we can see that the larger E(X)/T becomes, the smaller the probability of a particle being in X becomes. This corresponds both to particles scattering away from high-energy regions and to less densely populated systems having lower temperatures.

And the smaller E(X)/T, the larger P(X). This corresponds to particles densely clustering in low-energy areas, and dense clusters of particles having high temperatures.

There are too many other things I could say about this equation and others, and this post is already way too long. I want to close with a final note about the nature of entropy.

I said earlier that entropy is entirely a function of your model of reality. The universe doesn’t have an entropy. You have a model of the universe, and that model has an entropy. Regardless of what physical reality is like, if I hand you a model, you can tell me its entropy.

But at the same time, models of reality are linked to the nature of the physical world. So for instance, a very simple and predictable universe lends itself to very precise and accurate models of reality, and thus to lower-entropy models. And a very complicated and chaotic universe lends itself to constant loss of information and low-accuracy models, and thus to higher entropy.

It is this second world that we live in. Due to the structure of the universe, information is constantly being lost to us at enormous rates. Systems that start out simple eventually spiral off into chaotic and unpredictable patterns, and order in the universe is only temporary.

It is in this sense that statements about entropy are statements about physical reality. And it is for this reason that entropy always increases.

In principle, an omnipotent and omniscient agent could track the positions of all particles at all times, and this agent’s model of the universe would be always perfectly accurate, with entropy zero. For this agent, the entropy of the universe would never rise.

And yet for us, as we look at the universe, we seem to constantly and only see entropy-increasing interactions.

This might seem counterintuitive or maybe even impossible to you. How could the entropy rise to one agent and stay constant for another?

Imagine an ice cube sitting out on a warm day. The ice cube is in a highly ordered and understandable state. We could sit down and write out a probability distribution, taking into account the crystalline structure of the water molecules and the shape of the cube, and have a fairly low-entropy and accurate description of the system.

But now the ice cube starts to melt. What happens? Well, our simple model starts to break down. We start losing track of where particles are going, and having trouble predicting what the growing puddle of water will look like. And by the end of the transition, when all that’s left is a wide spread-out wetness across the table, our best attempts to describe the system will inevitably remain higher-entropy than what we started with.

Our omniscient agent looks at the ice cube and sees all the particles exactly where they are. There is no mystery to him about what will happen next – he knows exactly how all the water molecules are interacting with one another, and can easily determine which will break their bonds first. What looked like an entropy-increasing process to us was an entropy-neutral process to him, because his model never lost any accuracy.

We saw the puddle as higher-entropy, because we started doing poorly at modeling it. And our models started performing poorly, because the system got too complex for our models.

In this sense, entropy is not just a physical quantity, it is an epistemic quantity. It is both a property of the world and a property of our model of the world. The statement that the entropy of the universe increases is really the statement that the universe becomes harder for our models to compute over time.

Which is a really substantive statement. To know that we live in the type of universe that constantly increases in entropy is to know a lot about the way that the universe operates.

More reading here if you’re interested!