The method of reasoning illustrated here is somewhat reminiscent of Laplace’s “principle of indifference.” However, we are concerned here with indifference between problems, rather than indifference between events. The distinction is essential, for indifference between events is a matter of intuitive judgment on which our intuition often fails even when there is some obvious geometrical symmetry (as Bertrand’s paradox shows).
E. T. Jaynes
I’ve previously written praise of the principle of maximum entropy as a prior-setting method that is justified on the basis of a very minimal and highly intuitive set of epistemic features.
But there’s an even better technique for prior-setting, one that is justified on incredibly fundamental grounds. This technique can only be used in rare times, and is immensely powerful when it is used. It’s the principle of transformation groups.
Here is the single assumption from which the principle arises:
“In problems where we have the same prior information, we should assign the same prior probabilities.” (Jaynes’ wording)
This is simple to the point of seeming almost tautological. So what can we do with it?
We’ll start with one of the simplest applications of transformation groups. Suppose that somebody gives you the following information:
I = “This coin will land either tails or heads.”
Now you want to say what the following probabilities should be:
P(This coin will land tails | I) = p
P(This coin will land heads | I) = q
Intuitively, it seems obvious to us that absent any other information, we should assign equal probabilities to these. But why? Is there a principled reason for assuming that the coin is a fair coin? Or is this just a presumption that is importing into the problem our background knowledge about most coins being fair?
The method of transformation groups gives us a principled reason. It says to rephrase the problem as follows:
I’ = “This coin will land either heads or tails.”
Now, our initial problem has only changed to our new problem by replacing every “heads” with “tails” and “tails” with “heads”. Since our prior-setting procedure found that P(This coin will land tails | I) = p in the first problem, it should now find P(This coin will land heads | I) = P in this new one. This is required for any consistent prior-setting procedure! If the problem changes by just switching places of labels, then the priors should change in the exact same way. This means that:
P(This coin will land heads | I’) = p
P(This coin will land tails | I’) = q
But clearly, I = I’; the logical operator “OR” is symmetric! Which means that:
P(This coin will land heads | I’) = P(This coin will land heads | I)
And this is only possible if p = q = ½!
This is simple, but beautiful. The principle tells us that the only logical way to set our priors in this case is evenly – anything else would be either logically inconsistent, or assuming extra information that breaks the symmetry between heads and tails. It goes from logical symmetry to probability symmetry!
Finding these symmetries is what the method of transformation groups is all about. More generally, one can represent a choice between N different possibilities as the statement:
I = “Possibility 1 or possibility 2 or … possibility N”
But this is symmetric with:
I’ = “Possibility 2 or possibility 1 or … possibility N”
As well as all other orderings.
By the exact same argument as above, your prior-setting procedure is required by logical consistency to evenly distribute credences across the N procedures. So for each n from 1 to N, P(Possibility k | I) = 1/N.
The method of transformation groups can also be applied to continuous variables, where finding the right set of priors can be a lot less intuitive. You do so by noting different types of symmetries for different types of parameters.
For instance, a location parameter is one that serves to merely shift a probability distribution over an observable quantity, without reshaping the distribution. We can formally express this by saying that for a location parameter µ, the distribution over x depends only on the difference x – µ:
p(x | µ) = f(x – µ)
For such parameters, it must be the case that the prior distribution over them is similarly symmetric over translational shifts:
For all ∆, p(µ) = p(µ + ∆)
So p(µ) = c, for some constant c
Another common category of parameters are scale parameters. These are parameters that serve to rescale probability distributions without reshaping them. Formally:
p(x | σ) = 1/σ g(x / σ)
For this symmetry, the requirement for consistency is:
For all s, p(σ) = 1/s · p(σ / s)
So, p(σ) = 1/σ
In summation, by carefully analyzing the symmetries of the background information you have, you can extract out requirements for how to set your prior distribution that are mandated on threat of logical inconsistency!