Cox’s theorem

A very original and thoroughgoing development of the theory of probability, which does not depend on the concept of frequency in an ensemble, has been given by Keynes. In his view, the theory of probability is an extended logic, the logic of probable inference. Probability is a relation between a hypothesis and a conclusion, corresponding to the degree of rational belief and limited by the extreme relations of certainty and impossibility. Classical deductive logic, which deals with these limiting relations only, is a special case in this more general development.

R. T. Cox
Probability, Frequency, and Reasonable Expectation
(Yep, that Keynes! He was an influential early Bayesian thinker as well as a famous economist)

Cox’s famous theorem says that if your way of reasoning is not in some sense isomorphic to Bayesianism, then you are violating one of the following ten basic principles. I’ll derive the theorem in this post.

Logic

  1. ~~a = a
  2. ab = ba
  3. (ab)c = a(bc)
  4. aa = a
  5. ~(ab) = ~a or ~b
  6. a(a or b) = a

Reasoning under uncertainty

  1. Degrees of plausibility are represented by real numbers: {b | a}
  2. {bc | a} = F({b | a}, {c | ab})
  3. {~b | a} = G({b | a})
  4. F and G are monotonic.

The first six, I think, need no introduction. I write “a and b” as “ab” for aesthetics.

The next four extend logic beyond the realm of the perfectly certain, and relate to how we should reason in the presence of uncertainty – how we should reason about degrees of plausibility. For this step, we need a new notation: a way to represent not the truth of a proposition a, but its plausibility. We do this with the {} notation: {a | b} = the plausibility of a given that b is known to be true.

I’ll make some brief notes about each of 7 through 10.

Number 7 perhaps sounds strange – plausibilities are states of belief, not numbers. But you can make sense of this in two ways: first, we can consider this a simple model of plausibilities, in which we are merely mirroring the properties of plausibilities in the structure of the system we set up around how to manipulate numbers. And second, if one is to design a robot that reasons about the world, it isn’t crazy to think about programming it to represent beliefs about the plausibilities of propositions as numbers.

#7 also contains an assumption of continuity – that it’s not the case that there are discontinuous jumps in the plausibility of propositions. Said another way, if two different real numbers represent two different degrees of plausibility, then every possible value between them should also represent a degree of plausibility.

Number 8 is about relevance. It says that the plausibility that b and c are both true given some background information a, is only dependent on (1) the plausibility that b is true given a and (2) the plausibility that c is true given a and b.

If you want to know how likely it is that b and c are both true given a, you can break the process down into two steps. First you determine how likely it is that b is true, given your background information. And second, you determine how likely it is that a is true, given both your background information and the truth of b.

In other words, all that you need to know if you want to know {bc | a} are {b | a} and {c | ab}. For example, say that you want the plausibility of Usain Bolt winning the 100 meter dash and also running an extra lap around the track at the end of the dash. The only pieces of information you need for this are (1) the plausibility of Usain Bolt winning the 100 meter dash, and (2) the plausibility of him running an extra lap at the end of the race, given that he won the race. And of course, each of these plausibilities are also conditional on all of the background information you have about the situation.

We give the name F to the function that details precisely how we determine {bc | a}.

Number 9 says that all we need to determine the plausibility of a proposition being false is the plausibility of the proposition being true. The function that details how we determine one from the other is named G.

And finally, Number 10 says that the plausibility relationships described in 8 and 9 are in some sense simple. For instance, if b becomes less plausible and ~b becomes more plausible, then if b becomes even less plausible, ~b should not suddenly become less plausible. If a change in the plausibility of a proposition a makes the plausibility of another proposition b change in a certain direction, then a greater change in the plausibility of a in the same direction should not result in a reversal of the direction of change of the plausibility of b.

There is an additional implicit assumption that is almost too obvious to be stated: If two propositions are just different phrasings of the same state of knowledge, then you should have the same plausibility in both of them. This is the basic requirement of consistency. It says, for instance, that “a and b” is exactly as plausible as “b and a”.

***

Now we can derive Bayesianism!

First step: conditional probabilities. We use the associativity of “and”:

{bcd | a} = F( {b | a}, {cd | ab} ) = F( {b | a}, F( {c | ab}, {d | abc} ) )
{bcd | a} = F( {bc | a}, {d | abc} ) = F( F({b | a}, {c | ab}), {d | abc} ) )

So F(x, F(y, z)) = F(F(x, y), z)

Any monotonic function that satisfies this functional equation is isomorphic to ordinary multiplication on the interval [0, 1].

In other words, there exists a function W such that:

W(F(x, y)) = W(x) · W(y)

From which we can see:

W({bc | a}) = W({b | a}) · W({c | ab})

Which is exactly the form of conjunction in ordinary probability theory!

This means that any form of reasoning about the plausibility of conjunctions of propositions is either violating one of the 10 axioms, or is equivalent to probability theory up to isomorphism.

So if somebody walks up to you and presents to you an algorithm for computing the plausibility of the conjunction of two propositions, and you know that they are following the above ten rules, then you are guaranteed to be able to find some function that takes in their algorithm and translates into ordinary Bayesianism.

This translation function is W! Notice that although W looks a lot like ordinary probability, it is a different thing entirely. For instance, if somebody is already reasoning with ordinary probability theory, then {b | a} = P(b | a). To convert {b | a} into P(b | a), then, W need not do anything! So W({b | a}) = {b | a} = P(b | a), or W(x) = x, which is clearly different from the probability function.

Next, we reveal the nature of the function G.

The first step is easy:

{~~b | a} = G(G({b | a}))
So G(G(x)) = x

This means that G is an involution – a function that is its own inverse.

Next, we use the commutativity of “and”:

W({bc | a}) = W({b | a}) · W({c | ab})
= W({b | a}) · G( W({~c | ab}) )
= W({b | a}) · G( W({b~c | a}) / W({b | a}) )

So W({cb | a}) = W({c | a}) · G( W({~bc | a}) / W({c | a}) )

Now if we let c = ~(bd) for some new statement d, and rename W(b | a) = x and W(c | a) = y, we find:

x G( G(y) / x ) = y G( G(x) / y )

The only functions that satisfy this functional equation as well as the earlier involution equation are the following:

G(x) = (1 – xm)1/m, for any m

With some algebraic manipulation, we see that this is equivalent to:

W({a | b})m + W({~a | b})m = 1

This equation almost perfectly represents the normalization rule: that the total probability must sum to 1! It seems mildly inconvenient that this holds true for the function Wm rather than W. If we knew that m had to equal 1, then we’d have a perfect translation function both for conditional probabilities and for normalization… But if we look back at our conditional probability finding, we notice that it can be equivalently represented in terms of Wm instead of W!

W({bc | a}) = W({b | a}) · W({c | ab})
W({bc | a})m = W({b | a})m · W({c | ab})m

Now we just define one last function Q = Wm, and we’re done!

Q takes in any system of reasoning that obeys the starting ten rules, and reveals it to be equivalent to ordinary probability theory.

The rest of probability theory can be shown to result from these basic axioms (we’ll drop the brackets now, and will call Q({a | b}) its proper name: P(a | b).

P(certainty) = 1
P(impossibility) = 0
P(bc | a) + P(b~| a) = P(b | a)
P(a or b | c) = P(a | c) + P(b | c) – P(ab | c)
And so on.

This derivation of probability theory is great because it isn’t limited by the ordinary frequency interpretations of probability theory, in which a probability is defined to be a limit of a ratio of an infinite number of experimental results. This is not only ugly theoretically, but leaves us puzzled as to how to talk about the probabilities of one-shot events or of hypotheses that can’t be directly translated into empirical results.

The definition of probability invoked here is infinitely deeper. Instead of frequencies, probabilities are defined according to fundamental principles of normative epistemology. Any agent that reasons consistently according to a few basic maxims will be reasoning in a way that is functionally identical to probability theory.

And probabilities here can be assigned to any propositions – not just those that refer to empirically measurable events that are repeatable ad infinitum. They represent a normative rational degree of certainty that one must possess if one is to reason consistently!

Maximum Entropy and Bayes

The original method of Maximum Entropy, MaxEnt, was designed to assign probabilities on the basis of information in the form of constraints. It gradually evolved into a more general method, the method of Maximum relative Entropy (abbreviated ME), which allows one to update probabilities from arbitrary priors unlike the original MaxEnt which is restricted to updates from a uniform background measure.

The realization that ME includes not just MaxEnt but also Bayes’ rule as special cases is highly significant. First, it implies that ME is capable of reproducing every aspect of orthodox Bayesian inference and proves the complete compatibility of Bayesian and entropy methods. Second, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually.

Giffin and Caticha
https://arxiv.org/pdf/0708.1593.pdf

I want to heap a little more praise on the principle of maximum entropy. Previously I’ve praised it as a solution to the problem of the priors – a way to decide what to believe when you are totally ignorant.

But it’s much much more than that. The procedure by which you maximize entropy is not only applicable in total ignorance, but is also applicable in the presence of partial information! So not only can we calculate the maximum entropy distribution given total ignorance, but we can also calculate the maximum entropy distribution given some set of evidence constraints.

That is, the principle of maximum entropy is not just a solution to the problem of the priors – it’s an entire epistemic framework in itself! It tells you what you should believe at any given moment, given any evidence that you have. And it’s better than Bayesianism in the sense that the question of priors never comes up – we maximize entropy when we don’t have any evidence just like we do when we do have evidence! There is no need for a special case study of the zero-evidence limit.

But a natural question arises – if the principle of maximum entropy and Bayes’ rule are both self-contained procedures for updating your beliefs in the face of evidence, are these two procedures consistent?

Anddd the answer is, yes! They’re perfectly consistent. Bayes’ rule leads you from one set of beliefs to the set of beliefs that are maximally uncertain under the new information you receive.

This post will be proving that Bayes’ rule arises naturally from maximizing entropy after you receive evidence.

But first, let me point out that we’re making a slight shift in our definition of entropy, as suggested in the quote I started this post with. Rather than maximizing the entropy S(P) = – ∫ P log(P) dx, we will maximize the relative entropy:

Srel(P, Pold) = – ∫ P log(P / Pold) dx.

The relative entropy is much more general than the ordinary entropy – it serves as a way to compare entropies of distributions, and gives a simple way to talk about the change in uncertainty from a previous distribution to a new one. Intuitively, it is the additional information that is required to specify P, once you’ve already specified Pold. You can think of it in terms of surprisal: Srel(P, Q) is how much more surprised you will be if P is true and you believe Q than if P is true and you believe P.

You might be concerned that this function no longer has the nice properties of entropy that we discussed earlier – the only possible function for consistently representing uncertainty. But these worries aren’t warranted. If some set of initial constraints give Pold as the maximum-entropy distribution, then the function that maximizes relative entropy with just the new constraints will be the same as the function that maximizes entropy with the new constraints and the value of your prior distribution as an additional constraint.

Okay, so from now on whenever I talk about entropy, I’m talking about relative entropy. I’ll just denote it by S as usual, instead of writing out Srel every time. We’ll now prove that the prescribed change in your beliefs upon receiving the results of an experiment is the same under Bayesian conditionalization as it is under maximum entropy.

Say that our probability distribution is over the possible values of some parameter A and the possible results of an experiment that will tell us the value of X. Thus our initial model of reality can be written as:

Pinit(A = a, X = x), and
Pinit(A = a) = ∫ dx Pinit(A = a | X = x) P(X = x)

Which we’ll rewrite for ease of notation as:

Pinit(a, x), and
Pinit(a) = ∫ Pinit(a | x) P(x) dx

Ordinary Bayesian conditionalization says that when we receive the information that the experiment returned the result X = x’, we update our probabilities as follows:

Pnew(a) = Pinit(a | x’)

What does the principle of maximum entropy say to do? It prescribes the following algorithm:

Maximize the value of S = – ∫ da dx P(a, x) log( P(a,x) / Pinit(a, x) )
with the following constraints:
Constraint 1: ∫ da dx P(a, x) = 1
Constraint 2: P(x) = δ(x – x’)

Constraint 2 represents the experimental information that our new probability distribution over X is zero everywhere except for at X = x’, and that we are certain that the value of X is x’. Notice that it is actually an infinite number of constraints – one for each value of X.

We will rewrite Constraint 2 so that it is of the same form as the entropy function and the first constraint:

Constraint 2: ∫ da P(a, x) = δ(x – x’)

The method of Lagrange multipliers tells us how to solve this equation!

First, define a new quantity A as follows:

A = S + Constraint 1 + Constraint 2
= – ∫ da dx P log(P/Pinit) + α · [ ∫ da dx P – 1 ] + ∫ dx β(x) · [ ∫ da P – δ(x – x’) ]

Now we solve!

∆A = 0
P ∫ da dx [ – P log(P/Pinit) + α P + β(x) P] = 0
P [ – P log P + P log Pinit + α P + β(x) P ] = 0
-log Pnew – 1 + log Pinit + α + β(x) = 0
Pnew(a, x) = Pinit(a, x) · eβ(x)/Z

Z is our normalization constant, and we can find β(x) by applying Constraint 2:

Constraint 2: ∫ da P(a, x) = δ(x – x’)
∫ da Pinit(a, x) · eβ(x)/Z = δ(x – x’)
Pinit(x) · eβ(x)/Z = δ(x – x’)

And finally, we can plug in:

Pnew(a, x) = Pinit(a, x) · eβ(x) / Z
= Pinit(a | x) · Pinit(x) · eβ(x) / Z
= Pinit(a | x) · δ(x – x’)
So Pnew(a) = Pinit(a | x’)

Exactly the same as Bayesian conditionalization!!

What’s so great about this is that the principle of maximum entropy is an entire theory of normative epistemology in its own right, and it’s equivalent to Bayesianism, AND it has no problem of the priors!

If you’re a Bayesian, then you know what to do when you encounter new evidence, as long as you already have a prior in hand. But when somebody asks you how you should choose the prior that you have… well then you’re stumped, or have to appeal to some other prior-setting principle outside of Bayes’ rule.

But if you ask a maximum-entropy theorist how they got their priors, they just answer: “The same way I got all of my other beliefs! I just maximize my uncertainty, subject to the information that I possess as constraints. I don’t need any special consideration for the situation in which I possess no information – I just maximize entropy with no constraints at all!”

I think this is wonderful. It’s also really aesthetic. The principle of maximum entropy says that you should be honest about your uncertainty. You should choose your beliefs in such a way as to ensure that you’re not pretending to know anything that you don’t know. And there is a single unique way to do this – by maximizing the function ∫ P log P.

Any other distribution you might choose represents a decision to pretend that you know things that you don’t know – and maximum entropy says that you should never do this. It’s an epistemological framework built on the virtue of humility!

Advanced two-envelopes paradox

Yesterday I described the two-envelopes paradox and laid out its solution. Yay! Problem solved.

Except that it’s not. Because I said that the root of the problem was an improper prior, and when we instead use a proper prior, any proper prior, we get the right result. But we can propose a variant of the two envelopes problem that gives a proper prior, and still mandates infinite switching.

Here it is:

In front of you are two envelopes, each containing some unknown amount of money. You know that one of the envelopes has twice the amount of money of the other, but you’re not sure which one that is and can only take one of the two.

In addition, you know that the envelopes were stocked by a mad genius according to the following procedure: He randomly selects an integer n ≥ 0 with probability ⅓ (⅔)n, then stocked the smaller envelope with $2n and the larger with double this amount.

You have picked up one of the envelopes and are now considering if you should switch your choice.

Let’s verify quickly that the mad genius’s procedure for selecting the amount of money makes sense:

Total probability = ∑n ⅓ (⅔)n = ⅓ · 3 = 1

Okay, good. Now we can calculate the expected value.

You know that the envelope that you’re holding contains one of the following amounts of money: ($1, $2, $4, $8, …).

First let’s consider the case in which it contains $1. If so, then you know that your envelope must be the smaller of the two, since there is no $0.50 envelope. So if your envelope contains $1, then you are sure to gain $1 by switching.

Now let’s consider every other case. If the amount you’re holding is $2n, then you know that there is a probability of ⅓ (⅔)n that it is the smaller envelope and ⅓ (⅔)n+1 that it’s the larger one. You are $2n better off if you have the smaller envelope and switch, and are 2n-1 worse off if you initially had the larger envelope and switch.

So your change in expected value by switching instead of staying is:

∆EU = $ ⅓ (1⅓)n – $ ⅓ ¼ (1⅓)n+1
= $ ⅓ (1⅓)n (1 – ¼ · 1⅓)
= $ ⅓ (1⅓)n (1 – ⅓) > 0

So if you are holding $1, you are better off switching. And if you are holding more than $1, you are better off switching. In other words, switching is always better than staying, regardless of how much money you are holding.

And yet this exact same argument applies once you’ve switched envelopes, so you are led to an infinite process of switching envelopes back and forth. Your decision theory tells you that as you’re doing this, your expected value is exponentially growing, so it’s worth it to you to keep on switching ad infinitum – it’s not often that you get a chance to generate exponentially large amounts of money!

The problem this time can’t be the prior – we are explicitly given the prior in the problem, and verified that it was normalized just in case.

So what’s going wrong?

***

 

 

(once again, recommend that you sit down and try to figure this out for yourself before reading on)

 

 

***

Recall that in my post yesterday, I claimed to have proven that no matter what your prior distribution over money amounts in your envelope, you will always have a net zero expected value. But apparently here we have a statement that contradicts that.

The reason is that my proof yesterday was only for continuous prior distributions over all real numbers, and didn’t apply to discrete distributions like the one in this variant. And apparently for discrete distributions, it is no longer the case that your expected value is zero.

The best solution to this problem that I’ve come across is the following: This problem involves comparing infinite utilities, and decision theory can’t handle infinities.

There’s a long and fascinating precedent for this claim, starting with problems like the Saint Petersburg paradox, where an infinite expected value leads you to bet arbitrarily large amounts of money on arbitrarily unlikely scenarios, and including weird issues in Boltzmann brain scenarios. Discussions of Pascal’s wager also end up confronting this difficulty – comparing different levels of infinite expected utility leads you into big trouble.

And in this variant of the problem, both your expected utility for switching and your expected utility for staying are infinite. Both involve a calculation of a sum of (⅔)n (the probability) times 2n, which diverges.

This is fairly unsatisfying to me, but perhaps it’s the same dissatisfaction that I feel when confronting problems like Pascal’s wager – a mistaken feeling that decision theory should be able to handle these problems, ultimately rooted in a failure to internalize the hidden infinities in the problem.

Sam Harris and the is-ought distinction

Sam Harris puzzles me sometimes.

I recently saw a great video explaining Hume’s is-ought distinction and its relation to the orthogonality thesis in artificial intelligence. Even if you are familiar with these two, the video is a nice and clear exposition of the basic points, and I recommend it highly.

Quick summary: Statements about what one ought to do are logically distinct from ‘is’ statements – those that simply describe the world. Statements in the second category cannot be derived from statements in the first, and vice versa. Artificial intelligence researchers recognize this in the form of the orthogonality thesis – an AI can have any combination of intelligence level and terminal values, and learning more about the world or becoming more intelligent will not result in value convergence.

This is more than just pure theory. When you’re building an AI, you actually have to design its goals separately from its capacity to describe and model the world. And if you don’t do so, then you’ll have failed at the alignment task – ensuring that the goals of the AI are friendly to humanity (and most of the space of all possible goals looks fairly unfriendy).

Said more succinctly: if you think that a sufficiently intelligent being that spends enough time observing the world and figuring out all of its “is” statements would naturally start to converge towards some set of “ought” statements, you’re wrong. While this may seem very intuitively compelling, it just isn’t the case. Values and beliefs are orthogonal, and if you think otherwise, you’d make a bad AI designer.

I am very confused by the existence of apparently reasonable people that have spent any significant amount of time thinking about this and have concluded that “ought”s can be derived from “is”s, or that “ought”s are really just some special type of “is”s.

Case in point: Sam Harris’s “argument” for getting an ought from an is.

Getting from “Is” to “Ought”

1/ Let’s assume that there are no ought’s or should’s in this universe. There is only what *is*—the totality of actual (and possible) facts.

2/ Among the myriad things that exist are conscious minds, susceptible to a vast range of actual (and possible) experiences.

3/ Unfortunately, many experiences suck. And they don’t just suck as a matter of cultural convention or personal bias—they really and truly suck. (If you doubt this, place your hand on a hot stove and report back.)

4/ Conscious minds are natural phenomena. Consequently, if we were to learn everything there is to know about physics, chemistry, biology, psychology, economics, etc., we would know everything there is to know about making our corner of the universe suck less.

5/ If we *should* to do anything in this life, we should avoid what really and truly sucks. (If you consider this question-begging, consult your stove, as above.)

6/ Of course, we can be confused or mistaken about experience. Something can suck for a while, only to reveal new experiences which don’t suck at all. On these occasions we say, “At first that sucked, but it was worth it!”

7/ We can also be selfish and shortsighted. Many solutions to our problems are zero-sum (my gain will be your loss). But *better* solutions aren’t. (By what measure of “better”? Fewer things suck.)

8/ So what is morality? What *ought* sentient beings like ourselves do? Understand how the world works (facts), so that we can avoid what sucks (values).

This is clearly a bad argument. Depending on what he intended it to be, step 5 is either a non sequitur or a restatement of his conclusion. It’s especially surprising because the mistake is so clearly visible.

I don’t know what to make of this, and really have no charitable interpretation. My least uncharitable interpretation is that maybe the root of the problem is an inability to let go of the longing to ground your moral convictions in objectivity. I’ve certainly had times where I felt so completely confident about a moral conviction that I convinced myself that it just had to be objectively true, although I always eventually came down from those highs.

I’m not sure, but this is something that I am very confused about and want to understand better.

Two envelopes paradox

In front of you are two envelopes, each containing some unknown amount of money. You know that one of the envelopes has twice the amount of money of the other, but you’re not sure which one that is and can only take one of the two. You choose one at random, and start to head out, when a thought goes through your head:

You: “Hmm, let me think about this. I just took an introductory decision theory class, and they said that you should always make decisions that maximize your expected utility. So let’s see… Either I have the envelope with less money or not. If I do, then I stand to double my money by switching. And if I don’t, then I only lose half my money. Since the possible gain outweighs the possible loss, and the two are equally likely, I should switch!”

Excited by your good sense in deciding to consult your decision theory knowledge, you run back and take the envelope on the table instead. But now, as you’re walking towards the door, another thought pops into your head:

You: “Wait a minute. I currently have some amount of money in my hands, and I still don’t know whether I got the envelope with more or less money. I haven’t gotten any new information in the past few moments, so the same argument should apply… if I call the amount in my envelope Y, then I gain $2Y by switching if I have the lesser envelope, and only lose $½Y by switching if I have the greater envelope. So… I should switch again, I guess!”

Slightly puzzled by your own apparently absurd behavior, but reassured by the memories of your decision theory professor’s impressive-sounding slogans about instrumental rationality and maximizing expected utility, you walk back to the table and grab the envelope you had initially chosen, and head for the door.

But a new argument pops into your head…

You see where this is going.

What’s going on here? It appears that by a simple application of decision theory, you are stuck switching envelopes ad infinitum, foolishly thinking that as you do so, your expected value is skyrocketing. Has decision theory gone crazy?

***

This is the wonderful two-envelopes paradox. It’s one of my favorite paradoxes of decision theory, because it starts from what appear to be incredibly minimal assumptions and produces obviously outlandish behavior.

If you’re not convinced yet that this is what standard decision theory tells you to do, let me formalize the argument and write out the exact calculations that lead to the decision to switch.

Call the envelope with less money “Envelope A”
Call the envelope with more money “Envelope B”
Call the envelope you are holding “Envelope E”
X = the amount of money in your envelope

First framing

P(E is A) = P(E is B) = ½
If E is A & you switch, then you get $2X
If E is B & you switch, then you get $½X
If E is A & you stay, then you get $X
If E is B & you stay, then you get $X

EU(switch) = P(E is A) · 2X + P(E is B) · ½X = 1¼ X
EU(stay) = P(E is A) · X + P(E is B) · X = X
So, EU(switch) > EU(stay)!

If you think that the conclusion is insane, then either there’s an error somewhere in this argument, or we’ve proven that decision theory is insane.

It’s easy to put forward additional arguments for why the expected utility should be the same for switching and staying, but this still leaves the nagging question of why this particular argument doesn’t work. The ultimate reason is wonderfully subtle and required several hours of agonizing for me to grasp.

I suggest you stop and analyze the argument a little bit before reading on – try to figure out for yourself what’s wrong.

Let me present the correct line of reasoning for comparison:

Call the envelope with less money “Envelope A”
Call the envelope with more money “Envelope B”
Call the envelope you are holding “Envelope E”
Label the amount of money in Envelope A = Y.
Then the amount of money in Envelope B = 2Y.

Second framing

P(E is A) = P(E is B) = ½
If E is A and you switch, then you get $2Y
If E is B and you switch, then you get $Y
If E is A and you stay, then you get $Y
If E is B and you stay, then you get $2Y

EU(switch) = P(E is A) · 2Y + P(E is B) · Y = 1½ Y
EU(stay) = P(E is A) · Y + P(E is B) · 2Y = 1½ Y
So EU(switch) = EU(stay)

This gives us the right answer, but the only apparent difference between this and what we did before is which quantity we give a name – X was the money in your envelope in the first argument, and Y is the money in the lesser envelope in this one. How could the answer depend on this apparently irrelevant difference?

***

Without further ado, let me diagnose the argument at the start of this post.

The fundamental mistake that this argument makes is that it treats the probability that you have the lesser envelope as if it is independent of the amount of money that you have in your hands. This is only the case if the amount of money in your envelope is irrelevant to whether you have the lesser envelope. But the amount of money in your hand is highly relevant information.

This may sound weird. After all, you chose the envelope at random, and shouldn’t the principle of maximum entropy prescribe that two equivalent envelopes are equally likely to be chosen? How could the unknown amount of money you’re holding have any sway over which one you were more likely to choose?

The answer is that the envelopes aren’t equivalent for any given amount of money in your hand. In general, given that you end up holding an envelope with $X, the chance that this is the lesser quantity is affected by the value of X.

Suppose, for example, that you know that the envelopes contain $1 and $2. Now in your mental model of the envelope in your hand, you see an equal chance of it containing $1 and $2. But now whether your envelope is the lesser or greater one is clearly not independent of the amount of money in your envelope. If you’re holding $1, then you know that you have the lesser envelope. And if you’re holding $2, then you know that you have the greater envelope.

In general, your prior probability over the possible amounts of money in your envelope will be relevant to the chance that you are holding the lesser or greater envelopes.

If you think that the envelopes are much more likely to contain small numbers, then given that the amount of money in your hand is large, you are much more likely to be holding the envelope with more money. Or if you think that the person stuffing the envelopes had only a certain fixed upper amount of cash that he was willing to put into the envelopes, then for some possible amounts of money in your envelope, you will know with certainty that it is the larger envelope.

Regardless, we’ll see that for any distribution of probabilities over the money in the envelopes, proper calculation of the expected utility will inevitably end up zero.

Here’s the sketch of the proof:

Stipulations:
Call the envelope with less money “Envelope A”
Call the envelope you are holding “Envelope E”
P(E is A) = P(E is B) = ½
If A has $x, then B has $2x
P(A has $x) = f(x) for some normalized function f(x)

The function f(x) represents your prior probability distribution over the possible amounts of money in A. We can infer your probability distribution over the possible amounts of money in B from the fact that B has double the money of A.

P(B has $x) = ½ P(A = $½ x) = ½ · f(½ x)

The ½ comes from the fact that we’ve stretched out our distribution by a factor of 2 and must renormalize it to keep our total probability equal to 1.

Now we’ll calculate the expected utility of switching, given that our envelope has some amount of money $x in it, and average over all possible values of x.

∆EU = < ∆EU given that E has $x >
= ∫ P(E has $x) · (∆EU given that E has $x) dx

Since this calculation will have several components, I’ll start color-coding them.

Next we’ll split up the calculation into the expected utility of switching (brown) and the expected utility of staying (blue).

∆EU = ∫ P(E has $x) · {EU(switch | E has $x) – EU(stay | E has $x)} dx

Our final subdivision of possible worlds will be regarding whether you’re holding the envelope with less or more money.

∆EU = ∫ P(E has $x) · { P(E is A | E has $x) · U(switch to B) + P(E is B | E has $x) · U(switch to A) – P(E is A | E has $x) · U(stay with A) + P(E is B | E has $x) · U(stay with B) } dx

We can rearrange the terms and color code them by whether they refer to the world in which you’re holding the lesser envelope (red) or the world in which you’re holding the greater envelope (green).

∆EU = ∫ { P(E has $x and E is A) · (U(switch to B)U(stay with A)) + P(E has $x and E is B) · (U(switch to A)U(stay with B)) } dx
= ∫ { P(A has $x) · P(E is A) · (2xx) + P(B has $x) · P(E is B) · (½ xx) } dx
= ∫ { f(x) · ½ x½ f(½ x) · ¼ x } dx
= ½ ∫ x f(x) dx – ½ ∫ (½ x) · f(½ x) · d(½ x)
= ½ ∫ x f(x) dx – ½ ∫ x’ f(x’) dx’

= 0

And we’re done!

So if this is the right way to do the calculation we attempted at the beginning, then where did we go wrong the first? The key is that we considered the unconditional probabilities P(E is A) and P(E is B) instead of the conditional probabilities P(E is A | E has $x) and P(E is B | E has $x).

This made the calculations more complicated, but was necessary. Why? Well, the assumption of independence of the value of your envelope and whether it is the lower or higher valued envelope is logically incoherent.

Proof in words: Suppose that your envelope’s value was independent of whether it is the lower or higher envelope. This means that for any value $X, it is equally likely that the other envelope contains $2X and that it contains $½X. We can write this condition as follows: P(other envelope has 2X) = P(other envelope has ½X) for all X. But there are no normalized distributions that satisfy this property! For any amount of probability mass in a given region [X, X+∆], there must also be at least as much probability mass in the region [4X, 4X+4∆]. Thus if any region has any finite probability mass, then that mass must be repeated an infinite number of times, meaning the distribution can’t be normalized! Proof by contradiction.

Even if we imagined some cap on the total value of a given envelope (say $1 million), we still don’t get away. Because now the value of your envelope is no longer independent of whether it is the lower or higher envelope! If the value of the envelope in your hands is $999,999, then you know for sure that you must have the larger of the two envelopes.

If the amount of money in your hands and the chance that you have the lesser envelope are independent, then you are imagining an unnormalizable prior. And if they are dependent, then the argument we started with must be amended to the colorful argument.

It’s not that at any point you get to look inside your envelope and see how much money is inside. It’s simply that you cannot talk about the probability of your envelope being the lesser of the two as if it is independent of the the amount of money you’re holding. And our starting argument did exactly that – it assumed that you were equally likely to have the smaller and larger envelope, regardless of how much money you held.

So the problem with our starting argument is wonderfully subtle. By the very framing of the statement “It’s equally likely that the other envelope contains $2X and $½X if my envelope contains $X,” we are committing ourselves to an impossibility: a prior probability with infinite total probability!

Principle of Maximum Entropy

Previously, I talked about the principle of maximum entropy as the basis of statistical mechanics, and gave some intuitive justifications for it. In this post I want to present a more rigorous justification.

Our goal is to find a function that uniquely quantifies the amount of uncertainty that there is in our model of reality. I’m going to use very minimal assumptions, and will point them out as I use them.

***

Here’s the setup. There are N boxes, and you know that a ball is in one of them. We’ll label the possible locations of the ball as:

B1, B2, B3, … BN
where Bn = “The ball is in box n”

The full state of our knowledge about which box the ball is in will be represented by a probability distribution.

(P1, P2, P3, … PN)
where Pn = the probability that the ball is in box n

Our ultimate goal is to uniquely prescribe an uncertainty measure S that will take in the distribution P and return a numerical value.

S(P1, P2, P3, … PN)

Our first assumption is that this function is continuous. When you make arbitrarily small changes to your distribution, you don’t get discontinuous jumps in your entropy. We’ll use this in a few minutes.

We’ll start with a simpler case than the general distribution – a uniform distribution, where the ball is equally likely to be in any of the N boxes.

For all n, Pn = 1/N

uniform-boxes.png

We’ll give the entropy of a uniform distribution a special name, labeled U for ‘uniform’:

S(1/N, 1/N, …, 1/N) = U(N)

Our next and final assumption is going to relate to the way that we combine our knowledge. In words, it will be that the uncertainty of a given distribution should be the same, regardless of how we represent the distribution. We’ll lay this out more formally in a moment.

Before that, imagine enclosing our boxes in M different containers, like this:

containers.png

Now we can represent the same state of knowledge as our original distribution by specifying first the probability that the ball is in a given container, and then the probability that it is in a given box, given that it is in that container.

Qn = probability that the ball is in container n
Pm|n = probability that the ball is in box m, given that it’s in container n

almost-final.png

Notice that the value of each Qn is just the number of boxes in the container divided by the total number of boxes. In addition, the conditional probability that the ball is in box m, given that it’s in container n, is just one over the number of boxes in the container. We’ll write these relationships as

Qn = |Cn| / N
Pm|n = 1 / |Cn|

The point of all this is that we can now formalize our third assumption. Our initial state of knowledge was given by a single distribution Pn. Now it is given by two distributions: Qn and Pm|n.

Since these represent the same amount of knowledge about which container the box is in, the entropy of each should be the same.

Final

And in general:

Initial entropy = S(1/N, 1/N, …, 1/N) = U(N)
Final entropy = S(Q1, Q2, …, QM) + ∑i Qi · S(P1|i, P2|i, …, PN|i)
= S(Q1, Q2, …, QM) + ∑i Qi · U(|Ci|)

The form of this final entropy is the substance of the uncertainty combination rule. First you compute the uncertainty of each individual distribution. Then you add them together, but weight each one by the probability that you encounter that uncertainty.

Why? Well, a conditional probability like Pm|n represents the probability that the ball is in box m, given that it’s in container n. You will only have to consider this probability if you discover that the ball is in container n, which happens with probability Qn.

With this, we’re almost finished.

First of all, notice that we have the following equality:

S(Q1, Q2, …, QM) = U(N) – ∑i [ Qi · U(|Ci|) ]

In other words, if we determine the general form of the function U, then we will have uniquely determined the entropy S for any arbitrary distribution!

And we can determine the general form of the function U by making a final simplification: assume that the containers all contain an equal number of boxes.

This means that Qn will be a uniform distribution over the M possible containers. And if there are M containers and N boxes, then this means that each container contains N/M boxes.

For all n, Qn = 1/M and |Cn| = N/M

If we plug this all in, we get that:

S(1/M, 1/M, …, 1/M) = U(N) – ∑i [1/M · U(N/M)]
U(M) = U(N) – U(N/M)
U(N/M) = U(N) – U(M)

There is only one continuous function that satisfies this equation, and it is the logarithm:

U(N) = K log(N) for some constant K

And we have uniquely determined the form of our entropy function, up to a constant factor K!

S(Q1, Q2, …, QM) = K log(N)  –  K ∑i Qi log(| Ci |)
= – K ∑i Qi log(| Ci |/N)
= – K ∑i Qi log(Qi)

If we add as a third assumption that U(N) should be monotonically increasing with N (that is, more boxes means more uncertainty, not less), then we can also specify that K should be a positive constant.

The three basic assumptions from which we can find the form of the entropy:

  1. S(P) is a continuous function of P.
  2. S should assign the same uncertainty to different representations of the same information.
  3. The entropy of a wide uniform distribution is greater than the entropy of a thin uniform distribution.

Consciousness

Every now and then I go through a phase in which I find myself puzzling about consciousness. I typically leave these phases feeling like my thoughts are slightly more organized on the problem than when I started thinking about it, but still feeling overwhelmingly confused about the subject.

I’m currently in one of those phases!

It started when I was watching an episode of the recent Planet Earth II series (which I recommend to everybody – it’s beautiful). One scene contains a montage of grizzly bears that have just emerged from hibernation and are now passionately grinding their backs against trees to shed their excess fur.

Nobody with a soul would watch this video and not relate to the back-scratching bears through memories of the rush of pleasure and utter satisfaction of a great back scratching session.

The natural question this raises is: how do we know that the bears are actually feeling the same pleasure that we feel when we get our backs scratched? How could we know that they are feeling anything at all?

A modest answer is that it’s just intuition. Some things just look to us like they’re conscious, and we feel a strong intuitive conviction that they really are feeling what we think they are.

But this is unsatisfying. ‘Intuition’ is only a good answer to a question when we have a good reason to presume that our intuitions should be reliable in the context of the question. And why should we believe that our intuitions about a rock being unconscious and a bear being conscious have any connection to reality? How can we rationally justify such beliefs?

The only starting point we have for assessing any questions about consciousness is our own conscious experience – the only one that we have direct and undeniable introspective access to. If we’re to build up a theory of consciousness, we must start there.

So for instance, we notice that there are tight correlations between patterns of neural activation in our brains and our conscious experiences. We also notice that there are some physical details that seem irrelevant to the conscious experiences that we have.

This distinction between ‘the physical details that are relevant to what conscious experiences I have’ and ‘the physical details that are irrelevant to what conscious experiences I have’ allow us to make new inferences about conscious experiences that are not directly accessible to us.

We can say, for instance, that a perfect physical clone of mine that is in a different location than me probably has a similar range of conscious experiences. This is because the only difference between us is our location, which is largely irrelevant to the range of my conscious experiences (I experience colors and emotions and sounds the same way whether I’m on one side of the room or another).

And we can draw similar conclusions about a clone of mine if we also change their hairstyle or their height or their eye color. Each of these changes should only affect our view of their consciousness insofar as we notice changes in our consciousness upon changes in our height, hairstyle, or eye color.

This gives us rational grounds on which to draw conclusions like ‘Other human beings are conscious, and likely have similar types of conscious experiences to me.’ The differences between other human beings and me are not the types of things that seem able to make them have wildly different types of conscious experiences.

Once we notice that we tend to reliably produce accurate reports about our conscious experiences when there are no incentives for us to lie, we can start drawing conclusions about the nature of consciousness from the self-reports of other beings like us.

(Which is of course how we first get to the knowledge about the link between brain structure and conscious experience, and the similarity in structure between my brain and yours. We probably don’t actually personally notice this unless we have access to a personal MRI, but we can reasonably infer from the scientific literature.)

From this we can build up a theory of consciousness. A theory of consciousness examines a physical system and reports back on things like whether or not this system is conscious and what types of conscious experiences it is having.

***

Let me now make a conceptual separation between two types of theories of consciousness: epiphenomenal theories and causally active theories.

Epiphenomenal theories of consciousness are structured as follows: There are causal relationships leading from the physical world to conscious experiences, and no causal relationships leading back.

Causally active theories of consciousness have both causal arrows leading from the physical world to consciousness, and back from consciousness to the physical world. So physical stuff causes conscious experiences, and conscious experiences have observable behavioral consequences.

Let’s tackle the first class of theories first. How could a good Bayesian update on these theories? Well, the theories make predictions about what is being experienced, but make no predictions about any other empirically observable behaviors. So the only source of evidence for these theories is our personal experiences. If Theory X tells me that when I hit my finger with a hammer, I will feel nothing but a sense of mild boredom, then I can verify that Theory X is wrong only through introspection of my own experiences.

But even this is unusual.

The mental process by which I verify that Theory X is wrong is occurring in my brain, and on any epiphenomenal theory, such a process cannot be influenced by any actual conscious experiences that I’m having.

If suddenly all of my experiences of blue and red were inverted, then any reaction of mine, especially one which accurately reported what had happened, would have to be a wild coincidence. After all, the change in my conscious experience can’t have had any causal effects on my behavior.

In other words, there is no reason to expect on an epiphenomenal theory of consciousness that the beliefs I form or the self-reports I produce about my own experiences should align with my actual conscious experiences.

And yet they invariably do. Every time I notice that I have accurately reported a conscious experience, I have noticed something that is wildly unlikely to occur under any epiphenomenal theory of consciousness. And by Bayes’ rule, each time this happens, all epiphenomenal theories are drastically downgraded in credence.

So this entire class of theories is straightforwardly empirically wrong, and will quickly be eliminated from our model of reality through some introspection. The theories that are left involve causation going both from the physical world to consciousness and back from consciousness to the physical world.

In other words, they involve two mappings – one from a physical system to consciousness, and another from consciousness to predicted future behavior of the physical system

But now we have a puzzle. The second mapping involves third-party observable physical effects that are caused by conscious experiences. But in our best understanding of the world, physical effects are always found to be the results of physical causes. For any behavior that my theory tells me is caused by a conscious experience, I can trace a chain of physical causes that uniquely determined this behavior.

What does this mean about the causal role of consciousness? How can it be true that conscious experiences are causal determinants of our behavior, and also that our behaviors are fully causally determined by physical causes?

The only way to make sense of this is by concluding that conscious experiences must be themselves purely physical causes. So if my best theory of consciousness tells me that experience E will cause behavior B, and my best theory of physics tells me that the cause of B is some set of physical events P, then E is equal to P, or some subset of P.

This is how we are naturally led to what’s called identity physicalism – the claim that conscious experiences are literally the same thing as some type of physical pattern or substance.

***

Let me move on to another weird aspect of consciousness. Imagine that I encounter an alien being that looks like an exact clone of myself, but made purely of silicon. What does our theory of consciousness say about this being?

It seems like this depends on if the theory makes reference only to the patterns exhibited by the physical structure, or to the physical structure itself. So if my theory is about the types of conscious experiences that arise from complicated patterns of carbon, then it will tell me that this alien being is not conscious. But if it just references the complicated patterns, and doesn’t specify the lower-level physical substrate from which the pattern arises, then the alien being is conscious.

The problem is that it’s not clear to me which of these we should prefer. Both make the same third-party predictions, and first-party verifications could only be made through a process involving a transformation of our body from one substrate to the other. In the absence of such a process, both of the theories make the exact same predictions about what the world looks like, and thus will be boosted or shrunk in credence exactly the same way by any evidence we receive.

Perhaps the best we could do is say that the first theory contains all of the complicated details of the first, but also has additional details, and so should be penalized by the conjunction rule? So “carbon + pattern” will always be less likely than “pattern” by some amount. But these differences in priors can’t give us that much, as they should in principle be dwarfed in the infinite-evidence limit.

What this amounts to is an apparently un-leap-able inferential gap regarding the conscious experiences of beings that are qualitatively different from us.

Inequality and free markets

(This post is a summary of the main things I found while diving into the economics literature on income inequality. Will try to condense my findings as much as possible, but there’s a lot to talk about. TL;DR at the end for lazy folk)

First, a note on terminology

Before getting into the published research on this topic, I started by surveying articles from popular news sources. I was curious to ultimately compare the standard media presentation to what I’d find in the scientific literature.

A large portion of what I read consisted of debates about the meanings of terms – one person says that capitalism is a lightly regulated market with a social safety net, another says any social safety net is socialism and therefore not capitalism, another says that a free market with any form of government regulation is corporatism, not capitalism, and they all yell at each other about terms and don’t get anything done.

By contrast, the terminology used in the economics and public policy literature was consistent, straightforward, and clear. I’ll define the controversial terms right here at the start to avoid confusion. These definitions are in line with the way that the terms are used in the literature.

Economic freedom: A combination of factors including limited regulation of businesses, protected rights to own private property, trade freedom, and small government.

Free marketAn economic system characterized by high degrees of economic freedom. Continue reading “Inequality and free markets”

40 papers on inequality in one sentence each

(Preliminary post – am planning to write this all up more digestibly in a future post)

***

Free markets and income inequality

Positive relationship

Capital in the 21st Century (Piketty)
When the rate of return on capital is greater than the rate of economic growth (as tends to occur in a free market given time), this leads to a concentration of wealth.

Capitalism and inequality: The negative consequences for humanity (Stevenson)
Inequalities are the inevitable result of capitalism, and we should abolish private property.

Envisioning Real Utopias (Wright)
Capitalism causes inequality through profit-seeking behavior and encouragement of innovation in technology.

How Privatization Increases Inequality (In the Public Interest)
Privatization of public goods causes inequality and inefficiency.

Income inequality and privatisation: a multilevel analysis comparing prefectural size of private sectors in Western China (Bakkeli)
More privatization corresponds to higher income inequality and lower individual outcome.

Economic Freedom, Income Inequality and Life Satisfaction in OECD Countries (Graafland, Lous)
Economic freedom causes higher income per capita, higher inequality, and overall lower happiness.

Negative relationship

Testing Piketty’s Hypothesis on the Drivers of Income Inequality: Evidence from Panel VARs with Heterogeneous Dynamics (Góes)
Piketty’s r – g hypothesis is wrong; the effect of r > g is not wealth concentration, but in fact mild wealth dispersion.

Economic Freedom and the Trade-off between Inequality and Growth (Scully)
Economic freedom overall reduces inequality, despite that economic growth increases inequality.

A Dynamic Analysis of Economic Freedom and Income Inequality in the 50 U.S. States: Empirical Evidence of a Parabolic Relationship (Bennett, Vedder)
Increases in economic freedom are associated with lower income inequality, and larger government is associated with greater inequality (with the exception of progressive taxation)

Economic freedom and equality: Friends or foes? (Berggren)
More economic freedom is associated with less inequality, because of trade liberalization and economic growth.

Economic Freedom And Income Inequality Revisited: Evidence From A Panel Error Correction Model (Apergis, Dincer, Payne)
Economic freedom reduces inequality in both the short and long run, and inequality causes less economic freedom.

Income inequality and economic freedom in the U.S. states (Ashby, Sobel)
More economic freedom causes larger per capita income, higher rates of economic growth, and less relative income inequality.

Other

On the ambiguous economic freedom–inequality relationship (Bennett, Nikolaev)
The relationship between economic freedom and inequality is ambiguous – it depends on how you choose your freedom and inequality measures.

***

Income inequality in the US

Trade

The Geography of Trade and Technology Shocks in the United States (Autor, Dorn, Hanson)
Two biggest causes of growing inequality are technology and trade, which are geographically separate in their effects.

Why are American Workers getting Poorer? China, Trade and Offshoring (Ebenstein, Harrison, McMillan)
Offshoring to China has led to US wage declines, but trade with China is much more important in explaining wage declines.

China Trade, Outsourcing and Jobs (Kimball, Scott)
China is a currency manipulator, encouraging a huge trade imbalance and hurting US workers.

The China Syndrome: Local Labor Market Effects of Import Competition in the United States (Autor, Dorn, Hanson)
China imports increase unemployment, lower labor force participation, and reduce wages in the US.

Technology

Rising Income Inequality: Technology, or Trade and Financial Globalization? (Jaumotte, Lall, Papageorgiou)
Technological progress is the primary cause of the rise of inequality in the last 2 decades, and increased globalization has had a minor impact.

World Economic Forum Outlook, April 2017: Gaining Momentum? (International Monetary Fund)
Decrease in bottom wages in advanced economies is driven mostly by technology and also by globalization.

It’s the Market: The Broad-Based Rise in the Return to Top Talent (Kaplan, Rauh)
Incomes at the top are driven by technological change in information and communications that increase the relative productivity of talented individuals through audience magnification.

Policy

Wage Inequality: A Story of Policy Choices (Mishel, Scmitt, Shierholz)
Income inequality is the result of erosion of the minimum wage value, decreased union power, industrial deregulation, traid policy, failure to use fiscal spending to stimulate the economy, bad monetary policy by the Fed, and rent-seeking behaviors from CEOs.

Controversies about the Rise of American Inequality: A Survey (Gordon, Dew-Becker)
Rising inequality is due to a low minimum wage, the decline in unionization, audience magnification, generous stock options, and unregulated corporate wage practices, not imports, immigration, or a lower labor share of income.

The Top 1 Percent in International and Historical Perspective (Alvaredo, Atkinson, Piketty, Saez)
Tax policy, decreased labor bargaining ability, and increased capital income explain the growing income share at the very top, not technology.

Rent-seeking

Declining Labor and Capital Shares (Barkai)
Capital shares have declined faster than labor shares in the last 30 years, and the decline of labor shares is due entirely to an increase in markups, which decreases output and consumer welfare.

The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes (Bivens, Mishel)
CEO pay is driven by rent-seeking behavior.

Evidence for the Effects of Mergers on Market Power and Efficiency (Blonigen, Pierce)
Mergers in manufacturing from 1997 to 2007 haven’t significantly increased productivity or efficiency, but have increased markups.

Skill premium

Skills, education, and the rise of earnings inequality among the “other 99 percent” (Autor)
Income inequality is mostly due to an increasing skill premium, but is also due to a decline in the minimum wage value, automation, international trade, de-unionization, and regressive taxation.

Other

The long-run determinants of inequality: What can we learn from top income data? (Roine, Vlachos, Waldenström)
High growth benefits top income earners, tax progressiveness reduces top income shares, and trade openness doesn’t really do anything.

Why Hasn’t Democracy Slowed Rising Inequality? (Bonica, McCarty, Poole, Rosenthal)
Democracy hasn’t slowed the rise in inequality because of a political acceptance of free-market capitalism, immigration and a low turnout of poor voters, rising real income and wealth making social insurance less attractive, money influencing politics, and distortion of democracy through gerrymandering.

Billionaire Bonanza (Collins, Hoxie)
The people at the top are crazy rich and we should tax them.

***

More on free markets

Economic Freedom, Institutional Quality, and Cross-Country Differences in Income and Growth (Gwartney, Holcombe, Lawson)
More economic freedom leads to more rapid growth and higher income levels.

Economic Freedom of the World: 2017 Annual Report (Gwartney, Lawson, Hall)
Economic freedom is strongly correlated with rapid growth, higher average income per capita, lower poverty rates, higher income amount/share for the poorest 10%, higher life expectancy, more civil liberties and political rights, more gender equality, greater happiness, and better access to electricity, gas, and water supplies.

Agent-Based Simulations of Subjective Well-Being (Baggio, Papyrakis)
Economic growth weakly correlates with happiness, and pro-middle and balanced growth correspond to much higher levels of long-term happiness than pro-rich growth.

***

Decline of income inequality

Latin America

Deconstructing the Decline in Inequality in Latin America (Lustig, López-Calva, Ortiz-Juarez)
Income inequality declined in Latin American countries because of a declining skill premium and government redistribution.

China

Global Inequality Dynamics: New Findings from WID.world (Alvaredo, Chancel, Piketty, Saez, Zucman)
China’s top 1% income share has risen since 1980 (partially due to privatization), peaked near 2006, and is stable/slightly declining.

The great Chinese inequality turnaround (Kanbur, Wang, Zhang)
Drop in Chinese inequality is due to tightening of rural labor markets from migration, government investment in infrastructure in the rural sector, minimum wage policies, and social programs.

***

Views on inequality

The Challenge of Shared Prosperity (Rivkin, Mills, Porter)
Business leaders care about inequality, and it’s in their perceived self-interest to reduce inequality.

How Much (More) Should CEOs Make? A Universal Desire for More Equal Pay. (Kiatpongsan, Norton)
Everybody across all socioeconomic classes wants less inequality.

***

Minimum wage

Minimum Wages and Employment (Neumark, Wascher)
Most studies show that minimum wages reduce employment of low-wage workers.

The Effects of a Minimum-Wage Increase on Employment and Family Income (Congressional Budget Office)
Increases in minimum wage would cause unemployment but have a net positive real income effect.

 

Gregory Watson (and how to change the world)

When you feel like cynically proclaiming the impossibility of ever making any real progress, think about the story of Gregory Watson.

In 1982, Gregory Watson was a UT Austin undergrad struggling to think of a topic to write about for his political science paper.

He came across an old failed attempt to amend the Constitution, first proposed over 200 years earlier in 1789. The amendment prevented congressional pay raises from taking place until after an election, the idea being that a Congressman shouldn’t be able to just give themselves pay-raises willy-nilly, without first having to wait to be re-elected.

Gregory had a wild idea: maybe the amendment could still be passed. He looked into it, and found that amazingly, yes, the amendment process was still live. Deadlines for amendment ratification were introduced in 1917, over a hundred years before the amendment was proposed. So this amendment had never gotten a deadline.

Excited, he wrote up his paper on this, suggesting that this amendment could and should be sent out for ratification 200 years after its proposal. He turned in the paper to his teaching assistant, and… got a C.

Sure that there was a mistake, he appealed the grade to the professor, and… once more got a C.

His paper judged inadequate, Gregory began lobbying lawmakers, sending letters to members of Congress. Most responses were disappointingly negative – the amendment was 200 years old, and this sort of thing just wasn’t done. But he didn’t stop. He kept writing appeals to members of Congress for years, pushing them to bring this amendment to the floor of state legislatures.

Finally the tide shifted in the hail of Gregory’s determined appeals to state lawmakers.

Maine passed the amendment a year after his failed paper. Colorado approved the amendment 2 years later. Then five more states the next year. And 16 more states in the following four years.

By 1992, the 27th Amendment was ratified. And in 2017, his old professor signed a grade change form, changing Gregory’s grade to an A+.

An undergraduate political science student changed the United States’ Constitution, for the better. And was given a C for it.

The system can be exasperating, and cases like these are few and far between, but they do happen.