Patterns of inductive inference

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.

If all truths are knowable, then all truths are known

The title of this post is what’s called Fitch’s paradox of knowability.

It’s a weird result that arises from a few very intuitive assumptions about the notion of knowability. I’ll prove it here.

First, let’s list five assumptions. The first of these will be the only strong one – the others should all seem very obviously correct.

Assumptions

  1. All truths are knowable.
  2. If P & Q is known, then both P and Q are known.
  3. Knowledge entails truth.
  4. If P is possible and Q can be derived from P, then Q is possible.
  5. Contradictions are necessarily false.

Let’s put these assumptions in more formal language by using the following symbolization:

P means that P is possible
KP means that P is known by somebody at some time

Assumptions

  1. From P, derive KP
  2. From K(P & Q), derive KP & KQ
  3. From KP, derive P
  4. From ◇P & (P → Q), derive ◇Q
  5. ◇[P & -P]

Now, the proof. First in English…

Proof

  1. Suppose that P is true and unknown.
  2. Then it is knowable that P is true and unknown.
  3. Thus it is possible that P is known and that it is known that P is unknown.
  4. So it is possible that P is both known and not known.
  5. Since 4 is a contradiction, it is not the case that P is true and unknown.
  6. In other words, if P is true, then it is known.

Follow all of that? Essentially, we assume that there is some statement P that is both true and unknown. But if this last sentence is true, and if all truths are knowable, then it should be a knowable truth. I.e. it is knowable that P is both true and unknown. But of course this can’t be knowable, since to know that P is both true and unknown is to both know it and not know it. And thus it must be the case that if all truths are knowable, then all truths are known.

I’ll write out the proof more formally now.

Proof

  1. P & –KP                Provisional assumption
  2. K(P & –KP)        Assumption 1
  3. ◇(KP & KKP)     Assumption 2
  4. ◇(KP & –KP)        Assumption 3
  5. -(P & –KP)            Reductio ad absurdum of 1
  6. P → KP                 Standard tautology

I love finding little examples like these where attempts to formalize our intuitions about basic concepts we use all the time lead us into disaster. You can’t simultaneously accept all of the following:

  • Not all truths are known.
  • All truths are knowable.
  • If P & Q is known, then both P and Q are known.
  • Knowledge entails truth.
  • If P is possible and P implies Q, then Q is possible.
  • Contradictions are necessarily false.

The value of the personal

I have been thinking about the value of powerful anecdotes. An easy argument for why we should be very cautious of personal experience and anecdotal evidence is that it has the potential to cause us to over-update. E.g. somebody that hears a few harrowing stories from friends about gun violence in Chicago is more likely to have an overly high estimation of how dangerous Chicago is.

Maybe the best way to formulate our worldview is in a cold and impersonal manner, disregarding most anecdotes in favor of hard data. This is the type of thing I might have once said, but I now think that this approach is likely just as flawed.

First of all, I think it’s an unrealistic demand on most people’s cognitive systems that they toss out the personal and compelling in their worldview.

And second of all, just like personal experience and anecdotal evidence have the potential to cause over-updating, statistical data and dry studies have the potential to cause under-updating.

Reading some psychological studies about the seriousness of the psychological harms of extended periods of solitary confinement is no match for witnessing or personally experiencing the effects of being locked in a tiny room alone for years. There’s a real and important difference between abstractly comprehending a fact and really understanding the fact. Other terms for this second thing include internalizing the fact, embodying it, and making it salient to yourself.

This difference is not easy to capture on a one-dimensional model of epistemology where beliefs are represented as simple real numbers. I’m not even sure if there’d be any good reason to build this distinction into artificial intelligences we might eventually construct. But it is there in us, and has a powerful influence.

How do we know whether somebody has really internalized a belief or not? I’m not sure, but here’s a gesture in what I think is the right direction.

We can conceive of somebody’s view of the world as a massive web of beliefs, where the connections between beliefs indicate dependencies and logical relations. To have a fully internalized a belief is to have a web that is fully consistent with the truth of this belief. On the other hand, if you notice that somebody verbally reports that they believe A, but then also seem to believe B, C, and D, where all of these are inconsistent with A, then they have not really internalized A.

The worry is that a cold and impersonal approach to forming your worldview is the type of thing that would result in this type of inconsistency and disconnectedness in your web of beliefs, through the failure to internalize important facts.

Such failures become most obvious when you have a good sense of somebody’s values, and can simply observe their behavior to see what it reveals about their beliefs. If somebody is a pure act utilitarian (I know that nobody actually has a value system as simple as this, but just play along for a moment), then they should be sending their money wherever it would be better spent maximizing utility. If they are not doing so, then this reveals an implicit belief that there is no better way to be maximizing utility than by keeping their own money.

This is sort of an attempt to uncover somebody’s revealed beliefs, to steal the concept of revealed preferences from economics. Conflicts between revealed beliefs and stated beliefs indicate a lack of internalization.

Confirmation bias, or different priors?

Consider two people.

Person A is a person of color who grew up in an upper middle class household in a wealthy and crime-free neighborhood. This person has gone through their life encountering mostly others similar to themselves – well-educated liberals that believe strongly in socially progressive ideals. They would deny ever having personally experienced or witnessed racism, despite their skin color. In addition, when their friends discuss the problem of racism in America, they feels a baseline level of skepticism about the actual extent of the problem, and suspect that the whole thing has been overblown by sensationalism in the media. Certainly there was racism in the past, they reason, but the problem seems largely solved in the present day. This suspicion of the mainstream narrative seems confirmed by the emphasis placed on shootings of young black men like Michael Brown, which were upon closer reflection not clear cases of racial discrimination at all.

Person B grew up in a lower middle class family, and developed friendships across a wide range of socioeconomic backgrounds. They began witnessing racist behavior towards black friends at a young age. As they got older, this racism became more pernicious, and several close friends described their frustration at experiences of racial profiling. Many ended up struggling with the law, and some ended up in jail. Studying history, they could see that racism is not a new phenomenon, and descends from a long and brutal history of segregation, Jim Crow, and discriminatory housing practices. To Person B, it is extremely obvious that racism is a deeply pervasive force in society today, and that it results in many injustices. These injustices are those that sparked the Black Lives Matter movement, which they are an enthusiastic supporter of. They are aware that BLM has made some mistakes in the past, but they see a torrent of evidence in favor of the primary message of the movement: that policing is racially biased.

Now both A and B are presented with the infamous Roland Fryer study finding that when you carefully control for confounding factors, black people are no more likely to be shot by police officers than whites, and are in fact slightly less likely to be shot than whites.

Person A is not super surprised by these results, and feels vindicated in his skepticism of the mainstream narrative. To them, this is clear-cut evidence supporting their preconception that a large part of the Black Lives Matter movement rests on an exaggeration of the seriousness of racial issues in the country.

On the other hand, Person B right away dismisses the results of this study. They know from their whole life experience that the results must be flawed, and their primary take-away is the fallibility of statistics in analyzing complex social issues.

They examine the study closely, trying to find the flaw. They come up with a few hypotheses: (1) The data is subject to one or more selection biases, having been provided by police departments that have an interest in seeming colorblind, and having come from the post-Ferguson period in which police officers became temporarily more careful to not shoot blacks (dubbed the Ferguson effect). (2) The study looked at encounters between officers and blacks/whites, but didn’t take into account the effects of differential encounter rates. If police are more likely to pull over or arrest black people than white people for minor violations, then this would artificially lower the rate of officer shootings.

A and B now discuss the findings. Witnessing B’s response to the data, A perceives it as highly irrational. It appears to be a textbook example of confirmation bias. B immediately dismissed the data that contradicted their beliefs, and rationalized it by conjuring complicated explanations for why the results of the study were wrong. Sure, (thinks Person A) these explanations were clever and suggested the need for more nuance in the statistical analysis, but clearly it was more likely that the straightforward interpretation of the data was correct than these complicated alternatives.

B finds A’s quick acceptance of the results troubling and suggestive of a lack of nuance. To B, it appears that A already had their mind made up, and eagerly jumped onto this study without being sufficiently cautious.

Now the two are presented with a series of in-depth case studies of shootings of young black men that show clear evidence of racial profiling.

These stories fit perfectly within B’s worldview, and they find themselves deeply moved by the injustices that these young men experienced. Their dedication to the cause of fighting police brutality is reinvigorated.

But now A is the skeptical one. After all, the plural of anecdote is not data, and the existence of some racist cops by no means indicts all of society as racist. And what about all the similar stories with white victims that don’t get reported? They also recall the pernicious effects of cognitive biases that could make a young black man fed narratives of police racism more likely to see racism where there is none.

To B, all of this gives them the impression that A is doing cartwheels to avoid acknowledging the simple fact that racism exists.

In the first case, was Person B falling prey to confirmation bias? Was A? Were they both?

How about in the second case… Is A thinking irrationally, as B believes?

✯✯✯

I think that in both cases, the right answer is most likely no. In each case we had two people that were rationally responding to the evidence that they received, just starting from very different presuppositions.

Said differently, A and B had vastly different priors upon encountering the same data, and this difference is sufficient to explain their differing reactions. Given these priors it makes sense and is perfectly rational for B to quickly dismiss the Fryer report and search for alternative explanations, and for A to regard stories of racial profiling as overblown. It makes sense for the same reason that it makes sense for a scientist encountering a psychic that seems eerily accurate to right away write it off as some complicated psychological illusion… strong priors are not easily budged by evidence, and there are almost always alternative explanations that are more likely.

This is all perfectly Bayesian, by the way. If two interpretations of a data set equally well predict or make sense of the data (i.e. P(data | interpretation 1) = P(data | interpretation 2)), then their posterior odds ratio P(interpretation 1 | data) / P(interpretation 2 | data) should be no different from their prior ratio P(interpretation 1) / P(interpretation 2). In other words, strong priors dictate strong posteriors when the evidence is weakly discriminatory between the hypotheses.

When the evidence does rule out some interpretations, probability mass is preferentially shifted towards the interpretations that started with stronger priors. For instance, suppose you have three theories with credences (P(T1), P(T2), P(T3)) = (5%, 5%, 90%), and some evidence E is received that rules out T1, but is equally likely under T2 and T3. Then your posterior probabilities will be (P(T1 | E), P(T2 | E), P(T3 | E)) = (0%, 5.6%, 94.4%).

T2 gains only .6% credence, while T3 gains 4.4%. In other words, while the posterior odds stay the same, the actual probability mass has shifted relatively more towards theories favored in the prior.

The moral of this story is to be careful when accusing others, internally or externally, of confirmation bias. What looks like mulish unwillingness to take seriously alternative hypotheses can actually be good rational behavior and the effect of different priors. Being able to tell the difference between confirmation bias and strong priors is a hard task – one that most people probably won’t undertake, opting instead to assume the worst of their ideological opponents.

Another moral is that privilege is an important epistemic concept. Privilege means, in part, having lived a life sufficiently insulated from injustice that it makes sense to wonder if it is there at all. Privilege is a set of priors tilted in favor of colorblindness and absolute moral progress. “Recognizing privilege” corresponds to doing anthropic reasoning to correct for selection biases in the things you have personally experienced, and adjusting your priors accordingly.

Objective Bayesianism and choices of concepts

Bayesians believe in treating belief probabilistically, and updating credences via Bayes’ rule. They face the problem of how to set priors – while probability theory gives a clear prescription for how to update beliefs, it doesn’t tell you what credences you should start with before getting any evidence.

Bayesians are thus split into two camps: objective Bayesians and subjective Bayesians. Subjective Bayesians think that there are no objectively correct priors. A corollary to this is that there are no correct answers to what somebody should believe, given their evidence.

Objective Bayesians disagree. Different variants specify different procedures for determining priors. For instance, the principle of indifference (POI) prescribes that the proper priors are those that are indifferent between all possibilities. If you have N possibilities, then according to the POI, you should distribute your priors credences evenly (1/N each). If you are considering a continuum of hypotheses (say, about the mass of an object), then the principle of indifference says that your probability density function should be uniform over all possible masses.

Now, here’s a problem for objective Bayesians.

You are going to be handed a cube, and all that you know about it is that it is smaller than 1 cm3. What should your prior distribution over possible cubes you might be handed look like?

Naively applying the POI, you might evenly distribute your credences across all volumes from 0 cm3 to 1 cm3 (so that there is a 50% chance that the cube has a volume less than .50 cm3 and a 50% chance its volume is between greater than .50 cm3).

But instead of choosing to be indifferent over possible volumes, we could equally well have chosen to be indifferent over possible side areas, or side lengths. The key point is that these are all different distributions. If we spread our credences evenly across possible side lengths from 0 cm to 1 cm, then we would have a distribution with a 50% chance that the cube has a volume less than .125 cm3 and a 50% chance that the volume is greater than this.

Cube puzzle

In other words, our choice of concepts (edge length vs side area vs volume) ends up determining the shape of our prior. Insofar as there is no objectively correct choice of concepts to be using, there is no objectively correct prior distribution.

I’ve known about this thought experiment for a while, but only recently internalized how serious of a problem it presents. It essentially says that your choice of priors is hostage to your choice of concepts, which is a pretty unsavory idea. In some cases, which concept to choose is very non-obvious (e.g. length vs area vs volume). In others, there are strong intuitions about some concepts being better than others.

The most famous example of this is contained in Nelson Goodman’s “new riddle of induction.” He proposes a new concept grue, which is defined as the set of objects that are either observed before 2100 and green, or observed after 2100 and blue. So if you spot an emerald before 2100, it is grue. So is a blue ball that you spot after 2100. But if you see an emerald after 2100, it will not be grue.

To characterize objects like this emerald that is observed after 2100, Goodman also creates another concept bleen, which is the inverse of grue. The set of bleen objects is composed of blue objects observed before 2100 and green objects observed after 2100.

Now, if we run ordinary induction using the concepts grue and bleen, we end up making bizarre predictions. For instance, say we observe many emeralds before 2100, and always found them to be green. By induction, we should infer that the next emerald we observe after 2100 is very likely going to be green as well. But if we thought in terms of the concepts grue and bleen, then we would say that all our observations of emeralds so far have provided inductive support for the claim “All emeralds are grue.” The implication of this is that the emeralds we observe after time 2100 will very likely also be grue (and thus blue).

In other words, by simply choosing a different set of fundamental concepts to work with, we end up getting an entirely different prediction about the future.

Here’s one response that you’ve probably already thought of: “But grue and bleen are such weird artificial choices of concepts! Surely we can prefer green/blue over bleen/grue on the basis of the additional complexity required in specifying the transition time 2100?”

The problem with this is that we could equally well define green and blue in terms of grue and bleen:

Green = grue before 2100 or bleen after 2100
Blue = bleen before 2100 or grue after 2100

If for whatever reason somebody had grue and bleen as their primitive concepts, they would see green and blue as the concepts that required the additional complexity of the time specification.

“Okay, sure, but this is only if we pretend that color is something that doesn’t emerge from lower physical levels. If we tried specifying the set of grue objects in terms of properties of atoms, we’d have a lot harder time than if we tried specifying the set of green or blue objects in terms of properties of atoms.”

This is right, and I think it’s a good response to this particular problem. But it doesn’t work as a response to a more generic form of the dilemma. In particular, you can construct a grue/bleen-style set of concepts for whatever you think is the fundamental level of reality. If you think electrons and neutrinos are undecomposable into smaller components, then you can imagine “electrinos” and “neuctrons.” And now we have the same issue as before… thinking in terms of electrinos would lead us to conclude that all electrons will suddenly transform into neutrinos in 2100.

The type of response I want to give is that concepts like “electron” and “neutrino” are preferable to concepts like “electrinos” and “neuctrons” because they mirror the structure of reality. Nature herself computes electrons, not electrinos.

But the problem is that we’re saying that in order to determine which concepts we should use, we need to first understand the broad structure of reality. After which we can run some formal inductive schema to, y’know, figure out the broad structure of reality.

Said differently, we can’t really appeal to “the structure of reality” to determine our choices of concepts, since our choices of concepts end up determining the results of our inductive algorithms, which are what we’re relying on to tell us the structure of reality in the first place!

This seems like a big problem to me, and I don’t know how to solve it.

Overemphasizing disagreement

I’ve noticed a tendency in myself and others during debate to only respond to the parts of what others say that I disagree with, taking the agreement for granted. This makes some sense; if you agree with 95% of an argument somebody is making, there is the most progress to be made by focusing on the 5% remaining difference. But I think this also causes the perception on both sides that there is a greater distance to be bridged than there is in reality. Constantly focusing on subtle points of disagreement can also be perceived as being unresponsive and indifferent towards a significant part of the arguments being made.

This is an optimistic take on phenomena like the backfire effect – a lot of the sense that you and your interlocutor are getting no closer during conversation might be the result of this form of miscommunication. I think a good policy for reducing misunderstanding is something like Rapaport’s rules – explicitly stating points of agreement before going into disagreement. This isn’t only good for reducing misunderstanding – I’ve noticed that stating points of agreement, especially things I’ve just been convinced of, has the effect of actually making it easier to change my mind.

Why minimizing sum of squares is equivalent to frequentist inference

(This will be the first in a short series of posts describing how various commonly used statistical methods are approximate versions of frequentist, Bayesian, and Akaike-ian inference)

Suppose that we have some data D = { (x₁, y₁), (x₂, y₂), … , (xɴ, yɴ) }, and a candidate function y = f(x).

Frequentist inference involves the assessment of the likelihood of the data given this candidate function: P(D | f).

Since D is composed of N independent data points, we can assess the probability of each data point separately, and multiply them all together.

P(D | f) = P(x₁, y₁ | f) P(x₂, y₂ | f) … P(xɴ, yɴ | f)

So now we just need to answer the question: What is P(x, y | f)?

f predicts that for the value x, the most likely y-value is f(x).

The other possible y-values will be normally distributed around f(x).

IMG_20180522_192208774

The equation for this distribution is a Gaussian:

P(x, y | f) = exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)

Now that we know how to find P(x, y | f), we can easily calculate P(D | F)!

P(D | f) = exp[ -(y – f(x))² / 2σ² ] /√(2πσ²) ・ exp[ -(y – f(x))² / 2σ² ] / √(2πσ²) … exp[ -(y – f(x))² / 2σ² ] / √(2πσ²)
= exp[ -(y – f(x))² / 2σ² ] ・ exp[ -(y – f(x))² / 2σ² ] … exp[ -(y – f(x))² / 2σ² ] / (2πσ²)N/2

Products are messy and logarithms are monotonic, so log(P(D | f)) is easier to work with: it turns the product into a sum.

log P(D | f) = log( exp[ -(y₁ – f(x₁))² / 2σ² ] … exp[ -(yɴ – f(xɴ))² / 2σ² ] / (2πσ²)N/2 )
= log( exp[ -(y₁ – f(x₁))² / 2σ² ] ) + … log( exp[ -(yɴ – f(xɴ))² / 2σ² ] ) – N/2 log(2πσ²)
= -(y₁ – f(x₁))² / 2σ² ) + -(yɴ – f(xɴ))² / 2σ² ) – N/2 log(2πσ²)
= -1/2σ² [ (y₁ – f(x₁))² + … +(yɴ – f(xɴ))² ] – N/2 log(2πσ²)

Now notice that the sum of squares just naturally pops out!

SOS = (y₁ – f(x₁))² + … + (yɴ – f(xɴ))²
log P(D | f) = -SOS/2σ² – N/2 log(2πσ²)

Frequentist inference chooses f to maximize P(D | f). We can now immediately see why this is equivalent to minimizing SOS!

argmax{ P(D | f) }
= argmax{ log P(D | f) }
= argmax{ – SOS/2σ² – N/2 log(2πσ²) }
= argmin{ SOS/2σ² + N/2 log(2πσ²) }
= argmin{ SOS/2σ² }
= argmin{ SOS }

Next, we’ll go Bayesian…

Constructing the world

In this six and a half hour lecture series by David Chalmers, he describes the concept of a minimal set of statements from which all other truths are a priori “scrutable” (meaning, basically, in-principle knowable or derivable).

What are the types of statements in this minimal set required to construct the world? Chalmers offers up four categories, and abbreviates this theory PQIT.

P

P is the set of physical facts (for instance, everything that would be accessible to a Laplacean demon). It can be thought of as essentially the initial conditions of the universe and the laws governing their changes over time.

Q

Q is the set of facts about qualitative experience. We can see Chalmers’ rejection of physicalism here, as he doesn’t think that Q is eclipsed within P. Example of a type of statement that cannot be derived from P without Q: “There is a beige region in the bottom right of my visual field.”

I

Here’s a true statement: “I’m in the United States.” Could this be derivable from P and Q? Presumably not; we need another set of indexical truths that allows us to have “self-locating” beliefs and to engage in anthropic reasoning.

T

Suppose that P, Q, and I really are able to capture all the true statements there are to be captured. Well then, the statement “P, Q, and I really are able to capture all the true statements there are to be captured” is a true statement, and it is presumably not captured by P, Q, and I! In other words, we need some final negative statements that tell us that what we have is enough, and that there are no more truths out there. These “that’s all”-type statements are put into the set T.

⁂⁂⁂

So this is a basic sketch of Chalmer’s construction. I like that we can use these tags like PQIT or PT or QIT as a sort of philosophical zip-code indicating the core features of a person’s philosophical worldview. I also want to think about developing this further. What other possible types of statements are there out there that may not be captured in PQIT? Here is a suggestion for a more complete taxonomy:

p    microphysics
P    macrophysics (by which I mean all of science besides fundamental physics)
Q    consciousness
R    normative rationality
E    
normative ethics
C    counterfactuals
L    
mathematical / logical truths
I     indexicals
T    “that’s-all” statements

I’ve split P into big-P (macrophysics) and little-p (microphysics) to account for the disagreements about emergence and reductionism. Normativity here is broad enough to include both normative epistemic statements (e.g. “You should increase your credence in the next coin toss landing H after observing it land H one hundred times in a row”) and ethical statements. The others are fairly self-explanatory.

The most ontologically extravagant philosophical worldview would then be characterized as pPQRECLIT.

My philosophical address is pRLIT (with the addendum that I think C comes from p, and am really confused about Q). What’s yours?

Getting evidence for a theory of consciousness

I’ve been reading about the integrated information theory of consciousness lately, and wondering about the following question. In general, what are the sources of evidence we have for a theory of consciousness?

One way to think about this is to imagine yourself teleported hundreds of years into the future and talking to a scientist in this future world. This scientist tells you that in his time, consciousness is fully understood. What sort of experiments would you expect to be able to run to verify for yourself that the future’s theory of consciousness really is sufficient?

One thing you could do is point to a bunch of different physical systems, ask the scientist what his theory of consciousness says about them, and compare them to your intuitions. So, for instance, does the theory say that you are conscious? What about humans in general? What about people in deep sleep? How about dogs? Chickens? Frogs? Insects? Bacterium? Are Siri-style computer programs conscious? What about a rock? And so on.

The obvious problem with this is that it assumes the validity of your intuitions about consciousness. Sure it seems obvious that a rock is not conscious, that humans generally are, and that dogs are conscious, but less so than humans, but how do we know that these are trustworthy intuitions?

I think the validity of these intuitions is necessarily grounded in our phenomenology and our observations of how it correlates with our physical substance. So, for instance, I notice that when I fall asleep, my consciousness fades in and out. On the other hand, when I wiggle my big toe, this has an effect on the character of my conscious experience, but doesn’t shut it off entirely. This tells me that something about what happens to my body when I fall asleep is relevant to the maintenance of my consciousness, while the angle of my big toe is not.

In general, we make many observations like these and piece together a general theory of how consciousness relates to the physical world, not just in terms of the existence of consciousness, but also in terms of what specific conscious experiences we expect for a given change to our physical system. It tells us, for instance, that receiving a knock on the head or drinking too much alcohol is sometimes sufficient to temporarily suspend consciousness, while breaking a finger or cutting your hair is not.

Now, since we are able to intervene on our physical body at will and observe the results, our model is a causal model. An implication of this is that it should be able to handle counterfactuals. So, for instance, it can give us an answer to the question “Would I still be conscious if I cut my hair off, changed my skin color, shrunk several inches in height, and got a smaller nose?” This answer is presumably yes, because our theory distinguishes between physical features that are relevant to the existence of consciousness and those that are not.

Extending this further, we can ask if we would still be conscious if we gradually morphed into another human being, with a different brain and body. Again, the answer would appear to be yes, as long as nothing essential to the existence of consciousness is severed along the way. But now we are in a position to be able to make inferences about the existence of consciousness in bodies outside our own! For if I think that I would be conscious if I slowly morphed into my boyfriend, then I should also believe that my boyfriend is conscious himself. I could deny this by denying that the same physical states give rise to the same conscious states, but while this is logically possible, it seems quite implausible.

This gives rational grounds for our belief in the existence of consciousness in other humans, and allows us justified access to all of the work in neuroscience analyzing the connection between the brain and consciousness. It also allows us to have a baseline level of trust in the self-reports of other people about their conscious experiences, given the observation that we are generally reliable reporters of our conscious experience.

Bringing this back to our scientist from the future, I can think of some much more convincing tests I would do than the ‘tests of intuition’ that we did at first. Namely, suppose that the scientist was able to take any description of an experience, translate that into a brain state, and then stimulate your brain in such a way as to produce that experience for you. So over and over you submit requests – “Give me a new color experience that I’ve never had before, but that feels vaguely pinkish and bluish, with a high pitch whine in the background”, “Produce in me an emotional state of exaltation, along with the sensation of warm wind rushing through my hair and a feeling of motion”, etc – and over and over the scientist is able to excellently match your request. (Also, wow imagine how damn cool this would be if we could actually do this.)

You can also run the inverse test: you tell the scientist the details of an experience you are having while your brain is being scanned (in such a way that the scientist cannot see it). Then the scientist runs some calculations using their theory of consciousness and makes some predictions about what they’ll see on the brain scan. Now you check the brain scan to see if their predictions have come true.

To me, repeated success in experiments of this kind would be supremely convincing. If a scientist of the future was able to produce at will any experience I asked for (presuming my requests weren’t too far out as to be physical impossible), and was able to accurately translate facts about my consciousness into facts about my brain, and could demonstrate this over and over again, I would be convinced that this scientist really does have a working theory of consciousness.

And note that since this is all rooted in phenomenology, it’s entirely uncoupled from our intuitive convictions about consciousness! It could turn out that the exact framework the scientist is using to calculate the connections between my physical body and my consciousness end up necessarily entailing that rocks are conscious and that dolphins are not. And if the framework’s predictive success had been demonstrated with sufficient robustness before, I would just have to accept this conclusion as unintuitive but true. (Of course, it would be really hard to imagine how any good theory of consciousness could end up coming to this conclusion, but that’s beside the point.)

So one powerful source of evidence we have for testing a theory of consciousness is the correlations between our physical substance and our phenomenology. Is that all, or are there other sources of evidence tout there?

We can straightforwardly adopt some principles from the philosophy of science, such as the importance of simplicity and avoiding overfitting in formulating our theories. So for instance, one theory of consciousness might just be an exhaustive list of every physical state of the brain and what conscious experience this corresponds to. In other words, we could imagine a theory in which all of the basic phenomenological facts of consciousness are taken as individual independent axioms. While this theory will be fantastically accurate, it will be totally worthless to us, and we’d have no reason to trust its predictive validity.

So far, we really just have three criteria for evidence:

  1. Correlations between phenomenology and physics
  2. Simplicity
  3. Avoiding overfitting

As far as I’m concerned, this is all that I’m really comfortable with counting as valid evidence. But these are very much not the only sources of evidence that get referenced in the philosophical literature. There are a lot of arguments that get thrown around concerning the nature of consciousness that I find really hard to classify neatly, although often these arguments feel very intuitively appealing. For instance, one of my favorite arguments for functionalism is David Chalmers’ ‘Fading Qualia’ argument. It goes something like this:

Imagine that scientists of the future are able to produce silicon chips that are functionally identical to neurons and can replicate all of their relevant biological activity. Now suppose that you undergo an operation in which gradually, every single part of your nervous system is substituted out for silicon. If the biological substrate implementing the functional relationships is essential to consciousness, then by the end of this procedure you will no longer be conscious.

But now we ask: when did the consciousness fade out? Was it a sudden or a gradual process? Both seem deeply implausible. Firstly, we shouldn’t expect a sudden drop-out of consciousness from the removal of a single neuron or cluster of neurons, as this would be a highly unusual level of discreteness. This would also imply the ability to switch on and off the entirety of your consciousness with seemingly insignificant changes to the biological structure of your nervous system.

And secondly, if it is a gradual process, then this implies the existence of “pseudo-conscious” states in the middle of the procedure, where your experiences are markedly distinct from those of the original being but you are pretty much always wrong about your own experiences. Why? Well, the functional relationships have stayed the same! So your beliefs about your conscious states, the memories you form, the emotional reactions you have, will all be exactly as if there has been no change to your conscious states. This seems totally bizarre and, in Chalmers’ words, “we have little reason to believe that consciousness is such an ill-behaved phenomenon.”

Now, this is a fairly convincing argument to me. But I have a hard time understanding why it should be. The argument’s convincingness seems to rely on some very high-level abstract intuitions about the types of conscious experiences we imagine organisms could be having, and I can’t think of a great reason for trusting these intuitions. Maybe we could chalk it up to simplicity, and argue that the notion of consciousness entailed by substrate-dependence must be extremely unparsimonious. But even this connection is not totally clear to me.

A lot of the philosophical argumentation about consciousness feels this way to me; convincing and interesting, but hard to make sense of as genuine evidence.

One final style of argument that I’m deeply skeptical of is arguments from pure phenomenology. This is, for instance, how Giulio Tononi likes to argue for his integrated information theory of consciousness. He starts from five supposedly self-evident truths about the character of conscious experience, then attempts to infer facts about the structure of the physical systems that could produce such experiences.

I’m not a big fan of Tononi’s observations about the character of consciousness. They seem really vaguely worded and hard enough to make sense of that I have no idea if they’re true, let alone self-evident. But it is his second move that I’m deeply skeptical of. The history of philosophers trying to move from “self-evident intuitive truths” to “objective facts about reality” is pretty bad. While we might be plenty good at detailing our conscious experiences, trying to make the inferential leap to the nature of the connection between physics and consciousness is not something you can do just by looking at phenomenology.

The problem with philosophy

(Epistemic status: I have a high credence that I’m going to disagree with large parts of this in the future, but it all seems right to me at present. I know that’s non-Bayesian, but it’s still true.)

Philosophy is great. Some of the clearest thinkers and most rational people I know come out of philosophy, and many of my biggest worldview-changing moments have come directly from philosophers. So why is it that so many scientists seem to feel contempt towards philosophers and condescension towards their intellectual domain? I can actually occasionally relate to the irritation, and I think I understand where some of it comes from.

Every so often, a domain of thought within philosophy breaks off from the rest of philosophy and enters the sciences. Usually when this occurs, the subfield (which had previously been stagnant and unsuccessful in its attempts to make progress) is swiftly revolutionized and most of the previous problems in the field are promptly solved.

Unfortunately, what also often happens is that the philosophers that were previously working in the field are often unaware of or ignore the change in their field, and end up wasting a lot of time and looking pretty silly. Sometimes they even explicitly challenge the scientists at the forefront of this revolution, like Henri Bergson did with Einstein after he came out with his pesky new theory of time that swept away much of the past work of philosophers in one fell swoop.

Next you get a generation of philosophy students that are taught a bunch of obsolete theories, and they are later blindsided when they encounter scientists that inform them that the problems they’re working on have been solved decades ago. And by this point the scientists have left the philosophers so far in the dust that the typical philosophy student is incapable of understanding the answers to their questions without learning a whole new area of math or something. Thus usually the philosophers just keep on their merry way, asking each other increasingly abstruse questions and working harder and harder to justify their own intellectual efforts. Meanwhile scientists move further and further beyond them, occasionally dropping in to laugh at their colleagues that are stuck back in the Middle Ages.

Part of why this happens is structural. Philosophy is the womb inside which develops the seeds of great revolutions of knowledge. It is where ideas germinate and turn from vague intuitions and hotbeds of conceptual confusion into precisely answerable questions. And once these questions are answerable, the scientists and mathematicians sweep in and make short work of them, finishing the job that philosophy started.

I think that one area in which this has happened is causality.

Statisticians now know how to model causal relationships, how to distinguish them from mere regularities, how to deal with common causes and causal pre-emption, how to assess counterfactuals and assign precise probabilities to these statements, and how to compare different causal models and determine which is most likely to be true.

(By the way, guess where I came to be aware of all of this? It wasn’t in the metaphysics class in which we spent over a month discussing the philosophy of causation. No, it was a statistician friend of mine who showed me a book by Judea Pearl and encouraged me to get up to date with modern methods of causal modeling.)

Causality as a subject has firmly and fully left the domain of philosophy. We now have a fully fleshed out framework of causal reasoning that is capable of answering all of the ancient philosophical questions and more. This is not to say that there is no more work to be done on understanding causality… just that this work is not going to be done by philosophers. It is going to be done by statisticians, computer scientists, and physicists.

Another area besides causality where I think this has happened is epistemology. Modern advances in epistemology are not coming out of the philosophy departments. They’re coming out of machine learning institutes and artificial intelligence researchers, who are working on turning the question of “how do we optimally come to justified beliefs in a posteriori matters?” into precise code-able algorithms.

I’m thinking about doing a series of posts called “X for philosophers”, in which I take an area of inquiry that has historically been the domain of philosophy, and explain how modern scientific methods have solved or are solving the central questions in this area.

For instance, here’s a brief guide to how to translate all the standard types of causal statements philosophers have debated for centuries into simple algebra problems:

Causal model

An ordered triple of exogenous variables, endogenous variables, and structural equations for each endogenous variable

Causal diagram

A directed acyclic graph representing a causal model, whose nodes represent the endogenous variables and whose edges represent the structural equations

Causal relationship

A directed edge in a causal diagram

Causal intervention

A mutilated causal diagram in which the edges between the intervened node and all its parent nodes are removed

Probability of A if B

P(A | B)

Probability of A if we intervene on B

P(A | do B) = P(AB)

Probability that A would have happened, had B happened

P(AB | -B)

Probability that B is a necessary cause of A

P(-A-B | A, B)

Probability that B is a sufficient cause of A

P(AB | -A, -B)

Right there is the guide to understanding the nature of causal relationships, and assessing the precise probabilities of causal conditional statements, counterfactual statements, and statements of necessary and sufficient causation.

To most philosophy students and professors, what I’ve written is probably chicken-scratch. But it is crucially important for them in order to not become obsolete in their causal thinking.

There’s an unhealthy tendency amongst some philosophers to, when presented with such chicken-scratch, dismiss it as not being philosophical enough and then go back to reading David Lewis’s arguments for the existence of possible worlds. It is this that, I think, is a large part of the scientist’s tendency to dismiss philosophers as outdated and intellectually behind the times. And it’s hard not to agree with them when you’ve seen both the crystal-clear beauty of formal causal modeling, and also the debates over things like how to evaluate the actual “distance” between possible worlds.

Artificial intelligence researcher extraordinaire Stuart Russell has said that he knew immediately upon reading Pearl’s book on causal modeling that it was going to change the world. Philosophy professors should either teach graph theory and Bayesian networks, or they should not make a pretense of teaching causality at all.