Simple induction

In front of you is a coin. You don’t know the bias of this coin, but you have some prior probability distribution over possible biases (between 0: always tails, and 1: always heads). This distribution has some statistical properties that characterize it, such as a standard deviation and a mean. And from this prior distribution, you can predict the outcome of the next coin toss.

Now the coin is flipped and lands heads. What is your prediction for the outcome of the next toss?

This is a dead simple example of a case where there is a correct answer to how to reason inductively. It is as correct as any deductive proof, and derives a precise and unambiguous result:

Fixed

This is a law of rational thought, just as rules of logic are laws of rational thought. It’s interesting to me how the understanding of the structure of inductive reasoning begins to erode the apparent boundary between purely logical a priori reasoning and supposedly a posteriori inductive reasoning.

Anyway, here’s one simple conclusion that we can draw from the above image: After the coin lands heads, it should be more likely that the coin will land heads next time. After all, the initial credence was µ, and the final credence is µ multiplied by a value that is necessarily greater than 1.

You probably didn’t need to see an equation to guess that for each toss that lands H, future tosses landing H become more likely. But it’s nice to see the fundamental justification behind this intuition.

We can also examine some special cases. For instance, consider a uniform prior distribution (corresponding to maximum initial uncertainty about the coin bias). For this distribution (π = 1), µ = 1/2 and σ = 1/3. Thus, we arrive at the conclusion that after getting one heads, your credence in the next toss landing heads should be 13/18 (72%, up from 50%).

We can get a sense of the insufficiency of point estimates using this example. Two prior distributions with the same average value will respond very differently to evidence, and thus the final point estimate of the chance of H will differ. But what is interesting is that while the mean is insufficient, just the mean and standard deviation suffice for inferring the value of the next point estimate.

In general, the dynamics are controlled by the term σ/µ. As σ/µ goes to zero (which corresponds to a tiny standard deviation, or a very confident prior), our update goes to zero as well. And as σ/µ gets large (either by a weak prior or a low initial credence in the coin being H-biased), the observation of H causes a greater update. How large can this term possibly get? Obviously, the updated point estimate should asymptote towards 1, but this is not obvious from the form of the equation we have (it looks like σ/µ can get arbitrarily large, forcing our final point estimate to infinity).

What we need to do is optimize the updated point estimate, while taking into account the constraints implied by the relationship between σ and µ.

The North Korea problem isn’t solved

Donald Trump and Kim Jong Un just met and signed a deal committing North Korea to nuclear disarmament. Yay! Problem solved!

Except that there’s a long historical precedent of North Korea signing deals just like this one, only to immediately go back on them. Here’s a timeline for some relevant historical context.

1985: North Korea signs Nuclear Non-Proliferation Treaty
1992: North Korea signs historic agreement to halt nuclear program! (#1)
1993: North Korea is found to be cheating on its commitments under the NPT
1994: In exchange for US assistance in production of proliferation-free nuclear power plants, North Korea signs historic agreement to halt nuclear program! (#2)
1998: North Korea is suspected of having an underground nuclear facility
1998: North Korea launches missile tests over Japan
1999: North Korea signs historic agreement to end missile tests, in exchange for a partial lifting of economic sanctions by the US.
2000: North Korea signs historic agreement to reunify Korea! Nobel Peace Prize is awarded
2002-2003: North Korea admits to having a secret nuclear weapons program, and withdraws from the NPT
2004: North Korea allows an unofficial US delegation to visit its nuclear facilities to display a working nuclear weapon
2005: In exchange for economic and energy assistance, North Korea signs historic agreement to halt nuclear program and denuclearize! (#3)
2006: North Korea fires seven ballistic missiles and conducts an underground nuclear test
2006: North Korea declares support for denuclearization of Korean peninsula
2006: North Korea again supports denuclearization of Korean peninsula
2007: In exchange for energy aid from the US, North Korea signs historic agreement to halt nuclear program! (#4)
2007: N&S Korea sign agreement on reunification
2009: North Korea issues a statement outlining a plan to weaponize newly separated plutonium
2010: North Korea threatens war with South Korea
2010: North Korea again announces commitment to denuclearize
2011: North Korea announces plan to halt nuclear and missile tests
2012: North Korea announces halt to nuclear program
2013: North Korea announces intentions to conduct more nuclear tests
2014: North Korea test fires 30 short-range rockets, as well as two medium missiles into the Sea of Japan
2015: North Korea offers to halt nuclear tests
2016: North Korea announces that it has detonated a hydrogen bomb
2016: North Korea again announces support for denuclearization
2017: North Korea conducts its sixth nuclear test
2018: Kim Jong Un announces that North Korea will mass produce nuclear warheads and ballistic missiles for deployment
2018: In exchange for the cancellation of US-South Korea military exercises, North Korea, once again, commits to “work toward complete denuclearization on the Korean peninsula”

Maybe this time is really, truly different. But our priors should be informed by history, and history tell us that it’s almost certainly not.

Priors in the supernatural

A friend of mine recently told me the following anecdote.

Years back, she had visited an astrologer in India with her boyfriend, who told her the following things: (1) she would end up marrying her boyfriend at the time, (2) down the line they would have two kids, the first a girl and the second a boy, and (3) he predicted the exact dates of birth of both children.

Many years down the line, all of these predictions turned out to be true.

I trust this friend a great deal, and don’t have any reason to think that she misremembered the details or lied to me about them. But at the same time, I recognize that astrology is completely crazy.

Since that conversation, I’ve been thinking about the ways in which we can evaluate our de facto priors in supernatural events by consulting either real-world anecdotes or thought experiments. For instance, if we think that each of these two predictions gave us a likelihood ratio of 100:1 in favor of astrology being true, and if I ended up thinking that astrology was about as likely to be true as false, then I must have started with roughly 1:10,000 odds against astrology being true.

That’s not crazily low for a belief that contradicts much of our understanding of physics. I would have thought that my prior odds would be something much lower, like 1:1010 or something. But really put yourself in that situation.

Imagine that you go to an astrologer, who is able to predict an essentially unpredictable sequence of events years down the line, with incredible accuracy. Suppose that the astrologer tells you who you will marry, how many kids you’ll have, and the dates of birth of each. Would you really be totally unshaken by this experience? Would you really believe that it was more likely to have happened by coincidence?

Yes, yes, I know the official Bayesian response; I read it in Jaynes long ago. For beliefs like astrology that contradict our basic understanding of science and causality, we should always have reserved some amount of credence for alternate explanations, even if we can’t think of any on the spot. This reserve of credence will insure us against jumping in credence to 99% upon seeing a psychic continuously predict the number in your heads, ensuring sanity and a nice simple secular worldview.

But that response is not sufficient to rule out all strong evidence for the supernatural.

Here’s one such category of strong evidence: evidence for which all alternative explanations are ruled out by the laws of physics as strongly as the supernatural hypothesis is ruled out by the laws of physics.

I think that my anecdote is one such case. If it was true, then there is no good natural alternative explanation for it. The reason? Because the information about the dates of birth of my friend’s children did not exist in the world at the time of the prediction, in any way that could be naturally attainable by any human being.

By contrast, imagine you go to a psychic who tells you to put up some fingers behind your back and then predicts over and over again how many fingers you have up. There’s hundreds of alternative explanations for this besides “Psychics are real science has failed us.” The reason that there are these alternative explanations is that the information predicted by the psychic existed in the world at the time of the prediction.

But in the case of my friend’s anecdote, the information predicted by the astrologer was lost far in the chaotic dynamics of the future.

What this rules out is the possibility that the astrologer somehow obtained the information surreptitiously by any natural means. It doesn’t rule out a host of other explanations, such as that my friend’s perception at the time was mistaken, that her memory of the event is skewed, or that she is lying. I could even, as a last resort, consider that possibility that I hallucinated the entire conversation with her. (I’d like to give the formal title “unbelievable propositions” to the set of propositions that are so unlikely that we should sooner believe that we are hallucinating than accept evidence for them.)

But each of these sources of alternative explanations, with the possible exception of the last, can be made significantly less plausible.

Let me use a thought experiment to illustrate this.

Imagine that you are a nuclear physicist who, with a group of fellow colleagues, have decided to test the predictive powers of a fortune teller. You carefully design an experiment in which a source of true quantum randomness will produce a number between 1 and N. Before the number has been produced, when it still exists only as an unrealized possibility in the wave function, you ask the fortune teller to predict its value.

Suppose that they get it correct. For what value of N would you begin to take their fortune telling abilities seriously?

Here’s how I would react to the success, for different values of N.

N = 10: “Haha, that’s a funny coincidence.”

N = 100: “Hm, that’s pretty weird.”

N = 1000: “What…”

N = 10,000: “Wait, WHAT!?”

N = 100,000: “How on Earth?? This is crazy.”

N = 1,000,000: “Ok, I’m completely baffled.”

I think I’d start taking them seriously as early as N = 10,000. This indicates prior odds of roughly 1:10,000 against fortune-telling abilities (roughly the same as my prior odds against astrology, interestingly!). Once again, this seems disconcertingly low.

But let’s try to imagine some alternative explanations.

As far as I can tell, there are only three potential failure points: (1) our understanding of physics, (2) our engineering of the experiment, (3) our perception of the fortune teller’s prediction.

First of all, if our understanding of quantum mechanics is correct, there is no possible way that any agent could do better than random at predicting the number.

Secondly, we stipulated that the experiment was designed meticulously so as to ensure that the information was truly random, and unavailable to the fortune-teller. I don’t think that such an experiment would actually be that hard to design. But let’s go even further and imagine that we’ve designed the experiment so that the fortune teller is not in causal contact with the quantum number-generator until after she has made her prediction.

And thirdly, we can suppose that the prediction is viewed by multiple different people, all of whom affirm that it was correct. We can even go further and imagine that video was taken, and broadcast to millions of viewers, all of whom agreed. Not all of them could just be getting it wrong over and over again. The only possibility is that we’re hallucinating not just the experimental result, but indeed also the public reaction and consensus on the experimental result.

But the hypothesis of a hallucination now becomes inconsistent with our understanding of how the brain works! A hallucination wouldn’t have the effect of creating a perception of a completely coherent reality in which everybody behaves exactly as normal except that they saw the fortune teller make a correct prediction. We’d expect that if this were a hallucination, it would not be so self-consistent.

Pretty much all that’s left, as far as I can tell, is some sort of Cartesian evil demon that’s cleverly messing with our brains to create this bizarre false reality. If this is right, then we’re left weighing the credibility of the laws of physics against the credibility of radical skepticism. And in that battle, I think, the laws of physics lose out. (Consider that the invalidity of radical skepticism is a precondition for the development of laws of physics in the first place.)

The point of all of this is just to sketch an example where I think we’d have a good justification for ruling out all alternative explanations, at least with an equivalent degree of confidence that we have for affirming any of our scientific knowledge.

Let’s bring this all the way back to where we started, with astrology. The conclusion of this blog post is not that I’m now a believer in astrology. I think that there’s enough credence in the buckets of “my friend misremembered details”, “my friend misreported details”, and “I misunderstood details” so that the likelihood ratio I’m faced with is not actually 10,000 to 1. I’d guess it’s something more like 10 to 1.

But I am now that much less confident that astrology is wrong. And I can imagine circumstances under which my confidence would be drastically decreased. While I don’t expect such circumstances to occur, I do find it instructive (and fun!) to think about them. It’s a good test of your epistemology to wonder what it would take for your most deeply-held beliefs to be overturned.

Patterns of inductive inference

I’m currently reading through Judea Pearl’s wonderful book Probabilistic Inference in Intelligent Systems. It’s chock-full of valuable insights into the subtle patterns involved in inductive reasoning.

Here are some of the patterns of reasoning described in Chapter 1, ordered in terms of increasing unintuitiveness. Any good system of inductive inference should be able to accommodate all of the following.

Abduction:

If A implies B, then finding that B is true makes A more likely.

Example: If fire implies smoke, smoke suggests fire.

Asymmetry of inference:

There are two types of inference that function differently: predictive vs explanatory. Predictive inference reasons from causes to consequences, whereas explanatory inference reasons from consequences to causes.

Example: Seeing fire suggests that there is smoke (predictive). Seeing smoke suggests that there is a fire (diagnostic).

Induced Dependency:

If you know A, then learning B can suggest C where it wouldn’t have if you hadn’t known A.

Example: Ordinarily, burglaries and earthquakes are unrelated. But if you know that your alarm is going off, then whether or not there was an earthquake is relevant to whether or not there was a burglary.

Correlated Evidence:

Upon discovering that multiple sources of evidences have a common origin, the credibility of the hypothesis should be decreased.

Example: You learn on a radio report, TV report, and newspaper report that thousands died. You then learn that all three reports got their information from the same source. This decreases the credibility that thousands died.

Explaining away:

Finding a second explanation for an item of data makes the first explanation less credible. If A and B both suggest C, and C is true, then finding that B is true makes A less credible.

Example: Finding that my light bulb emits red light makes it less credible that the red-hued object in my hand is truly red.

Rule of the hypothetical middle:

If two diametrically opposed assumptions impart two different degrees of belief onto a proposition Q, then the unconditional degree of belief should be somewhere between the two.

Example: The plausibility of an animal being able to fly is somewhere between the plausibility of a bird flying and the plausibility of a non-bird flying.

Defeaters or Suppressors:

Even if as a general rule B is more likely given A, this does not necessarily mean that learning A makes B more credible. There may be other elements in your knowledge base K that explain A away. In fact, learning B might cause A to become less likely (Simpson’s paradox). In other words, updating beliefs must involve searching your entire knowledge base for defeaters of general rules that are not directly inferentially connected to the evidence you receive.

Example 1: Learning that the ground is wet does not permit us to increase the certainty of “It rained”, because the knowledge base might contain “The sprinkler is on.”
Example 2: You have kidney stones and are seeking treatment. You additionally know that Treatment A makes you more likely to recover from kidney stones than Treatment B. But if you also have the background information that your kidney stones are large, then your recovery under Treatment A becomes less credible than under Treatment B.

Non-Transitivity:

Even if A suggests B and B suggests C, this does not necessarily mean that A suggests C.

Example 1: Your card being an ace suggests it is an ace of clubs. If your card is an ace of clubs, then it is a club. But if it is an ace, this does not suggest that it is a club.
Example 2: If the sprinkler was on, then the ground is wet. If the ground is wet, then it rained. But it’s not the case that if the sprinkler was on, then it rained.

Non-detachment:

Just learning that a proposition has changed in credibility is not enough to analyze the effects of the change; the reason for the change in credibility is relevant.

Example: You get a phone call telling you that your alarm is going off. Worried about a burglar, you head towards your home. On the way, you hear a radio announcement of an earthquake near your home. This makes it more credible that your alarm really is going off, but less credible that there was a burglary. In other words, your alarm going off decreased the credibility of a burglary, because it happened as a result of the earthquake, whereas typically an alarm going off would make a burglary more credible.

✯✯✯

All of these patterns should make a lot of sense to you when you give them a bit of thought. It turns out, though, that accommodating them in a system of inference is no easy matter.

Pearl distinguishes between extensional and intensional systems, and talks about the challenges for each approach. Extensional systems (including fuzzy logic and non-monotonic logic) focus on extending the truth values of propositions from {0,1} to a continuous range of uncertainty [0, 1], and then modifying the rules according to which propositions combine (for instance, the proposition “A & B” has the truth value min{A, B} in some extensional systems and A*B in others). The locality and simplicity of these combination rules turns out to be their primary failing; they lack the subtlety and nuance required to capture the complicated reasoning patterns above. Their syntactic simplicity makes them easy to work with, but curses them with semantic sloppiness.

On the other hand, intensional systems (like probability theory) involve assigning a function from entire world-states (rather than individual propositions) to degrees of plausibility. This allows for the nuance required to capture all of the above patterns, but results in a huge blow up in complexity. True perfect Bayesianism is ridiculously computationally infeasible, as the operation of belief updating blows up exponentially as the number of atomic propositions increases. Thus, intensional systems are semantically clear, but syntactically messy.

A good summary of this from Pearl (p 12):

We have seen that handling uncertainties is a rather tricky enterprise. It requires a fine balance between our desire to use the computational permissiveness of extensional systems and our ability to refrain from committing semantic sins. It is like crossing a minefield on a wild horse. You can choose a horse with good instincts, attach certainty weights to it and hope it will keep you out of trouble, but the danger is real, and highly skilled knowledge engineers are needed to prevent the fast ride from becoming a disaster. The other extreme is to work your way by foot with a semantically safe intensional system, such as probability theory, but then you can hardly move, since every step seems to require that you examine the entire field afresh.

The challenge for extensional systems is to accommodate the nuance of correct inductive reasoning.

The challenge for intensional systems is to maintain their semantic clarity while becoming computationally feasible.

Pearl solves the second challenge by supplementing Bayesian probability theory with causal networks that give information about the relevance of propositions to each other, drastically simplifying the tasks of inference and belief propagation.

One more insight from Chapter 1 of the book… Pearl describes four primitive qualitative relationships in everyday reasoning: likelihood, conditioning, relevance, and causation. I’ll give an example of each, and how they are symbolized in Pearl’s formulation.

1. Likelihood (“Tim is more likely to fly than to walk.”)
P(A)

2. Conditioning (“If Tim is sick, he can’t fly.”)
P(A | B)

3. Relevance (“Whether Tim flies depends on whether he is sick.”)
A B

4. Causation (“Being sick caused Tim’s inability to fly.”)
P(A | do B)

The challenge is to find a formalism that fits all four of these, while remaining computationally feasible.

If all truths are knowable, then all truths are known

The title of this post is what’s called Fitch’s paradox of knowability.

It’s a weird result that arises from a few very intuitive assumptions about the notion of knowability. I’ll prove it here.

First, let’s list five assumptions. The first of these will be the only strong one – the others should all seem very obviously correct.

Assumptions

  1. All truths are knowable.
  2. If P & Q is known, then both P and Q are known.
  3. Knowledge entails truth.
  4. If P is possible and Q can be derived from P, then Q is possible.
  5. Contradictions are necessarily false.

Let’s put these assumptions in more formal language by using the following symbolization:

P means that P is possible
KP means that P is known by somebody at some time

Assumptions

  1. From P, derive KP
  2. From K(P & Q), derive KP & KQ
  3. From KP, derive P
  4. From ◇P & (P → Q), derive ◇Q
  5. ◇[P & -P]

Now, the proof. First in English…

Proof

  1. Suppose that P is true and unknown.
  2. Then it is knowable that P is true and unknown.
  3. Thus it is possible that P is known and that it is known that P is unknown.
  4. So it is possible that P is both known and not known.
  5. Since 4 is a contradiction, it is not the case that P is true and unknown.
  6. In other words, if P is true, then it is known.

Follow all of that? Essentially, we assume that there is some statement P that is both true and unknown. But if this last sentence is true, and if all truths are knowable, then it should be a knowable truth. I.e. it is knowable that P is both true and unknown. But of course this can’t be knowable, since to know that P is both true and unknown is to both know it and not know it. And thus it must be the case that if all truths are knowable, then all truths are known.

I’ll write out the proof more formally now.

Proof

  1. P & –KP                Provisional assumption
  2. K(P & –KP)        Assumption 1
  3. ◇(KP & KKP)     Assumption 2
  4. ◇(KP & –KP)        Assumption 3
  5. -(P & –KP)            Reductio ad absurdum of 1
  6. P → KP                 Standard tautology

I love finding little examples like these where attempts to formalize our intuitions about basic concepts we use all the time lead us into disaster. You can’t simultaneously accept all of the following:

  • Not all truths are known.
  • All truths are knowable.
  • If P & Q is known, then both P and Q are known.
  • Knowledge entails truth.
  • If P is possible and P implies Q, then Q is possible.
  • Contradictions are necessarily false.

Variational Bayesian inference

Today I learned a cool trick for practical implementation of Bayesian inference.

Bayesians are interested in calculating posterior probability distributions of unobserved parameters X, given data (which consists of the values of observed parameters Y).

To do so, they need only know the form of the likelihood function (the probability of Y given X) and their own prior distribution over X. Then they can apply Bayes’ rule…

P(X | Y) = P(Y | X) P(X) / P(Y)

… and voila, Bayesian inference complete.

The trickiest part of this process is calculating the term in the denominator, the marginal likelihood P(Y). Trying to calculate this term analytically is typically  very computationally expensive – it involves a sum over all possible values of the parameters of the likelihood multiplied by the prior. If Y is drawn from a continuous infinity of possible parameter values, then calculating the marginal likelihood amounts to solving a (typically completely intractable) integral.

P(Y) = ∫ P(Y | X) P(X) dX

Variational Bayesian inference is a procedure that solves this problem through a clever trick.

First, we start by searching for a posterior in a space of functions F that are easily integrable.

Our goal is not to find the exact form of the posterior, although if we do, that’s great. Instead, we want to find the function Q(X) within F that is as close to the posterior P(X | Y) as possible.

Distance between probability distributions is typically calculated by the information divergence D(Q, P), which is defined by…

D(Q, P) = ∫ Q(X) log(Q(X) / P(X|Y)) dX

To explicitly calculate and minimize this, we would need to know the form of the posterior P(X | Y) from the start. But let’s plug in the definition of conditional probability…

P(X | Y) = P(X, Y) / P(Y)

D(Q, P) = ∫ Q(X) log(Q(X) P(Y) / P(X, Y)) dX
= ∫ Q(X) log(Q(X) / P(X, Y)) dX  +  ∫ Q(X) log P(Y) dX

The second term is easily calculated. Since log(P(Y)) is not a function of X, the integral just becomes…

∫ Q(X) log P(Y) dX = log P(Y)

Rearranging, we get…

log P(Y) = D(Q, P)  –  ∫ Q(X) log(Q(X) / P(X, Y)) dX

The second term depends on Q(X) and the joint probability P(X, Y), which we can calculate easily as the product of the likelihood P(Y | X) and the prior P(X). We name it the variational free energy, L(Q).

log P(Y) = D(Q, P) + L(Q)

Now, on the left-hand side we have the log of the marginal likelihood, and on the right we have the information distance plus the variational free energy.

Notice that the left side is not a function of Q. This is really important! It tells us that if we’re trying to vary Q to minimize D(Q, P), then the right side will be a constant quantity.

In other words, any increase in L(Q) is necessarily a decrease in D(Q, P). What this means is that the Q that minimizes D(Q, P) is the same thing as the Q that maximizes L(Q)!

We can use this to minimize D(Q, P) without ever explicitly knowing P.

Recalling the definition of the variational free energy, we have…

L(Q) = – ∫ Q(X) log(Q(X) / P(X, Y)) dX
= ∫ Q(X) log P(X, Y) dX – ∫ Q(X) log Q(X) dX

Both of these integrals are computable insofar as we made a good choice for the function space F. Thus we can exactly find Q*, the best approximation to P in F. Then, knowing Q*, we can calculate L(Q*), which serves as a lower bound on the log of the marginal likelihood P(Y).

log P(Y) = D(Q, P) + L(Q)
so log P(Y) ≥ L(Q*)

Summing up…

  1. Variational Bayesian inference approximates the posterior probability P(X | Y) with a function Q(X) in the function space F.
  2. We find the function Q* that is as similar as possible to P(X | Y) by maximizing L(Q).
  3. L(Q*) gives us a lower bound on the log of the marginal likelihood, log P(Y).

The value of the personal

I have been thinking about the value of powerful anecdotes. An easy argument for why we should be very cautious of personal experience and anecdotal evidence is that it has the potential to cause us to over-update. E.g. somebody that hears a few harrowing stories from friends about gun violence in Chicago is more likely to have an overly high estimation of how dangerous Chicago is.

Maybe the best way to formulate our worldview is in a cold and impersonal manner, disregarding most anecdotes in favor of hard data. This is the type of thing I might have once said, but I now think that this approach is likely just as flawed.

First of all, I think it’s an unrealistic demand on most people’s cognitive systems that they toss out the personal and compelling in their worldview.

And second of all, just like personal experience and anecdotal evidence have the potential to cause over-updating, statistical data and dry studies have the potential to cause under-updating.

Reading some psychological studies about the seriousness of the psychological harms of extended periods of solitary confinement is no match for witnessing or personally experiencing the effects of being locked in a tiny room alone for years. There’s a real and important difference between abstractly comprehending a fact and really understanding the fact. Other terms for this second thing include internalizing the fact, embodying it, and making it salient to yourself.

This difference is not easy to capture on a one-dimensional model of epistemology where beliefs are represented as simple real numbers. I’m not even sure if there’d be any good reason to build this distinction into artificial intelligences we might eventually construct. But it is there in us, and has a powerful influence.

How do we know whether somebody has really internalized a belief or not? I’m not sure, but here’s a gesture in what I think is the right direction.

We can conceive of somebody’s view of the world as a massive web of beliefs, where the connections between beliefs indicate dependencies and logical relations. To have a fully internalized a belief is to have a web that is fully consistent with the truth of this belief. On the other hand, if you notice that somebody verbally reports that they believe A, but then also seem to believe B, C, and D, where all of these are inconsistent with A, then they have not really internalized A.

The worry is that a cold and impersonal approach to forming your worldview is the type of thing that would result in this type of inconsistency and disconnectedness in your web of beliefs, through the failure to internalize important facts.

Such failures become most obvious when you have a good sense of somebody’s values, and can simply observe their behavior to see what it reveals about their beliefs. If somebody is a pure act utilitarian (I know that nobody actually has a value system as simple as this, but just play along for a moment), then they should be sending their money wherever it would be better spent maximizing utility. If they are not doing so, then this reveals an implicit belief that there is no better way to be maximizing utility than by keeping their own money.

This is sort of an attempt to uncover somebody’s revealed beliefs, to steal the concept of revealed preferences from economics. Conflicts between revealed beliefs and stated beliefs indicate a lack of internalization.