The history of lighting technology

Behold, one of my favorite tables of all time:

Screen-Shot-2018-12-23-at-4.08.12-AM-e1545556587809.pngScreen-Shot-2018-12-23-at-4.08.19-AM-e1545556574319.png
(Source.)

There’s so much to absorb here. Let’s look at just the “Light Price in Terms of Labor” column. At 500,000 BC, our starting point, we have this handsome guy:

Peking Man

The Peking man was a Homo erectus found in a cave with evidence of tool use and basic fire technology.  At this point, it would have taken him about 58 hours of work for every 1000 hours of light. Lighting a fire by hand or even with basic tools is hard, as seen here:

Nothing much changes for hundreds of thousands of years, until people begin using basic candles and oil lamps in the 1800s. After that, things slowly begin to accelerate, with gas lighting, incandescent lamps, and eventually fluorescent bulbs and LEDs…

Lights

Notice that this is a logarithmic plot! So a straight line corresponds to an exponential decrease in the amount of labor required to produce light. By the end we have less than 1 second to produce 1000 hours of light. And this doesn’t even include LED technologies!

Here’s a more detailed timeline of milestones in lighting technology up to the 1980s:

Screen Shot 2018-12-23 at 4.06.50 AM

And finally, a comparison of the efficiency of different lighting technologies over time.

Screen-Shot-2018-12-23-at-4.07.31-AM.png

 

Kant’s attempt to save metaphysics and causality from Hume

TL;DR

  • Hume sort of wrecked metaphysics. This inspired Kant to try and save it.
  • Hume thought that terms were only meaningful insofar as they were derived from experience.
  • We never actually experience necessary connections between events, we just see correlations. So Hume thought that the idea of causality as necessary connection is empty and confused, and that all our idea of causality really amounts to is correlation.
  • Kant didn’t like this. He wanted to PROTECT causality. But how??
  • Kant said that metaphysical knowledge was both a priori and substantive, and justified this by describing these things called pure intuitions and pure concepts.
  • Intuitions are representations of things (like sense perceptions). Pure intuitions are the necessary preconditions for us to represent things at all.
  • Concepts are classifications of representations (like “red”). Pure concepts are the necessary preconditions underlying all classifications of representations.
  • There are two pure intuitions (space and time) and twelve pure concepts (one of which is causality).
  • We get substantive a priori knowledge by referring to pure intuitions (mathematics) or pure concepts (laws of nature, metaphysics).
  • Yay! We saved metaphysics!

 

(Okay, now on to the actual essay. This was not originally written for this blog, which is why it’s significantly more formal than my usual fare.)

 

***

 

David Hume’s Enquiry Into Human Understanding stands out as a profound and original challenge to the validity of metaphysical knowledge. Part of the historical legacy of this work is its effect on Kant, who describes Hume as being responsible for [interrupting] my dogmatic slumber and [giving] my investigations in the field of speculative philosophy a completely different direction.” Despite the great inspiration that Kant took from Hume’s writing, their thinking on many matters is diametrically opposed. A prime example of this is their views on causality.

Hume’s take on causation is famously unintuitive. He gives a deflationary account of the concept, arguing that the traditional conception lacks a solid epistemic foundation and must be replaced by mere correlation. To understand this conclusion, we need to back up and consider the goal and methodology of the Enquiry.

He starts with an appeal to the importance of careful and accurate reasoning in all areas of human life, and especially in philosophy. In a beautiful bit of prose, he warns against the danger of being overwhelmed by popular superstition and religious prejudice when casting one’s mind towards the especially difficult and abstruse questions of metaphysics.

But this obscurity in the profound and abstract philosophy is objected to, not only as painful and fatiguing, but as the inevitable source of uncertainty and error. Here indeed lies the most just and most plausible objection against a considerable part of metaphysics, that they are not properly a science, but arise either from the fruitless efforts of human vanity, which would penetrate into subjects utterly inaccessible to the understanding, or from the craft of popular superstitions, which, being unable to defend themselves on fair ground, raise these entangling brambles to cover and protect their weakness. Chased from the open country, these robbers fly into the forest, and lie in wait to break in upon every unguarded avenue of the mind, and overwhelm it with religious fears and prejudices. The stoutest antagonist, if he remit his watch a moment, is oppressed. And many, through cowardice and folly, open the gates to the enemies, and willingly receive them with reverence and submission, as their legal sovereigns.

In less poetic terms, Hume’s worry about metaphysics is that its difficulty and abstruseness makes its practitioners vulnerable to flawed reasoning. Even worse, the difficulty serves to make the subject all the more tempting for “each adventurous genius[, who] will still leap at the arduous prize and find himself stimulated, rather than discouraged by the failures of his predecessors, while he hopes that the glory of achieving so hard an adventure is reserved for him alone.”

Thus, says Hume, the only solution is “to free learning at once from these abstruse questions [by inquiring] seriously into the nature of human understanding and [showing], from an exact analysis of its powers and capacity, that it is by no means fitted for such remote and abstruse questions.”

Here we get the first major divergence between Kant and Hume. Kant doesn’t share Hume’s eagerness to banish metaphysics. His Prolegomena To Any Future Metaphysics and Critique of Pure Reason are attempts to find it a safe haven from Hume’s attacks. However, while Kant might not be similarly constituted to Hume in this way, he does take Hume’s methodology very seriously. He states in the preface to the Prolegomena that “since the origin of metaphysics as far as history reaches, nothing has ever happened which could have been more decisive to its fate than the attack made upon it by David Hume.” Many of the principles which Hume derives, Kant agrees with wholeheartedly, making the task of shielding metaphysics even harder for him.

With that understanding of Hume’s methodology in mind, let’s look at how he argues for his view of causality. We’ll start with a distinction that is central to Hume’s philosophy: that between ideas and impressions. The difference between the memory of a sensation and the sensation itself is a good example. While the memory may mimic or copy the sensation, it can never reach its full force and vivacity. In general, Hume suggests that our experiences fall into two distinct categories, separated by a qualitative gap in liveliness. The more lively category he calls impressions, which includes sensory perceptions like the smell of a rose or the taste of wine, as well as internal experiences like the feeling of love or anger. The less lively category he refers to as thoughts or ideas. These include memories of impressions as well as imagined scenes, concepts, and abstract thoughts. 

With this distinction in hand, Hume proposes his first limit on the human mind. He claims that no matter how creative or original you are, all of your thoughts are the product of “compounding, transposing, augmenting, or diminishing the materials afforded us by the senses and experiences.” This is the copy principle: all ideas are copies of impressions, or compositions of simpler ideas that are in turn copies of impressions.

Hume turns this observation of the nature of our mind into a powerful criterion of meaning. “When we entertain any suspicion that a philosophical term is employed without any meaning or idea (as is but too frequent), we need but enquire, From what impression is that supposed idea derived? And if it be impossible to assign any, this will serve to confirm our suspicion.

This criterion turns out to be just the tool Hume needs in order to establish his conclusion. He examines the traditional conception of causation as a necessary connection between events, searches for the impressions that might correspond to this idea, and, failing to find anything satisfactory, declares that “we have no idea of connection or power at all and that these words are absolutely without any meaning when employed in either philosophical reasonings or common life.” His primary argument here is that all of our observations are of mere correlation, and that we can never actually observe a necessary connection.

Interestingly, at this point he refrains from recommending that we throw out the term causation. Instead he proposes a redefinition of the term, suggesting a more subtle interpretation of his criterion of meaning. Rather than eliminating the concept altogether upon discovering it to have no satisfactory basis in experience, he reconceives it in terms of the impressions from which it is actually formed. In particular, he argues that our idea of causation is really based on “the connection which we feel in the mind, this customary transition of the imagination from one object to its usual attendant.”

Here Hume is saying that humans have a rationally unjustifiable habit of thought where, when we repeatedly observe one type of event followed by another, we begin to call the first a cause and the second its effect, and we expect that future instances of the cause will be followed by future instances of the effect. Causation, then, is just this constant conjunction between events, and our mind’s habit of projecting the conjunction into the future. We can summarize all of this in a few lines:

Hume’s denial of the traditional concept of causation

  1. Ideas are always either copies of impressions or composites of simpler ideas that are copies of impressions.
  2. The traditional conception of causation is neither of these.
  3. So we have no idea of the traditional conception of causation.

Hume’s reconceptualization of causation

  1. An idea is the idea of the impression that it is a copy of.
  2. The idea of causation is copied from the impression of constant conjunction.
  3. So the idea of causation is just the idea of constant conjunction.

There we have Hume’s line of reasoning, which provoked Kant to examine the foundations of metaphysics anew. Kant wanted to resist Hume’s dismissal of the traditional conception of causation, while accepting that our sense perceptions reveal no necessary connections to us. Thus his strategy was to deny the Copy Principle and give an account of how we can have substantive knowledge that is not ultimately traceable to impressions. He does this by introducing the analytic/synthetic distinction and the notion of a priori synthetic knowledge.

Kant’s original definition of analytic judgments is that they “express nothing in the predicate but what has already been actually thought in the concept of the subject.” This suggests that the truth value of an analytic judgment is determined by purely the meanings of the concepts in use. A standard example of this is “All bachelors are unmarried.” The truth of this statement follows immediately just by understanding what it means, as the concept of bachelor already contains the predicate unmarried.  Synthetic judgments, on the other hand, are not fixed in truth value by merely the meanings of the concepts in use. These judgments amplify our knowledge and bring us to genuinely new conclusions about our concepts. An example: “The President is ornery.” This certainly doesn’t follow by definition; you’d have to go out and watch the news to realize its truth.

We can now put the challenge to metaphysics slightly differently. Metaphysics purports to be discovering truths that are both necessary (and therefore a priori) as well as substantive (adding to our concepts and thus synthetic). But this category of synthetic a priori judgments seems a bit mysterious. Evidently, the truth values of such judgments can be determined without referring to experience, but can’t be determined by merely the meanings of the relevant concepts. So apparently something further is required besides the meanings of concepts in order to make a synthetic a priori judgment. What is this thing?

Kant’s answer is that the further requirement is pure intuition and pure concepts. These terms need explanation.

Pure Intuitions

For Kant, an intuition is a direct, immediate representation of an object. An obvious example of this is sense perception; looking at a cup gives you a direct and immediate representation of an object, namely, the cup. But pure intuitions must be independent of experience, or else judgments based on them would not be a priori. In other words, the only type of intuition that could possibly be a priori is one that is present in all possible perceptions, so that its existence is not contingent upon what perceptions are being had. Kant claims that this is only possible if pure intuitions represent the necessary preconditions for the possibility of perception.

What are these necessary preconditions? Kant famously claimed that the only two are space and time. This implies that all of our perceptions have spatiotemporal features, and indeed that perception is only possible in virtue of the existence of space and time. It also implies, according to Kant, that space and time don’t exist outside of our minds!  Consider that pure intuitions exist equally in all possible perceptions and thus are independent of the actual properties of external objects. This independence suggests that rather than being objective features of the external world, space and time are structural features of our minds that frame all our experiences.

This is why Kant’s philosophy is a species of idealism. Space and time get turned into features of the mind, and correspondingly appearances in space and time become internal as well. Kant forcefully argues that this view does not make space and time into illusions, saying that without his doctrine “it would be absolutely impossible to determine whether the intuitions of space and time, which we borrow from no experience, but which still lie in our representation a priori, are not mere phantasms of our brain.”

The pure intuitions of space and time play an important role in Kant’s philosophy of mathematics: they serve to justify the synthetic a priori status of geometry and arithmetic. When we judge that the sum of the interior angles of a triangle is 180º, for example, we do so not purely by examining the concepts triangle, sum, and angle. We also need to consult the pure intuition of space! And similarly, our affirmations of arithmetic truths rely upon the pure intuition of time for their validity.

Pure Concepts

Pure intuition is only one part of the story. We don’t just perceive the world, we also think about our perceptions. In Kant’s words, “Thoughts without content are empty; intuitions without concepts are blind. […] The understanding cannot intuit anything, and the senses cannot think anything. Only from their union can cognition arise.” As pure intuitions are to perceptions, pure concepts are to thought. Pure concepts are necessary for our empirical judgments, and without them we could not make sense of perception. It is this category in which causality falls.

Kant’s argument for this is that causality is a necessary condition for the judgment that events occur in a temporal order. He starts by observing that we don’t directly perceive time. For instance, we never have a perception of one event being before another, we just perceive one and, separately, the other. So to conclude that the first preceded the second requires something beyond perception, that is, a concept connecting them.

He argues that this connection must be necessary: “For this objective relation to be cognized as determinate, the relation between the two states must be thought as being such that it determines as necessary which of the states must be placed before and which after.” And as we’ve seen, the only way to get a necessary connection between perceptions is through a pure concept. The required pure concept is the relation of cause and effect: “the cause is what determines the effect in time, and determines it as the consequence.” So starting from the observation that we judge events to occur in a temporal order, Kant concludes that we must have a pure concept of cause and effect.

What about particular causal rules, like that striking a match produces a flame? Such rules are not derived solely from experience, but also from the pure concept of causality, on which their existence depends. It is the presence of the pure concept that allows the inference of these particular rules from experience, even though they postulate a necessary connection.

Now we can see how different Kant and Hume’s conceptions of causality are. While Hume thought that the traditional concept of causality as a necessary connection was unrescuable and extraneous to our perceptions, Kant sees it as a bedrock principle of experience that is necessary for us to be able to make sense of our perceptions at all. Kant rejects Hume’s definition of cause in terms of constant conjunction on the grounds that it “cannot be reconciled with the scientific a priori cognitions that we actually have.”

Despite this great gulf between the two philosophers’ conceptions of causality, there are some similarities. As we saw above, Kant agrees wholeheartedly with Hume that perception alone is insufficient for concluding that there is a necessary connection between events. He also agrees that a purely analytic approach is insufficient. Since Kant sees pure intuitions and pure concepts as features of the mind, not the external world, both philosophers deny that causation is an objective relationship between things in themselves (as opposed to perceptions of things). Of course, Kant would deny that this makes causality an illusion, just as he denied that space and time are made illusory by his philosophy.

Of course, it’s impossible to know to what extent the two philosophers would have actually agreed, had Hume been able to read Kant’s responses to his works. Would he have been convinced that synthetic a priori judgments really exist? If so, would he accept Kant’s pure intuitions and pure concepts? I suspect that at the crux of their disagreement would be Kant’s claim that math is synthetic a priori. While Hume never explicitly proclaims math’s analyticity (he didn’t have the term, after all), it seems more in line with his views on algebra and arithmetic as purely concerning the way that ideas relate to one another. It is also more in line with the axiomatic approach to mathematics familiar to Hume, in which one defines a set of axioms from which all truths about the mathematical concepts involved necessarily follow.

If Hume did maintain math’s analyticity, then Kant’s arguments about the importance of synthetic a priori knowledge would probably hold much less sway for him, and would largely amount to an appeal to the validity of metaphysical knowledge, which Hume already doubted. Hume also would likely want to resist Kant’s idealism; in Section XII of the Enquiry he mocks philosophers that doubt the connection between the objects of our senses and external objects, saying that if you “deprive matter of all its intelligible qualities, both primary and secondary, you in a manner annihilate it and leave only a certain unknown, inexplicable something as the cause of our perceptions – a notion so imperfect that no skeptic will think it worthwhile to contend against it.”

Deriving the Lorentz transformation

My last few posts have been all about visualizing the Lorentz transformation, the coordinate transformation in special relativity. But where does this transformation come from? In this post, I’ll derive it from basic principles. I saw this derivation first probably a year ago, and have since tried unsuccessfully to re-find the source.  It isn’t the algebraically simplest derivation I’ve seen, but it is the conceptually simplest. The principles we’ll use to derive the transformation should all seem extremely obvious to you.

So let’s dive straight in!

The Lorentz transformation in full generality is a 4D matrix that tells you how to transform spacetime coordinates in one inertial reference frame to spacetime coordinates in another inertial reference frame. It turns out that once you’ve found the Lorentz transformation for one spatial dimension, it’s quite simple to generalize it to three spatial dimensions, so for simplicity we’ll just stick to the 1D case. The Lorentz transformation also allows you to transform to a coordinate system that is both translated some distance and rotated some angle. Both of these are pretty straightforward, and work the way we intuitively think rotation and translation should work. So I’ll not consider them either. The interesting part of the Lorentz transformation is what happens when we translate to reference frames that are co-moving (moving with respect to one another). Strictly speaking, this is called a Lorentz boost. That’s what I’ll be deriving for you: the 1D Lorentz boost.

So, we start by imagine some reference frame, in which an event is labeled by its temporal and spatial coordinates: t and x. Then we look at a new reference frame moving at velocity v with respect to the starting reference frame. We describe the temporal and spatial coordinates of the same event in the new coordinate system: t’ and x’. In general, these new coordinates can be any function whatsoever of the starting coordinates and the velocity v.

Screen Shot 2018-12-09 at 10.31.11 PM.png

To narrow down what these functions f and g might be, we need to postulate some general relationship between the primed and unprimed coordinate system.

So, our first postulate!

1. Straight lines stay straight.

Our first postulate is that all observers in inertial reference frames will agree about if an object is moving at a constant velocity. Since objects moving at constant velocities are straight lines on diagrams of position vs time, this is equivalent to saying that a straight path through spacetime in one reference frame is a straight path through spacetime in all reference frames.

More formally, if x is proportional to t, then x’ is proportional to t’ (though the constant of proportionality may differ).

Screen Shot 2018-12-09 at 10.41.03 PM.png

This postulate turns out to be immensely powerful. There is a special name for the types of transformations that keep straight lines straight: they are linear transformations. (Note, by the way, that the linearity is only in the coordinates t and x, since those are the things that retain straightness. There is no guarantee that the dependence on v will be linear, and in fact it will turn out not to be.)

 These transformations are extremely simple, and can be represented by a matrix. Let’s write out the matrix in full generality:

Screen Shot 2018-12-09 at 10.45.02 PM.png

We’ve gone from two functions (f and g) to four (A, B, C, and D). But in exchange, each of these four functions is now only a function of one variable: the velocity v. For ease of future reference, I’ve chosen to name the matrix T(v).

So, our first postulate gives us linearity. On to the second!

2. An object at rest in the starting reference frame is moving with velocity -v in the moving reference frame

This is more or less definitional. If somebody tells you that they had a function that transformed coordinates from one reference frame to a moving reference frame, then the most basic check you can do to see if they’re telling the truth is verify that objects at rest in the starting reference frame end up moving in the final reference frame. And again, it seems to follow from what it means for the reference frame to be moving right at 1 m/s that the initially stationary objects should end up moving left at 1 m/s.

Let’s consider an object sitting at rest at x = 0 in the starting frame of reference. Then we have:

Screen Shot 2018-12-09 at 10.52.06 PM.png

We can plug this into our matrix to get a constraint on the functions A and C:

Screen Shot 2018-12-09 at 10.54.59 PM.png

Great! We’ve gone from four functions to three!

Screen Shot 2018-12-09 at 10.56.02 PM.png

3. Moving to the left at velocity v and to the right at the same velocity is the same as not moving at all

More specifically: Start with any reference frame. Now consider a new reference frame that is moving at velocity v with respect to the starting reference frame. Now, from this new reference frame, consider a third reference frame that is moving at velocity -v. This third reference frame should be identical to the one we started with. Got it?

Formally, this is simply saying the following:

Screen Shot 2018-12-09 at 11.01.36 PM.png

(I is the identity matrix.)

To make this equation useful, we need to say more about T(-v). In particular, it would be best if we could express T(-v) in terms of our three functions A(v), B(v), and D(v). We do this with our next postulate:

4. Moving at velocity -v is the same as turning 180°, then moving at velocity v, then turning 180° again.

Again, this is quite self-explanatory. As a geometric fact, the reference frame you end up with by turning around, moving at velocity v, and then turning back has got to be the same as the reference frame you’d end up with by moving at velocity -v. All we need to formalize this postulate is the matrix corresponding to rotating 180°.

Screen Shot 2018-12-09 at 11.07.28 PM.png

There we go! Rotating by 180° is the same as taking every position in the starting reference frame and flipping its sign. Now we can write our postulate more precisely:

Screen Shot 2018-12-09 at 11.09.47 PM

Screen Shot 2018-12-09 at 11.10.44 PM.png

Now we can finally use Postulate 3!

Screen Shot 2018-12-09 at 11.11.56 PM

Doing a little algebra, we get…

Screen Shot 2018-12-09 at 11.12.42 PM.png

(You might notice that we can only conclude that A = D if we reject the possibility that A = B = 0. We are allowed to do this because allowing A = B = 0 gives us a trivial result in which a moving reference frame experiences no time. Prove this for yourself!)

Now we have managed to express all four of our starting functions in terms of just one!

Screen Shot 2018-12-09 at 11.18.23 PM.png

So far our assumptions have been grounded by almost entirely a priori considerations about what we mean by velocity. It’s pretty amazing how far we got with so little! But to progress, we need to include one final a posteriori postulate, that which motivated Einstein to develop special relativity in the first place: the invariance of the speed of light.

5. Light’s velocity is c in all reference frames.

The motivation for this postulate comes from mountains of empirical evidence, as well as good theoretical arguments from the nature of light as an electromagnetic phenomenon. We can write it quite simply as:

Screen Shot 2018-12-09 at 11.43.23 PM

Plugging in our transformation, we get:

Screen Shot 2018-12-09 at 11.43.28 PM

Multiplying the time coordinate by c must give us the space coordinate:

Screen Shot 2018-12-10 at 3.27.16 AM

And we’re done with the derivation!

Summarizing our five postulates:

Screen Shot 2018-12-10 at 12.37.23 AM.png

And our final result:

Screen Shot 2018-12-10 at 3.29.09 AM.png

Swapping the past and future

There are a few more cool things you can visualize with the special relativity program from my last post.

First of all, a big theme of the last post was the ambiguity of temporal orderings. It’s easy to see the temporal ordering of events when there are only three, but gets harder when you have many many events. Let’s actually display the temporal order on the visualization, so that we can see how it changes for different frames of reference.

Display Order Of Three Events

Order of Many Events.gif

Looking at this second GIF, you can see the immense ambiguity that there is in the temporal order of events.

Now, where things get even more interesting is when we consider the spacetime coordinates of events that are not in your future light cone. Check this out:

Outside the Light Cone.gif

Here’s a more detailed image of the paths traced out by events as you change your velocity:

Screen Shot 2018-12-06 at 10.22.20 PM.png

Instead of just looking at events in your future light cone, we’re now also looking at events outside of your light cone!

We chose to look at a bunch of events that are initially all in your future (in the frame of reference where v = 0). Notice now that as we vary the velocity, some of these events end up at earlier times than you! In other words, by changing your frame of reference, events that were in your future can end up in your past. And vice versa; events in the past of one frame of reference can be in the future in the other.

We can see this very clearly by considering just two events.

Future Past Swap.gif

In the v = 0 frame, Red and Green are simultaneous with you. But for v > 0, Green is before Red is before you, and for v < 0, Green is after Red is after you. The lesson is the following: when considering events outside of your light cone there is no fact of the matter about what events are in your future and which ones are in your past.

Now, notice that in the above GIFs we never see events that are in causal contact leave causal contact, or vice versa. This holds true in general. While things certainly do get weirder when considering events outside your light cone, it is still the case that all observers will agree on what events are in causal contact with one another. And just like before, the temporal ordering of events in causal contact does not depend on your frame of reference. In other words, basketballs are always tossed before they go through the net, even outside your light cone.

The same holds when considering interactions between a pair of events that straddle either side of your light cone:

Straddling No Cause.gif

Straddling With Cause

If A is in B’s light cone from one frame of reference, then A is in B’s light cone from all frames of reference. And if A is out of B’s light cone in one frame of reference, then it is out of B’s light cone in all frames of reference. Once again, we see that special relativity preserves as absolute our bedrock intuitions about causality, even when many of our intuitions about time’s objectivity fall away.

Now, all of the implications of special relativity that I’ve discussed so far have been related to time and causality. But there’s also some strange stuff that happens with space. For instance, let’s consider a series of events corresponding to an object sitting at rest some distance away from you. On our diagram this looks like the following:

Screen Shot 2018-12-08 at 11.12.10 PM.png

What does this look like when we if we are moving towards the object? Obviously the object should now be getting closer to us, so we expect the red line to tilt inwards towards the x = 0 point. Here’s what we see at 80% of the speed of light:

Screen Shot 2018-12-08 at 11.14.01 PM.png

As we expected, the object now rushes towards us from our frame of reference, and quickly passes us by and moves off to the left. But notice the spatial distortion in the image! At the present moment (t = 0), the object looks significantly closer than it was previously. (You can see this by starting from the center point and looking to the right to see how much distance you cover before intersecting with the object. This is the distance to the object at t = 0.)

This is extremely unusual! Remember, the moving frame of reference is at the exact same spatial position at t = 0 as the still frame of reference. So whether I am moving towards an object or standing still appears to change how far away the object presently is!

This is the famous phenomenon of length contraction. If we imagine placing two objects at different distances from the origin, each at rest with respect to the v = 0 frame, then moving towards them would result in both of them getting closer to us as well as each other, and thus shrinking! Evidently when we move, the universe shrinks!

Contraction

One last effect we can see in the diagram appears to be a little at odds with what I’ve just said. This is that the observed distance between yourself and the object increases as you move towards it (and as the actual distance shrinks). Why? Well, what you observe is dictated by the beams of light that make it to your eye. So at the moment t = 0, what you are observing is everything along the two diagonals in the bottom half of the images. And in the second image, where you are moving towards the object, the place where the object and diagonal intersect is much further away than it is in the first image! Evidently, moving towards an object makes it appear further away, even though in reality it is getting closer to you!

This holds as a general principle. The reason? When you observe an object, you are really observing it as it was some time in the past (however much time it took for light to reach your eye). And when you move towards an object, that past moment you are observing falls further into the past. (This is sort of the flip-side of time dilation.) Since you are moving towards the object, looking further into the past means looking at the object when it was further away from you. And so therefore the object ends up appearing more distant from you than before!

There’s a bunch more weird and fascinating effects that you can spot in these types of visualizations, but I’ll stop there for now.

Visualizing Special Relativity

I’ve been thinking a lot about special relativity recently, and wrote up a fun program for visualizing some of its stranger implications. Before going on to these visualizations, I want to recommend the Youtube channel MinutePhysics, which made a fantastic primer on the subject. I’ll link the first few of these here, as they might help with understanding the rest of the post. I highly recommend the entire series, even if you’re already pretty familiar with the subject.

Now, on to the pretty images! I’m still trying to determine whether it’s possible to embed applets in my posts, so that you can play with the program for yourself. Until I figure that out, GIFs will have to suffice.

lots of particles

Let me explain what’s going on in the image.

First of all, the vertical direction is time (up is the future, down is the past), and the horizontal direction is space (which is 1D for simplicity). What we’re looking at is the universe as described by an observer at a particular point in space and time. The point that this observer is at is right smack-dab in the center of the diagram, where the two black diagonal lines meet. These lines represent the observer’s light cone: the paths through spacetime that would be taken by beams of light emitted in either direction. And finally, the multicolored dots scattered in the upper quadrant represent other spacetime events in the observer’s future.

Now, what is being varied is the velocity of the observer. Again, keep in mind that the observer is not actually moving through time in this visualization. What is being shown is the way that other events would be arranged spatially and temporally if the observer had different velocities.

Take a second to reflect on how you would expect this diagram to look classically. Obviously the temporal positions of events would not depend upon your velocity. What about the spatial positions of events? Well, if you move to the right, events in your future and to the right of you should be nearer to you than they would be had you not been in motion. And similarly, events in your future left should be further to the left. We can easily visualize this by plugging in the classical Galilean transformation:

Classical Transformation.gif

Just as we expected, time positions stay constant and spatial positions shift according to your velocity! Positive velocity (moving to the right) moves future events to the left, and negative velocity moves them to the right. Now, technically this image is wrong. I’ve kept the light paths constant, but even these would shift under the classical transformation. In reality we’d get something like this:

Classical Corrected.gif

Of course, the empirical falsity of this prediction that the speed of light should vary according to your own velocity is what drove Einstein to formulate special relativity. Here’s what happens with just a few particles when we vary the velocity:

RGB Transform

What I love about this is how you can see so many effects in one short gif. First of all, the speed of light stays constant. That’s a good sign! A constant speed of light is pretty much the whole point of special relativity. Secondly, and incredibly bizarrely, the temporal positions of objects depend on your velocity!! Objects to your future right don’t just get further away spatially when you move away from them, they also get further away temporally!

Another thing that you can see in this visualization is the relativity of simultaneity. When the velocity is zero, Red and Blue are at the same moment of time. But if our velocity is greater than zero, Red falls behind Blue in temporal order. And if we travel at a negative velocity (to the left), then we would observe Red as occurring after Blue in time. In fact, you can find a velocity that makes any two of these three points simultaneous!

This leads to the next observation we can make: The temporal order of events is relative! The orderings of events that you can observe include Red-Green-Blue, Green-Red-Blue, Green-Blue-Red, and Blue-Green-Red. See if you can spot them all!

This is probably the most bonkers consequence of special relativity. In general, we cannot say without ambiguity that Event A occurred before or after Event B. The notion of an objective temporal ordering of events simply must be discarded if we are to hold onto the observation of a constant speed of light.

Are there any constraints on the possible temporal orderings of events? Or does special relativity commit us to having to say that from some valid frames of reference, the basketball going through the net preceded the throwing of the ball? Well, notice that above we didn’t get all possible orders… in particular we didn’t have Red-Blue-Green or Blue-Red-Green. It turns out that in general, there are some constraints we can place on temporal orderings.

Just for fun, we can add in the future light cones of each of the three events:

RGB with Light Cones.gif

Two things to notice: First, all three events are outside each others’ light cones. And second, no event ever crosses over into another event’s light cone. This makes some intuitive sense, and gives us a constant that will hold true in all reference frames: Events that are outside each others’ light cones from one perspective, are outside each others’ light cones from all perspectives. Same thing for events that are inside each others’ light cones.

Conceptually, events being inside each others’ light cones corresponds to them being in causal contact. So another way we can say this is that all observers will agree on what the possible causal relationships in the universe are. (For the purposes of this post, I’m completely disregarding the craziness that comes up when we consider quantum entanglement and “spooky action at a distance.”) 

Now, is it ever possible for events in causal contact to switch temporal order upon a change in reference frame? Or, in other words, could effects precede their causes? Let’s look at a diagram in which one event is contained inside the light cone of another:

RGB Causal

Looking at this visualization, it becomes quite obvious that this is just not possible! Blue is fully contained inside the future light cone of Red, and no matter what frame of reference we choose, it cannot escape this. Even though we haven’t formally proved it, I think that the visualization gives the beginnings of an intuition about why this is so. Let’s postulate this as another absolute truth: If Event A is contained within the light cone of Event B, all observers will agree on the temporal order of the two events. Or, in plainer language, there can be no controversy over whether a cause precedes its effects.

I’ll leave you with some pretty visualizations of hundreds of colorful events transforming as you change reference frames:

Pretty Transforms LQ

And finally, let’s trace out the set of possible space-time locations of each event.

Hyperbolas

Screen Shot 2018-12-06 at 3.22.43 PM.png

Try to guess what geometric shape these paths are! (They’re not parabolas.) Hint.

 

Fractals and Epicycles

There is no bilaterally-symmetrical, nor eccentrically-periodic curve used in any branch of astrophysics or observational astronomy which could not be smoothly plotted as the resultant motion of a point turning within a constellation of epicycles, finite in number, revolving around a fixed deferent.

Norwood Russell Hanson, “The Mathematical Power of Epicyclical Astronomy”

 

A friend recently showed me this image…

hilbert_epicycle.gif

…and thus I was drawn into the world of epicycles and fractals.

Epicycles were first used by the Greeks to reconcile observational data of the motions of the planets with the theory that all bodies orbit the Earth in perfect circles. It was found that epicycles allowed astronomers to retain their belief in perfectly circular orbits, as well as the centrality of Earth. The cost of this, however, was a system with many adjustable parameters (as many parameters as there were epicycles).

There’s a somewhat common trope about adding on endless epicycles to a theory, the idea being that by being overly flexible and accommodating of data you lose epistemic credibility. This happens to fit perfectly with my most recent posts on model selection and overfitting! The epicycle view of the solar system is one that is able to explain virtually any observational data. (There’s a pretty cool reason for this that has to do with the properties of Fourier series, but I won’t go into it.) The cost of this is a massive model with many parameters. The heliocentric model of the solar system, coupled with the Newtonian theory of gravity, turns out to be able to match all the same data with far fewer adjustable parameters. So by all of the model selection criteria we went over, it makes sense to switch over from one to the other.

Of course, it is not the case that we should have been able to tell a priori that an epicycle model of the planets’ motions was a bad idea. “Every planet orbits Earth on at most one epicycle”, for instance, is a perfectly reasonable scientific hypothesis… it just so happened that it didn’t fit the data. And adding epicycles to improve the fit to data is also not bad scientific practice, so long as you aren’t ignoring other equally good models with fewer parameters.)

Okay, enough blabbing. On to the pretty pictures! I was fascinated by the Hilbert curve drawn above, so I decided to write up a program of my own that generates custom fractal images from epicycles. Here are some gifs I created for your enjoyment:

Negative doubling of angular velocity

(Each arm rotates in the opposite direction of the previous arm, and at twice its angular velocity. The length of each arm is half that of the previous.)

negative_doubling

Trebling of angular velocity

trebling.gif

Negative trebling

negative_trebling

Here’s a still frame of the final product for N = 20 epicycles:

Screen Shot 2018-11-27 at 7.23.55 AM

Quadrupling

epicycles_frequency_quadrupling.gif

ωn ~ (n+1) 2n

(or, the Fractal Frog)

(n+1)*2^n.gif

ωn ~ n, rn ~ 1/n

radius ~ 1:n, frequency ~ n.gif

ωn ~ n, constant rn

singularity

ωn ~ 2n, rn ~ 1/n2

pincers

And here’s a still frame of N = 20:

high res pincers

(All animations were built using Processing.py, which I highly recommend for quick and easy construction of visualizations.)

How to Learn From Data, Part 2: Evaluating Models

I ended the last post by saying that the solution to the problem of overfitting all relied on the concept of a model. So what is a model? Quite simply, a model is just a set of functions, each of which is a candidate for the true distribution that generates the data.

Why use a model? If we think about a model as just a set of functions, this seems kinda abstract and strange. But I think it can be made intuitive, and that in fact we reason in terms of models all the time. Think about how physicists formulate their theories. Laws of physics have two parts: First they specify the types of functional relationships there are in the world. And second, they specify the value of particular parameters in those functions. This is exactly how a model works!

Take Newton’s theory of gravity. The fundamental claim of this theory is that there is a certain functional relationship between the quantities a (the acceleration of an object), M (the mass of a nearby object), and r (the distance between the two objects): a ~ M/r2. To make this relationship precise, we need to include some constant parameter G that says exactly what this proportionality is: a = GM/r2.

So we start off with a huge set of probability distributions over observations of the accelerations of particles (one for each value of G), and then we gather data to see which value of G is most likely to be right. Einstein’s theory of gravity was another different model, specifying a different functional relationship between a, M, and r and involving its own parameters. So to compare Einstein’s theory of gravity to Newton’s is to compare different models of the phenomenon of gravitation.

When you start to think about it, the concept of a model is an enormous one. Models are ubiquitous, you can’t escape from them. Many fancy methods in machine learning are really just glorified model selection. For instance, a neural network is a model: the architecture of the network specifies a particular functional relationship between the input and the output, and the strengths of connections between the different neurons are the parameters of the model.

So. What are some methods for assessing predictive accuracy of models? The first one we’ll talk about is a super beautiful and clever technique called…

Cross Validation

The fundamental idea of cross validation is that you can approximate how well a model will do on the next data point by evaluating how the model would have done at predicting a subset of your data, if all it had access to was the rest of your data.

I’ll illustrate this with a simple example. Suppose you have a data set of four points, each of which is a tuple of two real numbers. Your two models are T1: that the relationship is linear, and T2: that the relationship is quadratic.

What we can do is go through each data point, selecting each in order, train the model on the non-selected data points, and then evaluate how well this trained model fits the selected data point. (By training the model, I mean something like “finding the curve within the model that best fits the non-selected data points.”) Here’s a sketch of what this looks like:

img_0045.jpeg

Notice that each different data point we select gives a different curve! This is important

There are a bunch of different ways of doing cross validation. We just left out one data point at a time, but we could have left out two points at a time, or three, or any k less than the total number of points N. This is called leave-k-out cross validation. If we partition our data by choosing a fraction of it for testing, then we get what’s called n-fold cross validation (where 1/n is the fraction of the data that is isolated for testing).

We also have some other choices besides how to partition the data. For instance, we can choose to train our model via a Likelihoodist procedure or a Bayesian procedure. And we can also choose to test our model in various ways, by using different metrics for evaluating the distance between the testing set and the trained model. In leave-one-out cross validation (LOOCV), a pretty popular method, both the training and testing procedures are Likelihoodist.

Now, there’s a practical problem with cross-validation, which is that it can take a long time to compute. If we do LOOCV with N data points and look at all ways of leaving out one data point, then we end up doing N optimization procedures (one for each time you train your model), each of which can be pretty slow. But putting aside practical issues, for thinking about theoretical rationality, this is a super beautiful technique.

Next we’ll move on to…

Bayesian Model Selection!

We had Bayesian procedures for evaluating individual functions. Now we can go full Bayes and apply it to models! For each model, we assess the posterior probability of the model given the data using Bayes’ rule as usual:

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Now… what is the prior Pr(M)? Again, it’s unspecified. Maybe there’s a good prior that solves overfitting, but it’s not immediately obvious what exactly it would be.

But there’s something really cool here. In this equation, there’s a solution to overfitting that pops up not out of the prior, but out of the LIKELIHOOD! I explain this here. The general idea is that models that are prone to overfitting get that way because their space of functions is very large. But if the space of functions is large, then the average prior on each function in the model must be small. So larger models have, on average, smaller values of the term Pr(f | M), and correspondingly (via Pr(D | M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) ) get a weaker update from evidence.

So even without saying anything about the prior, we can see that Bayesian model selection provides a potential antidote to overfitting. But just like before, we have the practical problem that computing Pr(M | D) is in general very hard. Usually evaluating Pr(D | M) involves calculating a really complicated many-dimensional integral, and calculating many-dimensional integrals can be very computationally expensive.

Might we be able to find some simple unbiased estimator of the posterior Pr(M | D)? It turns out that yes, we can. Take another look at the equation above.

Pr(M | D) = \frac{Pr(D | M)}{Pr(D)} Pr(M)

Since Pr(D) is a constant across models, we can ignore it when comparing models. So for our purposes, we can attempt to maximize the equation:

Pr(D | M) Pr(M) = \sum_{f \in M} Pr(D | f, M) Pr(f | M) Pr(M)

If we assume that P(D | f) is twice differentiable in the model parameters and that Pr(f) behaves roughly linearly around the maximum likelihood function, we can approximate this as:

Pr(D | M) Pr(M) \approx \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)

f* is the maximum likelihood function within the model, and k is the the number of parameters of the model (e.g. k = 2 for a linear model and k = 3 for a quadratic model).

\We can make this a little neater by taking a logarithm and defining a new quantity: the Bayesian Information Criterion.

argmax_M [ Pr(M | D) ] \\~\\  = argmax_M [ Pr(D | M) Pr(M) ] \\~\\  \approx argmax_M [ \frac {Pr(D | f^*(D))} {\sqrt{N}}^k Pr(M)] \\~\\  = argmax_M [ \log(Pr(M)) + \log (Pr(D | f^*(D)) - \frac{k}{2} \log(N) ] \\~\\  = argmin_M [ BIC - \log Pr(M) ]

BIC = \frac{k}{2} log(N) - \log \left( Pr(D | f^*(D)) \right)

Thus, to maximize the posterior probability of a model M is roughly the same as to minimize the quantity BIC – log Pr(M).

If we assume that our prior over models is constant (that is, that all models are equally probable at the outset), then we just minimize BIC.

Bayesian Information Criterion

Notice how incredibly simple this expression is to evaluate! If all we’re doing is minimizing BIC, then we only need to find the maximum likelihood function f*(D), assess the value of Pr(D | f*), and then penalize this quantity with the factor k/2 log(N)! This penalty scales in proportion to the model complexity, and thus helps us avoid overfitting.

We can think about this as a way to make precise the claims above about Bayesian Model Selection penalizing overfitting in the likelihood. Remember that minimizing BIC made sense when we assumed a uniform prior over models (and therefore when our prior doesn’t penalize overfitting). So even when we don’t penalize overfitting in the prior, we still end up getting a penalty! This penalty must come from the likelihood.

Some more interesting facts about BIC:

  • It is approximately equal to the minimum description length criterion (which I haven’t discussed here)
  • It is only valid when N >> k. So it’s not good for small data sets or large models.
  • It the truth is contained within your set of models, then BIC will select the truth with probability 1 as N goes to infinity.

Okay, stepping back. We sort of have two distinct clusters of approaches to model selection. There’s Bayesian Model Selection and the Bayesian Information Criterion, and then there’s Cross Validation. Both sound really nice. But how do they compare? It turns out that in general they give qualitatively different answers. And in general, the answers you get using cross validation tend to be more predictively accurate that those that you get from BIC/BMS.

A natural question to ask at this point is: BIC was a nice simple approximation to BMS. Is there a corresponding nice simple approximation to cross validation? Well first we must ask: which cross validation? Remember that there was a plethora of different forms of cross validation, each corresponding to slightly different criterion for evaluating fits. We can’t assume that all these methods give the same answer.

Let’s choose leave-one-out cross validation. It turns out that yes, there is a nice approximation that is asymptotically equivalent to LOOCV! This is called the Akaike information criterion.

Akaike Information Criterion

First, let’s define AIC:

AIC = k - \log \left( Pr(D | f^*(D)) \right)

Like always, k is the number of parameters in M and f* is chosen by the Likelihoodist approach described in the last post.

What you get by minimizing this quantity is asymptotically equivalent to what you get by doing leave-one-out cross validation. Compare this to BIC:

BIC = \frac{k}{2} \log (N) - \log \left( Pr(D | f^*(D) ) \right)

There’s a qualitative difference in how the parameter penalty is weighted! BIC is going to have a WAY higher complexity penalty than AIC. This means that AIC should in general choose less simple models than BIC.

Now, we’ve already seen one reason why AIC is good: it’s a super simple approximation of LOOCV and LOOCV is good. But wait, there’s more! AIC can be derived as an unbiased estimator of the Kullback-Leibler divergence DKL.

What is DKL? It’s a measure of the information theoretic distance of a model from truth. For instance, if the true generating distribution for a process is f, and our model of this process is g, then DKL tells us how much information we lose by using g to represent f. Formally, DKL is:

D_{KL}(f, g) = \int { f(x) \log \left(\frac{f(x)} {g(x)} \right) dx }

Notice that to actually calculate this quantity, you have to already know what the true distribution f is. So that’s unfeasible. BUT! AIC gives you an unbiased estimate of its value! (The proof of this is complicated, and makes the assumptions that N >> k, and that the model is not far from the truth.)

Thus, AIC gives you an unbiased estimate of which model is closest to the truth. Even if the truth is not actually contained within your set of models, what you’ll end up with is the closest model to it. And correspondingly, you’ll end up getting a theory of the process that is very similar to how the process actually works.

There’s another version of AIC that works well for smaller sample sizes and larger models called AICc.

AICc = AIC + \frac {k (k + 1)} {N - (k + 1)}

And interestingly, AIC can be derived as an approximation to Bayesian Model Selection with a particular prior over models. (Remember earlier when we showed that argmax_M \left( Pr(M | D) \right) \approx argmin_M \left( BIC - \log(Pr(M)) \right) ? Well, just set \log Pr(M) = BIC - AIC = k \left( \frac{1}{2} \log N - 1 \right) , and you get the desired result.) Interestingly, this prior ends up rewarding theories for having lots of parameters! This looks pretty bad… it seems like AIC is what you get when you take Bayesian Model Selection and then try to choose a prior that favors overfitting theories. But the practical use and theoretical virtues of AIC warrant, in my opinion, taking a closer look at what’s going on here. Perhaps what’s going on is that the likelihood term Pr(D | M) is actually doing too much to avoid overfitting, so in the end what we need in our prior is one that avoids underfitting! Regardless, we can think of AIC as specifying the unique good prior on models that optimizes for predictive accuracy.

There’s a lot more to be said from here. This is only a short wade into the waters of statistical inference and model selection. But I think this is a good place to stop, and I hope that I’ve given you a sense of the philosophical richness of the various different frameworks for inference I’ve presented here.