Armed with a clearer view of vector/matrix multiplication and homogeneous coordinates, it’s time to dissect the perspective projection.
Stated simply, the point of perspective projection is to make things that are further away look smaller. Since the magnitude of z gets bigger as a point gets further from the viewer, it follows that the magnitude of other things should basically vary inversely with z. However, as I quickly discovered when I wrote my first 3D program, merely dividing everything else by z is not quite enough.
In order to understand the mathematics required to do perspective projection properly, let’s back up and talk about the view frustum. I’ll use 2D illustrations, but the same math extends pretty easily to 3D. The view frustum:
The idea is to pretend that the computer display is a window looking “out” onto the 3D world. Mathematically, it’s exactly like looking out a real window. If you look out your window at a fixed point in the outside world (say, a mailbox across the street), you could imagine drawing a straight light from your eye to that point. The point where that line intersects with the window is exactly where that mailbox will appear to you to be “on the window”. If you kept your head perfectly still, and painted exactly what you saw onto the surface of the window, it would look real. And that’s pretty much what we do when we render a 3D scene on a computer.
The diagram is in eye space – meaning the eye is at the origin. The near plane corresponds to the display – the window on the virtual world, which is a certain distance d from the eye. The far plane is the boundary that is as far as the eye can see. It’s necessary to have this limit because we need a unique number to represent every possible distance from the eye to any point in the view space, and computers don’t have infinite numeric precision. We can still put the far plane as far away as is practical, though. So, the view frustum neatly and precisely bounds everything the eye can see through the window.
In the 2D diagram, the point that we need to project is located at y. The goal is to figure out y’, the location on the near plane where the viewer will seem to see that point. One way to think about the math is in terms of similar triangles. Since the beige triangles are similar (they have the same angles), we know that the proportion of y’ to d is the same as the proportion of y to z:

Multiplying both sides by d, we get:

There! Problem solved. Since y’ (and x’, in 3d) is what we were after, we have our projection formula. From here on, we’ll refer to x’ and y’ as proj_x and proj_y.
Oh, but wait. There is one other thing. In real 3D systems, we generally need to map these coordinates into something called clip space, which is generally something like a unit cube (openGL) or the half of a unit cube with positive z (Direct3D). For this, there is yet another way to look at the perspective projection, which will make the problem simpler:

See that? If we can just warp the whole view frustum into a rectangle, then all the perspective lines become parallel, and the y coordinate of the point out in space is now the same as the projected one! Also, if we take this just a little further and squash the rectangle into a cube, we’ll have our coordinates in clip space. So how is this warping of space accomplished? Hmm. Look at the result in this diagram – the warping turns y into y’, just like before, therefore we already have the formula that does this for x and y as a function of z:

That effectively makes the frustum rectangular, and all the perspective lines parallel. So much for the warp – on to the squash. Let’s start by thinking about extremities. If we can find formulas that work for the extremities, they’ll work for everything in between, too. To complete the transformation into a half-unit cube (I’m using Direct3D’s convention here), we need to map the following:
minimum and maximum x values at both near and far planes → -1 and +1
minimum and maximum y values at both near and far planes → -1 and +1
minimum and maximum z values → 0 and +1 (for openGL, we’d simply map to -1, +1 here)
The x and y cases are really similar. The z case is a bit special, firstly because the view frustum is not symmetric about the z axis, and secondly because we eventually need to divide everything by z, which means we’re even going to divide the transformed z by z. I don’t expect that last bit to make sense just yet – it has to do with the matrix multiplication and homogeneous coordinates, which we’ll get to eventually.
I won’t beat around the bush anymore on these formulas. Let’s just get out a pencil and paper and work it out. For x and y, all we need to do is divide by half the width (or height) of the near plane. For example, suppose the near plane ranges from -100 to +100. We want a point at 100 to map to one. It’s easy to work out that dividing by half the width does the trick. A little algebra gives us:

Since we’re eventually going to compose a matrix to do all of these calculations in one go, it’s important to concatenate them all together first. Let’s combine the projection (warp) formula with the mapping (squash) formula:

and

From here on, we’re going to call d zNear, indicating that it’s the distance to the near plane along the z axis. A little more algebra, substituting
into the second equation, gives:

And for y:

I mentioned that the case for z is special. So special that I just finished working out the derivation for the first time just now. I’ll show it in the next post. Then we’ll have all the pieces we need to put the matrix together.