Let’s start with the basics. At minimum, a polygonal model consists of:

- a set of points in 3d space, called vertices
- relationships or connections between those points, which organize them into polygons, called faces

The faces are what the system actually renders. The vertices provide spacial information about the surface, and determine the ultimate locations and shapes of the faces. So far, so good. It seems clear that we need:

- a list of vertices. each vertex can be represented by a 3d vector that gives its position in space, relative to origin <0, 0, 0>
- a list of faces. each face should somehow indicate which vertices belong to it. for example, a face could be a list of indices into the list of vertices

This does indeed give us enough information to render the model in some way. But as our rendering techniques become more sophisticated, we soon find that we need more information than this. Surface shading models generally depend on surface normals to calculate how light reflects or scatters in different directions. It is convenient to associate these normals with the vertices – either the lighting calculations will happen at the vertices, or else the normals themselves will be interpolated across the face so the lighting calculations can be performed at each pixel. A vertex normal is calculated by averaging the normals of the faces that share the vertex.

As soon as we try to associate a normal with a vertex, and to figure out which face normals should contribute to its calculation, we immediately encounter an interesting problem of model representation, which I will call the hard-edge/soft-edge problem. Consider the following variations in shading on our model:

In the version on the left, the edges of these square faces are **hard**, or sharp. This gives the surface a faceted look – it is made of distinct, flat faces. Hard edges are useful for models of things like cubes and robots. In the version on the right, these faces are just part of a continuous, smooth surface. The edges are **soft** – in fact, we wish to downplay the presence of any edges altogether. Soft edges are good for models of things like spheres and human faces. Clearly these are very different effects. But what is the difference, exactly?

The difference is that the hard faces have all their normals pointing one way. The surface normals do not vary across the faces. In the smooth model, the surface normals do vary continuously across the faces. So what is the difference in terms of data representation? Well, given that we are (at least conceptually) interpolating the normals at the vertices across the faces, soft edges are actually the default effect. For hard edges, the only way the normals will *not* vary is if they are the *same at every vertex for a given face*. In particular, the normal at each vertex for that face should be the face normal itself. Look at the normals in the following illustration:

In the smooth surface, each vertex has just one normal, so each normal is shared between adjacent faces. In the hard-edged surface, it’s as if the normals are split at each vertex, so that each adjacent face gets its own normal. Unfortunately, this seems rather at odds with the notion of a *vertex normal*. These split normals are no longer really associated with a vertex. Rather, each normal is now associated with a vertex *and* a face – i.e. with a vertex-face pair.

How should we represent this in our model data? For the smooth case, it seems like we really do want to store the normals with the vertices. But for the hard-edge case, it seems like the normals belong with the faces. We could have two different representations, but there are lots of objects, like cylinders, that have both hard and smooth edges. For this reason, it would be much better if we could find a common representation that would accomodate both. We *could* actually make the more general case the standard – that is, always store the normals per-face. For the smooth case, this isn’t really necessary, but we can get the smooth effect by ensuring that adjacent faces have the same normals associated with shared vertices. That would look like this:

Here, the two cases have the same basic representation. And it isn’t awful… But there are a couple of things wrong with doing this in the smooth model. First, it duplicates data needlessly. Second, we’ve actually lost an interesting piece of information – the fact that those faces share normals at those vertices is no longer explicit in the model. The best we could do is compare the normals numerically in order to guess that, which is not very reliable.

What we need to do is to approach this like the relational data-modeling problem that it is. We need to get more precise about what the relationship between vertices, faces, and so-called vertex normals *really* is. One face has many vertices, and multiple faces may share the same vertex. For shared vertices, the normals may be shared as well, but they might just as well not be shared. That last part tells us we have to decouple the normals from the vertices, but still be able to relate faces to both. Another illustration should make this clearer:

The red dots are vertex positions. There are a straightforward, fixed number of vertex positions, regardless of how we share normals or connect faces. The purple dots represent the vertex normals. The blue lines are faces. For hard edges (left), normals are not shared among faces. For soft edges (right), they are. Both of the objections raised two paragraphs ago are solved. In the smooth model, normals are not duplicated. Even better, we have explicit information about which faces share which normals. This has at least one specific and important benefit: **each normal is associated with exactly the set of faces whose face normals contribute to the calculation of the vertex normal**. This makes calculation of the vertex normals from the model very convenient. Notice that the faces are no longer directly related to the vertex positions – instead, they are related to them via the vertex normals. This might seem a little funny at first, but it is just as convenient in practice as a direct relation, and it is a more accurate representation of what we are trying to model.

So far, our model consists of:

- vertex positions – 3d vectors representing points in 3d space
- vertex normals – not so much a value in itself, as a link that relates faces to vertices, while supporting hard and soft edges. It will contain a reference to the underlying vertex position, and a list of references to the faces that share it. The actual value, also a 3d vector, can be calculated easily from this face list.
- faces – a face is now a list of vertex normals. Indirectly, it can still be considered a list of vertex positions.

That’s concise, and works beautifully. So we’re done, right?

Unfortunately, as soon as we introduce the next common piece of per-vertex data, we immediately run into a similar problem yet again. I’m talking about **UV coordinates**. UV coordinates are coordinates in texture space, which map a polygonal region of the texture to a face. Like normals, they are basically related to a face-vertex pair. UV coordinates may be shared by faces at a vertex, or each face might have its own. Note the similarity to the hard-edge/soft-edge problem:

The case on the left, like the hard-edge case for normals, requires each face to have its own set of UV coordinates. This is because the UV coordinates needed at the inside edges will be discontinuous in texture space. E.g. the U coordinate will go from zero at the left edge to one at the middle edge, and then from zero at the middle edge to one at the right edge. The case on the right, like the soft-edge case for normals, is better represented by sharing UV coordinates among adjacent faces.

Since the problem is so similar, can we just piggyback the UV coordinates onto the normals? Not quite. That would enable the correct relation between faces, UVs, and vertices, but not between UVs and normals. We can have shared normals where UVs are not shared, and shared UVs where normals are not shared. The cases are similar, but independent of one another. Therefore, we need another layer of indirection that works in the same way. Here’s what that looks like:

Whoa. What’s *that* all about? The green dots are a new level of indirection that I’ll call face vertices. They give us a way to group data that belongs to these face-vertex pairs that we’ve been touching on. When we were only considering vertex normals, we could afford to sort of lump the concepts of face vertices and vertex normals together. Now that we’ve introduced a new kind of *per-face, per-vertex data*, we need to make this concept formal so that we can make distinctions between kinds of data. The yellow dots are that new kind of data – the UV coordinates. The vertex normals can be shared, or not, as before. The UVs can also be shared, or not. Notice, though, that the face vertices themselves can only be shared if *all* the data associated with them is shared. That’s the bottom-right case in the illustration. In the bottom-left case, even though the UVs are shared, the face vertices cannot be shared because they must be able to refer uniquely to the vertex normals, which are not shared. We could remove this asymmetry by *never* sharing face vertices – and you might prefer that. Sharing them in the special case at the bottom-right is just a little efficiency.

With the addition of face vertices, any new kind of per-face, per-vertex data can be shared, or not. This begins to look like a general solution. Our model now consists of:

- vertex positions – 3d vectors representing points in 3d space
- vertex normals – links that relate face vertices to vertex positions, while supporting hard and soft edges. Each contains a reference to the underlying vertex position, and a list of references to the faces that share it. The actual value, also a 3d vector, can be calculated easily from this face list.
- UV coordinates – vectors representing points in texture space
- face vertices – links that relate faces to per-face, per-vertex data, such as vertex normals, UV coordinates, skin weights, tangents, and so on. Each contains references to this data, and those references can be to shared or separate referents.
- faces – each face is a list of face vertices. Indirectly, via the face vertices, it can also be considered a list of vertex positions, vertex normals, UV coordinates, and other per-face, per-vertex data. A face typically also contains a reference to the material to be applied to that face.

This gives us a general polygonal model representation, which can be extended by adding new kinds of per-face, per-vertex data. Because the data is organized by face, and then by face vertex, this representation can be readily converted to the vertex/index stream formats expected by 3d graphics hardware. The considerations that drive all the little details of that conversion would be a good topic for a follow-up article…

]]>Let’s begin by recapping those formulas. **zNear **is the z value of the near plane (also called **d **in previous posts). **zFar **is the z value of the far plane. **nearWidth **and **nearheight **are the width and height of the near plane:

Notice that each one of these involves a division by **z**. Actually, the calculation of clip_z doesn’t *really *need a division by **z**, but as explained in the previous post, we have to use homogeneous coordinates to accomplish the division by **z **for **x **and **y**, and it’s an all or nothing deal. It’s amazing to me that all of this ends up working out. Thank goodness for mathematicians!

Since that division by **z **will *not* (and cannot) be carried out by the matrix multiplication itself, we can take it right out of all three equations. We just need to make sure the matrix puts the value of **z **into the resulting **w **coordinate. This leaves us with calculations that *can *be carried out by a single 4×4 matrix. All that’s left is to figure out where in the matrix to place the different parts.

Remember that each component of the result vector is determined by a single row of the matrix:

Row 1 determines the resulting **x **coordinate.

Row 2 determines the resulting **y **coordinate.

Row 3 determines the resulting **z **coordinate.

Row 4 determines the resulting **w **coordinate.

Let’s proceed one component at a time. We want the source **x **to be multiplied by . The column in row 1 of the matrix that gets multiplied by **x **is the first one, entry (1,1). That’s all there is to it. We put into (1,1) and zeros in all the other columns. So the first row of the matrix is:

Calculating** y **is almost the same, except it’s the second column that gets multiplied by the source **y**. So the second row is:

For **z**, remember that we need to add two terms together. One is multiplied by the source **z, **and the other is just a constant. So the part to be multiplied by **z **goes in the third column, and we conveniently hijack the **w **column to host the constant:

That leaves **w**, to which we simply assign the value of the source **z **by placing a one in the third column:

And we’re done! The final perspective projection matrix is:

Yeah, I just figured out that WordPress supports . Awesome.

This concludes the perspective projection series. You should now be able to look at that matrix and understand exactly what it does to a 3D point.

]]>The reason it is different is intrinsic in what *projection *is, by definition. Projection means dropping a dimension. In 3D graphics, we’re dropping the **z **dimension, so that we can render on a display that only has **x **and **y **dimensions. In perspective projection, **x**** **and **y **are projected as a function of **z – **they’re absorbing some of **z**‘s information into themselves that preserves some depth information, manifested as far-away things looking smaller.

The reason it is tricky has to do with the way we’re going to use homogeneous coordinates to accomplish the division by **z **required by the projection formulas we already have. To explain, I need to skip ahead a little to the matrix multiplication. Recall from the post on vector/matrix multiplication that while x, y, z, and w from the source vector can *each* be multiplied by a constant in the matrix, they can only combine with *each other* additively. But we need to divide **x** and **y** by **z – **that’s multiplicative, and that’s a problem.

The solution is apparently due to Möbius, who you might remember from such hits as the Möbius strip. Recall that homogeneous coordinates give us a way to embed a scaling factor into a vector. If the **w **coordinate is not one, the real 3D coordinates are computed by dividing the whole vector by **w**. Aha! If we can somehow get the value of **z **into the **w **component (and we can), then **x **and **y **will effectively be divided by **z**! There’s just one problem – with that value in **w**, **z** itself** **will get divided by **z **too! We need to correct for this, because we are in fact going to need accurate depth information in the projected coordinates, for things like perspective-correct texture mapping and shadow mapping.

We now have some constraints for how this projection of the **z **coordinate needs to work. We know that it must map the range between the near and far planes into the range [0, 1]. And we know that the result must somehow have **z **itself already factored in, so that *subsequent division by z ends up giving the correct answer. I know that last bit is weird. Just remember that we’re forced into it by the way homogeneous coordinates facilitate the divide-by-z needed by the x/y projection.*

*Time to break out the pencil and paper again, and see if we have enough information here to come up with a formula. Remember that the computation of z will be determined exclusively by the third row of the projection matrix. And it will consist of adding together bits of x, y, z, and w from the source vector. Well, right away we can rule out any contribution from x and y, because they have nothing to do with mapping z into a range. Really, the only input we need to accomplish that is z itself. So let’s start by seeing if there’s any constant that we can put in the (3,3) position in the matrix that will accomplish the desired mapping when multiplied by z.*

*We want the result to have z factored into it (to correct for the z-division problem) and we want to reach the result by multiplying z by some constant. Let’s call that hoped-for constant A:*

*Great. What else do we know? We know that when we plug in the z coordinate of the near plane, ultimately Clip_z must equal zero. And when we plug in the z coordinate of the far plane, Clip_z must equal 1. Simple substitution into the equation above gives us:*

*And simplifying gives:*

*Well shucks. No solution. This means there’s no single constant A that we can multiply z by to get the result we’re after. Fortunately, there’s one more option: we might be able to use the last term in that matrix row. That’s the one that will get multiplied by w and added to the result. Since all of our input vectors are going to contain normalized homogeneous coordinates, w is always going to be one, which is innocuous. This will let us pass through one more constant (let’s call it B), to be added to the result. Which means we can also consider a formula with the form:*

Let’s try plugging in our known values just like before:

This time it turns out we have a solvable system. Solving for A and B gives:

Plugging these back into the original formula and simplifying give us:

Whew! We’re done with **z**. In the next post, we’ll put it all together into the projection matrix.

Stated simply, the point of perspective projection is to make things that are further away look smaller. Since the magnitude of **z **gets bigger as a point gets further from the viewer, it follows that the magnitude of other things should basically vary inversely with **z**. However, as I quickly discovered when I wrote my first 3D program, merely dividing everything else by **z **is not quite enough.

In order to understand the mathematics required to do perspective projection properly, let’s back up and talk about the view frustum. I’ll use 2D illustrations, but the same math extends pretty easily to 3D. The view frustum:

The idea is to pretend that the computer display is a window looking “out” onto the 3D world. Mathematically, it’s exactly like looking out a real window. If you look out your window at a fixed point in the outside world (say, a mailbox across the street), you could imagine drawing a straight light from your eye to that point. The point where that line intersects with the window is exactly where that mailbox will appear to you to be “on the window”. If you kept your head perfectly still, and painted exactly what you saw onto the surface of the window, it would look real. And that’s pretty much what we do when we render a 3D scene on a computer.

The diagram is in **eye space** – meaning the eye is at the origin.** **The **near plane **corresponds to the display – the window on the virtual world, which is a certain distance **d **from the eye. The **far plane **is the boundary that is as far as the eye can see. It’s necessary to have this limit because we need a unique number to represent every possible distance from the eye to any point in the view space, and computers don’t have infinite numeric precision. We can still put the far plane as far away as is practical, though. So, the view frustum neatly and precisely bounds everything the eye can see through the window.

In the 2D diagram, the point that we need to project is located at **y**. The goal is to figure out **y’**, the location on the near plane where the viewer will seem to see that point. One way to think about the math is in terms of similar triangles. Since the beige triangles are similar (they have the same angles), we know that the proportion of **y’ **to **d **is the same as the proportion of **y **to **z**:

Multiplying both sides by d, we get:

There! Problem solved. Since y’ (and x’, in 3d) is what we were after, we have our projection formula. From here on, we’ll refer to x’ and y’ as proj_x and proj_y.

Oh, but wait. There is one other thing. In real 3D systems, we generally need to map these coordinates into something called **clip space**, which is generally something like a unit cube (openGL) or the half of a unit cube with positive z (Direct3D). For this, there is yet another way to look at the perspective projection, which will make the problem simpler:

See that? If we can just *warp *the whole view frustum into a rectangle, then all the perspective lines become parallel, and the y coordinate of the point out in space is now the same as the projected one! Also, if we take this just a little further and squash the rectangle into a cube, we’ll have our coordinates in clip space. So how is this warping of space accomplished? Hmm. Look at the result in this diagram – the warping turns y into y’, just like before, therefore we already have the formula that does this for x and y as a function of z:

That effectively makes the frustum rectangular, and all the perspective lines parallel. So much for the warp – on to the squash. Let’s start by thinking about extremities. If we can find formulas that work for the extremities, they’ll work for everything in between, too. To complete the transformation into a half-unit cube (I’m using Direct3D’s convention here), we need to map the following:

minimum and maximum x values at both near and far planes → -1 and +1

minimum and maximum y values at both near and far planes → -1 and +1

minimum and maximum z values → 0 and +1 (for openGL, we’d simply map to -1, +1 here)

The x and y cases are really similar. The z case is a bit special, firstly because the view frustum is not symmetric about the z axis, and secondly because we eventually need to divide everything by z, which means we’re even going to divide the transformed z by z. I don’t expect that last bit to make sense just yet – it has to do with the matrix multiplication and homogeneous coordinates, which we’ll get to eventually.

I won’t beat around the bush anymore on these formulas. Let’s just get out a pencil and paper and work it out. For x and y, all we need to do is divide by half the width (or height) of the near plane. For example, suppose the near plane ranges from -100 to +100. We want a point at 100 to map to one. It’s easy to work out that dividing by half the width does the trick. A little algebra gives us:

Since we’re eventually going to compose a matrix to do **all** of these calculations in one go, it’s important to concatenate them all together first. Let’s combine the projection (warp) formula with the mapping (squash) formula:

and

From here on, we’re going to call **d ***zNear*, indicating that it’s the distance to the near plane along the z axis. A little more algebra, substituting into the second equation, gives:

And for **y**:

I mentioned that the case for z is special. So special that I just finished working out the derivation for the first time just now. I’ll show it in the next post. Then we’ll have all the pieces we need to put the matrix together.

]]>I’m going to assume that you understand how plain old coordinates work. You’ve got x, y, and z values that describe a position in a cartesian coordinate system. Usually this is represented by a 3-dimensional vector. Simple enough.

To get so-called homogeneous coordinates, you add a fourth component, conventionally called **w**. So now you have x, y, z, and w, but they still represent a point in **3** dimensions, **not **4. Then what’s the **w **for? Isn’t it superfluous? Before we answer that, let’s look at a few examples.

Consider this 3d point:

<10, 10, 10>

The simplest way to make these into homogeneous coordinates is to just add **w=1**:

<10, 10, 10, 1>

There, now they’re homogeneous. Big deal! Well, what happens if we change **w**?

<10, 10, 10, **5**>

Uh oh. They’re still homogeneous coordinates, but now they’re misleading. Here’s the thing:

<10, 10, 10, 5> **!= **<10, 10, 10>. They are **not **the same point in space! In fact:

<10, 10, 10, 5> = <2, 2, 2>!

The rule is, <x, y, z, w> is the same point as <x, y, z> *only *when w = 1. If w != 1, you can normalize the whole thing by dividing every component by w (obviously, doing this will give you w = 1).

Now we can sort of start to see one part of what **w **is good for. It gives us a way to package a scaling factor into a vector. Notice that because of the rules, this scaling factor is upside down – the rest of the vector scales *inversely *with **w**. When w gets bigger, the rest of the vector gets smaller, and when w gets smaller, the rest of the vector gets bigger.

Ok, so we can pack an inverse scaling factor in with our coordinates. Why is that useful? To understand that, you have to look at how these 4-component vectors interact with the 4×4 transformation matrices used everywhere in computer graphics. I’ll get into that in the next post, where I’ll examine the perspective projection matrix.

]]>In order to understand how a projection matrix works, first you have to understand what a matrix actually does to a vector when you multiply them. It can seem a little mysterious if you don’t look too closely, but once you look, it’s almost intuitive.

First, a refresher on dot products. Remember that you get the dot product of two vectors by adding the component-wise products, like so:

You can look at vector/matrix multiplication as taking the dot products of the vector with each row of the matrix. Each dot product becomes a component of the result vector. Here’s how I like to think about this: *All *of the components of the original vector can potentially contribute to *each *component of the result vector. Each row of the matrix determines the contributions for one component of the result vector. So, in the result vector, **x **is determined by **row 1 **in the matrix, **y **is determined by **row 2**, and so on. Let’s see how this works for the simplest case: the identity matrix. After this, it should make sense why the identity matrix has ones on a diagonal.

Each component of the original vector can contribute additively to any component in the result vector, and the weightings of the contributions are determined by the corresponding row in the matrix. So, **P’x **is determined by **row 1**. You can see that in the identity matrix, row 1 has a one** **in the x position, and zeros in all other positions. So **Px** from the original vector is preserved intact by the one in that row, while **Py**, **Pz**, and **Pw** are suppressed by the zeros. In the second row, there is a one in the **y** position, and the rest are zeros. Since row 2 determines the contribution weightings for **P’y**, the contribution from **Py **is preserved, while **Px, Pz, **and **Pw** get zeroed.

Hopefully that makes sense after going over it once or twice. The trick is to see each row in the matrix as *selecting *the way that the original vector’s components combine to produce each component of the result. The tools used to do this are multiplication (to scale or cancel a source component) and addition (to combine those components into a single result component).

Of course, most matrices are more interesting than the identity matrix. Try applying this to some of the transformation matrices. Notice how the rows of a scaling matrix only affect one vector component at a time, just like the identity matrix, since **x **scaled is always a function of just **x**. In contrast, notice how the rows of the rotation matrices affect multiple components at once, since **x **rotated can be a function of both **x and y**, or **x and z**, depending on the axis of rotation. Once you absorb this, you’ll be able to see intuitively what a given matrix is doing to a vector, and make up your own matrices to do interesting things to vectors.

I’ve been doing 3D graphics programming for about 17 years, ever since the day in 10th grade algebra when we covered rectangular and polar coordinates. Daydreaming during the lesson, I realized that you could use that math to rotate stuff. On the way home, I imagined using the formulas to rotate a cloud of points in 3D space. I went straight to my Tandy 1000 and fired up BASIC.

My first 3D program had a random cloud of pixels spinning in 3D space, projected head-on, orthographically. In other words, it looked like a bunch of pixels swaying from side to side on the screen – but I knew what it represented. Then two things happened. First, I realized that if I drew lines connecting the points instead, I could make something more like a geometric shape. I got some graph paper, plotted a triangular prism, and typed in the coordinates. Second, almost as an afterthought, I started thinking about how to make it look “more 3D” – i.e. how to add perspective.

As I remember it, it only took me a few minutes to work out that if I wanted far away things to be smaller, that meant I should divide everything by **z**. I tried this, and it sort of almost worked, but the result looked distorted. Without understanding what I was doing, I experimented with adding an extra term to the formula until it looked right. And just like that, I had a rotating 3D wireframe model, with perspective.

I didn’t fully appreciate at the time that I had just independently rediscovered the Perspective Projection, following in the footsteps of Florentine painters, mathematicians, and early computer graphics practitioners. Nor did I fully understand the details, such as the fact that my extra term was actually the distance from the eye to the view plane.

I find it a little funny that the concept that spurred me to write a 3D program was *rotation*, and that I tacked on perspective almost as an afterthought. It’s like I thought rotation was the main thing that would make an object seem real and solid.

Still, figuring out perspective was a pretty big deal. I have a much deeper understanding of it now than I did back then. And yet, when I went to reimplement a projection matrix today, I found that I’d forgotten (again) what a lot of the terms did, and how it all worked with the homogeneous coordinates and what not. So, my first little series will be a breakdown of perspective projection matrices and why they are the way they are.

]]>