# Attack of the depth buffer

In these exciting modern times, people get a lot of mileage out of their depth buffers. Long gone are the days where we only use depth buffers for visibility and stenciling, as we now make use of the depth buffer to reconstruct world-space or view-space position of our geometry at any given pixel.  This can be a powerful performance optimization, since the alternative is to output position into a “fat” floating-point buffer. However it’s important to realize that using the depth buffer in such unconventional ways can impose new precision requirements, since complex functions like lighting attenuation and shadowing will depend on the accuracy of the value stored in your depth buffer.  This is particularly important if you’re using a hardware depth buffer for reconstructing position,  since the z/w value stored in it will be non-linear with respect to the view-space z value of the pixel. If you’re not familiar with any of this, there’s a good overview here by Steve Baker. The basic gist of it is that z/w will increase very quickly in the as you move away from the near-clip plane of your perspective projection, and for much of the area viewable by your camera you will have values >= 0.9 in your depth buffer. Consequently you’ll end up with a lot of precision for geometry that’s close to your camera, and very little for geometry that’s way in the back. This article from codermind has some mostly-accurate graphs that visualize the problem.

Recently I’ve been doing some research into different formats for storing depth, in order to get a solid idea of the amount of error I can expect.  To do this I made DirectX11 app where I rendered a series of objects at various depths, and compared the position reconstructed from the depth buffer with a position interpolated from the vertex shader.  This let me easily flip through different depth formats visualize the associated error. Here’s a front view and a side view of the test scene:

The cylinders are placed at depths of 5, 20, 40, 60, 80, and 100. The near-clip plane was set to 1, and the far-clip was set to 101.

For an error metric, I calculated the difference between the reference position (interpolated view-space vertex position) and normalized it by dividing by the distance to the far clip plane. I also multiplied by 100, so that a fully red pixel represented a difference equivalent to 1% of the view distance. For a final output I put the shaded and lit scene in the top-left corner, the sampled depth in the top right, the error in the bottom left, and error * 100 in the bottom right.

For all formats marked “Linear Z”, the depth was calculated by taking view-space Z and dividing by the distance to the far-clip plane.  Position was reconstructed using the method decribed here. For formats marked “Perspective Z/W”, the depth was calculated by interpolating the z and w components of the clip-space position and then dividing in the pixel shader.  Position was reconstructed by first reconstructing view-space Z from Z/W using values derived from the projection matrix.  For formats marked “1 – Perspective Z/W”, the near and far plane values were flipped when creating the perspective projection matrix. This effectively stores 1 – z/w in the depth buffer. More on that in #9.

So without further rambling, let’s look at some results:

1. Linear Z, 16-bit floating point

So things are not so good on our first try. We get significant errors along the entire visible range with this format,  with the error increasing as we get towards the far-clip plane. This makes sense, considering that a floating-point value has more precision closer to 0.0 than it does closer to 1.0.

2. Linear Z, 32-bit floating point

Ahh, much nicer. No visible error at all. It’s pretty clear that if you’re going to manually write depth to a render target, this is a good way to go. Storing into a 32-bit UINT would probably have even better results due to an even distribution of precision, but that format may not be available depending on your platform.  In D3D11 you’d also have to add a tiny bit of packing/unpacking code since there’s no UNORM format.

3. Linear Z, 16-bit UINT

For this image I output depth to a render target with the DXGI_FORMAT_R16_UNORM format. As you can see it still has errors, but they’re significantly decreased compared to a 16-bit floating point. It seems to me that if you were going to restrict yourself to 16 bits for depth, this is a way to go.

4. Perspective Z/W, 16-bit floating point

This is easily the worst format out of everything I tested.  You’re at a disadvantage right off the bat just from using 16-bits instead of 32, and you also compound that with the non-linear distribution of precision that occurs from storing perspective depth. Then on top of that, you’re encoding to floating point which gives you even worse precision for geometry that’s far from the camera. The results are not pretty…don’t use this!

5. Perspective Z/W, 32-bit floating point

This one isn’t so bad compared to using a 16-bit float, but there’s still error at higher depth values.

6. Perspective Z/W, 16-bit UINT

I used a normal render target for this in my test app, but it should be mostly equivalent to sampling from a 16-bit depth buffer. As you’d expect, quite a bit of error once you move away from the near clip plane.

7. Perspective Z/W, 24-bit UINT

This is the most common depth buffer format, and in my sample app I actually sampled the hardware depth-stencil buffer created from the first rendering pass.  Compared to some of the alternatives this really isn’t terrible, and a lot of people have shipped awesome-looking games with this format. The maximum error towards the back is ~0.005%. If the distance to your far plane is very high, the error can be pretty significant.

8. Position, 16-bit floating point

For this format, I just output view-space position straight to a DXGI_FORMAT_R16G16B16A16_FLOAT render target. The only thing this format has going for it is convenience and speed of reconstruction…all you have to do is sample and you have position. In terms of accuracy, the amount of error is pretty close to what you get from storing linear depth in a 16-bit float. All in all…it’s a pretty bad choice.

9. 1 – Z/W, 16-bit floating point

This is where things get a bit interesting. Earlier I mentioned how floating-point values have more precision closer to 0.0 than they do closer to 1.0. It turns out that if you flip your near and far plane such so that you store 1 – z/w in the depth buffer, your two precision distribution issues will mostly cancel each other out. As far as I know this was first proposed by Humus in this Beyond3D thread. He later posted this short article, where elaborated on some of the issues brought up in that thread.  As you can see he’s quite right: flipping the clip planes gives significantly improved results. They’re still not great, but clearly we’re getting somewhere.

10. 1 – Z/W, 32-bit floating point

With a 32-bit float, flipping the planes gives us results similar to what we got when storing linear z. Not bad! In D3D10/D3D11 you can even use this format for a depth-stencil buffer…as long as you’re willing to either give up stencil or use 64 bits for depth.

The one format I would have liked to add to this list is a 24-bit float depth-stencil format.  This format is available on consoles, and is even exposed in D3D9 as D3DFMT_D24FS8. However according to the caps spreadsheet that comes with DX SDK, only ATI 2000-series and up GPU’s actually support this format. In D3D10/D3D11 there doesn’t even appear to be an equivalent DXGI format, unless I’m missing something.

If there’s any other formats or optimizations out there that you think are worthwhile, please let me know so that I can add them to the test app!  Also if you’d to play around with the test app, I’ve upload the source and binaries here.  The project uses my new sample framework, which I still consider to be work-in-progress.  However if you have any comments about the framework please let me know. I haven’t put in the time to make the components totally separable, but if people are interested then I will take some time to clean things up a bit.

EDIT: I also started a thread here on gamedev.net, to try to get some discussion going on the subject. Feel free to weigh in!

# D3D Performance and Debugging Tools Round-Up: PerfHUD

Officially, Nvidia’s PerfHUD is a performance-monitoring and debugging application for use with Nvidia GPU’s. Unofficially, it’s pure awesomeness for a graphics programmer.  While I personally find PIX to be a more useful tool when it comes to debugging, the fact that PerfHUD gives you hardware-specific details makes it infinitely more useful for profiling. At work I find myself using it every time there’s a performance issue on the PC. Here’s some of the things I like to do with it (warning, it’s a long list!):

1. View driver time and GPU idle time per frame

When doing graphics in Windows, something you always need to be wary of is spending too much time sitting around in the driver instead of giving the GPU enough work to do. With PerfHUD instead of looking at the number of Draw/SetRenderState/SetTexture calls and guessing their impact (although PerfHUD will show you these values if you want), you get a nice sweet graph that shows you your driver time, GPU idle time, and total frame time.

These graphics make it very obvious whether your app is CPU-bound or GPU-bound, and if you are CPU-bound you can tell whether it’s because you’re spending too much time in the driver.

2. View memory usage

Want to know how much memory your app is using? PerfHUD tells you.

The default PerfHUD layout includes a graph showing you which percentage of unified shaders are being used for different shader types. This lets you know if you’re spending a lot of time in vertex, geometry, or pixel shaders.

4. View GPU pipeline usage states

The default layout also has a graph showing you how much you’re using the input assembler, shader units, texture units, and ROP’s:

5. Customize the graphs

Above I used the term “default layout” a few times. I do this because the initial layout you get isn’t fixed, and you can customize it. You can add new graphs, move them, remove them, and customize the data shown on a graph. This lets pick and choose from the various performance counters available, and also change how the data is displayed. For instance you can select ranges, or switch between frame percentage or raw time.

6. Run instant global experiments

PerfHUD has hot-keys that let you toggle different experiments on and off. These include:

-Swap out all textures for 2×2 textures (removes texture bandwidth usage)
-Use a 1×1 viewport (removes pixel shading usage)
-Ignore all Draw calls (isolates the CPU)
-Eliminate geometry (removes vertex shader/input assembler usage)

You can also show wireframes, depth complexity (which shows your overdraw), and also highlight different pixel shader profiles.

7. View textures and render targets for a draw call

This is something you can do in PIX so it’s not that smile, but I’m mentioning this because PerfHUD makes it easier to view all of the textures and render targets at the same time. In PIX you have to look at the debug state and open a new tab in the Details view for each texture/RT, which is kinda annoying. Also note that when you do this the current state of the backbuffer is shown on the screen, with the current Draw call highlighted in orange. You don’t see anything in my picture since the app doesn’t draw anything to the backbuffer until the final step.

8. View dependencies for a Draw call

This is actually a pretty neat debugging feature. PerfHUD basically gives you a list of all previous Draw calls whose results are used by the current Draw call. It will also show which future Draw calls use the results from the current Draw call.

9. View and modify states for a Draw call

PIX is really good at letting you see the current device state at a certain point in a frame, but PerfHUD takes this a step further by letting you modify them and instantly view the results.

PerfHUD doesn’t let you debug shaders like PIX can, but it does let you modify a shader and see the live changes in your app. You can also load up any shader from file, compile it, and replace that shader.

11. Replace textures with special debugging textures

In the Frame Debugger, you can replace a specific texture with one of the following:

• 2×2 texture
• Black, 25% gray, 50% gray, 75% gray, white textures
• Mipmap visualization texture

Here’s a screenshot showing the mipmap visualization texture applied as the diffuse albedo texture for all meshes:

12. View comprehensive performance statistics

The Frame Profiler is easily the coolest part about PerfHUD. It presents you with a whole slew of information about what’s going on with the GPU for different parts of a frame, and makes it easy to figure out which parts of your frame are the most expensive for different parts of the GPU pipeline. In fact, PerfHUD will automatically figure out the most expensive draw calls and indicate it for you. I’m not going to go through each feature in detail since I want this post to stay readable, but I’ll give you a list:

• View the bottleneck or utilization time per unit for a Draw call, state bucket (a state bucket is a group of draw calls that have similiar performance and bottleneck charactistics, meaning that if you reduce a particular bottleneck all Draw calls in the state bucket are likely to be quicker)
• View a graph of bottleneck or utilization percentages for all all Draw calls in a frame
• View a graph of CPU and GPU timing info
• View a graph of the number of pixels shaded per Draw call
• View a graph of the texture LOD level for all Draw calls
• View a graph of the number of primitives and screen coverage per Draw call

The following image shows the frame profiler in action. The graph is showing the bottleneck percentage per Draw call, and the selected Draw call is in the shadowmap generation pass. As you’d expect, the call is primarily bound by the input assembler stage since the shaders are so simple. You can also see that PerfHUD grouped all of the other shadow map generation Draw calls into the same state bucket.

Useful Tips:

1. PerfHUD itself is actually just a layer on top of Nvidia’s PerfKit, which is a library that lets you access hardware and driver-specific performance counters. If you wanted you could just use those API’s yourself and display the information on-screen, or integrate it into in-house profiling tools. In fact Nvidia’s provides a PIX plugin, which lets PIX display and record them just like any other other standard performance counter. However the catch is that a lot of the hardware counters aren’t updated every frame, which makes it difficult to use them to figure out bottlenecks. You also have the problem that it’s difficult to figure out bottlenecks for a specific portion of the frame, since you can’t query the counters multiple times per frame. The PerfHUD Frame Profiler makes this all easy by automatically running a frame multiple times, allowing it to gather sufficient information from the hardware performance counters.  You could of course do this yourself, but it’s a lot easier to just use PerfHUD.
2. PerfHUD is totally usable for XNA apps. In fact, all of those screenshots are from my InferredRendering sample. To run an app with PerfHUD you have to query for the PerfHUD adapter on startup, and use it if it’s available. The user guide gives sample code for doing this in DX9, and it’s even easier with XNA.In your constructor, add a handler for the GraphicsDeviceManager.PreparingDeviceSettings event:
```graphics.PreparingDeviceSettings += graphics_PreparingDeviceSettings;
```

Then in your event handler, use this code:

```foreach (GraphicsAdapter adapter in GraphicsAdapter.Adapters)
{