Attack of the depth buffer

In these exciting modern times, people get a lot of mileage out of their depth buffers. Long gone are the days where we only use depth buffers for visibility and stenciling, as we now make use of the depth buffer to reconstruct world-space or view-space position of our geometry at any given pixel.  This can be a powerful performance optimization, since the alternative is to output position into a “fat” floating-point buffer. However it’s important to realize that using the depth buffer in such unconventional ways can impose new precision requirements, since complex functions like lighting attenuation and shadowing will depend on the accuracy of the value stored in your depth buffer.  This is particularly important if you’re using a hardware depth buffer for reconstructing position,  since the z/w value stored in it will be non-linear with respect to the view-space z value of the pixel. If you’re not familiar with any of this, there’s a good overview here by Steve Baker. The basic gist of it is that z/w will increase very quickly in the as you move away from the near-clip plane of your perspective projection, and for much of the area viewable by your camera you will have values >= 0.9 in your depth buffer. Consequently you’ll end up with a lot of precision for geometry that’s close to your camera, and very little for geometry that’s way in the back. This article from codermind has some mostly-accurate graphs that visualize the problem.

Recently I’ve been doing some research into different formats for storing depth, in order to get a solid idea of the amount of error I can expect.  To do this I made DirectX11 app where I rendered a series of objects at various depths, and compared the position reconstructed from the depth buffer with a position interpolated from the vertex shader.  This let me easily flip through different depth formats visualize the associated error. Here’s a front view and a side view of the test scene:

The cylinders are placed at depths of 5, 20, 40, 60, 80, and 100. The near-clip plane was set to 1, and the far-clip was set to 101.

For an error metric, I calculated the difference between the reference position (interpolated view-space vertex position) and normalized it by dividing by the distance to the far clip plane. I also multiplied by 100, so that a fully red pixel represented a difference equivalent to 1% of the view distance. For a final output I put the shaded and lit scene in the top-left corner, the sampled depth in the top right, the error in the bottom left, and error * 100 in the bottom right.

For all formats marked “Linear Z”, the depth was calculated by taking view-space Z and dividing by the distance to the far-clip plane.  Position was reconstructed using the method decribed here. For formats marked “Perspective Z/W”, the depth was calculated by interpolating the z and w components of the clip-space position and then dividing in the pixel shader.  Position was reconstructed by first reconstructing view-space Z from Z/W using values derived from the projection matrix.  For formats marked “1 – Perspective Z/W”, the near and far plane values were flipped when creating the perspective projection matrix. This effectively stores 1 – z/w in the depth buffer. More on that in #9.

So without further rambling, let’s look at some results:

1. Linear Z, 16-bit floating point


So things are not so good on our first try. We get significant errors along the entire visible range with this format,  with the error increasing as we get towards the far-clip plane. This makes sense, considering that a floating-point value has more precision closer to 0.0 than it does closer to 1.0.

2. Linear Z, 32-bit floating point


Ahh, much nicer. No visible error at all. It’s pretty clear that if you’re going to manually write depth to a render target, this is a good way to go. Storing into a 32-bit UINT would probably have even better results due to an even distribution of precision, but that format may not be available depending on your platform.  In D3D11 you’d also have to add a tiny bit of packing/unpacking code since there’s no UNORM format.

3. Linear Z, 16-bit UINT

For this image I output depth to a render target with the DXGI_FORMAT_R16_UNORM format. As you can see it still has errors, but they’re significantly decreased compared to a 16-bit floating point. It seems to me that if you were going to restrict yourself to 16 bits for depth, this is a way to go.

4. Perspective Z/W, 16-bit floating point

This is easily the worst format out of everything I tested.  You’re at a disadvantage right off the bat just from using 16-bits instead of 32, and you also compound that with the non-linear distribution of precision that occurs from storing perspective depth. Then on top of that, you’re encoding to floating point which gives you even worse precision for geometry that’s far from the camera. The results are not pretty…don’t use this!

5. Perspective Z/W, 32-bit floating point

This one isn’t so bad compared to using a 16-bit float, but there’s still error at higher depth values.

6. Perspective Z/W, 16-bit UINT

I used a normal render target for this in my test app, but it should be mostly equivalent to sampling from a 16-bit depth buffer. As you’d expect, quite a bit of error once you move away from the near clip plane.

7. Perspective Z/W, 24-bit UINT

This is the most common depth buffer format, and in my sample app I actually sampled the hardware depth-stencil buffer created from the first rendering pass.  Compared to some of the alternatives this really isn’t terrible, and a lot of people have shipped awesome-looking games with this format. The maximum error towards the back is ~0.005%. If the distance to your far plane is very high, the error can be pretty significant.

8. Position, 16-bit floating point

For this format, I just output view-space position straight to a DXGI_FORMAT_R16G16B16A16_FLOAT render target. The only thing this format has going for it is convenience and speed of reconstruction…all you have to do is sample and you have position. In terms of accuracy, the amount of error is pretty close to what you get from storing linear depth in a 16-bit float. All in all…it’s a pretty bad choice.

9. 1 – Z/W, 16-bit floating point

This is where things get a bit interesting. Earlier I mentioned how floating-point values have more precision closer to 0.0 than they do closer to 1.0. It turns out that if you flip your near and far plane such so that you store 1 – z/w in the depth buffer, your two precision distribution issues will mostly cancel each other out. As far as I know this was first proposed by Humus in this Beyond3D thread. He later posted this short article, where elaborated on some of the issues brought up in that thread.  As you can see he’s quite right: flipping the clip planes gives significantly improved results. They’re still not great, but clearly we’re getting somewhere.

10. 1 – Z/W, 32-bit floating point

With a 32-bit float, flipping the planes gives us results similar to what we got when storing linear z. Not bad! In D3D10/D3D11 you can even use this format for a depth-stencil buffer…as long as you’re willing to either give up stencil or use 64 bits for depth.

The one format I would have liked to add to this list is a 24-bit float depth-stencil format.  This format is available on consoles, and is even exposed in D3D9 as D3DFMT_D24FS8. However according to the caps spreadsheet that comes with DX SDK, only ATI 2000-series and up GPU’s actually support this format. In D3D10/D3D11 there doesn’t even appear to be an equivalent DXGI format, unless I’m missing something.

If there’s any other formats or optimizations out there that you think are worthwhile, please let me know so that I can add them to the test app!  Also if you’d to play around with the test app, I’ve upload the source and binaries here.  The project uses my new sample framework, which I still consider to be work-in-progress.  However if you have any comments about the framework please let me know. I haven’t put in the time to make the components totally separable, but if people are interested then I will take some time to clean things up a bit.

EDIT: I also started a thread here on gamedev.net, to try to get some discussion going on the subject. Feel free to weigh in!

D3D Performance and Debugging Tools Round-Up: PerfHUD

Officially, Nvidia’s PerfHUD is a performance-monitoring and debugging application for use with Nvidia GPU’s. Unofficially, it’s pure awesomeness for a graphics programmer.  While I personally find PIX to be a more useful tool when it comes to debugging, the fact that PerfHUD gives you hardware-specific details makes it infinitely more useful for profiling. At work I find myself using it every time there’s a performance issue on the PC. Here’s some of the things I like to do with it (warning, it’s a long list!):

1. View driver time and GPU idle time per frame

When doing graphics in Windows, something you always need to be wary of is spending too much time sitting around in the driver instead of giving the GPU enough work to do. With PerfHUD instead of looking at the number of Draw/SetRenderState/SetTexture calls and guessing their impact (although PerfHUD will show you these values if you want), you get a nice sweet graph that shows you your driver time, GPU idle time, and total frame time.

These graphics make it very obvious whether your app is CPU-bound or GPU-bound, and if you are CPU-bound you can tell whether it’s because you’re spending too much time in the driver.

2. View memory usage

Want to know how much memory your app is using? PerfHUD tells you.

3. View shader utilization stats

The default PerfHUD layout includes a graph showing you which percentage of unified shaders are being used for different shader types. This lets you know if you’re spending a lot of time in vertex, geometry, or pixel shaders.

4. View GPU pipeline usage states

The default layout also has a graph showing you how much you’re using the input assembler, shader units, texture units, and ROP’s:

5. Customize the graphs

Above I used the term “default layout” a few times. I do this because the initial layout you get isn’t fixed, and you can customize it. You can add new graphs, move them, remove them, and customize the data shown on a graph. This lets pick and choose from the various performance counters available, and also change how the data is displayed. For instance you can select ranges, or switch between frame percentage or raw time.

6. Run instant global experiments

PerfHUD has hot-keys that let you toggle different experiments on and off. These include:

-Swap out all textures for 2×2 textures (removes texture bandwidth usage)
-Use a 1×1 viewport (removes pixel shading usage)
-Ignore all Draw calls (isolates the CPU)
-Eliminate geometry (removes vertex shader/input assembler usage)

You can also show wireframes, depth complexity (which shows your overdraw), and also highlight different pixel shader profiles.

7. View textures and render targets for a draw call

This is something you can do in PIX so it’s not that smile, but I’m mentioning this because PerfHUD makes it easier to view all of the textures and render targets at the same time. In PIX you have to look at the debug state and open a new tab in the Details view for each texture/RT, which is kinda annoying. Also note that when you do this the current state of the backbuffer is shown on the screen, with the current Draw call highlighted in orange. You don’t see anything in my picture since the app doesn’t draw anything to the backbuffer until the final step.

8. View dependencies for a Draw call

This is actually a pretty neat debugging feature. PerfHUD basically gives you a list of all previous Draw calls whose results are used by the current Draw call. It will also show which future Draw calls use the results from the current Draw call.

9. View and modify states for a Draw call

PIX is really good at letting you see the current device state at a certain point in a frame, but PerfHUD takes this a step further by letting you modify them and instantly view the results.

10. View and modify shaders

PerfHUD doesn’t let you debug shaders like PIX can, but it does let you modify a shader and see the live changes in your app. You can also load up any shader from file, compile it, and replace that shader.

11. Replace textures with special debugging textures

In the Frame Debugger, you can replace a specific texture with one of the following:

  • 2×2 texture
  • Black, 25% gray, 50% gray, 75% gray, white textures
  • Vertical and horizontal gradients
  • Mipmap visualization texture

Here’s a screenshot showing the mipmap visualization texture applied as the diffuse albedo texture for all meshes:

12. View comprehensive performance statistics

The Frame Profiler is easily the coolest part about PerfHUD. It presents you with a whole slew of information about what’s going on with the GPU for different parts of a frame, and makes it easy to figure out which parts of your frame are the most expensive for different parts of the GPU pipeline. In fact, PerfHUD will automatically figure out the most expensive draw calls and indicate it for you. I’m not going to go through each feature in detail since I want this post to stay readable, but I’ll give you a list:

  • View the bottleneck or utilization time per unit for a Draw call, state bucket (a state bucket is a group of draw calls that have similiar performance and bottleneck charactistics, meaning that if you reduce a particular bottleneck all Draw calls in the state bucket are likely to be quicker)
  • View a graph of bottleneck or utilization percentages for all all Draw calls in a frame
  • View a graph of CPU and GPU timing info
  • View a graph of the number of pixels shaded per Draw call
  • View a graph of the texture LOD level for all Draw calls
  • View a graph of the number of primitives and screen coverage per Draw call

The following image shows the frame profiler in action. The graph is showing the bottleneck percentage per Draw call, and the selected Draw call is in the shadowmap generation pass. As you’d expect, the call is primarily bound by the input assembler stage since the shaders are so simple. You can also see that PerfHUD grouped all of the other shadow map generation Draw calls into the same state bucket.

Useful Tips:

  1. PerfHUD itself is actually just a layer on top of Nvidia’s PerfKit, which is a library that lets you access hardware and driver-specific performance counters. If you wanted you could just use those API’s yourself and display the information on-screen, or integrate it into in-house profiling tools. In fact Nvidia’s provides a PIX plugin, which lets PIX display and record them just like any other other standard performance counter. However the catch is that a lot of the hardware counters aren’t updated every frame, which makes it difficult to use them to figure out bottlenecks. You also have the problem that it’s difficult to figure out bottlenecks for a specific portion of the frame, since you can’t query the counters multiple times per frame. The PerfHUD Frame Profiler makes this all easy by automatically running a frame multiple times, allowing it to gather sufficient information from the hardware performance counters.  You could of course do this yourself, but it’s a lot easier to just use PerfHUD.
  2. PerfHUD is totally usable for XNA apps. In fact, all of those screenshots are from my InferredRendering sample. To run an app with PerfHUD you have to query for the PerfHUD adapter on startup, and use it if it’s available. The user guide gives sample code for doing this in DX9, and it’s even easier with XNA.In your constructor, add a handler for the GraphicsDeviceManager.PreparingDeviceSettings event:
    graphics.PreparingDeviceSettings += graphics_PreparingDeviceSettings;
    

    Then in your event handler, use this code:

    foreach (GraphicsAdapter adapter in GraphicsAdapter.Adapters)
    {
        if (adapter.Description.Contains("PerfHUD"))
        {                    
            e.GraphicsDeviceInformation.Adapter = adapter;
            e.GraphicsDeviceInformation.DeviceType = DeviceType.Reference;
            break;
        }
    }
    
  3. Be careful using PerfHUD when your app is running in windowed mode. Closing the window can cause it to crash.

D3D Performance and Debugging Tools Round-Up: PIX

So at this point just everybody knows about knows about PIX. I mean it comes with the DirectX SDK, for crying out loud.  This handy little program started its like as the Performance Investigator for Xbox (original Xbox, that is) and today is useful performance and debugging tool for both Windows and the Xbox 360.  Since it’s a DirectX tool, most of the information you can gather from it is hardware-independent. For instance it can easily tell you how many DrawIndexedPrimitives calls you’re making, but it can’t tell you whether the GPU is bound by texture bandwidth. For that reason I find that PIX is much more useful for debugging as opposed to performance investigations. However it can be useful when your performance is held up by API calls (since it can tell you where you’re making them in a frame, and how many) and in Vista/Win7 it has access to GPU timing information that cam tell you how much time per frame the GPU is working, idle, or waiting for resources.  Another nice thing about PIX is that it now has full support for D3D11 as of the new February 2010 SDK, which unfortunately isn’t the case for NVPerfHUD.

If you’re an XNA programmer I’d recommend checking out my in-depth PIX With XNA article, especially if you’re new to D3D in general. For rest of you, here’s a summary of what I think are the most useful things you can do with PIX:

1. View the results of a Draw call

With everyone and their mother using a deferred renderer these days, more often than not what’s displayed on the screen is the result of several passes.  This means that when things go wrong, it’s hard to guess the problem since it could have occurred in multiple places.  Fortunately PIX can help us by letting us pick any singular Draw call and see exactly what was drawn to the screen.  All you have to do s capture a frame, find the Draw call in the Event view, and then click on the “Render” tab in the Details view.  Here’s a screenshot I took showing what was drawn to the normal-specular buffer during the G-Buffer pass of my Inferred Rendering Sample:

2.  View device state at any point in frame

Ever have a problem where something wasn’t drawing, and it turned out you left alpha-testing enabled or something silly like that? I know it’s happened to me. If it happens again, you can help diagnose the problem by using PIX to view the state of your device at the time of a Draw call (or at any other point in the frame, for that matter).  To do it you capture a frame, find the Draw call or other Event you’re interested in, and then find the device in your Objects view (filtering by Type can help). Then you just right-click on the device, click “View Device”, and have a look at the tab that appears in the Details view.  It looks something like this:

3. View mesh data for a draw call

Doing this lets you see what your vertex data looks like before and after your vertex shader (also before and after your geometry shader, if you’re using D3D10 or D3D11). Just capture a frame, click on the Draw event, and click on the Mesh tab in the Details view.

4. Debug shaders

I don’t I need to mention why this is useful. With PIX you can step through both the compiled assembly and HLSL code for your shader.  The easiest way to start debugging is to view the pixel history of a pixel by right-clicking on it in the Render tab, and then click on the links displayed for a Draw event.

5. View textures, render targets, buffers, depth/stencil surfaces, and vertex declarations

You can view the contents of all of these things just by finding them in the Objects view and right-clicking. Like everything else in PIX, what’s displayed will reflect the current state of the object based on the event you’ve selected in the Event view.  For vertex buffers you’ll also need to specify the  vertex format using an HLSL-like syntax, which is really easy to do.

6. View a CPU/GPU timing graph (Vista/Win7 only)

If you select the “Statistics for each frame” option when starting your experiment, one of the things you’ll get is a timeline showing your CPU and GPU work for the frames captured.  This let’s you easily view whether the GPU is idling or hard at work (which makes it simple to determine if you’re CPU or GPU-bound). It also can show you where GPU work is done in relation to a frame being submitted by the CPU, so you can tell if the CPU is working one or more frames ahead of the GPU.

7. Record performance counter values for a frame, and show them on a HUD

PIX has a number of counters available that let you keep track of things like the number of Draw calls or the number of SetTexture calls in a frame.  When you create a new experiment and you select the “Statistics for each frame” option, it will let you pick from a selection countersets provided by PIX.  To see which counters are included in a counterset or to make your own, click the “More Options” button, click the “Set Counters” action, and then click on “Customize”.  In this dialog you can pick through all of the D3D counters provided by PIX, or add in any of the standard Windows Performance Counters installed on your system. Also note that vendor–specific tools like NVPerkKit will install plugins for PIX that let you add in hardware-specific counters.

If you enable the HUD for your experiment, you’ll get something that looks like this when running your app:

Either way once you close your app and you’re viewing the experiment results, the Events view will display the value of all active counters for each frame.

Useful Tips:

  1. If you’re going to debug shaders and you don’t to have to step through the assembly, make sure you compile them with the DEBUG flag. This embeds debugging info in the compiled bytecode, including a path to the HLSL source code file. You’ll also want to disable optimization if possible, otherwise you’ll find that the compiler usually aggressively reorders your code.  XNA users: the Effect processor will enable the DEBUG flag when you perform Debug builds, and it will attempt to disable optimizations. If you’re using a vs_2_0 or ps_2_0 shader it’s possible that disabling optimizations  will cause you to go over the instruction limit, in which case the processor will re-enable optimizations.
  2. Add markers to your app! PIX has a small list of functions that lets you mark off portions of a frame, which causes PIX to collapse all of the events that occurred in the marked area.  So for instance you could add a marker for “G-Buffer Pass”, which lets you easily find draw calls made to build your G-Buffer.  If you’re using XNA my sample includes a handy “PIXHelper” class that has pinvokes declared for those functions, as well as extension methods for SpriteMatch and Model.
  3. Use the “D” buttons in the Events view to quickly iterate through your Draw calls
  4. If you want to use the HUD and you’re using D3D9, make sure you Present with the implicit swap chain created with the device
  5. With D3D9 disable multisampling if you’re going to capture a frame. PIX doesn’t like it.

New Series: D3D Performance and Debugging Tools Round-Up

Recently I’ve been spending a lot of time with the various performance and debugging utilities available for Direct3D, and I thought it might be useful to give a quick overview of what’s out there.  I’m sure most people who do a lot of Direct3D/XNA work are aware of these tools, but probably aren’t familiar with all of the really cool things you can do with them.

What I’m going to do is run through each tool one at a time, and share some of the common use cases and show some screenshots of features I think are neat. That way some people might learn about something they never knew about, and hopefully a few people can tell me about something I never knew about.

As of right now here’s the tools I’m planning to run through:

  1. PIX
  2. NVPerfHUD
  3. GPU PerfStudio
  4. NVShaderPerf
  5. GPU ShaderAnalyzer

Inferred Rendering

So like I said in my last post, I’ve been doing some research into Inferred Rendering.  If you’re not familiar with the technique, Scott Kircher has the original paper and presentation materials hosted on his website.  The main topic of the paper is what they call “Discontinuity Sensitive Filtering”, or “DSF” for short.  Basically it’s standard 2×2 bilinear filtering, except in addition to sampling the texture you’re interested in you also sample what they call a a “DSF buffer” containing depth, an instance ID (semi-unique for each instance rendering on-screen), and a normal ID (a semi-unique value identifying areas where the normals are continuous).  By comparing the values sampled from the DSF buffer with the values supplied for the mesh being rendered (they apply the DSF filter during final pass of a light-prepass renderer where meshes are re-rendered and sample from the lighting buffer), they can bias the bilinear weights so that texels not “belonging” to the object being rendered are automatically rejected.  They go through all of this effort so that they can do two things:

  1. They can use a lower-res G-Buffer and L-Buffer but still render their geometry at full res
  2. They can light transparent surfaces using a deferred approach, by applying a stipple pattern when rendering the transparents to the G-Buffer

The second part is what’s interesting, so let’s talk about.  Basically what they do is they break up the G-Buffer into 2×2 quads.  Then for transparent objects, an output mask is applied so that only one pixel in the quad is actually written to.  Then by rotating the mask, you could render up to 3 layers of transparency into the quad and still have opaques visible underneath.  For a visual, this is what a quad would look like if only one transparent layer was rendered:

So “T1″ would be from the transparent surface, and “O” would be from opaque objects below it.  This is what it would look like if you had 3 transparent surfaces overlapping:

After laying out your G-Buffer, you then fill your L-Buffer (Lighting Buffer) with values just like you would with a standard Light Pre-pass renderer.  After you’ve filled your L-Buffer, you re-render your opaque geometry and sample your L-Buffer using a DSF filter so that only the texels belonging to opaque geometry get samples.  Then you render your transparent geometry with blending enabled, each time adjusting your DSF sample positions so that the 4 nearest texels (according to the output mask you used when rendering it to the G-Buffer) are sampled.

So you can light your transparents just like any other geometry, which is really cool stuff if you have a lot dynamic lights and shadows (which you probably do if you’re doing deferred rendering in the first place).  But now come the downsides:

  1. Transparents end up being lit at 1/4 resolution, and opaques underneath transparents will be lit at either 3/4, 2/4, or 1/4 resolution.  How bad this looks mainly depends on whether you have high-frequency normal maps, since the lighting itself is generally low-frequency.  You’re also helped a bit by the fact that your diffuse albedo texture will still be sampled at full rate.  Here’s a screenshot comparing forward-rendered transparents (left-side), with deferred transparents (right-side):

    You can see that aliasing artifacts become visible on the transparent layers, due to the normal maps.  Even more noticable is shadow map aliasing, which becomes noticeably worse on the transparent layers since it’s only sampled at 1/4 rate.  Here’s a screenshot showing the same comparison, this time with normal maps disabled:


    The aliasing becomes much less visible on the unshadowed areas without normal mapping disabled, since now the normals are much lower-frequency.  However you still have the same problem with shadow map aliasing.

  2. The DSF filtering is not cheap.  Or at least, the way I implemented it wasn’t cheap.  My code can probably be optimized a bit to reduce instructions, but unless I’m missing something fundamental I don’t think you could make any significant improvements.  If someone does figure out anything, please let me know!  Anyway when compiling my opaque pixel shader with fxc.exe  (from August 2009 SDK) using ps_3_0, I get a nice 11 instructions (9 math, 2 texture) when no DSF filtering is used.  When filtering is added in, it jumps up to a nasty 64 instructions! (55 math, 9 texure).  For transparents the shader jumps up again (71 math, 9 texture) since some additional math is needed to adjust the filtering in order to sample according to the stipple pattern.  Running the shaders through NVShaderPerf gives me the following:

    Here’s what I get with ATI’s GPU ShaderAnalyzer:

    So like I said, it’s not definitely not free.  In the paper they mention that they also use a half-sized G-Buffer + L-Buffer which offsets the cost of the extra filtering.  When running my test app on my GTX 275 at half-res G-Buffer there’s almost no difference in framerate and at quarter-res it’s actually faster to defer the transparents.  Using a full-res G-Buffer/L-Buffer it’s quicker to forward-render the transparents, with 4  large point lights and 1 directional light + shadow.  So I’d imagine for a full-res G-Buffer/L-Buffer you’d need quite a few dynamic lights for it to pay off when going deferred for transparents.  But in my opinion, the decrease in quality when using a lower-res G-Buffer just isn’t worth it.  Here’s a screenshot showing deferred transparents with half-sized G-Buffer:

    Notice how bad the shadows look on the transparents, since now the shadow map is being sampled at 1/8th rate.  Even on the opaques you start to lose quite a bit of the normal map detail.

  3. You only get 3 layers of transparency.  However past 3 layers it would probably be really hard to notice that you’re missing anything, at least to the average player.
  4. Since you use instance ID’s to identify transparent layers, you’ll have problems with models that have multiple transparency levels (like a car, which has 4 windows)

Regardless, I think the technique is interesting enough to look into.  Personally when I read the paper I had major concerns about what shadows would look like on the transparents (especially with a lower-res L-Buffer), which is what lead to me to make a prototype with XNA so that I could evaluate some of the more pathological cases that could pop up.  If you’re also interested, I’ve uploaded the binary here, and the source here.  If you want to run the binary you’ll need the XNA 3.1 Redistributable, located here.

One thing you’ll notice about my implementation is that I didn’t factor in normals at all in the DSF filter, and instead I stored depth in a 16-bit component and instance ID in the the other 16 bits.  This would give you much more than the 256 instances that the original implementation is limited to, at the expense of some artifacts around areas where the normal changes drastically on the same mesh.

Correcting XNA’s Gamma Correction

One thing I never used to pay attention to is gamma correction.  This is mainly because it rarely gets mentioned, and also because you can usually get pretty good results without ever even thinking about it.  However it only took a few days at my new job for me to realize just how essential it is if you want professional-quality results.

Lately I’ve been doing some research into inferred rendering (more on that later), and while working up a prototype renderer in XNA I decided that I would (for once)  be gamma-correct throughout the pipeline.  So I went looking through the XNA Framework documentation for framework’s equvalent of the D3DSAMP_SRGBTEXTURE sampler state (which automatically converts from sRGB to linear in the texture unit) and the D3DRS_SRGBWRITEENABLE render state (which automatically converts from linear to sRGB in the ROP)…and I didn’t find them.  The thought of these being left out struck me as odd, so I did a bit of searching on Google.  After refining my search terms I found this post by framework developer Shawn Hargreaves, confirming that those states were not exposed in the framework due to inconsistencies between Windows and Xbox.  After looking through some presentations again I concluded that he was talking about…

1.  The fact that the 360 uses a 4-segment piecewise linear approximation curve to perform conversion to and from sRGB, which gives quite different results compared to what you get with PC GPU’s.

2.  The fact that blending behavior is different in DX9 and DX10-level GPU’s, regardless of which API you use.  DX9 GPU’s will perform framebuffer blending after conversion to sRGB (which is mathematically incorrect), while DX10 GPU’s will do the blending in linear space and then convert the blended result to sRGB.  There is a cap to detect this behavior (D3DPMISCCAPS_POSTBLENDSRGBCONVERT) but it’s only available if you create an IDirect3D9Ex device.

So yeah, that’s annoying.  But like most limitations in the framework you can work around them if you’re determined enough, and fortunately this one is a piece of cake.  Well…on the PC, at least.  So let’s start with the first half, sampling sRGB textures.  Like I mentioned before there’s a nice convenient sampler state in D3D9 that will do the sRGB->linear automatically, but XNA’s SamplerState just doesn’t have it.  But fortunately that’s not the only way to set sampler states…we can also get the Effects framework to do it for us by defining a sampler_state in our effect files.  So I took a peek at the D3D9 Effect States documentation, and added the appropriate state declaration to my effect file.  And it worked!  For the lazy, all you have to do is this (important line in bold):

texture2D DiffuseMap;
sampler2D DiffuseSampler = sampler_state
{
   Texture = <DiffuseMap>;
   SRGBTexture = true;
};

Okay now for the other half, sRGB writes.  Once again D3D9 has a convenient render state that does all of the work for us, and the Effects framework can set render states for us if we include them in a pass declaration.  But unfortunately this time the Effect States documentation didn’t have anything for SRGBWRITEENABLE.  Too determined to give up, I followed the standard convention of effect states and chopped the prefix off the “D3DRS_” prefix.  And hey, it worked!

technique Transparent
{
    pass Pass1
    {
       VertexShader = compile vs_3_0 TransparentVS();
       PixelShader = compile ps_3_0 TransparentPS();

       SRGBWriteEnable = true;
    }
}

So we’ve solved our gamma problems…at least if you’re only targeting the PC and you’re using Effects.  If you’re not using Effects, then I don’t know of any way to toggle those states.  It’s probably possible with some sort of interop/reflection voodoo, but I don’t know enough about these things to recommend it.

There’s also the Xbox 360 problem, which is actually two problems in one.  The first problem is that the Xbox 360 doesn’t use sampler and render states to control sRGB read and writes.  It instead uses the D3D10 convention of having special surface formats for textures and render targets that control whether conversion takes place.  I don’t have access to my Xbox 360 at the moment so I can’t verify for sure, but I strongly suspect that the effect states won’t work.  And even if they did work you’d still have the second problem, which is that the Xbox uses that piecewise approximation curve  (this presentation by Valve shows some of the nastiness that can occur with it).

Fortunately we can bypass those problems by doing the conversion ourselves in the shader.  The good news is that the code is a piece of cake…the bad news is that it’s not super cheap since it involves raising your RGB color value to a non-integral power. Here’s the code:

// Converts from linear RGB space to sRGB.
float3 LinearToSRGB(in float3 color)
{
    return pow(color, 1/2.2f);
}
// Converts from sRGB space to linear RGB.
float3 SRGBToLinear(in float3 color)
{
    return pow(color, 2.2f);
}

Unfortunately with these you also have the problem that filtering and blending will be performed in sRGB space, and there’s not much you can do about that (aside from doing the filtering and blending yourself, but that would be way too expensive).

If you want to make these conversions a little cheaper, you can use a trick that my coworker showed me: round down the 2.2 to 2.0.  This gives you a simple square operation for conversion to linear (you can just dot the value with itself), and a sqrt operation for conversion to sRGB.

Two Samples For The Price Of One

Today I have two XNA samples fresh out of the oven: a Motion Blur Sample, and Depth Of Field Sample.  I figure all of the kids these days wanna add fancy post-processing tricks to their games, right?  The motion blur sample shows you how to do camera motion blur using a depth buffer, or full object motion blur using a velocity buffer. The depth of field sample shows you how to do a standard blur-based DOF, a slightly-smarter blur-based DOF that doesn’t blur across edges, and the somewhat more physically accurate disc blur approach.

Get ‘em while they’re hot!

New Tutorial: Using PIX With XNA

Ladies and gentlemen, I present you with the most epic of tutorials: Using PIX With XNA.  This 37-page monster teaches PIX for the XNA programmer, and includes an in-depth explanation of the XNA/D3D9 relationship as well as 6 excercises that show you the how to solve common problems (full source code and XNA 3.1 projects included).  I sure hope somebody finds this thing useful…it took me forever to write this thing.

I originally intended to have this tutorial hosted on Ziggyware…in fact I finished this over a month ago and submitted it to Ziggy.  However as you may or may not know, Ziggy has become the unfortunate target of scumbag hackers who have repeatedly hijacked his site in order to deploy malware.  The whole thing absolutely sucks…I really wish that those assholes had decided to hijack a site that wasn’t the most comprehensive collection of community-created XNA resources.  I hope Ziggy figures out a way to shake them and get the site up and running again…but it looks doubtful.  Honestly I don’t think I’d want to keep dealing with the kinds of problems he’s gone though.

Scintillating Snippets: Storing Normals Using Spherical Coordinates

Update:  n00body posted this link in the comments, which is way more in-depth than my post.  Check it out!

If you’ve ever implemented a deferred renderer, you know that one of the important points is keeping your G-Buffer small enough as to be reasonable in terms of bandwidth and your number of render targets.  Thanks to that constant struggle between good and evil, people have come up with some reasonable clever approaches towards packing necessary attributes in your G-Buffer.  One of the more popular approaches is that whole storing depth and reconstructing position thing, and another is packing normals so that you only need 2 components instead of 3.

One of the more simple and common approaches is to only store the X and Y components of your view-space normals and then assume Z is positive (or negative, depending on whether you’re using right-handed or left-handed coordinates).  As far as I know, this was first proposed here by Guerilla Games. However there’s a problem with this approach, which is that you can’t always assume the sign of your Z component when you’re using a perspective projection! This might seem weird at first (heck it took a while for someone to demonstrate to me why this is the case), but I assure you it’s true.  Insomniac has some good pictures here demonstrating the errors that occur.  So this means that if we want to use this technique and avoid errors, we have to pack the sign of Z somewhere in our two values. This is a little nasty, and takes away a bit of precision from one of your other values.

An alternative approach suggested to me a long time ago is to store the normal as a spherical coordinate.  Since a normal is always a unit vector with length = 1, you can (safely) assume that Rho = 1 and just store Thetha and Phi.  Piece of cake!  All you have to do is implement the equations on the wiki page, take out the Rho’s, and you’ve got a two-component normal with excellent precision.

But wait, there’s more!  It turns out if you use some trig-fu, you can actually further optimization to the conversions when Rho is equal to 1.  I was never actually good at simplifying equations with trig functions (I can do everything else, promise!) so I defer to the noble Pat Wilson who gave a quick rundown over in this thread.  Make sure you check out his set of screenshots that demonstrate the errors that occur from different normal storage options, so you can pick which method is right for you.

Also since this is Scinitillating Snippets and it wouldn’t be much fun without a snippet, I’ll post the HLSL functions I use for encoding and decoding my normals.  Just remember, all of the credit goes to Mr. Wilson.  I just did the pilfering!

// Converts a normalized cartesian direction vector
// to spherical coordinates.
float2 CartesianToSpherical(float3 cartesian)
{
  float2 spherical;

  spherical.x = atan2(cartesian.y, cartesian.x) / 3.14159f;
  spherical.y = cartesian.z;

  return spherical * 0.5f + 0.5f;
}

// Converts a spherical coordinate to a normalized
// cartesian direction vector.
float3 SphericalToCartesian(float2 spherical)
{
  float2 sinCosTheta, sinCosPhi;

  spherical = spherical * 2.0f - 1.0f;
  sincos(spherical.x * 3.14159f, sinCosTheta.x, sinCosTheta.y);
  sinCosPhi = float2(sqrt(1.0 - spherical.y * spherical.y), spherical.y);

  return float3(sinCosTheta.y * sinCosPhi.x, sinCosTheta.x * sinCosPhi.x, sinCosPhi.y);    
}

Also keep in mind that these functions normalize the values to the range [0,1], so that you can store in a regular fixed-point texture. If you’re using a floating point texture you can remove the division by PI if you wish (and corresponding multiply by PI in the decode), as well as the “multiply by 0.5, subtract by 0.5″.

What’s good on the menu, waiter?

I remember reading someone say on gamedev.net that at some point everyone tries to write their own UI system, and usually gets it wrong.  Apparently he’s right (or at least about the first part), because I’ve gone ahead and written a menu/UI system.  While it initially started out as part of the engine/framework I’ve been working on for my game, as I worked on it I decided it might be better off if I decoupled it from the rest of the engine components and made it a standalone library/editor package so that other people could make use of it.

While designing and implementing I had these goals in mind:

  • Keep it simple!  Make menu elements useful by default, but don’t cram in tons of functionality with limited use.  Just let them be flexible enough so that they can be customized for unusual cases.
  • Cross-platform, with a focus on Xbox 360.  Should look identical on both, and expose the same functionality regardless of input method.
  • Page-based layout. A few of the other GUI packages out there seem to be aimed at recreating WinForms using XNA…and I think that’s silly.  You don’t want sizeable windows for a game (or at least not most games), you want menus that are logically divided up into pages that you can switch between.
  • A PC-only editor application that lets you visually design your menus.   The core library should be aware of the fact that it can run in a designer, and provide support for this.
  • Free and open-source!

What I ended up with is the CPX Menu System.  It actually came out better than I expected…the editor is very stable and works pretty nicely.  It could use somore more fancy features (like tools for lining up menu items), but it definitely WORKS and I’m happy about that.  As for the menu item types included in the library itself…it’s pretty bare-bones but you can still do a lot with them.  I mean personally for my game I wouldn’t really need a whole lot more than what I put in the sample app.

Probably the biggest weakness it has working with content is a bit awkward.  Early on a I struggled a lot with trying to come up with a good way to handle it…and I don’t feel like I ever really came up with a killer solution.  As of right now the way it works is that the editor app itself does not build any content at runtime.  This isn’t so nice, since you have to have Content compiled ahead of time before you run the app.  The upside is that editor doesn’t depend on the content pipeline assemblies at all, so you can run it on a PC that doesn’t have the full XNA GS install.  Probably the easiest way to manage content is to just add all of your menu content to the CPXMenu project’s Content project.  If you do that, then you will always have the content available for the editor and your game (assuming you’re always building the editor in VS and running it that way).  Otherwise you can tell the editor to look for content in a specific path whenever it loads a project.  This is what I did for the sample app: it has its own Content project with some custom textures, so I set the editor to look in the output folder for that project.

I guess that’s it for now…at some point I suppose I’ll announce it on Ziggyware.  Maybe after I add some documentation explaining how to use the damned thing.  In the meantime, here’s some screenshots of the sample app and the editor:

Follow

Get every new post delivered to your Inbox.

Join 34 other followers