A quick note on shader compilers

This morning I was wrestling with  a particularly complicated compute shader, which was taking just shy of 10 minutes to compile using D3DCompiler_43 from the June 2010 DirectX SDK. After a few failed attempts to speed it up by rearranging the code, I figured I’d try it out with the new version of the compiler that comes with the Windows 8 SDK. I wasn’t expecting any miracles, but to my surprise it compiled my shader in about 45 seconds! I figured I would pass along the knowledge, in case anyone else is dealing with a similar problem.

Light Indexed Deferred Rendering

There’s been a bit of a stir on the Internet lately due to AMD’s recent Leo demo, which was recently revealed to be using a modern twist on Light Indexed Deferred Rendering. The idea of light indexed deferred has always been pretty appealing, since it gives you some of the advantages of deferred rendering (namely using the GPU to decide which lights affect each pixel) while still letting you use forward rendering to actually apply the lighting to each surface. While there’s little doubt at this point that deferred rendering has proven itself as an effective and practical technique, I’m sure that plenty of programmers currently maintaining such a renderer have dreamed of a day where they don’t have to figure out how to cram every attribute into their G-Buffer using as few bits as possible, or consume 100′s of megabytes for MSAA G-Buffer textures.

While the benefits of light indexed deferred were pretty obvious to, I was pretty sure that the performance wouldn’t hold up when compared to the state-of-art in traditional deferred rendering. So I decided to make a simple test app where I could toggle between the two techniques for the same scene. For the deferred renderer, I based my implementation very closely on Andrew Lauritzen’s work since he had done quite a bit of work in terms of optimizing it for modern GPU architectures. The only differences were that I used a different G-Buffer layout (normals, specular albedo + roughness, diffuse albedo, and ambient lighting, all 32bpp) and I used an oversized texture instead of a structured buffer for writing out the individual MSAA subsamples from the compute shader.

For the light indexed deferred renderer implementation I used a depth-only prepass to fill the depth buffer, which was then used by a compute shader to compute the list of intersecting lights per-tile. This list was stored in either an R8_UINT or R16_UINT typed buffer (8-bit for < 255 lights, 16-bit otherwise), with enough space pre-allocated in the buffer to store a full light list for each tile. So no bitfields or linked lists or anything fancy like that, just a simple per-tile list terminated by sentinel value. I found that this worked best for the forward lighting pass, since this resulted in the least amount of overheard for reading the list in the forward rendering pass, although there might be better ways to do it. The forward rendering pass then figures out which tile each pixel is in, and applies the list of lights one by one.

In both cases I used normalized Blinn-Phong with fresnel approximation for the lights, so nothing fancy there. I did use a terrible linear falloff for the point lights just so that I could artificially restrict the radius, so please don’t judge me for that. I also used the depth-only prepass for both implementations, since it actually resulted in a speed up of around 0.5ms for the G-Buffer pass. For a test scene, I used the ol’ Sponza atrium.

I gathered some performance numbers for the hardware I have access to, which is an AMD 6970 and an Nvidia GTX 570. For both GPU’s I ran at 1920×1080 resolution with VSYNC disabled, and the timings represent total frame time. The Nvidia numbers were pretty much in line with my expectations:

Nvidia GTX 570
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 6.94ms 6.41ms
2x MSAA 7.81ms 7.51ms
4xMSAA 8.47ms 9.17ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.67ms 9.43ms
2x MSAA 12.987ms 10.75ms
4xMSAA 13.88ms 12.34ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 18.18ms 14.084ms
2x MSAA 20.00ms 15.63ms
4xMSAA 21.27ms 17.24ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 27.03ms
2x MSAA 29.41ms
4xMSAA 31.25ms

Tile-based deferred rendering wins out and nearly every case, and it only gets worse as you add in more lights.  Light indexed seems to scale a bit better with MSAA, but even with that it’s only enough to overcome the overall disadvantage for the 128 light case. For 1024 lights it seemed as though the Nvidia driver or hardware couldn’t handle the large buffer I was using for storing the light indices, as I was getting very strange artifacts on the lower half of the screen. However I can only imagine the trend would continue, and it would lag further behind the tile-based deferred renderer.

For the AMD 6970, the results were much more interesting:

AMD 6970
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.26ms 5.71ms
2x MSAA 5.98ms 9.43ms
4xMSAA 6.49ms 10.75ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 7.87ms 7.87ms
2x MSAA 8.77ms 11.11ms
4xMSAA 9.43ms 13.15ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.76ms 11.36ms
2x MSAA 12.98ms 14.93ms
4xMSAA 13.89ms 16.94ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 22.22ms 20ms
2x MSAA 24.39ms 25.64ms
4xMSAA 25.64ms 33.33ms

These results really surprised me. The light indexed renderer actually starts out faster than the deferred renderer, and doesn’t really start to fall behind until you hit 1024 lights. However with either 2xMSAA or 4xMSAA the light indexed renderer absolutely blows away the competition. I actually suspected that I did something wrong in my MSAA implementation, until I verified that I got similar results from the Intel sample. Perhaps there’s a better way to handle MSAA in a compute shader for AMD hardware? I didn’t spend a lot of time experimenting, so perhaps someone else has a few bright ideas. Either way it’s clear that forward rendering scales really well with MSAA on this hardware. Even the G-Buffer pass fares pretty well, as it goes from 1ms to 1.2ms to 1.3ms as the MSAA level increases (1.5ms to 1.9ms to 2.1ms without a z prepass).

So, where does this leave us? Even with these numbers we really don’t have a complete picture. Really we need some tests run with…

1. Different scenes, preferably some with even higher poly counts and/or some tessellation
2. More realistic material variety, including different texture configurations, layer blending, decals
3. A variety of complex BRDF’s
4. A few different ambient/bounce lighting configurations
5. More lighting types, with different shadowing configurations
6. More hardware to test on

These things have some big implications on what you store in the G-Buffer, forward shading efficiency, and the cost of a z prepass. That last one is important, since it’s mandatory for light indexed deferred but optional for traditional deferred. While it can still be cheaper overall to have a z prepass before your G-Buffer pass (as it was in my case), that could change depending on how your vertex processing costs.

So for now, my conclusion is that Light Indexed Deferred is at least in the realm of practical for most cases. Personally I consider even 256 to be a LOT of lights, so I’m not too worried about scaling up to thousands of lights anytime soon. But if anyone has access to different GPU’s, I would love to get some more numbers so that I can post them here. So if you happen to have a 7970 or GTX 680 lying around, feel free to download my sample and take down some numbers. Originally the number of lights was hard-coded to 128 in the binary, but I uploaded a new version that lets you toggle through the number of lights that I used for my test runs.

You can find the code and binary on CodePlex: http://mjp.codeplex.com/releases/view/85279#DownloadId=363173

Here are a few numbers for a GTX 680 contributed by Sander van Rossen:

Nvidia GTX 680
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.3ms 2.6ms
2x MSAA 2.62ms 3.86ms
4xMSAA 2.85ms 4.95ms

And some more numbers for the AM 7970 courtesy of phantom, gathered at 1280×720:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 1.8ms 1.9ms
2x MSAA 2.0ms 2.82ms
4xMSAA 2.3ms 3.6ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.5ms 2.3ms
2x MSAA 2.7ms 3.3ms
4xMSAA 3.0ms 4.2ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.3ms 2.9ms
2x MSAA 3.8ms 4.2ms
4xMSAA 4.2ms 5.2ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.9ms 4.5ms
2x MSAA 6.7ms 6.4ms
4xMSAA 7.4ms 7.8ms

Nvidia GTX 580, from Nathan Reed

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.77ms 3.17ms
2x MSAA 4.14ms 3.58ms
4xMSAA 4.39ms 4.17ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.68ms 4.37ms
2x MSAA 6.33ms 4.95ms
4xMSAA 6.80ms 5.52ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 8.54ms 6.10ms
2x MSAA 9.62ms 6.80ms
4xMSAA 10.42ms 7.35ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 16.67ms 10.99ms
2x MSAA 18.87ms 12.05ms
4xMSAA 20.41ms 12.82ms

AMD 5870, from Ethatron

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.59ms 3.48ms
2x MSAA 4.03ms 4.60ms
4xMSAA 4.44ms 5.49ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.18ms 4.42ms
2x MSAA 5.78ms 5.95ms
4xMSAA 6.32ms 6.89ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 7.51ms 5.95ms
2x MSAA 8.40ms 7.63ms
4xMSAA 9.09ms 8.69ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 14.28ms 10.20ms
2x MSAA 15.87ms 12.98ms
4xMSAA 17.24ms 14.28ms

Radeon 7970 @ 1920×1080, from 3dcgi:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.03ms 3.34ms
2x MSAA 3.52ms 5.12ms
4xMSAA 3.96ms 6.84ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 4.18ms 4.20ms
2x MSAA 4.76ms 6.25ms
4xMSAA 5.32ms 8.13ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.85ms 5.46ms
2x MSAA 6.62ms 8.00ms
4xMSAA 7.19ms 10.00ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 10.42ms 8.92ms
2x MSAA 11.63ms 12.66ms
4xMSAA 12.82ms 15.63ms

GPU Profiling in DX11 with Queries

For profiling GPU performance on the PC, there aren’t too many options. AMD’s GPU PerfStudio and Nvidia’s Parallel Nsight can be pretty handy due to their ability to query hardware performance counters and display the data, but they only work on each vendor’s respective hardware. You also might want to integrate some GPU performance numbers into your own internal profiling systems, in which case those tools aren’t going to be of much use.

To get around this, it’s possible to use D3D11 timestamp queries to get coarse-grained timing info for different parts of the frame. It’s a ways off from the kind of info you get from the vendor-specific tools, but it’s a lot better than nothing. It’s also pretty easy to implement. To profile a portion of your frame, you need a trio of ID3D11Query objects. Two of them need to have the type D3D11_QUERY_TIMESTAMP, and are used to get the GPU timestamp at the start and end of the block you want to profile. The third needs to have the type D3D11_QUERY_TIMESTAMP_DISJOINT, and it tells you whether your timestamps are invalid as well as the frequency used for converting from ticks to seconds. In practice it goes like this:

When starting a profiling block:

  • Call ID3D11DeviceContext::Begin and pass the DISJOINT query
  • Call ID3D11DeviceContext::End and pass the start TIMESTAMP query

When ending a profiling block:

  • Call ID3D11DeviceContext::End and pass the end TIMESTAMP query
  • Call ID3D11DeviceContext::End and pass the DISJOINT query

After waiting a sufficient amount of time  for the queries to be ready:

  • Call ID3D11DeviceContext::GetData on all 3 queries
  • Compute the delta in ticks using the timestamps from both TIMESTAMP queries
  • Use the frequency from the DISJOINT query to convert the delta to a time in seconds

Like any query, you need to wait for the GPU to actually execute all of the commands you submitted for the data to be ready. In my sample app, I handle this by keeping an array of queries for each profile block and moving to the next one each frame. Then at the end of the frame, I get the data from the oldest query and use that for outputting the timing data to the screen. So the actual timing data lags behind by a few frames, but that’s okay for real-time profiling. For automated benchmarks or performance snapshots you could either use the data from N frames later, or you could just stall at the end of the frame and wait for the query to be ready.

Sample code and binaries are available on CodePlex: http://mjp.codeplex.com/releases/view/74987#DownloadId=292437

Average luminance calculation using a compute shader

A common part of most HDR rendering pipelines is some form of average luminance calculation. Typically it’s used to implement Reinhard’s method of image calibration, which is to map the geometric mean of luminance (log average) to some “key value”. This, combined with some time-based adaptation, allows for a reasonable approximation of auto-exposure or human eye adaptation.

In the old days of DX9, the average luminance calculation was usually done repeatedly downscaling a luminance texture as if generating mipmaps. Technically DX9 had support for automatic mipmap generation on the GPU, but support wasn’t guaranteed so to be safe you had to do it yourself. DX10 brought guaranteed support for mipmap generation of a variety of texture formats, making it viable for use with average luminance calculations. It’s obviously still there in DX11, and it’s still a very easy and pretty quick way to do it. On my AMD 6950 it only takes about 0.2ms to generate the full mip chain for a 1024×1024 luminance map, which is pretty quick. But with DX11 it’s not sexy unless you’re doing it with compute shaders, which means we need to ditch that one line call to GenerateMips and replace it with some parallel reductions. Technically a parallel reduction should have much fewer memory reads/writes compared to generating successive mip levels, so there’s also some actual sound reasoning behind exploring that approach.

The DirectX SDK actually comes with a sample that implements average luminance calculation with a compute shader parallel reduction (HDRToneMappingCS11), but unfortunately their CS implementation actually performs a fair bit worse than their pixel shader implementation. A few people on gamedev.net had asked about this and I had said that it should definitely be possible to beat successive downscaling with a compute shader if you did it right, and used cs_5_0 rather than cs_4_0 like the sample. When it came up again today I decided to put my money where my mouth is and make a working example.

The implementation is really simple: render the scene in HDR, render log(luminance) to a 1024×1024 texture, downscale to 1×1 using either GenerateMips or a compute shader reduction, apply adaption, then tone map (and add bloom). My first try was to do the reduction in 32×32 thread groups (giving the max of 1024 per thread group), where each thread sampled a single float from the input texture and stored in shared memory. Then the reduction is done in shared memory using the techniques outlined in Nvidia’s CUDA parallel reduction whitepaper, which helps avoid shared memory conflicts. The first pass used a 32×32 dispatch which resulted in a 32×32 output texture, which was then reduced to 1×1 with one more 1×1 dispatch. Unfortunately this approach took about 0.3ms to complete, which was slower than the 0.2ms taken for generating mips.

For my second try, I decided to explicitly vectorize so that I could take better advantage of the vector ALU’s in my AMD GPU. I reconfigured the reduction compute shader to use 16×16 thread groups, and had each thread group take 4 samples (forming a 2×2 grid) from the input texture and store it in a float4 in shared memory. Then the float4′s were summed in a parallel reduction, with the last step being to sum the 4 components of the final result. This approach required only 0.08ms for the reduction, meaning I hit my goal of beating out the mipmap generation. After all that work saving 0.1ms doesn’t seem like a whole lot, but it’s worth it for the cool factor. The performance differential may also become more pronounced at higher resolution, or on hardware with less bandwidth available. I’m not sure how the compute shader will fare on Nvidia hardware since they don’t use vectorized GPU, so it should be interesting to get some numbers. I’d suspect that shared memory access patterns are going to dominate anyway over ALU cost anyway, so it could go either way.

The sample project is up for download here: http://mjp.codeplex.com/releases/view/71518#DownloadId=268633

I am officially a published author

I recently collaborated with fellow DX MVP’s Jason Zink and Jack Hoxley to write a D3D11-focused book entitled Practical Rendering and Computation with Direct3D 11, which just came up for sale on Amazon today. I wrote the HLSL and Deferred Rendering chapters in particular. All of the code samples are up on the Hieroglyph 3 CodePlex site, if you want to get an idea of the content. Or you can just take my word for it that it’s awesome. :P

Bokeh II: The Sequel

After I finished the bokeh sample, there were a few remaining issues that I wanted to tackle before I was ready to call it “totally awesome” and move on with my life.

Good blur – in the last sample I used either a 2-pass blur on a poisson disc performed at full resolution, or a bilateral Gaussian blur performed at 1/4 resolution (both done in a pixel shader). The former is nice because it gives you variable filter width per-pixel, but you get some ugly noise-like artifacts due to insufficient sampling. Performance can really take a nose dive with too many samples, especially if your filter radius is very large. Doing two passes helps a lot, but gives you artifacts like the one pictured below:

The traditional “gaussian blur at 1/4 res” approach sucks even worse. This is because the lower resolution screws up the bilateral filtering, and performance also isn’t so great in a pixel shader due to the high amount of texture samples required. But worst of all, it just plain doesn’t look good due to using a lerp to blend between blurred and non-blurred pixels to simulate “in-focus” and “out-of-focus”. It ends up looking more like soft focus, rather than an image that’s gradually coming in or out of focus. On top of that you get aliasing artifacts from working at a lower res, which causes shimmering and swimming.

My solution to this problem was to implement monstrous 21-tap seperable bilateral blur in a compute shader. Wide, seperable blur kernels are a pretty nice fit for compute shaders because you can store the texture samples in shared memory, which allows each thread to take a single texture sample rather than N. Shared memory isn’t that quick, but for larger kernels (10 pixels or so) the savings start to win out over a pixel shader implementation. With such a wide blur kernel, I could perform the blurring at full resolution in order to provide nice sharp edges when the foreground is in-focus and the background is out-of-focus. Here’s a similar image to the one above, this time artifact-free:

I ended up just using a box filter rather than a Gaussian, which normally gives you ugly box-shaped highlights on the bright spots. Fortunately the bokeh does a great job of extracting those points and replacing them with the bokeh shape. To avoid having to lerp between blurred and un-blurred versions of the image, I had the filter kernel reject samples outside of the CoC size of the pixel being operated on. This effectively gives you a variable-sized blur kernel per-pixel, which gives you much nicer transitions for objects moving in or out of focus. It doesn’t look quite as good as the transitions for the disc-based blur, but I think it’s a fair trade off. The image below shows what the kernel looks like as it transitions from in-focus to out-of-focus:


Out-of-focus foreground objects
– As in most DOF implementations, foreground objects that were out-of-focus were blurred, but still had hard edges. This can look pretty bad, as the object blurring into the background is a key part of simulating the look. A simple way to rectify this issue is to store your CoC size or blurriness factor in a texture, then blur it in screen space. This is essentially the approach used by Infinity Ward for the past few Call of Duty games produced by them. I went with something similar, except like with the DOF blur I used a compute shader to do a 21-tap seperable blur of the CoC texture. To make this look good you have to make sure you only gather samples coming from a lower depth, and that have a larger CoC than the pixel being operated on by that thread. Here’s an out-of-focus foreground object with and without the CoC spreading:

CoC Blur Off:

CoC Blur On:


No depth occlusion for bokeh sprites
– I didn’t do any sort of occlusion testing for bokeh sprites in the sample, and waved it off by saying it would be simple to use the depth stencil buffer and enable depth testing. While that’s true, it turns out that doing it that doesn’t really give you great results. The problem is that if a bokeh sprite is covering an area that’s totally out of focus, you don’t want those out-of-focus pixels occluding the bokeh sprites. Otherwise the bokeh just not blend in at all with the pixels that are blurred by the DOF blur pass. So instead I implemented my own depth occlusion function that attenuates based on depth, but removes the attenuation if the pixel is out-of-focus. This let me keep the same overall look, while removing cases where I was getting “halos” due to the bokeh sprites rendering on top of in-focus objects. And since I was doing it in the shader anyway, I threw in a “soft” occlusion function rather than a binary comparison.

Without occlusion:

With occlusion:

Code and binaries are available here: http://mjp.codeplex.com/releases/view/64806

4/22/2011 – Fixed a texture sampling bug on Nvidia hardware

How To Fake Bokeh (And Make It Look Pretty Good)

Before I bought a decent DSLR camera and started putting it in manual mode, I never really noticed bokeh that much. I always just equated out-of-focus with blur, and that was that. But now that I’ve started noticing, I can’t stop seeing it everywhere. And now every time I see depth of field effects in a game that doesn’t have bokeh, it just looks wrong. A disc blur or even Gaussian blur is fine for approximating the look of out-0f-focus areas that are mostly low-frequency, but the hot spots just don’t look right at all (especially if you don’t do it in HDR).

So what are our options for getting a decent bokeh look in real time? Here’s a short (and by no means complete) list:

1. Render the scene multiple times and accumulate samples using the aperture shape – this obviously a no-go for real-time rendering.

2. Stochastic rasterization, using scatter – this becoming more feasible now that we can implement scatter in pixel or compute shaders, but requires a lot of samples to not look like crap (or maybe something like this) and would likely have performance problems due to the non-coherent memory writes

3. Do scatter-as-gather as a post process – this is doable and could be implemented in a compute shader or even a pixel shader, but it’s expensive since you need a huge number of samples for large CoC sizes. Plus it can be tricky to implement, since you really need to be careful about energy conservation. Gamedev user FBMachine actually implemented this approach and documented some of the issues here…as you can see the results can be nice but the performance isn’t so great.

4. Render each pixel as point sprite, using the CoC-size + aperture shape – this approach initially sounds completely unrealistic, since you’re talking about huge bandwidth usage from the blending and massive overdraw. However the guys who make 3DMark actually implemented some optimizations to make it usable for their “Deep Sea” demo in 3DMark11. They go into a little bit of detail in their white paper, but the basic gist of it is that they extract pixels with a CoC above a given threshold and append the point into an append/consume buffer (basically a stack that your shaders can push onto or pull from), then render the points at point sprites to one of several render targets. The render targets are successively smaller like mip levels, and they render the larger points to smaller render targets. By doing this they help curb the massive overdraw/blending cost of large points. They also do the point extraction several times, each time from a progressively downsampled version of the input image and with a different CoC threshold. I presume they do this to avoid extracting huge amounts of points.

The guys at Capcom also did something similar to this for the DX10 version of Lost Planet, although I’m not too familiar with the details. There’s some pictures + descriptions here and here.

5. Pick out the bright spots, render those as point sprites using the aperture shape, and do everything else with a “traditional” blur-based DOF technique – ideally with this approach we get the nice bokeh effects for that parts where it’s really noticeable (the highlights), and use something cheap for everything else. Gamedev.net poster Hodgman took a crack at implementing this approach using point sprites and vertex textures, and documented his results in this thread. His main problems were due to flickering + instability, since he had to downscale many times in order to render the grid of point sprites.

For my own implementation, I decided to go for #5. I spent a lot of time staring at out-of-focus images, and decided that it really wasn’t necessary to do a more accurate bokeh simulation for most of the image. For instance, take a look at this picture:

At least 90% of that image doesn’t have a distinctive bokeh pattern, and looks very similar to either a box blur or disc blur with a wide radius. It’s only those bright spots that really need the full bokeh treatment for it to look convincing.

With that in mind, I came up with the following approach:

  1. Render the scene in HDR
  2. Do a full-res depth + blur generation pass where we sample the depth buffer, and write out linear depth + circle of confusion size to an R16G16_FLOAT buffer
  3. Do a full-res bokeh point extraction pass. For each pixel, compute the average brightness of the 5×5 block surrounding the pixel and compare it with the brightness of the current pixel. If the pixel brightness minus the average brightness is above a certain threshold and the CoC size is above a certain threshold, push the pixel position + color + CoC size onto an append buffer and output a value of 0.0 as the pixel color. If it doesn’t pass the threshold, output the input color.
  4. Do a regular DOF pass. I implemented two versions: one that does a  full-res disc-based blur in two passes using 16 samples on a Poisson disc with radius == CoC size, and one that does ye olde 1/4-res Gaussian blur (with edge bleeding reduction) and a full-screen lerp between the un-blurred and blurred version.
  5. Copy the embedded count from the append buffer to another buffer, and use the second buffer as an indirect arguments buffer for a DrawInstancedIndirect call. Basically this lets us draw however many points are in the buffer, without having to copy anything back to the CPU. The vertex shader for each point then samples the position/color/CoC size from the append buffer and passes it to the geometry shader, which expands the point into a quad with size equal to the CoC size. The pixel shader then samples from a texture containing the aperture shape, and additively blends the result into an empty render target. The render target can be either full res, or 1/4 res to save on bandwidth.
  6. Combine the results of the bokeh pass with the  results of the DOF pass by summing them together in a pixel shader.
  7. Pass the result to bloom + tone mapping, and output the image.

I actually implemented everything with pixel shaders, since I find they’re still quicker for rapid prototyping compared to compute shaders. The bokeh generation step and Guassian blurs probably would have benefited from using shared memory to avoid redundant texture samples, but not so much that penalty is huge. The disc-based blur isn’t all that great of a fit either, since I used a very large sampling radius (usually at least 16 pixels).  For the disc blur I also did it in two passes with 16 samples each, in order to avoid some of the nasty banding artifacts that come from using a large sampling radius. This leads to some artifacts around edges, but it’s too bad. Either way the DOF part isn’t really important, and you could swap it out with whatever cool new technique you want. I also didn’t end up using proper lens-based CoC-size calculations, since I found it was a pain in the ass to work with. So I reverted to a very simple linear interpolation  between “out-of-focus” and “in-focus” distances, and then multiplied the value by a tweakable maximum CoC size.

As for the bokeh itself, it looks pretty good since it’s using a texture and can have whatever shape you want. It’s also pretty stable since the extraction is done at full resolution, and so you don’t get much flickering or jumping around. I didn’t use depth testing when rendering the bokeh sprites…I had intended on doing it, but then decided it wasn’t really necessary. However I’d imagine it would probably be desirable if you wanted to render really large bokeh spoints, in which case it would be trivial to implement.

Now for some results. This is with the foreground in focus, and the background out of focus:

The bokeh isn’t too distinctive here since most of the image is in focus, but you can definitely see the hexagon pattern on some of the background geometry.

This one has the whole scene out of focus, and so you can see a lot more of the bokeh effect:

Now you can really see the bokeh! Here’s another with a circle-shaped bokeh:

This one is with the bokeh sprites rendered to a 1/4-resolution texure:


Finally, this one is with the brightness threshold set to 0. Basically this means they every out of focus pixel gets drawn as a point sprite, which I used as a sort of “reference” image.

I think the earlier shots hold up pretty well in comparison! The biggest issue that I notice is that it can look a bit weird if you DOF blur radius and your bokeh radius don’t match up. It starts to become pretty obvious if you crank up the maximum bokeh size, but still use a small radius for blurring everything else. This is because you don’t want to be able to clearly discern what’s “underneath” the bokeh sprites…you want it to pretty much look like a solid color. To help with this I added a parameter to tweak the falloff used for conserving energy as as the bokeh point sprites get larger. Basically it does a pow on the falloff, which is computed by calculating the ratio of area of a circle with radius == CoC and comparing it with the radius of a single pixel. So by setting the falloff tweak to a lower value, the points are artificially brightened and appear more opaque.

If you want to check it out yourself, you can download the source code + binaries here: http://cid-538e432ea49f5bde.office.live.com/self.aspx/Public/Samples%20And%20Tutorials/DX11/Bokeh.zip

Updated (3/27/2011): Changed the shadow filtering shader code so that it doesn’t cause crashes on Nvidia hardware

Radiosity, DX11 Style

Radiosity isn’t exactly new. According to Wikipedia it’s been used for rendering since the early 80′s, and this page looks like it may have been the first web page on the Internet. The basic premise is dead simple: for each point where you want to bake lighting (typically either a texel in a lightmap, or a vertex in a mesh), render the rest of the scene and any exterior light sources (skydome, area lights, sun, whatever) in all directions within a hemisphere surrounding the surface normal at that point.  As described in the article I linked, this is typically done by rendering 5 faces of a hemicube so as to play nice with a traditional perspective projection.  It’s possible to use a parabolic projection for this (as in dual-paraboloid shadow mapping), but there are problems you can run into which are outlined here. Once you’ve fully rendered your hemisphere, you then integrate your computed radiance about the hemisphere with a cosine kernel to compute irradiance, which you can use to determine the diffuse reflectance for that point. You can then store this diffuse lighting value in your vertex or lightmap texel, and you’re ready to render your geometry fully lit by direct lighting. Typically you repeat the process many times, rendering the geometry scene geometry as lit by the results of your previous iteration. This effectively allows you to compute lighting for each successive bounce off the scene geometry, adding an indirect lighting term. With enough iterations, you eventually begin to converge on a full global illumination solution for diffuse lighting.

The nice part about the technique is that it’s pretty simple to implement…if you can rasterize a mesh you’re already most of the way there, and you can even co-opt an existing real-time rendering engine to do it. Taking the latter approach has the added benefit that any material or lighting feature you add to your engine benefits your radiosity baker by default…so for instance if you have some complex pixel shader that blends multiple maps to determine the final material albedo, you don’t need to implement an equivalent solution in a ray tracer or photon mapper. You can even take advantage of your lighting and shadowing pipeline, if you do it right. The major downside is that it’s usually pretty slow, even if you implement it all on the GPU. This is because you typically have to serialize rendering of the scene for each vertex or lightmap texel, rather than baking many vertices/texels in parallel (which is possible with ray tracing implementations, particularly if you do it on the GPU using Cuda/Optix).

Recently when I was trying to get myself familar with GI techniques, I decided to implement my own radiosity baker (with a lot of help from my coworker Dave). However to make it cool and hip for the DX11 age, I deviated from the “standard” radiosity process in 3 ways:

  1. Rather than producing a single diffuse color per sample point, I baked down to a set of 1st-order H-basis coefficients representing irradiance about the hemisphere. This lets you use normal mapping with your baked values, which adds higher fidelity to your precomputed lighting. This is similar to what Valve, Bungie, and Epic do for their lightmaps, except I’m using a different basis. If you’re not familiar with H-basis, they’re similar to spherical harmonics except that they’re only defined on the  upper hemisphere. This allows you to get better quality with less coefficients, for situations where you only need to store information about a hemisphere rather than a full sphere.
  2. Instead of baking direct lighting for all light sources, I bake direct + indirect lighting for a skydome and indirect lighting only for the sun. This is similar to what Naughty Dog does in Uncharted. The advantage is that you can add in the direct sun lighting at runtime using a directional light, and you get nice high-frequency visibility from your shadow maps. This lets you avoid having to use an area light or environment map for representing your sun, which can be difficult to tune if you’re used to traditional analytical directional light sources. Plus you can light your dynamic geometry the same way and have the lighting match, and also have their dynamic shadows only remove the direct lighting term. Another additional advantage is that your baked lighting term generally only contains low-frequency information, since it doesn’t need to represent high frequency shadow visibility from the direct term. So if your scene is decently tessellated you can get away with computing it per-vertex, which is what I did.
  3. I used compute shaders for integrating the radiosity hemicube down to H-basis coefficients. This not only made the integration really really fast, but it let me keep everything on the GPU and avoid messing with CPU-GPU memory transfers.

Setup

To prepare for baking, the scene and all of its individual meshes are loaded and processed. As part of the processing, I calculate a per-vertex tangent frame using a modified version of this approach. The tangents are needed for normal mapping, but they’re also used as a frame of reference for baking each vertex. This is because I store H-basis irradiance in tangent space. Tangent space provides a natural frame of reference for the hemisphere about the normal, and is also consistent across the vertices of a triangle. This lets me interpolate the coefficients across a triangle, which wouldn’t be possible if each vertex used a different frame of reference during integration. It also allows for a simple irradiance lookup with the tangent space normal sampled from a normal map, or (0, 0, 1) if normal mapping isn’t used.

Baking

The baking loop looks something like this:

for each Iteration
    for each Mesh
        Extract vertex data
        for each Vertex
            for each Hemicube Face
                if 1st iteration
                    Render the scene, depth only
                    Render the skydome
                else if 2nd iteration
                    Render the scene with baked lighting + shadowed diffuse from the sun
                else
                    Render the scene with baked lighting
            Integrate
     Sum the result of current iteration with result of previous iteration

Basically we do N + 1 iterations, where N is the number of indirect bounces we want to factor in. For the first iteration we add in all direct light sources (the skydome), for the second we add in the bounce lighting from the first pass plus the indirect-only lighting (the sun), and in all subsequent passes we only render the scene geometry with baked lighting.

Vertex Baking

For each vertex, we need to determine the radiance emitted from each surface visible around the hemisphere.  This hemisphere of radiance values is known as the field-radiance function. Determining the radiance value for any surface is simple: we just render the corresponding mesh and evaluate the BRDF in the pixel shader, which in our case means sampling the albedo texture and using it to modulate the diffuse lighting.  Since we’re going do it using rasterization, we’ll render to a hemicube for the reasons mentioned previously. To represent my hemicube, I used 5  R16G16B16A16_FLOAT render target textures storing HDR color values. To keep things simple I made them all equal-sized, and rendered each face as if it were a full cube map rather than a hemicube. However I used the scissor test to scissor out the half of the cube face that would not be needed, for all faces other than the first. Initially I used 256×256 textures for the render targets, but eventually lowered it to 64×64. Increasing the resolution does increase the quality slightly, but gains become diminishing very quickly past 64×64. This is because the irradiance integration filters out the high-frequency components, so any small details missed due to the small render target size have very little effect on the end result.

For the first pass, the scene is rendered with color writes disabled. This is because the mesh surfaces do not yet have incident lighting, and thus do not emit any radiance. Conceptually you can imagine this as though all light sources just began emitting light, and the light has yet to reach the mesh surfaces. So essentially we just render the mesh geometry to the depth buffer, in order “block out” the sky and determine the overall visibility for that vertex. Once we’ve done this we render the skydome with a depth value of 1.0, so that any rendered geometry occludes it. Thus we “fill in” the rest of the hemicube texels with radiance values emitted by the skydome. For the skydome I used the CIE Clear Sky model, which is simple and easy to implement in a pixel shader. The final result in the hemicube textures looks like this:


For the second pass, we use the results of the first pass as the incident lighting light for each surface pixel. This effectively causes the skydome lighting to “bounce” off the surface, adding indirect lighting. We also evaluate the diffuse contribution from the sun for each pixel, so that we get an indirect contribution from the sun as well. This contribution is calculated using a simple N (dot) L with the interpolated vertex normal, and the sun direction. A shadow visibility term is also added using a shadow map, which is rendered as a low-resolution cascaded shadow map. Then the sum of the baked lighting and and the sun light are modulated with the diffuse albedo, which is sampled from a texture. So the final exit radiance value for a pixel is computed like this:

radiance = (bakedLighting + sunLight * sunVisibility) * diffuseAlbedo

After the scene is rendered, the skydome is omitted since it’s contribution was already handled in the first pass. Thus the final hemicube looks like this:


For all subsequent passes, only the baked vertex irradiance is used for computing the exit radiance of each pixel. This is because the contribution from both of our light sources have already been added in previous passes, and we only need to further compute indirect lighting terms.

Integration

Once we’ve rendered the scene to all 5 sides of the hemicube, we have a full field-radiance function for the hemisphere stored in a texture map. At this point we could now compute a full irradiance distribution function for the hemisphere, which would provide us with an irradiance value for any possible surface normal. Such a function would be computed by convolving our field radiance with a cosine kernel, which is done by evaluating the following integral:

I(p,N_{p})=\int_{\Omega}L(p,\omega_{i})(N_{p}\circ\omega_{i})d\omega_{i}

Unfortunately, a full irradiance distribution function in a texture map isn’t all that useful since it’s too much data to store per-vertex. So instead we’ll  represent the irradiance map using 2nd-order spherical harmonics, using the method outlined in the paper “An Efficient Representation for Irradiance Environment Maps“. The basic procedure is to first convert the radiance map to a spherical harmonic representation by integrating against the spherical harmonic basis functions, and then convolve the result with a cosine kernel to compute irradiance. The following integral is used for projecting onto SH:

L_{lm}=\int_{\theta=0}^{\pi}\int_{\phi=0}^{\pi}L(\theta,\phi)Y_{lm}(\theta,\phi)sin{\theta}d{\theta}d{\phi}

For radiance stored in a texture map, we can implement this integration by using the method described in Peter-Pike Sloan’s Stupid Spherical Harmonics Tricks. For our purposes we’ll modify the algorithm by first converting each texel’s SH radiance to irradiance by convolving with a cosine kernel, and then converting the SH coefficients to 1st-order H-basis representation. This allows us to sum 12 values per texel, rather than the 27 required for 2nd-order SH. The algorithm looks something like this:

for each Hemicube Face
    for each Texel
        Sample radiance
        Calculate direction vector for the texel
        Project the direction onto SH and convolve with cosine kernel
        Multiply SH coefficients by sampled radiance
        Convert from SH to H-basis
        Weight the coefficients by the differential solid angle for the texel
        Add the coefficients to a running sum

What this essentially boils down to is bunch of per-texel math, followed by sum of all results. Sounds like a job for compute shaders! The first part is simple, since the per-texel math operations are completely independent of one another. The second part is a bit tougher, since it requires a parallel reduction to be efficient. Essentially we need to efficiently share results between different threads in order to avoid heavy bandwidth usage, while properly exploiting the GPU’s massively parallel architecture by sharing the workload across multiple minimally divergent threads and thread groups. Basically it’s pretty simple to implement naively, and tricky to do it with good performance.  Fortunately Nvidia has a bunch of data-parallel algorithms that are part of their cuda SDK, and one of them happens to be a parallel reduction. I won’t go into the details, but their whitepaper outlines the basic process as well as a series of improvements that can be made to the naive algorithm in order to improve performance. These improvements are a mix of algorithmic and hardware-specific optimizations, and pretty much all of them are easily applicable to compute shaders.

My  implementation ended up being 2 passes: the first performing the conversion to H-basis irradiance and reducing each row of each face texture to a single set of RGB coefficients, and the second reducing to only 1 set of RGB coefficients. In the first pass, the threads are dispatched in 1x64x5 thread groups, with each group containing 64x1x1 threads. The following diagram shows how the threads are distributed relative to the hemicube textures for the first 2 faces:

The projection onto SH and cosine kernel convolution can be implemented pretty easily in HLSL, using values taken from the irradiance environment mapping paper. My HLSL code looks like this:

void ProjectOntoSH(in float3 n, in float3 color, out float3 sh[9])
{
    // Cosine kernel
    const float A0 = 3.141593f;
    const float A1 = 2.095395f;
    const float A2 = 0.785398f;

    // Band 0
    sh[0] = 0.282095f * A0 * color;

    // Band 1
    sh[1] = 0.488603f * n.y * A1 * color;
    sh[2] = 0.488603f * n.z * A1 * color;
    sh[3] = 0.488603f * n.x * A1 * color;

    // Band 2
    sh[4] = 1.092548f * n.x * n.y * A2 * color;
    sh[5] = 1.092548f * n.y * n.z * A2 * color;
    sh[6] = 0.315392f * (3.0f * n.z * n.z - 1.0f) * A2 * color;
    sh[7] = 1.092548f * n.x * n.z * A2 * color;
    sh[8] = 0.546274f * (n.x * n.x - n.y * n.y) * A2 * color;
}

Converting that to H-basis is also simple, and is expressed as a matrix multiplication. The values for the transformation matrix are given in the source paper. This is the shader code that I used:

void ConvertToHBasis(in float3 sh[9], out float3 hBasis[4])
{
    const float rt2 = sqrt(2.0f);
    const float rt32 = sqrt(3.0f / 2.0f);
    const float rt52 = sqrt(5.0f / 2.0f);
    const float rt152 = sqrt(15.0f / 2.0f);
    const float convMatrix[4][9] =
    {
        { 1.0f / rt2, 0, 0.5f * rt32, 0, 0, 0, 0, 0, 0 },
        { 0, 1.0f / rt2, 0, 0, 0, (3.0f / 8.0f) * rt52, 0, 0, 0 },
        { 0, 0, 1.0f / (2.0f * rt2), 0, 0, 0, 0.25f * rt152, 0, 0 },
        { 0, 0, 0, 1.0f / rt2, 0, 0, 0, (3.0f / 8.0f) * rt52, 0 }
    };

    [unroll(4)]
    for(uint row = 0; row < 4; ++row)
    {
        hBasis[row] = 0.0f;

        [unroll(9)]
        for(uint col = 0; col < 9; ++col)
            hBasis[row] += convMatrix[row][col] * sh[col];
    }
}

After the first pass, we’re left with a single buffer containing 64x5x3 float4 values, where each consecutive set of 3 float4 values represents the sum of all RGB H-basis coefficients for that row.  To reduce to a single set of coefficients, we dispatch a reduction pass containing 1x3x1 groups of 64x1x5 threads.  With this setup each group sums all 64 of the R, G, or B coefficients for a particular hemicube face and stores the result in shared memory. Once this has completed, the first thread of each group sums the 5 values for each hemicube to produe a single set of H-basis coefficients. This last step is somewhat sub-optimal since only a single thread performs the work, however for summing only 5 values I didn’t think it was worth it to try anything fancy or split the reduction into another pass. The following diagram shows the thread layout:

The final result of this process is 3 sets of 4 H-basis coefficients (1 for each RGB channel) representing the irradiance across the hemisphere around the vertex normal, oriented in tangent space. After vertices are baked in this manner, I sum the vertex coefficients for each mesh with the results from the previous iteration in order to sum the bounces (which I do with a really simple compute shader). After the desired number of iterations, the coefficients are ready to be used at runtime and combined with direct sun lighting. Evaluating the H-basis coefficients to compute irradiance for a normal is pretty simple. I use the following code in my pixel shader, which takes a tangent space normal and the interpolated coefficients from the vertices:

float3 GetHBasisIrradiance(in float3 n, in float3 H0, in float3 H1, in float3 H2, in float3 H3)
{
    float3 color = 0.0f;

    // Band 0
    color += H0 * (1.0f / sqrt(2.0f * 3.14159f));

    // Band 1
    color += H1 * -sqrt(1.5f / 3.14159f) * n.y;
    color += H2 * sqrt(1.5f / 3.14159f) * (2 * n.z - 1.0f);
    color += H3 * -sqrt(1.5f / 3.14159f) * n.x;

    return color;
}

Performance

To enable profiling with quick iteration, I made a very simple test scene containing a single mesh and 12,400 vertices. My initial implementation was pretty slow, baking only 545 vertices per second for the 1st pass, 292 vps for the 2nd pass, and close to 545 for all subsequent passes. For the first pass, I determined that the integration step was slowing things down considerably. Initially I had implemented integration using pixel shaders, which converted to H-basis and then reduced each hemicube face by a 1/4 each pass. This resulted in lots of unnecessary render target reads and writes, degrading performance. Moving to my current compute shader implementation brought the first pass to 1600 vps, and the second pass to 325 vps.

When I analyzed the second pass, GPU PerfStudio revealed that I was spending a significant amount of time in the geometry shader during the main rendering and shadow map rendering phases. I had used a geometry shader so that I could create the 5 hemicube faces as a texture array (or 4 shadow map cascades for the shadow map), and use SV_RenderTargetArrayIndex to specify the output array slice without having to switch render targets multiple times. I had known that this sort of geometry shader amplification performed poorly on Dx10 hardware and had been hoping that it wouldn’t be so bad on my 5830, but unfortunately this was not the case. Ditching the geometry shader  and setting the render target slices one by one brought me up to 1760 vps for the first pass and 480 vps for the second pass. Further performance was gained by switching the cascaded shadow map implementation to use an old-school texture atlas rather than a texture array, which brought me to 625 vps for the second pass. This was disappointing, since texture arrays are a totally natural and convenient way to implement cascaded shadow maps. Texture atlases are so DX9. Even after that the shadow map rendering was still really slowing down the 2nd pass, so I cut it down to 2 cascades (from 4) and reduced the resolution from 2048×2048 per cascade to 512×512. This got me to 850 vps for the test scene, about 600 vps for the broken tank scene from the SDK, and about 180 vps for the powerplant scene from the SDK. In its current state, the GPU is currently spending a portion of each vertex bake idling due to processing so many commands and having multiple render target switches. It could definitely benefit from some reduction in overall amount of API commands and state changes, and batching during the shadow map rendering. It would also probably benefit from using an approach similar to Ignacio’s, where the shadow map is only rendered once for a group of vertices.

Now for some pictures! These were all taken with only a single bounce, because I’m impatient.

Test scene: baked lighting only, baked lighting with normal maps, baked lighting + direct sunlight, baked light + direct with normal maps, final

Tank scene: baked only, baked with normal mapping, baked + direct, final, alternate final, another alternate final

Powerplant scene: direct only, baked only, baked + direct, baked w/o normal mapping, baked w/ normal mapping, alternate baked + direct, final

Source code and binaries are available here:

Part 1

Part 2

Updated (3/27/2011): Changed the shadow filtering shader code so that it doesn’t cause crashes on Nvidia hardware

Conservative Depth Output (and Other Lesser-Known D3D11 Features)

D3D11 came with a whole bunch of new big-ticket features that received plenty of attention and publicity. Things like tessellation, compute shaders, and multithreaded command submission have the subject of many presentations, discussion, and sample apps. However D3D11 also came with a few other features that allow more “traditional” rendering approaches to benefit from the increased programmability of graphics hardware. Unfortunately most of them have gone relatively unnoticed, which isn’t surprising when you consider that most of them have little or no documentation, (much like some of the cool stuff that came in D3D10.1). Not too long ago one of these neat tricks came to my attention by way of John Hable’s blog, which inspired me to dig around a bit and try out some of other neat tricks I was missing out on. Quite a few are briefly described in this presentation from GDC. Here’s a few of my favorites, in no particular order:

1. Conservative depth output: this is something you use for pixel shaders that manually output a depth value. Basically rather than using SV_Depth, you use a variant that also specifiea an inequality. For instance SV_DepthGreater, or SV_DepthLessEqual. The depth you output from the shader must then satisfy the inequality relative to the interpolated depth of the rasterized triangle (if you don’t, the depth value is clamped for you). This allows the GPU to still use early-z cull, since it can still trivially reject pixels for cases where the depth test will always fail for the specified upper/lower bound. So for instance if you render a quad and output DepthGreaterEqual, the GPU can cull pixels where the quad’s depth is greater than the depth buffer value. Don’t bother looking for this one in the documentation…it’s not in there.

2. SV_Coverage as an  input: D3D10.1 added the feature to let you output to SV_Coverage in order to manually specify the MSAA coverage mask (which controls how the pixel shader output gets written to the subsamples). In D3D11 you also can take it as an input to your pixel shader to know which of the sample points passed the triangle coverage test. This is really handy for deferred rendering, since you’ll want to mark off edge pixels as those are the only pixels that require you to sample all of the subsamples in the G-Buffer. In D3D10 you could do this with the centroid sampling trick, but it’s much nicer to just skip the intermediate step and get coverage directly. Plus the rules for centroid sampling are a little loosely defined, so I don’t really like relying on it.

3. Programmable interpolation: D3D10/D3D10.1 already had some modifiers you could use for pixel shader attributes that controlled how they were interpolated. For instance you had linear, noPerspective, and centroid. In D3D11 you still have those, but you also have a series of EvaluateAttributeAt* instrinsics that allow you to evaluate an attribute using a specified interpolation mode. Probably the most useful of the bunch of EvaluateAttributeAtSample, which interpolates the attribute to the MSAA sample point for the specified index. Probably the most obvious use case is for selective supersampling…using that intrinsic you could evaluate your BRDF at each subsample location. You can also sample alpha-tested textures multiple times, effectively antialiasing the edges. I whipped up a little test case where I rendered a billboarded quad to an MSAA target, and in the pixel shader I did a simple ray-cast into a sphere located at the quad center. I took SV_Coverage as an input to determine if the pixel was an edge pixel (all sample points weren’t full covered), and in that case I did a ray-cast per-sample using EvaluateAttributeAtSample to snap the interpolated view-space position to each sample point. This basically gives you selective super-sampling, so that you get anti-aliased edges without relying on rasterization. Cool stuff!

4. Read-only depth-stencil views: D3D10 let you bind depth-stencil buffers as shader resource views so that you could sample them in the pixel shader, but came with the restriction that you couldn’t have them simultaneously bound to the pipeline as both views simultaneously. That’s fine for SSAO or depth of field, but not so fine if you want to do deferred rendering and use depth-stencil culling. D3D10.1 added the ability to let you copy depth buffers asynchronously using the GPU, but that’s still not optimal. D3D11 finally makes everything right in the world by letting you create depth-stencil views with a read-only flag. So basically you can just create two DSV’s for your depth buffer, and switch to the read-only one when you want to do depth readback.

5. Unordered access views for pixel shaders: UAV’s are essentially buffers or textures that give you both random read access *and* random write access. They’re usually mentioned in the context of compute shaders, but they’re actually usable for pixel shaders too. I haven’t really dug into this use case, but it seems as though you could implement scatter or fully programmable blending.

After doing some research, I came up with a quick sample app so that I could try out conservative depth output and see the performance results. I ended up basing it off the SoftParticles sample from the SDK, since depth sprites are probably the most obvious use-case for that particular feature. Here’ s some numbers I got running on my machine (Radeon HD 5830) at 1280×720 resolution, with the particles covering most of the viewport:

Basic billboarding: 8.7ms

Depth output enabled: 11.76ms

Conservative depth enabled: 9.52ms

Soft particles w/depth output: 23.25ms

Soft particles w/conservative depth: 18.18ms

So overall it looks like it gets you about halfway back to the performance you get with no depth output, which is pretty nice (especially considering how easy it is to use). In addition to that, I also used a read-only depth-stencil buffer for the soft particles so that I could keep depth testing active.

If you want to run it yourself or check out the code, I uploaded the code + binaries here: http://cid-538e432ea49f5bde.office.live.com/self.aspx/Public/Samples%20And%20Tutorials/DX11/DepthSprites.zip

I’ll also leave you off with a picture of my totally sweet fire/smoke effect. I should quit my job and become an effects artist.

Deferred MSAA

A long while ago I was looking into morphological antialiasing (MLAA) to see if I could somehow make it practical for a GPU that isn’t the latest monster from Nvidia or ATI. With MLAA most people talk about how nicely it cleans up edges (which it certainly does), but for me the really cool part is how it’s completely orthogonal to the technique used to render the image.  It could have been rasterized and forward rendered, it could be the product of a deferred rendering, or it could even be ray-traced: in all cases the algorithm works the same.  This seems pretty important to me, since in real-time rendering we’re getting further and further away from the old days of forward rendering to the backbuffer (AKA, the case where MSAA actually works properly).

Anyway I never really got anywhere (although someone took a stab at it in the meantime), but what DID happen is that I spent some time thinking about the downsides of MLAA and how they could be avoided.  In particular, I’m talking about the inability of MLAA to handle temporal changes to polygon edges due to no sub-pixel information being available.  MSAA doesn’t have the same problem, since it at handles depth testing and triangle coverage at a sub-pixel level.  So I thought to myself: “wouldn’t it be nice to use MSAA to get coverage info, and then use that to some smart filtering in a post process?”  Turns out it’s not too difficult to pull it off, and the results aren’t too bad.  I’m calling my sample “Deferred MSAA” even though that’s a terrible name for it, but I can’t really think of anything better. Here’s the basic premise:

  1. Render all scene geometry in a depth-only prepass, with 4xMSAA enabled
  2. Render the scene normally with no MSAA to an HDR render target
  3. Apply tone mapping and other post processes
  4. Apply a single “deferred AA” pass that samples the MSAA depth buffer from step #1, uses it to calculate pixel coverage, and filters based on the coverage

To “calculate pixel coverage”, all I do is read all 4 depth subsamples and compare them (with a small tolerance) with the depth value from the non-MSAA pass.  The idea is to try to figure out which depth subsamples came from rasterizing the same triangle that was rasterized in the non-MSAA pass.  Then, using knowledge of the points in the sample pattern, I come up with a sample location where it’s likely that we’ll a color on the other side of the triangle edge.  The method I use to do this is really simple: if a subsample fails the equivalence test, add a vector from the pixel center to the sample location to a summed offset vector. Then after all subsamples are compared, average the offsets and take a single bilinear sample at that offset.  Check out the diagram below for a visual representation:

As I see it, this approach has these things going for it:

  1. Like with MLAA, AA no longer has anything to do with how you shade a pixel. So if your going deferred, there’s no “lighting the subsamples” nonsense.  It also means you don’t necessarily need to render your G-Buffer with MSAA, which can save a lot in terms of memory usage and bandwidth.
  2. Also like MLAA, AA is applied after tone mapping. So you don’t have to tone map each subsample to avoid aliasing where there’s high contrast.
  3. Unlike MLAA, it’s simple and straightforward to implement on a GPU in a pixel shader.
  4. Unlike MLAA, it uses sub-pixel information which helps it handle temporal changes a lot better.

For comparison, here’s a screenshot from my sample app (click for full size):

That particular shot shows off the weakness of both standard MSAA and my approach.  If you look at the high-contrast area whether the tank overlaps with the bright sky, you can see how standard MSAA totally fails to do anything about the edge aliasing.  This is because the MSAA resolve is done on the HDR pixel values, and tone mapping is a non-linear operation.  However in the darker areas towards the bottom MSAA does just fine.  In contrast (pun not intended), deferred MSAA holds up really well in the silhouettes. The AA quality is surprisingly pretty decent for a edge filtering-based approach, and since it’s done after tone mapping there’s no non-linearity to worry about. However if you look carefully down in the tank, you’ll spot edges that don’t get the AA treatment at all. This is because the depth values are similar enough to pass the comparison test, despite coming from different triangles.  You can see it most clearly in the grate on the front of the tank.  The sample app has a tweakable threshold value for the depth comparison, and setting it lower causes more edges to be properly distinguished. However it also starts to cause false positives in triangles nearly parallel to the camera.  The image below shows the same scene with a default threshold, and with a low threshold:

Even with a low threshold, some edges are still missed since only the normal or material properties will change between two triangles and not the depth.  For those cases more information would be required, such as the scene normals or an ID buffer.  But you could probably make the case that edges with depth discontinuities are more likely to have high contrast, and the high-contrast areas are where AA makes the biggest difference.

Aside from the quality issues, the other obvious concern with this approach is performance.  Even with depth only, an MSAA prepass isn’t going to be cheap on a lot of hardware.  It makes more sense if the depth prepass isn’t wasted and only used for the post process…for instance it’s certainly possible to do downscale the MSAA depth buffer with a MAX operation and use it to pre-populate the non-MSAA depth buffer.  However it may be hard to get this to play nice with early z-cull, depending on the hardware.  You could also possibly lay out your G-Buffer for light prepass with MSAA enabled, but then you might have trouble using depth values to reconstruct position.

You can get the sample code and binaries here: http://cid-538e432ea49f5bde.office.live.com/self.aspx/Public/Samples%20And%20Tutorials/DX11/DeferredMSAA.zip.  This one requires a D3D_FEATURE_LEVEL_11_0-capable device.

Follow

Get every new post delivered to your Inbox.

Join 34 other followers