10 Things That Need To Die For Next-Gen

Lately I’ve been thinking about things in graphics that have long worn out their welcome, and I started a list of techniques that I hope will be nowhere in sight once everyone moves on to next-gen console hardware (or starts truly exploiting high-end PC hardware). Here they are, in no particular order:

  1. Phong/Blinn-Phong – we need more expressive BRDF’s for our materials, and these guys are getting in the way. Phong is just plain bad, as it doesn’t even produce realistic streaks at glancing angles (due to using the reflection vector and not the halfway vector like with microfacet-based BRDF’s). Energy-conserving Blinn-Phong with a proper fresnel factor is a huge step in the right direction, but we can still do better. Personally I’m a big fan of Cook-Torrance for isotropic materials. It requires quite a bit more math compared to Blinn-Phong, but if there’s one thing modern GPU’s are good at it’s crunching through some ALU-heavy shader code. Anisotropic BRDF’s are also really important for a lot of materials, and I think we need to start getting them working side-by-side with Cook-Torrance in our deferred renderers. Another important hurdle to overcome is making sure our pre-rendered specular environment maps match our BRDF. Blurring the mip levels is a good approximation for Phong, but not so much for Blinn-Phong or Cook-Torrance. For anisotropic BRDF’s, it’s not even close.
  2. Specular Aliasing – during this generation we got pretty good at making things bumpy and shiny. What we didn’t get good at was making sure all of that bumpy shiny stuff didn’t turn into aliasing hell. Stephen Hill gave a great summary of the current lay of the land when it comes to specular antialiasing techniques, as well as new technique for pre-computing them into gloss maps. Unfortunately those techniques are either formulated in terms of  Blinn-Phong (in the case of Toksvig AA), or aim to replace Blinn-Phong (in the case of LEAN/CLEAN). This means that we still have a bit more work to do if we want to move on to Cook-Torrance. However I think these techniques have given us a great starting point, and I’m fairly confident that with some information about the variance of a normal map we can tackle the problem for other BRDF’s. Selective supersampling is another (expensive) possibility for problematic scenarios, which can even be seamlessly integrated into MSAA on DX11 hardware.
  3. SSAO – I don’t think too many would agree with me on this one, but I’ve just never been a big fan of SSAO. It was tremendously clever idea when it came out, and it certainly has improved quite a bit since its original inception. However I don’t think I’ll ever get over the idea that the technique is fundamentally handicapped in terms of the information it has to work with. I’d really like to aggressively pursue alternative techniques based on primitive shapes and/or low-resolution representations of scene meshes, with the hopes getting AO that’s more stable and captures larger scale occlusion. But I’m sure we’ll end up still using SSAO to fill in the cracks (pun intended).
  4. DXT5 Normal Maps – this one is a no-brainer, we just need to ditch current consoles and embrace BC5.
  5. Geometry Aliasing – we’ve been fighting this one for a long time now, but it still lingers. In fact you could almost argue that this problem has gotten worse due to the widespread use of deferred rendering and HDR rendering. Screen-space techniques like MLAA and FXAA have given us a great big band-aid to throw over the problem, but they are exactly that: a band-aid. They’re never going to completely solve the problem on their own, which means that we need to find better ways to make use of MSAA if we really want to solve some of the tougher cases. For deferred rendering this means being smart about which subsamples we shade, as well as how we shade them. Andrew Lauritzen’s work has given us a great starting point, but I’d imagine we’ll need to specifically tailor our approach for whatever target hardware we’re working with. For dealing with HDR, we need to be aware of the problems caused by tone mapping and make sure that we effectively work around them. Humus’s approach of individually tone mapping each subsample produces the desired result, however it also means keeping subsamples around until you perform tone mapping (which is the last stop in your post processing pipeline if you’re doing everything in HDR). Andrew Lauritzen suggested a clever idea over on the beyond3d forums, which was to apply tone mapping, resolve, then apply the inverse tone mapping operator to get an HDR value back. I tried it and it does work…at least as long as you’re using a tone mapping operator that’s easily invertible. The inverse operator can get really nasty in some cases, and won’t exist at all if you end up clamping to 1.0 as part of your tone mapping. Speaking of tone mapping…
  6. Crappy Tone Mapping – no more linear, no more Reinhard. Filmic tone mapping is awesome, and easy to integrate if you use the variant proposed by John Hable.
  7. No GII don’t think that we all need to go for real-time GI, since most games don’t need completely dynamic lighting or geometry. However it’s time to stop faking GI with ambient lights/fill lights/bounce lights/whatever. Having a good GI bake produces smoother, more realistic lighting that’s easier for artists to author. And if people do go for real-time GI techniques, I hope that they can do so without taking a severe drop in quality compared to offline solutions.
  8. Crappy Depth of Field and Motion Blur – both of these guys have a lot of room for improvement. Depth of field has gotten some recent attention from the heavy hitters like Crytek, DICE, and Epic, which has mostly focused on reproducing iris-shaped blur patterns for realistic bokeh effects. This has resulted in some promising research, but I think the jury is still out on the best way to approach the problem. There’s also the issue of foreground blur with proper transparency, which is still something of an elephant in the room. Motion blur, on the other hand, isn’t getting quite as much attention. Working only screen-space is extremely limiting for motion blur, in fact I would say it’s even more limiting than it is for depth of field. There have been attempts at supplementing screen-space approaches with fins or geometry stretching, but personally I’m still not satisfied with the results of my own experiments. What I’d really love to do is some sort of cheap, scaled-down version of stochastic rasterization combined with screen-space blurring to help remove the noise. Nvidia has some recent research in this area, so I’m holding out hope.
  9. Sprite-based Lens Flares – these looked cheesy when they first showed up in games, and they still do. Using FFT to perform convolutions in frequency space lets you convolve highlights with arbitrary (low-resolution) kernels, which is still limiting but in many ways looks a lot better than sprites. But what we really need is for someone to get this working in a game, so that we can all have awesome physically-based flares. :D
  10. Simple Fog – we’ve got shaders now…we don’t need to do the same old depth-based fog. The time has come to upgrade to a physically-based scattering model, such as this one.

GPU Profiling in DX11 with Queries

For profiling GPU performance on the PC, there aren’t too many options. AMD’s GPU PerfStudio and Nvidia’s Parallel Nsight can be pretty handy due to their ability to query hardware performance counters and display the data, but they only work on each vendor’s respective hardware. You also might want to integrate some GPU performance numbers into your own internal profiling systems, in which case those tools aren’t going to be of much use.

To get around this, it’s possible to use D3D11 timestamp queries to get coarse-grained timing info for different parts of the frame. It’s a ways off from the kind of info you get from the vendor-specific tools, but it’s a lot better than nothing. It’s also pretty easy to implement. To profile a portion of your frame, you need a trio of ID3D11Query objects. Two of them need to have the type D3D11_QUERY_TIMESTAMP, and are used to get the GPU timestamp at the start and end of the block you want to profile. The third needs to have the type D3D11_QUERY_TIMESTAMP_DISJOINT, and it tells you whether your timestamps are invalid as well as the frequency used for converting from ticks to seconds. In practice it goes like this:

When starting a profiling block:

  • Call ID3D11DeviceContext::Begin and pass the DISJOINT query
  • Call ID3D11DeviceContext::End and pass the start TIMESTAMP query

When ending a profiling block:

  • Call ID3D11DeviceContext::End and pass the end TIMESTAMP query
  • Call ID3D11DeviceContext::End and pass the DISJOINT query

After waiting a sufficient amount of time  for the queries to be ready:

  • Call ID3D11DeviceContext::GetData on all 3 queries
  • Compute the delta in ticks using the timestamps from both TIMESTAMP queries
  • Use the frequency from the DISJOINT query to convert the delta to a time in seconds

Like any query, you need to wait for the GPU to actually execute all of the commands you submitted for the data to be ready. In my sample app, I handle this by keeping an array of queries for each profile block and moving to the next one each frame. Then at the end of the frame, I get the data from the oldest query and use that for outputting the timing data to the screen. So the actual timing data lags behind by a few frames, but that’s okay for real-time profiling. For automated benchmarks or performance snapshots you could either use the data from N frames later, or you could just stall at the end of the frame and wait for the query to be ready.

Sample code and binaries are available on CodePlex: http://mjp.codeplex.com/releases/view/74987#DownloadId=292437

Average luminance calculation using a compute shader

A common part of most HDR rendering pipelines is some form of average luminance calculation. Typically it’s used to implement Reinhard’s method of image calibration, which is to map the geometric mean of luminance (log average) to some “key value”. This, combined with some time-based adaptation, allows for a reasonable approximation of auto-exposure or human eye adaptation.

In the old days of DX9, the average luminance calculation was usually done repeatedly downscaling a luminance texture as if generating mipmaps. Technically DX9 had support for automatic mipmap generation on the GPU, but support wasn’t guaranteed so to be safe you had to do it yourself. DX10 brought guaranteed support for mipmap generation of a variety of texture formats, making it viable for use with average luminance calculations. It’s obviously still there in DX11, and it’s still a very easy and pretty quick way to do it. On my AMD 6950 it only takes about 0.2ms to generate the full mip chain for a 1024×1024 luminance map, which is pretty quick. But with DX11 it’s not sexy unless you’re doing it with compute shaders, which means we need to ditch that one line call to GenerateMips and replace it with some parallel reductions. Technically a parallel reduction should have much fewer memory reads/writes compared to generating successive mip levels, so there’s also some actual sound reasoning behind exploring that approach.

The DirectX SDK actually comes with a sample that implements average luminance calculation with a compute shader parallel reduction (HDRToneMappingCS11), but unfortunately their CS implementation actually performs a fair bit worse than their pixel shader implementation. A few people on gamedev.net had asked about this and I had said that it should definitely be possible to beat successive downscaling with a compute shader if you did it right, and used cs_5_0 rather than cs_4_0 like the sample. When it came up again today I decided to put my money where my mouth is and make a working example.

The implementation is really simple: render the scene in HDR, render log(luminance) to a 1024×1024 texture, downscale to 1×1 using either GenerateMips or a compute shader reduction, apply adaption, then tone map (and add bloom). My first try was to do the reduction in 32×32 thread groups (giving the max of 1024 per thread group), where each thread sampled a single float from the input texture and stored in shared memory. Then the reduction is done in shared memory using the techniques outlined in Nvidia’s CUDA parallel reduction whitepaper, which helps avoid shared memory conflicts. The first pass used a 32×32 dispatch which resulted in a 32×32 output texture, which was then reduced to 1×1 with one more 1×1 dispatch. Unfortunately this approach took about 0.3ms to complete, which was slower than the 0.2ms taken for generating mips.

For my second try, I decided to explicitly vectorize so that I could take better advantage of the vector ALU’s in my AMD GPU. I reconfigured the reduction compute shader to use 16×16 thread groups, and had each thread group take 4 samples (forming a 2×2 grid) from the input texture and store it in a float4 in shared memory. Then the float4′s were summed in a parallel reduction, with the last step being to sum the 4 components of the final result. This approach required only 0.08ms for the reduction, meaning I hit my goal of beating out the mipmap generation. After all that work saving 0.1ms doesn’t seem like a whole lot, but it’s worth it for the cool factor. The performance differential may also become more pronounced at higher resolution, or on hardware with less bandwidth available. I’m not sure how the compute shader will fare on Nvidia hardware since they don’t use vectorized GPU, so it should be interesting to get some numbers. I’d suspect that shared memory access patterns are going to dominate anyway over ALU cost anyway, so it could go either way.

The sample project is up for download here: http://mjp.codeplex.com/releases/view/71518#DownloadId=268633

I am officially a published author

I recently collaborated with fellow DX MVP’s Jason Zink and Jack Hoxley to write a D3D11-focused book entitled Practical Rendering and Computation with Direct3D 11, which just came up for sale on Amazon today. I wrote the HLSL and Deferred Rendering chapters in particular. All of the code samples are up on the Hieroglyph 3 CodePlex site, if you want to get an idea of the content. Or you can just take my word for it that it’s awesome. :P

Anamorphic lens flares: the lens flare of the 2010′s?

Since the dawn of time, the greatest struggle of the graphics programmer is to ensure that bright stuff looks really damn bright. We’re stuck with displays that have a limited displayable range, which means it’s fallen upon us to come up with new hacks and tricks to make sure the player at least feels like he’s blinded by the sun, even if we can’t really cause physical damage to their eyes (if only!).

In the early 2000′s, all we had to work with was the lens flare. It was cheesy, and it was everywhere. You probably still remember being bombarded with giant flare sprites whenever you looked up at the sun cresting over the mountains. If not, this should jog your memory:

A few years later, Tron 2.0 came out and brought with it some neat glow effects. Now all that glowly Tron stuffed looked like it was glowing! But pretty soon we got crafty and realized that we didn’t just have to blur a glow value stored in the alpha channel, but we could do it to everything. And thus, bloom was born. With bloom not only could we make bright stuff look bright, but we could give the same treatment to the mid tones! And like the lens flare, it turned up everywhere. It got so bad, GameSpot gave it an award for being the most annoyingly overused effect. It wasn’t really unwarranted since games looked like this in 2004:

Fast forward to 2007, and Crysis bestows upon us the glory of god rays. It was at this point that we realized that merely blinding people with bloom was not enough…we needed to combine it with beams of light shining directly into their brains. Because every game should look like you’re walking through a thick, soupy haze!

Now in 2011, lens flares are making a startling comeback! I went to E3 on Tuesday, and the obvious effect du jour was anamorphic lens flares. Anamorphic lens flares are great, because they take a tiny little bright spot and stretch it across the whole damn screen. I mean it’s go big or go home, right? Now the player can’t possibly miss that tiny spot light in the corner, since we’re going to cover their whole damn screen with a giant-ass blue flare. High five!

On the subject of E3, I decided that I’m not above pimping our game on my blog. Rest assured, there is plenty of bloom. :P

Bokeh II: The Sequel

After I finished the bokeh sample, there were a few remaining issues that I wanted to tackle before I was ready to call it “totally awesome” and move on with my life.

Good blur – in the last sample I used either a 2-pass blur on a poisson disc performed at full resolution, or a bilateral Gaussian blur performed at 1/4 resolution (both done in a pixel shader). The former is nice because it gives you variable filter width per-pixel, but you get some ugly noise-like artifacts due to insufficient sampling. Performance can really take a nose dive with too many samples, especially if your filter radius is very large. Doing two passes helps a lot, but gives you artifacts like the one pictured below:

The traditional “gaussian blur at 1/4 res” approach sucks even worse. This is because the lower resolution screws up the bilateral filtering, and performance also isn’t so great in a pixel shader due to the high amount of texture samples required. But worst of all, it just plain doesn’t look good due to using a lerp to blend between blurred and non-blurred pixels to simulate “in-focus” and “out-of-focus”. It ends up looking more like soft focus, rather than an image that’s gradually coming in or out of focus. On top of that you get aliasing artifacts from working at a lower res, which causes shimmering and swimming.

My solution to this problem was to implement monstrous 21-tap seperable bilateral blur in a compute shader. Wide, seperable blur kernels are a pretty nice fit for compute shaders because you can store the texture samples in shared memory, which allows each thread to take a single texture sample rather than N. Shared memory isn’t that quick, but for larger kernels (10 pixels or so) the savings start to win out over a pixel shader implementation. With such a wide blur kernel, I could perform the blurring at full resolution in order to provide nice sharp edges when the foreground is in-focus and the background is out-of-focus. Here’s a similar image to the one above, this time artifact-free:

I ended up just using a box filter rather than a Gaussian, which normally gives you ugly box-shaped highlights on the bright spots. Fortunately the bokeh does a great job of extracting those points and replacing them with the bokeh shape. To avoid having to lerp between blurred and un-blurred versions of the image, I had the filter kernel reject samples outside of the CoC size of the pixel being operated on. This effectively gives you a variable-sized blur kernel per-pixel, which gives you much nicer transitions for objects moving in or out of focus. It doesn’t look quite as good as the transitions for the disc-based blur, but I think it’s a fair trade off. The image below shows what the kernel looks like as it transitions from in-focus to out-of-focus:


Out-of-focus foreground objects
– As in most DOF implementations, foreground objects that were out-of-focus were blurred, but still had hard edges. This can look pretty bad, as the object blurring into the background is a key part of simulating the look. A simple way to rectify this issue is to store your CoC size or blurriness factor in a texture, then blur it in screen space. This is essentially the approach used by Infinity Ward for the past few Call of Duty games produced by them. I went with something similar, except like with the DOF blur I used a compute shader to do a 21-tap seperable blur of the CoC texture. To make this look good you have to make sure you only gather samples coming from a lower depth, and that have a larger CoC than the pixel being operated on by that thread. Here’s an out-of-focus foreground object with and without the CoC spreading:

CoC Blur Off:

CoC Blur On:


No depth occlusion for bokeh sprites
– I didn’t do any sort of occlusion testing for bokeh sprites in the sample, and waved it off by saying it would be simple to use the depth stencil buffer and enable depth testing. While that’s true, it turns out that doing it that doesn’t really give you great results. The problem is that if a bokeh sprite is covering an area that’s totally out of focus, you don’t want those out-of-focus pixels occluding the bokeh sprites. Otherwise the bokeh just not blend in at all with the pixels that are blurred by the DOF blur pass. So instead I implemented my own depth occlusion function that attenuates based on depth, but removes the attenuation if the pixel is out-of-focus. This let me keep the same overall look, while removing cases where I was getting “halos” due to the bokeh sprites rendering on top of in-focus objects. And since I was doing it in the shader anyway, I threw in a “soft” occlusion function rather than a binary comparison.

Without occlusion:

With occlusion:

Code and binaries are available here: http://mjp.codeplex.com/releases/view/64806

4/22/2011 – Fixed a texture sampling bug on Nvidia hardware

Crashes on Nvidia hardware

A few people have told me that my past two samples (Bokeh and RadiosityDX11) were crashing on Nvidia GPU’s, which I verified myself on my coworker’s GTX 470. The crash appears to be a driver bug, since it happens deep in the Nvidia runtime DLL on a worker thread and also because it works fine on AMD hardware and the REF device. This morning we managed to narrow it down to the shadow map filtering shader code (shader code can crash drivers apparently, who knew?), and I suspect that it’s the the fact that shader makes use of a SampleCmp with an integer offset. Commenting out the filtering and replacing it with a single SampleCmp seems to work, but I think using a regular texture coordinate offset might work as well. Anyone want to try putting this into “SampleShadowCascade” in Mesh.hlsl, and let me know if it works?

 

[unroll(NumSamples)]
for (int y = -Radius; y <= Radius; y++)
{
    [unroll(NumSamples)]
    for (int x = -Radius; x <= Radius; x++)
    {
        float2 offset = float2(x, y) * (1.0f / ShadowMapSize);
        float2 sampleCoord = shadowTexCoord + offset;
        float sample = ShadowMap.SampleCmp(ShadowSampler, sampleCoord, shadowDepth).x;
        ...
    }
}

How To Fake Bokeh (And Make It Look Pretty Good)

Before I bought a decent DSLR camera and started putting it in manual mode, I never really noticed bokeh that much. I always just equated out-of-focus with blur, and that was that. But now that I’ve started noticing, I can’t stop seeing it everywhere. And now every time I see depth of field effects in a game that doesn’t have bokeh, it just looks wrong. A disc blur or even Gaussian blur is fine for approximating the look of out-0f-focus areas that are mostly low-frequency, but the hot spots just don’t look right at all (especially if you don’t do it in HDR).

So what are our options for getting a decent bokeh look in real time? Here’s a short (and by no means complete) list:

1. Render the scene multiple times and accumulate samples using the aperture shape – this obviously a no-go for real-time rendering.

2. Stochastic rasterization, using scatter – this becoming more feasible now that we can implement scatter in pixel or compute shaders, but requires a lot of samples to not look like crap (or maybe something like this) and would likely have performance problems due to the non-coherent memory writes

3. Do scatter-as-gather as a post process – this is doable and could be implemented in a compute shader or even a pixel shader, but it’s expensive since you need a huge number of samples for large CoC sizes. Plus it can be tricky to implement, since you really need to be careful about energy conservation. Gamedev user FBMachine actually implemented this approach and documented some of the issues here…as you can see the results can be nice but the performance isn’t so great.

4. Render each pixel as point sprite, using the CoC-size + aperture shape – this approach initially sounds completely unrealistic, since you’re talking about huge bandwidth usage from the blending and massive overdraw. However the guys who make 3DMark actually implemented some optimizations to make it usable for their “Deep Sea” demo in 3DMark11. They go into a little bit of detail in their white paper, but the basic gist of it is that they extract pixels with a CoC above a given threshold and append the point into an append/consume buffer (basically a stack that your shaders can push onto or pull from), then render the points at point sprites to one of several render targets. The render targets are successively smaller like mip levels, and they render the larger points to smaller render targets. By doing this they help curb the massive overdraw/blending cost of large points. They also do the point extraction several times, each time from a progressively downsampled version of the input image and with a different CoC threshold. I presume they do this to avoid extracting huge amounts of points.

The guys at Capcom also did something similar to this for the DX10 version of Lost Planet, although I’m not too familiar with the details. There’s some pictures + descriptions here and here.

5. Pick out the bright spots, render those as point sprites using the aperture shape, and do everything else with a “traditional” blur-based DOF technique – ideally with this approach we get the nice bokeh effects for that parts where it’s really noticeable (the highlights), and use something cheap for everything else. Gamedev.net poster Hodgman took a crack at implementing this approach using point sprites and vertex textures, and documented his results in this thread. His main problems were due to flickering + instability, since he had to downscale many times in order to render the grid of point sprites.

For my own implementation, I decided to go for #5. I spent a lot of time staring at out-of-focus images, and decided that it really wasn’t necessary to do a more accurate bokeh simulation for most of the image. For instance, take a look at this picture:

At least 90% of that image doesn’t have a distinctive bokeh pattern, and looks very similar to either a box blur or disc blur with a wide radius. It’s only those bright spots that really need the full bokeh treatment for it to look convincing.

With that in mind, I came up with the following approach:

  1. Render the scene in HDR
  2. Do a full-res depth + blur generation pass where we sample the depth buffer, and write out linear depth + circle of confusion size to an R16G16_FLOAT buffer
  3. Do a full-res bokeh point extraction pass. For each pixel, compute the average brightness of the 5×5 block surrounding the pixel and compare it with the brightness of the current pixel. If the pixel brightness minus the average brightness is above a certain threshold and the CoC size is above a certain threshold, push the pixel position + color + CoC size onto an append buffer and output a value of 0.0 as the pixel color. If it doesn’t pass the threshold, output the input color.
  4. Do a regular DOF pass. I implemented two versions: one that does a  full-res disc-based blur in two passes using 16 samples on a Poisson disc with radius == CoC size, and one that does ye olde 1/4-res Gaussian blur (with edge bleeding reduction) and a full-screen lerp between the un-blurred and blurred version.
  5. Copy the embedded count from the append buffer to another buffer, and use the second buffer as an indirect arguments buffer for a DrawInstancedIndirect call. Basically this lets us draw however many points are in the buffer, without having to copy anything back to the CPU. The vertex shader for each point then samples the position/color/CoC size from the append buffer and passes it to the geometry shader, which expands the point into a quad with size equal to the CoC size. The pixel shader then samples from a texture containing the aperture shape, and additively blends the result into an empty render target. The render target can be either full res, or 1/4 res to save on bandwidth.
  6. Combine the results of the bokeh pass with the  results of the DOF pass by summing them together in a pixel shader.
  7. Pass the result to bloom + tone mapping, and output the image.

I actually implemented everything with pixel shaders, since I find they’re still quicker for rapid prototyping compared to compute shaders. The bokeh generation step and Guassian blurs probably would have benefited from using shared memory to avoid redundant texture samples, but not so much that penalty is huge. The disc-based blur isn’t all that great of a fit either, since I used a very large sampling radius (usually at least 16 pixels).  For the disc blur I also did it in two passes with 16 samples each, in order to avoid some of the nasty banding artifacts that come from using a large sampling radius. This leads to some artifacts around edges, but it’s too bad. Either way the DOF part isn’t really important, and you could swap it out with whatever cool new technique you want. I also didn’t end up using proper lens-based CoC-size calculations, since I found it was a pain in the ass to work with. So I reverted to a very simple linear interpolation  between “out-of-focus” and “in-focus” distances, and then multiplied the value by a tweakable maximum CoC size.

As for the bokeh itself, it looks pretty good since it’s using a texture and can have whatever shape you want. It’s also pretty stable since the extraction is done at full resolution, and so you don’t get much flickering or jumping around. I didn’t use depth testing when rendering the bokeh sprites…I had intended on doing it, but then decided it wasn’t really necessary. However I’d imagine it would probably be desirable if you wanted to render really large bokeh spoints, in which case it would be trivial to implement.

Now for some results. This is with the foreground in focus, and the background out of focus:

The bokeh isn’t too distinctive here since most of the image is in focus, but you can definitely see the hexagon pattern on some of the background geometry.

This one has the whole scene out of focus, and so you can see a lot more of the bokeh effect:

Now you can really see the bokeh! Here’s another with a circle-shaped bokeh:

This one is with the bokeh sprites rendered to a 1/4-resolution texure:


Finally, this one is with the brightness threshold set to 0. Basically this means they every out of focus pixel gets drawn as a point sprite, which I used as a sort of “reference” image.

I think the earlier shots hold up pretty well in comparison! The biggest issue that I notice is that it can look a bit weird if you DOF blur radius and your bokeh radius don’t match up. It starts to become pretty obvious if you crank up the maximum bokeh size, but still use a small radius for blurring everything else. This is because you don’t want to be able to clearly discern what’s “underneath” the bokeh sprites…you want it to pretty much look like a solid color. To help with this I added a parameter to tweak the falloff used for conserving energy as as the bokeh point sprites get larger. Basically it does a pow on the falloff, which is computed by calculating the ratio of area of a circle with radius == CoC and comparing it with the radius of a single pixel. So by setting the falloff tweak to a lower value, the points are artificially brightened and appear more opaque.

If you want to check it out yourself, you can download the source code + binaries here: http://cid-538e432ea49f5bde.office.live.com/self.aspx/Public/Samples%20And%20Tutorials/DX11/Bokeh.zip

Updated (3/27/2011): Changed the shadow filtering shader code so that it doesn’t cause crashes on Nvidia hardware

Radiosity, DX11 Style

Radiosity isn’t exactly new. According to Wikipedia it’s been used for rendering since the early 80′s, and this page looks like it may have been the first web page on the Internet. The basic premise is dead simple: for each point where you want to bake lighting (typically either a texel in a lightmap, or a vertex in a mesh), render the rest of the scene and any exterior light sources (skydome, area lights, sun, whatever) in all directions within a hemisphere surrounding the surface normal at that point.  As described in the article I linked, this is typically done by rendering 5 faces of a hemicube so as to play nice with a traditional perspective projection.  It’s possible to use a parabolic projection for this (as in dual-paraboloid shadow mapping), but there are problems you can run into which are outlined here. Once you’ve fully rendered your hemisphere, you then integrate your computed radiance about the hemisphere with a cosine kernel to compute irradiance, which you can use to determine the diffuse reflectance for that point. You can then store this diffuse lighting value in your vertex or lightmap texel, and you’re ready to render your geometry fully lit by direct lighting. Typically you repeat the process many times, rendering the geometry scene geometry as lit by the results of your previous iteration. This effectively allows you to compute lighting for each successive bounce off the scene geometry, adding an indirect lighting term. With enough iterations, you eventually begin to converge on a full global illumination solution for diffuse lighting.

The nice part about the technique is that it’s pretty simple to implement…if you can rasterize a mesh you’re already most of the way there, and you can even co-opt an existing real-time rendering engine to do it. Taking the latter approach has the added benefit that any material or lighting feature you add to your engine benefits your radiosity baker by default…so for instance if you have some complex pixel shader that blends multiple maps to determine the final material albedo, you don’t need to implement an equivalent solution in a ray tracer or photon mapper. You can even take advantage of your lighting and shadowing pipeline, if you do it right. The major downside is that it’s usually pretty slow, even if you implement it all on the GPU. This is because you typically have to serialize rendering of the scene for each vertex or lightmap texel, rather than baking many vertices/texels in parallel (which is possible with ray tracing implementations, particularly if you do it on the GPU using Cuda/Optix).

Recently when I was trying to get myself familar with GI techniques, I decided to implement my own radiosity baker (with a lot of help from my coworker Dave). However to make it cool and hip for the DX11 age, I deviated from the “standard” radiosity process in 3 ways:

  1. Rather than producing a single diffuse color per sample point, I baked down to a set of 1st-order H-basis coefficients representing irradiance about the hemisphere. This lets you use normal mapping with your baked values, which adds higher fidelity to your precomputed lighting. This is similar to what Valve, Bungie, and Epic do for their lightmaps, except I’m using a different basis. If you’re not familiar with H-basis, they’re similar to spherical harmonics except that they’re only defined on the  upper hemisphere. This allows you to get better quality with less coefficients, for situations where you only need to store information about a hemisphere rather than a full sphere.
  2. Instead of baking direct lighting for all light sources, I bake direct + indirect lighting for a skydome and indirect lighting only for the sun. This is similar to what Naughty Dog does in Uncharted. The advantage is that you can add in the direct sun lighting at runtime using a directional light, and you get nice high-frequency visibility from your shadow maps. This lets you avoid having to use an area light or environment map for representing your sun, which can be difficult to tune if you’re used to traditional analytical directional light sources. Plus you can light your dynamic geometry the same way and have the lighting match, and also have their dynamic shadows only remove the direct lighting term. Another additional advantage is that your baked lighting term generally only contains low-frequency information, since it doesn’t need to represent high frequency shadow visibility from the direct term. So if your scene is decently tessellated you can get away with computing it per-vertex, which is what I did.
  3. I used compute shaders for integrating the radiosity hemicube down to H-basis coefficients. This not only made the integration really really fast, but it let me keep everything on the GPU and avoid messing with CPU-GPU memory transfers.

Setup

To prepare for baking, the scene and all of its individual meshes are loaded and processed. As part of the processing, I calculate a per-vertex tangent frame using a modified version of this approach. The tangents are needed for normal mapping, but they’re also used as a frame of reference for baking each vertex. This is because I store H-basis irradiance in tangent space. Tangent space provides a natural frame of reference for the hemisphere about the normal, and is also consistent across the vertices of a triangle. This lets me interpolate the coefficients across a triangle, which wouldn’t be possible if each vertex used a different frame of reference during integration. It also allows for a simple irradiance lookup with the tangent space normal sampled from a normal map, or (0, 0, 1) if normal mapping isn’t used.

Baking

The baking loop looks something like this:

for each Iteration
    for each Mesh
        Extract vertex data
        for each Vertex
            for each Hemicube Face
                if 1st iteration
                    Render the scene, depth only
                    Render the skydome
                else if 2nd iteration
                    Render the scene with baked lighting + shadowed diffuse from the sun
                else
                    Render the scene with baked lighting
            Integrate
     Sum the result of current iteration with result of previous iteration

Basically we do N + 1 iterations, where N is the number of indirect bounces we want to factor in. For the first iteration we add in all direct light sources (the skydome), for the second we add in the bounce lighting from the first pass plus the indirect-only lighting (the sun), and in all subsequent passes we only render the scene geometry with baked lighting.

Vertex Baking

For each vertex, we need to determine the radiance emitted from each surface visible around the hemisphere.  This hemisphere of radiance values is known as the field-radiance function. Determining the radiance value for any surface is simple: we just render the corresponding mesh and evaluate the BRDF in the pixel shader, which in our case means sampling the albedo texture and using it to modulate the diffuse lighting.  Since we’re going do it using rasterization, we’ll render to a hemicube for the reasons mentioned previously. To represent my hemicube, I used 5  R16G16B16A16_FLOAT render target textures storing HDR color values. To keep things simple I made them all equal-sized, and rendered each face as if it were a full cube map rather than a hemicube. However I used the scissor test to scissor out the half of the cube face that would not be needed, for all faces other than the first. Initially I used 256×256 textures for the render targets, but eventually lowered it to 64×64. Increasing the resolution does increase the quality slightly, but gains become diminishing very quickly past 64×64. This is because the irradiance integration filters out the high-frequency components, so any small details missed due to the small render target size have very little effect on the end result.

For the first pass, the scene is rendered with color writes disabled. This is because the mesh surfaces do not yet have incident lighting, and thus do not emit any radiance. Conceptually you can imagine this as though all light sources just began emitting light, and the light has yet to reach the mesh surfaces. So essentially we just render the mesh geometry to the depth buffer, in order “block out” the sky and determine the overall visibility for that vertex. Once we’ve done this we render the skydome with a depth value of 1.0, so that any rendered geometry occludes it. Thus we “fill in” the rest of the hemicube texels with radiance values emitted by the skydome. For the skydome I used the CIE Clear Sky model, which is simple and easy to implement in a pixel shader. The final result in the hemicube textures looks like this:


For the second pass, we use the results of the first pass as the incident lighting light for each surface pixel. This effectively causes the skydome lighting to “bounce” off the surface, adding indirect lighting. We also evaluate the diffuse contribution from the sun for each pixel, so that we get an indirect contribution from the sun as well. This contribution is calculated using a simple N (dot) L with the interpolated vertex normal, and the sun direction. A shadow visibility term is also added using a shadow map, which is rendered as a low-resolution cascaded shadow map. Then the sum of the baked lighting and and the sun light are modulated with the diffuse albedo, which is sampled from a texture. So the final exit radiance value for a pixel is computed like this:

radiance = (bakedLighting + sunLight * sunVisibility) * diffuseAlbedo

After the scene is rendered, the skydome is omitted since it’s contribution was already handled in the first pass. Thus the final hemicube looks like this:


For all subsequent passes, only the baked vertex irradiance is used for computing the exit radiance of each pixel. This is because the contribution from both of our light sources have already been added in previous passes, and we only need to further compute indirect lighting terms.

Integration

Once we’ve rendered the scene to all 5 sides of the hemicube, we have a full field-radiance function for the hemisphere stored in a texture map. At this point we could now compute a full irradiance distribution function for the hemisphere, which would provide us with an irradiance value for any possible surface normal. Such a function would be computed by convolving our field radiance with a cosine kernel, which is done by evaluating the following integral:

I(p,N_{p})=\int_{\Omega}L(p,\omega_{i})(N_{p}\circ\omega_{i})d\omega_{i}

Unfortunately, a full irradiance distribution function in a texture map isn’t all that useful since it’s too much data to store per-vertex. So instead we’ll  represent the irradiance map using 2nd-order spherical harmonics, using the method outlined in the paper “An Efficient Representation for Irradiance Environment Maps“. The basic procedure is to first convert the radiance map to a spherical harmonic representation by integrating against the spherical harmonic basis functions, and then convolve the result with a cosine kernel to compute irradiance. The following integral is used for projecting onto SH:

L_{lm}=\int_{\theta=0}^{\pi}\int_{\phi=0}^{\pi}L(\theta,\phi)Y_{lm}(\theta,\phi)sin{\theta}d{\theta}d{\phi}

For radiance stored in a texture map, we can implement this integration by using the method described in Peter-Pike Sloan’s Stupid Spherical Harmonics Tricks. For our purposes we’ll modify the algorithm by first converting each texel’s SH radiance to irradiance by convolving with a cosine kernel, and then converting the SH coefficients to 1st-order H-basis representation. This allows us to sum 12 values per texel, rather than the 27 required for 2nd-order SH. The algorithm looks something like this:

for each Hemicube Face
    for each Texel
        Sample radiance
        Calculate direction vector for the texel
        Project the direction onto SH and convolve with cosine kernel
        Multiply SH coefficients by sampled radiance
        Convert from SH to H-basis
        Weight the coefficients by the differential solid angle for the texel
        Add the coefficients to a running sum

What this essentially boils down to is bunch of per-texel math, followed by sum of all results. Sounds like a job for compute shaders! The first part is simple, since the per-texel math operations are completely independent of one another. The second part is a bit tougher, since it requires a parallel reduction to be efficient. Essentially we need to efficiently share results between different threads in order to avoid heavy bandwidth usage, while properly exploiting the GPU’s massively parallel architecture by sharing the workload across multiple minimally divergent threads and thread groups. Basically it’s pretty simple to implement naively, and tricky to do it with good performance.  Fortunately Nvidia has a bunch of data-parallel algorithms that are part of their cuda SDK, and one of them happens to be a parallel reduction. I won’t go into the details, but their whitepaper outlines the basic process as well as a series of improvements that can be made to the naive algorithm in order to improve performance. These improvements are a mix of algorithmic and hardware-specific optimizations, and pretty much all of them are easily applicable to compute shaders.

My  implementation ended up being 2 passes: the first performing the conversion to H-basis irradiance and reducing each row of each face texture to a single set of RGB coefficients, and the second reducing to only 1 set of RGB coefficients. In the first pass, the threads are dispatched in 1x64x5 thread groups, with each group containing 64x1x1 threads. The following diagram shows how the threads are distributed relative to the hemicube textures for the first 2 faces:

The projection onto SH and cosine kernel convolution can be implemented pretty easily in HLSL, using values taken from the irradiance environment mapping paper. My HLSL code looks like this:

void ProjectOntoSH(in float3 n, in float3 color, out float3 sh[9])
{
    // Cosine kernel
    const float A0 = 3.141593f;
    const float A1 = 2.095395f;
    const float A2 = 0.785398f;

    // Band 0
    sh[0] = 0.282095f * A0 * color;

    // Band 1
    sh[1] = 0.488603f * n.y * A1 * color;
    sh[2] = 0.488603f * n.z * A1 * color;
    sh[3] = 0.488603f * n.x * A1 * color;

    // Band 2
    sh[4] = 1.092548f * n.x * n.y * A2 * color;
    sh[5] = 1.092548f * n.y * n.z * A2 * color;
    sh[6] = 0.315392f * (3.0f * n.z * n.z - 1.0f) * A2 * color;
    sh[7] = 1.092548f * n.x * n.z * A2 * color;
    sh[8] = 0.546274f * (n.x * n.x - n.y * n.y) * A2 * color;
}

Converting that to H-basis is also simple, and is expressed as a matrix multiplication. The values for the transformation matrix are given in the source paper. This is the shader code that I used:

void ConvertToHBasis(in float3 sh[9], out float3 hBasis[4])
{
    const float rt2 = sqrt(2.0f);
    const float rt32 = sqrt(3.0f / 2.0f);
    const float rt52 = sqrt(5.0f / 2.0f);
    const float rt152 = sqrt(15.0f / 2.0f);
    const float convMatrix[4][9] =
    {
        { 1.0f / rt2, 0, 0.5f * rt32, 0, 0, 0, 0, 0, 0 },
        { 0, 1.0f / rt2, 0, 0, 0, (3.0f / 8.0f) * rt52, 0, 0, 0 },
        { 0, 0, 1.0f / (2.0f * rt2), 0, 0, 0, 0.25f * rt152, 0, 0 },
        { 0, 0, 0, 1.0f / rt2, 0, 0, 0, (3.0f / 8.0f) * rt52, 0 }
    };

    [unroll(4)]
    for(uint row = 0; row < 4; ++row)
    {
        hBasis[row] = 0.0f;

        [unroll(9)]
        for(uint col = 0; col < 9; ++col)
            hBasis[row] += convMatrix[row][col] * sh[col];
    }
}

After the first pass, we’re left with a single buffer containing 64x5x3 float4 values, where each consecutive set of 3 float4 values represents the sum of all RGB H-basis coefficients for that row.  To reduce to a single set of coefficients, we dispatch a reduction pass containing 1x3x1 groups of 64x1x5 threads.  With this setup each group sums all 64 of the R, G, or B coefficients for a particular hemicube face and stores the result in shared memory. Once this has completed, the first thread of each group sums the 5 values for each hemicube to produe a single set of H-basis coefficients. This last step is somewhat sub-optimal since only a single thread performs the work, however for summing only 5 values I didn’t think it was worth it to try anything fancy or split the reduction into another pass. The following diagram shows the thread layout:

The final result of this process is 3 sets of 4 H-basis coefficients (1 for each RGB channel) representing the irradiance across the hemisphere around the vertex normal, oriented in tangent space. After vertices are baked in this manner, I sum the vertex coefficients for each mesh with the results from the previous iteration in order to sum the bounces (which I do with a really simple compute shader). After the desired number of iterations, the coefficients are ready to be used at runtime and combined with direct sun lighting. Evaluating the H-basis coefficients to compute irradiance for a normal is pretty simple. I use the following code in my pixel shader, which takes a tangent space normal and the interpolated coefficients from the vertices:

float3 GetHBasisIrradiance(in float3 n, in float3 H0, in float3 H1, in float3 H2, in float3 H3)
{
    float3 color = 0.0f;

    // Band 0
    color += H0 * (1.0f / sqrt(2.0f * 3.14159f));

    // Band 1
    color += H1 * -sqrt(1.5f / 3.14159f) * n.y;
    color += H2 * sqrt(1.5f / 3.14159f) * (2 * n.z - 1.0f);
    color += H3 * -sqrt(1.5f / 3.14159f) * n.x;

    return color;
}

Performance

To enable profiling with quick iteration, I made a very simple test scene containing a single mesh and 12,400 vertices. My initial implementation was pretty slow, baking only 545 vertices per second for the 1st pass, 292 vps for the 2nd pass, and close to 545 for all subsequent passes. For the first pass, I determined that the integration step was slowing things down considerably. Initially I had implemented integration using pixel shaders, which converted to H-basis and then reduced each hemicube face by a 1/4 each pass. This resulted in lots of unnecessary render target reads and writes, degrading performance. Moving to my current compute shader implementation brought the first pass to 1600 vps, and the second pass to 325 vps.

When I analyzed the second pass, GPU PerfStudio revealed that I was spending a significant amount of time in the geometry shader during the main rendering and shadow map rendering phases. I had used a geometry shader so that I could create the 5 hemicube faces as a texture array (or 4 shadow map cascades for the shadow map), and use SV_RenderTargetArrayIndex to specify the output array slice without having to switch render targets multiple times. I had known that this sort of geometry shader amplification performed poorly on Dx10 hardware and had been hoping that it wouldn’t be so bad on my 5830, but unfortunately this was not the case. Ditching the geometry shader  and setting the render target slices one by one brought me up to 1760 vps for the first pass and 480 vps for the second pass. Further performance was gained by switching the cascaded shadow map implementation to use an old-school texture atlas rather than a texture array, which brought me to 625 vps for the second pass. This was disappointing, since texture arrays are a totally natural and convenient way to implement cascaded shadow maps. Texture atlases are so DX9. Even after that the shadow map rendering was still really slowing down the 2nd pass, so I cut it down to 2 cascades (from 4) and reduced the resolution from 2048×2048 per cascade to 512×512. This got me to 850 vps for the test scene, about 600 vps for the broken tank scene from the SDK, and about 180 vps for the powerplant scene from the SDK. In its current state, the GPU is currently spending a portion of each vertex bake idling due to processing so many commands and having multiple render target switches. It could definitely benefit from some reduction in overall amount of API commands and state changes, and batching during the shadow map rendering. It would also probably benefit from using an approach similar to Ignacio’s, where the shadow map is only rendered once for a group of vertices.

Now for some pictures! These were all taken with only a single bounce, because I’m impatient.

Test scene: baked lighting only, baked lighting with normal maps, baked lighting + direct sunlight, baked light + direct with normal maps, final

Tank scene: baked only, baked with normal mapping, baked + direct, final, alternate final, another alternate final

Powerplant scene: direct only, baked only, baked + direct, baked w/o normal mapping, baked w/ normal mapping, alternate baked + direct, final

Source code and binaries are available here:

Part 1

Part 2

Updated (3/27/2011): Changed the shadow filtering shader code so that it doesn’t cause crashes on Nvidia hardware

Position From Depth in GLSL

Commenter “Me” was kind enough to share his GLSL implementation of a deferred point light shader, which makes use of one of the methods I previously posted for reconstructing position from depth. So I figured I’d post it here, for all of you unfortunate enough to be stuck with writing shaders in GLSL. :P

// deferred shading VERTEX (GEOMETRY)
varying vec3 normalv, posv;

void main( void ) {
    normalv = ( gl_NormalMatrix * gl_Normal ).xyz;
    posv = ( gl_ModelViewMatrix * gl_Vertex ).xyz;
    gl_Position = ftransform();
}

// deferred shading FRAGMENT (GEOMETRY)
varying vec3  normalv, posv;
uniform vec2  nfv;

void main( void ){
    gl_FragData[0] = vec4( normalize( normalv ) * 0.5 + 0.5, -posv.z / nfv.y );
}

// deferred shading VERTEX (LIGHTING: POINT)
varying vec3 posv;

void main( void ){
    posv = ( gl_ModelViewMatrix * gl_Vertex ).xyz;
    gl_Position = ftransform();
    gl_FrontColor = gl_Color;
}

// deferred shading FRAGMENT (LIGHTING: POINT)
varying vec3 posv;
uniform float lradius;
uniform vec3 lcenter;
uniform vec2 nfv, sic;
uniform sampler2D geotexture;

void main( void ){
    vec2 tcoord = gl_FragCoord.xy * sic;
    vec4 geometry = texture2D( geotexture, tcoord );
    vec3 viewray = vec3( posv.xy * ( -nfv.y / posv.z ), -nfv.y );
    vec3 vscoord = viewray * geometry.a;
    float dlight = length( lcenter - vscoord );
    float factor = 1.0 - dlight/lradius;
    if( dlight > lradius ) discard;
    gl_FragData[0] = vec4( gl_Color.rgb, factor );
}
Follow

Get every new post delivered to your Inbox.