Stairway To (Programmable Sample Point) Heaven

What Is and Should Never Be

Historically, the sub-pixel location of MSAA sample points was totally out of your control as a programmer. Typically the hardware used rotated grid patterns such as this one, which were fixed for every pixel in your render target. For FEATURE_LEVEL_10_1, D3D added the concept of standard sample patterns that were required to be supported by the hardware. These were nice, in that you could specify the appropriate quality level and know exactly where the samples would be located. Without that, you either had to use shader intrinsics or resort to wackier methods for figuring out the sample positions. In theory it also gives you a very limited ability to control where sample points are located, but in practice the GPU vendors just made the standard multisample patterns the only patterns that they supported. So if you specified quality level 0, you got the same pattern as if you specified D3D11_STANDARD_MULTISAMPLE_PATTERN.

Fairly recently, we’ve finally seen the idea of programmable sample points getting some major attention. This has primarily came from Nvidia, who just added this functionality to their second-generation Maxwell architecture. You’ve probably seen some of the articles about their new “Multi-Frame Sampled Anti-Aliasing” (MFAA), which exploits the new hardware capabilities to jitter MSAA sample positions across alternating frames1. The idea is that you can achieve a doubled effective sampling rate, as long as your resolve intelligently retrieves the subsamples from your previous frame. They also incorporate ideas from interleaved sampling, by varying their sample locations across a 2×2 quad instead of using the same pattern for all pixels. While the idea of yet-another control panel AA setting probably won’t do more than elicit some collective groans from the graphics community, it should at least get us thinking about what we might do if provided with full access to this new functionality for ourselves. And now that Nvidia has added  an OpenGL extension as well as a corresponding D3D11 extension to their proprietary NVAPI, we can finally try out our own ideas (unless you work on consoles, in which case you may have been experimenting with them already!).

As for AMD, they’ve actually supported some form of programmable sample points since as far back as R600, at least if the command buffer documentation is accurate (look, for PA_SC_AA_SAMPLE_LOCS). Either way, Southern Islands certainly has support for varying sample locations across a 2×2 quad of pixels, which puts it on par with the functionality present in Maxwell 2.0. It’s a little strange to that AMD hasn’t done much to promote this functionality in the way that Nvidia has, considering they’ve had it for so long. Currently there’s no way to access this feature through D3D, but they have had an OpenGL extension for a long time now (thanks to Matthäus Chajdas and Graham Sellers for pointing out the extension!).

I’m not particularly knowledgeable about Intel GPU’s, and some quick searching didn’t return anything to indicate that they might be able to specify custom sample points. If anyone knows otherwise, then please let me know!

Update: Andrew Lauritzen has informed me via Twitter that Intel GPU’s do in fact support custom sample locations! Thank you again, Andrew!

How Does It Work?

Before we get into use cases, let’s quickly go over how to actually work with programmable sample points. Since I usually only use D3D when working on PC’s, I’m going to focus on the extensions for Maxwell GPU’s that were exposed in NVAPI. If you look in the NVAPI headers, you’ll find a function for creating an extended rasterizer state, with a corresponding description structure that has new members:

NvU32 ForcedSampleCount; 
bool ProgrammableSamplePositionsEnable;
bool InterleavedSamplingEnable; 
NvU8 SampleCount; 
NvU8 SamplePositionsX[16]; 
NvU8 SamplePositionsY[16]; 
bool ConservativeRasterEnable; 
bool PostZCoverageEnable; 
bool CoverageToColorEnable; 
NvU8 CoverageToColorRTIndex; 
NvU32 reserved[16];

There’s a few other goodies in there (like conservative rasterization, and Post-Z coverage!), but the members that we’re concerned with are ProgrammableSamplePositionsEnable, InterleavedSamplingEnable, SampleCount, SamplePositionsX, and SamplePositionsYProgrammableSamplePositionsEnable is self-explanatory: it enables the functionality. SampleCount is also pretty obvious: it’s the MSAA sample count that we’re using for rendering. SamplePositionsX and SamplePositionsY are pretty clear in terms of what they’re used for: they’re for specifying the X and Y coordinates of our MSAA sample points. What’s not clear at all is how the API interprets those values. My initial guess was that they should contain 8-bit fixed point numbers where (0,0) is the top left of the pixel, and (255,255) is the bottom right. This was close, but not quite: they’re actually 4-bit fixed-point values that correspond to points in the D3D Sample Coordinate System. If you’re not familiar with this particular coordinate system (and you’re probably not), there’s a nice diagram in the documentation for the D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration2:

D3D Sample Coordinate System

There’s also a bit more information in the documentation for EvaluateAttributeSnapped:

Only the least significant 4 bits of the first two components (U, V) of the pixel offset are used. 
The conversion from the 4-bit fixed point to float is as follows (MSB...LSB), 
where the MSB is both a part of the fraction and determines the sign:

1000 = -0.5f (-8 / 16)
1001 = -0.4375f (-7 / 16)
1010 = -0.375f (-6 / 16)
1011 = -0.3125f (-5 / 16)
1100 = -0.25f (-4 / 16)
1101 = -0.1875f (-3 / 16)
1110 = -0.125f (-2 / 16)
1111 = -0.0625f (-1 / 16)
0000 = 0.0f ( 0 / 16)
0001 = 0.0625f ( 1 / 16)
0010 = 0.125f ( 2 / 16)
0011 = 0.1875f ( 3 / 16)
0100 = 0.25f ( 4 / 16)
0101 = 0.3125f ( 5 / 16)
0110 = 0.375f ( 6 / 16)
0111 = 0.4375f ( 7 / 16)

So basically we have 16 buckets work with in X and Y, with the ability to have sample points sit on the top or left edges, but not on the bottom or right edges. Now as for the NVAPI extension, it uses this same coordinate system, except that all values or positive. This means that there is no sign bit, as in the values passed to EvaluateAttributeSnapped. Instead you can picture the coordinate system as ranging from 0 to 15, with 8 (0.5) corresponding to the center of the pixel:

NVAPI Sample Grid

Personally I like this coordinate system better, since having the pixel center at 0.5 is consistent with the coordinate system used for pixel shader positions and UV coordinates. It also means that going from a [0, 1] float to the fixed point representation is pretty simple:

rsDesc.SamplePositionsX[i] = uint8(Clamp(SamplePositions1x[i].x * 16.0f, 0.0f, 15.0f));

Now, what about that InterleavedSamplingEnable member that’s also part of the struct? This is a somewhat poorly-named parameter that controls whether you’re specifying the same sample positions for all 4 pixels in a 2×2 quad, or whether you’re specifying separate sample positions for each quad pixel. The API works such that the first N points correspond to the top left pixel, the next N correspond to the top right pixel, then the bottom left, followed by the bottom right. This means that for 2xMSAA you need to specify 8 coordinates, and for 4xMSAA you need  to specify 16 coordinates. For 8xMSAA we would need to specify 32 coordinates, but unfortunately NVAPI only lets us specify 16 values. In this case, the 16 values correspond to the sample points in the top left and bottom left pixels of the quad.

Case Study: Low-Resolution Rendering

So now that we have programmable sample points, what exactly should we do with them? MFAA-style temporal jittering is a pretty obvious use case, but surely there’s other ways to make use of this functionality. One idea I’ve been kicking around has been to use MSAA as way to implement half-resolution rendering. Rendering particles at half resolution is really common in games, since blending many layers of particles can result in high costs from pixel shading and framebuffer operations. It can be an easy win in many cases, as long as the content you’re rendering won’t suffer too much from being shaded at half-rate and upsampled to full resolution. The main issue of course is that you can’t just naively upsample the result, since the half-resolution rasterization and depth testing will cause noticeably incorrect results around areas where the transparent geometry was occluded by opaque geometry. Here’s a comparison between half-resolution particle rendering with a naive bilinear upscale (left), and full-resolution rendering (right):


Notice how in some areas the low-resolution buffer has false occlusion, causing no particles to be visible for those pixels, while in other places the particles bleed onto the foreground geometry even though they should have been occluded. Perhaps the most obvious and common way to reduce these artifacts is to use some variation of a bilateral filter during the upscale, where the filtering weights are adjusted for pixels based on the difference between the low-resolution depth buffer and the full-resolution depth buffer. The idea behind doing this is that you’re going to have under or over-occlusion artifacts in places where the low-resolution depth is a poor representation of your full-resolution depth, and so you should reject low-resolution samples where the depth values are divergent. For The Order: 1886, we used a variant of this approach called nearest-depth upsampling, which was originally presented in an Nvidia SDK sample called OpacityMapping. This particular algorithm requires that you sample 4 low-resolution depth samples from a 2×2 bilinear footprint, and compare each one with the full-resolution depth value. If the depth values are all within a user-defined threshold, the low-resolution color target is bilinearly upsampled with no special weighting. However if any of the sample comparisons are greater than the treshold value, then the algorithm falls back to using only the low-resolution sample where the depth value was closest to the full-resolution depth value (hence the name “nearest-depth”). Overall the technique works pretty well, and in most cases it can avoid usual occlusion artifacts that are typical of low-resolution rendering. It can even be used in conjunction with MSAA, as long as you’re willing to run the filter for every subsample in the depth buffer. However it does have a few issues that we ran into over the course of development, and most of them stem from using depth buffer comparisons as a heuristic for occlusion changes in the transparent rendering. Before I list them, here’s an image that you can use to follow along (click for full resolution):


The first issue, illustrated in the top left corner, occurs when there are rapid changes in the depth buffer but no actual changes in the particle occlusion. In that image all of the particles are actually in front of the opaque geometry, and yet the nearest-depth algorithm still switches to point sampling due to depth buffer changes. It’s very noticeable in the pixels that overlap with the blue wall to the left. In this area there’s not even a discontinuity in the depth buffer, it’s just that the depth is changing quickly due to the oblique angle at which the wall is being viewed. The second issue, shown in the top right corner, is that the depth buffer contains no information about the actual edges of the triangles that were rasterized into the low-resolution render target. In this image I turned off the particle texture and disabled billboarding the quad towards the camera, and as a result you can clearly see the a blurry, half-resolution stair-step pattern at the quad edges. The third issue, shown in the bottom left, occurs when alpha-to-coverage is used to get sub-pixel transparency. In this image 2xMSAA is used, and the grid mesh (the grey parallel lines) is only being rendered to the first subsample, which was done by outputting a custom coverage mask from the pixel shader. This interacts poorly with the nearest-depth upsampling, since 1 of the 2 high-res samples won’t have the depth from the grid mesh. This causes blocky artifacts to occur for the second subsample. Finally, in the bottom right we have what was the most noticable issue for me personally: if the geometry is too small to be captured by the low-resolution depth buffer, then the upsample just fails completely. In practice this manifests as entire portions of the thin mesh “disappearing”, since they don’t correctly occlude the low-resolution transparents.

Ramble On

By now I hope you get the point: bilateral upsampling has issues for low resolution rendering. So how can we use MSAA and custom sampling points to do a better job? If you think about it, MSAA is actually perfect for what we want to accomplish: it causes the rasterizer and the depth test at higher resolution than the render target, but the pixel shader still executes once per pixel. So if we use a half-resolution render target combined with 4x MSAA, then we should get exactly what we want: full-resolution coverage and depth testing, but half-resolution shading! This actually isn’t a new idea: Capcom tried a variant of this approach for the original Lost Planet on Xbox 360. For their implementation, which is outlined here in Japanese (English translation here), they were actually able to just alias their full-resolution render targets as a half-resolution 4xMSAA render targets while they were still in eDRAM. This worked for them, although at the time it was specific to the Xbox 360 hardware and also had no filtering applied to the low-resolution result (due to there being no explicit upscaling step). However we now have much more flexible GPU’s that give us generalized access to MSAA data, and my hunch was that we could improve upon this by using a more traditional “render in low-res and then upscale” approach. So I went ahead and implemented a sample program that demonstrates the technique, and so far it seems to actually work! I did have to work my way past a few issues, and so I’d like to go over the details in this article.

Here’s how the rendering steps were laid out every frame:

  1. Render Shadows
  2. Render Opaques (full res)
  3. Run pixel shader to convert from full-res depth to 4xMSAA depth
  4. Render particles at low-res with 4xMSAA, using ordered grid sample points
  5. Resolve low-res MSAA render target
  6. Upscale and composite low-res render target onto the full-res render target

The first two steps are completely normal, and don’t need explaining. For step #3, I used a full-screen pixel shader that only output SV_Depth, and executed at per-sample frequency. Its job was to look at the current subsample being shaded (provided by SV_SampleIndex), and use that to figure out which full-res depth pixel to sample from. The idea here is that 4 full-res depth samples need to be crammed into the 4 subsamples of the low-resolution depth buffer, so that all of the full-resolution depth values are fully represented in the low-res depth buffer:


If you look at the image, you’ll notice that the MSAA sample positions in the low-resolution render target are spaced out in a uniform grid, instead of the typical rotated grid pattern. This is where programmable sample points come in: by manually specifying the sample points, we can make sure that  the low-resolution MSAA samples correspond to the exact same location as they would be in the full-resolution render target. This is important if you want your low-resolution triangle edges to look the same as if they would if rasterized at full resolution, and also ensures correct depth testing at intersections with opaque geometry.

For step #4, we now render our transparent geometry to the half-resolution 4xMSAA render target. This is fairly straightforward, and in my implementation I just render a bunch of lit, billboarded particles with alpha blending. This is also where I apply the NVAPI rasterizer state containing the custom sample points, since this step is where rasterization and depth testing occurs. In my sample app you can toggle the programmable sample points on and off to see the effect (or rather, you can if you’re running on a GPU that supports it), although you probably wouldn’t notice anything with the default rendering settings. The issues are most noticable with albedo maps and billboarding disabled, which lets you see the triangle edges very clearly. If you look at pair of the images below, the image to left shows what happens when rasterizing with the 4x rotated grid pattern. The image on the right shows what it looks like when using our custom sample points, which are in a sub-pixel ordered grid pattern.


For step #5, I use a custom resolve shader instead of the standard hardware resolve. I did this so that I can look at the 4 subsamples, and find cases where the subsamples are not all the same. When this happens, this means that there was a sub-pixel edge during rasterization, that was either caused by a transparent triangle edge or by the depth test failing. For these pixels I output a special sentinel value of -FP16Max, so that the upscale/composite step has knowledge of the sub-pixel edge.

In the last step, I run another full-screen pixel shader that samples the low-resolution render target and blends it on top of the full-resolution render target. The important part of this step is choosing how exactly to filter when upsampling. If we just use plain bilinear filtering, the results will be smooth but the transparents will incorrectly bleed over onto occluding opaque pixels. If we instead use the MSAA render target and just load the single subsample that corresponds to the pixel being shaded, the results will look “blocky” due to the point sampling. So we must choose between filtering and point sampling, just as we did with nearest-depth upsampling. In the resolve step we already detected sub-pixel edges, and so we can already use that to determine when to switch to point sampling. This gets us most of the way there, but not all of the way. For bilinear filtering we’re going to sample a 2×2 neighborhood from the low-resolution texture, but our resolve step only detected edges within a single pixel. This means that if there’s an edge between two adjacent low-resolution pixels, then we’ll get a bleeding artifact if we sample across that edge. To show you what I mean, here’s an image showing low-resolution MSAA on the left, and full-resolution rendering on the right:


To detect this particular case, I decided to use an alpha comparison as a heuristic for changes in occlusion:

float4 msaaResult = LowResTextureMSAA.Load(lowResPos, lowResSampleIdx);
float4 filteredResult = LowResTexture.SampleLevel(LinearSampler, UV, 0.0f);
float4 result = filteredResult;
if(msaaResult.a - filteredResult.a > CompositeSubPixelThreshold)
    result = msaaResult;

This worked out pretty nicely, since it essentially let me roll the two tests into one: if the resolve sub-pixel test failed then it would output -FP16Max, which automatically results in a very large difference during the edge test in the composite step. Here’s an image where the left side shows low-resolution MSAA rendering with the improved edge detection, and the right side shows the “edge” pixels by coloring them bright green:


Before I show some results and comparisons, I’d like to touch on integration with MSAA for full-resolution rendering. With nearest-depth upsampling, MSAA is typically handled by running the upsample algorithm for every MSAA sub-sample. This is pretty trivial to implement by running your pixel shader at per-sample frequency, which automatically happens if you use SV_SampleIndex as an input to your pixel shader. This works well for most cases, although it still doesn’t help situations where sub-pixel features are completely missing in the low-resolution depth buffer. For the low-resolution MSAA technique that I’m proposing it’s also fairly trivial to integrate with 2x MSAA: you simply need to use 8xMSAA for your low-resolution rendering, and adjust your sample points so that you have 4 sets of 2x rotated grid patterns. Then during the upsample you can execute the pixel shader at per-sample frequency, and run the alpha comparison for each full-resolution subsample. Unfortunately we can’t handle 4x so easily, since 16xMSAA is not available on any hardware that I know of. I haven’t given it a lot of thought yet, but I think it should be doable with some quality loss by performing a bit of bilateral upsampling during the compositing step. On consoles it should also be possible to use EQAA with 16x coverage samples to get a little more information during the upsampling.


So now that we know that our MSAA trick can work for low-resolution rendering, how does it stack up against against techniques? In my sample app I also implemented full-resolution rendering as well as half-resolution with nearest-depth upsampling, and so we’ll use those as a basis for comparison. Full resolution is useful as a “ground truth” for quality comparisons, while nearest-depth upsampling is a good baseline for performance. So without further adieu, here are some links to comparison images:

Normal Full Res Low Res – Nearest Depth Low Res – MSAA
Sub-Pixel Geo Full Res Low Res – Nearest Depth Low Res – MSAA
Sub-Pixel Geo – 2x MSAA Full Res Low Res – Nearest Depth Low Res – MSAA
Triangle Edges Full Res Low Res – Nearest Depth Low Res – MSAA
High Depth Gradient Full Res Low Res – Nearest Depth Low Res – MSAA
Alpha To Coverage Full Res Low Res – Nearest Depth Low Res – MSAA

In my opinion, the quality is a consistent improvement over standard low-res rendering with a nearest-depth upscale. The low-resolution MSAA technique holds up in all of the failure cases that I mentioned earlier, and is still capable of providing filtered results in areas where the particles aren’t being occluded (unlike the Lost Planet approach, which essentially upsampled with point filtering).

To evaluate performance, I gathered some GPU timings for each rendering step on my GeForce GTX 970. The GPU time was measured using timestamp queries, and I recorded the maximum time interval from the past 64 frames. These were all gathered while rendering at 1920×1080 resolution (which means 960×540 for half-resolution rendering) with 16 * 1024 particles, and no MSAA:

Render Mode Depth Downscale Particle Rendering Resolve Composite Total
Full Res N/A 4.59ms N/A N/A 4.59ms
Half Res – Nearest Depth 0.04ms 1.37ms N/A 0.31ms 1.71ms
Half Res – MSAA 0.13ms 1.48ms 0.23ms 0.08ms 1.92ms

On my hardware, the low-resolution MSAA technique holds up fairly well against the baseline of half-res with nearest depth upsampling. The resolve adds a bit of time, although that particular timestamp was rather erratic and so I’m not 100% certain of its accuracy. One nice thing is that the composite step is now cheaper, since the shader is quite a bit simpler. In the interest of making sure that all of my cards are on the table, one thing that I should mention is that the particles in this sample don’t sample the depth buffer in the pixel shader. This is common for implementing soft particles as well as for computing volumetric effects by marching along the view ray. In these cases the pixel shader performance would likely suffer if it had to point sample an MSAA depth buffer, and so it would probably make sense to prepare a 1x depth buffer during the downscale phase. This would add some additional cost to the low-resolution MSAA technique, and so I should probably consider adding it to the sample at some point.

These numbers were gathered with 2xMSAA used for full-resolution rendering:

Render Mode Depth Downscale Particle Rendering Resolve Composite Total
Full Res N/A 4.69ms N/A N/A 4.69ms
Half Res – Nearest Depth 0.06ms 1.4ms N/A 0.38ms 1.81ms
Half Res – MSAA 0.25ms 2.11ms 0.24ms 0.17ms 2.74ms

Unfortunately the low-resolution MSAA technique doesn’t hold up quite as well in this case. The particle rendering gets quite a bit more expensive, and the cost of downscale as well as the composite both increase. It does seem odd for the rendering cost to increase so much, and I’m not sure that I can properly explain it. Perhaps there is a performance cliff when going from 4x to 8x MSAA on my particular GPU, or maybe there’s a bug in my implementation.

Bring It On Home

So what conclusions can we draw from all of this? At the very least, I would say that programmable sample points can certainly be useful, and that low-resolution MSAA is a potentially viable use case for the functionality. In hindsight it seems that my particular implementation isn’t necessarily the best way to show off the improvement that you get from using programmable sample points, since the alpha-blended particles don’t have any noticeable triangle edges by default. However it’s definitely relevant if you want to consider more general transparent geometry, or perhaps even rendering opaque geometry at half resolution. Either way, I would really like to see broader support for the functionality being exposed in PC API’s. Unfortunately it’s not part of any Direct3D 12 feature set, so we’re currently stuck with Nvidia’s D3D11 or OpenGL extensions. It would be really great to see it get added to a future D3D12 revision, and have it be supported across Nvidia, Intel, and AMD hardware. Until that happens I suspect it will mostly get used by console developers.

If you would like to check out the code for yourself, I’ve posted it all on GitHub. If you’d just like to download and run the pre-compiled executable, I’ve also made a binary release available for download. If you look through the build instructions, you’ll notice that I didn’t include the NVAPI headers and libs in the repository. I made this choice because of the ominous-sounding license terms that are included in the header files, as well as the agreement that you need to accept before downloading it. This means that the code won’t compile by default unless you download NVAPI yourself, and place it in the appropriate directory. However if you just want to build it right away, there’s a preprocessor macro called “UseNVAPI_” at the top of LowResRendering.cpp that you can set to 0. If you do that then it won’t link against NVAPI, and the app will build without it. Of course if you do this, then you won’t be able to use the programmable sample point feature in the sample app. The feature will also be disabled if you aren’t running on a Maxwell 2.0 (or newer) Nvidia GPU, although everything else should work correctly on any hardware that supports FEATURE_LEVEL_11_0. Unfortunately I don’t have any non-Nvidia hardware to test on, so please let me know if you run into any issues.


[1] MSDN documentation for D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration
[2] MSDN documentatoin for EvaluateAttributeSnapped
[3] Multi-Frame Samples Anti-Aliasing (MFAA)
[4] Interleaved Sampling
[5] GL_NV_sample_locations
[7] Hybrid Reconstruction Anti-Aliasing
[8] GPU-Driven Rendering Pipelines
[9] AMD R600 3D Registers
[10] AMD Southern Islands 3D Registers
[11] AMD_sample_positions
[12] OpacityMapping White Paper
[13] High-Speed, Off-Screen Particles (GPU Gems 3)
[14] Lost Planet Tech Overview (english translation)
[15] Mixed Resolution Rendering
[16] Destiny: From Mythic Science Fiction to Rendering in Real-Time


1. Some of you might remember ATI’s old “Temporal AA” feature from their Catalyst Control Panel, which had a similar approach to MFAA in that it alternated sample patterns every frame. However unlike MFAA, it just relied on display persistance to combine samples from adjacent frames, instead of explicitly combining them in a shader. Return to text

2. If you’ve been working with D3D11 for a long time and you don’t recognize these diagrams, you’re not going crazy. The original version of the D3D11 docs were completely missing all of this information, which actually made the “standard” MSAA patterns somewhat useless. This is what the docs for D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS look like in the old DirectX SDK documentation, and this is what EvaluateAttributeSnapped looked like. Return to text

SIGGRAPH Follow-Up: 2015 Edition

SIGGRAPH 2015 wrapped up just a few days ago, and it really was fantastic this year! There was tons of great content, and I got a chance to meet up with some of best graphics programmers in the industry. I wanted to thank anyone that came to my talk at Advances in Real-Time Rendering, as well as anyone who came to our talk at the Physically Based Shading course. It’s always awesome to see so many people interested in the latest rendering technology, and the other presenters really knocked it out of the park in both courses. I’d also like to thank Natalya Tatarchuk for organizing the Advances course year after year. It’s really amazing when you look back on the 10 years worth of high-quality material that she’s assembled, and this year in particular was full of some really inspiring presentations. And of course I should thank Stephen Hill and Stephen McAuley as well, who also do a phenomenal job of cultivating top-notch material from both games and film.

For my advances talk, you can find the the slides on the company website. They should also be up on the Advances course site in the near future. If you haven’t seen it yet, the talk is primarily focused on the antialiasing of The Order, with a section about shadows at the end. There’s also some bonus slides about the decal system that I made for The Order, which uses deferred techniques to accumulate a few special-case decal types onto our forward-rendered geometry. For the antialiasing, the tech and techniques presented isn’t really anything I’d consider to be particularly novel or groundbreaking. However I wanted to give an overview of the problem space as well as our particular approach for handling aliasing, and I hope that came across in the presentation. One thing I really wanted to touch on more was that I firmly believe that we need to go much deeper if we really want to fix aliasing in games. Things like temporal AA and SMAA are fantastic in that that really do make things look a whole lot better, but they’re still fundamentally limited in several ways. On the other hand, just brute forcing the problem by increasing sampling rates isn’t really a scalable solution in the long term. In some cases we’re also undersampling so badly that just having a 2x or 4x increase in our sampling rates isn’t going to even come close to fixing the issue. What I’d really like to see is more work into figuring out how to have smarter sampling patterns (no more screen-space uniform grids!), and also how to properly prefilter content so that we’re not always undersampling. This was actually something brought up by Marco Salvi in his excellent AA talk that was a part of the Open Problems in Real-Time Rendering course, which I was very happy to see. It was also really inspiring to see Alex Evans describe how he strived for filterable scene representations in his talk from the Advances course.

In case you missed it, I uploaded a full antialiasing code sample to GitHub to accompany the talk. The code is using my usual sample framework and coding style, which means you can grab it and build it with VS 2013 with no external dependencies. There’s also pre-compiled bins in the releases section, in case you would just like the view the app or play around with the shaders. The sample is essentially a successor to the MSAAFilter sample that I put out nearly 3 years ago, which accompanied a blog post where I shared some of my research on using higher-order filtering with MSAA resolves. The AA work in The Order is in many ways the natural conclusion of that work, and the new sample reflects that. If you load up the sample, you’ll notice that the default scene is a really terrible case for geometric and specular aliasing: it’s a rather high-polygon mesh, with lighting from both a directional light as well as from the environment.  I like to evaluate flickering reduction by setting “Model Rotation Speed” to 1.0, which causes the scene to automatically rotate around its Y axis. The default settings are also fairly close to what we shipped with in The Order, although not exactly the same due to some PS4-specific tweaks. The demo also defaults to a 2x jitter pattern, which we didn’t use in The Order. One possible avenue that I never really explored was to experiment with more variation in the MSAA subsample patterns. This is something that you can do on PS4 (as demonstrated in Michal Drobot’s talk about HRAA in Far Cry 4), and you can also do it on recent Nvidia hardware using their proprietary NVAPI. A really simple thing to try would be to implement interleaved sampling, although it could potentially make the resolve shader more expensive.

As for the talk that Dave and I gave at Physically Based Shading, I hope that the images spoke for themselves in terms of showing how much better things looked once we made the switch from SH/H-basis to Spherical Gaussians. It was a very late and risky change for the project, but fortunately it paid off for us by substantially improving the overall visual quality. The nice thing is it’s pretty easy to understand why it looks better once we switched. Previously, we partitioned lighting response into diffuse and specular. We then took advantage of the response characteristics to store the input for both responses in two separate ways: for diffuse, we used high spatial resolution but with low angular resolution (SH lightmaps), while for specular we used low spatial resolution but with high angular resolution (sparse cubemap probes). By splitting specular into both low-frequency (high roughness) and high-frequency (low roughness) categories, we were able to use spatially dense sample points for a much broader range of surfaces. These surfaces with rougher specular then benefit from improved visibility/occlusion, which is usually the biggest issue with sparse cubemap probes. This obviously isn’t a new idea, in fact Halo 3 was doing similar things all the way back in 2008! The main difference of course is that we were able to use SG instead of SH, which gave us more flexibility in how we represented per-texel incoming radiance.

SGs can be a really useful tool for all kinds of things in graphics, and I think it would be great if we all added them to our toolbox. To aid with that, Dave Neubelt and Brian Karis are planning on putting out a helpful paper that can hopefully be to SGs what Stupid SH Tricks was to spherical harmonics. Dave and I also have been working on a code sample to release, which lets you switch between various methods for pre-computing both diffuse and specular lighting as well as a path-traced ground truth render. I’m hoping to finish this soon, since I’m sure it would be very helpful to have working code examples for the various SG operations.

Mitsuba Quick-Start Guide

Angelo Pesce’s recent blog post brought up a great point towards the end of the article: having a “ground-truth” for comparison can be extremely important for evaluating your real-time techniques. For approximations like pre-integrated environment maps it can help visualize what kind of effect your approximation errors will have on a final rendered image, and and in many other cases it can aid you in tracking down bugs in your implementation. On Twitter I advocated writing your own path tracer for such purposes, primarily because doing so can be an extremely educational experience. However not everyone has time to write their own path tracer, and then populate with all of the required functionality. And even if you do have the time, it’s helpful to have a “ground truth for your ground truth”, so that you can make sure that you’re not making any subtle mistakes (which is quite easy to do with a path tracer!). To help in both of these situations, it’s really handy to have a battle-tested renderer at your disposal. For me, that renderer is Mitsuba.

Mitsuba is a free, open source (GPL), physically based renderer that implements a variety of material and volumetric scattering models, and also implements some of the latest and greatest integration techniques (bidirectional path tracing, photon mapping, metropolis light transport, etc.). Since it’s primarily an academic research project, it doesn’t have all of the functionality and user-friendliness that you might get out of a production renderer like Arnold. That said, it can certainly handle many  (if not all) of the cases that you would want to verify for real-time rendering, and any programmer shouldn’t have too much trouble figuring out the interface and file formats once he or she spends some time reading the documentation. It also features a plugin system for integrating new functionality, which could potentially be useful if you wanted to try out a custom material model but still make use of Mitsuba’s ray-tracing/sampling/integration framework.

To help get people up and running quickly, I’ve come up with a “quick-start” guide that can show you the basics of setting up a simple scene and viewing it with the Mitsuba renderer. It’s primarily aimed at fellow real-time graphics engineers who have never used Mitsuba before, so if you belong in that category then hopefully you’ll find it helpful! The guide will walk you through how to import a scene from .obj format into Mitsuba’s internal format, and then directly manipulate Mitsuba’s XML format to modify the scene properties. Editing XML by hand is obviously not an experience that makes anyone jump for joy, but I think it’s a decent way to familiarize yourself with their format. Once you’re familiar with how Mitsuba works, you can always write your own exporter that converts from your own format.

1. Getting Mitsuba

The latest version of Mitsuba is available on this page. If you’re running a 64-bit version of Windows like me, then you can go ahead and grab the 64-bit version which contains pre-compiled binaries. There are also Mac and Linux versions if either of those is your platform of choice, however I will be using the Windows version for this guide.

Once you’ve downloaded the zip file, go ahead and extract it to a folder of your choice. Inside of the folder you should have mtsgui.exe, which is the simple GUI version of the renderer that we’ll be using for this guide. There’s also a command-line version called mitsuba.exe, should you ever have a need for that.

While you’re on the Mitsuba website, I would also recommend downloading the PDF documentation into the same folder where you extracted Mitsuba. The docs contain the full specification for Mitsuba’s XML file format, general usage information, and documentation for the plugin API.

2. Importing a Simple Scene

Now that we have Mitsuba, we can get to work on importing a simple scene into Mitsuba’s format so that we can render it. The GUI front-end is capable of importing scenes from both COLLADA (*.dae) and Wavefront OBJ (*.obj) file formats, and for this guide we’re going to import a very simple scene from an OBJ file that was authored in Maya. If you’d like to follow along on your own, then you can grab the “TestScene.obj” file from the zip file that I’ve uploaded here: Our scene looks like this in Maya:


As you can see, it’s a very simple scene composed of a few primitive shapes arranged in a box-like setup. To keep things really simple with the export/import process, all of the meshes have their default shader assigned to them.

To import the scene into Mitsuba, we can now run mtsgui.exe and select File->Import from the menu bar.  This will give you the following dialog:


Go ahead and click the top-left button to browse for the .obj file that you’d like to import. Once you’ve done this, it will automatically fill in paths for the target directory and target file that will contain the Mitsuba scene definition. Feel free to change those if you’d like to create the files elsewhere. There’s also an option that specifies whether you’d like any material colors and textures as being in sRGB or linear color space.

Once you hit “OK” to import the scene, you should now see our scene being rendered in the viewport:


What you’re seeing right now is the OpenGL realtime preview. The preview uses the GPU to render your scene with VPL approximations for GI, so that it can give you a rough idea of what your scene will look like once it’s actually rendered. Whenever you first open a scene you will get the preview mode, and you’ll also revert back to the preview mode whenever you move the camera.

Speaking of the camera, it uses a basic “arcball” system that’s pretty similar to what Maya uses. Hold the left mouse button and drag the pointer to rotate the camera around the focus point, hold the middle mouse button to pan the camera left/right/up/down, and hold the right mouse button to move the camera along its local Z axis (you can also use the mouse wheel for this).

3. Configuring and Rendering

Now that we have our scene imported, let’s try doing an actual render. First, click the button in the toolbar with the gear icon. It should bring up the following dialog, which lets you configure your rendering settings:


That first setting specifies which integrator that you want to use for rendering. If you’re not familiar with the terminology being used here, an”integrator” is basically the overall rendering technique used for computing how much light is reflected back towards the camera for every pixel. If you’re not sure which technique to use, the path tracer is a good default choice. It makes use of unbiased monte carlo techniques to compute diffuse and specular reflectance from both direct and indirect light sources, which essentially means that if you increase the number of samples it will always converge on the “correct” result. The main downside is that it can generate noisy results for scenes where a majority of surfaces don’t have direct visibility of emissive light sources, since the paths are always traced starting at the camera. The bidirectional path tracer aims to improve on this by also tracing additional paths starting from the light sources. The regular path tracer also won’t handle volumetrics, and so you will need to switch to the volumetric path tracer if you every want to experiment with that.

For a path tracer, the primary quality setting is the “Samples per pixel” option. This dictates how many samples to take for every pixel in the output image, and so you can effectively think of it as the amount of supersampling. Increasing it will reduce aliasing from the primary rays, and also reduce the variance in the results of computing reflectance off of the surfaces. Using more samples will of course increase the rendering time as well, so use it carefully. The “Sampler” option dictates the strategy used for generating the random samples that are used for monte carlo integration, which can also have a pretty large effect on the resulting variance. I would suggest reading through Physically Based Rendering if you’d like to learn more about the various strategies, but if you’re not sure then the “low discrepancy sampler” is a good default choice. Another important option is the “maximum depth” setting, which essentially lets you limit the renderer to using a fixed number of bounces. Setting it to 1 only gives you emissive surfaces and lights (light -> camera), setting it to 2 gives you emissive + direct lighting on all surfaces (light -> surface -> camera), setting it to 3 gives you emissive + direct lighting + 1 bounce of indirect lighting (light -> surface -> surface -> camera), and so on. The default value of -1 essentially causes the renderer to keep picking paths until it hits a light source,  or the transmittance back to the camera is below a particular threshold.

Once you’ve configured everything the way I did in the picture above, go and hit OK to close the dialog. After than, press the big green “play” button in the toolbar to start the renderer. Once it starts, you’ll see an interactive view of the renderer completing the image one tile at a time. If you have a decent CPU it shouldn’t take more than 10-15 seconds to finish, at which point you should see this:


Congratulations, you now have a path-traced rendering of the scene!

4. The Scene File Format

Now that we have a basic scene rendering, it’s time to dig into the XML file for the scene and start customizing it. Go head and open up “TestScene.xml” in your favorite text editor, and have a look around. It should look like this:


If you scroll around a bit, you’ll see declarations for various elements of the scene. Probably what you’ll notice first is a bunch of “shape” declarations: these are the various meshes that make up the scene. Since we imported from .obj, Mitsuba automatically generated a binary file called “TestScene.serialized” from our .obj file containing the actual vertex and index data for our meshes, which is then referenced by the shapes. Mitsuba can also directly reference .obj or .ply files in a shape, which is convenient if you don’t want to go through Mitsuba’s import process. It also supports hair meshes, heightfields from an image file, and various primitive shapes (sphere, box, cylinder, rectangle, and disk). Note that shapes support transform properties as well as transform hierarchies, which you can use to position your meshes within the scene as you see fit. See section 8.1 of the documentation for a full description of all of the shape types, and their various properties.

For each shape,  you can see a “bsdf” property that specifies the BSDF type to use for shading the mesh. Currently all of the shapes are specifying that they should use the “diffuse” BSDF type, and that the BSDF should use default parameters. You might also notice there’s a separate bsdf declaration towards the top of the file, with an ID of “initialShadingGroup_material”. This comes from the default shader that Maya applies to all meshes, which is also reflected in the .mtl file that was generated along with the .obj file. This BSDF is not actually being used by any of the shapes in the scene, since they all are currently specifying that they just want the default “diffuse” BSDF. In the next section I’ll go over how we can create and modify materials, and then assign them to meshes.

If you scroll way down to the bottom, you’ll see the camera and sensor properties, which looks like this:


You should immediately recognize some of your standard camera properties, such as the clip planes and FOV. Here you can also see an example of a transform property, which is using the “lookAt” method for specifying the transform. Mitsuba also supports specifying transforms as translation + rotation + scale, or directly specifying the transformation matrix. See section 6.1.6 of the documentation for more details.

If you decide to manually update any of the properties in the scene, you can tell the GUI to re-load the scene from disk by clicking the button with blue, circular arrow on it in the toolbar. Just be aware that if you save the file from the GUI app, it may overwrite some of your changes. So if you decide to set up a nice camera position in the XML file, make sure that you don’t move the camera in the app and then save over it!

5. Specifying Materials

Now let’s assign some materials to our meshes, so that we can start making our scene look interesting. As we saw previously, any particular shape can specify which BSDF model it should use as well as various properties of that BSDF. Currently, all of our meshes are using the “diffuse” BSDF, which implements a simple Lambertian diffuse model. There are many BSDF types available in Mitsuba, which you can read about in section 8.2 of the documentation. To start off, we’re going to use the “roughplastic” model for a few of our meshes. This model gives you a classic diffuse + specular combination, where the diffuse is Lambertian the specular can use one of several microfacet models. It’s a good default choice for non-metals, and thus can work well for a wide variety of opaque materials. Let’s go down to about line 36 of our scene file, and make the following changes:


As you can see, we’ve added BSDF properties for 4 of our meshes. They’re all configured to use the “roughplastic” BSDF with a GGX distribution, a roughness of 0.1, and an IOR of 1.49. Unfortunately Mitsuba does not support specifying the F0 reflectance value for specular, and so we must specify the interior and exterior IOR instead (exterior IOR defaults to “air”, and so we can leave it at its default value). You can also see that I specified diffuse reflectance values for each shape, with a different color for each. For this I used the “srgb” property, which specifies that the color is in sRGB color space. You can also use the “rgb” property to specify linear values, or the “spectrum” property for spectral rendering.

After making these changes, go ahead click the “reload” button in Mitsuba followed by the “start” button to re-render the image. We should now get the following result:


Nice! Our results are noisier on the right side due to specular reflections from the sun, but we can now clearly see indirect specular in addition to indirect diffuse.

To simplify setting up materials and other properties, Mitsuba supports using references instead directly specifying shape properties. To see how that works, let’s delete the “initialShadingGroup_material” BSDF declaration at line 11 and replace it with a new that that we will reference into the cylinder and torus meshes:



If you look closely, you’ll see that for this new material I’m also using a texture for the diffuse reflectance. When setting the “texture” property to the “bitmap” type, you can tell Mitsuba to load an image file off disk. Note that Mitsuba also supports a few built-in procedural textures that you can use, such as checkerboard and a grid. See section 8.3 for more details.

After refreshing, our render should now look like this:



To finish up with materials, let’s assign a more interesting material to the sphere in the back:


If we now re-render with a sample count of 256 to reduce the variance, we get this result:


6. Adding Emitters

Up until this point, we’ve been using Mitsuba’s default lighting environment for rendering. Mitsuba supports a variety of emitters that mostly fall into one of 3 categories: punctual lights, area lights, or environment emitters. Punctual lights are your typical point, spot, and directional lights that are considered to have an infinitesimally small area. Area lights are arbitrary meshes that uniformly emit light from their surface, and therefore must be used with a corresponding “shape” property. Environment emitters are infinitely distant sources that surround the entire scene, and can either use an HDR environment map, a procedural sun and sky model, or a constant value. For a full listing of all emitter types and their properties, consult section 8.8 of the documentation.

Now, let’s try adding an area light to our scene. Like I mentioned earlier, an area light emitter needs to be parented to a “shape” property that determines the actual 3D representation of the light source. While this shape could be an arbitrary triangle mesh if you’d like, it’s a lot easier to just use Mitsuba’s built-in primitive types instead. For our light source, we’ll use the “sphere” shape type so that we get a spherical area light source:


After refreshing, our scene now looks like this:


Notice how the sky and sun are now gone, since we now have an emitter defined in the scene. To replace the sky, let’s now try adding our own environment emitter that uses an environment map:


The “uffizi.exr” file used here is an HDR light probe from USC’s high-resolution image probe gallery. Note that this emitter does not support cubemaps, and instead expects a 2D image that uses equirectangular mapping. Here’s what it looks like rendered with the path tracer, using a higher sample count of 256 samples per pixel:



7. Further Reading

At this point, you should hopefully understand the basics of how to use Mitsuba, and how to set up scenes in its XML file format. There’s obviously quite a bit of functionality that I didn’t cover, which you can read about in the documentation. If you’d like to know more about how Mitsuba works, I would very strongly recommend reading through Physically Based Rendering. Mitsuba is heavily based on pbrt (which is the open-source renderer described in the book), and the book does a fantastic job of explaining all of the relevant concepts. It’s also a must-have resource if you’d like to write your own path tracer, which is something that I would highly recommend to anybody working in real-time graphics.

Oh and just in case you missed it, here’s the link to the zip file containing the example Mitsuba scene:

Some Special Thanks

About a month ago, a little game called The Order: 1886 finally hit store shelves. Its release marks the culmination of my past 4 years at Ready At Dawn, which were largely devoted to developing the core rendering and engine technology that was ultimately used for the game. It’s also a major milestone for me personally, as it’s the first project that I’ve worked on full-time from start to finish. I’m of course immensely proud of what our team has managed to accomplish, and I feel tremendously lucky to go to work every day with such a talented and dedicated group of people.

My work wouldn’t have been possible without the help of many individuals both internal and external to our studio, but unfortunately there’s just too many people to list in a “special thanks” section of the credits. So instead of that, I’ve made my own special thanks section! But before you read it, please know that this list (like everything else on my blog) only represents my own personal feelings and experiences, and does not represent the rest of the team or my company as a whole. And now, without further ado…

From SCE:

  • Tobias Berghoff, Steven Tovey, Matteo Scapuzzi, Benjamin Segovia, Vince Diesi, Chris Carty, Nicolas Serres, and everyone else at the Advanced Technology Group. These fine folks have developed a fantastic suite of debugging and performance tools for the PS4, and thus played a huge part in making the PS4 the incredible development platform that it is.
  • Dave Simpson, Cort Stratton, Cedric Lallian, and everyone else at the ICE Team for making the best graphics API that I’ve ever used.
  • Geoff Audy, Elizabeth Baumel, Peter Young, and everyone else at Foster City and elsewhere that provided valuable tools and developer support.

From the graphics community:

  • Andrew Lauritzen for providing all of his valuable research on shadows and deferred rendering. His work with VSM and SDSM formed the basis of our shadowing tech in The Order, and his amazing presentation + demo on deferred rendering back in 2010 helped set the standard for state-of-the art in rendering tech for games. Also I’ll go ahead and thank him preemptively for (hopefully) forgiving me for thanking him here, even though I promised him last year at GDC that I would thank him in the credits.
  • Stephen Hill, whose excellent articles and presentations have always inspired me to strive for the next level of quality when sharing my own work.
  • Steve McAuley, who along with Mr. Hill is responsible for cultivating what is arguably the best collection of material year after year: the physically based shading course at SIGGRAPH. I’m very thankful to them for inviting us to participate in 2013, and then helping myself and Dave to deliver our presentation and course notes.
  • Naty Hoffman, Timothy Lottes, Adrian Bentley, Sébastien Legarde,  Johan Andersson, James McLarenBrian Karis, Peter-Pike Sloan, Nathan Reed, Christer EricsonAngelo Pesce, Michał Iwanicki, Christian GyrlingAras Pranckevičius, Michal Valient, Bart Wronski, Jasmin PatryMichal Drobot, Jorge Jimenez, Padraic Hennessy, and anyone else who was kind enough to talk shop over a drink at GDC or SIGGRAPH.
  • Anybody who has every given presentations, written articles, or otherwise contributed to the vast wealth of public knowledge concerning computer graphics. I’m really proud of the culture of openness and sharing among the graphics community, and also very grateful for it. We often stood on the shoulders of giants when creating the tech for our game, and we couldn’t have achieved we did if without drawing from the insights and techniques that were generously shared by other developers and researchers.

From Ready at Dawn:

  • Everyone that I worked with on the Tools and Engine team: Nick Blasingame, Joe FerfeckiGabriel Sassone, Sean Flynn, Simone Kulczycki, Scott Murray, Robin Green, Brett Dixon, Joe SchutteDavid Neubelt, Alex Clark, Garret Foster, Tom Plunket, Aaron Halstead, Jeremy Nikolai, and Jamie Hayes. If you appreciate anything at all about the tech of The Order, then please think of these people! Also I need to give Garret an extra-special thanks for letting myself and Dave take all of the credit for his work on the material pipeline.
  • Our art director Nathan Phail-Liff, who not only directs a phenomenal group of artists but also had a major hand in shaping the direction of the tech that was developed for the project.
  • Anthony Vitale, who leads the best damn environment art team in the business. If you haven’t seen it yet, go check out their art dump on the polycount forums!
  • Ru WeerasuriyaAndrea Pessino, and Dider Malenfant for starting this amazing company, and for giving me an opportunity to work there.
  • Everyone else at Ready At Dawn that worked on the project with me. I don’t know if I can ever get used to overwhelming amount of sheer talent and dedication that you can have at a game studio, and the people that I worked with had that in spades. I look forward to making many more beautiful things with them in the future!

If you read through all of this, I appreciate you taking the time to do so. It means a lot to me that the people listed above get their due credit, and so I hope that my humble little blog post has enough reach to give them some of the recognition that they rightfully deserve.

Shadow Sample Update

This past week a paper entitled “Moment Shadow Mapping” was released in advance of its planned publication at I3D in San Francisco. If you haven’t seen it yet, it presents a new method for achieving filterable shadow maps, with a promised improvement in quality compared to Variance Shadow Maps. Myself and many others were naturally intrigued, as filterable shadow maps are highly desirable for reducing various forms of aliasing. The paper primarily suggests two variants of the technique: one that directly stores the 1st, 2nd, 3rd, and 4th moments in a RGBA32_FLOAT texture, and another that uses an optimized quantization scheme (which essentially boils down to a 4×4 matrix transform) in order to use an RGBA16_UNORM texture instead. The first variant most likely isn’t immediately interesting for people working on games, since 128 bits per texel requires quite a bit of memory storage and bandwidth. It’s also the same storage cost as the highest-quality variant of EVSM (VSM with an exponential warp), which already provides high-quality filterable shadows with minimal light bleeding. So that really leaves us with the quantized 16-bit variant. Using 16-bit storage for EVSM results in more artifacts and increased light bleeding compared to the 32-bit version, so if MSM can provide better results than it could potentially be useful for games.

I was eager to see the results myself, so I downloaded the sample app that the authors were kind enough to provide. Unfortunately their sample didn’t implement EVSM, and so I wasn’t able perform any comparisons. However the implementation of MSM is very straightforward, and so I decided to just integrate it into my old shadows sample. I updated the corresponding blog post and re-uploaded the binary + source, so if you’d like to check it out for yourself then feel free to download it:

The MSM techniques can be found under the “Shadow Mode” setting. I implemented both the Hamburger and Hausdorff methods, which are available as two separate shadow modes. If you change the VSM/MSM format from 32-bit to 16-bit, then the optimized quantization scheme will be used when converting from a depth map to a moment shadow map.

Unfortunately, my initial findings are rather mixed. The 32-bit variant of MSM seems to provide quality that’s pretty close to the 32-bit variant of EVSM, with slightly worse performance. Both techniques are mostly free of light bleeding, but still exhibit bleeding artifacts for the more extreme cases. As for the 16-bit variant, it again has quality that’s very similar to EVSM with equivalent bit depth. Both techniques require increasing the bias when using 16-bit storage in order to reduce precision artifacts, which in turn leads to increased bleeding artifacts. Overall MSM seems to have behave a little bit better with regards to light leaking, which does make some sense considering that with EVSM you’re forced to use a lower exponent scale in order to prevent overflow. For either technique you can reduce the leaking using the standard VSM bleeding reduction technique, which essentially just remaps the output range of the shadow visibility term. Doing this can remove or reducing the bleeding artifacts, but will also result in over-darkening. Of course it’s also possible that I made a mistake when implementing MSM into my sample app, however the bleeding artifacts are also quite noticeable in the sample app provided by the authors.

To finish up, here are some screenshots that I took that show an example of lighting bleeding. The EVSM images all use the 4-component variant, and the MSM images all use the 4-moment Hamburger variant. For the images with the bleeding “fix’, they use a reduction factor of 0.25. In all cases the shadow map resolution is 2048×2048, with 4xMSAA, 16x anisotropic filtering, and mipmaps enabled for EVSM and MSM.

MSM Comparison Grid

Finally, here’s a few more images from an area with an even worse case for light bleeding:

MSM Comparison Grid 2


Come see me talk at GDC 2014

Myself and fellow lead graphics programmer David Neubelt will be at GDC next week, talking about the rendering technology behind The Order: 1886. Unfortunately the talk came together a bit late, and so it initially started from the talk that we gave back at SIGGRAPH at last year (which is why it has the same title). However we don’t want to just rehash the same material, so we’ve added tons of new slides and revamped the old ones. The resulting presentation has a much more of an engineering focus, and will cover a lot of the nuts and bolts behind our engine and the material pipeline. Some of the new things we’ll be covering include:

  • Dynamic lighting
  • Baked diffuse and specular lighting
  • Baked and real-time ambient occlusion
  • Our shader system, and how it interacts with our material pipeline
  • Details of how we perform compositing in our build pipeline
  • Hair shading
  • Practical implementation issues with our shading model
  • Performance and memory statistics
  • Several breakdowns showing how various rendering techniques combine to form the final image
  • At least one new funny picture

We want the presentation to be fresh and informative even if you saw our SIGGRAPH presentation, and I’m pretty sure that we have enough new material to ensure that it will be. So if you’re interested, come by at 5:00 PM on Wednesday. If you can’t make it, I’m going to try to make sure that we have the slides up for download as soon as possible.

Weighted Blended Order-Independent Transparency

Back in December, Morgan McGuire  and Louis Bavoil published a paper called Weighted Blended Order-Independent Transparency. In case you haven’t read it yet (you really should!), it proposes an OIT scheme that uses a weighted blend of all surfaces that overlap a given pixel. In other words finalColor = w0 * c0 + w1 * c1 + w2 * c2…etc. With a weighted blend the order of rendering no longer matters, which frees you from the never-ending nightmare of sorting. You can actually achieve results that are very close to a traditional sorted alpha blend, as long as your per-surface weights a carefully chosen. Obviously it’s that last part that makes it tricky, consequently McGuire and Bavoil’s primary contribution is proposing a weighting function that’s based on the view-space depth of a given surface. The reasoning behind using a depth-based weighting function is intuitive: closer surfaces obscure the surfaces behind them, so the closer surfaces should be weighted higher when blending. In practice the implementation is really simple: in the pixel shader you compute color, opacity, and a weight value based on both depth and opacity. You then output float4(color* opacity, opacity) * weight  to 1 render target,while also outputting weight alone to a second render target (the first RT needs to be fp16 RGBA for HDR, but the second can just be R8_UNORM or R16_UNORM). For both render targets special blending modes are required, however they both can be represented by standard fixed-function blending available in GL/DX. After rendering all of your transparents, you then perform a full-screen “resolve” pass where you normalize the weights and then blend with the opaque surfaces underneath the transparents. Obviously this is really appealing since you completely remove any dependency on the ordering of draw calls, and you don’t need to build per-pixel lists or anything like that (which is nice for us mortals who don’t have pixel sync).  The downside is that you’re at the mercy of your weighting function, and you potentially open yourself up to new kinds of artifacts issues depending on what sort of weighting function is used.

When the paper came out I read it and I was naturally interested, so I quickly hacked together a sample project using another project as a base. Unfortunately over the past 2 months there’s been holidays, the flu, and several weeks of long hours at work so that we could could finish up a major milestone. So while I’ve had time to visit my family and optimize our engine for PS4, I haven’t really had any time to come up with a proper sample app that really lets you explore the BOIT technique in a variety of scenarios. However I really hate not having source code and a working sample app to go with papers, so I’m going to release it so that others at least have something they can use for evaluating their proposed algorithm. Hopefully it’s useful, despite how bad the test scene is. Basically it’s just a simple cornell-box like scene made up of a few walls, a sphere, a cylinder, a torus, and a sky (I normally use it for GI testing), but I added the abililty to toggle through 2 alternative albedo maps: a smoke texture, and a tree texture. It doesn’t look great, but it’s enough to get a few layers of transparency with varying lighting conditions:


The sample is based on another project I’ve been working on for quite some time with my fellow graphics programmer David Neubelt, where we’ve been exploring new techniques for baking GI into lightmaps. For that project I had written a simple multithreaded ray-tracer using Embree 2.0 (which is an awesome library, and I highly recommend it), so I re-purposed it into a ground-truth renderer for this sample. You can toggle it on and off  to see what the scene would look like with perfect sorting, which is useful for evaluating the “correctness” of the BOIT algorithm. It’s very fast on my mighty 3.6GHz Core i7, but it might chug a bit for those of you running on mobile CPU’s. If that’s true I apologize, however I made sure that all of the UI and controls are decoupled from the ray-tracing framerate so that the app remains responsive.

I’d love to do a more thorough write-up that really goes into depth on the advantages and disadvantages in multiple scenarios, but I’m afraid I just don’t have the time for it at the moment. So instead I’ll just share some quick thoughts and screenshots:

It’s pretty good for surfaces with low to medium opacity  – with the smoke texture applied, it actually achieves decent results. The biggest issues are where there’s a large difference in the lighting intensity between two overlapping surfaces, which makes sense since this also applies to improperly sorted surfaces rendered with traditional alpha blending. Top image is with Blended OIT, bottom image is ground truth:


If you look at the area where the closer, brighter surface overlaps the darker surface on the cylinder you can see an example of where the results differ from the ground-truth render. Fortunately the depth weighting produces results that don’t look immediately “wrong”, which is certainly a big step up from unsorted alpha blending. Here’s another image of the test scene with default albedo maps, with an overall opacity of 0.25:


The technique fails for surfaces with high opacity – one case that the algorithm has trouble with is surfaces with opacity = 1.0. Since it uses a weighted blend, the weight of the closest surface has to be incredibly high relative to any other surfaces in order for it to appear opaque. Here’s the test scene with all surfaces using an opacity of 1.0:


You’ll notice in the image that the algorithm does actually work correctly with opacity = 1 if there’s no overlap of transparent surfaces, so it does hold up in that particular case. However in general this problem makes it unsuitable for materials like foliage, where large portions of of surface need to be fully opaque. Here’s the test scene using a tree texture, which illustrates the same problem:


Like I said earlier, you really need to make the closest surface have a an extremely high weight relative to the surfaces behind it if you want it to appear opaque. One simple thing you could do is to keep track of the depth of the closest surface (say in a depth buffer), and then artificially boost the weight of surfaces if their depth matches the depth buffer weight. If you do this (and also scale your “boost” factor by opacity) you get something like this:


This result looks quite a bit better, although messing with the weights changes the alpha gradients which gives it a different look. This approach obviously has a lot of failure cases. Since you’re relying on depth, you could easily create discontinuities at geometry edges. You can also get situations like this, where a surface visible through a transparent portion of the closest surface doesn’t get the weight boost and remains translucent in appearance:



Notice how the second tree true trunk appears to have a low opacity since it’s behind the closest surface. The other major downside is that you need to render your transparents in a depth prepass, which requires performance as well as the memory for an extra depth buffer. However you may already be doing that in order to optimize tiled forward rendering of transparents. Regardless I doubt it would be useful except in certain special-case scenarios, and it’s probably easier (and cheaper) to just stick to alpha-test or A2C for those cases.

Is it usable? – I’m not sure yet. I feel like it would take a lot of testing the wide range of transparents in our game before knowing if it’s going to work out. It’s too bad that it has failure cases, but if we’re going to be honest the bar is pretty damn low when it comes to transparents in  games. In our engine we make an attempt to sort by depth, but our artists frequently have to resort to manually setting “sort priorities” in order to prevent temporal issues from meshes constantly switching their draw order. The Blended OIT algorithm on the other hand may produce incorrect results, but those results are stable over time. However I feel the much bigger issue with traditional transparent rendering is that ordering constraints are fundamentally at odds with rendering performance. Good performance requires using instancing, reducing state changes and rendering to low-resolution render targets. All 3 of those these are incompatible with rendering based on Z order, which means living with lots of sorting issues if you want optimal performance. With that in mind it really feels like it’s hard to do worse than the current status-quo.

That’s about all I have for now. Feel free to download the demo and play around with it. If you missed it, the download link is at the top of the page. Also, please let me know if you have any thoughts or ideas regarding the practicality of the technique, since I would definitely be interested in discussing it further.

Sample Framework Updates

You may have noticed that my latest sample now has a proper UI instead of the homegrown sliders and keyboard toggles that I was using in my older samples. What you might not have noticed is that there’s a whole bunch of behind-the-scenes changes to go with that new UI! Before I ramble on, here’s a quick bullet-point list of the new features:

  • Switched to VS2012 and adopted a few C++11 features
  • New UI back-end provided by AntTweakBar
  • C#-based data-definition format for auto-generating UI
  • Shader hot-swapping
  • Better shader caching, and compressed cache files

It occurred to me a little while ago that I could try to develop my framework into something that enables rapid prototyping, instead of just being some random bits of cobbled-together code. I’m not sure if anybody else will use it for anything, but so far it’s working out pretty well for me.

VS 2012

A little while ago I switched to working in VS 2012 with the Windows 8 SDK for my samples, but continued using the VS 2010 project format and toolset in case anybody was stuck on 2010 and wanted to compile my code. For this latest sample I decided that legacy support wasn’t worth it, and made the full switch the newer compiler. I haven’t done any really crazy C++ 11 things yet, but now that I’ve started using enum class and nullptr I never want to go back. Ever since learning C# I’ve always wanted C++ enums to use a similar syntax, and C++11 finally fulfilled my wish. I suspect that I’ll feel the same way when I start using non-static data member initializers (once I switch to VS 2013, of course).

New UI

When I found out about AntTweakBar and how easy it is to integrate, I felt a little silly for ever trying to write my own ghetto UI widgets. If you haven’t seen it before, it basically shows up as a window containing a list of tweakable app settings, which is exactly what I need in my sample apps. It also has a neat alternative to sliders where you sort of rotate a virtual lever, and the speed increases depending on how far away you move the mouse from the center point. Here’s what it looks like integrated into my shadows sample:


If you’re thinking of using AntTweakBar in your code, you might want to grab the files TwHelper.h/cpp from my sample framework. They provide a bunch of wrapper functions over TwSetParam that add type safety, which makes the library a lot easier to work with. I also have much more comprehensive set of Settings classes that further wrap those functions, but those are a lot more tightly coupled to the rest of the framework.

Shader Hot-Swapping

This one is a no-brainer, and I’m not sure why I waited so long to do it. I decided not to implement it using the Win32 file-watching API. We’ve used that at work, and it quickly became the most hated part of our codebase. Instead I took the simple and painless route of checking the timestamp of a single file every N frames, which works great as long as you don’t have thousands of files to go through (usually I only have a dozen or so).

UI Data Definition

For a long time I’ve been unhappy with my old way of defining application settings, and setting up the UI for them. Previously I did everything in C++ code by declaring a global variable for each setting. This was nice because I got type safety and VisualAssist auto-complete whenever I wanted to access a setting, but it was also a bit cumbersome because I always had to touch code in multiple places whenever I wanted to add a new setting. This was especially annoying when I wanted to access a setting in a shader, since I had manually handle adding it to a constant buffer and setting the value before binding it to a shader. After trying and failing multiple times to come up with something better using just C++, I thought it might be fun to  try something a bit less…conventional. Ultimately I drew inspiration from game studios that define their data in a non-C++ language and then use that to auto-generate C++ code for use at runtime. If you can do it this way you get the best of both worlds: you define data in a simple language that can express it elegantly, and then you still get all of your nice type-safety and other C++ benefits. You can even add runtime reflection support by generating code that fills out data structures containing the info that you need to reflect on a type. It sounded crazy to do it for a sample framework, but I thought it would be fun to get out of my comfort zone a bit and try something new.

Ultimately I ended up using C# as my data-definition language. It not only has the required feature set, but I also used to be somewhat proficient in it several years ago. In particular I really like the Attribute functionality in C#,  and thought it would be perfect for defining metadata to go along with a setting. Here’s an example of how I ended up using them for the “Bias” setting from my Shadows sample:

BiasSettingFor enum settings, I just declare an enum in C# and use that new type when defining the setting. I also used an attribute to specify a custom string to associate with each enum value:EnumSettingTo finish it up, I added simple C# proxies for the Color, Orientation, and Direction setting types supported by the UI system.

Here’s how it all it ties together: I define all of the settings in a file called AppSettings.cs, which includes classes for splitting the settings into groups. This file is added to my Visual Studio C++ project, and set to use custom build step that runs before the C++ compilation step. This build step passes the file to SettingsCompiler.exe, which is a command-line C# app created by a C# project in the same solution. This app basically takes the C# settings file, and invokes the C# compiler (don’t you love languages that can compile themselves?) so that it can be compiled as an in-memory assembly. That assembly is then reflected to determine the settings that are declared in the file, and also to extract the various metadata from the attributes. Since the custom attribute classes need to be referenced by both the SettingsCompiler exe as well as the settings code being compiled, I had to put all of them in a separate DLL project called SettingsCompilerAttributes. Once all of the appropriate data is gathered, the C# project then generates and outputs AppSettings.h and AppSettings.cpp. These files contain global definitions of the various settings using the appropriate C++ UI types, and also contains code for initializing and registering those settings with the UI system. These files are added to the C++ project, so that they can be compiled and linked just like any other C++ code. On top of that, the settings compiler also spits out an HLSL file that declares a constant buffer containing all of the relevent settings (a setting can opt-out of the constant buffer if it wants by using an attribute). The C++ files then have code generated for creating a matching constant buffer resource, filling it out with the setting values once a frame, and binding it to all shader stages at the appropriate slot. This means that all a shader needs to do is #include the file, and it can use the setting. Here’s a diagram that shows the basic setup for the whole thing:


This actually works out really nicely in Visual Studio: you just hit F7 and everything builds in the right order. The settings compiler will gather errors from the C# compiler and output them to stdout, which means that if you have a syntax error it gets reported to the VS output window just as if you were compiling it normally through a C# project. MSBuild will even track the timestamp on AppSettings.cs, and won’t run the settings compiler unless it’s newer than the timestamp on AppSettings.h, AppSettings.cpp or AppSettings.hlsl. Sure it’s a really complicated and over-engineered way of handling my problem, but it works and I had fun doing it.

Future Work

I think that the next thing that I’ll improve will be model loading. It’s pretty limiting working with .sdkmesh, and I’d like to be able to work with a wider variety of test scenes. Perhaps I’ll integrate Assimp, or use it to make a simple converter to a custom format. I’d also like to flesh out my SH math and shader code a bit, and add some more useful functionality.

A Sampling of Shadow Techniques

A little over a year ago I was looking to overhaul our shadow rendering at work in order to improve overall quality, as well as simplify the workflow for the lighting artists (tweaking biases all day isn’t fun for anybody). After doing yet another round of research into modern shadow mapping techniques, I decided to do what I usually do and starting working on sample project that I could use as a platform for experimentation and comparison. And so I did just that. I had always intended to put it on the blog, since I thought it was likely that other people would be evaluating some of the same techniques as they upgrade their engines for next-gen consoles. But I was a bit lazy about cleaning it up (there was a lot of code!), and it wasn’t the most exciting thing that I was working on, so it sort-of fell by the wayside. Fast forward to a year later, and I found myself  looking for a testbed for some major changes that I was planning for the settings and UI portion of my sample framework. My shadows sample came to mind since it was chock-full of settings, and that lead me to finally clean it up and get it ready for sharing. So here it is! Note that I’ve fully switched over to VS 2012 now, so you’ll need it to open the project and compile the code. If you don’t have it installed, then you may need to download and install the VC++ 2012 Redistributable in order to run the pre-compiled binary.

Update 9/12/2013
 – Fixed a bug with cascade stabilization behaving incorrectly for very small partitions
 – Restore previous min/max cascade depth when disabling Auto-Compute Depth Bounds

Thanks to Stephen Hill for pointing out the issues!

Update 9/18/2013
 – Ignacio Castaño was kind enough to share the PCF technique being used in The Witness, which he integrated into the sample under the name “OptimizedPCF”. Thanks Ignacio!

Update 11/3/2013
– Fixed SettingsCompiler string formatting issues for non-US cultures

Update 2/17/2015
– Fixed VSM/EVSM filtering shader that was broken with latest Nvidia drivers
– Added Moment Shadow Mapping
– Added notes about biasing VSM, EVSM, and MSM

The sample project is set up as a typical “cascaded shadow map from a directional light” scenario, in the same vein as the CascadedShadowMaps11 sample from the old DX SDK or the more recent Sample Distribution Shadow Maps sample from Intel.  In fact I even use the same Powerplant scene from both samples, although I also added in human-sized model so that you can get a rough idea of how well the shadowing looks on characters (which can be fairly important if you want your shadows to look good in cutscenes without having to go crazy with character-specific lights and shadows). The basic cascade rendering and setup is pretty similar to the Intel sample: a single “global” shadow matrix is created every frame based on the light direction and current camera position/orientation, using an orthographic projection with width, height, and depth equal to 1.0. Then for each cascade a projection is created that’s fit to the cascade, which is used for rendering geometry to the shadow map. For sampling the cascade, the projection is described as a 3D scale and offset that’s applied to the UV space of the “global” shadow projection. That way you just use one matrix in the shader to calculate shadow map UV coordinates + depth, and then apply the scale and offset to compute the UV coordinates + depth for the cascade that you want to sample. Unlike the Intel sample I didn’t use any kind of deferred rendering, so I decided to just fix the number of cascades at 4 instead of making it tweakable at runtime.

Cascade Optimization

Optimizing how you fit your cascades to your scene and current viewpoint is pretty crucial for reducing perspective aliasing and the artifacts that it causes. The old-school way of doing this for CSM is to chop up your entire visible depth range into N slices using some sort of partitioning scheme (logarithmic, linear, mixed, manual, etc.). Then for each slice of the view frustum, you’d tightly fit an AABB to the slice  and use that as the parameters for your orthographic projection for that slice. This gives you the most optimal effective resolution for a given cascade partition, however with 2006-era shadow map sizes you still generally end up with an effective resolution that’s pretty low relative to the screen resolution. Combine this with 2006-era filtering (2×2 PCF in a lot of cases), and you end up with quite a bit of aliasing. This aliasing was exceptionally bad, due to the fact that your cascade projections will translate and scale as your camera moves and rotates, which results in rasterization sample points changing from frame to frame as the camera moves. The end result was crawling shadow edges, even from static geometry. The most popular solution for this problem was to trade some effective shadow map resolution for stable sample points that  don’t move from frame to frame. This was first proposed (to my knowledge) by Michal Valient in his ShaderX6 article entitled “Stable Cascaded Shadow Maps”. The basic idea is that instead of tightly mapping your orthographic projection to your cascade split, you map it in such a way that the projection won’t change as the camera rotates. The way that Michal did it was to fit a sphere to the entire 3D frustum split and then fit the projection to that sphere, but you could do it any way that gives you a consistent projection size. To handle the translation issue, the projections are “snapped” to texel-sized increments so that you don’t get sub-texel sample movement. This ends up working really well, provided that your cascade partitions never change (which means that changing your FOV or near/car clip planes can cause issues). In general the stability ends up being a net win despite the reduced effective shadow map resolution, however small features and dynamic objects end up suffering.

2 years ago Andrew Lauritzen gave a talk on what he called “Sample Distribution Shadow Maps”, and released the sample that I mentioned earlier. He proposed that instead of stabilizing the cascades, we could instead focus on reducing wasted resolution in the shadow map to a point where effective resolution is high enough to give us sub-pixel resolution when sampling the shadow map. If you can do that then you don’t really need to worry about your projection changing frame to frame, provided that you use decent filtering when sampling the shadow map. The way that he proposed to do this was to analyze the depth buffer generated for the current viewpoint, and use it to automatically generate an optimal partitioning for the cascades. He tried a few complicated ways of achieving this that involved generating a depth histogram on the GPU, but also proposed a much more practical scheme that simply computed the min and max depth values. Once you have the min and max depth, you can very easily use that to clamp your cascades to the subset of your depth range that actually contains visible pixels. This might not sound like a big deal, but in practice it can give you huge improvements. The min Z in particular allows you to have a much more optimal partitioning, which you can get by using an “ideal” logarithmic partitioning scheme. The main downside is that you need to use the depth buffer, which puts you in the awful position of having the CPU dependent on results from the GPU if you want to do your shadow setup and scene submission on the CPU. In the sample code they simply stall and readback the reduction results, which isn’t really optimal at all in terms of performance. When you do something like this the CPU ends up waiting around for the driver to kick off commands to the GPU and for the GPU to finish processing them, and then you get a stall on the GPU while it sits around and waits for the CPU to start kicking off commands again. You can potentially get around this by doing what you normally do for queries and such, and deferring the readback for one or more frames. That way the results are (hopefully) already ready for readback, and so you don’t get any stalls. But this can cause you problems, since the cascades will trail a frame behind what’s actually on screen. So for instance if the camera moves towards an object, the min Z may not be low enough to fit all of the pixels in the first cascade and you’ll get artifacts. One potential workaround is to use the previous frame’s camera parameters to try to predict what the depth of a pixel will be for the next frame based on the instantaneous linear velocity of the camera, so that when you retrieve your min/max depth a frame late it actually contains the correct results. I’ve actually done this in the past (but not for this sample, I got lazy) and it works as long as long as the camera motion stays constant. However it won’t handle moving objects unless you store the per-pixel velocity with regards to depth and factor that into your calculations. The ultimate solution is to do all of the setup and submission on the GPU, but I’ll talk about that in detail later on.

These are the options implemented in the sample that affect cascade optimization:

  • Stabilize Cascades – enables cascade stabilization using the method from ShaderX6.
  • Auto-Compute Depth Bounds – computes the min and max Z on the GPU, and uses it to tightly fit the cascades as in SDSM.
  • Readback Latency – number of frames to wait for the depth reduction results
  • Min Cascade Distance/Max Cascade Distance – manual min and max Z when Auto-Compute Depth Bounds isn’t being used
  • Partition Mode – can be Logarithmic, PSSM (mix between linear and logarithmic), or manual
  • Split Distance 0/Split Distance 1/Split Distance 2/Split Distance 3 – manual partition depths
  • PSSM Lambda – mix between linear and logarithmic partitioning when PartitionMode == PSSM
  • Visualize Cascades – colors pixels based on which cascade they select

Shadow Filtering

Shadow map filtering is the other important aspect of reducing artifacts. Generally it’s needed to reduce aliasing due to undersampling the geometry being rasterized into the shadow map, but it’s also useful for cases where the shadow map itself is being undersampled by the pixel shader. The most common technique for a long time has been Percentage Closer Filtering (PCF), which basically amounts to sampling a normal shadow map, performing the shadow comparison, and then filtering the result of that comparison. Nvidia hardware has been able to do a 2×2 bilinear PCF kernel in hardware since…forever, and it’s required of all DX10-capable hardware. Just about every PS3 game takes advantage of this feature, and Xbox 360 games would too if the GPU supported it. In general you see lots of 2×2 or 3×3 grid-shaped PCF kernels with either a box filter or a triangle filter. A few games (notably the Crysis games) use a randomized sample pattern with sample points located on a disc, which trades regular aliasing for noise. With DX11 hardware there’s support for GatherCmp, which essentially gives you the results of 4 shadow comparisons performed for the relevant 2×2 group of texels. With this you can efficiently implement large (7×7 or even 9×9) filter kernels with minimal fetch instructions, and still use arbitrary filter kernels. In fact there was an article in GPU Pro called “Fast Conventional Shadow Filtering” by Holger Gruen, that did exactly this, and even provided source code. It can be stupidly fast…in my sample app going from 2×2 PCF to 7×7 PCF only adds about 0.4ms when rendering at 1920×1080 on my AMD 7950. For comparison, a normal grid sampling approach adds about 2-3ms in my sample app for the maximum level of filtering. The big disadvantage to the fixed kernel modes is that they rely on the compiler to unroll the loops, which make for some sloooowwww compile times. The sample app uses a compilation cache so you won’t notice it if you just start it up, but without the cache you’ll see it takes quite while due to the many shader permutations being used. For that reason I decided to stick with a single kernel shape (disc) rather than using all of the shapes from the GPU Pro article, since compilation times were already bad enough.

So far the only real competitor to gain any traction in games is Variance Shadow Maps (VSM). I won’t go deep into the specifics since the original paper and GPU Gems article do a great job of explaining it. But the basic gist is that you work in terms of the mean and variance of a distribution of depth values at a certain texel, and then use use that distribution to estimate the probability of a pixel being in shadow. The end result is that you gain the ability to filter the shadow map without having to perform a comparison, which means that you can use hardware filtering (including mipmaps, anistropy or even MSAA) and that you can pre-filter the shadow map with standard “full-screen” blur pass. Another important aspect is that you generally don’t suffer from the same biasing issues that you do with PCF. There’s some issues of performance and memory, since you now need to store 2 high-precision values in your shadow map instead of just 1. But in general the biggest problem is the light bleeding, which occurs when there’s a large depth range between occluders and receivers. Lauritzen attempted to address this a few years later by applying an exponential warp to the depth values stored in the shadow map, and performing the filtering in log space. It’s generally quite effective, but it requires high-precision floating point storage to accommodate the warping. For maximum quality he also proposed storing a negative term, which requires an extra 2 components in the shadow map. In total that makes for 4x FP32 components per texel, which is definitely not light in terms of bandwidth! However you could say that it arguably produces the highest-quality results, and it does so without having to muck around with biasing. This is especially true when pre-filtering, MSAA, anisotropic filtering, and mipmaps are all used, although each of those brings about additional cost. To provide some real numbers, using EVSM4 with 2048×2048 cascades, 4xMSAA, mipmaps, 8xAF, and highest-level filtering (9 texels wide for the first cascade), adds about 11.5ms relative to a 7×7  fixed PCF kernel. A more reasonable approach would be to go with 1024×1024 shadow maps with 4xMSAA, which is around 3ms slower than the PCF version.

These are the shadow filtering modes that are implemented:

  • FixedSizePCF – optimized GatherCmp PCF with disc-shaped kernel
  • GridPCF – manual grid-based PCF sampling using NxN samples
  • RandomDiscPCF  – randomized samples on a Poisson disc, with optional per-pixel randomized rotation
  • OptimizedPCF – similar to FixedSizePCF, but uses bilinear PCF samples to implement a uniform filter kernel.
  • VSM – variance shadow maps
  • EVSM2 – exponential variance shadow maps, positive term only
  • EVSM4 – exponential variance shadow maps, both positive and negative terms
  • MSM Hamburgermoment shadow mapping, using the “Hamburger 4MSM” technique from the paper.
  • MSM Hausdorff – moment shadow mapping, using the “Hausdorff 4MSM” technique from the paper

Here’s the options available in my sample related to filtering:

  • Shadow Mode – the shadow filtering mode, can be one of the above values
  • Fixed Filter Size – the kernel width in texels for the FixedSizePCF mode, can be 2×2 though 9×9
  • Filter Size – the kernel width in fractional world space units for all other filtering modes. For the VSM modes, it’s used in a pre-filtering pass.
  • Filter Across Cascades – blends between two cascade results at cascade boundaries to hide the transition
  • Num Disc Samples – number of samples to use for RandomDiscPCF mode
  • Randomize Disc Offsets – if enabled, applies a per-pixel randomized rotation to the disc samples
  • Shadow MSAA – MSAA level to use for VSM, EVSM, and MSM modes
  • VSM/MSM Format – the precision to use for VSM, EVSM, and MSM shadow maps. Can be 16bit or 32bit . For VSM the textures will use a UNORM format, for EVSM they will be FLOAT. For the MSM 16-bit version, the optimized quantization scheme from the paper is used to store the data in UNORM texture.
  • Shadow Anisotropy – anisotropic filtering level to use for VSM, EVSM, and MSM
  • Enable Shadow Mips – enables mipmap generation and sampling for VSM, EVSM, and MSM
  • Positive Exponent/Negative Exponent – the exponential warping factors for the positive and negative components of EVSM
  • Light Bleeding Reduction – reduces light bleeding for VSM/EVSM/MSM, but results in over-darkening

In general I try to keep the filtering kernel fixed in world space across each cascade by adjusting the kernel width based on the cascade size. The one exception is the FixedSizePCF mode, which uses the same size kernel for all cascades. I did this because I didn’t think that branching over the fixed kernels would be a great idea. Matching the filter kernel for each cascade is nice because it helps hide the seams at cascade transitions, which means you don’t have to try to filter across adjacent cascades in order to hide them. It also means that you don’t have to use wider kernels on more distant pixels, although this can sometimes lead to visible aliasing on distant surfaces.

I didn’t put a whole lot of effort into the “RandomDiscPCF” mode, so it doesn’t produce optimal results. The randomization is done per-pixel, which isn’t great since you can clearly see the random pattern tiling over the screen as the camera moves. For a better comparison you would probably want to do something similar to what Crytek does, and tile a (stable) pattern over each cascade in shadowmap space.


When using conventional PCF filtering, biasing is essential for avoiding “shadow acne” artifacts. Unfortunately, it’s usually pretty tricky to get it right across different scenes and lighting scenarios. Just a bit too much bias will end up killing shadows entirely for small features, which can cause characters too look very bad. My sample app exposes 3 kinds of biasing: manual depth offset, manual normal-based offset, and automatic depth offset based on receiver slope. The manual depth offset, called simply Bias in the UI, simply subtracts a value from the pixel depth used to compare against the shadow map depth. Since the shadow map depth is a [0, 1] value that’s normalized to the depth range of the cascade, the bias value represents a variable size in world space for each cascade. The normal-based offset, called Offset Scale in the UI, is based on “Normal Offset Shadows”, which was a poster presented at GDC 2011. The basic idea is you actually create a new virtual shadow receiver position by offsetting from the actual pixel position in the direction of the normal. The trick is that you offset more when the surface normal is more perpendicular to the light direction. Angelo Pesce has a hand-drawn diagram explaining the same basic premise on his blog, if you’re having trouble picturing it.. This technique can actually produce decent results, especially given how cheap it is. However since you’re offsetting the receiver position, you actually “move” the shadows a bit which is a bit weird to see as you tweak the value. Since the offset is a world-space distance, in my sample I scale it based on the depth range of the cascade in order to make it consistent across cascade boundaries. If you want to try using it, I recommend starting with a small manual bias of 0.0015 or so and then slowly increasing the Offset Scale to about 1.0 or 2.0. Finally we have the automatic solution, where we attempt to compute the “perfect” bias amount based on the slope of the receiver. This setting is called Use Receiver Plane Depth Bias in the sample. To determine the slope, screen-space derivatives are used in the pixel shader. When it works, it’s fantastic. However it will still run into degenerate cases where it can produce unpredictable results, which is something that often happens when working with screen-space derivatives.

There are also separate biasing parameters for VSM and MSM techniques. The “VSM Bias” affects VSM and EVSM, while “MSM Depth Bias” and “MSM Moment Bias” are used for MSM. For VSM and 32-bit EVSM, a bias value of 0.01 (which corresponds to an actual value of 0.0001) is sufficent. However for 16-bit EVSM a bias of up to 0.5 (0.005) is required to alleviate precision issues. For only the moment bias is particularly relevent. This value needs to be at least 0.001-0.003 (0.000001-0.000003) for 32-bit modes, while the quantized 16-bit mode requires a higher bias of around 0.03 (0.00003). Note that for both EVSM and MSM increasing the bias can also increase the amount of visible light bleeding.

GPU-Driven Cascade Setup and Scene Submission

This is a topic that’s both fun, and really frustrating. It’s fun because you can really start to exploit the flexibility of DX11-class GPU’s, and begin to break free of the old “CPU spoon-feeds the GPU” model that we’ve been using for so long now. It’s also frustrating because the API’s still hold us back a quite a bit in terms of letting the GPU generate its own commands.  Either way it’s something I see people talk about but not a lot of people actually doing it, so I thought I’d give it a try for this sample. There’s actually 2 reasons to try something like this. The first is that if you can offload enough work to the GPU, you can avoid the heavy CPU overhead of frustum culling and drawing and/or batching lots and lots of meshes. The second is that it lets you do SDSM-style cascade optimization based on the depth buffer without having to read back values from the GPU to the CPU, which is always a painful way of doing things.

The obvious path to implementing GPU scene submission would be to make use of DrawInstancedIndirect/DrawIndexedInstancedIndirect. These API’s are fairly simple to use: you simply write the parameters to a buffer (with the parameters being the same ones that you would normally pass on the CPU for the non-indirect version), and then on the CPU you specify that buffer when calling the appropriate function. Since instance count is one of the parameters, you can implement culling on the GPU my setting the instance count to 0 when a mesh shouldn’t be rendered. However it turns out that this isn’t really useful, since you still need to go through the overhead of submitting draw calls and setting associated state on the CPU. In fact the situation is worse than doing it all on the CPU, since you have to submit each draw call even if it will be frustum culled.

Instead of using indirect draws, I decided to make a simple GPU-based batching system. Shadow map rendering is inherently more batchable than normal rendering, since you typically don’t need to worry about having specialized shaders or binding lots of textures when you’re only rendering depth. In my sample I take advantage of this by using a compute shader to generate one great big index buffer based on the results of a frustum culling pass, which can then be rendered in a single draw call. It’s really very simple: during initialization I generate a buffer containing all vertex positions, a buffer containing all indices (offset to reflect the vertex position in the combined position buffer), and a structured buffer containing the parameters for every draw call (index start, num indices, and a bounding sphere). When it’s time to batch, I run a compute shader with one thread per draw call that culls the bounding sphere against the view frustum. If it passes, the thread then “allocates” some space in the output index buffer by performing an atomic add on a value in a RWByteAddressBuffer containing the total number of culled indices that will be present in output index buffer (note that on my 7950 you should be able to just do the atomic in GDS like you would for an append buffer, but unfortunately in HLSL you can only increment by 1 using IncrementCounter()). I also append the draw call data to an append buffer for use in the next pass:

const uint drawIdx = TGSize * GroupID.x + ThreadIndex;
if(drawIdx >= NumDrawCalls)

DrawCall drawCall = DrawCalls[drawIdx];

    CulledDraw culledDraw;
    culledDraw.SrcIndexStart = drawCall.StartIndex;
    culledDraw.NumIndices = drawCall.NumIndices;
    DrawArgsBuffer.InterlockedAdd(0, drawCall.NumIndices, 


Once that’s completed, I then launch another compute shader that uses 1 thread group per draw call to copy all of the indices from the source index buffer to the final output index buffer that will be used for rendering. This shader is also very simple: it simply looks up the draw call info from the append buffer that tells it which indices to copy and where they should be copied, and then loops enough times for all indices to be copied in parallel by the different threads inside the thread group. The final result is an index buffer containing all of the culled draw calls, ready to be drawn in a single draw call:

const uint drawIdx = GroupID.x;
CulledDraw drawCall = CulledDraws[drawIdx];

for(uint currIdx = ThreadIndex; currIdx < drawCall.NumIndices; currIdx += TGSize)
    const uint srcIdx = currIdx + drawCall.SrcIndexStart;
    const uint dstIdx = currIdx + drawCall.DstIndexStart;
    CulledIndices[dstIdx] = Indices[srcIdx];

In order to avoid the min/max depth readback, I also had to port my cascade setup code to a compute shader so that the entire process could remain on the GPU. I was surprised to find that this was actually quite a bit more difficult than writing the batching system. Porting arbitrary C++ code to HLSL can be somewhat tedious, due to the various limitations of the language. I also ran into a rather ugly bug in the HLSL compiler where it kept  trying to keep matrices in column-major order no matter what code I wrote and what declarations I used, which I suppose it tries to do as an optimization. However this really messed me up when I tried to write the matrix into a structured buffer, and expected it to be row-major. My advice for the future: if you need to write a matrix from a compute shader, just use a StructuredBuffer<float4> and write out the rows manually. Ultimately after much hair-pulling I got it to work, and finally achieved my goal of 0-latency depth reduction with no CPU readback!

In terms of performance, the GPU-driven path comes in about 0.8ms slower than the CPU version when the CPU uses 1 frame of latency for reading back the min/max depth results. I’m not entirely sure why it’s that much slower, although I haven’t spend much time trying to profile the batching shaders or the cascade setup shader. I also wouldn’t be surprised if the actual mesh rendering ends up being a bit slower, since I use 32-bit indices when batching on the GPU as opposed to 16-bit when submitting with the CPU. However when using the 0 frames of latency for the CPU readback, the GPU version turns it around and comes in a full millsecond faster on my PC. Obviously it makes sense that any additional GPU overhead would make up for the stall that occurs on the CPU and GPU when having to immediately read back data. One thing that I’ll point out is that the GPU version is quite a bit slower than the CPU version when VSM’s are used and pre-filtering is enabled. This is because the CPU path chooses an optimized blur shader based on the filter width for a cascade, and early outs of the blur process if no filtering is required. For the GPU path I got lazy and just used one blur shader that uses a dynamic loop, and it always runs the blur passes even if no filtering is requested. There’s no technical reason why you couldn’t do it the same way as the CPU path with enough motivation and DrawIndirect’s.

DX11.2 Tiled Resources

Tiled resources seems to be the big-ticket item for the upcoming DX11.2 update. While the online documentation has some information about the new functions added to the API, there’s currently no information about the two tiers of tiled resource functionality being offered. Fortunately there is a sample app available that provides some clues. After poking around a bit last night, these were the differences that I noticed:

  • TIER2 supports MIN and MAX texture sampling modes that return the min or max of 4 neighboring texels. In the sample they use this when sampling a residency texture that tells the shader the highest-resolution mip level that can be used when sampling a particular tile. For TIER1 they emulate it with a Gather.
  • TIER1 doesn’t support sampling from unmapped tiles, so you have to either avoid it in your shader or map all unloaded tiles to dummy tile data (the sample does the latter)
  • TIER1 doesn’t support packed mips for texture arrays. From what I can gather, packed mips refers to packing multiple mips into a single tile.
  • TIER2 supports a new version of Texture2D.Sample that lets you clamp the mip level to a certain value. They use this to force the shader to sample from lower-resolution mip levels if the higher-resolution mip isn’t currently resident in memory. For TIER1 they emulate this by computing what mip level would normally be used, comparing it with the mip level available in memory, and then falling back to SampleLevel if the mip level needs to be clamped. There’s also another overload for Sample that returns a status variable that you can pass to a new “CheckAccessFullyMapped” intrinsic that tells you if the sample operation would access unmapped tiles. The docs don’t say that these functions are restricted to TIER2, but I would assume that to be the case.

Based on this information it appears that TIER1 offers all of the core functionality, while TIER2 has few extras bring it up to par with AMD’s sparse texture extension.