A little over a year ago I was looking to overhaul our shadow rendering at work in order to improve overall quality, as well as simplify the workflow for the lighting artists (tweaking biases all day isn’t fun for anybody). After doing yet another round of research into modern shadow mapping techniques, I decided to do what I usually do and starting working on sample project that I could use as a platform for experimentation and comparison. And so I did just that. I had always intended to put it on the blog, since I thought it was likely that other people would be evaluating some of the same techniques as they upgrade their engines for next-gen consoles. But I was a bit lazy about cleaning it up (there was a lot of code!), and it wasn’t the most exciting thing that I was working on, so it sort-of fell by the wayside. Fast forward to a year later, and I found myself looking for a testbed for some major changes that I was planning for the settings and UI portion of my sample framework. My shadows sample came to mind since it was chock-full of settings, and that lead me to finally clean it up and get it ready for sharing. So here it is! Note that I’ve fully switched over to VS 2012 now, so you’ll need it to open the project and compile the code. If you don’t have it installed, then you may need to download and install the VC++ 2012 Redistributable in order to run the pre-compiled binary.
– Fixed a bug with cascade stabilization behaving incorrectly for very small partitions
– Restore previous min/max cascade depth when disabling Auto-Compute Depth Bounds
Thanks to Stephen Hill for pointing out the issues!
– Ignacio Castaño was kind enough to share the PCF technique being used in The Witness, which he integrated into the sample under the name “OptimizedPCF”. Thanks Ignacio!
- Fixed SettingsCompiler string formatting issues for non-US cultures
- Fixed VSM/EVSM filtering shader that was broken with latest Nvidia drivers
- Added Moment Shadow Mapping
– Added notes about biasing VSM, EVSM, and MSM
The sample project is set up as a typical “cascaded shadow map from a directional light” scenario, in the same vein as the CascadedShadowMaps11 sample from the old DX SDK or the more recent Sample Distribution Shadow Maps sample from Intel. In fact I even use the same Powerplant scene from both samples, although I also added in human-sized model so that you can get a rough idea of how well the shadowing looks on characters (which can be fairly important if you want your shadows to look good in cutscenes without having to go crazy with character-specific lights and shadows). The basic cascade rendering and setup is pretty similar to the Intel sample: a single “global” shadow matrix is created every frame based on the light direction and current camera position/orientation, using an orthographic projection with width, height, and depth equal to 1.0. Then for each cascade a projection is created that’s fit to the cascade, which is used for rendering geometry to the shadow map. For sampling the cascade, the projection is described as a 3D scale and offset that’s applied to the UV space of the “global” shadow projection. That way you just use one matrix in the shader to calculate shadow map UV coordinates + depth, and then apply the scale and offset to compute the UV coordinates + depth for the cascade that you want to sample. Unlike the Intel sample I didn’t use any kind of deferred rendering, so I decided to just fix the number of cascades at 4 instead of making it tweakable at runtime.
Optimizing how you fit your cascades to your scene and current viewpoint is pretty crucial for reducing perspective aliasing and the artifacts that it causes. The old-school way of doing this for CSM is to chop up your entire visible depth range into N slices using some sort of partitioning scheme (logarithmic, linear, mixed, manual, etc.). Then for each slice of the view frustum, you’d tightly fit an AABB to the slice and use that as the parameters for your orthographic projection for that slice. This gives you the most optimal effective resolution for a given cascade partition, however with 2006-era shadow map sizes you still generally end up with an effective resolution that’s pretty low relative to the screen resolution. Combine this with 2006-era filtering (2×2 PCF in a lot of cases), and you end up with quite a bit of aliasing. This aliasing was exceptionally bad, due to the fact that your cascade projections will translate and scale as your camera moves and rotates, which results in rasterization sample points changing from frame to frame as the camera moves. The end result was crawling shadow edges, even from static geometry. The most popular solution for this problem was to trade some effective shadow map resolution for stable sample points that don’t move from frame to frame. This was first proposed (to my knowledge) by Michal Valient in his ShaderX6 article entitled “Stable Cascaded Shadow Maps”. The basic idea is that instead of tightly mapping your orthographic projection to your cascade split, you map it in such a way that the projection won’t change as the camera rotates. The way that Michal did it was to fit a sphere to the entire 3D frustum split and then fit the projection to that sphere, but you could do it any way that gives you a consistent projection size. To handle the translation issue, the projections are “snapped” to texel-sized increments so that you don’t get sub-texel sample movement. This ends up working really well, provided that your cascade partitions never change (which means that changing your FOV or near/car clip planes can cause issues). In general the stability ends up being a net win despite the reduced effective shadow map resolution, however small features and dynamic objects end up suffering.
2 years ago Andrew Lauritzen gave a talk on what he called “Sample Distribution Shadow Maps”, and released the sample that I mentioned earlier. He proposed that instead of stabilizing the cascades, we could instead focus on reducing wasted resolution in the shadow map to a point where effective resolution is high enough to give us sub-pixel resolution when sampling the shadow map. If you can do that then you don’t really need to worry about your projection changing frame to frame, provided that you use decent filtering when sampling the shadow map. The way that he proposed to do this was to analyze the depth buffer generated for the current viewpoint, and use it to automatically generate an optimal partitioning for the cascades. He tried a few complicated ways of achieving this that involved generating a depth histogram on the GPU, but also proposed a much more practical scheme that simply computed the min and max depth values. Once you have the min and max depth, you can very easily use that to clamp your cascades to the subset of your depth range that actually contains visible pixels. This might not sound like a big deal, but in practice it can give you huge improvements. The min Z in particular allows you to have a much more optimal partitioning, which you can get by using an “ideal” logarithmic partitioning scheme. The main downside is that you need to use the depth buffer, which puts you in the awful position of having the CPU dependent on results from the GPU if you want to do your shadow setup and scene submission on the CPU. In the sample code they simply stall and readback the reduction results, which isn’t really optimal at all in terms of performance. When you do something like this the CPU ends up waiting around for the driver to kick off commands to the GPU and for the GPU to finish processing them, and then you get a stall on the GPU while it sits around and waits for the CPU to start kicking off commands again. You can potentially get around this by doing what you normally do for queries and such, and deferring the readback for one or more frames. That way the results are (hopefully) already ready for readback, and so you don’t get any stalls. But this can cause you problems, since the cascades will trail a frame behind what’s actually on screen. So for instance if the camera moves towards an object, the min Z may not be low enough to fit all of the pixels in the first cascade and you’ll get artifacts. One potential workaround is to use the previous frame’s camera parameters to try to predict what the depth of a pixel will be for the next frame based on the instantaneous linear velocity of the camera, so that when you retrieve your min/max depth a frame late it actually contains the correct results. I’ve actually done this in the past (but not for this sample, I got lazy) and it works as long as long as the camera motion stays constant. However it won’t handle moving objects unless you store the per-pixel velocity with regards to depth and factor that into your calculations. The ultimate solution is to do all of the setup and submission on the GPU, but I’ll talk about that in detail later on.
These are the options implemented in the sample that affect cascade optimization:
- Stabilize Cascades – enables cascade stabilization using the method from ShaderX6.
- Auto-Compute Depth Bounds – computes the min and max Z on the GPU, and uses it to tightly fit the cascades as in SDSM.
- Readback Latency – number of frames to wait for the depth reduction results
- Min Cascade Distance/Max Cascade Distance – manual min and max Z when Auto-Compute Depth Bounds isn’t being used
- Partition Mode – can be Logarithmic, PSSM (mix between linear and logarithmic), or manual
- Split Distance 0/Split Distance 1/Split Distance 2/Split Distance 3 – manual partition depths
- PSSM Lambda – mix between linear and logarithmic partitioning when PartitionMode == PSSM
- Visualize Cascades – colors pixels based on which cascade they select
Shadow map filtering is the other important aspect of reducing artifacts. Generally it’s needed to reduce aliasing due to undersampling the geometry being rasterized into the shadow map, but it’s also useful for cases where the shadow map itself is being undersampled by the pixel shader. The most common technique for a long time has been Percentage Closer Filtering (PCF), which basically amounts to sampling a normal shadow map, performing the shadow comparison, and then filtering the result of that comparison. Nvidia hardware has been able to do a 2×2 bilinear PCF kernel in hardware since…forever, and it’s required of all DX10-capable hardware. Just about every PS3 game takes advantage of this feature, and Xbox 360 games would too if the GPU supported it. In general you see lots of 2×2 or 3×3 grid-shaped PCF kernels with either a box filter or a triangle filter. A few games (notably the Crysis games) use a randomized sample pattern with sample points located on a disc, which trades regular aliasing for noise. With DX11 hardware there’s support for GatherCmp, which essentially gives you the results of 4 shadow comparisons performed for the relevant 2×2 group of texels. With this you can efficiently implement large (7×7 or even 9×9) filter kernels with minimal fetch instructions, and still use arbitrary filter kernels. In fact there was an article in GPU Pro called “Fast Conventional Shadow Filtering” by Holger Gruen, that did exactly this, and even provided source code. It can be stupidly fast…in my sample app going from 2×2 PCF to 7×7 PCF only adds about 0.4ms when rendering at 1920×1080 on my AMD 7950. For comparison, a normal grid sampling approach adds about 2-3ms in my sample app for the maximum level of filtering. The big disadvantage to the fixed kernel modes is that they rely on the compiler to unroll the loops, which make for some sloooowwww compile times. The sample app uses a compilation cache so you won’t notice it if you just start it up, but without the cache you’ll see it takes quite while due to the many shader permutations being used. For that reason I decided to stick with a single kernel shape (disc) rather than using all of the shapes from the GPU Pro article, since compilation times were already bad enough.
So far the only real competitor to gain any traction in games is Variance Shadow Maps (VSM). I won’t go deep into the specifics since the original paper and GPU Gems article do a great job of explaining it. But the basic gist is that you work in terms of the mean and variance of a distribution of depth values at a certain texel, and then use use that distribution to estimate the probability of a pixel being in shadow. The end result is that you gain the ability to filter the shadow map without having to perform a comparison, which means that you can use hardware filtering (including mipmaps, anistropy or even MSAA) and that you can pre-filter the shadow map with standard “full-screen” blur pass. Another important aspect is that you generally don’t suffer from the same biasing issues that you do with PCF. There’s some issues of performance and memory, since you now need to store 2 high-precision values in your shadow map instead of just 1. But in general the biggest problem is the light bleeding, which occurs when there’s a large depth range between occluders and receivers. Lauritzen attempted to address this a few years later by applying an exponential warp to the depth values stored in the shadow map, and performing the filtering in log space. It’s generally quite effective, but it requires high-precision floating point storage to accommodate the warping. For maximum quality he also proposed storing a negative term, which requires an extra 2 components in the shadow map. In total that makes for 4x FP32 components per texel, which is definitely not light in terms of bandwidth! However you could say that it arguably produces the highest-quality results, and it does so without having to muck around with biasing. This is especially true when pre-filtering, MSAA, anisotropic filtering, and mipmaps are all used, although each of those brings about additional cost. To provide some real numbers, using EVSM4 with 2048×2048 cascades, 4xMSAA, mipmaps, 8xAF, and highest-level filtering (9 texels wide for the first cascade), adds about 11.5ms relative to a 7×7 fixed PCF kernel. A more reasonable approach would be to go with 1024×1024 shadow maps with 4xMSAA, which is around 3ms slower than the PCF version.
These are the shadow filtering modes that are implemented:
- FixedSizePCF – optimized GatherCmp PCF with disc-shaped kernel
- GridPCF – manual grid-based PCF sampling using NxN samples
- RandomDiscPCF – randomized samples on a Poisson disc, with optional per-pixel randomized rotation
- OptimizedPCF – similar to FixedSizePCF, but uses bilinear PCF samples to implement a uniform filter kernel.
- VSM – variance shadow maps
- EVSM2 – exponential variance shadow maps, positive term only
- EVSM4 – exponential variance shadow maps, both positive and negative terms
- MSM Hamburger – moment shadow mapping, using the “Hamburger 4MSM” technique from the paper.
- MSM Hausdorff – moment shadow mapping, using the “Hausdorff 4MSM” technique from the paper
Here’s the options available in my sample related to filtering:
- Shadow Mode – the shadow filtering mode, can be one of the above values
- Fixed Filter Size – the kernel width in texels for the FixedSizePCF mode, can be 2×2 though 9×9
- Filter Size – the kernel width in fractional world space units for all other filtering modes. For the VSM modes, it’s used in a pre-filtering pass.
- Filter Across Cascades – blends between two cascade results at cascade boundaries to hide the transition
- Num Disc Samples – number of samples to use for RandomDiscPCF mode
- Randomize Disc Offsets – if enabled, applies a per-pixel randomized rotation to the disc samples
- Shadow MSAA – MSAA level to use for VSM, EVSM, and MSM modes
- VSM/MSM Format – the precision to use for VSM, EVSM, and MSM shadow maps. Can be 16bit or 32bit . For VSM the textures will use a UNORM format, for EVSM they will be FLOAT. For the MSM 16-bit version, the optimized quantization scheme from the paper is used to store the data in UNORM texture.
- Shadow Anisotropy – anisotropic filtering level to use for VSM, EVSM, and MSM
- Enable Shadow Mips – enables mipmap generation and sampling for VSM, EVSM, and MSM
- Positive Exponent/Negative Exponent – the exponential warping factors for the positive and negative components of EVSM
- Light Bleeding Reduction – reduces light bleeding for VSM/EVSM/MSM, but results in over-darkening
In general I try to keep the filtering kernel fixed in world space across each cascade by adjusting the kernel width based on the cascade size. The one exception is the FixedSizePCF mode, which uses the same size kernel for all cascades. I did this because I didn’t think that branching over the fixed kernels would be a great idea. Matching the filter kernel for each cascade is nice because it helps hide the seams at cascade transitions, which means you don’t have to try to filter across adjacent cascades in order to hide them. It also means that you don’t have to use wider kernels on more distant pixels, although this can sometimes lead to visible aliasing on distant surfaces.
I didn’t put a whole lot of effort into the “RandomDiscPCF” mode, so it doesn’t produce optimal results. The randomization is done per-pixel, which isn’t great since you can clearly see the random pattern tiling over the screen as the camera moves. For a better comparison you would probably want to do something similar to what Crytek does, and tile a (stable) pattern over each cascade in shadowmap space.
When using conventional PCF filtering, biasing is essential for avoiding “shadow acne” artifacts. Unfortunately, it’s usually pretty tricky to get it right across different scenes and lighting scenarios. Just a bit too much bias will end up killing shadows entirely for small features, which can cause characters too look very bad. My sample app exposes 3 kinds of biasing: manual depth offset, manual normal-based offset, and automatic depth offset based on receiver slope. The manual depth offset, called simply Bias in the UI, simply subtracts a value from the pixel depth used to compare against the shadow map depth. Since the shadow map depth is a [0, 1] value that’s normalized to the depth range of the cascade, the bias value represents a variable size in world space for each cascade. The normal-based offset, called Offset Scale in the UI, is based on “Normal Offset Shadows”, which was a poster presented at GDC 2011. The basic idea is you actually create a new virtual shadow receiver position by offsetting from the actual pixel position in the direction of the normal. The trick is that you offset more when the surface normal is more perpendicular to the light direction. Angelo Pesce has a hand-drawn diagram explaining the same basic premise on his blog, if you’re having trouble picturing it.. This technique can actually produce decent results, especially given how cheap it is. However since you’re offsetting the receiver position, you actually “move” the shadows a bit which is a bit weird to see as you tweak the value. Since the offset is a world-space distance, in my sample I scale it based on the depth range of the cascade in order to make it consistent across cascade boundaries. If you want to try using it, I recommend starting with a small manual bias of 0.0015 or so and then slowly increasing the Offset Scale to about 1.0 or 2.0. Finally we have the automatic solution, where we attempt to compute the “perfect” bias amount based on the slope of the receiver. This setting is called Use Receiver Plane Depth Bias in the sample. To determine the slope, screen-space derivatives are used in the pixel shader. When it works, it’s fantastic. However it will still run into degenerate cases where it can produce unpredictable results, which is something that often happens when working with screen-space derivatives.
There are also separate biasing parameters for VSM and MSM techniques. The “VSM Bias” affects VSM and EVSM, while “MSM Depth Bias” and “MSM Moment Bias” are used for MSM. For VSM and 32-bit EVSM, a bias value of 0.01 (which corresponds to an actual value of 0.0001) is sufficent. However for 16-bit EVSM a bias of up to 0.5 (0.005) is required to alleviate precision issues. For only the moment bias is particularly relevent. This value needs to be at least 0.001-0.003 (0.000001-0.000003) for 32-bit modes, while the quantized 16-bit mode requires a higher bias of around 0.03 (0.00003). Note that for both EVSM and MSM increasing the bias can also increase the amount of visible light bleeding.
GPU-Driven Cascade Setup and Scene Submission
This is a topic that’s both fun, and really frustrating. It’s fun because you can really start to exploit the flexibility of DX11-class GPU’s, and begin to break free of the old “CPU spoon-feeds the GPU” model that we’ve been using for so long now. It’s also frustrating because the API’s still hold us back a quite a bit in terms of letting the GPU generate its own commands. Either way it’s something I see people talk about but not a lot of people actually doing it, so I thought I’d give it a try for this sample. There’s actually 2 reasons to try something like this. The first is that if you can offload enough work to the GPU, you can avoid the heavy CPU overhead of frustum culling and drawing and/or batching lots and lots of meshes. The second is that it lets you do SDSM-style cascade optimization based on the depth buffer without having to read back values from the GPU to the CPU, which is always a painful way of doing things.
The obvious path to implementing GPU scene submission would be to make use of DrawInstancedIndirect/DrawIndexedInstancedIndirect. These API’s are fairly simple to use: you simply write the parameters to a buffer (with the parameters being the same ones that you would normally pass on the CPU for the non-indirect version), and then on the CPU you specify that buffer when calling the appropriate function. Since instance count is one of the parameters, you can implement culling on the GPU my setting the instance count to 0 when a mesh shouldn’t be rendered. However it turns out that this isn’t really useful, since you still need to go through the overhead of submitting draw calls and setting associated state on the CPU. In fact the situation is worse than doing it all on the CPU, since you have to submit each draw call even if it will be frustum culled.
Instead of using indirect draws, I decided to make a simple GPU-based batching system. Shadow map rendering is inherently more batchable than normal rendering, since you typically don’t need to worry about having specialized shaders or binding lots of textures when you’re only rendering depth. In my sample I take advantage of this by using a compute shader to generate one great big index buffer based on the results of a frustum culling pass, which can then be rendered in a single draw call. It’s really very simple: during initialization I generate a buffer containing all vertex positions, a buffer containing all indices (offset to reflect the vertex position in the combined position buffer), and a structured buffer containing the parameters for every draw call (index start, num indices, and a bounding sphere). When it’s time to batch, I run a compute shader with one thread per draw call that culls the bounding sphere against the view frustum. If it passes, the thread then “allocates” some space in the output index buffer by performing an atomic add on a value in a RWByteAddressBuffer containing the total number of culled indices that will be present in output index buffer (note that on my 7950 you should be able to just do the atomic in GDS like you would for an append buffer, but unfortunately in HLSL you can only increment by 1 using IncrementCounter()). I also append the draw call data to an append buffer for use in the next pass:
const uint drawIdx = TGSize * GroupID.x + ThreadIndex;
if(drawIdx >= NumDrawCalls)
DrawCall drawCall = DrawCalls[drawIdx];
culledDraw.SrcIndexStart = drawCall.StartIndex;
culledDraw.NumIndices = drawCall.NumIndices;
Once that’s completed, I then launch another compute shader that uses 1 thread group per draw call to copy all of the indices from the source index buffer to the final output index buffer that will be used for rendering. This shader is also very simple: it simply looks up the draw call info from the append buffer that tells it which indices to copy and where they should be copied, and then loops enough times for all indices to be copied in parallel by the different threads inside the thread group. The final result is an index buffer containing all of the culled draw calls, ready to be drawn in a single draw call:
const uint drawIdx = GroupID.x;
CulledDraw drawCall = CulledDraws[drawIdx];
for(uint currIdx = ThreadIndex; currIdx < drawCall.NumIndices; currIdx += TGSize)
const uint srcIdx = currIdx + drawCall.SrcIndexStart;
const uint dstIdx = currIdx + drawCall.DstIndexStart;
CulledIndices[dstIdx] = Indices[srcIdx];
In order to avoid the min/max depth readback, I also had to port my cascade setup code to a compute shader so that the entire process could remain on the GPU. I was surprised to find that this was actually quite a bit more difficult than writing the batching system. Porting arbitrary C++ code to HLSL can be somewhat tedious, due to the various limitations of the language. I also ran into a rather ugly bug in the HLSL compiler where it kept trying to keep matrices in column-major order no matter what code I wrote and what declarations I used, which I suppose it tries to do as an optimization. However this really messed me up when I tried to write the matrix into a structured buffer, and expected it to be row-major. My advice for the future: if you need to write a matrix from a compute shader, just use a StructuredBuffer<float4> and write out the rows manually. Ultimately after much hair-pulling I got it to work, and finally achieved my goal of 0-latency depth reduction with no CPU readback!
In terms of performance, the GPU-driven path comes in about 0.8ms slower than the CPU version when the CPU uses 1 frame of latency for reading back the min/max depth results. I’m not entirely sure why it’s that much slower, although I haven’t spend much time trying to profile the batching shaders or the cascade setup shader. I also wouldn’t be surprised if the actual mesh rendering ends up being a bit slower, since I use 32-bit indices when batching on the GPU as opposed to 16-bit when submitting with the CPU. However when using the 0 frames of latency for the CPU readback, the GPU version turns it around and comes in a full millsecond faster on my PC. Obviously it makes sense that any additional GPU overhead would make up for the stall that occurs on the CPU and GPU when having to immediately read back data. One thing that I’ll point out is that the GPU version is quite a bit slower than the CPU version when VSM’s are used and pre-filtering is enabled. This is because the CPU path chooses an optimized blur shader based on the filter width for a cascade, and early outs of the blur process if no filtering is required. For the GPU path I got lazy and just used one blur shader that uses a dynamic loop, and it always runs the blur passes even if no filtering is requested. There’s no technical reason why you couldn’t do it the same way as the CPU path with enough motivation and DrawIndirect’s.