Come see me talk at GDC 2014

Myself and fellow lead graphics programmer David Neubelt will be at GDC next week, talking about the rendering technology behind The Order: 1886. Unfortunately the talk came together a bit late, and so it initially started from the talk that we gave back at SIGGRAPH at last year (which is why it has the same title). However we don’t want to just rehash the same material, so we’ve added tons of new slides and revamped the old ones. The resulting presentation has a much more of an engineering focus, and will cover a lot of the nuts and bolts behind our engine and the material pipeline. Some of the new things we’ll be covering include:

  • Dynamic lighting
  • Baked diffuse and specular lighting
  • Baked and real-time ambient occlusion
  • Our shader system, and how it interacts with our material pipeline
  • Details of how we perform compositing in our build pipeline
  • Hair shading
  • Practical implementation issues with our shading model
  • Performance and memory statistics
  • Several breakdowns showing how various rendering techniques combine to form the final image
  • At least one new funny picture

We want the presentation to be fresh and informative even if you saw our SIGGRAPH presentation, and I’m pretty sure that we have enough new material to ensure that it will be. So if you’re interested, come by at 5:00 PM on Wednesday. If you can’t make it, I’m going to try to make sure that we have the slides up for download as soon as possible.

Weighted Blended Order-Independent Transparency

http://mynameismjp.files.wordpress.com/2014/02/blendedoit.zip

Back in December, Morgan McGuire  and Louis Bavoil published a paper called Weighted Blended Order-Independent Transparency. In case you haven’t read it yet (you really should!), it proposes an OIT scheme that uses a weighted blend of all surfaces that overlap a given pixel. In other words finalColor = w0 * c0 + w1 * c1 + w2 * c2…etc. With a weighted blend the order of rendering no longer matters, which frees you from the never-ending nightmare of sorting. You can actually achieve results that are very close to a traditional sorted alpha blend, as long as your per-surface weights a carefully chosen. Obviously it’s that last part that makes it tricky, consequently McGuire and Bavoil’s primary contribution is proposing a weighting function that’s based on the view-space depth of a given surface. The reasoning behind using a depth-based weighting function is intuitive: closer surfaces obscure the surfaces behind them, so the closer surfaces should be weighted higher when blending. In practice the implementation is really simple: in the pixel shader you compute color, opacity, and a weight value based on both depth and opacity. You then output float4(color* opacity, opacity) * weight  to 1 render target,while also outputting weight alone to a second render target (the first RT needs to be fp16 RGBA for HDR, but the second can just be R8_UNORM or R16_UNORM). For both render targets special blending modes are required, however they both can be represented by standard fixed-function blending available in GL/DX. After rendering all of your transparents, you then perform a full-screen “resolve” pass where you normalize the weights and then blend with the opaque surfaces underneath the transparents. Obviously this is really appealing since you completely remove any dependency on the ordering of draw calls, and you don’t need to build per-pixel lists or anything like that (which is nice for us mortals who don’t have pixel sync).  The downside is that you’re at the mercy of your weighting function, and you potentially open yourself up to new kinds of artifacts issues depending on what sort of weighting function is used.

When the paper came out I read it and I was naturally interested, so I quickly hacked together a sample project using another project as a base. Unfortunately over the past 2 months there’s been holidays, the flu, and several weeks of long hours at work so that we could could finish up a major milestone. So while I’ve had time to visit my family and optimize our engine for PS4, I haven’t really had any time to come up with a proper sample app that really lets you explore the BOIT technique in a variety of scenarios. However I really hate not having source code and a working sample app to go with papers, so I’m going to release it so that others at least have something they can use for evaluating their proposed algorithm. Hopefully it’s useful, despite how bad the test scene is. Basically it’s just a simple cornell-box like scene made up of a few walls, a sphere, a cylinder, a torus, and a sky (I normally use it for GI testing), but I added the abililty to toggle through 2 alternative albedo maps: a smoke texture, and a tree texture. It doesn’t look great, but it’s enough to get a few layers of transparency with varying lighting conditions:

Scene_Normal

The sample is based on another project I’ve been working on for quite some time with my fellow graphics programmer David Neubelt, where we’ve been exploring new techniques for baking GI into lightmaps. For that project I had written a simple multithreaded ray-tracer using Embree 2.0 (which is an awesome library, and I highly recommend it), so I re-purposed it into a ground-truth renderer for this sample. You can toggle it on and off  to see what the scene would look like with perfect sorting, which is useful for evaluating the “correctness” of the BOIT algorithm. It’s very fast on my mighty 3.6GHz Core i7, but it might chug a bit for those of you running on mobile CPU’s. If that’s true I apologize, however I made sure that all of the UI and controls are decoupled from the ray-tracing framerate so that the app remains responsive.

I’d love to do a more thorough write-up that really goes into depth on the advantages and disadvantages in multiple scenarios, but I’m afraid I just don’t have the time for it at the moment. So instead I’ll just share some quick thoughts and screenshots:

It’s pretty good for surfaces with low to medium opacity  – with the smoke texture applied, it actually achieves decent results. The biggest issues are where there’s a large difference in the lighting intensity between two overlapping surfaces, which makes sense since this also applies to improperly sorted surfaces rendered with traditional alpha blending. Top image is with Blended OIT, bottom image is ground truth:

Smoke_BOIT
Smoke_Ref

If you look at the area where the closer, brighter surface overlaps the darker surface on the cylinder you can see an example of where the results differ from the ground-truth render. Fortunately the depth weighting produces results that don’t look immediately “wrong”, which is certainly a big step up from unsorted alpha blending. Here’s another image of the test scene with default albedo maps, with an overall opacity of 0.25:

LowOpacity_BOIT
LowOpacity_Ref

The technique fails for surfaces with high opacity – one case that the algorithm has trouble with is surfaces with opacity = 1.0. Since it uses a weighted blend, the weight of the closest surface has to be incredibly high relative to any other surfaces in order for it to appear opaque. Here’s the test scene with all surfaces using an opacity of 1.0:

HiOpacity_BOIT
HiOpacity_Ref

You’ll notice in the image that the algorithm does actually work correctly with opacity = 1 if there’s no overlap of transparent surfaces, so it does hold up in that particular case. However in general this problem makes it unsuitable for materials like foliage, where large portions of of surface need to be fully opaque. Here’s the test scene using a tree texture, which illustrates the same problem:

Tree_BOIT
Tree_Ref

Like I said earlier, you really need to make the closest surface have a an extremely high weight relative to the surfaces behind it if you want it to appear opaque. One simple thing you could do is to keep track of the depth of the closest surface (say in a depth buffer), and then artificially boost the weight of surfaces if their depth matches the depth buffer weight. If you do this (and also scale your “boost” factor by opacity) you get something like this:

Tree_DBWeight

This result looks quite a bit better, although messing with the weights changes the alpha gradients which gives it a different look. This approach obviously has a lot of failure cases. Since you’re relying on depth, you could easily create discontinuities at geometry edges. You can also get situations like this, where a surface visible through a transparent portion of the closest surface doesn’t get the weight boost and remains translucent in appearance:

DBWeightFail_BOIT

DBWeight_Ref

Notice how the second tree true trunk appears to have a low opacity since it’s behind the closest surface. The other major downside is that you need to render your transparents in a depth prepass, which requires performance as well as the memory for an extra depth buffer. However you may already be doing that in order to optimize tiled forward rendering of transparents. Regardless I doubt it would be useful except in certain special-case scenarios, and it’s probably easier (and cheaper) to just stick to alpha-test or A2C for those cases.

Is it usable? - I’m not sure yet. I feel like it would take a lot of testing the wide range of transparents in our game before knowing if it’s going to work out. It’s too bad that it has failure cases, but if we’re going to be honest the bar is pretty damn low when it comes to transparents in  games. In our engine we make an attempt to sort by depth, but our artists frequently have to resort to manually setting “sort priorities” in order to prevent temporal issues from meshes constantly switching their draw order. The Blended OIT algorithm on the other hand may produce incorrect results, but those results are stable over time. However I feel the much bigger issue with traditional transparent rendering is that ordering constraints are fundamentally at odds with rendering performance. Good performance requires using instancing, reducing state changes and rendering to low-resolution render targets. All 3 of those these are incompatible with rendering based on Z order, which means living with lots of sorting issues if you want optimal performance. With that in mind it really feels like it’s hard to do worse than the current status-quo.

That’s about all I have for now. Feel free to download the demo and play around with it. If you missed it, the download link is at the top of the page. Also, please let me know if you have any thoughts or ideas regarding the practicality of the technique, since I would definitely be interested in discussing it further.

Sample Framework Updates

You may have noticed that my latest sample now has a proper UI instead of the homegrown sliders and keyboard toggles that I was using in my older samples. What you might not have noticed is that there’s a whole bunch of behind-the-scenes changes to go with that new UI! Before I ramble on, here’s a quick bullet-point list of the new features:

  • Switched to VS2012 and adopted a few C++11 features
  • New UI back-end provided by AntTweakBar
  • C#-based data-definition format for auto-generating UI
  • Shader hot-swapping
  • Better shader caching, and compressed cache files

It occurred to me a little while ago that I could try to develop my framework into something that enables rapid prototyping, instead of just being some random bits of cobbled-together code. I’m not sure if anybody else will use it for anything, but so far it’s working out pretty well for me.

VS 2012

A little while ago I switched to working in VS 2012 with the Windows 8 SDK for my samples, but continued using the VS 2010 project format and toolset in case anybody was stuck on 2010 and wanted to compile my code. For this latest sample I decided that legacy support wasn’t worth it, and made the full switch the newer compiler. I haven’t done any really crazy C++ 11 things yet, but now that I’ve started using enum class and nullptr I never want to go back. Ever since learning C# I’ve always wanted C++ enums to use a similar syntax, and C++11 finally fulfilled my wish. I suspect that I’ll feel the same way when I start using non-static data member initializers (once I switch to VS 2013, of course).

New UI

When I found out about AntTweakBar and how easy it is to integrate, I felt a little silly for ever trying to write my own ghetto UI widgets. If you haven’t seen it before, it basically shows up as a window containing a list of tweakable app settings, which is exactly what I need in my sample apps. It also has a neat alternative to sliders where you sort of rotate a virtual lever, and the speed increases depending on how far away you move the mouse from the center point. Here’s what it looks like integrated into my shadows sample:

Ant_UI

If you’re thinking of using AntTweakBar in your code, you might want to grab the files TwHelper.h/cpp from my sample framework. They provide a bunch of wrapper functions over TwSetParam that add type safety, which makes the library a lot easier to work with. I also have much more comprehensive set of Settings classes that further wrap those functions, but those are a lot more tightly coupled to the rest of the framework.

Shader Hot-Swapping

This one is a no-brainer, and I’m not sure why I waited so long to do it. I decided not to implement it using the Win32 file-watching API. We’ve used that at work, and it quickly became the most hated part of our codebase. Instead I took the simple and painless route of checking the timestamp of a single file every N frames, which works great as long as you don’t have thousands of files to go through (usually I only have a dozen or so).

UI Data Definition

For a long time I’ve been unhappy with my old way of defining application settings, and setting up the UI for them. Previously I did everything in C++ code by declaring a global variable for each setting. This was nice because I got type safety and VisualAssist auto-complete whenever I wanted to access a setting, but it was also a bit cumbersome because I always had to touch code in multiple places whenever I wanted to add a new setting. This was especially annoying when I wanted to access a setting in a shader, since I had manually handle adding it to a constant buffer and setting the value before binding it to a shader. After trying and failing multiple times to come up with something better using just C++, I thought it might be fun to  try something a bit less…conventional. Ultimately I drew inspiration from game studios that define their data in a non-C++ language and then use that to auto-generate C++ code for use at runtime. If you can do it this way you get the best of both worlds: you define data in a simple language that can express it elegantly, and then you still get all of your nice type-safety and other C++ benefits. You can even add runtime reflection support by generating code that fills out data structures containing the info that you need to reflect on a type. It sounded crazy to do it for a sample framework, but I thought it would be fun to get out of my comfort zone a bit and try something new.

Ultimately I ended up using C# as my data-definition language. It not only has the required feature set, but I also used to be somewhat proficient in it several years ago. In particular I really like the Attribute functionality in C#,  and thought it would be perfect for defining metadata to go along with a setting. Here’s an example of how I ended up using them for the “Bias” setting from my Shadows sample:

BiasSettingFor enum settings, I just declare an enum in C# and use that new type when defining the setting. I also used an attribute to specify a custom string to associate with each enum value:EnumSettingTo finish it up, I added simple C# proxies for the Color, Orientation, and Direction setting types supported by the UI system.

Here’s how it all it ties together: I define all of the settings in a file called AppSettings.cs, which includes classes for splitting the settings into groups. This file is added to my Visual Studio C++ project, and set to use custom build step that runs before the C++ compilation step. This build step passes the file to SettingsCompiler.exe, which is a command-line C# app created by a C# project in the same solution. This app basically takes the C# settings file, and invokes the C# compiler (don’t you love languages that can compile themselves?) so that it can be compiled as an in-memory assembly. That assembly is then reflected to determine the settings that are declared in the file, and also to extract the various metadata from the attributes. Since the custom attribute classes need to be referenced by both the SettingsCompiler exe as well as the settings code being compiled, I had to put all of them in a separate DLL project called SettingsCompilerAttributes. Once all of the appropriate data is gathered, the C# project then generates and outputs AppSettings.h and AppSettings.cpp. These files contain global definitions of the various settings using the appropriate C++ UI types, and also contains code for initializing and registering those settings with the UI system. These files are added to the C++ project, so that they can be compiled and linked just like any other C++ code. On top of that, the settings compiler also spits out an HLSL file that declares a constant buffer containing all of the relevent settings (a setting can opt-out of the constant buffer if it wants by using an attribute). The C++ files then have code generated for creating a matching constant buffer resource, filling it out with the setting values once a frame, and binding it to all shader stages at the appropriate slot. This means that all a shader needs to do is #include the file, and it can use the setting. Here’s a diagram that shows the basic setup for the whole thing:

SettingsCompiler

This actually works out really nicely in Visual Studio: you just hit F7 and everything builds in the right order. The settings compiler will gather errors from the C# compiler and output them to stdout, which means that if you have a syntax error it gets reported to the VS output window just as if you were compiling it normally through a C# project. MSBuild will even track the timestamp on AppSettings.cs, and won’t run the settings compiler unless it’s newer than the timestamp on AppSettings.h, AppSettings.cpp or AppSettings.hlsl. Sure it’s a really complicated and over-engineered way of handling my problem, but it works and I had fun doing it.

Future Work

I think that the next thing that I’ll improve will be model loading. It’s pretty limiting working with .sdkmesh, and I’d like to be able to work with a wider variety of test scenes. Perhaps I’ll integrate Assimp, or use it to make a simple converter to a custom format. I’d also like to flesh out my SH math and shader code a bit, and add some more useful functionality.

A Sampling of Shadow Techniques

A little over a year ago I was looking to overhaul our shadow rendering at work in order to improve overall quality, as well as simplify the workflow for the lighting artists (tweaking biases all day isn’t fun for anybody). After doing yet another round of research into modern shadow mapping techniques, I decided to do what I usually do and starting working on sample project that I could use as a platform for experimentation and comparison. And so I did just that. I had always intended to put it on the blog, since I thought it was likely that other people would be evaluating some of the same techniques as they upgrade their engines for next-gen consoles. But I was a bit lazy about cleaning it up (there was a lot of code!), and it wasn’t the most exciting thing that I was working on, so it sort-of fell by the wayside. Fast forward to a year later, and I found myself  looking for a testbed for some major changes that I was planning for the settings and UI portion of my sample framework. My shadows sample came to mind since it was chock-full of settings, and that lead me to finally clean it up and get it ready for sharing. So here it is! Note that I’ve fully switched over to VS 2012 now, so you’ll need it to open the project and compile the code. If you don’t have it installed, then you may need to download and install the VC++ 2012 Redistributable in order to run the pre-compiled binary.

http://mynameismjp.files.wordpress.com/2013/11/shadows.zip

Update 9/12/2013
 - Fixed a bug with cascade stabilization behaving incorrectly for very small partitions
 - Restore previous min/max cascade depth when disabling Auto-Compute Depth Bounds

Thanks to Stephen Hill for pointing out the issues!

Update 9/18/2013
 - Ignacio Castaño was kind enough to share the PCF technique being used in The Witness, which he integrated into the sample under the name “OptimizedPCF”. Thanks Ignacio!
Update 11/3/2013
- Fixed SettingsCompiler string formatting issues for non-US cultures

The sample project is set up as a typical “cascaded shadow map from a directional light” scenario, in the same vein as the CascadedShadowMaps11 sample from the old DX SDK or the more recent Sample Distribution Shadow Maps sample from Intel.  In fact I even use the same Powerplant scene from both samples, although I also added in human-sized model so that you can get a rough idea of how well the shadowing looks on characters (which can be fairly important if you want your shadows to look good in cutscenes without having to go crazy with character-specific lights and shadows). The basic cascade rendering and setup is pretty similar to the Intel sample: a single “global” shadow matrix is created every frame based on the light direction and current camera position/orientation, using an orthographic projection with width, height, and depth equal to 1.0. Then for each cascade a projection is created that’s fit to the cascade, which is used for rendering geometry to the shadow map. For sampling the cascade, the projection is described as a 3D scale and offset that’s applied to the UV space of the “global” shadow projection. That way you just use one matrix in the shader to calculate shadow map UV coordinates + depth, and then apply the scale and offset to compute the UV coordinates + depth for the cascade that you want to sample. Unlike the Intel sample I didn’t use any kind of deferred rendering, so I decided to just fix the number of cascades at 4 instead of making it tweakable at runtime.

Cascade Optimization

Optimizing how you fit your cascades to your scene and current viewpoint is pretty crucial for reducing perspective aliasing and the artifacts that it causes. The old-school way of doing this for CSM is to chop up your entire visible depth range into N slices using some sort of partitioning scheme (logarithmic, linear, mixed, manual, etc.). Then for each slice of the view frustum, you’d tightly fit an AABB to the slice  and use that as the parameters for your orthographic projection for that slice. This gives you the most optimal effective resolution for a given cascade partition, however with 2006-era shadow map sizes you still generally end up with an effective resolution that’s pretty low relative to the screen resolution. Combine this with 2006-era filtering (2×2 PCF in a lot of cases), and you end up with quite a bit of aliasing. This aliasing was exceptionally bad, due to the fact that your cascade projections will translate and scale as your camera moves and rotates, which results in rasterization sample points changing from frame to frame as the camera moves. The end result was crawling shadow edges, even from static geometry. The most popular solution for this problem was to trade some effective shadow map resolution for stable sample points that  don’t move from frame to frame. This was first proposed (to my knowledge) by Michal Valient in his ShaderX6 article entitled “Stable Cascaded Shadow Maps”. The basic idea is that instead of tightly mapping your orthographic projection to your cascade split, you map it in such a way that the projection won’t change as the camera rotates. The way that Michal did it was to fit a sphere to the entire 3D frustum split and then fit the projection to that sphere, but you could do it any way that gives you a consistent projection size. To handle the translation issue, the projections are “snapped” to texel-sized increments so that you don’t get sub-texel sample movement. This ends up working really well, provided that your cascade partitions never change (which means that changing your FOV or near/car clip planes can cause issues). In general the stability ends up being a net win despite the reduced effective shadow map resolution, however small features and dynamic objects end up suffering.

2 years ago Andrew Lauritzen gave a talk on what he called “Sample Distribution Shadow Maps”, and released the sample that I mentioned earlier. He proposed that instead of stabilizing the cascades, we could instead focus on reducing wasted resolution in the shadow map to a point where effective resolution is high enough to give us sub-pixel resolution when sampling the shadow map. If you can do that then you don’t really need to worry about your projection changing frame to frame, provided that you use decent filtering when sampling the shadow map. The way that he proposed to do this was to analyze the depth buffer generated for the current viewpoint, and use it to automatically generate an optimal partitioning for the cascades. He tried a few complicated ways of achieving this that involved generating a depth histogram on the GPU, but also proposed a much more practical scheme that simply computed the min and max depth values. Once you have the min and max depth, you can very easily use that to clamp your cascades to the subset of your depth range that actually contains visible pixels. This might not sound like a big deal, but in practice it can give you huge improvements. The min Z in particular allows you to have a much more optimal partitioning, which you can get by using an “ideal” logarithmic partitioning scheme. The main downside is that you need to use the depth buffer, which puts you in the awful position of having the CPU dependent on results from the GPU if you want to do your shadow setup and scene submission on the CPU. In the sample code they simply stall and readback the reduction results, which isn’t really optimal at all in terms of performance. When you do something like this the CPU ends up waiting around for the driver to kick off commands to the GPU and for the GPU to finish processing them, and then you get a stall on the GPU while it sits around and waits for the CPU to start kicking off commands again. You can potentially get around this by doing what you normally do for queries and such, and deferring the readback for one or more frames. That way the results are (hopefully) already ready for readback, and so you don’t get any stalls. But this can cause you problems, since the cascades will trail a frame behind what’s actually on screen. So for instance if the camera moves towards an object, the min Z may not be low enough to fit all of the pixels in the first cascade and you’ll get artifacts. One potential workaround is to use the previous frame’s camera parameters to try to predict what the depth of a pixel will be for the next frame based on the instantaneous linear velocity of the camera, so that when you retrieve your min/max depth a frame late it actually contains the correct results. I’ve actually done this in the past (but not for this sample, I got lazy) and it works as long as long as the camera motion stays constant. However it won’t handle moving objects unless you store the per-pixel velocity with regards to depth and factor that into your calculations. The ultimate solution is to do all of the setup and submission on the GPU, but I’ll talk about that in detail later on.

These are the options implemented in the sample that affect cascade optimization:

  • Stabilize Cascades – enables cascade stabilization using the method from ShaderX6.
  • Auto-Compute Depth Bounds – computes the min and max Z on the GPU, and uses it to tightly fit the cascades as in SDSM.
  • Readback Latency – number of frames to wait for the depth reduction results
  • Min Cascade Distance/Max Cascade Distance – manual min and max Z when Auto-Compute Depth Bounds isn’t being used
  • Partition Mode – can be Logarithmic, PSSM (mix between linear and logarithmic), or manual
  • Split Distance 0/Split Distance 1/Split Distance 2/Split Distance 3 – manual partition depths
  • PSSM Lambda – mix between linear and logarithmic partitioning when PartitionMode == PSSM
  • Visualize Cascades – colors pixels based on which cascade they select

Shadow Filtering

Shadow map filtering is the other important aspect of reducing artifacts. Generally it’s needed to reduce aliasing due to undersampling the geometry being rasterized into the shadow map, but it’s also useful for cases where the shadow map itself is being undersampled by the pixel shader. The most common technique for a long time has been Percentage Closer Filtering (PCF), which basically amounts to sampling a normal shadow map, performing the shadow comparison, and then filtering the result of that comparison. Nvidia hardware has been able to do a 2×2 bilinear PCF kernel in hardware since…forever, and it’s required of all DX10-capable hardware. Just about every PS3 game takes advantage of this feature, and Xbox 360 games would too if the GPU supported it. In general you see lots of 2×2 or 3×3 grid-shaped PCF kernels with either a box filter or a triangle filter. A few games (notably the Crysis games) use a randomized sample pattern with sample points located on a disc, which trades regular aliasing for noise. With DX11 hardware there’s support for GatherCmp, which essentially gives you the results of 4 shadow comparisons performed for the relevant 2×2 group of texels. With this you can efficiently implement large (7×7 or even 9×9) filter kernels with minimal fetch instructions, and still use arbitrary filter kernels. In fact there was an article in GPU Pro called “Fast Conventional Shadow Filtering” by Holger Gruen, that did exactly this, and even provided source code. It can be stupidly fast…in my sample app going from 2×2 PCF to 7×7 PCF only adds about 0.4ms when rendering at 1920×1080 on my AMD 7950. For comparison, a normal grid sampling approach adds about 2-3ms in my sample app for the maximum level of filtering. The big disadvantage to the fixed kernel modes is that they rely on the compiler to unroll the loops, which make for some sloooowwww compile times. The sample app uses a compilation cache so you won’t notice it if you just start it up, but without the cache you’ll see it takes quite while due to the many shader permutations being used. For that reason I decided to stick with a single kernel shape (disc) rather than using all of the shapes from the GPU Pro article, since compilation times were already bad enough.

So far the only real competitor to gain any traction in games is Variance Shadow Maps (VSM). I won’t go deep into the specifics since the original paper and GPU Gems article do a great job of explaining it. But the basic gist is that you work in terms of the mean and variance of a distribution of depth values at a certain texel, and then use use that distribution to estimate the probability of a pixel being in shadow. The end result is that you gain the ability to filter the shadow map without having to perform a comparison, which means that you can use hardware filtering (including mipmaps, anistropy or even MSAA) and that you can pre-filter the shadow map with standard “full-screen” blur pass. Another important aspect is that you generally don’t suffer from the same biasing issues that you do with PCF. There’s some issues of performance and memory, since you now need to store 2 high-precision values in your shadow map instead of just 1. But in general the biggest problem is the light bleeding, which occurs when there’s a large depth range between occluders and receivers. Lauritzen attempted to address this a few years later by applying an exponential warp to the depth values stored in the shadow map, and performing the filtering in log space. It’s generally quite effective, but it requires high-precision floating point storage to accommodate the warping. For maximum quality he also proposed storing a negative term, which requires an extra 2 components in the shadow map. In total that makes for 4x FP32 components per texel, which is definitely not light in terms of bandwidth! However you could say that it arguably produces the highest-quality results, and it does so without having to muck around with biasing. This is especially true when pre-filtering, MSAA, anisotropic filtering, and mipmaps are all used, although each of those brings about additional cost. To provide some real numbers, using EVSM4 with 2048×2048 cascades, 4xMSAA, mipmaps, 8xAF, and highest-level filtering (9 texels wide for the first cascade), adds about 11.5ms relative to a 7×7  fixed PCF kernel. A more reasonable approach would be to go with 1024×1024 shadow maps with 4xMSAA, which is around 3ms slower than the PCF version.

These are the shadow filtering modes that I implemented:

  • FixedSizePCF – optimized GatherCmp PCF with disc-shaped kernel
  • GridPCF – manual grid-based PCF sampling using NxN samples
  • RandomDiscPCF  – randomized samples on a Poisson disc, with optional per-pixel randomized rotation
  • OptimizedPCF – similar to FixedSizePCF, but uses bilinear PCF samples to implement a uniform filter kernel.
  • VSM – variance shadow maps
  • EVSM2 – exponential variance shadow maps, positive term only
  • EVSM4 – exponential variance shadow maps, both positive and negative terms

Here’s the options available in my sample related to filtering:

  • Shadow Mode – the shadow filtering mode, can be one of the above values
  • Fixed Filter Size – the kernel width in texels for the FixedSizePCF mode, can be 2×2 though 9×9
  • Filter Size – the kernel width in fractional world space units for all other filtering modes. For the VSM modes, it’s used in a pre-filtering pass.
  • Filter Across Cascades – blends between two cascade results at cascade boundaries to hide the transition
  • Num Disc Samples – number of samples to use for RandomDiscPCF mode
  • Randomize Disc Offsets – if enabled, applies a per-pixel randomized rotation to the disc samples
  • Shadow MSAA – MSAA level to use for VSM and EVSM modes
  • VSM Format – the precision to use for VSM and EVSM shadow maps, can be 16bit or 32bit (for VSM the textures will use a UNORM format, for EVSM they will be FLOAT)
  • Shadow Anisotropy – anisotropic filtering level to use for VSM and EVSM
  • Enable Shadow Mips – enables mipmap generation and sampling for VSM and EVSM
  • Positive Exponent/Negative Exponent – the exponential warping factors for the positive and negative components of EVSM
  • Light Bleeding Reduction – reduces light bleeding for VSM and EVSM, but results in over-darkening

In general I try to keep the filtering kernel fixed in world space across each cascade by adjusting the kernel width based on the cascade size. The one exception is the FixedSizePCF mode, which uses the same size kernel for all cascades. I did this because I didn’t think that branching over the fixed kernels would be a great idea. Matching the filter kernel for each cascade is nice because it helps hide the seams at cascade transitions, which means you don’t have to try to filter across adjacent cascades in order to hide them. It also means that you don’t have to use wider kernels on more distant pixels, although this can sometimes lead to visible aliasing on distant surfaces.

I didn’t put a whole lot of effort into the “RandomDiscPCF” mode, so it doesn’t produce optimal results. The randomization is done per-pixel, which isn’t great since you can clearly see the random pattern tiling over the screen as the camera moves. For a better comparison you would probably want to do something similar to what Crytek does, and tile a (stable) pattern over each cascade in shadowmap space.

Biasing

When using conventional PCF filtering, biasing is essential for avoiding “shadow acne” artifacts. Unfortunately, it’s usually pretty tricky to get it right across different scenes and lighting scenarios. Just a bit too much bias will end up killing shadows entirely for small features, which can cause characters too look very bad. My sample app exposes 3 kinds of biasing: manual depth offset, manual normal-based offset, and automatic depth offset based on receiver slope. The manual depth offset, called simply Bias in the UI, simply subtracts a value from the pixel depth used to compare against the shadow map depth. Since the shadow map depth is a [0, 1] value that’s normalized to the depth range of the cascade, the bias value represents a variable size in world space for each cascade. The normal-based offset, called Offset Scale in the UI, is based on “Normal Offset Shadows”, which was a poster presented at GDC 2011. The basic idea is you actually create a new virtual shadow receiver position by offsetting from the actual pixel position in the direction of the normal. The trick is that you offset more when the surface normal is more perpendicular to the light direction. Angelo Pesce has a hand-drawn diagram explaining the same basic premise on his blog, if you’re having trouble picturing it.. This technique can actually produce decent results, especially given how cheap it is. However since you’re offsetting the receiver position, you actually “move” the shadows a bit which is a bit weird to see as you tweak the value. Since the offset is a world-space distance, in my sample I scale it based on the depth range of the cascade in order to make it consistent across cascade boundaries. If you want to try using it, I recommend starting with a small manual bias of 0.0015 or so and then slowly increasing the Offset Scale to about 1.0 or 2.0. Finally we have the automatic solution, where we attempt to compute the “perfect” bias amount based on the slope of the receiver. This setting is called Use Receiver Plane Depth Bias in the sample. To determine the slope, screen-space derivatives are used in the pixel shader. When it works, it’s fantastic. However it will still run into degenerate cases where it can produce unpredictable results, which is something that often happens when working with screen-space derivatives.

GPU-Driven Cascade Setup and Scene Submission

This is a topic that’s both fun, and really frustrating. It’s fun because you can really start to exploit the flexibility of DX11-class GPU’s, and begin to break free of the old “CPU spoon-feeds the GPU” model that we’ve been using for so long now. It’s also frustrating because the API’s still hold us back a quite a bit in terms of letting the GPU generate its own commands.  Either way it’s something I see people talk about but not a lot of people actually doing it, so I thought I’d give it a try for this sample. There’s actually 2 reasons to try something like this. The first is that if you can offload enough work to the GPU, you can avoid the heavy CPU overhead of frustum culling and drawing and/or batching lots and lots of meshes. The second is that it lets you do SDSM-style cascade optimization based on the depth buffer without having to read back values from the GPU to the CPU, which is always a painful way of doing things.

The obvious path to implementing GPU scene submission would be to make use of DrawInstancedIndirect/DrawIndexedInstancedIndirect. These API’s are fairly simple to use: you simply write the parameters to a buffer (with the parameters being the same ones that you would normally pass on the CPU for the non-indirect version), and then on the CPU you specify that buffer when calling the appropriate function. Since instance count is one of the parameters, you can implement culling on the GPU my setting the instance count to 0 when a mesh shouldn’t be rendered. However it turns out that this isn’t really useful, since you still need to go through the overhead of submitting draw calls and setting associated state on the CPU. In fact the situation is worse than doing it all on the CPU, since you have to submit each draw call even if it will be frustum culled.

Instead of using indirect draws, I decided to make a simple GPU-based batching system. Shadow map rendering is inherently more batchable than normal rendering, since you typically don’t need to worry about having specialized shaders or binding lots of textures when you’re only rendering depth. In my sample I take advantage of this by using a compute shader to generate one great big index buffer based on the results of a frustum culling pass, which can then be rendered in a single draw call. It’s really very simple: during initialization I generate a buffer containing all vertex positions, a buffer containing all indices (offset to reflect the vertex position in the combined position buffer), and a structured buffer containing the parameters for every draw call (index start, num indices, and a bounding sphere). When it’s time to batch, I run a compute shader with one thread per draw call that culls the bounding sphere against the view frustum. If it passes, the thread then “allocates” some space in the output index buffer by performing an atomic add on a value in a RWByteAddressBuffer containing the total number of culled indices that will be present in output index buffer (note that on my 7950 you should be able to just do the atomic in GDS like you would for an append buffer, but unfortunately in HLSL you can only increment by 1 using IncrementCounter()). I also append the draw call data to an append buffer for use in the next pass:

const uint drawIdx = TGSize * GroupID.x + ThreadIndex;
if(drawIdx >= NumDrawCalls)
    return;

DrawCall drawCall = DrawCalls[drawIdx];

if(FrustumCull(drawCall))
{
    CulledDraw culledDraw;
    culledDraw.SrcIndexStart = drawCall.StartIndex;
    culledDraw.NumIndices = drawCall.NumIndices;
    DrawArgsBuffer.InterlockedAdd(0, drawCall.NumIndices, 
                                  culledDraw.DstIndexStart);

    CulledDrawsOutput.Append(culledDraw);
}

Once that’s completed, I then launch another compute shader that uses 1 thread group per draw call to copy all of the indices from the source index buffer to the final output index buffer that will be used for rendering. This shader is also very simple: it simply looks up the draw call info from the append buffer that tells it which indices to copy and where they should be copied, and then loops enough times for all indices to be copied in parallel by the different threads inside the thread group. The final result is an index buffer containing all of the culled draw calls, ready to be drawn in a single draw call:

const uint drawIdx = GroupID.x;
CulledDraw drawCall = CulledDraws[drawIdx];

for(uint currIdx = ThreadIndex; currIdx < drawCall.NumIndices; currIdx += TGSize)
{
    const uint srcIdx = currIdx + drawCall.SrcIndexStart;
    const uint dstIdx = currIdx + drawCall.DstIndexStart;
    CulledIndices[dstIdx] = Indices[srcIdx];
}

In order to avoid the min/max depth readback, I also had to port my cascade setup code to a compute shader so that the entire process could remain on the GPU. I was surprised to find that this was actually quite a bit more difficult than writing the batching system. Porting arbitrary C++ code to HLSL can be somewhat tedious, due to the various limitations of the language. I also ran into a rather ugly bug in the HLSL compiler where it kept  trying to keep matrices in column-major order no matter what code I wrote and what declarations I used, which I suppose it tries to do as an optimization. However this really messed me up when I tried to write the matrix into a structured buffer, and expected it to be row-major. My advice for the future: if you need to write a matrix from a compute shader, just use a StructuredBuffer<float4> and write out the rows manually. Ultimately after much hair-pulling I got it to work, and finally achieved my goal of 0-latency depth reduction with no CPU readback!

In terms of performance, the GPU-driven path comes in about 0.8ms slower than the CPU version when the CPU uses 1 frame of latency for reading back the min/max depth results. I’m not entirely sure why it’s that much slower, although I haven’t spend much time trying to profile the batching shaders or the cascade setup shader. I also wouldn’t be surprised if the actual mesh rendering ends up being a bit slower, since I use 32-bit indices when batching on the GPU as opposed to 16-bit when submitting with the CPU. However when using the 0 frames of latency for the CPU readback, the GPU version turns it around and comes in a full millsecond faster on my PC. Obviously it makes sense that any additional GPU overhead would make up for the stall that occurs on the CPU and GPU when having to immediately read back data. One thing that I’ll point out is that the GPU version is quite a bit slower than the CPU version when VSM’s are used and pre-filtering is enabled. This is because the CPU path chooses an optimized blur shader based on the filter width for a cascade, and early outs of the blur process if no filtering is required. For the GPU path I got lazy and just used one blur shader that uses a dynamic loop, and it always runs the blur passes even if no filtering is requested. There’s no technical reason why you couldn’t do it the same way as the CPU path with enough motivation and DrawIndirect’s.

DX11.2 Tiled Resources

Tiled resources seems to be the big-ticket item for the upcoming DX11.2 update. While the online documentation has some information about the new functions added to the API, there’s currently no information about the two tiers of tiled resource functionality being offered. Fortunately there is a sample app available that provides some clues. After poking around a bit last night, these were the differences that I noticed:

  • TIER2 supports MIN and MAX texture sampling modes that return the min or max of 4 neighboring texels. In the sample they use this when sampling a residency texture that tells the shader the highest-resolution mip level that can be used when sampling a particular tile. For TIER1 they emulate it with a Gather.
  • TIER1 doesn’t support sampling from unmapped tiles, so you have to either avoid it in your shader or map all unloaded tiles to dummy tile data (the sample does the latter)
  • TIER1 doesn’t support packed mips for texture arrays. From what I can gather, packed mips refers to packing multiple mips into a single tile.
  • TIER2 supports a new version of Texture2D.Sample that lets you clamp the mip level to a certain value. They use this to force the shader to sample from lower-resolution mip levels if the higher-resolution mip isn’t currently resident in memory. For TIER1 they emulate this by computing what mip level would normally be used, comparing it with the mip level available in memory, and then falling back to SampleLevel if the mip level needs to be clamped. There’s also another overload for Sample that returns a status variable that you can pass to a new “CheckAccessFullyMapped” intrinsic that tells you if the sample operation would access unmapped tiles. The docs don’t say that these functions are restricted to TIER2, but I would assume that to be the case.

Based on this information it appears that TIER1 offers all of the core functionality, while TIER2 has few extras bring it up to par with AMD’s sparse texture extension.

SIGGRAPH Follow-Up

So I’m hoping that if you’re reading this, you’ve already attended or read the slides from my presentation about The Order: 1886 that was part of the Physically Based Shading Course at SIGGRAPH last week. If not, go grab them and get started! If you haven’t read through the course notes already there’s a lot of good info there, in fact there’s almost 30 pages worth! The highlights include:

  • Full description of our Cook-Torrance and Cloth BRDF’s, including a handy optimization for the GGX Smith geometry term (for which credit belongs to Steve McAuley)
  • Analysis of our specular antialiasing solution
  • Plenty of details regarding the material scanning process
  • HLSL sample code for the Cook-Torrance BRDF’s as well as the specular AA roughness modification
  • Lots of beautiful LaTeX equations

If you did attend, I really appreciate you coming and I hope that you found it interesting. It was my first time presenting in front of such a large audience, so please forgive me if I seemed nervous. Also I’d like to give another thank you to anyone that came out for drinks later that night, I had a really great time talking shop with some of the industry’s best graphics programmers. And of course the biggest thanks goes out to Stephen Hill and Stephen McAuley for giving us the opportunity to speak at the best course at SIGGRAPH.

Anyhow…now that SIGGRAPH is finished and I have some time to sit down and take a breath, I wanted to follow up with some additional remarks about the topics that we presented. I also thought that if I post blogs more frequently, it might inspire a few other people to do the same.

Physically Based BRDF’s

I didn’t even mention anything about the benefits of physically based BRDF’s in our presentation because I feel like it’s no longer an issue that’s worth debating. We’ve been using some form of Cook-Torrance specular for about 2 years now, and it’s made everything in our game look better. Everything works more like it should work, and requires less fiddling and fudging on the part of the artists. We definitely had some issues to work through when we first switched (I can’t tell you how many times they asked me for direct control over the Fresnel curve), but there’s no question as to whether it was worth it in the long run. Next-gen consoles and modern PC GPU’s have lots of ALU to throw around, and sophisticated BRDF’s are a great way to utilize it.

Compositing

First of all, I wanted to reiterate that our compositing system (or “The Smasher” as it’s referred to in-house) is a completely offline process that happens in our content build system. When a level is a built we request a build of all materials referenced in that level, which then triggers a build of all materials used in the compositing stack of that material. Once all of the source materials are built, the compositing system kicks in and generates the blended parameter maps. The compositing process itself is very quick since we do it in a simple pixel shader run on the GPU, but it can take some time to build all of the source materials since doing that requires processing the textures referenced by that material. We also support using composited materials as a source material in a composite stack, which means that a single runtime material can potentially depend on a arbitrarily complex tree of source materials. To alleviate this we aggressively cache intermediate and output assets on local servers in our office, which makes build times pretty quick once the cache has been primed. We also used to compile shaders for every material, which caused long build times when changing code used in material shaders. This forced us to change things so that we only compile shaders directly required by a mesh inside of a level.

We also support runtime parameter blending, which we refer to as layer blending. Since it’s at runtime we limit it to 4 layers to prevent the shader from becoming too expensive, unlike the compositing system where artists are free to composite in as many materials as they’d like. It’s mostly used for environment geo as a way to add in some break-up and variety to tiling materials via vertex colors, as opposed to blending in small bits of a layer using blend maps. One notable exception is that we use the layer blending to add a “detail layer” to our cloth materials. This lets us keep the overall low-frequency material properties in lower-resolution texture maps, and then blend in tiling high-frequency data (such as the textile patterns acquired from scanning).

One more thing that I really wanted to bring up is that the artists absolutely love it. The way it interacts with our template library allows us to keep low-level material parameter authoring in the hands of a few technical people, and enables everyone else to just think in terms of combining high-level “building blocks” that form an object’s final appearance. I have no idea how we would make a game without it.

Maya Viewer

A few people noticed that we have our engine running in a Maya viewport, and came up to ask me about it. I had thought that this was fairly common and thus uninteresting, but I guess I was wrong! We do this by creating a custom Viewport 2.0 render override (via MRenderOverride) that uses our engine to render the scene using Maya’s DX11 device (Maya 2013.5 and up support a DX11 mode for Viewport 2.0, for 2013 you gave to use OpenGL). With their their render pass API you can basically specify what things you want Maya to draw, so we render first (so that we can fill the depth buffer) and then have Maya draw things like HUD text and wireframe overlays for UI. The VP 2.0 API actually supports lots of things for letting you hook into their renderer and data structure, which basically lets you specify things like vertex data and shader code while allowing Maya to handle the actual rendering. We don’t do that…we basically just give Maya’s device to our renderer and let our engine draw things the same way that it does in-game. To do this, we have a bunch of Maya plugin code that tracks objects in the scene (meshes, lights, plugin nodes, etc.) and handles exporting data off of them and converting them to data that can be consumed by our renderer. Maintaining this code has been a lot of work, since it basically amounts to an alternate path for scene data that’s in some ways very different from what we normally do in the game. However it’s huge workflow enhancement for all of our artists, so we put up with it. I definitely never thought I would know so much about Maya and its API!

Some other cool things that we support inside of Maya:

  • We can embed our asset editors (for editing materials, particle systems, lens flares, etc.) inside of Maya, which allows for real-time editing of those assets while they’re being viewed
  • GI bakes are initiated from Maya and then run on the GPU, which allows the lighting artists to keep iterating inside of a single tool and get quick feedback
  • Our pre-computed visibility can also be baked inside of Maya, which allows artists to check for vis dropouts and other bugs without running the game

One issue that I had to work around in our Maya viewer was using the debug DX11 device. Since Maya creates the device we can’t control the creation flags, which means no helpful DEBUG mode to tell us when we mess something up. To work around this, I had to make our renderer create its own device and use that for rendering. Then when presenting the back buffer and depth buffer to Maya, we have to use DXGI synchronization to copy texture data from our device’s render targets to Maya’s render targets. It’s not terribly hard, but it requires reading through a lot of obscure DXGI documentation.

Sample Code

You may not have noticed, but there’s a sample app to go with the presentation! Dave and I always say that it’s a little lame to give a talk about something without providing sample code, so we had to put our money where our mouth is. It’s essentially a working implementation of our specular AA solution as well as our Cook-Torrance BRDF’s that uses my DX11 sample framework. Seeing for yourself with the sample app is a whole lot better than looking at pictures, since the primary concern is avoiding temporal aliasing as the camera or object changes position. These are all of the specular AA techniques available in the sample for comparison:

  • LEAN
  • CLEAN
  • Toksvig
  • Pre-computed Toksvig
  • In-shader vMF evaluation of Frequency Domain Normal Map Filtering
  • Pre-computed vMF evaluation (what we use at RAD)

For a ground-truth reference, I also implemented in-shader supersampling and texture-space lighting. The shader supersampling works by interpolating vertex attributes to random sample points surrounding the pixel center, computing the lighting, and then applying a bicubic filter. Texture-space lighting works exactly like you think it would: the lighting result is rendered to a FP16 texture that’s 2x the width and height of the normal map, mipmaps are generated, and the geometry is rendered by sampling from the lighting texture with anisotropic filtering. Since linear filtering is used both for sampling the lighting map and generating the mipmaps, the overall results don’t always look very good (especially under magnification). However the results are completely stable, since the sample positions never move relative to the surface being shaded. The pixel shader supersampling technique still suffers from some temporal flickering because of this, although it’s obviously significantly reduced with higher sample counts.

Dave and I had also intended to implement solving for multiple vMF lobes, but unfortunately we ran out of time and we weren’t able to include it. I’d like to revisit it at some point and release an updated sample that has it implemented. I don’t think it would actually be worth it for a game to store so much additional texture data, however I think it would be useful as a reference. It might also be interesting to see if the data could be used to drive a more sophisticated pre-computation step that bakes the aniostropy into the lower-resolution mipmaps.

Like I mentioned before, the sample also has implementations of the GGX and Beckmann-based specular BRDF’s described in our course notes. We also implemented our GGX and Cloth BRDF’s as .brdf files for Disney’s BRDF Explorer, but we didn’t really have a place to put them. For now I have them hosted on my Google drive (one day I’ll move to real web hosting!), so you can grab them here:

GGX BRDF

Cloth BRDF

What I’ve been working on for the past 2 years

The announce trailer for Ready At Dawn’s latest project was shown during Sony’s E3 press conference yesterday, but if you missed it you can watch it here. There’s also a few full-res screenshots available here, with less compress-o-vision. It feels really good to finally be able to tell people what game I’ve been working on, and that we’re making a PS4 title. I’m also insanely proud of the trailer itself, as well as the in-house tech we’ve developed that made it possible. I have no doubts that I work with some of the most dedicated and talented people in the industry, and being able to collaborate with them makes it worth getting up in the morning. I’m really looking forward to sharing some of the cool things we’ve come up with, both on this blog as well as in more formal settings.

I was also seriously impressed by all of the other games and demos that were shown off yesterday. It’s really exciting to see everyone gearing up for next gen, and pushing their engines to new levels. These were some stand-outs for me:

  • Infamous: Second Son – characters look awesome! Lots of details in the skin and fabrics, plus some really nice facial animation
  • Watch Dogs – we’ve seen it before, but it looks better and better every time
  • The Division – lighting looks great, and the contextual animations are a really nice touch. I really dig the bullet decals and glass destruction.
  • Battlefield 4 – these guys are just out of control when it comes to destruction and mayhem! Also, really psyched to see that commander mode is making a comeback.
  • Mirror’s Edge 2 – the first game has a special place in my heart, so I would love a sequel even if it didn’t look awesome (but it does). It’s really cool seeing how good GI can look in a came with a clean art style.
  • Destiny – I’m not usually one for persistent-world multiplayer games, but this game might be exception. Looks like a lot of fun, and has great graphics to boot.
  • Titan Fall – any game where I can pilot a mech and run on walls has my money
  • The Dark Sorcerer – very high quality assets, and great lighting/materials. The performance capture is really impressive as well.

I’m looking forward to seeing more of these games and comparing notes on what techniques make the most out of next-gen hardware. By my count we’ve crossed off around 8 or so of the things on my list, and hopefully the entire industry will collectively figure out how to make all of them extinct. And then we can come up with new things to add to the list!

HLSL User Defined Language for Notepad++

When it comes to writing shaders, Notepad++ is currently my editor of choice. The most recent release of Notepad++ added version 2.0 of their User Defined Language (UDL) system, which adds quite a few improvements. I’ve been using an HLSL UDL file that I downloaded from somewhere else for a while now, and I decided to upgrade it to the 2.0 format and also make it work better for SM5.0 profiles. I added all of the operators, keywords, types attributes, system-value semantics, intrinsics, and methods, so they all get syntax highlighting now. I also stripped out all of the old pre-SM4.0 intrinsics and semantics, as well as the effect-specifics keywords. I’ve exported it as an XML file and uploaded it to my Google Drive so that others can make use of it as well. To use it, you can either import the XML file from the UDL dialog (Language->Define your language), or you can replace your userDefineLang.xml file in the AppData\Notepad++ folder. Enjoy!

Experimenting with Reconstruction Filters for MSAA Resolve

Previous article in the series: A Quick Overview of MSAA

Despite having the flexibility to implement a custom resolve for MSAA, the “standard” box filter resolve is still commonly used in games. While the box filter works well enough, it has some characteristics that can be considered undesirable for a reconstruction filter. In particular, the box function has a discontinuity at its edge that results in an infinite frequency response (the frequency domain equivalent of a box function is the sinc function). This causes it to introduce postaliasing when used as a reconstruction filter, since the filter is unable to isolate the original copy of a signal’s spectrum. The primary advantage offered by such a resolve is that it’s cheap from a performance point of view, since only subsamples within a single pixel need to be considered when computing a resolved pixel value.

The question we now want to answer is “can we do better?” Offline renderers such as Pixar’s PRMan support a variety of  filter types for antialiasing, so it stands to reason that we should at least explore the possibilities for real-time graphics. If we decide to forego the “standard” resolve offered by ResolveSubresource and instead perform our own resolve using a pixel or compute shader that that accesses the raw multisampled texture data, we are pretty much free to implement whatever reconstruction filter we’d like. So there is certainly no concern over lack of flexibility. Performance, however, is still an issue. Every GPU that I’ve run my code on will perform worse with a custom resolve, even when using a simple box filter with results that exactly match a standard resolve. Currently the performance delta seems to be worse on AMD hardware as opposed to Nvidia hardware. On top of that, there’s additional costs for the increased texture samples required for wider filter kernels. Separable filtering can be used to reduce the number of samples required for wide filters, however you must have special considerations with the rotated grid sample patterns used by MSAA. Unfortunately I haven’t solved these problems yet, so for this sample I’m just going to focus on quality without too much regard for performance. Hopefully in a future article I can revisit this, and address the performance issues.

At this point I feel that I should bring up TXAA. If you’re not familiar, TXAA is a library-supported antialiasing technique introduced for recent Nvidia Kepler-based GPU’s. There’s no public documentation as to exactly how it works, but Timothy Lottes has mentioned a few details here and there on his blog. From the info he’s given, it seems safe to assume that the MSAA resolve used by TXAA is something other than a box filter, and is using a filter width wider than a pixel. Based on these assumptions, you should be able to produce similar results with the framework that I’ve set up.

Implementation

The sample application renders my trusty tank scene in HDR to an fp16 render target with either 1x, 2x, 4x, or 8x MSAA enabled. Once the scene is rendered, a pixel shader is used to resolve the MSAA render target to a non-MSAA render target with one of ten available reconstruction filters. I implemented the following filters:

Box
Triangle
Gaussian
Blackman-Harris
Smoothstep (Hermite spline)
B-spline
Catmull-Rom
Mitchell
Generalized Cubic
Sinc

I started out by implementing some of the filters supported by PRMan, and using similar parameters for controlling the filtering. However I ended up deviating from the PRMan setup in order to make things more intuitive (in my opinion, at least). All filters except for the sinc filter were implemented such that their “natural range” of non-zero values were in the [-0.5, 0.5] range. This deviates from the canonical filter widths for several of these filters, notably the cubic filters (which are normally defined for the [-2, 2] range). I then used a “filter width” parameter to inversely scale the inputs to the filtering functions. So for a filter width of 1.0, the filters all have a width equal to the size of a single resolved pixel. The one exception is the sinc filter, where I used the filter width to window the function rather than scaling the input value. I should also note that I implemented all of the filters as radial filters where the input is the screen-space distance from the output pixel center to the sample position. Typically filters for image scaling are used in separable passes where 1D filters are passed the X or Y sample distance. Because of this my “Box” filter is actually disc-shaped, but it produces very similar results. In fact for a filter width of 1.0 the results are identical to a “standard” box filter resolve. The “Triangle” filter uses a standard triangle function,which can be considered a “cone” function when used as a radial filter. “Gaussian” uses a standard Gaussian function with a configurable sigma parameter, with the result windowed to [-0.5, 0.5].  The “Smoothstep” filter simply uses the smoothstep intrinsic available in HLSL, which implements a cubic hermite spline. The “Generalized Cubic” filter is an implementation of the cubic function suggested by Mitchell and Netravali in their paper, with the B and C parameters being tweakable by the user. The “B-spline”, “Mitchell” and “Catmull-Rom” filters use this same function except with fixed values for B and C. “Sinc” is the standard sinc function, windowed to [-FilterWidth, FilterWidth] as mentioned previously.

To visualize the filtering function, I added a real-time 1D plot of the currently-selected filter function using the current filter width. I also added a plot of the 1D fourier transform of the filter function (calculated with the help if the awesomely easy-to-integrate Kiss FFT library), so that you can also visualize the frequency response of the selected filter type. This can be useful for estimating the amount of postaliasing produced by a filter, as well as the attenuation of frequencies below the Nyquist rate (which results in blurring).

After the resolve is performed, the result is fed into a standard post-processing chain. This phase includes average luminance calculation for auto-exposure, bloom, and HDR tone mapping. I added an option to tone map subsamples in a manner similar to Humus’s sample, so that the results can be compared to resolve prior to tone mapping. When this option is activated, the bloom and auto-exposure passes work with non-resolved MSAA textures since the output of the resolve no longer contains linear HDR values. Note that the resolve is still performed prior to post-processing, since I wanted to keep the resolve separate from the post-processing phase so that it was more visible. In production it would most likely be done after all post-processing, however you would still need the same considerations regarding working with non-resolved MSAA data.

Here’s a full list of all options that I implemented:

MSAA Mode – the number of MSAA samples to use for the primary render target (1x, 2x, 4x, or 8x)
Filter Type – the filtering function to use in the resolve step (supports all of the filters listed above)
Use Standard Resolve – when enabled, a “standard” box filter resolve is performed using ResolveSubresource
Tone Map Subsamples – when enabled, tone mapping is applied before the subsamples are resolved
Enable FXAA  – enables or disables FXAA with high-quality PC settings
Render Triangle – renders a plain red triangle in the center of the screen

Bloom Exposure – an exposure (in log2 space) applied to HDR values in order to create the bloom source values
Bloom Magnitude – a multiplier for the bloom value that’s combined with the tone mapped result
Auto-Exposure Key Value – key value for controlling auto-exposure
Adaptation Rate – rate at which exposure is adapted over time
Roughness – roughness used for material specular calculations
Filter Size – the radius of the filter kernel (in pixels) used during the resolve step
Gaussian Sigma – the sigma parameter for the Gaussian function, used by the Gaussian filter mode
Cubic B – the “B” parameter to Mitchell’s generalized cubic function, used by the Generalized Cubic filter mode
Cubic C – the “C” parameter to Mitchell’s generalized cubic function, used by the Generalized Cubic filter mode
Magnification – magnification level for the final output (magnification is performed with point filtering)
Triangle Rotation Speed – the speed at which the red triangle (enabled by Render Triangle) is rotated

Results

The following table contains links to 1280×720 screenshots from my sample application using various filter types and filter widths. All screenshots have use 4xMSAA, and perform the resolve in linear HDR space (bloom and tone mapping are performed after):

Box 1.0 2.0 3.0 4.0 5.0 6.0
Triangle 1.0 2.0 3.0 4.0 5.0 6.0
Gaussian 1.0 2.0 3.0 4.0 5.0 6.0
Blackman-Harris 1.0 2.0 3.0 4.0 5.0 6.0
Smoothstep 1.0 2.0 3.0 4.0 5.0 6.0
B-spline 1.0 2.0 3.0 4.0 5.0 6.0
Catmull-Rom 1.0 2.0 3.0 4.0 5.0 6.0
Mitchell 1.0 2.0 3.0 4.0 5.0 6.0
Sinc 1.0 2.0 3.0 4.0 5.0 6.0

This table contains similar screenshots, except that the tone mapping is performed prior to the resolve:

Box 1.0 2.0 3.0 4.0 5.0 6.0
Triangle 1.0 2.0 3.0 4.0 5.0 6.0
Gaussian 1.0 2.0 3.0 4.0 5.0 6.0
Blackman-Harris 1.0 2.0 3.0 4.0 5.0 6.0
Smoothstep 1.0 2.0 3.0 4.0 5.0 6.0
B-spline 1.0 2.0 3.0 4.0 5.0 6.0
Catmull-Rom 1.0 2.0 3.0 4.0 5.0 6.0
Mitchell 1.0 2.0 3.0 4.0 5.0 6.0
Sinc 1.0 2.0 3.0 4.0 5.0 6.0

As we take a close look at the images, the results shouldn’t be too surprising. For the most part, wider filter kernels tend to reduce aliasing while smaller filters preserve more high-frequency detail. Personally I find  that the cubic spline filters with no negative lobes (smoothstep and B-Spline) will produce the best results, with the best balance between aliasing and blurring occurring around the 2.0-3.0 range. Here is a magnified image showing the results of 4xMSAA with a standard 1-pixel-wide box filter, followed by the same image with a 3-pixel-wide B-spline filter:

4xMSAA with a “standard” 1-pixel-wide box filter, followed by 4xMSAA with a 3-pixel-wide B-spline filter

The aliasing is pretty significantly reduced on geometry edges with the B-spline filter, particularly edges with higher contrast. Here’s another pair of images that are magnified even further, so that you can see the edge quality:

Highly magnified images showing 4xMSAA with a 1-pixel-wide box filter, followed by a 3-pixel-wide B-spline filter

Here’s another set of images showing the results of a wider filter kernel on high-frequency details from normal maps and specular lighting:

4xMSAA with a 1-pixel-wide box filter, with a 2-pixel-wide B-spline filter, and with a 3-pixel-wide B-spline filter

As you can see in the images, a 2-pixel-wide B-spline filter is actually pretty good in terms of not attenuating details that are close to a pixel in size. A wider filter reduces aliasing even further, but I feel that  filter width of 2.0 is still an improvement over the quality offered by a “standard” resolve. So it’s probably a pretty good place to start if you want better quality, but you prefer a sharper output image. The other cubic filters with negative lobes (such as Catmull-Rom and Mitchell) will also produce a sharper result, however the negative lobe can produce undesirable artifacts if they’re too strong. This is especially true when filtering HDR values with high intensity, since they can have a strong effect on neighboring pixels. For this reason I think that Mitchell is a better option over Catmull-Rom, since Mitchell’s negative lobes are bit less pronounced. The sinc filter is almost totally unappealing for an MSAA resolve, since the ringing artifacts that it produces are very prominent. Here are three images comparing a 4-pixel-wide Catmull-Rom filter, a 4-pixel-wide Mitchell filter, and a 6-pixel-wide sinc filter:

All of the above images used 4xMSAA, but a wider filter kernel can also work well for 2xMSAA. Here’s some close-ups showing “normal” 2xMSAA vs. 2xMSAA with a B-spline filter vs. 4xMSAA with a B-spline filter:

2xMSAA with standard resolve, 2xMSAA with a 3-pixel-wide B-spline filter, and 4xMSAA with a 3-pixel-wide B-spline filter

To wrap things up, here a two close-ups showing the results of a wide B-spine filter applied before tone mapping, and the same image with tone mapping applied prior to filtering:

4xMSAA with a 3-pixel wide B-spline filter applied before tone mapping, and after tone mapping

These results are little interesting since they illustrate the differences in the two different approaches. In the first image the filtering is performed with HDR values, so you get similar effects to applying DOF or motion blur in HDR where bright values can dominate their local neighborhood. The second image shows quite a different result, where the darker geometry actually ends up looking “thicker” against the bright blue sky. In general I don’t find that it produces a substantial improvement when you’re already using a wider filter kernel, or at least not enough to justify the extra effort and performance required to make it work with an HDR post-processing pipeline. However it does tend to play nicer with cubic filters that have negative lobes, since you’re not filtering HDR values with arbitrary intensity.

Conclusions

There are clearly a lot of options available to you if you choose to implement a custom MSAA resolve. I think there’s some good opportunities here to do an even better job at reducing aliasing in games, and personally I’m of the opinion that it’s worth reducing the appearance of tiny pixel-wide details if it results in an overall cleaner image. Either way I don’t think that a box filter is the best choice no matter what your tastes are.

If you want to download the sample app with source code, you can grab it here. Feel free to download it and perform your own experiments, although I’d appreciate if you’d share your results and opinions!

Some quick notes about the sample code: I decided to use the newer DirectX headers and libraries from the Windows 8 SDK, so you’ll need to have it installed if you want to compile the project. I haven’t fully migrated to VS 2012 yet, so I’ve left the project in VS 2010 format so that it can be opened by either. I also overhauled a lot of my sample framework code, which includes a shader caching system that uses the totally-awesome MurmurHash to detect when a shader needs to be re-compiled. Using the new SDK also entailed ditching D3DX, so I’ve replaced the texture and mesh loading functionality that I was using with some of my own code combined with some code lifted from DirectXTK. One major downside I should mention is that I’m only supporting x64 for now, due to some annoyances with the SDK and redistributing D3DCompiler DLL’s. If anybody is still stuck on 32-bit Windows and really wants to run the sample, let me know and I’ll try to find some time to get it working.

A Quick Overview of MSAA

Previous article in the series: Applying Sampling Theory to Real-Time Graphics

MSAA can be a bit complicated, due to the fact that it affects nearly the entire rasterization pipeline used in GPU’s. It’s also complicated because really understanding why it works requires at least a basic understanding of signal processing and image resampling. With that in mind I wanted to provide an quick overview of how MSAA works on a GPU, in order to provide the some background material for the following article where we’ll experiment with MSAA resolves. Like the previous article on signal processing, feel free to skip if you’re already an expert. Or better yet, read through it and correct my mistakes!

Rasterization Basics

A modern D3D11-capable GPU features hardware-supported rendering of point, line, and triangle primitives through rasterization. The rasterization pipeline on a GPU takes as input the vertices of the primitive being rendered, with vertex positions provided in the homogeneous clip space produced by transformation by some projection matrix.  These positions are used to determine the set of pixels in the current render target where the triangle will be visible. This visible set is determined from two things: coverage, and occlusion. Coverage is determined by performing some test to determine if the primitive overlaps a given pixel. In GPU’s, coverage is calculated by testing if the primitive overlaps a single sample point located in the exact center of each pixel 1. The following image demonstrates this process for a single triangle:

Coverage being calculated for a rasterized triangle. The blue circles represent a grid of sample points, each located at the center of a pixel. The red circles represent sample points covered by the triangle.

Occlusion tells us whether a pixel covered by a primitive is also covered by any other triangles, and is handled by z-buffering in GPU’s. A z-buffer, or depth buffer, stores the depth of the closest primitive relative to the camera at each pixel location. When a primitive is rasterized, its interpolated depth is compared against the value in the depth buffer to determine whether or not the pixel is occluded. If the depth test succeeds, the appropriate pixel in the depth buffer is updated with new closest depth. One thing to note about the depth test is that while it is often shown as occurring after pixel shading, almost all modern hardware can execute some form of the depth test before shading occurs. This is done as an optimization, so that occluded pixels can skip pixel shading. GPU’s still support performing the depth test after pixel shading in order to handle certain cases where an early depth test would produce incorrect results. One such case is where the pixel shader manually specifies a depth value, since the depth of the primitive isn’t known until the pixel shader runs.

Together, coverage and occlusion tells us the visibility of a primitive. Since visibility can be defined as 2D function of X and Y, we can treat it as a signal and define its behavior in terms of concepts from signal processing. For instance, since coverage and depth testing is performed at each pixel location in the render target the visibility sampling rate is determined by the X and Y resolution of that render target. We should also note that triangles and lines will inherently have discontinuities, which means that the signal is not bandlimited and thus no sampling rate will be adequate to avoid aliasing in the general case.

Oversampling and Supersampling

While it’s generally impossible to completely avoid aliasing of an arbitrary signal with infinite frequency, we can still reduce the appearance of aliasing artifacts through a process known as oversampling. Oversampling is the process of sampling a signal at some rate that’s higher than our intended final output, and then reconstructing and resampling the signal again at the output sample rate. As you’ll recall from the first article, sampling at a higher rate causes the  clones of a signal’s spectrum to be further apart. This results in less of the higher-frequency components leaking into the reconstructed version of the signal, which in the case of an image means a reduction in the appearance of aliasing artifacts.

When applied to graphics and 2D images we call this supersampling, often abbreviated as SSAA. Implementing it in a 3D rasterizer is trivial: render to some resolution higher than the screen, and then downsample to screen resolution using a reconstruction filter. The following image shows the results of various supersampling patterns applied to a rasterized triangle:

Supersampling applied to a rasterized triangle, using various sub-pixel patterns. Notice how aliasing is reduced as the sample rate increases, even though the number of pixels is the same in all cases. Image from Real-Time Rendering, 3rd Edition, A K Peters 2008

The simplicity and effectiveness of supersampling resulted in it being offered as a driver option for many early GPU’s. The problem, however, is performance. When the resolution of the render target is increased, the sampling rate of visibility increases. However since the execution of the pixel shader is also tied to the resolution of the pixels, the pixel shading rate would also increase. This meant that any work performed in the pixel shader, such as lighting or texture fetches, would be performed at a higher rate and thus consume more resources. The same goes for bandwidth used when writing the results of the pixel shader to the render target, since the write (and blending, if enabled) is performed for each pixel. Memory consumption is also increased, since the render target and corresponding z buffer must be larger in size. Because of these adverse performance characteristics, supersampling was mostly relegated to a high-end feature for GPU’s with spare cycles to burn.

Supersampling Evolves into MSAA

So we’ve established that supersampling works in principle for reducing aliasing in 3D graphics, but that it’s also prohibitively expensive. In order to keep most of the benefit of supersampling without breaking the bank in terms of performance, we can observe that aliasing of triangle visibility function (AKA geometric aliasing) only occurs at the edges of rasterized triangles. If we hopped into a time machine and traveled back to 2001, we would also observe that pixel shading mostly consists of texture fetches and thus doesn’t suffer from aliasing (due to mipmaps). These observations would lead us to conclude that geometric aliasing is the primary form of aliasing for games, and should be our main focus. This conclusion is what what caused MSAA to be born.

In terms of rasterization, MSAA works in a similar manner to supersampling. The coverage and occlusion tests are both performed at higher-than-normal resolution, which is typically 2x through 8x. For coverage, the hardware implements this by having N sample points within a pixel, where N is the multisample rate. These samples are known as subsamples, since they are sub-pixel samples. The following image shows the subsample placement for a typical 4x MSAA rotated grid pattern:

Typical MSAA 4x Sample Pattern

The triangle is tested for coverage at each of the N sample points, essentially building a bitwise coverage mask representing the portion of the pixel covered by a triangle 2. For occlusion testing, the triangle depth is interpolated at each covered sample point and tested against the depth value in the z buffer. Since the depth test is performed for each subsample and not for each pixel, the size of the depth buffer must be augmented to store the additional depth values. In practice this means that the depth buffer will N times the size of the non-MSAA case. So for 2xMSAA the depth buffer will be twice the size, for 4x it will be four times the size, and so on.

Where MSAA begins to differ from supersampling is when the pixel shader is executed. In the standard MSAA case, the pixel shader is not executed for each subsample. Instead, the pixel shader is executed only once for each pixel where the triangle covers at least one subsample. Or in other words, it is executed once for each pixel where the coverage mask is non-zero. At this point pixel shading occurs in the same manner as non-MSAA rendering: the vertex attributes are interpolated to the center of the pixel and used by the pixel shader to fetch textures and perform lighting calculations. This means that the pixel shader cost does not increase substantially when MSAA is enabled, which is the primary benefit of MSAA over supersampling.

Although we only execute the pixel shader once per covered pixel, it is not sufficient to store only one output value per pixel in the render target. We need the render target to support storing multiple samples, so that we can store the results from multiple triangles that may have partially covered a single pixel. Therefore an MSAA render target will have enough memory to store N subsamples for each pixel. This is conceptually similar to an MSAA z buffer, which also has enough memory to store N subsamples. Each subsample in the render target is mapped to one of the subsample points used during rasterization to determine coverage. When a pixel shader outputs its value, the value is only written to subsamples where both the coverage test and the depth test passed for that pixel. So if a triangle covers half the sample points in 4x sample pattern, then half of the subsamples in the render target receive the pixel shader output value. Or if all of the sample points are covered, then all of the subsamples receive the output value. The following image demonstrates this concept:

Results from non-MSAA and 4x MSAA rendering when a triangle partially covers a pixel. Image from Real-Time Rendering, 3rd Edition

By using the coverage mask to determine which subsamples to be updated, the end result is that a single pixel can end up storing the output from N different triangles that partially cover the sample pixel. This effectively gives us the result we want, which is an oversampled form of triangle visibility. The following image, taken from the Direct3D 10 documentation[1], visually summarizes the rasterization process for the case of 4xMSAA:

An image from the D3D10 documentation detailing the results of rasterizing various primitives with 4xMSAA.

MSAA Resolve

As with supersampling, the oversampled signal must be resampled down to the output resolution before we can display it. With MSAA, this process is referred to as resolving the render target. In its earliest incarnations, the resolve process was carried out in fixed-function hardware on the GPU. The filter commonly used was a 1-pixel-wide box filter, which essentially equates to averaging all subsamples within a given pixel. Such a filter produces results such that fully-covered pixels end up with the same result as non-MSAA rendering, which could be considered either good or bad depending on how you look at it (good because you won’t unintentially reduce details through blurring, bad because a box filter will introduce postaliasing). For pixels with triangle edges, you get a trademark gradient of color values with a number of steps equal to the number of sub-pixel samples. Take a look at the following image to see what this gradient looks like for various MSAA modes:

Trademark MSAA edge gradients resulting from reconstruction using box filtering

One notable exception to box filtering was Nvidia’s “Quincunx” AA, which was available as a driver option on their DX8 and DX9-era hardware (which includes the RSX used by the PS3). When enabled, it would use a 2-pixel-wide triangle filter centered on one of the samples in a 2x MSAA pattern. The “quincunx” name comes the fact that the resolve process ends up using 5 subsamples that are arranged in the cross-shaped quincunx pattern. Since the quincunx resolve uses a wider reconstruction filter, aliasing is reduced compared to the standard box filter resolve. However, using a wider filter can also result in unwanted attenuation of higher frequencies. This can lead to a “blurred” look that appears to lack details, which is a complaint sometimes levied against PS3 games that have used the feature. AMD later added a similar feature to their 3 and 4000-series GPU’s called “Wide Tent” that also made use of a triangle filter with width greater than a pixel.

As GPU’s became more programmable and the API’s evolved to match them, we eventually gained the ability to perform the MSAA resolve in a custom shader instead of having to rely on an API function to do that. This is an ability we’re going to explore in the following article.

Compression

As we saw earlier, MSAA doesn’t actually improve on supersampling in terms of rasterization complexity or memory usage. At first glance we might conclude that the only advantage of MSAA is that pixel shader costs are reduced. However this isn’t actually true, since it’s also possible to improve bandwidth usage. Recall that the pixel shader is only executed once per pixel with MSAA. As a result, the same value is often written to all N subsamples of an MSAA render target. GPU hardware is able to exploit this by sending the pixel shader value coupled with another value indicating which subsamples should be written, which acts as a form of lossless compression. With such a compression scheme the bandwidth required to fill an MSAA render target can be significantly less than it would be for the supersampling case.

CSAA and EQAA

Since its introduction, the fundamentals of MSAA have not seen significant changes as graphics hardware has evolved. We already discussed the special resolve modes supported by the drivers for certain Nvidia and ATI/AMD hardware as well as the ability to arbitrarily access subsample data in an MSAA render target, which are two notable exceptions. A third exception has been Nvidia’s Coverage Sampling Antialiasing(CSAA)[2] modes supported by their DX10 and DX11 GPU’s. These modes seek to improve the quality/performance ratio of MSAA by decoupling the coverage of triangles within a pixel from the subsamples storing the value output by the pixel shader. The idea is that while subsamples have high storage cost since they store pixel shader outputs, the coverage can be stored as a compact bitmask. This is exploited is by rasterizing at a certain subsample rate and storing coverage at that rate, but then storing the actual subsample values at a lower rate. As an example, the “8x” CSAA mode stored 8 coverage samples and 4 pixel shader output values. When performing the resolve, the coverage data is used to augment the quality of the results. Unfortunately Nvidia does not provide public documentation of this step, and so the specifics will not be discussed here. They also do not provide programmatic access to the coverage data in shaders, thus the data will only be used when performing a standard resolve through D3D or OpenGL functions.

AMD has introduced a very similar feature in their 6900 series GPU’s, which they’ve named EQAA[3]. Like Nvidia, the feature can be enabled through driver options or special MSAA quality modes but it cannot be used in custom resolves performed via shaders.

Working with HDR and Tone Mapping

Before HDR became popular in real-time graphics, we essentially rendered display-ready color values to our MSAA render target with only simple post-processing passes applied after the resolve. This meant that after resolving with a box filter, the resulting gradients along triangle edges would be perceptually smooth between neighboring pixels3. However when HDR, exposure, and tone mapping are thrown into the mix there is no longer anything close to a linear relationship between the color rendered at each pixel and the perceived color displayed on the screen. As a result, you are no longer guaranteed to get the smooth gradient you would get when using a box filter to resolve LDR MSAA samples. This can seriously affect the output of the resolve, since it can end up appearing as if no MSAA is being used at all if there is extreme contrast on a geometry edge.

This strange phenomenon was first pointed out (to my knowledge) by Humus (Emil Persson), who created a sample[4] demonstrating it as well as a corresponding ShaderX6 article. In this same sample he also demonstrated an alternative approach to MSAA resolves, where he used a custom resolve to apply tone mapping to each subsample individually before filtering. His results were pretty striking, as you can see from these images (left is a typical resolve, right is resolve after tone mapping):

HDR rendering with MSAA. The top image applies tone mapping after a standard MSAA resolve, while the bottom image applies tone mapping before the MSAA resolve.

It’s important to think about what it actually means to apply tone mapping before the resolve. Before tone mapping, we can actually consider ourselves to be working with values representing physical quantities of light within our simulation. Primarily, we’re dealing with the radiance of light reflecting off of a surface towards the eye. During the tone mapping phase, we attempt to convert from a physical quantity of light to a new value representing the color that should be displayed on the screen. What this means is that by changing where the resolve takes places, we’re actually oversampling a different signal! When resolving before tone mapping we’re oversampling the signal representing physical light being reflected towards the camera, and when resolving after tone mapping we’re oversampling the signal representing colors displayed on the screen. Therefore an important consideration we have to make is which signal we actually want to oversample. This directly ties into post-processing, since a modern game will typically have several post-processing effects needing to work with HDR radiance values rather than display colors. Thus we want to perform tone mapping as the last step in our post-processing chain. This presents a potential difficulty with the approach of tone mapping prior to resolve, since it means that all previous post-processing steps must work with a non-resolved MSAA as an input and also produce an MSAA buffer as an output. This can obviously have serious memory and performance implications, depending on how the passes are implemented.

MLAA and Other Post-Process AA Techniques

Morphological Anti-Aliasing is an anti-aliasing technique originally developed by Intel[5] that initiated a wave of performance-oriented AA solutions commonly referred to as post-process anti-aliasing. This name is due to the fact that they do not fundamentally alter the rendering/rasterization pipeline like MSAA does. Instead, they work with only a non-MSAA render target to produce their results. In this way these techniques are rather interesting, in that they do not actually rely on increasing the sampling rate in order to reduce aliasing. Instead, they use what could be considered an advanced reconstruction filter in order to approximate the results that you would get from oversampling. In the case of MLAA in particular, this reconstruction filter uses pattern-matching in an attempt to detect the edges of triangles. The pattern-matching relies on the fact that for a fixed sample pattern, common patterns of pixels will be produced by the rasterizer for a triangle edge. By examining the color of the local neighborhood of pixels, the algorithm is able to estimate where a triangle edge is located and also the orientation of the line making up that particular edge. The edge and color information is then enough to estimate an analytical description of that particular edge, which can be used to calculate the exact fraction of the pixel that will be covered by the triangle. This is very powerful if the edge was calculated correctly, since it eliminates the need for multiple sub-pixel coverage samples. In fact if the coverage amount is used to blend the triangle color with the color behind that triangle, the results will match the output of standard MSAA rendering with infinite subsamples! The following image shows some of the patterns used for edge detection, and the result after blending:

MLAA edge detection using pattern recognition (from MLAA: Efficiently Moving Antialiasing from the GPU to the CPU)

The major problems with MLAA and similar techniques occur when the algorithm does not accurately estimate the triangle edges. Looking at only a single frame, the resulting artifacts would be difficult or impossible to discern. However in a video stream the problems become apparent due to sub-pixel rotations of triangles that occur as the triangle or the camera move in world space. Take a look at the following image:

Two different triangle edge orientations resulting in the same rasterization pattern

In this image, the blue line represents a triangle edge during one frame and the green line represents the same triangle edge in the following frame. The orientation of the edge relative to the pixels has changed, however in both cases only the leftmost pixel is marked as being “covered” by the rasterizer. Consequently the same pixel pattern (marked by the blue squares in the image) is produced by the rasterizer for both frames, and the MLAA algorithm detects the same edge pattern (denoted by the thick red line in the image). As the edge continues rotating, eventually it will cover the top-middle pixel’s sample point and that that pixel will “turn on”. In the resulting video stream that pixel will appear to “pop on”, rather than smoothly transitioning from a non-covered state to a covered state. This is a trademark temporal artifact of geometric aliasing, and MLAA is incapable of reducing it. The artifact can be even more objectionable for thin or otherwise small geometry, where entire portions of the triangle will appear and disappear from frame to frame causing a “flickering” effect. MSAA and supersampling are able to reduce such artifacts due to the increased sampling rate used by the rasterizer, which results in several “intermediate” steps in the case of sub-pixel movement rather than pixels “popping” on and off. The following animated GIFs demonstrate this effect on a single rotating triangle4 (click on the images if they’re not animating for you):

Two animations of a rotating triangle. The top image has FXAA enabled, which uses techniques similar to MLAA to reconstruct edges. The bottom edge uses 4x MSAA, which supersamples the visibility test at the edges of the triangle. Notice how in the MSAA image pixels will transition through intermediate values as the triangle moves across the sub-pixel sample points. The FXAA image lacks this characteristic, despite producing smoother gradients along the edges.

Another potential issue with MLAA and similar algorithms is that they may fail to detect edges or detect “false” edges if only color information is used. In such cases the accuracy of the edge detection can be augmented by using a depth buffer and/or a normal buffer. Another potential issue is that the algorithm uses the color adjacent to a triangle as a proxy for the color behind the triangle, which could actually be different. However this tends to be non-objectionable in practice.

Footnotes

1. The rasterization process on a modern GPU can actually be quite a bit more complicated than this, but those details aren’t particularly relevant to the scope of this article. Return to text
2. This mask is directly available to ps_5_0 pixel shaders in DX11 via the SV_Coverage system value. Return to text
3. The gamma-space rendering commonly used in the days before HDR would actually produce gradients that weren’t completely smooth, although later GPU’s supported performing the resolve in linear space. Either way the results were pretty close to being perceptually smooth, at least compared to the results that can occur with HDR rendering. Return to text
4. These animations were captured from the sample application that I’m going to discuss in the next article. So if you’d like to see live results without compression, you can download the sample app from that article. Return to text

References

[1]http://msdn.microsoft.com/en-us/library/windows/desktop/cc627092%28v=vs.85%29.aspx
[2]http://www.nvidia.com/object/coverage-sampled-aa.html
[3]http://developer.amd.com/Resources/archive/ArchivedTools/gpu/radeon/assets/EQAA Modes for AMD HD 690 Series Cards.pdf
[4]http://www.humus.name/index.php?page=3D&ID=77
[5]MLAA: Efficiently Moving Antialiasing from the GPU to the CPU

Next article in the series: Experimenting with Reconstruction Filters for MSAA Resolve