SIGGRAPH Follow-Up: 2015 Edition

SIGGRAPH 2015 wrapped up just a few days ago, and it really was fantastic this year! There was tons of great content, and I got a chance to meet up with some of best graphics programmers in the industry. I wanted to thank anyone that came to my talk at Advances in Real-Time Rendering, as well as anyone who came to our talk at the Physically Based Shading course. It’s always awesome to see so many people interested in the latest rendering technology, and the other presenters really knocked it out of the park in both courses. I’d also like to thank Natalya Tatarchuk for organizing the Advances course year after year. It’s really amazing when you look back on the 10 years worth of high-quality material that she’s assembled, and this year in particular was full of some really inspiring presentations. And of course I should thank Stephen Hill and Stephen McAuley as well, who also do a phenomenal job of cultivating top-notch material from both games and film.

For my advances talk, you can find the the slides on the company website. They should also be up on the Advances course site in the near future. If you haven’t seen it yet, the talk is primarily focused on the antialiasing of The Order, with a section about shadows at the end. There’s also some bonus slides about the decal system that I made for The Order, which uses deferred techniques to accumulate a few special-case decal types onto our forward-rendered geometry. For the antialiasing, the tech and techniques presented isn’t really anything I’d consider to be particularly novel or groundbreaking. However I wanted to give an overview of the problem space as well as our particular approach for handling aliasing, and I hope that came across in the presentation. One thing I really wanted to touch on more was that I firmly believe that we need to go much deeper if we really want to fix aliasing in games. Things like temporal AA and SMAA are fantastic in that that really do make things look a whole lot better, but they’re still fundamentally limited in several ways. On the other hand, just brute forcing the problem by increasing sampling rates isn’t really a scalable solution in the long term. In some cases we’re also undersampling so badly that just having a 2x or 4x increase in our sampling rates isn’t going to even come close to fixing the issue. What I’d really like to see is more work into figuring out how to have smarter sampling patterns (no more screen-space uniform grids!), and also how to properly prefilter content so that we’re not always undersampling. This was actually something brought up by Marco Salvi in his excellent AA talk that was a part of the Open Problems in Real-Time Rendering course, which I was very happy to see. It was also really inspiring to see Alex Evans describe how he strived for filterable scene representations in his talk from the Advances course.

In case you missed it, I uploaded a full antialiasing code sample to GitHub to accompany the talk. The code is using my usual sample framework and coding style, which means you can grab it and build it with VS 2013 with no external dependencies. There’s also pre-compiled bins in the releases section, in case you would just like the view the app or play around with the shaders. The sample is essentially a successor to the MSAAFilter sample that I put out nearly 3 years ago, which accompanied a blog post where I shared some of my research on using higher-order filtering with MSAA resolves. The AA work in The Order is in many ways the natural conclusion of that work, and the new sample reflects that. If you load up the sample, you’ll notice that the default scene is a really terrible case for geometric and specular aliasing: it’s a rather high-polygon mesh, with lighting from both a directional light as well as from the environment.  I like to evaluate flickering reduction by setting “Model Rotation Speed” to 1.0, which causes the scene to automatically rotate around its Y axis. The default settings are also fairly close to what we shipped with in The Order, although not exactly the same due to some PS4-specific tweaks. The demo also defaults to a 2x jitter pattern, which we didn’t use in The Order. One possible avenue that I never really explored was to experiment with more variation in the MSAA subsample patterns. This is something that you can do on PS4 (as demonstrated in Michal Drobot’s talk about HRAA in Far Cry 4), and you can also do it on recent Nvidia hardware using their proprietary NVAPI. A really simple thing to try would be to implement interleaved sampling, although it could potentially make the resolve shader more expensive.

As for the talk that Dave and I gave at Physically Based Shading, I hope that the images spoke for themselves in terms of showing how much better things looked once we made the switch from SH/H-basis to Spherical Gaussians. It was a very late and risky change for the project, but fortunately it paid off for us by substantially improving the overall visual quality. The nice thing is it’s pretty easy to understand why it looks better once we switched. Previously, we partitioned lighting response into diffuse and specular. We then took advantage of the response characteristics to store the input for both responses in two separate ways: for diffuse, we used high spatial resolution but with low angular resolution (SH lightmaps), while for specular we used low spatial resolution but with high angular resolution (sparse cubemap probes). By splitting specular into both low-frequency (high roughness) and high-frequency (low roughness) categories, we were able to use spatially dense sample points for a much broader range of surfaces. These surfaces with rougher specular then benefit from improved visibility/occlusion, which is usually the biggest issue with sparse cubemap probes. This obviously isn’t a new idea, in fact Halo 3 was doing similar things all the way back in 2008! The main difference of course is that we were able to use SG instead of SH, which gave us more flexibility in how we represented per-texel incoming radiance.

SGs can be a really useful tool for all kinds of things in graphics, and I think it would be great if we all added them to our toolbox. To aid with that, Dave Neubelt and Brian Karis are planning on putting out a helpful paper that can hopefully be to SGs what Stupid SH Tricks was to spherical harmonics. Dave and I also have been working on a code sample to release, which lets you switch between various methods for pre-computing both diffuse and specular lighting as well as a path-traced ground truth render. I’m hoping to finish this soon, since I’m sure it would be very helpful to have working code examples for the various SG operations.

Mitsuba Quick-Start Guide

Angelo Pesce’s recent blog post brought up a great point towards the end of the article: having a “ground-truth” for comparison can be extremely important for evaluating your real-time techniques. For approximations like pre-integrated environment maps it can help visualize what kind of effect your approximation errors will have on a final rendered image, and and in many other cases it can aid you in tracking down bugs in your implementation. On Twitter I advocated writing your own path tracer for such purposes, primarily because doing so can be an extremely educational experience. However not everyone has time to write their own path tracer, and then populate with all of the required functionality. And even if you do have the time, it’s helpful to have a “ground truth for your ground truth”, so that you can make sure that you’re not making any subtle mistakes (which is quite easy to do with a path tracer!). To help in both of these situations, it’s really handy to have a battle-tested renderer at your disposal. For me, that renderer is Mitsuba.

Mitsuba is a free, open source (GPL), physically based renderer that implements a variety of material and volumetric scattering models, and also implements some of the latest and greatest integration techniques (bidirectional path tracing, photon mapping, metropolis light transport, etc.). Since it’s primarily an academic research project, it doesn’t have all of the functionality and user-friendliness that you might get out of a production renderer like Arnold. That said, it can certainly handle many  (if not all) of the cases that you would want to verify for real-time rendering, and any programmer shouldn’t have too much trouble figuring out the interface and file formats once he or she spends some time reading the documentation. It also features a plugin system for integrating new functionality, which could potentially be useful if you wanted to try out a custom material model but still make use of Mitsuba’s ray-tracing/sampling/integration framework.

To help get people up and running quickly, I’ve come up with a “quick-start” guide that can show you the basics of setting up a simple scene and viewing it with the Mitsuba renderer. It’s primarily aimed at fellow real-time graphics engineers who have never used Mitsuba before, so if you belong in that category then hopefully you’ll find it helpful! The guide will walk you through how to import a scene from .obj format into Mitsuba’s internal format, and then directly manipulate Mitsuba’s XML format to modify the scene properties. Editing XML by hand is obviously not an experience that makes anyone jump for joy, but I think it’s a decent way to familiarize yourself with their format. Once you’re familiar with how Mitsuba works, you can always write your own exporter that converts from your own format.

1. Getting Mitsuba

The latest version of Mitsuba is available on this page. If you’re running a 64-bit version of Windows like me, then you can go ahead and grab the 64-bit version which contains pre-compiled binaries. There are also Mac and Linux versions if either of those is your platform of choice, however I will be using the Windows version for this guide.

Once you’ve downloaded the zip file, go ahead and extract it to a folder of your choice. Inside of the folder you should have mtsgui.exe, which is the simple GUI version of the renderer that we’ll be using for this guide. There’s also a command-line version called mitsuba.exe, should you ever have a need for that.

While you’re on the Mitsuba website, I would also recommend downloading the PDF documentation into the same folder where you extracted Mitsuba. The docs contain the full specification for Mitsuba’s XML file format, general usage information, and documentation for the plugin API.

2. Importing a Simple Scene

Now that we have Mitsuba, we can get to work on importing a simple scene into Mitsuba’s format so that we can render it. The GUI front-end is capable of importing scenes from both COLLADA (*.dae) and Wavefront OBJ (*.obj) file formats, and for this guide we’re going to import a very simple scene from an OBJ file that was authored in Maya. If you’d like to follow along on your own, then you can grab the “TestScene.obj” file from the zip file that I’ve uploaded here: https://mynameismjp.files.wordpress.com/2015/04/testscene.zip. Our scene looks like this in Maya:

Scene_Maya

As you can see, it’s a very simple scene composed of a few primitive shapes arranged in a box-like setup. To keep things really simple with the export/import process, all of the meshes have their default shader assigned to them.

To import the scene into Mitsuba, we can now run mtsgui.exe and select File->Import from the menu bar.  This will give you the following dialog:

Importer_Dialog

Go ahead and click the top-left button to browse for the .obj file that you’d like to import. Once you’ve done this, it will automatically fill in paths for the target directory and target file that will contain the Mitsuba scene definition. Feel free to change those if you’d like to create the files elsewhere. There’s also an option that specifies whether you’d like any material colors and textures as being in sRGB or linear color space.

Once you hit “OK” to import the scene, you should now see our scene being rendered in the viewport:

Initial_Preview

What you’re seeing right now is the OpenGL realtime preview. The preview uses the GPU to render your scene with VPL approximations for GI, so that it can give you a rough idea of what your scene will look like once it’s actually rendered. Whenever you first open a scene you will get the preview mode, and you’ll also revert back to the preview mode whenever you move the camera.

Speaking of the camera, it uses a basic “arcball” system that’s pretty similar to what Maya uses. Hold the left mouse button and drag the pointer to rotate the camera around the focus point, hold the middle mouse button to pan the camera left/right/up/down, and hold the right mouse button to move the camera along its local Z axis (you can also use the mouse wheel for this).

3. Configuring and Rendering

Now that we have our scene imported, let’s try doing an actual render. First, click the button in the toolbar with the gear icon. It should bring up the following dialog, which lets you configure your rendering settings:

PT_Settings

That first setting specifies which integrator that you want to use for rendering. If you’re not familiar with the terminology being used here, an”integrator” is basically the overall rendering technique used for computing how much light is reflected back towards the camera for every pixel. If you’re not sure which technique to use, the path tracer is a good default choice. It makes use of unbiased monte carlo techniques to compute diffuse and specular reflectance from both direct and indirect light sources, which essentially means that if you increase the number of samples it will always converge on the “correct” result. The main downside is that it can generate noisy results for scenes where a majority of surfaces don’t have direct visibility of emissive light sources, since the paths are always traced starting at the camera. The bidirectional path tracer aims to improve on this by also tracing additional paths starting from the light sources. The regular path tracer also won’t handle volumetrics, and so you will need to switch to the volumetric path tracer if you every want to experiment with that.

For a path tracer, the primary quality setting is the “Samples per pixel” option. This dictates how many samples to take for every pixel in the output image, and so you can effectively think of it as the amount of supersampling. Increasing it will reduce aliasing from the primary rays, and also reduce the variance in the results of computing reflectance off of the surfaces. Using more samples will of course increase the rendering time as well, so use it carefully. The “Sampler” option dictates the strategy used for generating the random samples that are used for monte carlo integration, which can also have a pretty large effect on the resulting variance. I would suggest reading through Physically Based Rendering if you’d like to learn more about the various strategies, but if you’re not sure then the “low discrepancy sampler” is a good default choice. Another important option is the “maximum depth” setting, which essentially lets you limit the renderer to using a fixed number of bounces. Setting it to 1 only gives you emissive surfaces and lights (light -> camera), setting it to 2 gives you emissive + direct lighting on all surfaces (light -> surface -> camera), setting it to 3 gives you emissive + direct lighting + 1 bounce of indirect lighting (light -> surface -> surface -> camera), and so on. The default value of -1 essentially causes the renderer to keep picking paths until it hits a light source,  or the transmittance back to the camera is below a particular threshold.

Once you’ve configured everything the way I did in the picture above, go and hit OK to close the dialog. After than, press the big green “play” button in the toolbar to start the renderer. Once it starts, you’ll see an interactive view of the renderer completing the image one tile at a time. If you have a decent CPU it shouldn’t take more than 10-15 seconds to finish, at which point you should see this:

Initial_PT

Congratulations, you now have a path-traced rendering of the scene!

4. The Scene File Format

Now that we have a basic scene rendering, it’s time to dig into the XML file for the scene and start customizing it. Go head and open up “TestScene.xml” in your favorite text editor, and have a look around. It should look like this:

Initial_XML

If you scroll around a bit, you’ll see declarations for various elements of the scene. Probably what you’ll notice first is a bunch of “shape” declarations: these are the various meshes that make up the scene. Since we imported from .obj, Mitsuba automatically generated a binary file called “TestScene.serialized” from our .obj file containing the actual vertex and index data for our meshes, which is then referenced by the shapes. Mitsuba can also directly reference .obj or .ply files in a shape, which is convenient if you don’t want to go through Mitsuba’s import process. It also supports hair meshes, heightfields from an image file, and various primitive shapes (sphere, box, cylinder, rectangle, and disk). Note that shapes support transform properties as well as transform hierarchies, which you can use to position your meshes within the scene as you see fit. See section 8.1 of the documentation for a full description of all of the shape types, and their various properties.

For each shape,  you can see a “bsdf” property that specifies the BSDF type to use for shading the mesh. Currently all of the shapes are specifying that they should use the “diffuse” BSDF type, and that the BSDF should use default parameters. You might also notice there’s a separate bsdf declaration towards the top of the file, with an ID of “initialShadingGroup_material”. This comes from the default shader that Maya applies to all meshes, which is also reflected in the .mtl file that was generated along with the .obj file. This BSDF is not actually being used by any of the shapes in the scene, since they all are currently specifying that they just want the default “diffuse” BSDF. In the next section I’ll go over how we can create and modify materials, and then assign them to meshes.

If you scroll way down to the bottom, you’ll see the camera and sensor properties, which looks like this:

Initial_Camera_XML

You should immediately recognize some of your standard camera properties, such as the clip planes and FOV. Here you can also see an example of a transform property, which is using the “lookAt” method for specifying the transform. Mitsuba also supports specifying transforms as translation + rotation + scale, or directly specifying the transformation matrix. See section 6.1.6 of the documentation for more details.

If you decide to manually update any of the properties in the scene, you can tell the GUI to re-load the scene from disk by clicking the button with blue, circular arrow on it in the toolbar. Just be aware that if you save the file from the GUI app, it may overwrite some of your changes. So if you decide to set up a nice camera position in the XML file, make sure that you don’t move the camera in the app and then save over it!

5. Specifying Materials

Now let’s assign some materials to our meshes, so that we can start making our scene look interesting. As we saw previously, any particular shape can specify which BSDF model it should use as well as various properties of that BSDF. Currently, all of our meshes are using the “diffuse” BSDF, which implements a simple Lambertian diffuse model. There are many BSDF types available in Mitsuba, which you can read about in section 8.2 of the documentation. To start off, we’re going to use the “roughplastic” model for a few of our meshes. This model gives you a classic diffuse + specular combination, where the diffuse is Lambertian the specular can use one of several microfacet models. It’s a good default choice for non-metals, and thus can work well for a wide variety of opaque materials. Let’s go down to about line 36 of our scene file, and make the following changes:

MaterialChanges_XML

As you can see, we’ve added BSDF properties for 4 of our meshes. They’re all configured to use the “roughplastic” BSDF with a GGX distribution, a roughness of 0.1, and an IOR of 1.49. Unfortunately Mitsuba does not support specifying the F0 reflectance value for specular, and so we must specify the interior and exterior IOR instead (exterior IOR defaults to “air”, and so we can leave it at its default value). You can also see that I specified diffuse reflectance values for each shape, with a different color for each. For this I used the “srgb” property, which specifies that the color is in sRGB color space. You can also use the “rgb” property to specify linear values, or the “spectrum” property for spectral rendering.

After making these changes, go ahead click the “reload” button in Mitsuba followed by the “start” button to re-render the image. We should now get the following result:

Material_PT

Nice! Our results are noisier on the right side due to specular reflections from the sun, but we can now clearly see indirect specular in addition to indirect diffuse.

To simplify setting up materials and other properties, Mitsuba supports using references instead directly specifying shape properties. To see how that works, let’s delete the “initialShadingGroup_material” BSDF declaration at line 11 and replace it with a new that that we will reference into the cylinder and torus meshes:

MaterialChanges2_XML

 

If you look closely, you’ll see that for this new material I’m also using a texture for the diffuse reflectance. When setting the “texture” property to the “bitmap” type, you can tell Mitsuba to load an image file off disk. Note that Mitsuba also supports a few built-in procedural textures that you can use, such as checkerboard and a grid. See section 8.3 for more details.

After refreshing, our render should now look like this:

Material2_PT

 

To finish up with materials, let’s assign a more interesting material to the sphere in the back:

MaterialChanges3_XML

If we now re-render with a sample count of 256 to reduce the variance, we get this result:

Material3_PT

6. Adding Emitters

Up until this point, we’ve been using Mitsuba’s default lighting environment for rendering. Mitsuba supports a variety of emitters that mostly fall into one of 3 categories: punctual lights, area lights, or environment emitters. Punctual lights are your typical point, spot, and directional lights that are considered to have an infinitesimally small area. Area lights are arbitrary meshes that uniformly emit light from their surface, and therefore must be used with a corresponding “shape” property. Environment emitters are infinitely distant sources that surround the entire scene, and can either use an HDR environment map, a procedural sun and sky model, or a constant value. For a full listing of all emitter types and their properties, consult section 8.8 of the documentation.

Now, let’s try adding an area light to our scene. Like I mentioned earlier, an area light emitter needs to be parented to a “shape” property that determines the actual 3D representation of the light source. While this shape could be an arbitrary triangle mesh if you’d like, it’s a lot easier to just use Mitsuba’s built-in primitive types instead. For our light source, we’ll use the “sphere” shape type so that we get a spherical area light source:

AreaLight_XML

After refreshing, our scene now looks like this:

AreaLight_PT

Notice how the sky and sun are now gone, since we now have an emitter defined in the scene. To replace the sky, let’s now try adding our own environment emitter that uses an environment map:

EnvMap_XML

The “uffizi.exr” file used here is an HDR light probe from USC’s high-resolution image probe gallery. Note that this emitter does not support cubemaps, and instead expects a 2D image that uses equirectangular mapping. Here’s what it looks like rendered with the path tracer, using a higher sample count of 256 samples per pixel:

EnvMap_PT

 

7. Further Reading

At this point, you should hopefully understand the basics of how to use Mitsuba, and how to set up scenes in its XML file format. There’s obviously quite a bit of functionality that I didn’t cover, which you can read about in the documentation. If you’d like to know more about how Mitsuba works, I would very strongly recommend reading through Physically Based Rendering. Mitsuba is heavily based on pbrt (which is the open-source renderer described in the book), and the book does a fantastic job of explaining all of the relevant concepts. It’s also a must-have resource if you’d like to write your own path tracer, which is something that I would highly recommend to anybody working in real-time graphics.

Oh and just in case you missed it, here’s the link to the zip file containing the example Mitsuba scene: https://mynameismjp.files.wordpress.com/2015/04/testscene.zip

Some Special Thanks

About a month ago, a little game called The Order: 1886 finally hit store shelves. Its release marks the culmination of my past 4 years at Ready At Dawn, which were largely devoted to developing the core rendering and engine technology that was ultimately used for the game. It’s also a major milestone for me personally, as it’s the first project that I’ve worked on full-time from start to finish. I’m of course immensely proud of what our team has managed to accomplish, and I feel tremendously lucky to go to work every day with such a talented and dedicated group of people.

My work wouldn’t have been possible without the help of many individuals both internal and external to our studio, but unfortunately there’s just too many people to list in a “special thanks” section of the credits. So instead of that, I’ve made my own special thanks section! But before you read it, please know that this list (like everything else on my blog) only represents my own personal feelings and experiences, and does not represent the rest of the team or my company as a whole. And now, without further ado…

From SCE:

  • Tobias Berghoff, Steven Tovey, Matteo Scapuzzi, Benjamin Segovia, Vince Diesi, Chris Carty, Nicolas Serres, and everyone else at the Advanced Technology Group. These fine folks have developed a fantastic suite of debugging and performance tools for the PS4, and thus played a huge part in making the PS4 the incredible development platform that it is.
  • Dave Simpson, Cort Stratton, Cedric Lallian, and everyone else at the ICE Team for making the best graphics API that I’ve ever used.
  • Geoff Audy, Elizabeth Baumel, Peter Young, and everyone else at Foster City and elsewhere that provided valuable tools and developer support.

From the graphics community:

  • Andrew Lauritzen for providing all of his valuable research on shadows and deferred rendering. His work with VSM and SDSM formed the basis of our shadowing tech in The Order, and his amazing presentation + demo on deferred rendering back in 2010 helped set the standard for state-of-the art in rendering tech for games. Also I’ll go ahead and thank him preemptively for (hopefully) forgiving me for thanking him here, even though I promised him last year at GDC that I would thank him in the credits.
  • Stephen Hill, whose excellent articles and presentations have always inspired me to strive for the next level of quality when sharing my own work.
  • Steve McAuley, who along with Mr. Hill is responsible for cultivating what is arguably the best collection of material year after year: the physically based shading course at SIGGRAPH. I’m very thankful to them for inviting us to participate in 2013, and then helping myself and Dave to deliver our presentation and course notes.
  • Naty Hoffman, Timothy Lottes, Adrian Bentley, Sébastien Legarde,  Johan Andersson, James McLarenBrian Karis, Peter-Pike Sloan, Nathan Reed, Christer EricsonAngelo Pesce, Michał Iwanicki, Christian GyrlingAras Pranckevičius, Michal Valient, Bart Wronski, Jasmin PatryMichal Drobot, Jorge Jimenez, Padraic Hennessy, and anyone else who was kind enough to talk shop over a drink at GDC or SIGGRAPH.
  • Anybody who has every given presentations, written articles, or otherwise contributed to the vast wealth of public knowledge concerning computer graphics. I’m really proud of the culture of openness and sharing among the graphics community, and also very grateful for it. We often stood on the shoulders of giants when creating the tech for our game, and we couldn’t have achieved we did if without drawing from the insights and techniques that were generously shared by other developers and researchers.

From Ready at Dawn:

  • Everyone that I worked with on the Tools and Engine team: Nick Blasingame, Joe FerfeckiGabriel Sassone, Sean Flynn, Simone Kulczycki, Scott Murray, Robin Green, Brett Dixon, Joe SchutteDavid Neubelt, Alex Clark, Garret Foster, Tom Plunket, Aaron Halstead, Jeremy Nikolai, and Jamie Hayes. If you appreciate anything at all about the tech of The Order, then please think of these people! Also I need to give Garret an extra-special thanks for letting myself and Dave take all of the credit for his work on the material pipeline.
  • Our art director Nathan Phail-Liff, who not only directs a phenomenal group of artists but also had a major hand in shaping the direction of the tech that was developed for the project.
  • Anthony Vitale, who leads the best damn environment art team in the business. If you haven’t seen it yet, go check out their art dump on the polycount forums!
  • Ru WeerasuriyaAndrea Pessino, and Dider Malenfant for starting this amazing company, and for giving me an opportunity to work there.
  • Everyone else at Ready At Dawn that worked on the project with me. I don’t know if I can ever get used to overwhelming amount of sheer talent and dedication that you can have at a game studio, and the people that I worked with had that in spades. I look forward to making many more beautiful things with them in the future!

If you read through all of this, I appreciate you taking the time to do so. It means a lot to me that the people listed above get their due credit, and so I hope that my humble little blog post has enough reach to give them some of the recognition that they rightfully deserve.

Shadow Sample Update

This past week a paper entitled “Moment Shadow Mapping” was released in advance of its planned publication at I3D in San Francisco. If you haven’t seen it yet, it presents a new method for achieving filterable shadow maps, with a promised improvement in quality compared to Variance Shadow Maps. Myself and many others were naturally intrigued, as filterable shadow maps are highly desirable for reducing various forms of aliasing. The paper primarily suggests two variants of the technique: one that directly stores the 1st, 2nd, 3rd, and 4th moments in a RGBA32_FLOAT texture, and another that uses an optimized quantization scheme (which essentially boils down to a 4×4 matrix transform) in order to use an RGBA16_UNORM texture instead. The first variant most likely isn’t immediately interesting for people working on games, since 128 bits per texel requires quite a bit of memory storage and bandwidth. It’s also the same storage cost as the highest-quality variant of EVSM (VSM with an exponential warp), which already provides high-quality filterable shadows with minimal light bleeding. So that really leaves us with the quantized 16-bit variant. Using 16-bit storage for EVSM results in more artifacts and increased light bleeding compared to the 32-bit version, so if MSM can provide better results than it could potentially be useful for games.

I was eager to see the results myself, so I downloaded the sample app that the authors were kind enough to provide. Unfortunately their sample didn’t implement EVSM, and so I wasn’t able perform any comparisons. However the implementation of MSM is very straightforward, and so I decided to just integrate it into my old shadows sample. I updated the corresponding blog post and re-uploaded the binary + source, so if you’d like to check it out for yourself then feel free to download it:

https://mynameismjp.files.wordpress.com/2013/09/shadows_msm.zip

The MSM techniques can be found under the “Shadow Mode” setting. I implemented both the Hamburger and Hausdorff methods, which are available as two separate shadow modes. If you change the VSM/MSM format from 32-bit to 16-bit, then the optimized quantization scheme will be used when converting from a depth map to a moment shadow map.

Unfortunately, my initial findings are rather mixed. The 32-bit variant of MSM seems to provide quality that’s pretty close to the 32-bit variant of EVSM, with slightly worse performance. Both techniques are mostly free of light bleeding, but still exhibit bleeding artifacts for the more extreme cases. As for the 16-bit variant, it again has quality that’s very similar to EVSM with equivalent bit depth. Both techniques require increasing the bias when using 16-bit storage in order to reduce precision artifacts, which in turn leads to increased bleeding artifacts. Overall MSM seems to have behave a little bit better with regards to light leaking, which does make some sense considering that with EVSM you’re forced to use a lower exponent scale in order to prevent overflow. For either technique you can reduce the leaking using the standard VSM bleeding reduction technique, which essentially just remaps the output range of the shadow visibility term. Doing this can remove or reducing the bleeding artifacts, but will also result in over-darkening. Of course it’s also possible that I made a mistake when implementing MSM into my sample app, however the bleeding artifacts are also quite noticeable in the sample app provided by the authors.

To finish up, here are some screenshots that I took that show an example of lighting bleeding. The EVSM images all use the 4-component variant, and the MSM images all use the 4-moment Hamburger variant. For the images with the bleeding “fix’, they use a reduction factor of 0.25. In all cases the shadow map resolution is 2048×2048, with 4xMSAA, 16x anisotropic filtering, and mipmaps enabled for EVSM and MSM.

MSM Comparison Grid

Finally, here’s a few more images from an area with an even worse case for light bleeding:

MSM Comparison Grid 2

 

Come see me talk at GDC 2014

Myself and fellow lead graphics programmer David Neubelt will be at GDC next week, talking about the rendering technology behind The Order: 1886. Unfortunately the talk came together a bit late, and so it initially started from the talk that we gave back at SIGGRAPH at last year (which is why it has the same title). However we don’t want to just rehash the same material, so we’ve added tons of new slides and revamped the old ones. The resulting presentation has a much more of an engineering focus, and will cover a lot of the nuts and bolts behind our engine and the material pipeline. Some of the new things we’ll be covering include:

  • Dynamic lighting
  • Baked diffuse and specular lighting
  • Baked and real-time ambient occlusion
  • Our shader system, and how it interacts with our material pipeline
  • Details of how we perform compositing in our build pipeline
  • Hair shading
  • Practical implementation issues with our shading model
  • Performance and memory statistics
  • Several breakdowns showing how various rendering techniques combine to form the final image
  • At least one new funny picture

We want the presentation to be fresh and informative even if you saw our SIGGRAPH presentation, and I’m pretty sure that we have enough new material to ensure that it will be. So if you’re interested, come by at 5:00 PM on Wednesday. If you can’t make it, I’m going to try to make sure that we have the slides up for download as soon as possible.

Weighted Blended Order-Independent Transparency

https://mynameismjp.files.wordpress.com/2014/02/blendedoit.zip

Back in December, Morgan McGuire  and Louis Bavoil published a paper called Weighted Blended Order-Independent Transparency. In case you haven’t read it yet (you really should!), it proposes an OIT scheme that uses a weighted blend of all surfaces that overlap a given pixel. In other words finalColor = w0 * c0 + w1 * c1 + w2 * c2…etc. With a weighted blend the order of rendering no longer matters, which frees you from the never-ending nightmare of sorting. You can actually achieve results that are very close to a traditional sorted alpha blend, as long as your per-surface weights a carefully chosen. Obviously it’s that last part that makes it tricky, consequently McGuire and Bavoil’s primary contribution is proposing a weighting function that’s based on the view-space depth of a given surface. The reasoning behind using a depth-based weighting function is intuitive: closer surfaces obscure the surfaces behind them, so the closer surfaces should be weighted higher when blending. In practice the implementation is really simple: in the pixel shader you compute color, opacity, and a weight value based on both depth and opacity. You then output float4(color* opacity, opacity) * weight  to 1 render target,while also outputting weight alone to a second render target (the first RT needs to be fp16 RGBA for HDR, but the second can just be R8_UNORM or R16_UNORM). For both render targets special blending modes are required, however they both can be represented by standard fixed-function blending available in GL/DX. After rendering all of your transparents, you then perform a full-screen “resolve” pass where you normalize the weights and then blend with the opaque surfaces underneath the transparents. Obviously this is really appealing since you completely remove any dependency on the ordering of draw calls, and you don’t need to build per-pixel lists or anything like that (which is nice for us mortals who don’t have pixel sync).  The downside is that you’re at the mercy of your weighting function, and you potentially open yourself up to new kinds of artifacts issues depending on what sort of weighting function is used.

When the paper came out I read it and I was naturally interested, so I quickly hacked together a sample project using another project as a base. Unfortunately over the past 2 months there’s been holidays, the flu, and several weeks of long hours at work so that we could could finish up a major milestone. So while I’ve had time to visit my family and optimize our engine for PS4, I haven’t really had any time to come up with a proper sample app that really lets you explore the BOIT technique in a variety of scenarios. However I really hate not having source code and a working sample app to go with papers, so I’m going to release it so that others at least have something they can use for evaluating their proposed algorithm. Hopefully it’s useful, despite how bad the test scene is. Basically it’s just a simple cornell-box like scene made up of a few walls, a sphere, a cylinder, a torus, and a sky (I normally use it for GI testing), but I added the abililty to toggle through 2 alternative albedo maps: a smoke texture, and a tree texture. It doesn’t look great, but it’s enough to get a few layers of transparency with varying lighting conditions:

Scene_Normal

The sample is based on another project I’ve been working on for quite some time with my fellow graphics programmer David Neubelt, where we’ve been exploring new techniques for baking GI into lightmaps. For that project I had written a simple multithreaded ray-tracer using Embree 2.0 (which is an awesome library, and I highly recommend it), so I re-purposed it into a ground-truth renderer for this sample. You can toggle it on and off  to see what the scene would look like with perfect sorting, which is useful for evaluating the “correctness” of the BOIT algorithm. It’s very fast on my mighty 3.6GHz Core i7, but it might chug a bit for those of you running on mobile CPU’s. If that’s true I apologize, however I made sure that all of the UI and controls are decoupled from the ray-tracing framerate so that the app remains responsive.

I’d love to do a more thorough write-up that really goes into depth on the advantages and disadvantages in multiple scenarios, but I’m afraid I just don’t have the time for it at the moment. So instead I’ll just share some quick thoughts and screenshots:

It’s pretty good for surfaces with low to medium opacity  – with the smoke texture applied, it actually achieves decent results. The biggest issues are where there’s a large difference in the lighting intensity between two overlapping surfaces, which makes sense since this also applies to improperly sorted surfaces rendered with traditional alpha blending. Top image is with Blended OIT, bottom image is ground truth:

Smoke_BOIT
Smoke_Ref

If you look at the area where the closer, brighter surface overlaps the darker surface on the cylinder you can see an example of where the results differ from the ground-truth render. Fortunately the depth weighting produces results that don’t look immediately “wrong”, which is certainly a big step up from unsorted alpha blending. Here’s another image of the test scene with default albedo maps, with an overall opacity of 0.25:

LowOpacity_BOIT
LowOpacity_Ref

The technique fails for surfaces with high opacity – one case that the algorithm has trouble with is surfaces with opacity = 1.0. Since it uses a weighted blend, the weight of the closest surface has to be incredibly high relative to any other surfaces in order for it to appear opaque. Here’s the test scene with all surfaces using an opacity of 1.0:

HiOpacity_BOIT
HiOpacity_Ref

You’ll notice in the image that the algorithm does actually work correctly with opacity = 1 if there’s no overlap of transparent surfaces, so it does hold up in that particular case. However in general this problem makes it unsuitable for materials like foliage, where large portions of of surface need to be fully opaque. Here’s the test scene using a tree texture, which illustrates the same problem:

Tree_BOIT
Tree_Ref

Like I said earlier, you really need to make the closest surface have a an extremely high weight relative to the surfaces behind it if you want it to appear opaque. One simple thing you could do is to keep track of the depth of the closest surface (say in a depth buffer), and then artificially boost the weight of surfaces if their depth matches the depth buffer weight. If you do this (and also scale your “boost” factor by opacity) you get something like this:

Tree_DBWeight

This result looks quite a bit better, although messing with the weights changes the alpha gradients which gives it a different look. This approach obviously has a lot of failure cases. Since you’re relying on depth, you could easily create discontinuities at geometry edges. You can also get situations like this, where a surface visible through a transparent portion of the closest surface doesn’t get the weight boost and remains translucent in appearance:

DBWeightFail_BOIT

DBWeight_Ref

Notice how the second tree true trunk appears to have a low opacity since it’s behind the closest surface. The other major downside is that you need to render your transparents in a depth prepass, which requires performance as well as the memory for an extra depth buffer. However you may already be doing that in order to optimize tiled forward rendering of transparents. Regardless I doubt it would be useful except in certain special-case scenarios, and it’s probably easier (and cheaper) to just stick to alpha-test or A2C for those cases.

Is it usable? – I’m not sure yet. I feel like it would take a lot of testing the wide range of transparents in our game before knowing if it’s going to work out. It’s too bad that it has failure cases, but if we’re going to be honest the bar is pretty damn low when it comes to transparents in  games. In our engine we make an attempt to sort by depth, but our artists frequently have to resort to manually setting “sort priorities” in order to prevent temporal issues from meshes constantly switching their draw order. The Blended OIT algorithm on the other hand may produce incorrect results, but those results are stable over time. However I feel the much bigger issue with traditional transparent rendering is that ordering constraints are fundamentally at odds with rendering performance. Good performance requires using instancing, reducing state changes and rendering to low-resolution render targets. All 3 of those these are incompatible with rendering based on Z order, which means living with lots of sorting issues if you want optimal performance. With that in mind it really feels like it’s hard to do worse than the current status-quo.

That’s about all I have for now. Feel free to download the demo and play around with it. If you missed it, the download link is at the top of the page. Also, please let me know if you have any thoughts or ideas regarding the practicality of the technique, since I would definitely be interested in discussing it further.

Sample Framework Updates

You may have noticed that my latest sample now has a proper UI instead of the homegrown sliders and keyboard toggles that I was using in my older samples. What you might not have noticed is that there’s a whole bunch of behind-the-scenes changes to go with that new UI! Before I ramble on, here’s a quick bullet-point list of the new features:

  • Switched to VS2012 and adopted a few C++11 features
  • New UI back-end provided by AntTweakBar
  • C#-based data-definition format for auto-generating UI
  • Shader hot-swapping
  • Better shader caching, and compressed cache files

It occurred to me a little while ago that I could try to develop my framework into something that enables rapid prototyping, instead of just being some random bits of cobbled-together code. I’m not sure if anybody else will use it for anything, but so far it’s working out pretty well for me.

VS 2012

A little while ago I switched to working in VS 2012 with the Windows 8 SDK for my samples, but continued using the VS 2010 project format and toolset in case anybody was stuck on 2010 and wanted to compile my code. For this latest sample I decided that legacy support wasn’t worth it, and made the full switch the newer compiler. I haven’t done any really crazy C++ 11 things yet, but now that I’ve started using enum class and nullptr I never want to go back. Ever since learning C# I’ve always wanted C++ enums to use a similar syntax, and C++11 finally fulfilled my wish. I suspect that I’ll feel the same way when I start using non-static data member initializers (once I switch to VS 2013, of course).

New UI

When I found out about AntTweakBar and how easy it is to integrate, I felt a little silly for ever trying to write my own ghetto UI widgets. If you haven’t seen it before, it basically shows up as a window containing a list of tweakable app settings, which is exactly what I need in my sample apps. It also has a neat alternative to sliders where you sort of rotate a virtual lever, and the speed increases depending on how far away you move the mouse from the center point. Here’s what it looks like integrated into my shadows sample:

Ant_UI

If you’re thinking of using AntTweakBar in your code, you might want to grab the files TwHelper.h/cpp from my sample framework. They provide a bunch of wrapper functions over TwSetParam that add type safety, which makes the library a lot easier to work with. I also have much more comprehensive set of Settings classes that further wrap those functions, but those are a lot more tightly coupled to the rest of the framework.

Shader Hot-Swapping

This one is a no-brainer, and I’m not sure why I waited so long to do it. I decided not to implement it using the Win32 file-watching API. We’ve used that at work, and it quickly became the most hated part of our codebase. Instead I took the simple and painless route of checking the timestamp of a single file every N frames, which works great as long as you don’t have thousands of files to go through (usually I only have a dozen or so).

UI Data Definition

For a long time I’ve been unhappy with my old way of defining application settings, and setting up the UI for them. Previously I did everything in C++ code by declaring a global variable for each setting. This was nice because I got type safety and VisualAssist auto-complete whenever I wanted to access a setting, but it was also a bit cumbersome because I always had to touch code in multiple places whenever I wanted to add a new setting. This was especially annoying when I wanted to access a setting in a shader, since I had manually handle adding it to a constant buffer and setting the value before binding it to a shader. After trying and failing multiple times to come up with something better using just C++, I thought it might be fun to  try something a bit less…conventional. Ultimately I drew inspiration from game studios that define their data in a non-C++ language and then use that to auto-generate C++ code for use at runtime. If you can do it this way you get the best of both worlds: you define data in a simple language that can express it elegantly, and then you still get all of your nice type-safety and other C++ benefits. You can even add runtime reflection support by generating code that fills out data structures containing the info that you need to reflect on a type. It sounded crazy to do it for a sample framework, but I thought it would be fun to get out of my comfort zone a bit and try something new.

Ultimately I ended up using C# as my data-definition language. It not only has the required feature set, but I also used to be somewhat proficient in it several years ago. In particular I really like the Attribute functionality in C#,  and thought it would be perfect for defining metadata to go along with a setting. Here’s an example of how I ended up using them for the “Bias” setting from my Shadows sample:

BiasSettingFor enum settings, I just declare an enum in C# and use that new type when defining the setting. I also used an attribute to specify a custom string to associate with each enum value:EnumSettingTo finish it up, I added simple C# proxies for the Color, Orientation, and Direction setting types supported by the UI system.

Here’s how it all it ties together: I define all of the settings in a file called AppSettings.cs, which includes classes for splitting the settings into groups. This file is added to my Visual Studio C++ project, and set to use custom build step that runs before the C++ compilation step. This build step passes the file to SettingsCompiler.exe, which is a command-line C# app created by a C# project in the same solution. This app basically takes the C# settings file, and invokes the C# compiler (don’t you love languages that can compile themselves?) so that it can be compiled as an in-memory assembly. That assembly is then reflected to determine the settings that are declared in the file, and also to extract the various metadata from the attributes. Since the custom attribute classes need to be referenced by both the SettingsCompiler exe as well as the settings code being compiled, I had to put all of them in a separate DLL project called SettingsCompilerAttributes. Once all of the appropriate data is gathered, the C# project then generates and outputs AppSettings.h and AppSettings.cpp. These files contain global definitions of the various settings using the appropriate C++ UI types, and also contains code for initializing and registering those settings with the UI system. These files are added to the C++ project, so that they can be compiled and linked just like any other C++ code. On top of that, the settings compiler also spits out an HLSL file that declares a constant buffer containing all of the relevent settings (a setting can opt-out of the constant buffer if it wants by using an attribute). The C++ files then have code generated for creating a matching constant buffer resource, filling it out with the setting values once a frame, and binding it to all shader stages at the appropriate slot. This means that all a shader needs to do is #include the file, and it can use the setting. Here’s a diagram that shows the basic setup for the whole thing:

SettingsCompiler

This actually works out really nicely in Visual Studio: you just hit F7 and everything builds in the right order. The settings compiler will gather errors from the C# compiler and output them to stdout, which means that if you have a syntax error it gets reported to the VS output window just as if you were compiling it normally through a C# project. MSBuild will even track the timestamp on AppSettings.cs, and won’t run the settings compiler unless it’s newer than the timestamp on AppSettings.h, AppSettings.cpp or AppSettings.hlsl. Sure it’s a really complicated and over-engineered way of handling my problem, but it works and I had fun doing it.

Future Work

I think that the next thing that I’ll improve will be model loading. It’s pretty limiting working with .sdkmesh, and I’d like to be able to work with a wider variety of test scenes. Perhaps I’ll integrate Assimp, or use it to make a simple converter to a custom format. I’d also like to flesh out my SH math and shader code a bit, and add some more useful functionality.

A Sampling of Shadow Techniques

A little over a year ago I was looking to overhaul our shadow rendering at work in order to improve overall quality, as well as simplify the workflow for the lighting artists (tweaking biases all day isn’t fun for anybody). After doing yet another round of research into modern shadow mapping techniques, I decided to do what I usually do and starting working on sample project that I could use as a platform for experimentation and comparison. And so I did just that. I had always intended to put it on the blog, since I thought it was likely that other people would be evaluating some of the same techniques as they upgrade their engines for next-gen consoles. But I was a bit lazy about cleaning it up (there was a lot of code!), and it wasn’t the most exciting thing that I was working on, so it sort-of fell by the wayside. Fast forward to a year later, and I found myself  looking for a testbed for some major changes that I was planning for the settings and UI portion of my sample framework. My shadows sample came to mind since it was chock-full of settings, and that lead me to finally clean it up and get it ready for sharing. So here it is! Note that I’ve fully switched over to VS 2012 now, so you’ll need it to open the project and compile the code. If you don’t have it installed, then you may need to download and install the VC++ 2012 Redistributable in order to run the pre-compiled binary.

https://mynameismjp.files.wordpress.com/2013/09/shadows_msm.zip

Update 9/12/2013
 – Fixed a bug with cascade stabilization behaving incorrectly for very small partitions
 – Restore previous min/max cascade depth when disabling Auto-Compute Depth Bounds

Thanks to Stephen Hill for pointing out the issues!

Update 9/18/2013
 – Ignacio Castaño was kind enough to share the PCF technique being used in The Witness, which he integrated into the sample under the name “OptimizedPCF”. Thanks Ignacio!

Update 11/3/2013
– Fixed SettingsCompiler string formatting issues for non-US cultures

Update 2/17/2015
– Fixed VSM/EVSM filtering shader that was broken with latest Nvidia drivers
– Added Moment Shadow Mapping
– Added notes about biasing VSM, EVSM, and MSM

The sample project is set up as a typical “cascaded shadow map from a directional light” scenario, in the same vein as the CascadedShadowMaps11 sample from the old DX SDK or the more recent Sample Distribution Shadow Maps sample from Intel.  In fact I even use the same Powerplant scene from both samples, although I also added in human-sized model so that you can get a rough idea of how well the shadowing looks on characters (which can be fairly important if you want your shadows to look good in cutscenes without having to go crazy with character-specific lights and shadows). The basic cascade rendering and setup is pretty similar to the Intel sample: a single “global” shadow matrix is created every frame based on the light direction and current camera position/orientation, using an orthographic projection with width, height, and depth equal to 1.0. Then for each cascade a projection is created that’s fit to the cascade, which is used for rendering geometry to the shadow map. For sampling the cascade, the projection is described as a 3D scale and offset that’s applied to the UV space of the “global” shadow projection. That way you just use one matrix in the shader to calculate shadow map UV coordinates + depth, and then apply the scale and offset to compute the UV coordinates + depth for the cascade that you want to sample. Unlike the Intel sample I didn’t use any kind of deferred rendering, so I decided to just fix the number of cascades at 4 instead of making it tweakable at runtime.

Cascade Optimization

Optimizing how you fit your cascades to your scene and current viewpoint is pretty crucial for reducing perspective aliasing and the artifacts that it causes. The old-school way of doing this for CSM is to chop up your entire visible depth range into N slices using some sort of partitioning scheme (logarithmic, linear, mixed, manual, etc.). Then for each slice of the view frustum, you’d tightly fit an AABB to the slice  and use that as the parameters for your orthographic projection for that slice. This gives you the most optimal effective resolution for a given cascade partition, however with 2006-era shadow map sizes you still generally end up with an effective resolution that’s pretty low relative to the screen resolution. Combine this with 2006-era filtering (2×2 PCF in a lot of cases), and you end up with quite a bit of aliasing. This aliasing was exceptionally bad, due to the fact that your cascade projections will translate and scale as your camera moves and rotates, which results in rasterization sample points changing from frame to frame as the camera moves. The end result was crawling shadow edges, even from static geometry. The most popular solution for this problem was to trade some effective shadow map resolution for stable sample points that  don’t move from frame to frame. This was first proposed (to my knowledge) by Michal Valient in his ShaderX6 article entitled “Stable Cascaded Shadow Maps”. The basic idea is that instead of tightly mapping your orthographic projection to your cascade split, you map it in such a way that the projection won’t change as the camera rotates. The way that Michal did it was to fit a sphere to the entire 3D frustum split and then fit the projection to that sphere, but you could do it any way that gives you a consistent projection size. To handle the translation issue, the projections are “snapped” to texel-sized increments so that you don’t get sub-texel sample movement. This ends up working really well, provided that your cascade partitions never change (which means that changing your FOV or near/car clip planes can cause issues). In general the stability ends up being a net win despite the reduced effective shadow map resolution, however small features and dynamic objects end up suffering.

2 years ago Andrew Lauritzen gave a talk on what he called “Sample Distribution Shadow Maps”, and released the sample that I mentioned earlier. He proposed that instead of stabilizing the cascades, we could instead focus on reducing wasted resolution in the shadow map to a point where effective resolution is high enough to give us sub-pixel resolution when sampling the shadow map. If you can do that then you don’t really need to worry about your projection changing frame to frame, provided that you use decent filtering when sampling the shadow map. The way that he proposed to do this was to analyze the depth buffer generated for the current viewpoint, and use it to automatically generate an optimal partitioning for the cascades. He tried a few complicated ways of achieving this that involved generating a depth histogram on the GPU, but also proposed a much more practical scheme that simply computed the min and max depth values. Once you have the min and max depth, you can very easily use that to clamp your cascades to the subset of your depth range that actually contains visible pixels. This might not sound like a big deal, but in practice it can give you huge improvements. The min Z in particular allows you to have a much more optimal partitioning, which you can get by using an “ideal” logarithmic partitioning scheme. The main downside is that you need to use the depth buffer, which puts you in the awful position of having the CPU dependent on results from the GPU if you want to do your shadow setup and scene submission on the CPU. In the sample code they simply stall and readback the reduction results, which isn’t really optimal at all in terms of performance. When you do something like this the CPU ends up waiting around for the driver to kick off commands to the GPU and for the GPU to finish processing them, and then you get a stall on the GPU while it sits around and waits for the CPU to start kicking off commands again. You can potentially get around this by doing what you normally do for queries and such, and deferring the readback for one or more frames. That way the results are (hopefully) already ready for readback, and so you don’t get any stalls. But this can cause you problems, since the cascades will trail a frame behind what’s actually on screen. So for instance if the camera moves towards an object, the min Z may not be low enough to fit all of the pixels in the first cascade and you’ll get artifacts. One potential workaround is to use the previous frame’s camera parameters to try to predict what the depth of a pixel will be for the next frame based on the instantaneous linear velocity of the camera, so that when you retrieve your min/max depth a frame late it actually contains the correct results. I’ve actually done this in the past (but not for this sample, I got lazy) and it works as long as long as the camera motion stays constant. However it won’t handle moving objects unless you store the per-pixel velocity with regards to depth and factor that into your calculations. The ultimate solution is to do all of the setup and submission on the GPU, but I’ll talk about that in detail later on.

These are the options implemented in the sample that affect cascade optimization:

  • Stabilize Cascades – enables cascade stabilization using the method from ShaderX6.
  • Auto-Compute Depth Bounds – computes the min and max Z on the GPU, and uses it to tightly fit the cascades as in SDSM.
  • Readback Latency – number of frames to wait for the depth reduction results
  • Min Cascade Distance/Max Cascade Distance – manual min and max Z when Auto-Compute Depth Bounds isn’t being used
  • Partition Mode – can be Logarithmic, PSSM (mix between linear and logarithmic), or manual
  • Split Distance 0/Split Distance 1/Split Distance 2/Split Distance 3 – manual partition depths
  • PSSM Lambda – mix between linear and logarithmic partitioning when PartitionMode == PSSM
  • Visualize Cascades – colors pixels based on which cascade they select

Shadow Filtering

Shadow map filtering is the other important aspect of reducing artifacts. Generally it’s needed to reduce aliasing due to undersampling the geometry being rasterized into the shadow map, but it’s also useful for cases where the shadow map itself is being undersampled by the pixel shader. The most common technique for a long time has been Percentage Closer Filtering (PCF), which basically amounts to sampling a normal shadow map, performing the shadow comparison, and then filtering the result of that comparison. Nvidia hardware has been able to do a 2×2 bilinear PCF kernel in hardware since…forever, and it’s required of all DX10-capable hardware. Just about every PS3 game takes advantage of this feature, and Xbox 360 games would too if the GPU supported it. In general you see lots of 2×2 or 3×3 grid-shaped PCF kernels with either a box filter or a triangle filter. A few games (notably the Crysis games) use a randomized sample pattern with sample points located on a disc, which trades regular aliasing for noise. With DX11 hardware there’s support for GatherCmp, which essentially gives you the results of 4 shadow comparisons performed for the relevant 2×2 group of texels. With this you can efficiently implement large (7×7 or even 9×9) filter kernels with minimal fetch instructions, and still use arbitrary filter kernels. In fact there was an article in GPU Pro called “Fast Conventional Shadow Filtering” by Holger Gruen, that did exactly this, and even provided source code. It can be stupidly fast…in my sample app going from 2×2 PCF to 7×7 PCF only adds about 0.4ms when rendering at 1920×1080 on my AMD 7950. For comparison, a normal grid sampling approach adds about 2-3ms in my sample app for the maximum level of filtering. The big disadvantage to the fixed kernel modes is that they rely on the compiler to unroll the loops, which make for some sloooowwww compile times. The sample app uses a compilation cache so you won’t notice it if you just start it up, but without the cache you’ll see it takes quite while due to the many shader permutations being used. For that reason I decided to stick with a single kernel shape (disc) rather than using all of the shapes from the GPU Pro article, since compilation times were already bad enough.

So far the only real competitor to gain any traction in games is Variance Shadow Maps (VSM). I won’t go deep into the specifics since the original paper and GPU Gems article do a great job of explaining it. But the basic gist is that you work in terms of the mean and variance of a distribution of depth values at a certain texel, and then use use that distribution to estimate the probability of a pixel being in shadow. The end result is that you gain the ability to filter the shadow map without having to perform a comparison, which means that you can use hardware filtering (including mipmaps, anistropy or even MSAA) and that you can pre-filter the shadow map with standard “full-screen” blur pass. Another important aspect is that you generally don’t suffer from the same biasing issues that you do with PCF. There’s some issues of performance and memory, since you now need to store 2 high-precision values in your shadow map instead of just 1. But in general the biggest problem is the light bleeding, which occurs when there’s a large depth range between occluders and receivers. Lauritzen attempted to address this a few years later by applying an exponential warp to the depth values stored in the shadow map, and performing the filtering in log space. It’s generally quite effective, but it requires high-precision floating point storage to accommodate the warping. For maximum quality he also proposed storing a negative term, which requires an extra 2 components in the shadow map. In total that makes for 4x FP32 components per texel, which is definitely not light in terms of bandwidth! However you could say that it arguably produces the highest-quality results, and it does so without having to muck around with biasing. This is especially true when pre-filtering, MSAA, anisotropic filtering, and mipmaps are all used, although each of those brings about additional cost. To provide some real numbers, using EVSM4 with 2048×2048 cascades, 4xMSAA, mipmaps, 8xAF, and highest-level filtering (9 texels wide for the first cascade), adds about 11.5ms relative to a 7×7  fixed PCF kernel. A more reasonable approach would be to go with 1024×1024 shadow maps with 4xMSAA, which is around 3ms slower than the PCF version.

These are the shadow filtering modes that are implemented:

  • FixedSizePCF – optimized GatherCmp PCF with disc-shaped kernel
  • GridPCF – manual grid-based PCF sampling using NxN samples
  • RandomDiscPCF  – randomized samples on a Poisson disc, with optional per-pixel randomized rotation
  • OptimizedPCF – similar to FixedSizePCF, but uses bilinear PCF samples to implement a uniform filter kernel.
  • VSM – variance shadow maps
  • EVSM2 – exponential variance shadow maps, positive term only
  • EVSM4 – exponential variance shadow maps, both positive and negative terms
  • MSM Hamburgermoment shadow mapping, using the “Hamburger 4MSM” technique from the paper.
  • MSM Hausdorff – moment shadow mapping, using the “Hausdorff 4MSM” technique from the paper

Here’s the options available in my sample related to filtering:

  • Shadow Mode – the shadow filtering mode, can be one of the above values
  • Fixed Filter Size – the kernel width in texels for the FixedSizePCF mode, can be 2×2 though 9×9
  • Filter Size – the kernel width in fractional world space units for all other filtering modes. For the VSM modes, it’s used in a pre-filtering pass.
  • Filter Across Cascades – blends between two cascade results at cascade boundaries to hide the transition
  • Num Disc Samples – number of samples to use for RandomDiscPCF mode
  • Randomize Disc Offsets – if enabled, applies a per-pixel randomized rotation to the disc samples
  • Shadow MSAA – MSAA level to use for VSM, EVSM, and MSM modes
  • VSM/MSM Format – the precision to use for VSM, EVSM, and MSM shadow maps. Can be 16bit or 32bit . For VSM the textures will use a UNORM format, for EVSM they will be FLOAT. For the MSM 16-bit version, the optimized quantization scheme from the paper is used to store the data in UNORM texture.
  • Shadow Anisotropy – anisotropic filtering level to use for VSM, EVSM, and MSM
  • Enable Shadow Mips – enables mipmap generation and sampling for VSM, EVSM, and MSM
  • Positive Exponent/Negative Exponent – the exponential warping factors for the positive and negative components of EVSM
  • Light Bleeding Reduction – reduces light bleeding for VSM/EVSM/MSM, but results in over-darkening

In general I try to keep the filtering kernel fixed in world space across each cascade by adjusting the kernel width based on the cascade size. The one exception is the FixedSizePCF mode, which uses the same size kernel for all cascades. I did this because I didn’t think that branching over the fixed kernels would be a great idea. Matching the filter kernel for each cascade is nice because it helps hide the seams at cascade transitions, which means you don’t have to try to filter across adjacent cascades in order to hide them. It also means that you don’t have to use wider kernels on more distant pixels, although this can sometimes lead to visible aliasing on distant surfaces.

I didn’t put a whole lot of effort into the “RandomDiscPCF” mode, so it doesn’t produce optimal results. The randomization is done per-pixel, which isn’t great since you can clearly see the random pattern tiling over the screen as the camera moves. For a better comparison you would probably want to do something similar to what Crytek does, and tile a (stable) pattern over each cascade in shadowmap space.

Biasing

When using conventional PCF filtering, biasing is essential for avoiding “shadow acne” artifacts. Unfortunately, it’s usually pretty tricky to get it right across different scenes and lighting scenarios. Just a bit too much bias will end up killing shadows entirely for small features, which can cause characters too look very bad. My sample app exposes 3 kinds of biasing: manual depth offset, manual normal-based offset, and automatic depth offset based on receiver slope. The manual depth offset, called simply Bias in the UI, simply subtracts a value from the pixel depth used to compare against the shadow map depth. Since the shadow map depth is a [0, 1] value that’s normalized to the depth range of the cascade, the bias value represents a variable size in world space for each cascade. The normal-based offset, called Offset Scale in the UI, is based on “Normal Offset Shadows”, which was a poster presented at GDC 2011. The basic idea is you actually create a new virtual shadow receiver position by offsetting from the actual pixel position in the direction of the normal. The trick is that you offset more when the surface normal is more perpendicular to the light direction. Angelo Pesce has a hand-drawn diagram explaining the same basic premise on his blog, if you’re having trouble picturing it.. This technique can actually produce decent results, especially given how cheap it is. However since you’re offsetting the receiver position, you actually “move” the shadows a bit which is a bit weird to see as you tweak the value. Since the offset is a world-space distance, in my sample I scale it based on the depth range of the cascade in order to make it consistent across cascade boundaries. If you want to try using it, I recommend starting with a small manual bias of 0.0015 or so and then slowly increasing the Offset Scale to about 1.0 or 2.0. Finally we have the automatic solution, where we attempt to compute the “perfect” bias amount based on the slope of the receiver. This setting is called Use Receiver Plane Depth Bias in the sample. To determine the slope, screen-space derivatives are used in the pixel shader. When it works, it’s fantastic. However it will still run into degenerate cases where it can produce unpredictable results, which is something that often happens when working with screen-space derivatives.

There are also separate biasing parameters for VSM and MSM techniques. The “VSM Bias” affects VSM and EVSM, while “MSM Depth Bias” and “MSM Moment Bias” are used for MSM. For VSM and 32-bit EVSM, a bias value of 0.01 (which corresponds to an actual value of 0.0001) is sufficent. However for 16-bit EVSM a bias of up to 0.5 (0.005) is required to alleviate precision issues. For only the moment bias is particularly relevent. This value needs to be at least 0.001-0.003 (0.000001-0.000003) for 32-bit modes, while the quantized 16-bit mode requires a higher bias of around 0.03 (0.00003). Note that for both EVSM and MSM increasing the bias can also increase the amount of visible light bleeding.

GPU-Driven Cascade Setup and Scene Submission

This is a topic that’s both fun, and really frustrating. It’s fun because you can really start to exploit the flexibility of DX11-class GPU’s, and begin to break free of the old “CPU spoon-feeds the GPU” model that we’ve been using for so long now. It’s also frustrating because the API’s still hold us back a quite a bit in terms of letting the GPU generate its own commands.  Either way it’s something I see people talk about but not a lot of people actually doing it, so I thought I’d give it a try for this sample. There’s actually 2 reasons to try something like this. The first is that if you can offload enough work to the GPU, you can avoid the heavy CPU overhead of frustum culling and drawing and/or batching lots and lots of meshes. The second is that it lets you do SDSM-style cascade optimization based on the depth buffer without having to read back values from the GPU to the CPU, which is always a painful way of doing things.

The obvious path to implementing GPU scene submission would be to make use of DrawInstancedIndirect/DrawIndexedInstancedIndirect. These API’s are fairly simple to use: you simply write the parameters to a buffer (with the parameters being the same ones that you would normally pass on the CPU for the non-indirect version), and then on the CPU you specify that buffer when calling the appropriate function. Since instance count is one of the parameters, you can implement culling on the GPU my setting the instance count to 0 when a mesh shouldn’t be rendered. However it turns out that this isn’t really useful, since you still need to go through the overhead of submitting draw calls and setting associated state on the CPU. In fact the situation is worse than doing it all on the CPU, since you have to submit each draw call even if it will be frustum culled.

Instead of using indirect draws, I decided to make a simple GPU-based batching system. Shadow map rendering is inherently more batchable than normal rendering, since you typically don’t need to worry about having specialized shaders or binding lots of textures when you’re only rendering depth. In my sample I take advantage of this by using a compute shader to generate one great big index buffer based on the results of a frustum culling pass, which can then be rendered in a single draw call. It’s really very simple: during initialization I generate a buffer containing all vertex positions, a buffer containing all indices (offset to reflect the vertex position in the combined position buffer), and a structured buffer containing the parameters for every draw call (index start, num indices, and a bounding sphere). When it’s time to batch, I run a compute shader with one thread per draw call that culls the bounding sphere against the view frustum. If it passes, the thread then “allocates” some space in the output index buffer by performing an atomic add on a value in a RWByteAddressBuffer containing the total number of culled indices that will be present in output index buffer (note that on my 7950 you should be able to just do the atomic in GDS like you would for an append buffer, but unfortunately in HLSL you can only increment by 1 using IncrementCounter()). I also append the draw call data to an append buffer for use in the next pass:

const uint drawIdx = TGSize * GroupID.x + ThreadIndex;
if(drawIdx >= NumDrawCalls)
    return;

DrawCall drawCall = DrawCalls[drawIdx];

if(FrustumCull(drawCall))
{
    CulledDraw culledDraw;
    culledDraw.SrcIndexStart = drawCall.StartIndex;
    culledDraw.NumIndices = drawCall.NumIndices;
    DrawArgsBuffer.InterlockedAdd(0, drawCall.NumIndices, 
                                  culledDraw.DstIndexStart);

    CulledDrawsOutput.Append(culledDraw);
}

Once that’s completed, I then launch another compute shader that uses 1 thread group per draw call to copy all of the indices from the source index buffer to the final output index buffer that will be used for rendering. This shader is also very simple: it simply looks up the draw call info from the append buffer that tells it which indices to copy and where they should be copied, and then loops enough times for all indices to be copied in parallel by the different threads inside the thread group. The final result is an index buffer containing all of the culled draw calls, ready to be drawn in a single draw call:

const uint drawIdx = GroupID.x;
CulledDraw drawCall = CulledDraws[drawIdx];

for(uint currIdx = ThreadIndex; currIdx < drawCall.NumIndices; currIdx += TGSize)
{
    const uint srcIdx = currIdx + drawCall.SrcIndexStart;
    const uint dstIdx = currIdx + drawCall.DstIndexStart;
    CulledIndices[dstIdx] = Indices[srcIdx];
}

In order to avoid the min/max depth readback, I also had to port my cascade setup code to a compute shader so that the entire process could remain on the GPU. I was surprised to find that this was actually quite a bit more difficult than writing the batching system. Porting arbitrary C++ code to HLSL can be somewhat tedious, due to the various limitations of the language. I also ran into a rather ugly bug in the HLSL compiler where it kept  trying to keep matrices in column-major order no matter what code I wrote and what declarations I used, which I suppose it tries to do as an optimization. However this really messed me up when I tried to write the matrix into a structured buffer, and expected it to be row-major. My advice for the future: if you need to write a matrix from a compute shader, just use a StructuredBuffer<float4> and write out the rows manually. Ultimately after much hair-pulling I got it to work, and finally achieved my goal of 0-latency depth reduction with no CPU readback!

In terms of performance, the GPU-driven path comes in about 0.8ms slower than the CPU version when the CPU uses 1 frame of latency for reading back the min/max depth results. I’m not entirely sure why it’s that much slower, although I haven’t spend much time trying to profile the batching shaders or the cascade setup shader. I also wouldn’t be surprised if the actual mesh rendering ends up being a bit slower, since I use 32-bit indices when batching on the GPU as opposed to 16-bit when submitting with the CPU. However when using the 0 frames of latency for the CPU readback, the GPU version turns it around and comes in a full millsecond faster on my PC. Obviously it makes sense that any additional GPU overhead would make up for the stall that occurs on the CPU and GPU when having to immediately read back data. One thing that I’ll point out is that the GPU version is quite a bit slower than the CPU version when VSM’s are used and pre-filtering is enabled. This is because the CPU path chooses an optimized blur shader based on the filter width for a cascade, and early outs of the blur process if no filtering is required. For the GPU path I got lazy and just used one blur shader that uses a dynamic loop, and it always runs the blur passes even if no filtering is requested. There’s no technical reason why you couldn’t do it the same way as the CPU path with enough motivation and DrawIndirect’s.

DX11.2 Tiled Resources

Tiled resources seems to be the big-ticket item for the upcoming DX11.2 update. While the online documentation has some information about the new functions added to the API, there’s currently no information about the two tiers of tiled resource functionality being offered. Fortunately there is a sample app available that provides some clues. After poking around a bit last night, these were the differences that I noticed:

  • TIER2 supports MIN and MAX texture sampling modes that return the min or max of 4 neighboring texels. In the sample they use this when sampling a residency texture that tells the shader the highest-resolution mip level that can be used when sampling a particular tile. For TIER1 they emulate it with a Gather.
  • TIER1 doesn’t support sampling from unmapped tiles, so you have to either avoid it in your shader or map all unloaded tiles to dummy tile data (the sample does the latter)
  • TIER1 doesn’t support packed mips for texture arrays. From what I can gather, packed mips refers to packing multiple mips into a single tile.
  • TIER2 supports a new version of Texture2D.Sample that lets you clamp the mip level to a certain value. They use this to force the shader to sample from lower-resolution mip levels if the higher-resolution mip isn’t currently resident in memory. For TIER1 they emulate this by computing what mip level would normally be used, comparing it with the mip level available in memory, and then falling back to SampleLevel if the mip level needs to be clamped. There’s also another overload for Sample that returns a status variable that you can pass to a new “CheckAccessFullyMapped” intrinsic that tells you if the sample operation would access unmapped tiles. The docs don’t say that these functions are restricted to TIER2, but I would assume that to be the case.

Based on this information it appears that TIER1 offers all of the core functionality, while TIER2 has few extras bring it up to par with AMD’s sparse texture extension.

SIGGRAPH Follow-Up

So I’m hoping that if you’re reading this, you’ve already attended or read the slides from my presentation about The Order: 1886 that was part of the Physically Based Shading Course at SIGGRAPH last week. If not, go grab them and get started! If you haven’t read through the course notes already there’s a lot of good info there, in fact there’s almost 30 pages worth! The highlights include:

  • Full description of our Cook-Torrance and Cloth BRDF’s, including a handy optimization for the GGX Smith geometry term (for which credit belongs to Steve McAuley)
  • Analysis of our specular antialiasing solution
  • Plenty of details regarding the material scanning process
  • HLSL sample code for the Cook-Torrance BRDF’s as well as the specular AA roughness modification
  • Lots of beautiful LaTeX equations

If you did attend, I really appreciate you coming and I hope that you found it interesting. It was my first time presenting in front of such a large audience, so please forgive me if I seemed nervous. Also I’d like to give another thank you to anyone that came out for drinks later that night, I had a really great time talking shop with some of the industry’s best graphics programmers. And of course the biggest thanks goes out to Stephen Hill and Stephen McAuley for giving us the opportunity to speak at the best course at SIGGRAPH.

Anyhow…now that SIGGRAPH is finished and I have some time to sit down and take a breath, I wanted to follow up with some additional remarks about the topics that we presented. I also thought that if I post blogs more frequently, it might inspire a few other people to do the same.

Physically Based BRDF’s

I didn’t even mention anything about the benefits of physically based BRDF’s in our presentation because I feel like it’s no longer an issue that’s worth debating. We’ve been using some form of Cook-Torrance specular for about 2 years now, and it’s made everything in our game look better. Everything works more like it should work, and requires less fiddling and fudging on the part of the artists. We definitely had some issues to work through when we first switched (I can’t tell you how many times they asked me for direct control over the Fresnel curve), but there’s no question as to whether it was worth it in the long run. Next-gen consoles and modern PC GPU’s have lots of ALU to throw around, and sophisticated BRDF’s are a great way to utilize it.

Compositing

First of all, I wanted to reiterate that our compositing system (or “The Smasher” as it’s referred to in-house) is a completely offline process that happens in our content build system. When a level is a built we request a build of all materials referenced in that level, which then triggers a build of all materials used in the compositing stack of that material. Once all of the source materials are built, the compositing system kicks in and generates the blended parameter maps. The compositing process itself is very quick since we do it in a simple pixel shader run on the GPU, but it can take some time to build all of the source materials since doing that requires processing the textures referenced by that material. We also support using composited materials as a source material in a composite stack, which means that a single runtime material can potentially depend on a arbitrarily complex tree of source materials. To alleviate this we aggressively cache intermediate and output assets on local servers in our office, which makes build times pretty quick once the cache has been primed. We also used to compile shaders for every material, which caused long build times when changing code used in material shaders. This forced us to change things so that we only compile shaders directly required by a mesh inside of a level.

We also support runtime parameter blending, which we refer to as layer blending. Since it’s at runtime we limit it to 4 layers to prevent the shader from becoming too expensive, unlike the compositing system where artists are free to composite in as many materials as they’d like. It’s mostly used for environment geo as a way to add in some break-up and variety to tiling materials via vertex colors, as opposed to blending in small bits of a layer using blend maps. One notable exception is that we use the layer blending to add a “detail layer” to our cloth materials. This lets us keep the overall low-frequency material properties in lower-resolution texture maps, and then blend in tiling high-frequency data (such as the textile patterns acquired from scanning).

One more thing that I really wanted to bring up is that the artists absolutely love it. The way it interacts with our template library allows us to keep low-level material parameter authoring in the hands of a few technical people, and enables everyone else to just think in terms of combining high-level “building blocks” that form an object’s final appearance. I have no idea how we would make a game without it.

Maya Viewer

A few people noticed that we have our engine running in a Maya viewport, and came up to ask me about it. I had thought that this was fairly common and thus uninteresting, but I guess I was wrong! We do this by creating a custom Viewport 2.0 render override (via MRenderOverride) that uses our engine to render the scene using Maya’s DX11 device (Maya 2013.5 and up support a DX11 mode for Viewport 2.0, for 2013 you gave to use OpenGL). With their their render pass API you can basically specify what things you want Maya to draw, so we render first (so that we can fill the depth buffer) and then have Maya draw things like HUD text and wireframe overlays for UI. The VP 2.0 API actually supports letting you hook into their renderer and data structure, which basically lets you specify things like vertex data and shader code while allowing Maya to handle the actual rendering. We don’t do that…we basically just give Maya’s device to our renderer and let our engine draw things the same way that it does in-game. To do this, we have a bunch of Maya plugin code that tracks objects in the scene (meshes, lights, plugin nodes, etc.) and handles exporting data off of them and converting them to data that can be consumed by our renderer. Maintaining this code has been a lot of work, since it basically amounts to an alternate path for scene data that’s in some ways very different from what we normally do in the game. However it’s huge workflow enhancement for all of our artists, so we put up with it. I definitely never thought I would know so much about Maya and its API!

Some other cool things that we support inside of Maya:

  • We can embed our asset editors (for editing materials, particle systems, lens flares, etc.) inside of Maya, which allows for real-time editing of those assets while they’re being viewed
  • GI bakes are initiated from Maya and then run on the GPU, which allows the lighting artists to keep iterating inside of a single tool and get quick feedback
  • Our pre-computed visibility can also be baked inside of Maya, which allows artists to check for vis dropouts and other bugs without running the game

One issue that I had to work around in our Maya viewer was using the debug DX11 device. Since Maya creates the device we can’t control the creation flags, which means no helpful DEBUG mode to tell us when we mess something up. To work around this, I had to make our renderer create its own device and use that for rendering. Then when presenting the back buffer and depth buffer to Maya, we have to use DXGI synchronization to copy texture data from our device’s render targets to Maya’s render targets. It’s not terribly hard, but it requires reading through a lot of obscure DXGI documentation.

Sample Code

You may not have noticed, but there’s a sample app (alternate download link here) to go with the presentation! Dave and I always say that it’s a little lame to give a talk about something without providing sample code, so we had to put our money where our mouth is. It’s essentially a working implementation of our specular AA solution as well as our Cook-Torrance BRDF’s that uses my DX11 sample framework. Seeing for yourself with the sample app is a whole lot better than looking at pictures, since the primary concern is avoiding temporal aliasing as the camera or object changes position. These are all of the specular AA techniques available in the sample for comparison:

  • LEAN
  • CLEAN
  • Toksvig
  • Pre-computed Toksvig
  • In-shader vMF evaluation of Frequency Domain Normal Map Filtering
  • Pre-computed vMF evaluation (what we use at RAD)

For a ground-truth reference, I also implemented in-shader supersampling and texture-space lighting. The shader supersampling works by interpolating vertex attributes to random sample points surrounding the pixel center, computing the lighting, and then applying a bicubic filter. Texture-space lighting works exactly like you think it would: the lighting result is rendered to a FP16 texture that’s 2x the width and height of the normal map, mipmaps are generated, and the geometry is rendered by sampling from the lighting texture with anisotropic filtering. Since linear filtering is used both for sampling the lighting map and generating the mipmaps, the overall results don’t always look very good (especially under magnification). However the results are completely stable, since the sample positions never move relative to the surface being shaded. The pixel shader supersampling technique still suffers from some temporal flickering because of this, although it’s obviously significantly reduced with higher sample counts.

Dave and I had also intended to implement solving for multiple vMF lobes, but unfortunately we ran out of time and we weren’t able to include it. I’d like to revisit it at some point and release an updated sample that has it implemented. I don’t think it would actually be worth it for a game to store so much additional texture data, however I think it would be useful as a reference. It might also be interesting to see if the data could be used to drive a more sophisticated pre-computation step that bakes the anisotropy into the lower-resolution mipmaps.

Like I mentioned before, the sample also has implementations of the GGX and Beckmann-based specular BRDF’s described in our course notes. We also implemented our GGX and Cloth BRDF’s as .brdf files for Disney’s BRDF Explorer, which you can download here.