Some Special Thanks

About a month ago, a little game called The Order: 1886 finally hit store shelves. Its release marks the culmination of my past 4 years at Ready At Dawn, which were largely devoted to developing the core rendering and engine technology that was ultimately used for the game. It’s also a major milestone for me personally, as it’s the first project that I’ve worked on full-time from start to finish. I’m of course immensely proud of what our team has managed to accomplish, and I feel tremendously lucky to go to work every day with such a talented and dedicated group of people.

My work wouldn’t have been possible without the help of many individuals both internal and external to our studio, but unfortunately there’s just too many people to list in a “special thanks” section of the credits. So instead of that, I’ve made my own special thanks section! But before you read it, please know that this list (like everything else on my blog) only represents my own personal feelings and experiences, and does not represent the rest of the team or my company as a whole. And now, without further ado…

From SCE:

  • Tobias Berghoff, Steven Tovey, Matteo Scapuzzi, Benjamin Segovia, Vince Diesi, Chris Carty, Nicolas Serres, and everyone else at the Advanced Technology Group. These fine folks have developed a fantastic suite of debugging and performance tools for the PS4, and thus played a huge part in making the PS4 the incredible development platform that it is.
  • Dave Simpson, Cort Stratton, Cedric Lallian, and everyone else at the ICE Team for making the best graphics API that I’ve ever used.
  • Geoff Audy, Elizabeth Baumel, Peter Young, and everyone else at Foster City and elsewhere that provided valuable tools and developer support.

From the graphics community:

  • Andrew Lauritzen for providing all of his valuable research on shadows and deferred rendering. His work with VSM and SDSM formed the basis of our shadowing tech in The Order, and his amazing presentation + demo on deferred rendering back in 2010 helped set the standard for state-of-the art in rendering tech for games. Also I’ll go ahead and thank him preemptively for (hopefully) forgiving me for thanking him here, even though I promised him last year at GDC that I would thank him in the credits.
  • Stephen Hill, whose excellent articles and presentations have always inspired me to strive for the next level of quality when sharing my own work.
  • Steve McAuley, who along with Mr. Hill is responsible for cultivating what is arguably the best collection of material year after year: the physically based shading course at SIGGRAPH. I’m very thankful to them for inviting us to participate in 2013, and then helping myself and Dave to deliver our presentation and course notes.
  • Naty Hoffman, Timothy Lottes, Adrian Bentley, Sébastien Legarde,  Johan Andersson, James McLarenBrian Karis, Peter-Pike Sloan, Nathan Reed, Christer EricsonAngelo Pesce, Michał Iwanicki, Christian GyrlingAras Pranckevičius, Michal Valient, Bart Wronski, Jasmin PatryMichal Drobot, Jorge Jimenez, Padraic Hennessy, and anyone else who was kind enough to talk shop over a drink at GDC or SIGGRAPH.
  • Anybody who has every given presentations, written articles, or otherwise contributed to the vast wealth of public knowledge concerning computer graphics. I’m really proud of the culture of openness and sharing among the graphics community, and also very grateful for it. We often stood on the shoulders of giants when creating the tech for our game, and we couldn’t have achieved we did if without drawing from the insights and techniques that were generously shared by other developers and researchers.

From Ready at Dawn:

  • Everyone that I worked with on the Tools and Engine team: Nick Blasingame, Joe FerfeckiGabriel Sassone, Sean Flynn, Simone Kulczycki, Scott Murray, Robin Green, Brett Dixon, Joe SchutteDavid Neubelt, Alex Clark, Garret Foster, Tom Plunket, Aaron Halstead, Jeremy Nikolai, and Jamie Hayes. If you appreciate anything at all about the tech of The Order, then please think of these people! Also I need to give Garret an extra-special thanks for letting myself and Dave take all of the credit for his work on the material pipeline.
  • Our art director Nathan Phail-Liff, who not only directs a phenomenal group of artists but also had a major hand in shaping the direction of the tech that was developed for the project.
  • Anthony Vitale, who leads the best damn environment art team in the business. If you haven’t seen it yet, go check out their art dump on the polycount forums!
  • Ru WeerasuriyaAndrea Pessino, and Dider Malenfant for starting this amazing company, and for giving me an opportunity to work there.
  • Everyone else at Ready At Dawn that worked on the project with me. I don’t know if I can ever get used to overwhelming amount of sheer talent and dedication that you can have at a game studio, and the people that I worked with had that in spades. I look forward to making many more beautiful things with them in the future!

If you read through all of this, I appreciate you taking the time to do so. It means a lot to me that the people listed above get their due credit, and so I hope that my humble little blog post has enough reach to give them some of the recognition that they rightfully deserve.

Shadow Sample Update

This past week a paper entitled “Moment Shadow Mapping” was released in advance of its planned publication at I3D in San Francisco. If you haven’t seen it yet, it presents a new method for achieving filterable shadow maps, with a promised improvement in quality compared to Variance Shadow Maps. Myself and many others were naturally intrigued, as filterable shadow maps are highly desirable for reducing various forms of aliasing. The paper primarily suggests two variants of the technique: one that directly stores the 1st, 2nd, 3rd, and 4th moments in a RGBA32_FLOAT texture, and another that uses an optimized quantization scheme (which essentially boils down to a 4×4 matrix transform) in order to use an RGBA16_UNORM texture instead. The first variant most likely isn’t immediately interesting for people working on games, since 128 bits per texel requires quite a bit of memory storage and bandwidth. It’s also the same storage cost as the highest-quality variant of EVSM (VSM with an exponential warp), which already provides high-quality filterable shadows with minimal light bleeding. So that really leaves us with the quantized 16-bit variant. Using 16-bit storage for EVSM results in more artifacts and increased light bleeding compared to the 32-bit version, so if MSM can provide better results than it could potentially be useful for games.

I was eager to see the results myself, so I downloaded the sample app that the authors were kind enough to provide. Unfortunately their sample didn’t implement EVSM, and so I wasn’t able perform any comparisons. However the implementation of MSM is very straightforward, and so I decided to just integrate it into my old shadows sample. I updated the corresponding blog post and re-uploaded the binary + source, so if you’d like to check it out for yourself then feel free to download it:

The MSM techniques can be found under the “Shadow Mode” setting. I implemented both the Hamburger and Hausdorff methods, which are available as two separate shadow modes. If you change the VSM/MSM format from 32-bit to 16-bit, then the optimized quantization scheme will be used when converting from a depth map to a moment shadow map.

Unfortunately, my initial findings are rather mixed. The 32-bit variant of MSM seems to provide quality that’s pretty close to the 32-bit variant of EVSM, with slightly worse performance. Both techniques are mostly free of light bleeding, but still exhibit bleeding artifacts for the more extreme cases. As for the 16-bit variant, it again has quality that’s very similar to EVSM with equivalent bit depth. Both techniques require increasing the bias when using 16-bit storage in order to reduce precision artifacts, which in turn leads to increased bleeding artifacts. Overall MSM seems to have behave a little bit better with regards to light leaking, which does make some sense considering that with EVSM you’re forced to use a lower exponent scale in order to prevent overflow. For either technique you can reduce the leaking using the standard VSM bleeding reduction technique, which essentially just remaps the output range of the shadow visibility term. Doing this can remove or reducing the bleeding artifacts, but will also result in over-darkening. Of course it’s also possible that I made a mistake when implementing MSM into my sample app, however the bleeding artifacts are also quite noticeable in the sample app provided by the authors.

To finish up, here are some screenshots that I took that show an example of lighting bleeding. The EVSM images all use the 4-component variant, and the MSM images all use the 4-moment Hamburger variant. For the images with the bleeding “fix’, they use a reduction factor of 0.25. In all cases the shadow map resolution is 2048×2048, with 4xMSAA, 16x anisotropic filtering, and mipmaps enabled for EVSM and MSM.

MSM Comparison Grid

Finally, here’s a few more images from an area with an even worse case for light bleeding:

MSM Comparison Grid 2


Come see me talk at GDC 2014

Myself and fellow lead graphics programmer David Neubelt will be at GDC next week, talking about the rendering technology behind The Order: 1886. Unfortunately the talk came together a bit late, and so it initially started from the talk that we gave back at SIGGRAPH at last year (which is why it has the same title). However we don’t want to just rehash the same material, so we’ve added tons of new slides and revamped the old ones. The resulting presentation has a much more of an engineering focus, and will cover a lot of the nuts and bolts behind our engine and the material pipeline. Some of the new things we’ll be covering include:

  • Dynamic lighting
  • Baked diffuse and specular lighting
  • Baked and real-time ambient occlusion
  • Our shader system, and how it interacts with our material pipeline
  • Details of how we perform compositing in our build pipeline
  • Hair shading
  • Practical implementation issues with our shading model
  • Performance and memory statistics
  • Several breakdowns showing how various rendering techniques combine to form the final image
  • At least one new funny picture

We want the presentation to be fresh and informative even if you saw our SIGGRAPH presentation, and I’m pretty sure that we have enough new material to ensure that it will be. So if you’re interested, come by at 5:00 PM on Wednesday. If you can’t make it, I’m going to try to make sure that we have the slides up for download as soon as possible.

Weighted Blended Order-Independent Transparency

Back in December, Morgan McGuire  and Louis Bavoil published a paper called Weighted Blended Order-Independent Transparency. In case you haven’t read it yet (you really should!), it proposes an OIT scheme that uses a weighted blend of all surfaces that overlap a given pixel. In other words finalColor = w0 * c0 + w1 * c1 + w2 * c2…etc. With a weighted blend the order of rendering no longer matters, which frees you from the never-ending nightmare of sorting. You can actually achieve results that are very close to a traditional sorted alpha blend, as long as your per-surface weights a carefully chosen. Obviously it’s that last part that makes it tricky, consequently McGuire and Bavoil’s primary contribution is proposing a weighting function that’s based on the view-space depth of a given surface. The reasoning behind using a depth-based weighting function is intuitive: closer surfaces obscure the surfaces behind them, so the closer surfaces should be weighted higher when blending. In practice the implementation is really simple: in the pixel shader you compute color, opacity, and a weight value based on both depth and opacity. You then output float4(color* opacity, opacity) * weight  to 1 render target,while also outputting weight alone to a second render target (the first RT needs to be fp16 RGBA for HDR, but the second can just be R8_UNORM or R16_UNORM). For both render targets special blending modes are required, however they both can be represented by standard fixed-function blending available in GL/DX. After rendering all of your transparents, you then perform a full-screen “resolve” pass where you normalize the weights and then blend with the opaque surfaces underneath the transparents. Obviously this is really appealing since you completely remove any dependency on the ordering of draw calls, and you don’t need to build per-pixel lists or anything like that (which is nice for us mortals who don’t have pixel sync).  The downside is that you’re at the mercy of your weighting function, and you potentially open yourself up to new kinds of artifacts issues depending on what sort of weighting function is used.

When the paper came out I read it and I was naturally interested, so I quickly hacked together a sample project using another project as a base. Unfortunately over the past 2 months there’s been holidays, the flu, and several weeks of long hours at work so that we could could finish up a major milestone. So while I’ve had time to visit my family and optimize our engine for PS4, I haven’t really had any time to come up with a proper sample app that really lets you explore the BOIT technique in a variety of scenarios. However I really hate not having source code and a working sample app to go with papers, so I’m going to release it so that others at least have something they can use for evaluating their proposed algorithm. Hopefully it’s useful, despite how bad the test scene is. Basically it’s just a simple cornell-box like scene made up of a few walls, a sphere, a cylinder, a torus, and a sky (I normally use it for GI testing), but I added the abililty to toggle through 2 alternative albedo maps: a smoke texture, and a tree texture. It doesn’t look great, but it’s enough to get a few layers of transparency with varying lighting conditions:


The sample is based on another project I’ve been working on for quite some time with my fellow graphics programmer David Neubelt, where we’ve been exploring new techniques for baking GI into lightmaps. For that project I had written a simple multithreaded ray-tracer using Embree 2.0 (which is an awesome library, and I highly recommend it), so I re-purposed it into a ground-truth renderer for this sample. You can toggle it on and off  to see what the scene would look like with perfect sorting, which is useful for evaluating the “correctness” of the BOIT algorithm. It’s very fast on my mighty 3.6GHz Core i7, but it might chug a bit for those of you running on mobile CPU’s. If that’s true I apologize, however I made sure that all of the UI and controls are decoupled from the ray-tracing framerate so that the app remains responsive.

I’d love to do a more thorough write-up that really goes into depth on the advantages and disadvantages in multiple scenarios, but I’m afraid I just don’t have the time for it at the moment. So instead I’ll just share some quick thoughts and screenshots:

It’s pretty good for surfaces with low to medium opacity  – with the smoke texture applied, it actually achieves decent results. The biggest issues are where there’s a large difference in the lighting intensity between two overlapping surfaces, which makes sense since this also applies to improperly sorted surfaces rendered with traditional alpha blending. Top image is with Blended OIT, bottom image is ground truth:


If you look at the area where the closer, brighter surface overlaps the darker surface on the cylinder you can see an example of where the results differ from the ground-truth render. Fortunately the depth weighting produces results that don’t look immediately “wrong”, which is certainly a big step up from unsorted alpha blending. Here’s another image of the test scene with default albedo maps, with an overall opacity of 0.25:


The technique fails for surfaces with high opacity – one case that the algorithm has trouble with is surfaces with opacity = 1.0. Since it uses a weighted blend, the weight of the closest surface has to be incredibly high relative to any other surfaces in order for it to appear opaque. Here’s the test scene with all surfaces using an opacity of 1.0:


You’ll notice in the image that the algorithm does actually work correctly with opacity = 1 if there’s no overlap of transparent surfaces, so it does hold up in that particular case. However in general this problem makes it unsuitable for materials like foliage, where large portions of of surface need to be fully opaque. Here’s the test scene using a tree texture, which illustrates the same problem:


Like I said earlier, you really need to make the closest surface have a an extremely high weight relative to the surfaces behind it if you want it to appear opaque. One simple thing you could do is to keep track of the depth of the closest surface (say in a depth buffer), and then artificially boost the weight of surfaces if their depth matches the depth buffer weight. If you do this (and also scale your “boost” factor by opacity) you get something like this:


This result looks quite a bit better, although messing with the weights changes the alpha gradients which gives it a different look. This approach obviously has a lot of failure cases. Since you’re relying on depth, you could easily create discontinuities at geometry edges. You can also get situations like this, where a surface visible through a transparent portion of the closest surface doesn’t get the weight boost and remains translucent in appearance:



Notice how the second tree true trunk appears to have a low opacity since it’s behind the closest surface. The other major downside is that you need to render your transparents in a depth prepass, which requires performance as well as the memory for an extra depth buffer. However you may already be doing that in order to optimize tiled forward rendering of transparents. Regardless I doubt it would be useful except in certain special-case scenarios, and it’s probably easier (and cheaper) to just stick to alpha-test or A2C for those cases.

Is it usable? - I’m not sure yet. I feel like it would take a lot of testing the wide range of transparents in our game before knowing if it’s going to work out. It’s too bad that it has failure cases, but if we’re going to be honest the bar is pretty damn low when it comes to transparents in  games. In our engine we make an attempt to sort by depth, but our artists frequently have to resort to manually setting “sort priorities” in order to prevent temporal issues from meshes constantly switching their draw order. The Blended OIT algorithm on the other hand may produce incorrect results, but those results are stable over time. However I feel the much bigger issue with traditional transparent rendering is that ordering constraints are fundamentally at odds with rendering performance. Good performance requires using instancing, reducing state changes and rendering to low-resolution render targets. All 3 of those these are incompatible with rendering based on Z order, which means living with lots of sorting issues if you want optimal performance. With that in mind it really feels like it’s hard to do worse than the current status-quo.

That’s about all I have for now. Feel free to download the demo and play around with it. If you missed it, the download link is at the top of the page. Also, please let me know if you have any thoughts or ideas regarding the practicality of the technique, since I would definitely be interested in discussing it further.

Sample Framework Updates

You may have noticed that my latest sample now has a proper UI instead of the homegrown sliders and keyboard toggles that I was using in my older samples. What you might not have noticed is that there’s a whole bunch of behind-the-scenes changes to go with that new UI! Before I ramble on, here’s a quick bullet-point list of the new features:

  • Switched to VS2012 and adopted a few C++11 features
  • New UI back-end provided by AntTweakBar
  • C#-based data-definition format for auto-generating UI
  • Shader hot-swapping
  • Better shader caching, and compressed cache files

It occurred to me a little while ago that I could try to develop my framework into something that enables rapid prototyping, instead of just being some random bits of cobbled-together code. I’m not sure if anybody else will use it for anything, but so far it’s working out pretty well for me.

VS 2012

A little while ago I switched to working in VS 2012 with the Windows 8 SDK for my samples, but continued using the VS 2010 project format and toolset in case anybody was stuck on 2010 and wanted to compile my code. For this latest sample I decided that legacy support wasn’t worth it, and made the full switch the newer compiler. I haven’t done any really crazy C++ 11 things yet, but now that I’ve started using enum class and nullptr I never want to go back. Ever since learning C# I’ve always wanted C++ enums to use a similar syntax, and C++11 finally fulfilled my wish. I suspect that I’ll feel the same way when I start using non-static data member initializers (once I switch to VS 2013, of course).

New UI

When I found out about AntTweakBar and how easy it is to integrate, I felt a little silly for ever trying to write my own ghetto UI widgets. If you haven’t seen it before, it basically shows up as a window containing a list of tweakable app settings, which is exactly what I need in my sample apps. It also has a neat alternative to sliders where you sort of rotate a virtual lever, and the speed increases depending on how far away you move the mouse from the center point. Here’s what it looks like integrated into my shadows sample:


If you’re thinking of using AntTweakBar in your code, you might want to grab the files TwHelper.h/cpp from my sample framework. They provide a bunch of wrapper functions over TwSetParam that add type safety, which makes the library a lot easier to work with. I also have much more comprehensive set of Settings classes that further wrap those functions, but those are a lot more tightly coupled to the rest of the framework.

Shader Hot-Swapping

This one is a no-brainer, and I’m not sure why I waited so long to do it. I decided not to implement it using the Win32 file-watching API. We’ve used that at work, and it quickly became the most hated part of our codebase. Instead I took the simple and painless route of checking the timestamp of a single file every N frames, which works great as long as you don’t have thousands of files to go through (usually I only have a dozen or so).

UI Data Definition

For a long time I’ve been unhappy with my old way of defining application settings, and setting up the UI for them. Previously I did everything in C++ code by declaring a global variable for each setting. This was nice because I got type safety and VisualAssist auto-complete whenever I wanted to access a setting, but it was also a bit cumbersome because I always had to touch code in multiple places whenever I wanted to add a new setting. This was especially annoying when I wanted to access a setting in a shader, since I had manually handle adding it to a constant buffer and setting the value before binding it to a shader. After trying and failing multiple times to come up with something better using just C++, I thought it might be fun to  try something a bit less…conventional. Ultimately I drew inspiration from game studios that define their data in a non-C++ language and then use that to auto-generate C++ code for use at runtime. If you can do it this way you get the best of both worlds: you define data in a simple language that can express it elegantly, and then you still get all of your nice type-safety and other C++ benefits. You can even add runtime reflection support by generating code that fills out data structures containing the info that you need to reflect on a type. It sounded crazy to do it for a sample framework, but I thought it would be fun to get out of my comfort zone a bit and try something new.

Ultimately I ended up using C# as my data-definition language. It not only has the required feature set, but I also used to be somewhat proficient in it several years ago. In particular I really like the Attribute functionality in C#,  and thought it would be perfect for defining metadata to go along with a setting. Here’s an example of how I ended up using them for the “Bias” setting from my Shadows sample:

BiasSettingFor enum settings, I just declare an enum in C# and use that new type when defining the setting. I also used an attribute to specify a custom string to associate with each enum value:EnumSettingTo finish it up, I added simple C# proxies for the Color, Orientation, and Direction setting types supported by the UI system.

Here’s how it all it ties together: I define all of the settings in a file called AppSettings.cs, which includes classes for splitting the settings into groups. This file is added to my Visual Studio C++ project, and set to use custom build step that runs before the C++ compilation step. This build step passes the file to SettingsCompiler.exe, which is a command-line C# app created by a C# project in the same solution. This app basically takes the C# settings file, and invokes the C# compiler (don’t you love languages that can compile themselves?) so that it can be compiled as an in-memory assembly. That assembly is then reflected to determine the settings that are declared in the file, and also to extract the various metadata from the attributes. Since the custom attribute classes need to be referenced by both the SettingsCompiler exe as well as the settings code being compiled, I had to put all of them in a separate DLL project called SettingsCompilerAttributes. Once all of the appropriate data is gathered, the C# project then generates and outputs AppSettings.h and AppSettings.cpp. These files contain global definitions of the various settings using the appropriate C++ UI types, and also contains code for initializing and registering those settings with the UI system. These files are added to the C++ project, so that they can be compiled and linked just like any other C++ code. On top of that, the settings compiler also spits out an HLSL file that declares a constant buffer containing all of the relevent settings (a setting can opt-out of the constant buffer if it wants by using an attribute). The C++ files then have code generated for creating a matching constant buffer resource, filling it out with the setting values once a frame, and binding it to all shader stages at the appropriate slot. This means that all a shader needs to do is #include the file, and it can use the setting. Here’s a diagram that shows the basic setup for the whole thing:


This actually works out really nicely in Visual Studio: you just hit F7 and everything builds in the right order. The settings compiler will gather errors from the C# compiler and output them to stdout, which means that if you have a syntax error it gets reported to the VS output window just as if you were compiling it normally through a C# project. MSBuild will even track the timestamp on AppSettings.cs, and won’t run the settings compiler unless it’s newer than the timestamp on AppSettings.h, AppSettings.cpp or AppSettings.hlsl. Sure it’s a really complicated and over-engineered way of handling my problem, but it works and I had fun doing it.

Future Work

I think that the next thing that I’ll improve will be model loading. It’s pretty limiting working with .sdkmesh, and I’d like to be able to work with a wider variety of test scenes. Perhaps I’ll integrate Assimp, or use it to make a simple converter to a custom format. I’d also like to flesh out my SH math and shader code a bit, and add some more useful functionality.

A Sampling of Shadow Techniques

A little over a year ago I was looking to overhaul our shadow rendering at work in order to improve overall quality, as well as simplify the workflow for the lighting artists (tweaking biases all day isn’t fun for anybody). After doing yet another round of research into modern shadow mapping techniques, I decided to do what I usually do and starting working on sample project that I could use as a platform for experimentation and comparison. And so I did just that. I had always intended to put it on the blog, since I thought it was likely that other people would be evaluating some of the same techniques as they upgrade their engines for next-gen consoles. But I was a bit lazy about cleaning it up (there was a lot of code!), and it wasn’t the most exciting thing that I was working on, so it sort-of fell by the wayside. Fast forward to a year later, and I found myself  looking for a testbed for some major changes that I was planning for the settings and UI portion of my sample framework. My shadows sample came to mind since it was chock-full of settings, and that lead me to finally clean it up and get it ready for sharing. So here it is! Note that I’ve fully switched over to VS 2012 now, so you’ll need it to open the project and compile the code. If you don’t have it installed, then you may need to download and install the VC++ 2012 Redistributable in order to run the pre-compiled binary.

Update 9/12/2013
 – Fixed a bug with cascade stabilization behaving incorrectly for very small partitions
 – Restore previous min/max cascade depth when disabling Auto-Compute Depth Bounds

Thanks to Stephen Hill for pointing out the issues!

Update 9/18/2013
 – Ignacio Castaño was kind enough to share the PCF technique being used in The Witness, which he integrated into the sample under the name “OptimizedPCF”. Thanks Ignacio!

Update 11/3/2013
- Fixed SettingsCompiler string formatting issues for non-US cultures

Update 2/17/2015
- Fixed VSM/EVSM filtering shader that was broken with latest Nvidia drivers
- Added Moment Shadow Mapping
– Added notes about biasing VSM, EVSM, and MSM

The sample project is set up as a typical “cascaded shadow map from a directional light” scenario, in the same vein as the CascadedShadowMaps11 sample from the old DX SDK or the more recent Sample Distribution Shadow Maps sample from Intel.  In fact I even use the same Powerplant scene from both samples, although I also added in human-sized model so that you can get a rough idea of how well the shadowing looks on characters (which can be fairly important if you want your shadows to look good in cutscenes without having to go crazy with character-specific lights and shadows). The basic cascade rendering and setup is pretty similar to the Intel sample: a single “global” shadow matrix is created every frame based on the light direction and current camera position/orientation, using an orthographic projection with width, height, and depth equal to 1.0. Then for each cascade a projection is created that’s fit to the cascade, which is used for rendering geometry to the shadow map. For sampling the cascade, the projection is described as a 3D scale and offset that’s applied to the UV space of the “global” shadow projection. That way you just use one matrix in the shader to calculate shadow map UV coordinates + depth, and then apply the scale and offset to compute the UV coordinates + depth for the cascade that you want to sample. Unlike the Intel sample I didn’t use any kind of deferred rendering, so I decided to just fix the number of cascades at 4 instead of making it tweakable at runtime.

Cascade Optimization

Optimizing how you fit your cascades to your scene and current viewpoint is pretty crucial for reducing perspective aliasing and the artifacts that it causes. The old-school way of doing this for CSM is to chop up your entire visible depth range into N slices using some sort of partitioning scheme (logarithmic, linear, mixed, manual, etc.). Then for each slice of the view frustum, you’d tightly fit an AABB to the slice  and use that as the parameters for your orthographic projection for that slice. This gives you the most optimal effective resolution for a given cascade partition, however with 2006-era shadow map sizes you still generally end up with an effective resolution that’s pretty low relative to the screen resolution. Combine this with 2006-era filtering (2×2 PCF in a lot of cases), and you end up with quite a bit of aliasing. This aliasing was exceptionally bad, due to the fact that your cascade projections will translate and scale as your camera moves and rotates, which results in rasterization sample points changing from frame to frame as the camera moves. The end result was crawling shadow edges, even from static geometry. The most popular solution for this problem was to trade some effective shadow map resolution for stable sample points that  don’t move from frame to frame. This was first proposed (to my knowledge) by Michal Valient in his ShaderX6 article entitled “Stable Cascaded Shadow Maps”. The basic idea is that instead of tightly mapping your orthographic projection to your cascade split, you map it in such a way that the projection won’t change as the camera rotates. The way that Michal did it was to fit a sphere to the entire 3D frustum split and then fit the projection to that sphere, but you could do it any way that gives you a consistent projection size. To handle the translation issue, the projections are “snapped” to texel-sized increments so that you don’t get sub-texel sample movement. This ends up working really well, provided that your cascade partitions never change (which means that changing your FOV or near/car clip planes can cause issues). In general the stability ends up being a net win despite the reduced effective shadow map resolution, however small features and dynamic objects end up suffering.

2 years ago Andrew Lauritzen gave a talk on what he called “Sample Distribution Shadow Maps”, and released the sample that I mentioned earlier. He proposed that instead of stabilizing the cascades, we could instead focus on reducing wasted resolution in the shadow map to a point where effective resolution is high enough to give us sub-pixel resolution when sampling the shadow map. If you can do that then you don’t really need to worry about your projection changing frame to frame, provided that you use decent filtering when sampling the shadow map. The way that he proposed to do this was to analyze the depth buffer generated for the current viewpoint, and use it to automatically generate an optimal partitioning for the cascades. He tried a few complicated ways of achieving this that involved generating a depth histogram on the GPU, but also proposed a much more practical scheme that simply computed the min and max depth values. Once you have the min and max depth, you can very easily use that to clamp your cascades to the subset of your depth range that actually contains visible pixels. This might not sound like a big deal, but in practice it can give you huge improvements. The min Z in particular allows you to have a much more optimal partitioning, which you can get by using an “ideal” logarithmic partitioning scheme. The main downside is that you need to use the depth buffer, which puts you in the awful position of having the CPU dependent on results from the GPU if you want to do your shadow setup and scene submission on the CPU. In the sample code they simply stall and readback the reduction results, which isn’t really optimal at all in terms of performance. When you do something like this the CPU ends up waiting around for the driver to kick off commands to the GPU and for the GPU to finish processing them, and then you get a stall on the GPU while it sits around and waits for the CPU to start kicking off commands again. You can potentially get around this by doing what you normally do for queries and such, and deferring the readback for one or more frames. That way the results are (hopefully) already ready for readback, and so you don’t get any stalls. But this can cause you problems, since the cascades will trail a frame behind what’s actually on screen. So for instance if the camera moves towards an object, the min Z may not be low enough to fit all of the pixels in the first cascade and you’ll get artifacts. One potential workaround is to use the previous frame’s camera parameters to try to predict what the depth of a pixel will be for the next frame based on the instantaneous linear velocity of the camera, so that when you retrieve your min/max depth a frame late it actually contains the correct results. I’ve actually done this in the past (but not for this sample, I got lazy) and it works as long as long as the camera motion stays constant. However it won’t handle moving objects unless you store the per-pixel velocity with regards to depth and factor that into your calculations. The ultimate solution is to do all of the setup and submission on the GPU, but I’ll talk about that in detail later on.

These are the options implemented in the sample that affect cascade optimization:

  • Stabilize Cascades – enables cascade stabilization using the method from ShaderX6.
  • Auto-Compute Depth Bounds – computes the min and max Z on the GPU, and uses it to tightly fit the cascades as in SDSM.
  • Readback Latency – number of frames to wait for the depth reduction results
  • Min Cascade Distance/Max Cascade Distance – manual min and max Z when Auto-Compute Depth Bounds isn’t being used
  • Partition Mode – can be Logarithmic, PSSM (mix between linear and logarithmic), or manual
  • Split Distance 0/Split Distance 1/Split Distance 2/Split Distance 3 – manual partition depths
  • PSSM Lambda – mix between linear and logarithmic partitioning when PartitionMode == PSSM
  • Visualize Cascades – colors pixels based on which cascade they select

Shadow Filtering

Shadow map filtering is the other important aspect of reducing artifacts. Generally it’s needed to reduce aliasing due to undersampling the geometry being rasterized into the shadow map, but it’s also useful for cases where the shadow map itself is being undersampled by the pixel shader. The most common technique for a long time has been Percentage Closer Filtering (PCF), which basically amounts to sampling a normal shadow map, performing the shadow comparison, and then filtering the result of that comparison. Nvidia hardware has been able to do a 2×2 bilinear PCF kernel in hardware since…forever, and it’s required of all DX10-capable hardware. Just about every PS3 game takes advantage of this feature, and Xbox 360 games would too if the GPU supported it. In general you see lots of 2×2 or 3×3 grid-shaped PCF kernels with either a box filter or a triangle filter. A few games (notably the Crysis games) use a randomized sample pattern with sample points located on a disc, which trades regular aliasing for noise. With DX11 hardware there’s support for GatherCmp, which essentially gives you the results of 4 shadow comparisons performed for the relevant 2×2 group of texels. With this you can efficiently implement large (7×7 or even 9×9) filter kernels with minimal fetch instructions, and still use arbitrary filter kernels. In fact there was an article in GPU Pro called “Fast Conventional Shadow Filtering” by Holger Gruen, that did exactly this, and even provided source code. It can be stupidly fast…in my sample app going from 2×2 PCF to 7×7 PCF only adds about 0.4ms when rendering at 1920×1080 on my AMD 7950. For comparison, a normal grid sampling approach adds about 2-3ms in my sample app for the maximum level of filtering. The big disadvantage to the fixed kernel modes is that they rely on the compiler to unroll the loops, which make for some sloooowwww compile times. The sample app uses a compilation cache so you won’t notice it if you just start it up, but without the cache you’ll see it takes quite while due to the many shader permutations being used. For that reason I decided to stick with a single kernel shape (disc) rather than using all of the shapes from the GPU Pro article, since compilation times were already bad enough.

So far the only real competitor to gain any traction in games is Variance Shadow Maps (VSM). I won’t go deep into the specifics since the original paper and GPU Gems article do a great job of explaining it. But the basic gist is that you work in terms of the mean and variance of a distribution of depth values at a certain texel, and then use use that distribution to estimate the probability of a pixel being in shadow. The end result is that you gain the ability to filter the shadow map without having to perform a comparison, which means that you can use hardware filtering (including mipmaps, anistropy or even MSAA) and that you can pre-filter the shadow map with standard “full-screen” blur pass. Another important aspect is that you generally don’t suffer from the same biasing issues that you do with PCF. There’s some issues of performance and memory, since you now need to store 2 high-precision values in your shadow map instead of just 1. But in general the biggest problem is the light bleeding, which occurs when there’s a large depth range between occluders and receivers. Lauritzen attempted to address this a few years later by applying an exponential warp to the depth values stored in the shadow map, and performing the filtering in log space. It’s generally quite effective, but it requires high-precision floating point storage to accommodate the warping. For maximum quality he also proposed storing a negative term, which requires an extra 2 components in the shadow map. In total that makes for 4x FP32 components per texel, which is definitely not light in terms of bandwidth! However you could say that it arguably produces the highest-quality results, and it does so without having to muck around with biasing. This is especially true when pre-filtering, MSAA, anisotropic filtering, and mipmaps are all used, although each of those brings about additional cost. To provide some real numbers, using EVSM4 with 2048×2048 cascades, 4xMSAA, mipmaps, 8xAF, and highest-level filtering (9 texels wide for the first cascade), adds about 11.5ms relative to a 7×7  fixed PCF kernel. A more reasonable approach would be to go with 1024×1024 shadow maps with 4xMSAA, which is around 3ms slower than the PCF version.

These are the shadow filtering modes that are implemented:

  • FixedSizePCF – optimized GatherCmp PCF with disc-shaped kernel
  • GridPCF – manual grid-based PCF sampling using NxN samples
  • RandomDiscPCF  – randomized samples on a Poisson disc, with optional per-pixel randomized rotation
  • OptimizedPCF – similar to FixedSizePCF, but uses bilinear PCF samples to implement a uniform filter kernel.
  • VSM – variance shadow maps
  • EVSM2 – exponential variance shadow maps, positive term only
  • EVSM4 – exponential variance shadow maps, both positive and negative terms
  • MSM Hamburgermoment shadow mapping, using the “Hamburger 4MSM” technique from the paper.
  • MSM Hausdorff – moment shadow mapping, using the “Hausdorff 4MSM” technique from the paper

Here’s the options available in my sample related to filtering:

  • Shadow Mode – the shadow filtering mode, can be one of the above values
  • Fixed Filter Size – the kernel width in texels for the FixedSizePCF mode, can be 2×2 though 9×9
  • Filter Size – the kernel width in fractional world space units for all other filtering modes. For the VSM modes, it’s used in a pre-filtering pass.
  • Filter Across Cascades – blends between two cascade results at cascade boundaries to hide the transition
  • Num Disc Samples – number of samples to use for RandomDiscPCF mode
  • Randomize Disc Offsets – if enabled, applies a per-pixel randomized rotation to the disc samples
  • Shadow MSAA – MSAA level to use for VSM, EVSM, and MSM modes
  • VSM/MSM Format – the precision to use for VSM, EVSM, and MSM shadow maps. Can be 16bit or 32bit . For VSM the textures will use a UNORM format, for EVSM they will be FLOAT. For the MSM 16-bit version, the optimized quantization scheme from the paper is used to store the data in UNORM texture.
  • Shadow Anisotropy – anisotropic filtering level to use for VSM, EVSM, and MSM
  • Enable Shadow Mips – enables mipmap generation and sampling for VSM, EVSM, and MSM
  • Positive Exponent/Negative Exponent – the exponential warping factors for the positive and negative components of EVSM
  • Light Bleeding Reduction – reduces light bleeding for VSM/EVSM/MSM, but results in over-darkening

In general I try to keep the filtering kernel fixed in world space across each cascade by adjusting the kernel width based on the cascade size. The one exception is the FixedSizePCF mode, which uses the same size kernel for all cascades. I did this because I didn’t think that branching over the fixed kernels would be a great idea. Matching the filter kernel for each cascade is nice because it helps hide the seams at cascade transitions, which means you don’t have to try to filter across adjacent cascades in order to hide them. It also means that you don’t have to use wider kernels on more distant pixels, although this can sometimes lead to visible aliasing on distant surfaces.

I didn’t put a whole lot of effort into the “RandomDiscPCF” mode, so it doesn’t produce optimal results. The randomization is done per-pixel, which isn’t great since you can clearly see the random pattern tiling over the screen as the camera moves. For a better comparison you would probably want to do something similar to what Crytek does, and tile a (stable) pattern over each cascade in shadowmap space.


When using conventional PCF filtering, biasing is essential for avoiding “shadow acne” artifacts. Unfortunately, it’s usually pretty tricky to get it right across different scenes and lighting scenarios. Just a bit too much bias will end up killing shadows entirely for small features, which can cause characters too look very bad. My sample app exposes 3 kinds of biasing: manual depth offset, manual normal-based offset, and automatic depth offset based on receiver slope. The manual depth offset, called simply Bias in the UI, simply subtracts a value from the pixel depth used to compare against the shadow map depth. Since the shadow map depth is a [0, 1] value that’s normalized to the depth range of the cascade, the bias value represents a variable size in world space for each cascade. The normal-based offset, called Offset Scale in the UI, is based on “Normal Offset Shadows”, which was a poster presented at GDC 2011. The basic idea is you actually create a new virtual shadow receiver position by offsetting from the actual pixel position in the direction of the normal. The trick is that you offset more when the surface normal is more perpendicular to the light direction. Angelo Pesce has a hand-drawn diagram explaining the same basic premise on his blog, if you’re having trouble picturing it.. This technique can actually produce decent results, especially given how cheap it is. However since you’re offsetting the receiver position, you actually “move” the shadows a bit which is a bit weird to see as you tweak the value. Since the offset is a world-space distance, in my sample I scale it based on the depth range of the cascade in order to make it consistent across cascade boundaries. If you want to try using it, I recommend starting with a small manual bias of 0.0015 or so and then slowly increasing the Offset Scale to about 1.0 or 2.0. Finally we have the automatic solution, where we attempt to compute the “perfect” bias amount based on the slope of the receiver. This setting is called Use Receiver Plane Depth Bias in the sample. To determine the slope, screen-space derivatives are used in the pixel shader. When it works, it’s fantastic. However it will still run into degenerate cases where it can produce unpredictable results, which is something that often happens when working with screen-space derivatives.

There are also separate biasing parameters for VSM and MSM techniques. The “VSM Bias” affects VSM and EVSM, while “MSM Depth Bias” and “MSM Moment Bias” are used for MSM. For VSM and 32-bit EVSM, a bias value of 0.01 (which corresponds to an actual value of 0.0001) is sufficent. However for 16-bit EVSM a bias of up to 0.5 (0.005) is required to alleviate precision issues. For only the moment bias is particularly relevent. This value needs to be at least 0.001-0.003 (0.000001-0.000003) for 32-bit modes, while the quantized 16-bit mode requires a higher bias of around 0.03 (0.00003). Note that for both EVSM and MSM increasing the bias can also increase the amount of visible light bleeding.

GPU-Driven Cascade Setup and Scene Submission

This is a topic that’s both fun, and really frustrating. It’s fun because you can really start to exploit the flexibility of DX11-class GPU’s, and begin to break free of the old “CPU spoon-feeds the GPU” model that we’ve been using for so long now. It’s also frustrating because the API’s still hold us back a quite a bit in terms of letting the GPU generate its own commands.  Either way it’s something I see people talk about but not a lot of people actually doing it, so I thought I’d give it a try for this sample. There’s actually 2 reasons to try something like this. The first is that if you can offload enough work to the GPU, you can avoid the heavy CPU overhead of frustum culling and drawing and/or batching lots and lots of meshes. The second is that it lets you do SDSM-style cascade optimization based on the depth buffer without having to read back values from the GPU to the CPU, which is always a painful way of doing things.

The obvious path to implementing GPU scene submission would be to make use of DrawInstancedIndirect/DrawIndexedInstancedIndirect. These API’s are fairly simple to use: you simply write the parameters to a buffer (with the parameters being the same ones that you would normally pass on the CPU for the non-indirect version), and then on the CPU you specify that buffer when calling the appropriate function. Since instance count is one of the parameters, you can implement culling on the GPU my setting the instance count to 0 when a mesh shouldn’t be rendered. However it turns out that this isn’t really useful, since you still need to go through the overhead of submitting draw calls and setting associated state on the CPU. In fact the situation is worse than doing it all on the CPU, since you have to submit each draw call even if it will be frustum culled.

Instead of using indirect draws, I decided to make a simple GPU-based batching system. Shadow map rendering is inherently more batchable than normal rendering, since you typically don’t need to worry about having specialized shaders or binding lots of textures when you’re only rendering depth. In my sample I take advantage of this by using a compute shader to generate one great big index buffer based on the results of a frustum culling pass, which can then be rendered in a single draw call. It’s really very simple: during initialization I generate a buffer containing all vertex positions, a buffer containing all indices (offset to reflect the vertex position in the combined position buffer), and a structured buffer containing the parameters for every draw call (index start, num indices, and a bounding sphere). When it’s time to batch, I run a compute shader with one thread per draw call that culls the bounding sphere against the view frustum. If it passes, the thread then “allocates” some space in the output index buffer by performing an atomic add on a value in a RWByteAddressBuffer containing the total number of culled indices that will be present in output index buffer (note that on my 7950 you should be able to just do the atomic in GDS like you would for an append buffer, but unfortunately in HLSL you can only increment by 1 using IncrementCounter()). I also append the draw call data to an append buffer for use in the next pass:

const uint drawIdx = TGSize * GroupID.x + ThreadIndex;
if(drawIdx >= NumDrawCalls)

DrawCall drawCall = DrawCalls[drawIdx];

    CulledDraw culledDraw;
    culledDraw.SrcIndexStart = drawCall.StartIndex;
    culledDraw.NumIndices = drawCall.NumIndices;
    DrawArgsBuffer.InterlockedAdd(0, drawCall.NumIndices, 


Once that’s completed, I then launch another compute shader that uses 1 thread group per draw call to copy all of the indices from the source index buffer to the final output index buffer that will be used for rendering. This shader is also very simple: it simply looks up the draw call info from the append buffer that tells it which indices to copy and where they should be copied, and then loops enough times for all indices to be copied in parallel by the different threads inside the thread group. The final result is an index buffer containing all of the culled draw calls, ready to be drawn in a single draw call:

const uint drawIdx = GroupID.x;
CulledDraw drawCall = CulledDraws[drawIdx];

for(uint currIdx = ThreadIndex; currIdx < drawCall.NumIndices; currIdx += TGSize)
    const uint srcIdx = currIdx + drawCall.SrcIndexStart;
    const uint dstIdx = currIdx + drawCall.DstIndexStart;
    CulledIndices[dstIdx] = Indices[srcIdx];

In order to avoid the min/max depth readback, I also had to port my cascade setup code to a compute shader so that the entire process could remain on the GPU. I was surprised to find that this was actually quite a bit more difficult than writing the batching system. Porting arbitrary C++ code to HLSL can be somewhat tedious, due to the various limitations of the language. I also ran into a rather ugly bug in the HLSL compiler where it kept  trying to keep matrices in column-major order no matter what code I wrote and what declarations I used, which I suppose it tries to do as an optimization. However this really messed me up when I tried to write the matrix into a structured buffer, and expected it to be row-major. My advice for the future: if you need to write a matrix from a compute shader, just use a StructuredBuffer<float4> and write out the rows manually. Ultimately after much hair-pulling I got it to work, and finally achieved my goal of 0-latency depth reduction with no CPU readback!

In terms of performance, the GPU-driven path comes in about 0.8ms slower than the CPU version when the CPU uses 1 frame of latency for reading back the min/max depth results. I’m not entirely sure why it’s that much slower, although I haven’t spend much time trying to profile the batching shaders or the cascade setup shader. I also wouldn’t be surprised if the actual mesh rendering ends up being a bit slower, since I use 32-bit indices when batching on the GPU as opposed to 16-bit when submitting with the CPU. However when using the 0 frames of latency for the CPU readback, the GPU version turns it around and comes in a full millsecond faster on my PC. Obviously it makes sense that any additional GPU overhead would make up for the stall that occurs on the CPU and GPU when having to immediately read back data. One thing that I’ll point out is that the GPU version is quite a bit slower than the CPU version when VSM’s are used and pre-filtering is enabled. This is because the CPU path chooses an optimized blur shader based on the filter width for a cascade, and early outs of the blur process if no filtering is required. For the GPU path I got lazy and just used one blur shader that uses a dynamic loop, and it always runs the blur passes even if no filtering is requested. There’s no technical reason why you couldn’t do it the same way as the CPU path with enough motivation and DrawIndirect’s.

DX11.2 Tiled Resources

Tiled resources seems to be the big-ticket item for the upcoming DX11.2 update. While the online documentation has some information about the new functions added to the API, there’s currently no information about the two tiers of tiled resource functionality being offered. Fortunately there is a sample app available that provides some clues. After poking around a bit last night, these were the differences that I noticed:

  • TIER2 supports MIN and MAX texture sampling modes that return the min or max of 4 neighboring texels. In the sample they use this when sampling a residency texture that tells the shader the highest-resolution mip level that can be used when sampling a particular tile. For TIER1 they emulate it with a Gather.
  • TIER1 doesn’t support sampling from unmapped tiles, so you have to either avoid it in your shader or map all unloaded tiles to dummy tile data (the sample does the latter)
  • TIER1 doesn’t support packed mips for texture arrays. From what I can gather, packed mips refers to packing multiple mips into a single tile.
  • TIER2 supports a new version of Texture2D.Sample that lets you clamp the mip level to a certain value. They use this to force the shader to sample from lower-resolution mip levels if the higher-resolution mip isn’t currently resident in memory. For TIER1 they emulate this by computing what mip level would normally be used, comparing it with the mip level available in memory, and then falling back to SampleLevel if the mip level needs to be clamped. There’s also another overload for Sample that returns a status variable that you can pass to a new “CheckAccessFullyMapped” intrinsic that tells you if the sample operation would access unmapped tiles. The docs don’t say that these functions are restricted to TIER2, but I would assume that to be the case.

Based on this information it appears that TIER1 offers all of the core functionality, while TIER2 has few extras bring it up to par with AMD’s sparse texture extension.


So I’m hoping that if you’re reading this, you’ve already attended or read the slides from my presentation about The Order: 1886 that was part of the Physically Based Shading Course at SIGGRAPH last week. If not, go grab them and get started! If you haven’t read through the course notes already there’s a lot of good info there, in fact there’s almost 30 pages worth! The highlights include:

  • Full description of our Cook-Torrance and Cloth BRDF’s, including a handy optimization for the GGX Smith geometry term (for which credit belongs to Steve McAuley)
  • Analysis of our specular antialiasing solution
  • Plenty of details regarding the material scanning process
  • HLSL sample code for the Cook-Torrance BRDF’s as well as the specular AA roughness modification
  • Lots of beautiful LaTeX equations

If you did attend, I really appreciate you coming and I hope that you found it interesting. It was my first time presenting in front of such a large audience, so please forgive me if I seemed nervous. Also I’d like to give another thank you to anyone that came out for drinks later that night, I had a really great time talking shop with some of the industry’s best graphics programmers. And of course the biggest thanks goes out to Stephen Hill and Stephen McAuley for giving us the opportunity to speak at the best course at SIGGRAPH.

Anyhow…now that SIGGRAPH is finished and I have some time to sit down and take a breath, I wanted to follow up with some additional remarks about the topics that we presented. I also thought that if I post blogs more frequently, it might inspire a few other people to do the same.

Physically Based BRDF’s

I didn’t even mention anything about the benefits of physically based BRDF’s in our presentation because I feel like it’s no longer an issue that’s worth debating. We’ve been using some form of Cook-Torrance specular for about 2 years now, and it’s made everything in our game look better. Everything works more like it should work, and requires less fiddling and fudging on the part of the artists. We definitely had some issues to work through when we first switched (I can’t tell you how many times they asked me for direct control over the Fresnel curve), but there’s no question as to whether it was worth it in the long run. Next-gen consoles and modern PC GPU’s have lots of ALU to throw around, and sophisticated BRDF’s are a great way to utilize it.


First of all, I wanted to reiterate that our compositing system (or “The Smasher” as it’s referred to in-house) is a completely offline process that happens in our content build system. When a level is a built we request a build of all materials referenced in that level, which then triggers a build of all materials used in the compositing stack of that material. Once all of the source materials are built, the compositing system kicks in and generates the blended parameter maps. The compositing process itself is very quick since we do it in a simple pixel shader run on the GPU, but it can take some time to build all of the source materials since doing that requires processing the textures referenced by that material. We also support using composited materials as a source material in a composite stack, which means that a single runtime material can potentially depend on a arbitrarily complex tree of source materials. To alleviate this we aggressively cache intermediate and output assets on local servers in our office, which makes build times pretty quick once the cache has been primed. We also used to compile shaders for every material, which caused long build times when changing code used in material shaders. This forced us to change things so that we only compile shaders directly required by a mesh inside of a level.

We also support runtime parameter blending, which we refer to as layer blending. Since it’s at runtime we limit it to 4 layers to prevent the shader from becoming too expensive, unlike the compositing system where artists are free to composite in as many materials as they’d like. It’s mostly used for environment geo as a way to add in some break-up and variety to tiling materials via vertex colors, as opposed to blending in small bits of a layer using blend maps. One notable exception is that we use the layer blending to add a “detail layer” to our cloth materials. This lets us keep the overall low-frequency material properties in lower-resolution texture maps, and then blend in tiling high-frequency data (such as the textile patterns acquired from scanning).

One more thing that I really wanted to bring up is that the artists absolutely love it. The way it interacts with our template library allows us to keep low-level material parameter authoring in the hands of a few technical people, and enables everyone else to just think in terms of combining high-level “building blocks” that form an object’s final appearance. I have no idea how we would make a game without it.

Maya Viewer

A few people noticed that we have our engine running in a Maya viewport, and came up to ask me about it. I had thought that this was fairly common and thus uninteresting, but I guess I was wrong! We do this by creating a custom Viewport 2.0 render override (via MRenderOverride) that uses our engine to render the scene using Maya’s DX11 device (Maya 2013.5 and up support a DX11 mode for Viewport 2.0, for 2013 you gave to use OpenGL). With their their render pass API you can basically specify what things you want Maya to draw, so we render first (so that we can fill the depth buffer) and then have Maya draw things like HUD text and wireframe overlays for UI. The VP 2.0 API actually supports letting you hook into their renderer and data structure, which basically lets you specify things like vertex data and shader code while allowing Maya to handle the actual rendering. We don’t do that…we basically just give Maya’s device to our renderer and let our engine draw things the same way that it does in-game. To do this, we have a bunch of Maya plugin code that tracks objects in the scene (meshes, lights, plugin nodes, etc.) and handles exporting data off of them and converting them to data that can be consumed by our renderer. Maintaining this code has been a lot of work, since it basically amounts to an alternate path for scene data that’s in some ways very different from what we normally do in the game. However it’s huge workflow enhancement for all of our artists, so we put up with it. I definitely never thought I would know so much about Maya and its API!

Some other cool things that we support inside of Maya:

  • We can embed our asset editors (for editing materials, particle systems, lens flares, etc.) inside of Maya, which allows for real-time editing of those assets while they’re being viewed
  • GI bakes are initiated from Maya and then run on the GPU, which allows the lighting artists to keep iterating inside of a single tool and get quick feedback
  • Our pre-computed visibility can also be baked inside of Maya, which allows artists to check for vis dropouts and other bugs without running the game

One issue that I had to work around in our Maya viewer was using the debug DX11 device. Since Maya creates the device we can’t control the creation flags, which means no helpful DEBUG mode to tell us when we mess something up. To work around this, I had to make our renderer create its own device and use that for rendering. Then when presenting the back buffer and depth buffer to Maya, we have to use DXGI synchronization to copy texture data from our device’s render targets to Maya’s render targets. It’s not terribly hard, but it requires reading through a lot of obscure DXGI documentation.

Sample Code

You may not have noticed, but there’s a sample app (alternate download link here) to go with the presentation! Dave and I always say that it’s a little lame to give a talk about something without providing sample code, so we had to put our money where our mouth is. It’s essentially a working implementation of our specular AA solution as well as our Cook-Torrance BRDF’s that uses my DX11 sample framework. Seeing for yourself with the sample app is a whole lot better than looking at pictures, since the primary concern is avoiding temporal aliasing as the camera or object changes position. These are all of the specular AA techniques available in the sample for comparison:

  • LEAN
  • Toksvig
  • Pre-computed Toksvig
  • In-shader vMF evaluation of Frequency Domain Normal Map Filtering
  • Pre-computed vMF evaluation (what we use at RAD)

For a ground-truth reference, I also implemented in-shader supersampling and texture-space lighting. The shader supersampling works by interpolating vertex attributes to random sample points surrounding the pixel center, computing the lighting, and then applying a bicubic filter. Texture-space lighting works exactly like you think it would: the lighting result is rendered to a FP16 texture that’s 2x the width and height of the normal map, mipmaps are generated, and the geometry is rendered by sampling from the lighting texture with anisotropic filtering. Since linear filtering is used both for sampling the lighting map and generating the mipmaps, the overall results don’t always look very good (especially under magnification). However the results are completely stable, since the sample positions never move relative to the surface being shaded. The pixel shader supersampling technique still suffers from some temporal flickering because of this, although it’s obviously significantly reduced with higher sample counts.

Dave and I had also intended to implement solving for multiple vMF lobes, but unfortunately we ran out of time and we weren’t able to include it. I’d like to revisit it at some point and release an updated sample that has it implemented. I don’t think it would actually be worth it for a game to store so much additional texture data, however I think it would be useful as a reference. It might also be interesting to see if the data could be used to drive a more sophisticated pre-computation step that bakes the anisotropy into the lower-resolution mipmaps.

Like I mentioned before, the sample also has implementations of the GGX and Beckmann-based specular BRDF’s described in our course notes. We also implemented our GGX and Cloth BRDF’s as .brdf files for Disney’s BRDF Explorer, which you can download here.

What I’ve been working on for the past 2 years

The announce trailer for Ready At Dawn’s latest project was shown during Sony’s E3 press conference yesterday, but if you missed it you can watch it here. There’s also a few full-res screenshots available here, with less compress-o-vision. It feels really good to finally be able to tell people what game I’ve been working on, and that we’re making a PS4 title. I’m also insanely proud of the trailer itself, as well as the in-house tech we’ve developed that made it possible. I have no doubts that I work with some of the most dedicated and talented people in the industry, and being able to collaborate with them makes it worth getting up in the morning. I’m really looking forward to sharing some of the cool things we’ve come up with, both on this blog as well as in more formal settings.

I was also seriously impressed by all of the other games and demos that were shown off yesterday. It’s really exciting to see everyone gearing up for next gen, and pushing their engines to new levels. These were some stand-outs for me:

  • Infamous: Second Son – characters look awesome! Lots of details in the skin and fabrics, plus some really nice facial animation
  • Watch Dogs – we’ve seen it before, but it looks better and better every time
  • The Division – lighting looks great, and the contextual animations are a really nice touch. I really dig the bullet decals and glass destruction.
  • Battlefield 4 – these guys are just out of control when it comes to destruction and mayhem! Also, really psyched to see that commander mode is making a comeback.
  • Mirror’s Edge 2 – the first game has a special place in my heart, so I would love a sequel even if it didn’t look awesome (but it does). It’s really cool seeing how good GI can look in a came with a clean art style.
  • Destiny – I’m not usually one for persistent-world multiplayer games, but this game might be exception. Looks like a lot of fun, and has great graphics to boot.
  • Titan Fall – any game where I can pilot a mech and run on walls has my money
  • The Dark Sorcerer – very high quality assets, and great lighting/materials. The performance capture is really impressive as well.

I’m looking forward to seeing more of these games and comparing notes on what techniques make the most out of next-gen hardware. By my count we’ve crossed off around 8 or so of the things on my list, and hopefully the entire industry will collectively figure out how to make all of them extinct. And then we can come up with new things to add to the list!

HLSL User Defined Language for Notepad++

When it comes to writing shaders, Notepad++ is currently my editor of choice. The most recent release of Notepad++ added version 2.0 of their User Defined Language (UDL) system, which adds quite a few improvements. I’ve been using an HLSL UDL file that I downloaded from somewhere else for a while now, and I decided to upgrade it to the 2.0 format and also make it work better for SM5.0 profiles. I added all of the operators, keywords, types attributes, system-value semantics, intrinsics, and methods, so they all get syntax highlighting now. I also stripped out all of the old pre-SM4.0 intrinsics and semantics, as well as the effect-specifics keywords. I’ve exported it as an XML file and uploaded it to my Google Drive so that others can make use of it as well. To use it, you can either import the XML file from the UDL dialog (Language->Define your language), or you can replace your userDefineLang.xml file in the AppData\Notepad++ folder. Enjoy!