*This is part 6 of a series on Spherical Gaussians and their applications for pre-computed lighting. You can find the other articles here:*

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Get the code on GitHub: https://github.com/TheRealMJP/BakingLab (pre-compiled binaries available here)

Back in early 2014, myself and David Neubelt started doing serious research into using Spherical Gaussians as a compact representation for our pre-computed lighting probes. One of the first things I did back then was to create a testbed application that we could use to compare various lightmap representations (SH, H-basis, SG, etc.) and quickly experiment with new ideas. As part of that application I implemented my first path tracer, which was directly integrated into the app for A/B comparisons. This turned out to be extremely useful, since having quick feedback was really helpful for evaluating quality and also for finding and fixing bugs. Eventually we used this app to finalize the exact approach that we would use when integrating SG’s into The Order: 1886.

A year later in 2015, Dave and I created another test application for experimenting with improvements that we were planning for future projects. This included things like a physically based exposure model utilizing real-world camera parameters, using the ACES[1] RRT/ODT for tone mapping, and using real-world units[2] for specifying lighting intensities. At some point I integrated an improved version of SG baking into this app that would progressively compute results in the background while the app remained responsive, allowing for quick “preview-quality” feedback after adjusting the lighting parameters. Once we started working on our SIGGRAPH presentation[3] from the 2015 physically based shading course, it occurred to us that we should really package up this new testbed and release it alongside the presentation to serve as a working implementation of the concepts we were going to cover. But unfortunately this slipped through the cracks: the new testbed required a lot of work in order to make it useful, and both Dave and I were really pressed for time due to multiple new projects ramping up at the office.

Now, more than a year after our SIGGRAPH presentation, I’m happy to announce that we’ve finally produced and published a working code sample that demonstrates baking of Spherical Gaussian lightmaps! This new app, which I call “The Baking Lab”, is essentially a combination of the two testbed applications that we created. It includes all of the fun features that we were researching in 2015, but also includes real-time progressive baking of 2D lightmaps in various formats. It also allows switching to a progressive path tracer at any time, which serves as the “ground truth” for evaluating lightmap quality and accuracy. Since it’s an amalgamation of two older apps, it uses D3D11 and the older version of my sample framework. So there’s no D3D12 fanciness, but it will run on Windows 7. If you’re just interested in looking at the code or running the app, then go ahead and head over to GitHub: https://github.com/TheRealMJP/BakingLab. If you’re interested in the details of what’s implemented in the app, then keep reading.

The primary feature of The Baking Lab is lightmap baking. Each of the test scenes includes a secondary UV set that contains non-overlapping UV’s used for mapping the lightmap onto the scene. Whenever the app starts or a new scene is selected, the baker uses the GPU to rasterize the scene into lightmap UV space. The pixel shader outputs interpolated vertex components like position, tangent frame, and UV’s to several render targets, which use MSAA to simulate conservative rasterization. Once the rasterization is completed, the results are copied back into CPU-accessible memory. The CPU then scans the render targets, and extracts “bake points” from all texels covered by the scene geometry. Each of these bake points represents the location of a single hemispherical probe to be baked.

Once all bake points are extracted, the baker begins running using a set of background threads on the CPU. Each thread continuously grabs a new work unit consisting of a group of contiguous bake points, and then loops over the bake points to compute the result for that probe. Each probe is computed by invoking a path tracer, which uses Embree[4] to allow for arbitrary ray tracing through the scene on the CPU. The path tracer returns the incoming radiance for a direction and starting point, where the radiance is the result of indirect lighting from various light sources as well as the direct lighting from the sky. The path tracer itself is a very simple unidirectional path tracer, using a few standard techniques like importance sampling, correlated multi-jittered sampling[5], and russian roulette to increase performance and/or convergence rates. The following baking modes are supported:

**Diffuse**– a single RGB value containing the result of applying a standard diffuse BRDF to the incoming lighting, with an albedo of 1.0**Half-Life 2**– directional irradiance projected onto the Half-Life 2 basis[6], making for a total of 3 sets of RGB coefficients (9 floats total)**L1 SH**– radiance projected onto the first two orders of spherical harmonics, making for a total of 4 sets of RGB coefficients (12 floats total). Supports environment specular via a 3D lookup texture.**L2 SH**– radiance projected on the first three orders of spherical harmonics, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via a 3D lookup texture.**L1 H-basis**– irradiance projected onto the first two orders of H-basis[7], making for a total of 4 sets of RGB coefficients (12 floats total).**L2 H-basis**– irradiance projected onto the first three orders of H-basis, making for a total of 6 sets of RGB coefficients (18 floats total).**SG5**– radiance represented by the sum of 5 SG lobes with fixed directions and sharpness, making for a total of 5 sets of RGB coefficients (15 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG6**– radiance represented by the sum of 6 SG lobes with fixed directions and sharpness, making for a total of 6 sets of RGB coefficients (18 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG9**– radiance represented by the sum of 9 SG lobes with fixed directions and sharpness, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG12**– radiance represented by the sum of 12 SG lobes with fixed directions and sharpness, making for a total of 12 sets of RGB coefficients (36 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.

For SH, H-basis, and HL2 basis baking modes the path tracer is evaluated for random rays distributed about the hemisphere so that Monte Carlo integration can be used to integrate the radiance samples onto the corresponding basis functions. This allows for true progressive integration, where the baker makes N passes over each bake point, each time adding a new sample with the appropriate weighting. It looks pretty cool in action:

The same approach is used for the “Diffuse” baking mode, except that sampling rays are evaluated using a cosine-weighted hemispherical sampling scheme[8]. For SG baking, things get a little bit trickier. If the ad-hoc projection mode is selected, the result can be progressively evaluated in the same manner as the non-SG bake modes. However if either the Least Squares or Non-Negative Least Squares mode are active, we can’t run the solve unless we have all of the hemispherical radiance samples available to feed to the solver. In this case we switch to a different baking scheme where each thread fully computes the final value for every bake point that it operates on. However the thread only does this for a single bake point from each work group, and afterwards it fills in the rest of the neighboring bake points (which are arranged in a 8×8 group of texels) with the results it just computed. Each pass of of baker then fills in the next bake point in the work group, gradually computing the final result for all texels in the group. So instead of seeing the quality slowly improve across the light map, you see extrapolated results being filled in. It ends up looking like this:

While it’s not as great as a true progressive bake, it’s still better than having no preview at all.

The app supports a few settings that control some of the bake parameters, such as the number of samples evaluated per-texel and the overall lightmap resolution. The “Scene” group in the UI also has a few settings that allow toggling different components of the final render, such as the direct or indirect lighting or the diffuse/specular components. Under the “Debug” setting you can also toggle a neat visualizer that shows a visual representation of the raw data stored in the lightmap. It looks like this:

The integrated path tracer is primarily there so that you can see how close or far off you are when computing environment diffuse or specular from a light map. It was also a lot of fun to write – I recommend doing it sometime if you haven’t already! Just be careful: it may make you depressed to see how poorly your real-time approximation holds up when compared with a proper offline render.

The ground truth renderer works in a similar vein to the lightmap baker: it kicks off multiple background threads that each grab work groups of 16×16 pixels that are contiguous in screen space. The renderer makes N passes over each pixel, where each pass adds an additional sample that’s weighted and summed with the previous results. This gives you a true progressive render, where the result starts out noisy and (very) gradually converges towards a noise-free image:

The ground truth renderer is activated by checking the “Show Ground Truth” setting under the “Ground Truth” group. There’s a few more parameters in that group to control the behavior of the renderer, such as the number of samples used per-pixel and the scheme used for generating random samples.

There’s 3 different light sources supported in the app: a sun, a sky, and a spherical area light. For real-time rendering, the sun is handled as a directional light with an intensity computed automatically using the Hosek-Wilkie solar radiance model[9]. So as you change the position of the sun in the sky, you’ll see the color and intensity of the sunlight automatically change. To improve the real-time appearance, I used the disk area light approximation from the 2014 Frostbite presentation. The path tracer evaluates the sun as an infinitely-distant spherical area light with the appropriate angular radius, with uniform intensity and color also computed from the solar radiance model. Since the path tracer handles the sun as a true area light source, it produces correct specular reflections and soft shadows. In both cases the sun defaults to correct real-world intensities using actual photometric units. There is a parameter for adjusting the sun size, which will result in the sun being too bright or too dark if manipulated. However there’s another setting called “Normalize Sun Intensity” which will attempt to maintain roughly the same illumination regardless of the size, which allows for changing the sun appearance or shadow softness without changing the overall scene lighting.

The default sky mode (called “Procedural”) uses the Hosek-Wilkie sky model to compute a procedural sky from a few input parameters. These include turbidity, ground albedo, and the current sun position. Whenever the parameters are changed, the model is cached to a cubemap that’ s used for real-time rendering on the GPU. For CPU path tracing, the the sky model is directly evaluated for a direction using the sample code provided by the authors. When combined with the procedural sun model, the two light sources form a simple outdoor lighting environment that corresponds to real-world intensities. Several other sky modes are also supported for convenience. The “Simple”mode takes just a color and intensity as input parameter, and flood-fills the entire sky with a value equal to color * intensity. The “Ennis”, “Grace Cathedral”, and “Uffizi Cross” modes use corresponding HDR environment maps to fill the sky instead of a procedural model.

For local lighting, the app supports enabling a single spherical area light using the “Enable Area Light” setting. The area light can be positioned using the Position X/Position Y/Position Z settings, and its radius can be specified with the “Size” setting. There are a 4 different modes for specifying the intensity of the light:

**Luminance**– the intensity corresponds to the amount of light being emitted from the light source along an infinitesimally small ray towards the viewer or receiving surface. Uses units of cd/m^{2}. Changing the size of the light source will change the overall illumination the scene.**Illuminance**– specifies the amount of light incident on a surface at a set distance, which is specified using the “Illuminance Distance” setting. So instead of saying “how much light is coming out of the light source” like you do with the “Luminance” mode, you’re saying “how much diffuse light is being reflected from a perpendicular surface N units away”. Uses units of lux, which are equivalent to lm/m^{2}. Changing the size of the light source will*not*change the overall illumination the scene.**Luminous Power**– specifies the total amount of light being emitted from the light source in all directions. Uses units of lumens. Changing the size of the light source will*not*change the overall illumination the scene.**EV100**– this is an alternative way of specifying the luminance of the light source, using the exposure value[10] system originally suggested by Nathan Reed[11]. The base-2 logarithmic scale for this mode is really nice, since incrementing by 1 means doubling the perceived brightness. Changing the size of the light source will change the overall illumination the scene.

The ground truth renderer will evaluate the area light as a true spherical light source, using importance sampling to reduce variance. The real-time renderer approximates the light source as a single SG, and generates very simple hard shadows using an array of 6 shadow maps. By default only indirect lighting from the area light will be baked into the lightmap, with the direct lighting evaluated on the GPU. However if the “Bake Direct Area Light” setting is enabled, then the direct contribution from the area light will be baked into the lightmap.

Note that all light sources in the app are always scaled down by a factor of 2^{-10} before being using in rendering, as suggested by Nathan Reed in his blog post[11]. Doing this effectively shifts the window of values that can be represented in a 16-bit floating point value, which is necessary in order to represent specular reflections from the sun. However the UI always will always show the unshifted values, as will the debug luminance picker that shows the final color and intensity of any pixel on the screen.

As I mentioned earlier, the app implements a physically based exposure system that attempts to models the behavior and parameters of a real-world camera. Much of the implementation was based on the code from Padraic Hennessy’s excellent series of articles[12], which was in turn inspired by Sébastien Lagarde and Charles de Rousiers’s SIGGRAPH presentation from 2014[2]. When the “Exposure Mode” setting is set to the “Manual (SBS)” or “Manual (SOS)” modes, the final exposure value applied before tone mapping will be computed based on the combination of aperture size, ISO rating, and shutter speed. There is also a “Manual (Simple)” mode available where a single value on a log2 scale can be used instead of the 3 camera parameters.

Mostly for fun, I integrated a post-process depth of field effect that uses the same camera parameters (along with focal length and film size) to compute per-pixel circle of confusion sizes. The effect is off by default, and can be toggled on using the “Enable DOF” setting. Polygonal and circular bokeh shapes are supported using the technique suggested by Tiago Sousa in his 2013 SIGGRAPH presentation[13]. Depth of field is also implemented in the ground truth renderer, which is capable of achieving true multi-layer effects by virtue of using a ray tracer.

Several tone mapping operators are available for experimentation:

**Linear**– no tone mapping, just a clamp to [0, 1]**Film Stock**– Jim Hejl and Richard Burgess-Dawson’s polyomial approximation of Haarm-Peter Duiker‘s filmic curve, which was created by scanning actual film stock. Based on the implementation provided by John Hable[14].**Hable (Uncharted2)**– John Hable‘s adjustable filmic curve from his GDC 2010 presentation[15]**Hejl 2015**– Jim Hejl’s filmic curve that he posted on Twitter[16], which is a refinement of Duiker’s curve**ACES sRGB Monitor**– a fitted polynomial version of the ACES[17] reference rendering transform (RRT) combined with the sRGB monitor output display transform (ODT), generously provided by Stephen Hill.

At the bottom of the settings UI are a group of debug options that can be selected. I already mentioned the bake data visualizer previously, but it’s worth mentioning again because it’s really cool. There’s also a “luminance picker”, which will enable a text output showing you the luminance and illuminance of the surface under the mouse cursor. This was handy for validating the physically based sun and sky model, since I could use the picker to make sure that the lighting values matched what you would expect from real-world conditions. The “View Indirect Specular” option causes both the real-time renderer and the ground truth renderer to only show the indirect specular component, which can be useful for gauging the accuracy of specular computed from the lightmap. After that there’s a pair of buttons for saving or loading light settings. This will serialize the settings that control the lighting environment (sun direction, sky mode, area light position, etc.) to a file, which can be loaded in whenever you like. The “Save EXR Screenshot” is fairly self-explanatory: it lets you save a screenshot to an EXR file that retains the HDR data. Finally there’s an option to show the current sun intensity that’s used for the real-time directional light.

[1] Academy Color Encoding System – https://en.wikipedia.org/wiki/Academy_Color_Encoding_System

[2] Moving Frostbite to PBR (course notes) – http://www.frostbite.com/wp-content/uploads/2014/11/course_notes_moving_frostbite_to_pbr_v2.pdf

[3] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pdf

[4] Embree: High Performance Ray Tracing Kernels – https://embree.github.io/

[5] Correlated Multi-Jittered Sampling – http://graphics.pixar.com/library/MultiJitteredSampling/paper.pdf

[6] Shading in Valve’s Source Engine – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[7] Efficient Irradiance Normal Mapping – https://www.cg.tuwien.ac.at/research/publications/2010/Habel-2010-EIN/

[8] Better Sampling – http://www.rorydriscoll.com/2009/01/07/better-sampling/

[9] Adding a Solar Radiance Function to the Hosek Skylight Model – http://cgg.mff.cuni.cz/projects/SkylightModelling/

[10] Exposure value – https://en.wikipedia.org/wiki/Exposure_value

[11] Artist-Friendly HDR With Exposure Values – http://www.reedbeta.com/blog/2014/06/04/artist-friendly-hdr-with-exposure-values/

[12] Implementing a Physically Based Camera: Understanding Exposure – https://placeholderart.wordpress.com/2014/11/16/implementing-a-physically-based-camera-understanding-exposure/

[13]CryENGINE 3 Graphics Gems – http://advances.realtimerendering.com/s2013/Sousa_Graphics_Gems_CryENGINE3.pptx

[14] Filmic Tonemapping Operators – http://filmicgames.com/archives/75

[15] Uncharted 2: HDR Lighting – http://www.gdcvault.com/play/1012351/Uncharted-2-HDR

[16] Jim Hejl on Twitter – https://twitter.com/jimhejl/status/633777619998130176

[17] Academy Color Encoding System Developer Resources – https://github.com/ampas/aces-dev

]]>

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the two previous articles I showed workable approaches for approximating the diffuse and specular result from a Spherical Gaussian light source. On its own these techniques might seem a bit silly, since it’s not immediately why the heck it would be useful to light a scene with an SG light source. But then you might remember that these articles started off by discussing methods for storing pre-computed radiance or irradiance in lightmaps or probe grids, which is a subject we’ll finally return to.

A common process in mathematics (particularly statistics) is to take a set of data points and attempt to figure out some sort of analytical curve that can represent the data. This process is known as curve fitting[1], since the goal is to find a curve that is a good fit for the data points. There’s various reasons to do this (such as regression analysis[2]), but I find it can helpful to think of it as a form of lossy compression: a few hundred data points might require kilobytes of data, but if you can approximate that data with a curve the coefficients might only need a few bytes of storage. Here’s a simple example from Wikipedia:

*Fitting various polynomials to data generated by a sine wave. Red is first degree, green is second degree, orange is third degree, blue is forth degree.
By Krishnavedala (Own work) [CC0], via Wikimedia Commons*

In the image there’s a bunch of black dots, which represent a set of data points that we want to fit. In this case the data points come from a sine wave, but in practice the data could take any form. The various colored curves represents attempts at fitting the data using polynomials of varying degrees:

By looking at graphs and the forms of the polynomials it should be obvious that higher degrees allow for more complex curves, but require more coefficients. More coefficients means more data to store, and may also mean that the fitting process is more difficult and/or more expensive. One of the most common techniques used for fitting is least squares[3], which works by minimizing the sum of all differences between the fit curve and the original data.

The other observation we can make it that the resulting fit is essentially a linear combination of basis functions, where the basis functions are , , , and so on. There are many other basis functions we could use here instead of polynomials, such as our old friend the Gaussian! Just like polynomials, a sum of Gaussians can represent more complex functions with a handful of coefficients. As an example, let’s take a set of data points and use least squares to fit varying numbers of Gaussians:

*Fitting Gaussians to a data set using least squares. The left graph shows a fit with a single Gaussian, the middle graph shows a fit with two Gaussians, and the right graph shows a fit with three Gaussians.*

For this example I used curve_fit[4] from the scipy optimization library, which uses a non-linear least squares algorithm. Notice how as I added more Gaussians, the resulting sum became a better approximation of the raw data.

So far we’ve been fitting 1D data sets, but the techniques we’re using also work in multiple dimensions. So for instance, let’s say we had a bunch of scattered samples in random directions on a sphere defined by a 2D spherical coordinate system. And let’s say that these samples represent something like…oh I don’t know…the amount of incoming lighting along an infinitesimally narrow ray oriented in that direction. If we take all of these data points and throw some least squares at it, we can end up with a series of N Spherical Gaussians whose sum can serve as an approximation for radiance in any direction on the sphere! We just need our fitting algorithm to spit out the axis, amplitude, and sharpness of each Gaussian, or if we want can fix one of more of the SG parameters ahead of time and only fit the remaining parameters. It should be immediately obvious why this is useful, since a set of SG coefficients can be stored very compactly compared to a gigantic set of radiance samples. Of course if we only use a few Gaussians the resulting approximation will probably lose details from the original radiance function, but this is no different from spherical harmonics or other common techniques for storing approximate representations of radiance or irradiance. Let’s take a look at a fit actually looks like using an HDR environment map as input data:

*Approximating radiance on a sphere using a sum of Spherical Gaussians. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows a least squares fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

These images were generated Yuriy O’Donnell‘s Probulator[5], which is an excellent tool for comparing various ways of approximating radiance and irradiance on a sphere. One important thing to note here is that the fit was only performed on the amplitude of the SG’s: the axis and sharpness are pre-determined based on the number of SG’s. Probulator generates the lobe axis directions using Vogel’s method[6], but any technique for distributing points on a sphere would also work. Fitting only the lobe amplitude significantly simplifies the solve, since there are less parameters to optimize. Solving for a single parameter also allows us to use linear least squares[7], while fitting all of the parameters would require use of complex and expensive non-linear least squares[8] algorithms. Solving for less parameters also decreases the storage costs, since only the amplitudes need to be stored per-probe while the directions and sharpness can be global constants. Either way it’s good to keep in mind when examining the results. In particular it helps explain why the white lobe in the middle image doesn’t quite line up with the bright windows in the source environment. Aside from that, the results are probably what you would expect: doubling the number of lobes increases the possible complexity and sharpness of the resulting approximation, which in this case allows it to provide a better representation of some of the high-frequency details in the source image.

One odd thing you might notice in the SG approximation is the overly dark areas towards the bottom-right of the sphere. They look somewhat similar to the darkening that can show up in SH approximations, which happens due to negative coefficients being used for the SH polynomials. It turns out that something very similar is happening with our SG fit: the least squares optimizations is returning negative coefficients for some of our lobes in an attempt to minimize the error of the resulting fit.If you’re having trouble understanding why this would happen, let’s go back to 1D for a quick example. For the last 1D example I cheated a bit: the data we were fitting our Gaussians to was actually just some random noise applied to the sum of 3 Gaussians. This is why our Gaussian fit so closely resembled the source data. This time, we’ll fitting some lobes to a more complex data set:

*A data set with a discontinuity near 0, making it more difficult to fit curves to.*

This time the data has a bunch of values near the center that have a value of zero. With such a data set it’s now less obvious how a sum of Gaussians could approximate the values. If we throw least squares at the problem and have it fit two lobes, we get the following:

*The result of using least squares to fit 2 Gaussian lobes to the above data set. The left graph shows the first lobe (red), the middle graph shows the second lobe (green), and the right graph shows the sum of the two lobes (blue) overlaid onto the original data set.*

This time around the optimization resulted in a positive amplitude for the first lobe, but a *negative* amplitude for the second lobe. Looking at the sum overlaid onto to the data makes it clear why this happened: the positive lobe takes care of the all of the positive data points to the left and right, while the negative lobe brings the sum closer to zero in the middle of the graph. Upon closer inspection the negative actually causes the approximation to dip *below* zero into the negatives. We can assume that having this dip results in lower overall error for the approximation, since that’s how least squares works.

In practice, having negative coefficients and negative values from our approximation can undesirable. In fact when approximating radiance or irradiance negative values really just don’t make sense, since they’re physically impossible. In our experience we also found that the visual result of lighting a scene with negative lobes can be quite displeasing, since it tends to look very unnatural to have surfaces that are completely dark. Perhaps you remember this image from the first article showing what L2 SH looks like with a bright area light in the environment:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

We found that character faces in particular tended to look really bad when our SH light probes had strong negative lobes, and this turned out to be one of our motivations for investigating alternative approximations. We also ran into some trouble when attempting to compress signed floating point values in BC6H textures: some compressors didn’t even support compressing to that format, and those that did had noticeably worse quality.

With that in mind, it would be nice to constrain a least squares solver in such a way that it only gave us positive coefficients. Fortunately for us such a technique exists, and it’s known as non-negative least squares[9] (or NNLS for short). If we use that technique for fitting SG’s to our original radiance function instead of standard least squares, we get this result instead:

*Fitting SG’s to a radiance function using a non-negative least squares solver. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows an NNLS fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

This time we don’t have the dark areas in the bottom right, since the fit only uses positive lobes. But unfortunately there’s no free lunch here, since the resulting approximation is also a bit “blurrier” compared to the normal least squares fit.

Now that we’ve covered how to generate an SG approximation of radiance from source data, we can take a look at how well it stacks up against other options for a simple use case. Probably the most obvious application is computing irradiance from the radiance approximation, which can be directly used to compute standard Lambertian diffuse lighting. The following images were captured using Probulator, and they show the Stanford Bunny being lit using a few common techniques for approximating irradiance:

*A bunny model being lit by various irradiance approximations generated from the “Pisa” HDR environment map*

With the exception of Valve’s Ambient Cube, all of the approximations hold up very well when compared with the ground truth. The non-negative least squares fit is just a bit more washed out than the least squares fit, but both seem to produce perfectly acceptable results. The SH result is also very good, with no noticeable ringing artifacts. However this particular environment is a somewhat easier case, as the range of intensities isn’t as large as you might find in some realistic lighting scenarios. For a more extreme case, let’s now look at a comparison using the “Ennis” environment map:

*A bunny model being lit by various irradiance approximations generated from the “Ennis” HDR environment map*

This time there’s a much more noticeable difference between the various techniques. This is because the source environment map has a very bright window to the left, which effectively serves as a large area light source. With this particular environment the SG results start to compare pretty favorably to the SH or ambient cube approximations. The results from L2 SH have severe ringing artifacts, which manifests as areas that are either too dark or too bright on the side of the bunny facing to the right. Meanwhile , the windowed version of L2 SH blurs the lighting too much, making it appear as if the environment is more uniform than it really is. The ambient cube probe doesn’t suffer from ringing, but it does have problems with the bright lighting from the left bleeding onto the top and side of the bunny. Looking at the least squares solve for 12 SG’s, the result is pretty nice but there is a bit of ringing evident on the upper-right side of the bunny model. This ringing isn’t present in the non-negative least squares solve, since all of the coefficients end up being positive.

As I mentioned earlier, these Probulator comparisons use fixed lobe directions and sharpness. Consequently we only need to store amplitude, meaning that the storage cost of the 12 SG lobes is equivalent to 12 sets of floating-point RGB coefficients (36 floats total). L2 SH requires 9 sets of RGB coefficients, which adds up to 27 floats. The ambient cube requires only 6 sets of RGB coefficents, which is half that of the SG solve. So for this particular comparison the SG representation of radiance requires the most storage, however this highlights one of the nice points using SG as your basis: you can solve for any number of lobes you’d like, allowing you to choose easily trade off quality vs. storage and performance cost. Valve’s ambient cube is only defined for 6 lobes, and that number can’t be increased since the lobes must remain orthogonal to each other.

For full lighting probes where the sampling surface can have any orientation, storing the radiance or irradiance on a sphere makes perfect sense. However it makes less sense if we would like to store baked lighting in 2D textures where all of the sample points lie on the surface of a mesh. For that case storing data for a full sphere is wasteful, since half of the data will point “into” the surface and therefore won’t be useful. With SG’s this is fairly trivial to correct: we can just choose to solve for lobes that only lie on the upper hemisphere surrounding the surface’s normal direction.

*The left side shows a configuration where 9 SG lobes are distributed about a sphere, forming a full spherical probe. The right side shows 5 SG lobes located on a hemisphere surrounding the normal of a surface (the blue arrow). *

To fully generate the compact radiance representation for an entire 2D lightmap, we need to gather radiance samples at every texel location, and then perform a solve to fit the samples to a set of SG lobes. It’s really no different from the spherical probe case we used as a testbed in Probulator, except now we’re generating many probes. The other main differences is that for lightmap generating the appropriate radiance requires sampling a full 3D scene, as opposed to using an environment map as we did with Probulator. This sort of problem is best solved with a ray tracer, using an algorithm such as path tracing to compute the incoming radiance for a particular ray. The following image shows a visualization of what the lightmap result looks like for a simple scene:

*Hemispherical radiance probes generated at the texel locations of a 2D lightmap applied to a simple scene. Each probe uses 9 SG lobes oriented about the surface normal of the underlying geometry.*

In the previous article we covered techniques that can be used to compute a specular term from SG light sources. If we apply them to a set of lobes that approximate the incoming radiance, then we can compute an approximation of the full environment specular response. For small lobe counts this is only going to be practical for materials with a relatively high roughness. This is because our SG approximation of the incoming radiance won’t be able to capture high-frequency details from the environment, and it would it be very obvious if those details were missing from the reflections on smooth surfaces. However for rougher surfaces where the BRDF itself starts to act as a low-pass filter, an SG approximation won’t have as much noticeable error. As an example, here’s what SG specular looks like for a test scene with a GGX roughness value of 0.25:

*Comparison of indirect specular approximation for a test scene with a GGX roughness of 0.25. The top-left image is a path-traced rendering of the final scene with full indirect and direct lighting. The top-right image shows the indirect environment specular term from SG9 lightmaps, with the exposure increased by 8x. The bottom left image shows the indirect specular term from L2 SH. The bottom right image shows the indirect specular term from a path-traced render of the scene.*

Compared to the ground truth, the SG approximation does pretty well in some places and not-so-well in others. In general it captures a lot of the overall specular response, but suffers from some of the higher-frequency detail being absent in the probes. This results in certain areas looking a bit “washed-out”, such as the right-most wall of the scene. You can also see that the reflections of the cylinder, sphere, and torus are not properly represented in the SG version for the same reason. On the positive side, lightmap samples are pretty dense in terms of their spatial distribution. They’re far more dense than what you typically achieve with sparse cubemap probes placed throughout the scene, which typically suffer from all kinds of occlusion and parallax artifacts. The SG specular also compares pretty favorably to the L2 SH result (despite having the same storage cost), which looks even more washed-out than the SG result. The SH implementation used a 3D lookup texture to store pre-computed SH coefficients, and you can see some interpolation artifacts from this method if you look at the far wall perpendicular to the camera.

In the first part of our presentation[10] at last year’s Physically Based Shading Course[11], Dave covered some of these details and also shared some information about how we implemented SG lighting probes into The Order: 1886. Much of the implementation was very similar to what I’ve described in this series of articles: we stored 5-12 SG lobes (the count was chosen per-level chunk) in our 2D lightmaps with fixed axis directions and sharpnesses, and we evaluated diffuse and specular lighting using the approximations that I outlined earlier. For dynamic meshes, we baked uniform 3D grids of spherical probes containing 9 SG lobes that were stored in 3D textures. The grids were defined by OBB’s that were hand-placed in the scene by our lighting artists, along with density parameters. In both cases we made use of hardware texture filtering to interpolate between neighboring probes before computing per-pixel lighting.

Much of our implementation closely followed the work of Wang[12] and Xu[13], at least in terms of the techniques used for approximating diffuse and specular lighting from a set of SG lobes. Where our work diverged quite a bit was in the choice to use fixed lobe directions and sharpness values. Both Wang and Xu generated their set of SG lobes by performing a solve on a single environment map, which produced the necessary axis, sharpness, and amplitude parameters. In our case, we always knew that we were going need many probes in order to maintain high-fidelity pre-computed lighting for our scenes. At the time (early 2014) we were already employing 2D lightmaps containing L1 H-basis hemispherical probes (4 coefficients) and 3D grids containing L2 spherical harmonics probes. Both could be quite dense in spatial terms, which allowed capturing important shadowing detail.

To make SG’s work with for these requirements, we had to carefully consider our available trade-offs. After getting a simple test-bed up and running where we could bake 2D lightmaps for a scene, it became quickly apparent that varying the axis directions per-texel wasn’t necessarily the best choice for us. Aside from the obvious issue of requiring more storage space and making the solve more complex and expensive, we also ran into issues resulting from interpolating the axis direction over a surface. The problem is most readily apparent at shadow boundaries: one texel might have visibility of a bright light source which causes a lobe to point in that direction, while its neighboring pixel might have no visibility and thus could end up with a lobe pointing in a completely different direction. The axis would then interpolate between the two directions for pixels between the two texels, which can cause noticeable specular shifting. This isn’t necessarily an unsolvable problem (the Frequency Domain Normal Map Filtering paper[14] extended their EM solver with a term that attempts to align neighboring lobes for coherency), but considering our time and memory constraints it made sense to just sidestep the problem altogether. Ultimately we ended up using fixed lobe directions, using the UV tangent frame as the local coordinate space for lightmap probes. Tangent space is natural for this purpose since it’s Z axis is the surface normal of the mesh, and also because it tends to be continuous over a mesh (you’ll have discontinuities wherever you have UV seams, but artists tend to hide those as best they can anyway). For the 3D probe grids, the directions were in world space for simplicity.

After deciding to fix the lobe directions, we also ultimately decided to go with a fixed sharpness value as well. This of course has the same obvious benefits as fixing the axis direction (less storage, simpler solve), which were definitely appealing. However another motivating factor came from the way were doing our solve. Or rather, our lack of a proper solve. Our early testbed performed all lightmap baking on the CPU, which allowed us to easily integrate packages like Eigen[15] so that we could use a battle-tested least squares solver. However our actual production baking farm at Ready at Dawn uses a Cuda baker that leverages Nvidia’s OptiX library to perform arbitrary ray tracing on a cluster of GPU’s.While Cuda does have optimization libraries that could have achieved what we wanted, we faced a bigger problem: memory. Our baker worked by baking many sample points in parallel on the GPU, with a kernel program that would generate and trace the many rays required for monte carlo integration. When we previously used SH and H-basis this approach worked well: both SH and H-basis are built upon orthogonal basis functions, which allows for continuous integration by projecting each sample onto those basis functions. Gathering thousands of samples per-texel is feasible with this setup, since those samples don’t need to be explicitly stored in memory. Instead, each new sample is projected onto the in-progress result for the texel and then discarded. This is not the case when performing a solve: the solver needs access to *all* of the samples, which means keeping them all around in memory. This is a big problem when you have many texels in flight simultaneously, and only a limited amount of GPU memory. Like the intepolation issue it’s probably not unsolveable, but we really looking for a less risky approach that would be more of a drop-in replacement for the SH integration.

Ultimately we ended up saying “screw it”, and projected on the SG lobes as they formed an orthogonal basis (even though they didn’t). Since the basis functions weren’t orthogonal the results ended up rather blurry compared to a least squares solve, which muddied some of the detail in the source environment for a probe. Here’s a comparison to show you what I mean:

*A comparison of different techniques for computing a set of SG lobes that approximate the radiance from an environment map*

Of the three techniques presented here, the naive projection is the worst at capturing the details from the source map and also the blurriest. As we saw earlier the least squares solve is the best at capturing sharp changes, but achieves this by over-darkening certain areas. NNLS is the “just right” fit for this particular case, doing a better job of capturing details compared to the projection but without using any negative lobes.

Before we switched to baking SG probes for our grids and lightmaps, we had a fairly standard system for applying pre-convolved cubemap probes that were hand-placed throughout our scenes. Previously these were our only source of environment specular lighting, but once we had SG’s working we began to to use the SG specular approximation to compute environment specular directly from lightmaps and probe grids. Obviously our low number of SG’s was not sufficient for accurately approximating environment specular for smooth and mirror-like surfaces, so our cubemap system remained relevent. We ended up coming up with a simple scheme to choose between cubemap and lightmap specular per-pixel based on the surface roughness, with a small zone in between where we would blend between the two specular sources. The following images taken from our SIGGRAPH slides use a color-coding to showing the specular source chosen for each pixel from one of our scenes:

*The top image shows a scene from The Order: 1886. The bottom image shows the same scene with a color coding applied to show the environment specular source for each pixel.*

To finish off, here’s some more comparison images taken from our SIGGRAPH slides:

*Several comparison images from The Order: 1886 showing scenes with and without environment specular from SG lightmaps*

We were quite happy with the improvements that come from the SG baking pipeline we introduced for The Order, but we also feel like we’ve barely scratched the surface. Our decision to use fixed lobe directions and sharpness values was probably the right one, but it also limits what we can do. When you take SG’s and compare them to a fixed set of basis functions like SH, perhaps the biggest advantage is the fact that you can use an arbitrary combination of lobes to represent a mix of high and low-frequency features. So for instance you can represent a sun and a sky by having one wide lobe with the sky color, and one very narrow and bright lobe oriented towards the sun. We gave up that flexibility when we decided to go with our simpler ad-hoc projection, and it’s something we’d like to explore further in the future. But until then we can at least enjoy the benefits of having a representation that allows for an environment specular approximation and also avoids ringing artifacts when approximating diffuse lighting.

Aside from the solve, I also think it would be worth taking the time to investigate better approximations for the specular BRDF. In particular I would like to try using something better than just evaluating the cosine, Fresnel, and shadow-masking terms at the center of the warped BRDF lobe. The assumption that those terms are constant over the lobe break down the most when the roughness is high, and in our case we’re only ever using SG specular for rough materials! Therefore I think it would be worth the effort to come up with a more accurate representation of those terms.

[1] Curve fitting – https://en.wikipedia.org/wiki/Curve_fitting

[2] Regression analysis – https://en.wikipedia.org/wiki/Regression_analysis

[3] Least squares – https://en.wikipedia.org/wiki/Least_squares

[4] scipy.optimize.curve_fit – http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy-optimize-curve-fit

[5] Probulator – https://github.com/kayru/Probulator

[6] Spreading points on a disc and on a sphere – http://blog.marmakoide.org/?p=1

[7] Linear least squares – https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)

[8] Non-linear least squares – https://en.wikipedia.org/wiki/Non-linear_least_squares

[9] Non-negative least squares – https://en.wikipedia.org/wiki/Non-negative_least_squares

[10] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pptx

[11] SIGGRAPH 2015 Course: Physically Based Shading in Theory and Practice – http://blog.selfshadow.com/publications/s2015-shading-course/

[12] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://renpr.org/project/sg_ssdf.htm

[13] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[14] Frequency Domain Normal Map Filtering – http://www.cs.columbia.edu/cg/normalmap/

[15] Eigen – http://eigen.tuxfamily.org/index.php?title=Main_Page

]]>

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, we explored a few ways to compute the contribution of an SG light source when using a diffuse BRDF. While this is already useful, it would be nice to be able work with more complex view-dependent BRDF’s so that we can also compute a specular contribution. In this article I’ll explain a possible approach for approximating the response of a microfacet specular BRDF when applied to an SG light, and also introduce the concept of Anisotropic Spherical Gaussians.

You probably recall from the past 5 years of physically based rendering presentations[1] that a standard microfacet BRDF takes the following structure:

Recall that *D*(h) is the *distribution* term, which tells us the percentage of active microfacets for a particular combination of view and light vectors. It’s also commonly known as the *normal distribution function*, or “NDF” for short. It’s generally parameterized on a *roughness* parameter, which essentially describes the “bumpiness” of the underlying microgeometry. Lower roughness values lead to sharp mirror-like reflections with a very narrow and intense specular lobe, while higher roughness values lead to more broad reflections with a wider specular lobe. Most modern games (including The Order: 1886) use the GGX (AKA Trowbridge-Reitz) distribution for this purpose.

*The top image shows a GGX distribution term with a roughness parameter of 0.5. The bottom image shows the same GGX distribution term with a roughness parameter of 0.1. For both graphs, the X axis represents the angle between the surface normal and the half vector. Click on either image for an interactive graph.*

Next we have G(i, o, h) which is referred to as the *geometry **term* or the *masking-shadow function*. Eric Heitz’s paper on the masking-shadow function[2] has a fantastic explanation of how these terms work and why they’re important, so I would strongly recommend reading through it if you haven’t already. As Eric explains, the geometry term actually accounts for two different phenomena. The first is the local occlusion of reflections by other neighboring microfacets. Depending on the angle of the incoming lighting and the bumpiness of the microsurface, a certain amount of the lighting will be occluded by the surface itself, and this term attempts to model that. The other phenomenon handled by this term is visibility of the surface from the viewer. A surface that isn’t visible naturally can’t reflect light towards the viewer, and so the geometry term models this masking effect using the view direction and the roughness of the microsurface.

*The Smith visibility term for GGX as a function of the angle between the surface normal and the light direction. The roughness used is 0.25, and the angle between the normal and the view direction is 0. Click on the image for an interactive graph.*

Finally we have F(o, h), which is the *Fresnel* term. This familiar term determines the amount of light that is reflected vs. the amount that is refracted or absorbed, which varies depending on the index of refraction for a particular interface as well as the angle of incidence. For a microfacet BRDF we compute the Fresnel term using the angle between the active microfacet direction (the half vector) and the light or view direction (it is equivalent to use either). This generally causes the reflected intensity to increase when both the viewing direction and the light direction are grazing with respect to the surface normal. In real-time graphics Schlick’s approximation is typically used to represent the Fresnel term, since it’s a bit cheaper than evaluating the actual Fresnel equations. It is also common to parameterize the Fresnel term on the reflectance at zero incidence (referred to as “F0”) instead of directly working with an index of refraction.

*Schlick’s approximation of the Fresnel term as a function of the angle between the half vector and the light direction. The graph uses a value of 0.04 for F0. Click on the image for an interactive graph.*

Let’s now return to our example of a Spherical Gaussian light source. In the previous article we explored how to approximate the purely diffuse contribution from such a light source, but how do we handle the specular response? If we go down this road, we ideally want to do it in a way that uses a microfacet specular model so that we can remain consistent with our lighting from punctual light sources and IBL probes. If we start out by just plugging our BRDF and SG light into the rendering equation we get this monstrosity:

Unlike diffuse we now have multiple terms inside the integral, many of which are view-dependent. This suggests that we will need to combine several aggressive optimizations in order to get anywhere close to our desired result.

Let’s start out by taking the distribution term on its own, since it’s arguably the most important part of the specular BRDF. The distribution determines the overall shape and intensity of the resulting specular highlight, and has a widely-varying frequency response over the domain of possible roughness values. If we look at the graph of the GGX distribution from the earlier section, the shape does resemble a Gaussian. Unfortunately it’s not an exact match: the GGX distribution has the characteristic narrow highlight and long tails which we can’t match using a single SG. We could get a closer fit by summing multiple Gaussians, but for this article we’ll keep things simple by sticking with a single lobe. If we go back to Wang et al.‘s paper[3] that we referenced earlier, we can see that they suggest a very simple fit of an SG to a Cook-Torrance distribution term:

If we look closely, we can see that they’re actually fitting for the Gaussian model mentioned in the original Cook-Torrance paper[4], which is similar to the one used in the Torrance-Sparrow model[5]. Note that this should not be confused the with Beckmann distribution that’s also mentioned in the Cook-Torrance paper, which is actually a 2D Gaussian in the slope domain (AKA parallel plane domain). However the shape isn’t even the biggest problem here, as the variant they’re using isn’t normalized. This means the peak of the Gaussian will always have a height of 1.0, rather than shrinking when the roughness increases and growing when the roughness decreases. Fortunately this is really easy to fix, since we have a simple analytical formula for computing the integral of an SG. Therefore if we set the amplitude to 1 over the integral, we end up with a normalized distribution:

SG DistributionTermSG(in float3 direction, in float roughness) { SG distribution; distribution.Axis = direction; float m2 = roughness * roughness; distribution.Sharpness = 2 / m2; distribution.Amplitude = 1.0f / (Pi * m2); return distribution; }

Let’s take a look at a graph of our distribution term, and see how far off it is from our target:

*Top graph shows a comparison between GGX, Beckmann, normalized Gaussian, and SG distribution terms with a roughness of 0.25. The bottom shows the same comparison with a roughness of 0.5. Click on either image for an interactive graph.*

It should come as no surprise that our SG distribution is almost an exact match for a normalized version of a Gaussian distribution. When the roughness is lower it’s also a fairly close match for Beckmann, but for higher roughness the difference gets to be quite large. In both cases our distribution isn’t a perfect fit for GGX, but it’s a workable approximation.

So we now have a normal distribution function in SG format, but unfortunately we’re not quite ready to use it as-is. The problem is that we’ve defined our distribution in the half-vector domain: the axis of the SG points in the direction of the surface normal, and we use the half-vector as our sampling direction. If we want to use an SG product to compute the result of multiplying our distribution with an SG light source, then we need to ensure that the distribution lobe is in the same domain as our light source. Another way to think about this is that center of our distribution shifts depending on viewing angle, since the half-vector also shifts as the camera moves.

In order to make sure that our distribution lobe is in the correct domain, we need “warp” our distribution so that it lines up with the current BRDF slice for our viewing direction. If you’re wondering what a BRDF slice is, it basically tells you “if I pick a particular view direction. what is the value of my BRDF for a given light direction?”. So if you had a mirror BRDF, the slice would just be a single ray pointing in the direction of the view ray reflected off the surface normal. For microfacet specular BRDF’s, you get a lobe that’s roughly centered around the reflected view direction. Here’s what a polar graph of a GGX BRDF slice looks like if we only consider the distribution term:

*Polar graph of the GGX distribution term from two different viewing angles. The blue line is the view direction, the green line is the surface normal, and the pink line is the reflected view direction. The top image shows the resulting slice when the viewing angle is 0 degrees, and the bottom images shows the resulting slice when the viewing angle is 45 degrees.*

Wang et al. proposed a simple spherical warp operator that would orient the distribution lobe about the reflected view direction, while also modifying the sharpness to take into account the differential area at the original center of the lobe:

SG WarpDistributionSG(in SG ndf, in float3 view) { SG warp; warp.Axis = reflect(-view, ndf.Axis); warp.Amplitude = ndf.Amplitude; warp.Sharpness = ndf.Sharpness; warp.Sharpness /= (4.0f * max(dot(ndf.Axis, view), 0.0001f)); return warp; }

Let’s now look at the result of that warp, and compare it to what the actual GGX distribution looks like:

*Result of applying a spherical warp to the SG distribution term (green) compared with the actual GGX distribution (red). The top graph shows a viewing angle of 0 degrees, and the bottom graph shows a viewing angle of 45 degrees.*

A quick look at the graph shows us that the shape is a bit off, but our warped lobe is ending up in approximately in the right spot. We’ll revisit the lobe shape later, but for now let’s try combining our distribution with the rest of our BRDF.

In the previous section we figured out how to obtain an SG approximation of our distribution term, and also warp it so that it’s in the correct domain for multiplication with our SG light source. Using our SG product operator would allow to us to represent the result of multiplying the distribution with the light source as another SG, which we could then multiply with other terms using SG operators…or at least we could if we were to represent the remaining terms as SG’s. Unfortunately this turns out to be a problem: the geometry and Fresnel terms are nothing at all like a Gaussian, which rules out approximating them as an SG. Wang et al. sidestep this issue by making the somewhat-weak assumption that the values of these terms will be constant across the entire BRDF lobe, which allows them to pull the terms out of the integral and evaluate them only for the axis direction of the BRDF lobe. This allows the resulting BRDF to still capture some of the glancing angle effects, with similar performance cost to evaluating those terms for a punctual light source. The downside is that the error of these terms will increase as the BRDF lobe becomes wider (increasing roughness), since the value of the geometry and Fresnel terms will vary more the further they are from the lobe center. Putting it all together gives the following specular BRDF:

The last thing we need to account for is the cosine term that needs to be multiplied with the BRDF inside of the hemispherical integral. Wang et al. suggest using an SG product to compute an SG representing the result of multiplying the distribution term SG and their SG approximation of a clamped cosine lobe, which can then be multiplied with the lighting lobe using an SG inner product. In order to avoid another expensive SG operation, we will instead use the same approach that we used for geometry and Fresnel terms and evaluate the cosine term using the BRDF lobe axis direction. Implementing it in shader code gives us the following:

float GGX_V1(in float m2, in float nDotX) { return 1.0f / (nDotX + sqrt(m2 + (1 - m2) * nDotX * nDotX)); } float3 SpecularTermSGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF. SG ndf = DistributionTermSG(normal, roughness); // Warp the distribution so that our lobe is in the same // domain as the lighting lobe SG warpedNDF = WarpDistributionSG(ndf, view); // Convolve the NDF with the SG light float3 output = SGInnerProduct(warpedNDF, light); // Parameters needed for the visibility float3 warpDir = warpedNDF.Axis; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpedNDF.Axis + view); // Visibility term evaluated at the center of // our warped BRDF lobe output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel evaluated at the center of our warped BRDF lobe float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term evaluated at the center of the BRDF lobe output *= nDotL; return max(output, 0.0f); }

Let’s now (finally) take a look at what our specular approximation looks like for a scene being lit by an SG light source:

*A scene being lit by an SG light source using our diffuse and specular approximations. The scene is using a uniform roughness of 0.128.*

So if we look at the specular highlights in the above image, you may notice that while the highlights on the red and green walls look pretty good, the highlight on floor seems a bit off. The highlight is rather wide and rounded, and our intuition tells that a highlight viewed at a grazing angle should appear vertically stretched across the floor. To determine why the look is so off, we need to revisit our warp of the distribution term. Previously when we looked at the polar graph of the distribution I noted that the shape of the resulting lobe was a bit off, even though it was oriented in approximately the right direction to line up with the BRDF slice. To get a better idea of what’s going on, let’s now take a look at a 3D graph of the GGX distribution term:

*3D graph of the GGX distribution term. The left image shows the distribution when the viewing angle is very low, while the right image shows the distribution when the viewing angle is very steep.*

Looking at the left image where the angle between the view direction and the surface normal are very low, the distribution is radially symmetrical just like an SG lobe. However as the viewing angle increases the lobe begins to stretch, looking more and more like the non-symmetrical lobe that we see in the right image. The stretching of the lobe is what causes the stretched highlights that occur when applying the BRDF, and our warped SG is unable to properly represent it since it must remain radially symmetric about its axis.

Luckily for us, there is a better way. In 2013 Xu et al.released a paper titled Anisotropic Spherical Gaussians[6], where they explain how they extended SG’s to support anisotropic lobe width/sharpness. They’re defined like this:

Instead of having a single axis direction, an ASG now has , , and , which are three orthogonal vectors forming a complete basis. It’s very similar to a tangent frame, where the normal, tangent, and bitangent together make up an orthonormal basis. With an ASG you also now have two separate sharpness parameters, and , which control the sharpness with respect to and . So for example setting to 16 and to 64 will result in stretched Gaussian lobe that’s skinnier along the direction, and with its center located at . Visualizing such an ASG on the surface of a sphere gives you this:

*An Anisotropic Spherical Gaussian visualized on the surface of a sphere. has a value of 16, and has a value of 64.*

Like SG’s, the equations lend themselves to simple HLSL implementations:

struct ASG { float3 Amplitude; float3 BasisZ; float3 BasisX; float3 BasisY; float SharpnessX; float SharpnessY; }; float3 EvaluateASG(in ASG asg, in float3 dir) { float sTerm = saturate(dot(asg.BasisZ, dir)); float lambdaTerm = asg.SharpnessX * dot(dir, asg.BasisX) * dot(dir, asg.BasisX); float muTerm = asg.SharpnessY * dot(dir, asg.BasisY) * dot(dir, asg.BasisY); return asg.Amplitude * sTerm* exp(-lambdaTerm - muTerm); }

The ASG paper provides us with formulas for two operations that we can use to improve the quality of our specular approximation for SG light sources. The first is a new warping operator that can take an NDF represented as an isotropic SG, and stretch it along the view direction to produce an ASG that better represents the actual BRDF. The other helpful forumula it provides is for convolving an ASG with an SG, which we can use to convolve a anisotropically warped NDF lobe with an SG lighting lobe. Let’s take a look at how their improved warp looks when graphing the NDF for a large viewing angle:

*The left image is a 3D graph of the distribution term when using a spherical warp. The middle image is the resulting distribution term when using an anisotropic warp. The right image is the actual GGX distribution term*.

The anisotropic distribution looks much closer to the actual GGX NDF, since it now has the vertical stretching that we were missing. Let’s now implement their formulas in HLSL so we can try the new warp in our test scene:

float3 ConvolveASG_SG(in ASG asg, in SG sg) { // The ASG paper specifes an isotropic SG as // exp(2 * nu * (dot(v, axis) - 1)), // so we must divide our SG sharpness by 2 in order // to get the nup parameter expected by the ASG formula float nu = sg.Sharpness * 0.5f; ASG convolveASG; convolveASG.BasisX = asg.BasisX; convolveASG.BasisY = asg.BasisY; convolveASG.BasisZ = asg.BasisZ; convolveASG.SharpnessX = (nu * asg.SharpnessX) / (nu + asg.SharpnessX); convolveASG.SharpnessY = (nu * asg.SharpnessY) / (nu + asg.SharpnessY); convolveASG.Amplitude = Pi / sqrt((nu + asg.SharpnessX) * (nu + asg.SharpnessY)); float3 asgResult = EvaluateASG(convolveASG, sg.Axis); return asgResult * sg.Amplitude * asg.Amplitude; } ASG WarpDistributionASG(in SG ndf, in float3 view) { ASG warp; // Generate any orthonormal basis with Z pointing in the // direction of the reflected view vector warp.BasisZ = reflect(-view, ndf.Axis); warp.BasisX = normalize(cross(ndf.Axis, warp.BasisZ)); warp.BasisY = normalize(cross(warp.BasisZ, warp.BasisX)); float dotDirO = max(dot(view, ndf.Axis), 0.0001f); // Second derivative of the sharpness with respect to how // far we are from basis Axis direction warp.SharpnessX = ndf.Sharpness / (8.0f * dotDirO * dotDirO); warp.SharpnessY = ndf.Sharpness / 8.0f; warp.Amplitude = ndf.Amplitude; return warp; } float3 SpecularTermASGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF SG ndf = DistributionTermSG(normal, roughness); // Apply a warpring operation that will bring the SG from // the half-angle domain the the the lighting domain. ASG warpedNDF = WarpDistributionASG(ndf, view); // Convolve the NDF with the light float3 output = ConvolveASG_SG(warpedNDF, light); // Parameters needed for evaluating the visibility term float3 warpDir = warpedNDF.BasisZ; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpDir + view); // Visibility term output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term output *= nDotL; return max(output, 0.0f); }

If we swap out our old specular approximation for one that uses an anisotropic warp, our test scene now looks much better!

*Diffuse and specular approximations applied to an SG light source, using an anisotropic warp to approximate the NDF as an ASG.*

Many of the images that I used for visually comparing the SG BRDF approximations were generated using Disney’s BRDF Explorer. When we were doing our initial research into SG’s and figuring out how to implement them, BRDF Explorer was extremely valuable both for understanding the concepts and for experimenting with different variations. If you’d like to this yourself, there’s a very easy way to do that courtesy of Nick Brancaccio. Nick was kind of enough to create his own awesome WebGL version of BRDF Explorer, and it comes pre-loaded with options for comparing an approximate SG specular BRDF with the GGX BRDF. I would recommend checking it out if you’d like to play around with the BRDF’s and make some pretty 3D graphs!

[1] Background: Physics and Math of Shading (SIGGRAPH 2013 Course: Physically Based Shading in Theory and Practice) – http://blog.selfshadow.com/publications/s2013-shading-course/hoffman/s2013_pbs_physics_math_notes.pdf

[2] Understanding the Masking-Shadowing Function in Microfacet-Based BRDFs – http://jcgt.org/published/0003/02/03/

[3] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[4] A Reflectance Model for Computer Graphics – http://inst.cs.berkeley.edu/~cs294-13/fa09/lectures/cookpaper.pdf

[5] Theory for Off-Specular Reflection From Roughened Surfaces – http://www.graphics.cornell.edu/~westin/pubs/TorranceSparrowJOSA1967.pdf

[6] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[7] BRDF Explorer – https://github.com/wdas/brdf

[8] WebGL BRDF Explorer – https://depot.floored.com/brdf_explorer

]]>

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous post we covered a few of the universal properties of SG’s. Now that we have a few tools on our utility belt, let’s discuss an example of how we can actually use those properties to our advantage in a rendering scenario. Let’s say we have a surface point **x** being lit by a light source **L**, with the light source being represented by an SG named **G _{L}**. Recall from the previous article that the equation for computing the outgoing radiance towards the eye for a surface with a Lambertian diffuse BRDF looks like the following:

For punctual light sources that are essentially a scaled delta function, computing this is as easy as N dot L. But we’re in trouble if we have an area light source, since we typically don’t have a closed form solution to the integral. But let’s suppose that we have some strange Gaussian light source, whose angular falloff can be exactly represented by an SG (normally area light sources are considered to have uniform emission over their surface, but let’s imagine we have case where the emission is non-uniform). If we can treat the light as an SG, then we can start to consider some of the handy Gaussian tools that we laid out earlier. In particular the inner product starts to seem really useful: it gives us the result of integrating the product of two SG’s, which is basically what we’re trying to accomplish in our diffuse lighting equation. The big catch is that we’re not integrating the product of two SG’s, we’re instead integrating the product of an SG with a clamped cosine lobe. Obviously a Gaussian lobe has a different shape compared to a clamped cosine lobe, but perhaps if we squint our eyes from a distance you could substitute one for another. This approach was taken by Wang et al.[1], who suggested fitting a cosine lobe to a single SG with **λ**=2.133 and **a**=1.17. If we follow in their footsteps, the diffuse calculation is straightforward:

SG CosineLobeSG(in float3 direction) { SG cosineLobe; cosineLobe.Axis = direction; cosineLobe.Sharpness = 2.133f; cosineLobe.Amplitude = 1.17f; return cosineLobe; } float3 SGIrradianceInnerProduct(in SG lightingLobe, in float3 normal) { SG cosineLobe = CosineLobeSG(normal); return max(SGInnerProduct(lightingLobe, cosineLobe), 0.0f); } float3 SGDiffuseInnerProduct(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradianceInnerProduct(lightingLobe, normal) * brdf; }

Not too bad, eh? Of course it’s worth taking a closer look at our cosine lobe approximation, since that’s definitely going to introduce some error. Perhaps the best way to do this is to look at the graphs of a real cosine lobe and our SG approximation side-by-side:

*Comparison of a clamped cosine cosine lobe (red) with an SG approximation (blue)*

Just from looking at the graph it’s fairly obvious that an SG isn’t necessarily a great fit for a cosine lobe. First of all, the amplitude actually goes above 1, which might seem a bit weird at first glance. However it’s necessary to ensure that the area under the curve remains somewhat consistent with the cosine lobe, since there would otherwise be a loss of energy. The other weirdness stems from the fact that an SG never actually hits 0 anywhere on the sphere, hence the long “tail” on the graph of the SG. This essentially means that if the SG were integrated against a punctual light source, the lighting would “wrap” around the sphere past the point where N dot L is equal to 0. The situation actually isn’t all that different from an SH representation of a cosine lobe, which also extends past π/2:

*L1 (green) and L2 (purple) SH approximation of a clamped cosine lobe compared with an SG approximation (blue) and the actual clamped cosine (red).*

In the SH case the approximation actually goes negative, which is arguably worse than the long tail of the SG approximation. The L1 approximation is particularly bad in this regard. If at this point you’re trying to imagine what these approximations look like on a sphere, let me save you the trouble by providing an image:

*From left to right: actual clamped cosine lobe, SG cosine approximation, L2 SH cosine approximation*

Now that we’ve finished analyzing you approximation of a cosine lobe, we need to take a look at the actual results of computing diffuse lighting from an SG light source. Let’s start off by graphing the results of computing irradiance using an SG inner product, and compare it against what we get by using brute-force numerical integration to compute the result of multiplying the SG with an actual clamped cosine (*not* the approximate SG cosine lobe that we use for the inner product):

*The resulting irradiance from an SG light source (with sharpness of 4.0) as a function of the angle between the light source and the surface normal. The red graph is the result of using numerical integration to compute the integral of the SG light source multiplied with a clamped cosine, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG.*

As you might expect, the inner product approximation has some error when compared with the “ground truth” provided by numerical integration. It’s worth pointing out that this error is purely a consequence of approximating the clamped cosine lobe as an SG: the inner product provides the exact result of the integral, and thus shouldn’t introduce any error on its own. Despite this error, the resulting irradiance isn’t hugely far off from our ground truth. The biggest difference is for the angles facing away from the light, where the SG inner product version has a stronger tail. Visualizing the resulting diffuse on a sphere gives us the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from computing irradiance using an SG inner product with an approximation of a cosine lobe.*

As an alternative to representing the cosine lobe with an SG and computing the inner product, we can consider a cheaper approximation. One advantage of working with SG’s is that each lobe is always symmetrical about its axis, which is also where its value is the highest. We also discussed earlier how we can compute the integral of an SG over the sphere, which gives us its total energy. This suggests that if we want to be frugal with our shader cycles, we can pull terms out of the integral over the sphere/hemisphere and only evaluate them for the SG axis direction. This obviously introduces error, but that error may be acceptable if the term we pull out is relatively “smooth”. If we apply this approximation to computing irradiance and diffuse lighting, we get this:

Translating to HLSL, we get the following functions:

float3 SGIrradiancePunctual(in SG lightingLobe, in float3 normal) { float cosineTerm = saturate(dot(lightingLobe.Axis, normal)); return cosineTerm * 2.0f * Pi * (lightingLobe.Amplitude) / lightingLobe.Sharpness; } float3 SGDiffusePunctual(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradiancePunctual(lightingLobe, normal) * brdf; }

If we overlay the graph of our super-cheap irradiance approximation on the graph we were looking at earlier, we get this:

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The green graph was computed by pulling the cosine term out of the integral, and multiplying it with the result of integrating the SG light about the sphere.*

The result shouldn’t be a surprise: it’s just a scaled version of the standard clamped cosine.It’s pretty obvious just by looking that this particular optimization will introduce quite a bit of error, particularly where theta is greater than π/2. But it is cheap, since we’ve effectively turned an SG into a point light. This is makes it useful tool for cases where we may want to approximate the convolution of an SG light source with a BRDF or some other function that isn’t easily represented as an SG.

So it’s nice to have a cheap option, but what if we want more accuracy than our inner product approximation? Fortunately for us, Stephen Hill was able to formulate another alternative approximation that directly fits a curve to the integral of a cosine lobe with an SG. His implementation is actually formulated for a normalized SG (where the integral about the sphere is equal to 1.0), but we can easily account for this by computing the integral and scaling the result by that value:

float3 SGIrradianceFitted(in SG lightingLobe, in float3 normal) { const float muDotN = dot(lightingLobe.Axis, normal); const float lambda = lightingLobe.Sharpness; const float c0 = 0.36f; const float c1 = 1.0f / (4.0f * c0); float eml = exp(-lambda); float em2l = eml * eml; float rl = rcp(lambda); float scale = 1.0f + 2.0f * em2l - rl; float bias = (eml - em2l) * rl - em2l; float x = sqrt(1.0f - scale); float x0 = c0 * muDotN; float x1 = c1 * x; float n = x0 + x1; float y = saturate(muDotN); if(abs(x0) <= x1) y = n * n / x; float result = scale * y + bias; return result * ApproximateSGIntegral(lightingLobe); }

The result is very close to the ground truth, which is very cool considering that it might actually be cheaper than our inner product approximation!

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The orange graph was computed using Stephen Hill’s fitted curve approximation.*

If we once again visualize the result on the sphere and compare with our previous results, we get the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using an SG inner product with an approximation of a cosine lobe. The middle sphere shows the resulting diffuse lighting from computing irradiance using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from Stephen Hill’s fitted approximation.*

[1] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – https://www.microsoft.com/en-us/research/wp-content/uploads/2009/12/sg.pdf

]]>

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, I gave a quick rundown of some of the available techniques for representing a pre-computed distribution of radiance or irradiance for each lightmap texel or probe location. In this article, I’m going cover the basics of Spherical Gaussians, which are a type of spherical radial basis function (SRBF for short). The concepts introduced here will serve as the core set of tools for working with Spherical Gaussians, and in later articles I’ll demonstrate how you can use those tools to form an alternative for approximating incoming radiance in pre-computed lightmaps or probes.

I should point out that this article is still going to be somewhat high-level, in that it won’t provide full derivations and background details for all formulas and operations. However it is my hope that the material here will be sufficient to gain a basic understanding of SG’s, and also use them in practical scenarios.

A Spherical Gaussian, or “SG” for short, is essentially a Gaussian function[1] that’s defined on the surface of a sphere. If you’re reading this, then you’re probably already familar with how a Gaussian function works in 1D: you compute the distance from the center of the Gaussian, and use this distance as part of a base-e exponential. This produces the characteristic “hump” that you see when you graph it:

*A Gaussian in 1D centered at x=0, with a height of 3*

You’re probably also familiar with how it looks in 2D, since it’s very commonly used in image processing as a filter kernel. It ends up looking like what you would get if you took the above graph and revolved it around its axis

*A Gaussian filter applied to a 2D image of a white dot, showing that the impulse response is effectively a Gaussian function in 2D*

A Spherical Gaussian still works the same way, except that it now lives on the surface of a sphere instead of on a line or a flat plane. If you’re having trouble visualizing that, imagine if you took the above image and wrapped it around a sphere like wrapping paper. It ends up looking like this:

*A Spherical Gaussian visualized on the surface of a sphere*

Since an SG is defined on a sphere rather than a line or plane, it’s parameterized differently than a normal Gaussian. A 1D Gaussian function always has the following form:

The part that we need to change in order to define the function on a sphere is the “(x – b)” term. This part of the function essentially makes the Gaussian a function of the cartesian distance between a given point and the center of the Gaussian, which can be trivially extended into 2D using the standard distance formula. To make this work on a sphere, we must instead make our Gaussian a function of the angle between two unit direction vectors. In practice we do this by making an SG a function of the *cosine* of the angle between two vectors, which can be efficiently computed using a dot product like so:

Just like a normal Gaussian, we have a few parameters that control the shape and location of the resulting lobe. First we have μ, which is the *axis*, or *direction* of the lobe. It effectively controls where the lobe is located on the sphere, and always points towards the exact center of the lobe. Next we have λ, which is the *sharpness* of the lobe. As this value increases, the lobe will get “skinnier”, meaning that the result will fall off more quickly as you get further from the lobe axis. Finally we have *a*, which is the *amplitude* or *intensity* of the lobe. If you were to look at a polar graph of an SG, it would correspond to the height of the lobe at its peak. The amplitude can be a scalar value, or for graphics applications we may choose to make it an RGB triplet in order to support varying intensities for different color channels. This all lends itself to a simple HLSL code definition:

struct SG { float3 Amplitude; float3 Axis; float Sharpness; };

Evaluating an SG is also easily expressible in HLSL. All we need is a normalized direction vector representing the point on the sphere where we’d like to compute the value of the SG:

float3 EvaluateSG(in SG sg, in float3 dir) { float cosAngle = dot(dir, sg.Axis); return sg.Amplitude * exp(sg.Sharpness * (cosAngle - 1.0f)); }

Now that we know what a Spherical Gaussian is, what’s so useful about them anyway? One pontential benefit is that they’re fairly intuitive: it’s not terribly hard to understand how the 3 parameters work, and how each parameter affects the resulting lobe. The other main draw is that they inherit a lot of useful properties of “regular” Gaussians, which makes them useful for graphics and other related applications. These properties have been explored and utilized in several research papers that were primarily aimed at achieving pre-computed radiance transfer (PRT) with both diffuse and specular material response. In particular, the paper entitled “All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance[2]” by Wang et al. was our main inspiration for pursuing SG’s at RAD.

So what are these useful Gaussian properties that we can exploit? For starters, taking the product of 2 Gaussians functions produces another Gaussian. For an SG, this is equivalent to visiting every point on the sphere, evaluating 2 different SG’s, and multiplying the two results. Since it’s an operation that takes 2 SG’s and produces another SG, it is sometimes referred to as a “vector” product. It’s defined as the following:

In HLSL code, it looks like this:

SG SGProduct(in SG x, in SG y) { float3 um = (x.Sharpness * x.Axis + y.Sharpness * y.Axis) / (x.Sharpness + y.Sharpness); float umLength = length(um); float lm = x.Sharpness + y.Sharpness; SG res; res.Axis = um * (1.0f / umLength); res.Sharpness = lm * umLength; res.Amplitude = x.Amplitude * y.Amplitude * exp(lm * (umLength - 1.0f)); return res; }

Gaussians have another really nice property in that their integrals have a closed-form solution, which is known as the error function[3]. The property also extends to SG’s, where we can compute the integral of an SG over the entire sphere:

Computing an integral will essentially tell us the total “energy” of an SG, which can be useful for lighting calculations. It can also be useful for *normalizing* an SG, which produces an SG that integrates to 1. Such a normalized SG is suitable for representing a probability distribution, such as an NDF. In fact, a normalized SG is actually equivalent to a von Mises-Fisher distribution[4] in 3D!

An SG integral is actually very cheap to compute…or at least it would be if we removed the exponential term. It turns out that the term actually approaches 1 very quickly as the SG’s sharpness increases, which means we can potentially drop it with little error as long as we know that the sharpness is high enough. Here’s what a graph of looks like for increasing sharpness:

*A graph of the exponential term in computing the integral of an SG over a sphere, which approaches 1 as the sharpness increases. The X-axis is sharpness, and the Y-axis is the value of .*

This all lends itself naturally to HLSL implementations for accurate and approximate versions of an SG integral:

float3 SGIntegral(in SG sg) { float expTerm = 1.0f - exp(-2.0f * sg.Sharpness); return 2 * Pi * (sg.Amplitude / sg.Sharpness) * expTerm; } float3 ApproximateSGIntegral(in SG sg) { return 2 * Pi * (sg.Amplitude / sg.Sharpness); }

If we were to use our SG integral formula to compute the integral of the product of two SG’s, we can compute what’s known as the *inner product*, or *dot product* of those SG’s. The operation is usually defined like this:

However we can avoid numerical precision issues by using an alternate arrangement:

…which looks like this in HLSL:

float3 SGInnerProduct(in SG x, in SG y) { float umLength = length(x.Sharpness * x.Axis + y.Sharpness * y.Axis); float3 expo = exp(umLength - x.Sharpness - y.Sharpness) * x.Amplitude * y.Amplitude; float other = 1.0f - exp(-2.0f * umLength); return (2.0f * Pi * expo * other) / umLength; }

SG’s have what’s known as “compact-ε” support, which means that it’s possible to determine an angle θ such that all points within θ radians of the SG’s axis will have a value greater than ε. This property is potentially more useful if we flip it around so that we calculate a sharpness λ that results in a given θ for a particular value of ε:

float SGSharpnessFromThreshold(in float amplitude, in float epsilon, in float cosTheta) { return (log(epsilon) - log(amplitude)) / (cosTheta - 1.0f); }

One last operation I’ll discuss is rotation. Rotating an SG is trivial: all you need to do is apply your rotation transform to the SG’s axis vector and you have a rotated SG! You can apply the transform using a matrix, a quaternion, or any other means you might have for rotating a vector. This is a welcome change from SH, which requires a very complex transform once you go above L1.

[1] Gaussian Function – https://en.wikipedia.org/wiki/Gaussian_function

[2] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[3] Error Function – https://en.wikipedia.org/wiki/Error_function

[4] von-Mises Fisher Distribtion – https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution

]]>

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

For part 1 of this series, I’m going to provide some background material for our research into Spherical Gaussians. The main purpose is cover some of the alternatives to the approach we used for The Order: 1886, and also to help you understand why we decided to persue Spherical Gaussians. The main empahasis is going to be on discussing what exactly we store in pre-baked lightmaps and probes, and how that data is used to compute diffuse or specular lighting. If you’re already familiar with the concepts of pre-computing radiance or irradiance and approximating them using basis functions like the HL2 basis or Spherical Harmonics, then you will probably want to skip to the next article.

Before we get started, here’s a quick glossary of the terms I use the formulas:

- – the outgoing radiance (lighting) towards the viewer
- – the incoming radiance (lighting) hitting the surface
- – the direction pointing towards the viewer (often denoted as “V” in shader code dealing with lighting)
- – the direction pointing towards the incoming radiance hitting the surface (often denoted as “L” in shader code dealing with lighting)
- – the direction of the surface normal
- – the 3D location of the surface point
- – integral about the hemisphere
- – the angle between the surface normal and the incoming radiance direction
- – the angle between the surface normal and the outgoing direction towards the viewer
- – the BRDF of the surface

Games have used pre-computed lightmaps for almost as long as they have been using shaded 3D graphics, and they’re still quite popular in 2016. The idea is simple: pre-compute a lighting value for every texel, then sample those lighting values at runtime to determine the final appearance of a surface. It’s a simple concept to grasp, but there are some details you might not think about if you’re just learning how they work. For instance, what exactly does it mean to store “lighting” in a texture? What exact value are we computing, anyway? In the early days the value fetched from the lightmap was simply multiplied with the material’s diffuse albedo color (typically done with fixed-function texture stages), and then directly output to the screen. Ignoring the issue of gamma correction and sRGB transfer functions for the moment, we can work backwards from this simple description to describe this old-school approach in terms of the rendering equation. This might seem like a bit of a pointless exercise, but I think it helps build a solid base that we can use to discuss more advanced techniques.

So we know that our lightmap contains a single fixed color per-texel, and we apply it the same way regardless of the viewing direction for a given pixel. This implies that we’re using a simple Lambertian diffuse BRDF, since it lacks any sort of view-dependence. Recall that we compute the outgoing radiance for a single point using the following integral:

If we substitute the standard diffuse BRDF of for our BRDF (where C_{diffuse} is the diffuse albedo of the surface), then we get the following:

On the right side we see that we can pull the constant terms out the integral (the constant term is actually the entire BRDF!), and what we’re left with lines up nicely with how we handle lightmaps: the expensive integral part is pre-computed per-texel, and then the constant term is applied at runtime per-pixel. The “integral part” is actually computing the incident irradiance, which lets us finally identify the quantity being stored in the lightmap: it’s irradiance! In practice however most games would not apply the 1 / π term at runtime, since it would have been impractical to do so. Instead, let’s assume that the 1 / π was “baked” into the lightmap, since it’s constant for all surfaces (unlike the diffuse albedo, which we consider to be *spatially varying*). In that case, we’re actually storing a reflectance value that takes the BRDF into account. So if we wanted to be precise, we would say that it contains “the diffuse reflectance of a surface with C_{diffuse} = 1.0″, AKA the maximum possible outgoing radiance for a surface with a diffuse BRDF.

One of the key concepts of lightmapping is the idea of reconstructing the final surface appearance using data that’s stored at different rates in the spatial domain. Or in simpler words, we store lightmaps using one texel density while combining it with albedo maps that have a different (usually higher) density. This lets us retain the appearance of high-frequency details without actually computing irradiance integrals per-pixel. But what if we want to take this concept a step further? What it we also want the irradiance itself to vary in response to texture maps, and not just the diffuse albedo? By the early 2000’s normal maps were starting to see common use for this purpose, however they were generally only used when computing the contribution from punctual light sources. Normal maps were no help with light maps that only stored a single (scaled) irradiance value, which meant that that pure ambient lighting would look very flat compared to areas using dynamic lighting:

*Areas in direct lighting (on the right) have a varying appearance due to a normal map, but areas in shadow (on the left) have no variation due to being lit by a baked lightmap containing only a single irradiance value.*

To make lightmaps work with normal mapping, we need to stop storing a single value and instead somehow store a *distribution* of irradiance values for every texel. Normal maps contain a range of normal directions, where those directions are generally restricted to the hemisphere around a point’s surface normal. So if we want our lightmap to store irradiance values for all possible normal map values, then it must contain a distribution of irradiance that’s defined for that same hemisphere. One of the earliest and simplest examples of such a distribution was used by Half-Life 2[1], and was referred to as Radiosity Normal Mapping[2]:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

Valve essentially modified their lightmap baker to compute 3 values instead of 1, with each value computed by projecting the irradiance signal onto one of the corresponding orthogonal basis vectors in the above image. At runtime, the irradiance value used for shading would be computed by blending the 3 lightmap values based on the cosine of the angle between the normal map direction and the 3 basis directions (which is cheaply computed using a dot product). This allowed them to effectively vary the irradiance based on the normal map direction, thus avoiding the “flat ambient” problem described above.

While this worked for their static geometry, there still remained the issue of applying pre-computed lighting to dynamic objects and characters. Some early games (such as the original Quake) used tricks like sampling the lightmap value at a character’s feet, and using that value to compute ambient lighting for the entire mesh. Other games didn’t even do that much, and would just apply dynamic lights combined with a global ambient term. Valve decided to take a more sophisticated approach that extended their hemispherical lightmap basis into a full spherical basis formed by 6 orthogonal basis vectors:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

The basis vectors coincided with the 6 face directions of a unit cube, which led Valve to call this basis the “Ambient Cube”. By projecting irradiance in all directions around a point in space (instead of a hemisphere surrounding a surface normal) onto their basis functions, a dynamic mesh could sample irradiance for any normal direction and use it to compute diffuse lighting. This type of representation is often referred to as a *lighting probe*, or often just “probe” for short.

With Valve’s basis we can combine normal maps and light maps to get diffuse lighting that can vary in response to high-frequency normal maps. So what’s next? For added realism we would ideally like to support more complex BRDF’s, including view-dependent specular BRDF’s. Half-Life 2 handled environment specular by pre-generating cubemaps at hand-placed probe locations, which is still a common approach used by modern games (albeit with the addition of pre-filtering[3] used to approximate the response from a microfacet BRDF). However the large memory footprint of cubemaps limits the practical density of specular probes, which can naturally lead to issues caused by incorrect parallax or disocclusion.

*A combination of incorrect parallax and disocclusion when using a pre-filtered environment as a source for environment specular. Notice the bright edges on the sphere, which are actually caused by the sphere reflecting itself!*

With that in mind it would nice to be able to get some sort of specular response out of our lightmaps, even if only for a subset of materials. But if that is our goal, then our approach of storing an irradiance distribution starts to become a hinderance. Recall from earlier that with a diffuse BRDF we were able to completely pull the BRDF out of the irradiance integral, since the Lambertian diffuse BRDF is just a constant term. This is no longer the case even with a simple specular BRDF, whose value varies depending on both the viewing direction as well as the incident lighting direction.

If you’re working with the Half-Life 2 basis (or something similar), a tempting option might be to compute a specular term as if the 3 basis directions were directional lights. If you think about what this means, it’s basically what you get if you decide to say “screw it” and pull the specular BRDF out of the irradiance integral. So instead of Integrate(BRDF * Lighting * cos(theta)), you’re doing BRDF * Integrate(Lighting * cos(theta)). This will definitely give you *something,* and it’s perhaps a lot better than nothing. But you’ll also effectively lose out on a ton of your specular response, since you’ll only get specular when your viewing direction appropriately lines up with your basis directions according the the BRDF slice. To show you what I mean by this, here’s a comparison:

*The top image shows a path-traced rendering of a green wall being lit by direct sun lighting. The middle image shows the indirect specular component of the top image, with exposure increased by 4x. The bottom image shows the resulting specular from treating the HL2 basis directions as directional lights.*

Hopefully these images clearly show the problem that I’m describing. In the bottom image, you get specular reflections that look just like they came from a few point lights, since that’s effectively what you’re simulating. Meanwhile in the middle image with proper environment reflections, you can see that the the entire green wall effectively acts as an area light, and you get a very broad specular reflections across the entire floor. In general the problem tends to be less noticeable though as roughness increases, since higher roughness naturally results in broader, less-defined reflections that are harder to notice.

If we want to do better, we must instead find a way to store a radiance distribution and then efficiently integrate it against our BRDF. It’s at this point that we turn to spherical harmonics. Spherical harmonics (SH for short) have become a popular tool for real-time graphics, typically as a way to store an approximation of indirect lighting at discrete probe locations. I’m not going to go into the full specifics of SH since that could easily fill an entire article[4] on its own. If you have no experience with SH, the key thing to know about them is that they basically let you approximate a function defined on a sphere using a handful of coefficients (typically either 4 or 9 floats per RGB channel). It’s sort-of as if you had a compact cubemap, where you can take a direction vector and get back a value associated with that direction. The big catch is that you can only represent very low-frequency (fuzzy) signals with lower-order SH, which can limit what sort of things you can do with it. You can project detailed, high-frequency signals onto SH if you want to, but the resulting projection will be very blurry. Here’s an example showing what an HDR environment map looks like projected onto L2 SH, which requires 27 coefficients for RGB:

*The top image is an HDR environment map containing incoming radiance values about a sphere, while the bottom image shows the result of projecting that environment onto L2 spherical harmonics.*

In the case of irradiance, SH can work pretty well since it’s naturally low-frequency. The integration of incoming radiance against the cosine term effectively acts as a low-pass filter, which makes it a suitable candidate for approximation with SH. So if we project irradiance onto SH for every probe location or lightmap texel, we can now do an SH “lookup” (which is basically a few computations followed by a dot product with the coefficients) to get the irradiance in any direction on the sphere. This means we can get spatial variation from albedo and normal maps just like with the HL2 basis!

It also turns out that SH is pretty useful for *computing* irradiance from input radiance, since we can do it really cheaply. In fact it can do it so cheaply, it can be done at runtime by folding it into the SH lookup process. The reason it’s so cheap is because SH is effectively a frequency-domain representation of the signal, and when you’re in the frequency domain convolutions can be done with simple multiplication. In the spatial domain, convolution with a cubemap is an N^2 operation involving many samples from an input radiance cubemap. If you’re interested in the full details, the process was described in Ravi Ramamoorthi’s seminal paper[5] from 2001, with derivations provided in another article[6].

*The Stanford Bunny model being lit with diffuse lighting from an L2 spherical harmonics probe*

So we’ve established that SH works for approximating irradiance, and that we can convert from radiance to irradiance at runtime. But what does this have to do with specular? By storing an approximation of radiance instead of irradiance in our probes or lightmaps (albeit, a very blurry version of radiance), we now have the signal that we need to integrate our specular BRDF against in order to produce specular reflections. All we need is an SH representation of our BRDF, and we’re a dot product away from environment specular! The only problem we have to solve is how to actually *get* an SH representation of our BRDF.

Unfortunately a microfacet specular BRDF is quite a bit more complicated than a Lambertian diffuse BRDF, which makes our lives more difficult. For diffuse lighting we only needed to worry about the cosine lobe, which has the same shape regardless of the material or viewing direction. However a specular lobe will vary in shape and intensity depending on the viewing direction, material roughness, and the fresnel term at zero incidence (AKA F_{0}). If all else fails, we can always use monte-carlo techniques to pre-compute the coefficients and store the result in a lookup texture. At first it may seem like we need at parameterize our lookup table on 4 terms, since the viewing direction is two-dimensional. However we can drop a dimension if we follow in the footsteps[7] of the intrepid engineers at Bungie, who used a neat trick for their SH specular implementation in Halo 3[8]. The key insight that they shared was that the specular lobe shape doesn’t actually change as the viewer rotates around the local Z axis of the shading point (AKA the surface normal). It actually only changes based on the *viewing angle*, which is the angle between the view vector and the local Z axis of the surface. If we exploit this knowledge, we can pre-compute the coefficients for the set of possible viewing directions that are aligned with the local X axis. Then at runtime, we can rotate the coefficients so that the resulting lobe lines up with the actual viewing direction. Here’s an image to show you what I mean:

*Rotating a specular lobe from the X axis to its actual location based on the viewing direction, which is helpful for pre-computing the SH coefficients into a lookup texture*

So in this image the checkerboard is the surface being shaded, and the red, green and blue arrows are the local X, Y, and Z axes of the surface. The transparent lobe represents the specular lobe that we precomputed for a viewpoint that’s aligned with the X axis, but has the same viewing angle. The blue arrow shows how we can rotate the specular lobe from its original position to the actual position of the lobe based on the current viewing position, giving us the desired specular response. Here’s a comparison showing what it looks like it in action:

*The top image is a scene rendered with a path tracer. The middle image shows the indirect specular as rendered by a path tracer, with exposure increased 4x. The bottom image shows the indirect specular term computing an L2 SH lightmap, also with exposure increased by 4x. *

Not too bad, eh? Or at least…not too bad as long as we’re willing to store 27 coefficients per lightmap texel, and we’re only concerned with rough materials. The comparison image used a GGX α parameter of 0.39, which is fairly rough.

One common issue with issue with SH is a phenomenon known as “ringing”, which is described in Peter-Pike Sloan’s Stupid Spherical Harmonics Tricks[9]. Ringing artifacts tends to show up when you have a very intense light source one side of the sphere. When this happens, the SH projection will naturally result in negative lobes on the opposite side of the sphere, which can result very low (or even negative!) values when evaluated. It’s generally not too much of an issue for 2D lightmaps, since lightmaps are only concerned with the incoming radiance for a hemisphere surrounding the surface normal. However they often show up in probes, which store radiance or irradiance about the entire sphere. The solution suggested by Peter-Pike Sloan is to apply a windowing function to the SH coefficients, which will filter out the ringing artifacts. However the windowing will also introduce additional blurring, which may remove high-frequency components from the original signal being projected. The following image shows how ringing artifacts manifest when using SH to compute irradiance from an environment with a bright area light, and also shows how windowing affects the final result:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

[1] Shading in Valve’s Source Engine (SIGGRAPH 2006) – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[2] Half Life 2 / Valve Source Shading – http://www2.ati.com/developer/gdc/D3DTutorial10_Half-Life2_Shading.pdf

[3] Real Shading in Unreal Engine 4 – http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf

[4] Spherical Harmonic Lighting: The Gritty Details – http://www.research.scea.com/gdc2003/spherical-harmonic-lighting.pdf

[5] An Efficient Representation for Irradiance Environment Maps – https://cseweb.ucsd.edu/~ravir/papers/envmap/

[6] On the Relationship between Radiance and Irradiance: Determining the illumination from images of a convex Lambertian object – https://cseweb.ucsd.edu/~ravir/papers/invlamb/

[7] The Lighting and Material of Halo 3 (Slides) – https://developer.amd.com/wordpress/media/2012/10/S2008-Chen-Lighting_and_Material_of_Halo3.pdf

[8] The Lighting and Material of Halo 3 (Course Notes) – http://developer.amd.com/wordpress/media/2013/01/Chapter01-Chen-Lighting_and_Material_of_Halo3.pdf

[9] Stupid Spherical Harmonics Tricks – http://www.ppsloan.org/publications/StupidSH36.pdf

]]>

So nearly year and a half ago myself and Dave Neubelt gave a presentation at SIGGRAPH where we described the approach that we developed for approximating incoming radiance using Spherical Gaussians in both our lightmaps and 3D probe grids. We had planned on releasing a source code demo as well as course notes that would serve as a full set of implementation details, but unfortunately those efforts were sidetracked by other responsibilities. We had actually gotten pretty far on a demo, and it started to seem pretty silly to let a whole year go by without releasing it. So I’ve cleaned it up, and put it on GitHub for all to learn from and/or make fun of. So if you’re just interested in seeing the code or running the demo, then go ahead and download it:

https://github.com/TheRealMJP/BakingLab

For those looking for some in-depth written explanation, I’ve also decided to write a series of blog posts that should hopefully shed some light on the basics of using SG’s in rendering. The first post provides background material by explaining common approaches to storing pre-computing lighting data in lightmaps and/or probes. The second post focuses on explaining the basics of Spherical Gaussians, and demonstrating some of their more useful properties. The third post explains how the various SG properties can be used to compute diffuse lighting from an SG light source. The fourth post goes even deeper and covers methods for approximating the specular contribution from an SG light source. The fifth post explores some approaches for using SG’s to create a compact approximation of a lighting environment, and compares the results with spherical harmonics. Finally, the sixth posts discusses features present in the the lightmap baking demo that we’ve released on GitHub.

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Enjoy, and please leave a comment if you have questions, concerns, or corrections!

]]>

https://github.com/TheRealMJP/DeferredTexturing

https://github.com/TheRealMJP/DeferredTexturing/releases (Precompiled Binaries)

Unless you’ve been in a coma for the past year, you’ve probably noticed that there’s a lot of buzz and excitement around the new graphics API’s that are available for PC and mobile. One of the biggest changes brought by both D3D12 and Vulkan is that they’ve ditched the old slot-based system for binding resources that’s been in use since…forever. In place of the old system, both API’s have a adopted a new model[1] based around placing opaque resource descriptors in contiguous ranges of GPU-accessible memory. The new model has the potential to be more efficient, since a lot of hardware (most notably AMD’s GCN-based GPU’s) can read their descriptors straight from memory instead of having to keep them in physical registers. So instead of having the driver take a slot-based model and do behind-the-scenes gymnastics to put the appropriate descriptors into tables that the shader can use, the app can just put the descriptors in a layout that works right from the start.

The new style of providing resources to the GPU is often referred to as “bindless”, since you’re no longer restricted to explicitly binding textures or buffers through dedicated API functions. The term “bindless” originally comes from Nvidia, who were the first to introduce the concept[2] through their NV_bindless_texture[3] extension for OpenGL. Their material shows some serious reductions in CPU overhead by skipping standard resource binding, and instead letting the app place 64-bit descriptor handles (most likely they’re actually pointers to descriptors) inside of uniform buffers. One major difference between Nvidia bindless and D3D12/Vulkan bindless is that the new APIs don’t allow you to simply put descriptor handles inside of constant/uniform buffers. Instead, they require you to manually specify (through a root signature) how you’ll organize your tables of descriptors for a shader. It might seem more complicated the Nvidia extension, but doing it this way has a big advantage: it lets D3D12 still support hardware that has no support (or limited support) for pulling descriptors from memory. It also still allows you to go full-on Nvidia-style bindless via support for unbounded texture arrays in HLSL. With unbounded arrays you can potentially put all of your descriptors in one giant table, and then index into that array using values from root constants/constant buffers/structured buffers/etc. This basically lets you treat an integer the same as a “handle” in Nvidia’s approach, with the added benefit that you don’t need to actually store a full 64-bit integer. Not only can this be really efficient, but it also opens the door to new rendering techniques that use GPU-generated values to determine which textures to fetch from.

One such use case for bindless textures is deferred texturing. The concept is pretty straightforward: instead of writing out a G-Buffer containing all of the material parameters required for shading, you instead write out your interpolated UV’s as well as a material ID. Then during your deferred pass you can use your material ID to figure out which textures you need to sample, and use the UV’s from your G-Buffer to actually sample them. The main benefit is that you can ensure that your textures are only sampled for visible pixels, without worrying about overdraw or quad packing. Depending on your approach, you may also be able to save on some G-Buffer space by virtue of not having to cram every material parameter in there. In practice you actually need more than just UV and material ID. For normal mapping you need your full tangent frame, which at minimum requires a quaternion. For mipmaps and anisotropic filtering we also need the screen-space derivatives of our UV’s. These can computed in the pixel shader and then explicitly stored in the G-Buffer, or you can compute them from G-Buffer as long as you’re willing to live with occasional artifacts. Nathan Reed has a nice write-up[4] on his blog discussing the various possibilities for G-Buffer layouts, so I would suggest reading through his article for some more details.

The place where bindless helps is in the actual sampling of the material textures during the deferred lighting phase. By putting all of your texture descriptors into one big array, the lighting pass can index into that array in order to sample the textures for any given material. All you need is a simple mapping of material ID -> texture indices, which you can do by indexing into a structured buffer. Of course it’s not really *required* to have bindless in order to pull this off. If you’re willing to stuff all of your textures into a big texture array, then you could achieve the same thing on D3D10-level hardware. Or if you use virtual mapping, then it’s pretty trivial to implement since everything’s already coming from big texture atlases. In fact, the virtual mapping approach has already been used in a shipping game, and was described at SIGGRAPH last year[5][6]. That said, the bindless approach is probably the easiest to get running and also places the least constraints on existing pipelines and assets.

As a way to test out the brand-new D3D12 version of my sample framework, I decided to have a go at writing a simple implementation of a deferred texturing renderer. It seemed like a good way to get familiar with some of the new features offered by D3D12, while also making sure that the new version of my sample framework was ready for prime-time. I also hoped that I could gain a better understanding of some of the practical issues involved in implementing deferred texturing, and use the experience to judge whether or not it might be a appealing choice for future projects.

Here’s a quick breakdown of how I ultimately set up my renderer:

- Bin lights and decals into clusters
- 16×16 tiles aligned to screen space, 16 linearly-partitioned depth tiles

- Render sun and spotlight shadows
- 4 2048×2048 cascaded shadow maps for the sun
- 1024×1024 standard shadow maps for each spotlight

- Render scene to the G-Buffer
- Depth (32bpp)
- Tangent frame as a packed quaternion[7] (32bpp)
- Texture coordinates (32bpp)
- Depth gradients (32bpp)
- (Optional) UV gradients (64bpp)
- Material ID (8bpp)
- 7-bit Material index
- 1-bit for tangent frame handedness

- If MSAA is not enabled:
- Run a deferred compute shader that perfoms deferred shading for all pixels
- Read attributes from G-Buffer

- Use material ID to get texture indices, and use those indices to index into descriptor tables
- Blend decals into material properties
- Apply sunlight and spotlights

- Render the sky to the output texture, using the depth buffer for occlusion

- Run a deferred compute shader that perfoms deferred shading for all pixels
- Otherwise if MSAA is enabled
- Detect “edge” pixels requiring per-sample shading using Material ID
- Classify 8×8 tiles as having either having edges or having no edges
- Render the sky to an MSAA render target texture, using the depth buffer for occlusion
- Run a deferred compute shader for non-edge tiles, shading only 1 subsample per pixel
- Run a deferred compute shader for edge tiles, shading 1 subsample for non-edge pixels and all subsamples for edge pixels
- Resolve MSAA subsamples to a non-MSAA texture

- Post-processing
- Present

In order to have another reference point for evaluating quality and performance, I also decided to implement a clustered forward[8] path side-by-side with the deferred renderer:

- Bin lights into clusters (16×16 tiles aligned to screen space, 16 linearly-partitioned depth tiles)
- Render sun and spotlight shadows
- (optional) Render scene depth prepass
- Render scene with full shading
- Read from material textures
- Blend decals into material properties
- Apply sunlight and spotlights

- if MSAA is enabled
- Resolve MSAA subsamples to a non-MSAA texture

- Post-processing
- Present

As you may have already noticed, I used the same clustered approach to binning lights for both the deferred and forward paths. I did this because it was simpler than having two different approaches to light selection, and it also seemed more fair to take that aspect out of the equation when comparing performance. However you could obviously use whatever approach you’d like for binning lights when using deferred texturing, such as classic light bounds rasterization/blending or tile-based subfrustum culling[9].

To implement the actual binning, I used a similar setup to what was described in Emil Persson’s excellent presentation[8] from SIGGRAPH 2013. If you’re not familiar, the basic idea is that you chop up your view frustum into a bunch subfrusta, both in screen-space XY as well as along the Z axis. This essentially looks like a voxel grid, except warped to fit inside the frustum shape of a perspective projection. This is actually rather similar to the approach used in tiled deferred or Forward+[10] rendering, except that you also bucket along Z instead of fitting each subfrustum to the depth buffer. This can be really nice for forward rendering, since it lets you avoid over-including lights for tiles with a large depth range.

In the Avalanche engine they decided to perform their light binning on the CPU, which is feasible since the binning doesn’t rely on a depth buffer. Binning on the CPU can make a lot of sense, since a typical GPU approach to culling will often have many redundant calculations across threads and thread groups.It’s also possible make it pretty fast through CPU parallelization, as demonstrated by Intel’s Clustered Forward Shading sample[11].

For my own implementation I decided that I wanted to stick with the GPU, and so I went with a different approach. Having shipped a game that used a Forward+-style renderer, I’m pretty disappointed with the results of using a typical subfrustum plane-based culling scheme in a compute shader. The dirty secret[12] of using plane/volume tests for frustum culling is that they’re actually quite prone to false positives. The “standard” test can only exclude when your bounding volume is completely on the wrong side of one or more of your frustum planes. Unfortunately this means that it will fail for cases where the bounding volume intersects multiple planes at points outside the frustum. Even more unfortunate is that this particular case becomes more likely as your bounding volumes become large relative to your frustum, which is typically the case for testing lights against subfrusta. Spotlights are especially bad in this regard, since the wide portion will often intersect the planes of subfrusta that are actually just outside the narrower tip:

The top-right image shows a 3D perspective view, where you can clearly see that the cone in this situation doesn’t actually intersect with the frustum. However if you look at the orthographic views, you can also see that the cone manages to be on both sides of the frustum’s right plane. In practice you end up getting results like the following (tiles colored in green are “intersecting” with the spotlight):

As you can see, you can end up with a ton of false positives. In fact towards the right we’re basically just filling in the screen-aligned AABB of the bounding cone, which turns into a whole lot of wasted work for those pixels.

This drove me nuts on The Order, since our lighting artists liked to pack our levels full of large, shadow-casting spotlights. On the CPU side I would resort to expensive frustum-frustum tests for shadow-casting spotlights, which makes sense when you consider that you can save a lot of CPU and GPU time by culling out an entire shadow map. Unfortunately frustum/frustum wasn’t a realistic option for the tiled subfrustum culling that was performed on the GPU, and so for “important” lights I augmented the intersection test with a 2D mask generated via rasterization of the bounding cone. The results were quite a bit better:

For this demo, I decided to take what I had done on The Order and move it into 3D by binning into Z buckets as well. Binning in Z is a bit tricky, since you essentially want the equivalent of solid voxelization except in the projected space of your the frustum. Working in projected space rules out some of the common voxelization tricks, and so I ended up going with a simple 2-pass approach. The first pass renders the backfaces of a light’s bounding geometry, and marks the light’s bit (each cluster stores a bitfield of active lights) in the furthest Z bucket intersected by the current triangle within a given XY bucket. To conservatively estimate the furthest Z bucket, I use pixel shader derivatives to get the depth gradients, and then compute the maximum depth at the corner of the pixel. This generally works OK, but when the depth gradient is large it’s possible to extrapolate off the triangle. To minimize the damage in these cases, I compute a view-space AABB of the light on the CPU, and clamp the extrapolated depth to this AABB. After the backfacing pass, the frontfaces are then rendered. This time, the pixel shader computes the minimum depth bucket, and then walks forward along view ray until encountering the bucket that was marked by the backface pass. Here’s a visualization of the binning for a single light in my demo:

The pixels marked in red are the ones that belong to a cluster where the light is active. The triangle-shaped UI in the bottom right is a visualization that shows the active clusters for the XZ plane located at y=0. This helps you to see how well the clustering is working in Z, which is where the most errors occur in my implementation. Here’s another image showing the scene with all (20) lights enabled:

To do this robustly, you really want to use conservative rasterization[13]. I have this option in my demo, but unfortunately there are still no AMD GPU’s that support the feature. As a fallback, I also support forcing 4x or 8x MSAA modes to reduce the chance that the pixel shader won’t be executed for a covered tile. For The Order I used 8x MSAA, and it was never an issue in practice. It would really only be an issue if the light was *very* small on-screen, in which case you could probably just rasterize a bounding box instead. I should also point out that in my implementation the depth buffer is not used to accelerate the binning process, or to produce more optimally-distributed Z buckets. I implemented it this way so that there would not be additional performance differences when choosing whether or not to enable a Z prepass for the forward rendering path.

For rendering we’re going to need a G-Buffer in which we can store whatever information comes from our vertex data, as well as a material ID that we can use to look up the appropriate textures during the deferred pass. In terms of vertex information, we need both the tangent frame and the UV’s in order to sample textures and perform normal mapping. If we assume our tangent frame is an orthonormal basis, we can store it pretty compactly by using a quaternion. R16G16B16A16_SNORM is perfect for this use case, since it covers the expected range and provides great precision. However we can crunch it down[7] to 4 bytes per texel if we really want to keep it small (and we do!). The UV’s are stored in a 16-bit UNORM format, which gives us plenty of precision as long as we store frac(UV) to keep things between 0 and 1. In the same texture as the UV’s I also store screen-space depth gradients in the Z and W components. After that is an optional 64bpp texture for storing the screen-space UV gradients, which I’ll discuss in the next section. Finally, the G-Buffer also has an 8-bit texture for storing a material ID. The MSB of each textel is the handedness bit for the tangent frame, which is used to flip the direction of the bitangent once it’s reconstructed from a quaternion. This brings us to a total of 25 bytes per sample when storing UV gradients, and 17 when computing them instead.

One issue with deferred texturing is that you can’t rely on automatic mip selection via screen-space UV gradients when sampling the material textures. The gradients computed in a pixel shader will be wrong for quads that span multiple triangles, and in a compute shader they’re not available at all. The simplest way to solve this is to obtain the gradients in the pixel shader (using ddx/ddy) when rendering the scene to the G-Buffer, and then store those gradients in a texture. Unfortunately this means storing 4 separate values, which requires an additional 8 bytes of data per pixel when using 16-bit precision. It also doesn’t help you at all if you require positional gradients during your deferred pass, which can be useful for things like gobos, decals, filterable shadows, or receiver plane shadow bias factors. Storing the full gradients of world or view-space position would be silly, but fortunately we can store depth gradients and use those to reconstruct position gradients. Depth gradients only need 2 values instead of 6, and we can use a 16-bit fixed-point format instead of floating-point. They also have the nice property of being constant across the surface of a plane, which makes them useful for detecting triangle edges.

Both the UV and depth gradients can be computed in the deferred pass by sampling values from neighboring pixels, but in practice I’ve found it’s actually somewhat tricky to get right. You have to be careful not to “walk off the triangle”, otherwise you might end up reading UV’s from a totally unrelated mesh. Unless of course your triangle is so small that none of your neighbors came from the same triangle, in which case walking off might be your only option. You also have to take care around UV seams, including any you might have created yourself by using frac()!

In my implementation, I decided to always store depth gradients (where “depth” in this case is the post-projection z/w value stored in the depth buffer) while supporting an option to either store or compute UV gradients. Doing it this way allowed me to utilize the depth gradients when trying to find suitable neighbor pixels for computing UV gradients, and also ensured that I always had high-quality positional gradients. The material ID was also useful here: by checking that a neighboring pixel used the same material and was also had the same depth gradients, I could be relatively certain that the neighbor was either from the same triangle, or from a coplanar triangle. The shader code for this step can be found here, if you’re interested.

To assess the quality, let’s look at a visualization showing what the UV gradients look like when using ddx/ddy during forward rendering:

And here’s an image showing what computed UV gradients look like:

As you can see there’s quite a few places where my algorithm fails to detect neighbors (these pixels are dark), and other places where a neighboring pixel shouldn’t have been used at all (these pixels are bright). I’m sure the results could be improved with more time and cleverness, but you need to be careful that the amount of work done is enough to offset the cost of just writing out the gradients in the first place. On my GTX 970 it’s actually faster to store the gradients than it is to compute them when MSAA is disabled, but then it switches to the other way around once MSAA is turned on.

It’s worth noting that Sebastian’s presentation[5] mentions that they reconstruct UV gradients in their implementation (see page 45), although you can definitely see some artifacts around triangle edges in their comparison image. They also mention that they use “UV distance” to detect neighbors, which makes sense considering that they have unique UVs for their virtual texturing.

For non-MSAA rendering, the deferred pass is pretty straightforward. First, the G-Buffer attributes are read from their textures and used to compute the original pixel position and gradients. UV gradients are then read from the G-Buffer if present, or otherwise computed from neighboring pixels. The material ID from the G-Buffer is then used to index into a structured buffer that contains one element per material, where each element contains the descriptor indices for the material textures (albedo, normal, roughness, and metallic). These indices are used to index into a large descriptor table containing descriptors for every texture in the scene, so that the appropriate textures can be sampled using the pixel’s UV coordinates and derivatives.

Once all of the surface and material parameters are read, they are passed into a function that performs the actual shading. This function will loop over all lights that were binned into the pixel’s XYZ cluster, and compute the reflectance for each light source. This requires evaluating the surface BRDF, and also applying a visibility term that comes from the light’s 1024×1024 shadow map. The shadow map is sampled with a 7×7 PCF kernel that’s implemented as an optimized, unrolled loop that makes use of GatherCmp instructions. This helps to make the actual workload somewhat representative of what you might have in actual game, instead of biasing things too much towards ALU work. My scene also has a directional light from the sun, which uses 4 2048×2048 cascaded shadow maps for visibility. Finally, an ambient term is applied by means of a set SH coefficients representing the radiance from the skydome.

Just like any other kind of deferred rendering, MSAA needs special care. In particular we need to determine which pixels contain multiple unique subsamples, so that we can shade each subsample individually during the deferred pass. The key issue here is scheduling: the “edge” pixels that require subsample shading will typically be rather sparse in screen space, which makes dynamic branching a poor fit. The “old-school” way of scheduling is to create a stencil mask from the edge pixels, and then render in two pixel shader passes: one pass for per-pixel shading, another for per-sample shading that runs the shader at per-sample frequency. This can work better than a branch, but the hardware still may not be able to schedule it particularly well due to the sparseness of the edge pixels. It will also still need to make sure that the shader runs with 2×2 quads, which can result in a lot of needless helper executions.

The “newer” way to schedule edge pixels (and by “newer”, I mean 6 years old) is to use a compute shader that re-schedules threads within a thread group. Basically you detect edge pixels, append their location to a list in thread group shared memory, and then have the entire group iterate over that list once it finishes shading the first subsample. This effectively compacts the sparse list of edge pixels within a tile, allowing for coherent looping and branching. The downside is that you need to use shared memory, which can decrease your maximum occupancy if you use too much of it.

In my sample, I use the compute shader approach but with a new twist. Instead of performing the edge detection in the deferred pass, I run an earlier pass that checks for edges and builds a mask. This pass uses append buffers to build a list of 8×8 tiles that contain edge pixels, as well as a separate list containing tiles that have no edges at all. The append counts are used as indirect arguments for ExecuteIndirect, so that the edge and non-edge tiles can processed with two separate dispatches using two different shader permutations. This helps minimize overhead from shared memory usage, since the non-edge version of the compute shader doesn’t touch shared memory at all.

As for the actual edge detection, my sample supports two different approaches. The first approach only checks the material ID, and flags pixels that contain multiple material ID values. This is a very conservative approach, since it will only flag pixels where meshes with different materials overlap. The second approach is more aggressive, and additionally flags pixels with varying depth gradients. A varying depth gradient means that we have multiple triangles that are not coplanar, which means that we avoid tagging edges for the case of a tessellated flat plane. Here’s what the edge detection looks like using only the material ID:

…and with the the more aggressive depth gradient check:

One of the big advantages of traditional deferred shading is that you can modify your G-Buffer before computing your lighting. Lots of games take advantage of this by rendering deferred decals[14] into the scene for things like graffiti, blood splatter, debris, and posters. It’s much nicer than traditional forward-rendered decals, since you only need to light once per pixel even when accumulating multiple decals. The typical approach is to apply these decals in a deferred pass prior to lighting, where bounding volumes are rasterized representing the area of influence for each decal. In the pixel shader, the depth buffer is used to compute a surface position, which is then projected into 2D in order to compute a UV value. The projected UV can then be used to sample textures containing the decal’s surface properties, which are then written out to be blended into the G-Buffer. The end result is cheap decals that can work on complex scene geometry.

To use a typical deferred decal system you need two things: a depth buffer, and a G-Buffer. The depth buffer is needed for the projection part, and the G-Buffer is needed so that you can blend in your properties before shading. For forward rendering you can get a depth buffer through a depth prepass, but you’re out of luck for the G-Buffer part. We’re in a similar position with deferred texturing: we have depth, but our G-Buffer lacks the parameters that we’d typically want to modify from a decal. For The Order I worked around this by making a very specialized decal system. We would essentially accumulate values into a render target, with each channel of the texture corresponding to specific hard-coded decal type: bullet damage/cratering, scorching, and blood. The forward shaders would then read in those values, and would use them to modify material parameters before performing the lighting phase of the shader. It worked, but it obviously was super-specific to our game and wasn’t at all generic. Although we did get two nice things from this approach: materials could customized how they reacted to decals (materials could actually allocate a fully-composited layer for the damaged areas inside of bullet decals), and decals would properly accumulate on top of one another.

The good news is that deferred projectors is definitely not the only way to do decals. You can actually remove the need for a both a depth buffer *and* a G-Buffer by switching to the same clustered approach that you can use for lights, which is an idea I’ve been kicking around for a year or two now. You just need to build a per-cluster list of decals, iterate over the list in your shading pass, and apply the decal according to its projection. The catch is that our shading pass now needs access to the textures for every possible decal in the scene, and it needs to be able to access the appropriate texture based on a decal index. In D3D11 this would have meant using texture arrays or atlases, but with D3D12 we can potentially avoid these headaches thanks to power of bindless.

So does a clustered approach to decals actually work? Why, yes it does! I implemented them in my app, and got them to work with both the clustered forward and deferred texturing rendering paths. I even made a picker so that you can splat decals wherever you want in the scene, which is lots of fun! The decals themselves are just a set of sci-fi color and normal maps, and were generously provided for free by Nobiax from DeviantArt[15]. They end up looking like this:

In my demo the decal color and normal are just blended with the surface parameters based on the alpha channel from the color texture. However one advantage of applying decals in this way is that you’re not restricted to framebuffer blending operations. So for instance, you can accumulate surface normals by reorienting the decal normals[19] to make them relative to the normals of the underlying surface.

As for their performance, I unfortunately don’t have another decal implementation to compare against. However if I set up test case where most of the screen is covered in around 11 decals, the cost of the deferred pass (no MSAA) goes from 1.50ms to 2.00ms on my GTX 970. If I branch over the decals entirely (including reading from the cluster bitfield buffer), then the cost drops to 1.33ms. For the forward path it costs about 1.55ms with no decals, 2.20ms with decals, and 1.38 branching over the entire decal step.

To measure performance, I captured GPU timing information via timestamp queries. All measurements were taken from the default camera position, at 1920×1080 resolution. The test scene is the CryTek Sponza (we really need a new test scene!) with 20 hand-placed spotlights, each casting a 1024×1024 shadow map. There’s also a directional light for the sun that uses 4 2048×2048 shadow cascades. The scene uses normal, albedo, roughness, and metallic maps courtesy of Alexandre Pestana[16]. Here’s what the frame breakdown looks like for the deferred texturing path, with no MSAA:

Render Total: 7.40ms (7.47ms max) Cluster Update: 0.08ms (0.08ms max) Sun Shadow Map Rendering: 1.30ms (1.34ms max) Spot Light Shadow Map Rendering: 1.04ms (1.04ms max) G-Buffer Rendering: 0.67ms (0.67ms max) Deferred Rendering: 3.54ms (3.58ms max) Post Processing: 0.59ms (0.60ms max)

…and here’s what it looks like with 4x MSAA:

Render Total: 9.86ms (9.97ms max) Cluster Update: 0.08ms (0.08ms max) Sun Shadow Map Rendering: 1.30ms (1.34ms max) Spot Light Shadow Map Rendering: 1.04ms (1.05ms max) G-Buffer Rendering: 1.48ms (1.49ms max) Deferred Rendering: 4.64ms (4.67ms max) Post Processing: 0.56ms (0.57ms max) MSAA Mask: 0.16ms (0.16ms max) MSAA Resolve: 0.21ms (0.21ms max)

The first timing number is an average over the past 64 frames, while the second number is the maximum time from the past 64 frames.

The following chart shows how some of the various configurations scale with increasing MSAA levels:

So there’s a few observations that we can make from this data. First of all, the deferred path generally does very well compared to the forward path. The forward path without a depth prepass is basically useless, which means we’re clearly suffering from overdraw problems. I do actually sort by depth when rendering the main forward pass, but my test scene doesn’t have sufficient granularity to achieve good front-to-back ordering. Enabling the depth prepass improves things considerably, but not enough to match the deferred performance. Once we enable MSAA things go a little differently, as the forward paths scale up better compared to the deferred rendering path. At 4x forward and deferred are nearly tied, but that is only when the deferred path uses the conservative material ID check for detecting edge pixels. The conservative path skips many edges in the test scene, and so the AA quality is inferior to the forward path. Using the aggressive depth gradient edge test brings the quality more in line with the forward path, but it’s also quite a bit more expensive. However I would also expect the forward path to scale more poorly with scene complexity, since pixel shader efficiency will only decrease as the triangle count increases. One other interesting observation we can make is that writing out the UV gradients doesn’t seem to be an issue for our test scene when running on my 970. With no MSAA it’s actually slightly faster (7.47ms vs 7.50ms) to just write out the gradients instead of computing them, but that changes by the time we get to 4x MSAA (9.97ms vs. 9.83ms).

I should point out that all of these timings were captured while using a heavy-duty, “thrash your cache” 7×7 PCF kernel that’s implemented as an unrolled loop using GatherCmp. It undoubtedly causes a large increase in memory traffic, and it probably causes an increase in register pressure as well. I would imagine that this is especially bad for the forward path, since everything is done in one pixel shader. As an alternative, I also have an option to revert back to a simple 2×2 PCF kernel that only uses a single GatherCmp (you can toggle it yourself by changing the “UseGatherPCF_” flag at the top of Shading.hlsl). This path is probably a better representation for games that use filterable shadow maps, and possibly games that use aggressive PCF sampling optimizations. Here’s what the data looks like:

Some of these results are quite different compared to the 7×7 case. The forward paths do much better than they did previously, especially at 4xMSAA. The deferred paths scale the same as they did before, with the aggressive edge detection again causing longer frame times.

At home I only have access to a GTX 970, and so that’s what I used for almost all of my testing, profiling, and optimization. However I was able to verify that the demo works on an an AMD R9 280, as well as a GTX 980. I’ve posted a summary of all of the performance data in table form below (all timings in milliseconds):

If you’re interested in seeing the raw performance data that includes the per-frame breakdowns, I’ve uploaded them here: https://mynameismjp.files.wordpress.com/2016/03/deferred-texturing-timings.zip. You can also access the timing data in the app by clicking on the “Timings” button in the top-left corner of the screen.

So what kind of conclusions can we draw from this experiment? Personally I’m wary of extrapolating too much from such an artificial testing scenario with only one basis for comparison, but I think it’s safe to say that the bindless deferred texturing approach doesn’t have any sort of inherent performance issue that would render it useless for games. Or at least, it doesn’t on the limited hardware that I’ve used for testing. I was initially worried that the combination of SampleGrad and divergent descriptor indices would lead to to suboptimal performance, but in the end it didn’t seem to be an issue for my particular setup. Although to be completely fair, the texture sampling ended up being a relatively small part of the shading process. It’s certainly possible that the situation could change if the number of textures were to increase, or if the indexing became more divergent due to increased material density in screen-space. But at the same time those situations could also lead to reduced performance during a forward pass or during a traditional G-Buffer pass, so it might end up being a wash anyway.

At least in my demo, the biggest performance issue for the deferred path seems to be MSAA. This shouldn’t really be a surprise, considering how hard it is to implement affordable MSAA with any deferred renderer. My hope would be that a deferred texturing approach does a bit better than a traditional deferred renderer with a very large G-Buffer, but unfortunately I don’t have data to prove that out. Ultimately it probably doesn’t even matter, since hardly anyone even bothers with MSAA these days.

What about limitations? Multiple UV sets are at best a headache, and at worst a non-option. I think you’d have to really like your artists to store another UV set in your G-Buffer, and also eat the cost of storing or computing another set of UV gradients. Not having a full G-Buffer might be an issue for certain screen-space techniques, like SSAO or screen-space reflections. I’ve shown that it’s possible to have decals even without a full G-Buffer, but it’s more complex and possibly more expensive than traditional deferred decals. But on the upside, it’s really nice to have a cheap geometry pass that doesn’t need to sample any textures! It’s also very friendly to GPU-driven batching techniques, which was demonstrated in the RedLynx presentation from SIGGRAPH.

There is one final reason why you might not want to do this (or at least not right now): it was a real pain in the ass to get this demo working in DX12. Driver bugs, shader compilation bugs, long compile times, validation bugs, driver crashes, blue screens of death: if it was annoying, I ran into it. Dynamic indexing support seems to be rather immature in both the shader compiler and the drivers, so tread carefully. The final code has a few work-arounds implemented, but I’ve noted them with a comment.

If I were to spend more time on this, I think it would be interesting to explore some of the more extreme variants of deferred texturing. In particular there’s Intel’s paper on Visibility Buffers[17], where the authors completely forego storing any surface data except for triangle ID and instance ID. All surface data is instead reconstructed by transforming vertices during the deferred pass, and performing a ray-triangle intersection to compute barycentrics for interpolation. There’s also Tomasz Stachowiak’s presentation about deferred material rendering[18], where barycentric coordinates are stored in the G-Buffer instead of being reconstructed (which he does by tricking the driver into accepting his hand-written GCN assembly!!!). He has some neat ideas about using tile classification to execute different shader paths based on the material, which is something that could be integrated with the MSAA tile classification that’s performed in my demo. Finally in the RedLynx presentation they use a neat trick where they render with MSAA at half resolution, and then reconstruct full-resolution surface samples during the deferred pass. It makes the deferred shader more complicated, but reduces the pixel shader cost of rasterizing the G-Buffer. These are all things I would love to implement in my demo if I had infinite time, but at some point I actually need to sleep.

If you’ve made it this far, thank you for hanging in there! This one might be my longest so far! I considering making it a series of articles, but I didn’t want it to turn into one of those blog series where the author just never finishes it.

If you want to look at the code or run the sample, everything is available on GitHub:

https://github.com/TheRealMJP/DeferredTexturing

https://github.com/TheRealMJP/DeferredTexturing/releases (Precompiled Binaries)

A word of warning : the shader compilation times are *incredibly* long for the deferred shader. A lot of it is from unrolling the loops in the monster 7×7 PCF kernel, so if you want to run the code in debug (the zipped up binaries on GitHub include cached Release shaders) or change the shaders, I would strongly recommend setting “UseGatherPCF_” to

“0” on line 19 of Shading.hlsl.

If you find any bugs or have any suggestions, please let me know via comments, email, or twitter!

[1] Introduction To Resource Binding In D3D12 (Intel)

[2]OpenGL Bindless Extensions (Nvidia)

[3]GL_NV_bindless_texture (OpenGL Extension Registry)

[4]Deferred Texturing (Nathan Reed)

[5]GPU-Driven Rendering Pipelines (Haar and Aaltonen, SIGGRAPH 2015)

[6]Modern Textureless Deferred Rendering Techniques (Beyond3D)

[7]The Bitsquid Low-Level Animation System

[8]Practical Clustered Shading (Emil Persson)

[9]Deferred Rendering for Current and Future Rendering Pipelines (Intel, Andrew Lauritzen)

[10]Forward+: Bringing Deferred Lighting To The Next Level (AMD, Takahiro Harada)

[11]Forward Clustered Shading (Intel)

[12]Correct Frustum Culling (Íñigo Quílez)

[13]Conservative Rasterization (MSDN)

[14]Screen Space Decals in Warhammer 40K: Space Marine (Pope Kim)

[15]Free Sci-Fi Decals 2 (Nobiax, DeviantArt)

[16]Base color, Roughness and Metallic textures for Sponza (Alexandre Pestana)

[17]The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading (Burns and Hunt, JCGT)

[18]A Deferred Material Rendering System (Tomasz Stachowiak)

[19]Blending In Detail (Barré-Brisebois and Hill)

]]>

To help make things right, I’ve uploaded a new version of the sample app to GitHub that correctly clamps the exponential warp factor to 5.54. The releases page has a zip file containing code and a compiled executable. I’ve also updated the original blog post in question with new comparison images, as well as new commentary on those images. You’ll notice that my thoughts on MSM have become quite a bit more positive now that I’ve taken a second look, and performed a more fair analysis. In fact, it now seems that MSM16 is hands-down the better option if you’re looking to stay in an 64bpp footprint for the shadow map.

Finally, I’d like to apologize to Christoph and Reinhard for my unfair comparison. The work they’ve presented (including some new research being presented at I3D 2016) is very promising, and it’s great to see research that’s immediately usable and practical for games. I also need to call out that they release sample code alongside their work, which is extremely helpful for evaluating and understanding their techniques. Hopefully we see more great things from them in the future!

]]>

Historically, the sub-pixel location of MSAA sample points was totally out of your control as a programmer. Typically the hardware used rotated grid patterns such as this one, which were fixed for every pixel in your render target. For FEATURE_LEVEL_10_1, D3D added the concept of standard sample patterns that were required to be supported by the hardware. These were nice, in that you could specify the appropriate quality level and know exactly where the samples would be located. Without that, you either had to use shader intrinsics or resort to wackier methods for figuring out the sample positions. In theory it also gives you a very limited ability to control where sample points are located, but in practice the GPU vendors just made the standard multisample patterns the only patterns that they supported. So if you specified quality level 0, you got the same pattern as if you specified D3D11_STANDARD_MULTISAMPLE_PATTERN.

Fairly recently, we’ve finally seen the idea of programmable sample points getting some major attention. This has primarily came from Nvidia, who just added this functionality to their second-generation Maxwell architecture. You’ve probably seen some of the articles about their new “Multi-Frame Sampled Anti-Aliasing” (MFAA), which exploits the new hardware capabilities to jitter MSAA sample positions across alternating frames^{1}. The idea is that you can achieve a doubled effective sampling rate, as long as your resolve intelligently retrieves the subsamples from your previous frame. They also incorporate ideas from interleaved sampling, by varying their sample locations across a 2×2 quad instead of using the same pattern for all pixels. While the idea of yet-another control panel AA setting probably won’t do more than elicit some collective groans from the graphics community, it should at least get us thinking about what we might do if provided with full access to this new functionality for ourselves. And now that Nvidia has added an OpenGL extension as well as a corresponding D3D11 extension to their proprietary NVAPI, we can finally try out our own ideas (unless you work on consoles, in which case you may have been experimenting with them already!).

As for AMD, they’ve actually supported some form of programmable sample points since as far back as R600, at least if the command buffer documentation is accurate (look, for PA_SC_AA_SAMPLE_LOCS). Either way, Southern Islands certainly has support for varying sample locations across a 2×2 quad of pixels, which puts it on par with the functionality present in Maxwell 2.0. It’s a little strange to that AMD hasn’t done much to promote this functionality in the way that Nvidia has, considering they’ve had it for so long. Currently there’s no way to access this feature through D3D, but they have had an OpenGL extension for a long time now (thanks to Matthäus Chajdas and Graham Sellers for pointing out the extension!).

I’m not particularly knowledgeable about Intel GPU’s, and some quick searching didn’t return anything to indicate that they might be able to specify custom sample points. If anyone knows otherwise, then please let me know!

*Update: Andrew Lauritzen has informed me via Twitter that Intel GPU’s do in fact support custom sample locations! Thank you again, Andrew!*

Before we get into use cases, let’s quickly go over how to actually work with programmable sample points. Since I usually only use D3D when working on PC’s, I’m going to focus on the extensions for Maxwell GPU’s that were exposed in NVAPI. If you look in the NVAPI headers, you’ll find a function for creating an extended rasterizer state, with a corresponding description structure that has new members:

NvU32 ForcedSampleCount; bool ProgrammableSamplePositionsEnable; bool InterleavedSamplingEnable; NvU8 SampleCount; NvU8 SamplePositionsX[16]; NvU8 SamplePositionsY[16]; bool ConservativeRasterEnable; NVAPI_QUAD_FILLMODE QuadFillMode; bool PostZCoverageEnable; bool CoverageToColorEnable; NvU8 CoverageToColorRTIndex; NvU32 reserved[16];

There’s a few other goodies in there (like conservative rasterization, and Post-Z coverage!), but the members that we’re concerned with are **ProgrammableSamplePositionsEnable**, **InterleavedSamplingEnable**, **SampleCount**, **SamplePositionsX**, and **SamplePositionsY**. **ProgrammableSamplePositionsEnable** is self-explanatory: it enables the functionality. **SampleCount** is also pretty obvious: it’s the MSAA sample count that we’re using for rendering. **SamplePositionsX** and **SamplePositionsY** are pretty clear in terms of what they’re used for: they’re for specifying the X and Y coordinates of our MSAA sample points. What’s not clear at all is how the API interprets those values. My initial guess was that they should contain 8-bit fixed point numbers where (0,0) is the top left of the pixel, and (255,255) is the bottom right. This was close, but not quite: they’re actually 4-bit fixed-point values that correspond to points in the D3D Sample Coordinate System. If you’re not familiar with this particular coordinate system (and you’re probably not), there’s a nice diagram in the documentation for the D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration^{2}:

There’s also a bit more information in the documentation for EvaluateAttributeSnapped:

Only the least significant 4 bits of the first two components (U, V) of the pixel offset are used. The conversion from the 4-bit fixed point to float is as follows (MSB...LSB), where the MSB is both a part of the fraction and determines the sign: 1000 = -0.5f (-8 / 16) 1001 = -0.4375f (-7 / 16) 1010 = -0.375f (-6 / 16) 1011 = -0.3125f (-5 / 16) 1100 = -0.25f (-4 / 16) 1101 = -0.1875f (-3 / 16) 1110 = -0.125f (-2 / 16) 1111 = -0.0625f (-1 / 16) 0000 = 0.0f ( 0 / 16) 0001 = 0.0625f ( 1 / 16) 0010 = 0.125f ( 2 / 16) 0011 = 0.1875f ( 3 / 16) 0100 = 0.25f ( 4 / 16) 0101 = 0.3125f ( 5 / 16) 0110 = 0.375f ( 6 / 16) 0111 = 0.4375f ( 7 / 16)

So basically we have 16 buckets work with in X and Y, with the ability to have sample points sit on the top or left edges, but not on the bottom or right edges. Now as for the NVAPI extension, it uses this same coordinate system, except that all values or positive. This means that there is no sign bit, as in the values passed to EvaluateAttributeSnapped. Instead you can picture the coordinate system as ranging from 0 to 15, with 8 (0.5) corresponding to the center of the pixel:

Personally I like this coordinate system better, since having the pixel center at 0.5 is consistent with the coordinate system used for pixel shader positions and UV coordinates. It also means that going from a [0, 1] float to the fixed point representation is pretty simple:

rsDesc.SamplePositionsX[i] = uint8(Clamp(SamplePositions1x[i].x * 16.0f, 0.0f, 15.0f));

Now, what about that **InterleavedSamplingEnable** member that’s also part of the struct? This is a somewhat poorly-named parameter that controls whether you’re specifying the same sample positions for all 4 pixels in a 2×2 quad, or whether you’re specifying separate sample positions for each quad pixel. The API works such that the first N points correspond to the top left pixel, the next N correspond to the top right pixel, then the bottom left, followed by the bottom right. This means that for 2xMSAA you need to specify 8 coordinates, and for 4xMSAA you need to specify 16 coordinates. For 8xMSAA we would need to specify 32 coordinates, but unfortunately NVAPI only lets us specify 16 values. In this case, the 16 values correspond to the sample points in the top left and bottom left pixels of the quad.

So now that we have programmable sample points, what exactly should we do with them? MFAA-style temporal jittering is a pretty obvious use case, but surely there’s other ways to make use of this functionality. One idea I’ve been kicking around has been to use MSAA as way to implement half-resolution rendering. Rendering particles at half resolution is really common in games, since blending many layers of particles can result in high costs from pixel shading and framebuffer operations. It can be an easy win in many cases, as long as the content you’re rendering won’t suffer too much from being shaded at half-rate and upsampled to full resolution. The main issue of course is that you can’t just naively upsample the result, since the half-resolution rasterization and depth testing will cause noticeably incorrect results around areas where the transparent geometry was occluded by opaque geometry. Here’s a comparison between half-resolution particle rendering with a naive bilinear upscale (left), and full-resolution rendering (right):

Notice how in some areas the low-resolution buffer has false occlusion, causing no particles to be visible for those pixels, while in other places the particles bleed onto the foreground geometry even though they should have been occluded. Perhaps the most obvious and common way to reduce these artifacts is to use some variation of a bilateral filter during the upscale, where the filtering weights are adjusted for pixels based on the difference between the low-resolution depth buffer and the full-resolution depth buffer. The idea behind doing this is that you’re going to have under or over-occlusion artifacts in places where the low-resolution depth is a poor representation of your full-resolution depth, and so you should reject low-resolution samples where the depth values are divergent. For The Order: 1886, we used a variant of this approach called nearest-depth upsampling, which was originally presented in an Nvidia SDK sample called OpacityMapping. This particular algorithm requires that you sample 4 low-resolution depth samples from a 2×2 bilinear footprint, and compare each one with the full-resolution depth value. If the depth values are all within a user-defined threshold, the low-resolution color target is bilinearly upsampled with no special weighting. However if any of the sample comparisons are greater than the treshold value, then the algorithm falls back to using only the low-resolution sample where the depth value was closest to the full-resolution depth value (hence the name “nearest-depth”). Overall the technique works pretty well, and in most cases it can avoid usual occlusion artifacts that are typical of low-resolution rendering. It can even be used in conjunction with MSAA, as long as you’re willing to run the filter for every subsample in the depth buffer. However it does have a few issues that we ran into over the course of development, and most of them stem from using depth buffer comparisons as a heuristic for occlusion changes in the transparent rendering. Before I list them, here’s an image that you can use to follow along (click for full resolution):

The first issue, illustrated in the top left corner, occurs when there are rapid changes in the depth buffer but no actual changes in the particle occlusion. In that image all of the particles are actually in front of the opaque geometry, and yet the nearest-depth algorithm still switches to point sampling due to depth buffer changes. It’s very noticeable in the pixels that overlap with the blue wall to the left. In this area there’s not even a discontinuity in the depth buffer, it’s just that the depth is changing quickly due to the oblique angle at which the wall is being viewed. The second issue, shown in the top right corner, is that the depth buffer contains no information about the actual edges of the triangles that were rasterized into the low-resolution render target. In this image I turned off the particle texture and disabled billboarding the quad towards the camera, and as a result you can clearly see the a blurry, half-resolution stair-step pattern at the quad edges. The third issue, shown in the bottom left, occurs when alpha-to-coverage is used to get sub-pixel transparency. In this image 2xMSAA is used, and the grid mesh (the grey parallel lines) is only being rendered to the first subsample, which was done by outputting a custom coverage mask from the pixel shader. This interacts poorly with the nearest-depth upsampling, since 1 of the 2 high-res samples won’t have the depth from the grid mesh. This causes blocky artifacts to occur for the second subsample. Finally, in the bottom right we have what was the most noticable issue for me personally: if the geometry is too small to be captured by the low-resolution depth buffer, then the upsample just fails completely. In practice this manifests as entire portions of the thin mesh “disappearing”, since they don’t correctly occlude the low-resolution transparents.

By now I hope you get the point: bilateral upsampling has issues for low resolution rendering. So how can we use MSAA and custom sampling points to do a better job? If you think about it, MSAA is actually perfect for what we want to accomplish: it causes the rasterizer and the depth test to run at a higher resolution than the render target(s), but the pixel shader still executes once per pixel. So if we use a half-resolution render target combined with 4x MSAA, then we should get exactly what we want: full-resolution coverage and depth testing, but half-resolution shading! This actually isn’t a new idea: Capcom tried a variant of this approach for the original Lost Planet on Xbox 360. For their implementation, which is outlined here in Japanese (English translation here), they were actually able to just alias their full-resolution render targets as a half-resolution 4xMSAA render targets while they were still in eDRAM. This worked for them, although at the time it was specific to the Xbox 360 hardware and also had no filtering applied to the low-resolution result (due to there being no explicit upscaling step). However we now have much more flexible GPU’s that give us generalized access to MSAA data, and my hunch was that we could improve upon this by using a more traditional “render in low-res and then upscale” approach. So I went ahead and implemented a sample program that demonstrates the technique, and so far it seems to actually work! I did have to work my way past a few issues, and so I’d like to go over the details in this article.

Here’s how the rendering steps were laid out every frame:

- Render Shadows
- Render Opaques (full res)
- Run pixel shader to convert from full-res depth to 4xMSAA depth
- Render particles at low-res with 4xMSAA, using ordered grid sample points
- Resolve low-res MSAA render target
- Upscale and composite low-res render target onto the full-res render target

The first two steps are completely normal, and don’t need explaining. For step #3, I used a full-screen pixel shader that only output SV_Depth, and executed at per-sample frequency. Its job was to look at the current subsample being shaded (provided by SV_SampleIndex), and use that to figure out which full-res depth pixel to sample from. The idea here is that 4 full-res depth samples need to be crammed into the 4 subsamples of the low-resolution depth buffer, so that all of the full-resolution depth values are fully represented in the low-res depth buffer:

If you look at the image, you’ll notice that the MSAA sample positions in the low-resolution render target are spaced out in a uniform grid, instead of the typical rotated grid pattern. This is where programmable sample points come in: by manually specifying the sample points, we can make sure that the low-resolution MSAA samples correspond to the exact same location as they would be in the full-resolution render target. This is important if you want your low-resolution triangle edges to look the same as if they would if rasterized at full resolution, and also ensures correct depth testing at intersections with opaque geometry.

For step #4, we now render our transparent geometry to the half-resolution 4xMSAA render target. This is fairly straightforward, and in my implementation I just render a bunch of lit, billboarded particles with alpha blending. This is also where I apply the NVAPI rasterizer state containing the custom sample points, since this step is where rasterization and depth testing occurs. In my sample app you can toggle the programmable sample points on and off to see the effect (or rather, you can if you’re running on a GPU that supports it), although you probably wouldn’t notice anything with the default rendering settings. The issues are most noticable with albedo maps and billboarding disabled, which lets you see the triangle edges very clearly. If you look at pair of the images below, the image to left shows what happens when rasterizing with the 4x rotated grid pattern. The image on the right shows what it looks like when using our custom sample points, which are in a sub-pixel ordered grid pattern.

For step #5, I use a custom resolve shader instead of the standard hardware resolve. I did this so that I can look at the 4 subsamples, and find cases where the subsamples are not all the same. When this happens, this means that there was a sub-pixel edge during rasterization, that was either caused by a transparent triangle edge or by the depth test failing. For these pixels I output a special sentinel value of -FP16Max, so that the upscale/composite step has knowledge of the sub-pixel edge.

In the last step, I run another full-screen pixel shader that samples the low-resolution render target and blends it on top of the full-resolution render target. The important part of this step is choosing how exactly to filter when upsampling. If we just use plain bilinear filtering, the results will be smooth but the transparents will incorrectly bleed over onto occluding opaque pixels. If we instead use the MSAA render target and just load the single subsample that corresponds to the pixel being shaded, the results will look “blocky” due to the point sampling. So we must choose between filtering and point sampling, just as we did with nearest-depth upsampling. In the resolve step we already detected sub-pixel edges, and so we can already use that to determine when to switch to point sampling. This gets us most of the way there, but not all of the way. For bilinear filtering we’re going to sample a 2×2 neighborhood from the low-resolution texture, but our resolve step only detected edges within a single pixel. This means that if there’s an edge between two adjacent low-resolution pixels, then we’ll get a bleeding artifact if we sample across that edge. To show you what I mean, here’s an image showing low-resolution MSAA on the left, and full-resolution rendering on the right:

To detect this particular case, I decided to use an alpha comparison as a heuristic for changes in occlusion:

float4 msaaResult = LowResTextureMSAA.Load(lowResPos, lowResSampleIdx); float4 filteredResult = LowResTexture.SampleLevel(LinearSampler, UV, 0.0f); float4 result = filteredResult; if(msaaResult.a - filteredResult.a > CompositeSubPixelThreshold) result = msaaResult;

This worked out pretty nicely, since it essentially let me roll the two tests into one: if the resolve sub-pixel test failed then it would output -FP16Max, which automatically results in a very large difference during the edge test in the composite step. Here’s an image where the left side shows low-resolution MSAA rendering with the improved edge detection, and the right side shows the “edge” pixels by coloring them bright green:

Before I show some results and comparisons, I’d like to touch on integration with MSAA for full-resolution rendering. With nearest-depth upsampling, MSAA is typically handled by running the upsample algorithm for every MSAA sub-sample. This is pretty trivial to implement by running your pixel shader at per-sample frequency, which automatically happens if you use SV_SampleIndex as an input to your pixel shader. This works well for most cases, although it still doesn’t help situations where sub-pixel features are completely missing in the low-resolution depth buffer. For the low-resolution MSAA technique that I’m proposing it’s also fairly trivial to integrate with 2x MSAA: you simply need to use 8xMSAA for your low-resolution rendering, and adjust your sample points so that you have 4 sets of 2x rotated grid patterns. Then during the upsample you can execute the pixel shader at per-sample frequency, and run the alpha comparison for each full-resolution subsample. Unfortunately we can’t handle 4x so easily, since 16xMSAA is not available on any hardware that I know of. I haven’t given it a lot of thought yet, but I think it should be doable with some quality loss by performing a bit of bilateral upsampling during the compositing step. On consoles it should also be possible to use EQAA with 16x coverage samples to get a little more information during the upsampling.

So now that we know that our MSAA trick can work for low-resolution rendering, how does it stack up against against techniques? In my sample app I also implemented full-resolution rendering as well as half-resolution with nearest-depth upsampling, and so we’ll use those as a basis for comparison. Full resolution is useful as a “ground truth” for quality comparisons, while nearest-depth upsampling is a good baseline for performance. So without further adieu, here are some links to comparison images:

Normal | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

Sub-Pixel Geo | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

Sub-Pixel Geo – 2x MSAA | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

Triangle Edges | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

High Depth Gradient | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

Alpha To Coverage | Full Res | Low Res – Nearest Depth | Low Res – MSAA |

In my opinion, the quality is a consistent improvement over standard low-res rendering with a nearest-depth upscale. The low-resolution MSAA technique holds up in all of the failure cases that I mentioned earlier, and is still capable of providing filtered results in areas where the particles aren’t being occluded (unlike the Lost Planet approach, which essentially upsampled with point filtering).

To evaluate performance, I gathered some GPU timings for each rendering step on my GeForce GTX 970. The GPU time was measured using timestamp queries, and I recorded the maximum time interval from the past 64 frames. These were all gathered while rendering at 1920×1080 resolution (which means 960×540 for half-resolution rendering) with 16 * 1024 particles, and no MSAA:

Render Mode |
Depth Downscale |
Particle Rendering |
Resolve |
Composite |
Total |

Full Res | N/A | 4.59ms | N/A | N/A | 4.59ms |

Half Res – Nearest Depth | 0.04ms | 1.37ms | N/A | 0.31ms | 1.71ms |

Half Res – MSAA | 0.13ms | 1.48ms | 0.23ms | 0.08ms | 1.92ms |

On my hardware, the low-resolution MSAA technique holds up fairly well against the baseline of half-res with nearest depth upsampling. The resolve adds a bit of time, although that particular timestamp was rather erratic and so I’m not 100% certain of its accuracy. One nice thing is that the composite step is now cheaper, since the shader is quite a bit simpler. In the interest of making sure that all of my cards are on the table, one thing that I should mention is that the particles in this sample don’t sample the depth buffer in the pixel shader. This is common for implementing soft particles as well as for computing volumetric effects by marching along the view ray. In these cases the pixel shader performance would likely suffer if it had to point sample an MSAA depth buffer, and so it would probably make sense to prepare a 1x depth buffer during the downscale phase. This would add some additional cost to the low-resolution MSAA technique, and so I should probably consider adding it to the sample at some point.

These numbers were gathered with 2xMSAA used for full-resolution rendering:

Render Mode |
Depth Downscale |
Particle Rendering |
Resolve |
Composite |
Total |

Full Res | N/A | 4.69ms | N/A | N/A | 4.69ms |

Half Res – Nearest Depth | 0.06ms | 1.4ms | N/A | 0.38ms | 1.81ms |

Half Res – MSAA | 0.25ms | 2.11ms | 0.24ms | 0.17ms | 2.74ms |

Unfortunately the low-resolution MSAA technique doesn’t hold up quite as well in this case. The particle rendering gets quite a bit more expensive, and the cost of downscale as well as the composite both increase. It does seem odd for the rendering cost to increase so much, and I’m not sure that I can properly explain it. Perhaps there is a performance cliff when going from 4x to 8x MSAA on my particular GPU, or maybe there’s a bug in my implementation.

So what conclusions can we draw from all of this? At the very least, I would say that programmable sample points can certainly be useful, and that low-resolution MSAA is a potentially viable use case for the functionality. In hindsight it seems that my particular implementation isn’t necessarily the best way to show off the improvement that you get from using programmable sample points, since the alpha-blended particles don’t have any noticeable triangle edges by default. However it’s definitely relevant if you want to consider more general transparent geometry, or perhaps even rendering opaque geometry at half resolution. Either way, I would really like to see broader support for the functionality being exposed in PC API’s. Unfortunately it’s not part of any Direct3D 12 feature set, so we’re currently stuck with Nvidia’s D3D11 or OpenGL extensions. It would be really great to see it get added to a future D3D12 revision, and have it be supported across Nvidia, Intel, and AMD hardware. Until that happens I suspect it will mostly get used by console developers.

*Update: programmable sample point functionality was added to D3D12!*

If you would like to check out the code for yourself, I’ve posted it all on GitHub. If you’d just like to download and run the pre-compiled executable, I’ve also made a binary release available for download. If you look through the build instructions, you’ll notice that I didn’t include the NVAPI headers and libs in the repository. I made this choice because of the ominous-sounding license terms that are included in the header files, as well as the agreement that you need to accept before downloading it. This means that the code won’t compile by default unless you download NVAPI yourself, and place it in the appropriate directory. However if you just want to build it right away, there’s a preprocessor macro called “UseNVAPI_” at the top of LowResRendering.cpp that you can set to 0. If you do that then it won’t link against NVAPI, and the app will build without it. Of course if you do this, then you won’t be able to use the programmable sample point feature in the sample app. The feature will also be disabled if you aren’t running on a Maxwell 2.0 (or newer) Nvidia GPU, although everything else should work correctly on any hardware that supports FEATURE_LEVEL_11_0. Unfortunately I don’t have any non-Nvidia hardware to test on, so please let me know if you run into any issues.

[1] MSDN documentation for D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS enumeration

[2] MSDN documentatoin for EvaluateAttributeSnapped

[3] Multi-Frame Samples Anti-Aliasing (MFAA)

[4] Interleaved Sampling

[5] GL_NV_sample_locations

[6] NVAPI

[7] Hybrid Reconstruction Anti-Aliasing

[8] GPU-Driven Rendering Pipelines

[9] AMD R600 3D Registers

[10] AMD Southern Islands 3D Registers

[11] AMD_sample_positions

[12] OpacityMapping White Paper

[13] High-Speed, Off-Screen Particles (GPU Gems 3)

[14] Lost Planet Tech Overview (english translation)

[15] Mixed Resolution Rendering

[16] Destiny: From Mythic Science Fiction to Rendering in Real-Time

1. Some of you might remember ATI’s old “Temporal AA” feature from their Catalyst Control Panel, which had a similar approach to MFAA in that it alternated sample patterns every frame. However unlike MFAA, it just relied on display persistance to combine samples from adjacent frames, instead of explicitly combining them in a shader. Return to text

2. If you’ve been working with D3D11 for a long time and you don’t recognize these diagrams, you’re not going crazy. The original version of the D3D11 docs were completely missing all of this information, which actually made the “standard” MSAA patterns somewhat useless. This is what the docs for D3D11_STANDARD_MULTISAMPLE_QUALITY_LEVELS look like in the old DirectX SDK documentation, and this is what EvaluateAttributeSnapped looked like. Return to text

]]>