To explain the basics of GPU thread synchronization, I’m going to walk through some examples using a completely fictional architecture: the MJP-3000. This made-up GPU is much simpler than real graphics hardware, which will (hopefully) make it easier to demonstrate high-level concepts without getting lost in the weeds. I also don’t want to give the impression that what I describe is *exactly* how real GPU’s do things, especially since many of those details aren’t publicly available. However the commands and behavior are still loosely based on real-world GPU’s, since otherwise the example wouldn’t be very useful!

With the prologue out of the way, let’s have a look at the amazing feat of engineering that is the MJP-3000:

The interesting parts here are the **command processor** on the left, and the **shader cores** in the middle. The command processor is the brains of the operation, and its job is to read commands (the green blocks) from a **command buffer** and coordinate the **shader cores**. The command processor reads commands one at time from the command buffer, always in the exact order they’re submitted. When the command processor encounters the appropriate commands, it can add a group of threads to the **thread queue** immediately to the right of the command processor. The 16 shader cores pull threads from this queue in a first-in first-out (FIFO) scheme, after which the shader program for that thread is actually executed on the shader core. The cores are all identical, and completely independent of each other. This means that together they can simultaneously run 16 threads of the same shader program, or they can each run a thread from a completely different program. The shader cores can also read or write to arbitrary locations in device memory, which is on the right. Since the cores are independent and can all access memory, you can think of the array like a 16-core CPU. The major difference is that unlike a CPU they can’t tell themselves what to do, since they instead rely on the command processor to enqueue work for them. The **Current Cycle Count** in the top-left corner shows how many GPU cycles have executed for a particular example, which will help us keep track of how long it took for a particular example to complete execution.

For some reason, the designers of the MJP-3000 decided that their hardware could only run compute shaders. I suppose they felt that it would make things a lot simpler to only focus on the one shader stage that doesn’t rely on a complicated rasterization pipeline. Because of that, the command processor only has 1 command that actually kicks off threads to run on the shader cores: **DISPATCH**. The DISPATCH command specifies two things: how many threads need to run, and what shader program should be executed. When a DISPATCH command is encountered by the command processor, the threads from that dispatch are immediately placed in the thread queue, where they are grabbed by waiting shader cores. Since there are 16 cores, only 16 threads can be executing at any given time. Any threads that aren’t running on the shader cores stay in the thread queue until a core finishes a different thread and pulls the waiting thread out of the queue. The command processor can parse a DISPATCH command and enqueue its threads in 1 cycle, and the shader cores can dequeue a thread from the thread queue in 1 cycle.

Let’s now try a simple example where we dispatch 32 threads that each write something to a separate element of a buffer located in device memory. This dispatch will run shader program “A”, which takes 100 cycles to complete. So with 16 cores we would expect the whole dispatch to take around 200 cycles from start to end. Let’s go through the steps:

Click to view slideshow.In the first step, the command processor encounters a DISPATCH command in the command buffer that requests 32 threads of program A. 1 cycle later, the command processor has enqueued the 32 requested threads in the thread queue. 1 cycle after that, the 16 shader cores have each picked up a thread of program A and have started executing them. Meanwhile, 16 threads are left in the queue. 100 cycles later the first batch of threads have completed, and their result is in memory. 1 cycle after that we’re at the 103 cycle count, and the second batch of 16 threads are pulled from the now-empty queue to start executing on the shader cores. Finally after a total of 203 cycles, the threads are all finished and their results are in memory.

Now that we understand the basics of how this GPU works, let’s introduce some synchronization. As we already know from the previous article, synchronization implies that we’re going to somehow wait for all of our threads to hit a certain point before continuing. On a GPU where you’re constantly spinning up lots of new threads, this actually translates into something more like “wait for all of the threads from one group to finish before the threads from a second group start executing”. The common case where we’ll need to do this is where one dispatch needs to read the results that were written out by another dispatch. So for instance, say we run 24 threads of program A that collectively write their results to 24 elements of a buffer. After program A completes that we want to run 24 threads of program B, which will then read those 24 elements from the original output buffer and use them to compute new results written into a different buffer. If we were to try to do this by simply putting two DISPATCH commands in our command buffer, it would go something like this (program A is red, and program B is green):

Click to view slideshow.Take a look at the the third step: since dispatch A wasn’t a multiple of 16, the bottom 8 shader cores pulled from dispatch B to keep the cores from going idle. This caused the two dispatches to *overlap*, meaning that the end of dispatch A was still executing while the start of dispatch B was simultaneously executing. This is actually really bad for our case, because we now have a race condition: the threads of dispatch B might read from dispatch A’s output buffer before the threads of dispatch A have finished! Without knowing the specifics of which memory is accessed by programs A and B and how exactly the threads execute on the GPU, we have no choice but to insert a sync point between the two dispatches. This sync point will need to cause the command processor to wait until all threads of dispatch A run to completion before processing dispatch B. So let’s now introduce a FLUSH command that will do exactly that: when the command processor hits the flush, it waits for all shader cores to become idle before processing any further commands. The term “flush” is common for this sort of operation because it implies that it will “flush out” all pending work that’s waiting to execute. Let’s now try the same scenario again, this time using a flush to synchronize:

Notice how the command processor hits the FLUSH command, and then stops reading commands until dispatch A is completely finished and the thread queue is empty. This ensures that dispatch B never overlaps with dispatch A, which means it will be safe for any thread in dispatch B to access any result that that was output by dispatch A. This is pretty much exactly what I was talking about in part 1 when I mentioned the need for barriers to prevent dependent Draw/Dispatch calls from overlapping. In fact, you can usually expect something like a FLUSH to happen on current GPU’s if you issued dispatch A, issued a barrier to transition the output buffer from a write state to a read state, and then issued dispatch B (it’s also similar to what you would get in response to issuing a D3D12_RESOURCE_UAV_BARRIER in D3D12, since that also implies waiting for all pending writes to finish). Hopefully this example makes it even more clear as to why a barrier is necessary for this sort of data dependency, and why results could be wrong if the barrier is omitted.

It’s also very important to note that in this case the flush/barrier was not free from a performance point of view: our total processing time for both dispatches went from 304 cycles to 406 cycles. That’s a 25% increase! The reason for this should be intuitive: with the flush between dispatches, we now have more idle shader cores during the tail end of both dispatches. In fact the increase in processing time is exactly the same as the increase in the amount of idle time: without the flush we had about 0% idle cores over both dispatches, but *with* the flush our cores were idle about 25% of the time on average. This leads us to a simple conclusion: **the performance cost of a flush is directly tied to the decrease in utilization**. This ultimately means that the relative cost of introducing a thread synchronization barrier will vary depending on the number of threads, how long those threads execute, and how well the threads can fully saturate the available shader cores. We can confirm this with a simple thought experiment: imagine we ran dispatch A and dispatch B with 40 threads each instead of 24. The process would go almost exactly as it did before, except both dispatches would have another “phase” of 100 cycles where all 16 cores were in-use. Without our barrier the whole process would take about (40 + 40) / 16 = 500 cycles, while with the barrier it would take about 600 cycles. Therefore the relative cost of the barrier is about 16.5% as opposed to the 25% cost when our thread counts were lower.

The other way to look at this is that **removing an unnecessary flush can result in a performance increase that’s relative to the amount of idle shader cores**. So if we’re syncing between two dispatches and they have no dependency between them, it’s most likely a good idea to remove the barrier and let them overlap with each other. For larger dispatches (in terms of thread count) that can saturate the GPU on their own there won’t be much benefit, since there won’t be much idle time to exploit. However for very small dispatches the difference can be significant. This time let’s imagine that dispatch A and B both have 8 threads each. With a flush in between the total time will be about 200 cycles, but with no flush they can perfectly overlap and finish in only 100 cycles! Or as another example, imagine we had another completely independent workload of 8 threads that we’ll call dispatch C (and color its threads blue). If we were to overlap it with dispatch A, we could essentially get it for “free” by utilizing the idle cores:

If you squint a bit and look at our GPU as if it were a CPU executing instructions instead of a GPU executing commands, then this kind of overlapping of work could be considered a kind of **Instruction Level Parallelism**. In this case the parallel operations are being explicitly specified in our command stream, making it somewhat similar to how VLIW architectures work.

In the previous example, we were able to basically hide dispatch C in the idle time left by the barrier between dispatch A and dispatch B. But what if dispatch C was very complicated, and took much longer than 100 cycles to complete? Let’s re-do the example, except this time dispatch C will execute for 400 cycles instead of 100:

Click to view slideshow.Things didn’t go as well this time around. We still got a bit of overlap between A and C, but that was immediately followed by 300 cycles where half of our shader cores were idle. This happened because our FLUSH command ends up waiting for dispatch C to finish, since the flush works by waiting for the thread queue to become completely empty. We could re-arrange things so that dispatch C gets kicked off *after* the flush, but this is not ideal either because there would still be a bit of idle time during the tail end of dispatch A, and also a long period of half-idle cores when dispatch C is running.

Lucky for us, there’s a new driver update for the MJP-3000 that should be able to help us out. MJP xPerience 3D Nocturnal Driver v5.444.198754 adds support for two new commands that can be parsed and executed by the command processor. The first one is called SIGNAL_POST_SHADER, and the other is called WAIT_SIGNAL. The first command is pretty fancy: it tells the command processor to write a signal value to an address in memory (often called a *fence* or *label*) once all shaders have completed. The cool part is that it’s a “deferred” write: the write is actually performed by the thread queue once it determines that all previously-queued threads have run to completion. This allows the command processor to move on to other commands while previous dispatches are still executing. The other command, WAIT_SIGNAL, tells the command processor to stall and wait for a memory address to be signaled. This can be used in conjunction with SIGNAL_POST_SHADER to wait for a particular dispatch to complete, but with the added bonus that the command processor can kick off more work in between those steps. To help visualize this process, let’s update the GPU diagram with a new component:

Once a SIGNAL_POST_SHADER command is executed, any pending labels will show up as a colored block in a new area under the thread queue. The number on the block shows the current status of the label: “0” means it hasn’t been signaled yet, and “1” means that it’s in the signaled state and any dependent waits will be released.

Let’s now try out our new commands with the previous example:

Click to view slideshow.Very nice! By removing the long stall on dispatch C, we’ve effectively eliminated all of the idle time and kept the GPU busy for the entire duration of the 3 dispatches. As a result we’ve increased our overall throughput: previously the process took about 700 cycles, but now it’s down to about 500 cycles. Unfortunately this is still more time than it took to complete when we only had dispatch A and B to worry about, which means the *latency* for the A->B job increased by about 100 cycles. But at the same time the latency for dispatch C is is lower than it would be if it weren’t overlapped, since it would otherwise need to wait for either A or B to finish before it could start processing.

If the MJP-3000 were being programmed via D3D12 or Vulkan, then this signal/wait behavior is probably what you would hope to see when issuing a split barrier (vkCmdSetEvent + vkCmdWaitEvents in Vulkan-ese). Split barriers let you effectively specify 2 different points in a resource’s lifetime: the point where you’re done using it in its current state (read, write, etc.), and the point where you actually need the resource to be in its new state. By doing this and issuing some work between the begin and end of the barrier, the driver (potentially) has enough information to know that it can overlap the in-between work while it’s waiting for the pre-barrier work to finish. So for the example I outlined above, the D3D12 commands might go something like this:

- Issue Dispatch A which writes to Buffer A
- Begin Transition Buffer A from writable -> readable
- Issue Dispatch C which writes to Buffer C
- End Transition Buffer A from writable -> readable
- Issue Dispatch B which writes to Buffer B

For real-world GPU’s the benefits of split barriers can possibly be even greater than the sync point removal that I demonstrated with my imaginary GPU. As I mentioned in part 1, barriers on GPU’s are also responsible for handing things like cache flushes and decompression steps. These things increase the relative cost of a barrier past the simple “idle shader core tax” that we saw on our imaginary GPU, which gives us even more incentive to try to overlap the barrier with with some non-dependent work. However, the ability to overlap barrier operations with Draws and Dispatches is totally dependent on the specifics of the GPU architecture.

Before we wrap up, I’d like to point out that our GPU is still rather limited in terms of how it can overlap different dispatches, even with the new label/wait functionality that we just added. You can only do so much when the command processor is completely tied up every time that you need to wait for a previous dispatch to finish, which really starts to hurt you if you have more complex dependency chains. Later on in part 3 we’ll revisit this topic, and look at at how some hardware changes can help us get around these limitations.

In part 3, I’m going to discuss why explicit API’s expose multiple queues for submitting command buffers. I’ll also show how multiple queues could work on the fictional GPU architecture we’ve been using as an example, and discuss some implementations in real-world GPU’s.

]]>So what gives? Why the heck do we even need barriers in the first place, and why do things go so wrong if we misuse them? If you’ve done significant console programming or are already familiar with the lower-level details of modern GPU’s, then you probably know the answer to these questions, in which case this article isn’t really for you. But if you don’t have the benefit of that experience, then I’m going to do my best to give you a better understanding of what’s going on behind the scenes when you issue a barrier.

Like almost everything else in programming and computers, the term “barrier” is already a bit overloaded. In some contexts, a “barrier” is a synchronization point where a bunch of threads all have to stop once they reach a particular point in the code that they’re running. In this case you can think of the barrier as an immovable wall: the threads are all running, but stop dead in their tracks when they “hit” the barrier:

void ThreadFunction() { DoStuff(); // Wait for all threads to hit the barrier barrier.Wait(); // We now know that all threads called DoStuff() }

This sort of thing is helpful when you want to know when a bunch of threads have all finished executing their tasks (the “join” in the fork-join model), or when you have threads that need to read other’s results. As a programmer you can implement a thread barrier by “spinning” (looping until a conditions met) on a variable updated via atomic operations, or by using semaphores and condition variables when you want your threads to go to sleep while they’re waiting.

In other contexts the term “barrier” will refer to a memory barrier (also known as a “fence”), particularly if you’ve somehow fallen down the rabbit hole of lock-free programming. In these scenarios you’re usually dealing with reordering of memory operations that’s done by the compiler and/or the processor itself, which can really throw a wrench in the works when you have multiple processors communicating through shared memory. Memory barriers help you out by letting you force memory operations to complete either before or after the barrier , effectively keeping them on one “side” of the fence. In C++ you can insert these into your code using platform-specific macros like MemoryBarrier in the Windows API, or through the cross-platform std::atomic_thread_fence. A common use case might look like this:

// DataIsReady and Data are written to // by a different thread if(DataIsReady) { // Make sure that reading Data happens // *after* reading from DataIsReady MemoryBarrier(); DoSomething(Data); }

These two meanings of the term “barrier” have different specifics, but they also have something in common: they’re mostly used when one thing is producing a result and another thing needs to read that result. Another way of saying that is that one task has a **dependency** on a different task. Dependencies happen all of the time when writing code: you might have one line of code that adds two numbers to compute an offset, and the very next line of code will use that offset to read from an array. However you often don’t need to really be aware of this, because the compiler can **track** those dependencies for you and make sure that it produces code to give you the right results. Manually inserting barriers usually doesn’t come in until you do things in a way that the compiler can’t see how the data is going to be written to and read from at compile-time. This commonly happens due to multiple threads accessing the same data, but it can also happen in other weird cases (like when another piece of hardware writes to memory). Either way, using the appropriate barrier will make sure that the results will be **visible** to dependent steps, ensuring that they don’t end up reading the wrong data.

Since compilers can’t handle dependencies for you automatically when you’re doing multithreaded CPU programming, you’ll often spend a lot of time figuring how to express and resolve dependencies between your multithreaded tasks. In these situations it’s common to build a dependency graph indicating which tasks depend on the results of other tasks. That graph can help you decide what order to execute your tasks, and when you need to stick a sync point (barrier) between two tasks (or groups of tasks) so that the earlier task completely finishes before the second task starts executing. You’ll often see these graphs drawn out as a tree-like diagram, like in this easy-to-understand example from Intel’s documentation for Thread Building Blocks:

Even if you’ve never done task-oriented multithreaded programming, this diagram makes the concept of dependencies pretty clear: you can’t put the peanut better on the bread before you’ve gotten the bread! At a high level this determines the order of your tasks (bread before peanut butter), but it also subtly implies something that would be obvious to you if you were doing this in real life: you can’t start applying your peanut butter until you’ve gotten out your slices of bread from the cabinet. If you were doing this in real life by yourself, you wouldn’t even think about this. There’s only 1 of you, so you would just go through each step one at a time. But we were originally discussing this in the context of multithreading, which means that we’re talking about trying to run different tasks on different cores in parallel. Without properly waiting you could end up with the peanut butter task running at the same time as the bread step, and that’s obviously not good!

To avoid these kinds of issues, task schedulers like TBB give you mechanisms that force a task (or group of tasks) to wait until a prior task (or group of tasks) completely finishes executing. Like I mentioned earlier, you could call this mechanism a barrier, or a sync point:

This sort of thing is pretty easy to implement on modern PC CPU’s, since you have a lot of flexibility as well as some powerful tools at your disposal: atomic operations, synchronization primitives, OS-supplied condition variables, and so on.

So we’ve covered the basics of what a barrier is, but I still haven’t explained why we have them in API’s designed for talking to the GPU. After all, issuing Draw and Dispatch calls isn’t really the same as scheduling a bunch of parallel tasks to execute on separate cores, right? I mean, if you look at a D3D11 program’s sequence of API calls it looks pretty damn serial:

If you’re used to dealing with GPU’s through an API like this, you’d be forgiven for thinking that the GPU just goes through each command one at a time, in the order you submit them. And while this may have been true a long time ago, the reality is actually quite a bit more complicated on modern GPU’s. To show you what I’m talking about, let’s take a look at what my Deferred Texturing sample looks like when I take a capture with AMD’s awesome profiling tool, Radeon GPU Profiler:

This snippet is showing just a portion of a frame, specifically the part where all of the scene geometry is rasterized into the G-Buffer. The left-hand side shows the draw call, while the blue bars to the right show when the draw call actually starts and stops executing. And what do you know, there’s a whole lot of overlapping going on there! You can see the same thing shown a bit differently if you fire up PIX for Windows:

This is a snippet from PIX’s timeline view, which is also showing the execution time for the same sequence of draw calls (this time captured on my GTX 1070, whereas the earlier RGP capture was done on my significantly less beefy RX 460). You can see the same pattern: the draws start executing roughly in submission order, but they overlap all over the place. In some cases, draws will start and finish before an earlier draw completes!

If you know even a little bit about GPU’s this shouldn’t be *completely* surprising. After all, everyone knows that a GPU is mostly made up of hundreds or thousands of what the IHV’s like to call “**shader cores**“, and those shader cores all work together to solve “embarrassingly parallel” problems. These days the bulk of work done to process a draw (and pretty much all of the work done to process a dispatch) is performed on these guys, which run the shader programs compiled from our HLSL/GLSL/MetalSL code. Surely it makes sense to have the shader cores process the several thousand vertices from a single draw call in parallel, and to do the same with the thousands or millions of pixels that result from rasterizing the triangles. But does it really make sense to let multiple draw calls or dispatches bleed over into one another so that your actual high-level commands are also executing in parallel?

The correct answer is, “yes, absolutely!” In fact, hardware designers have put in quite a bit of effort over the years to make sure that their GPU’s can do this even if there are some state changes in between the draws. Desktop GPU’s have even engineered their ROP’s (the units that are responsible for taking the output of a pixel shader and actually writing it to memory) so that they can resolve blending operations even if the pixel shaders didn’t finish in draw order! Doing it this way helps avoid having idle shader cores, which in turn gives you better throughput. Don’t worry if this doesn’t completely make sense right now, as I’m going to walk through some examples in a future post that explain why this is the case. But for now, just take my word for it that allowing draws and dispatches to overlap generally leads to higher throughput.

If a GPU’s threads from a draw/dispatch can overlap with other, that means that the GPU needs a way to *prevent* that from happening in cases where there’s a data dependency between two tasks. When this happens, it makes sense do what we do on a CPU, and insert something roughly similar to a thread barrier in order to let us know when a group of threads have all finished their work. In practice GPU’s tend to do this in a very coarse manner, such as waiting for all outstanding compute shader threads to finish before starting up the next dispatch. This can be called a “flush”, or a “wait for idle”, since the GPU will wait for all threads to “drain” before moving on. But we’ll get into that in more detail in the next article.

Hopefully by now it’s clear that there’s at least one reason for barriers on GPU’s: to keep shader threads from overlapping when there’s a data dependency. This is really the same scenario with the peanut butter and the bread that we laid out earlier when talking about CPU threads, except with the core count cranked up to the thousands. But unfortunately things get a bit more complicated when we’re talking about GPU’s as opposed to CPU’s.

Let’s say that you start up a group of threads running on a PC CPU that write a bunch of data to individual buffers, insert a thread barrier to wait until those threads are finished, and then kick off a second group of threads that reads the output data of the first group of threads. As long as you make sure that you have the right memory/compiler barriers in place to ensure that the second tasks’s read operations don’t happen too early (and often you get this by default from using OS synchronization primitives or atomic operations), you don’t need to care about getting correct results in the presence of a cache hierarchy. This is because the caches on an x86 core (usually each core has its own individual L1 cache, with a shared L2 and possibly L3 cache) are coherent, which means that they stay “up to date” with each other as they access different memory addresses. The details of how they achieve this miraculous feat are quite complicated, but as programmers we’re usually allowed to remain blissfully ignorant of the internal gymnastics being performed by the hardware.

Things are not so simple for the poor folks that write drivers for a GPU. For various reasons, some of them dating back to their legacy as devices that weren’t used for general-purpose computing like they are now, GPU’s tend to have a bunch of caches that aren’t always organized into a strict hierarchy. The details aren’t always public, but AMD tends to has quite a bit of public information available about their GPU’s that we can learn from. Here’s a diagram from slide 50 of this presentation:

This looks quite different from a CPU’s cache hiearchy! We have two things with L1 that go through L2 the way that you expect they would, but then there’s color and depth caches that bypass L2 and go right to memory! And there’s a DMA engine that doesn’t go through cache at all! The diagram here is also a bit misleading, since in reality there’s an L1 texture cache on every compute unit (CU), and there can be dozens of compute units on the larger video cards. There’s also multiple instruction cache and scalar data L1’s, with one of these shared between up to 4 CU’s. There’s lots of details in this GCN whitepaper, which explains how the various caches work, and also how direct memory writes from shaders (AKA writes to UAV’s) go through their local L1 and *eventually* propagate to L2.

As a consequence of having all of these caches without a strict hierarchy, the caches can sometimes get out of sync with each other. As the GCN whitepaper describes, the L1 caches on the shader units aren’t coherent with each other until the writes reach L2. This means that if one dispatch writes to a buffer and another reads from it, the CU L1 cache may need to be **flushed** in between those dispatches to make sure that all of the writes at least made it to L2 (a cache flush refers to the operation of taking modified/dirty cache lines and writing them out to the next cache level, or actual memory if applied to the last cache level). And as slide 52 describes, it’s even worse when a texture goes from being used as a render target to being used as a readable texture. For that case the writes to the render target could be sitting in the color buffer L1 cache that’s attached to the ROP’s, which means that cache has to be flushed in addition to flushing the other L1’s and L2 cache. (note that AMD’s new Vega architecture has more unified cache hierarchy where the ROP’s are also clients of the L2).

One cool thing about AMD hardware is that their tools actually show you when this happens! Here’s a snippet from an RGP capture showing the caches being flushed (and shader threads being synchronized!) on my RX 460 after large dispatch finishes writing to a texture:

Now the point of this isn’t just to explain how caches are hard and complicated, but to illustrate another way in which GPU’s require barriers. Ensuring that your threads don’t overlap isn’t sufficient for resolving read-after-write dependencies when you also have multiple caches that can have stale data in them. You’ve also got to invalidate or flush those caches to make the results visible to subsequent tasks that need to read the data.

GPU’s have gotten more and more focused on compute as time goes on, but they’re still heavily optimized for rasterizing triangles into a grid of pixels. Doing this job means that the ROP’s can end up having touch a *ton* of memory every frame. Games now have to render at up to 4k resolutions, which works out to 8294400 pixels if you write every one with no overdraw. Multiply that by 8 bytes per-pixel for 16-bit floating point texture formats, or maybe up to 30 or 40 bytes per-pixel for fat G-Buffers, and you’re looking at a lot bandwidth consumption just to touch all of that memory once (and typically many texels will be touched more than once)! It only gets worse if you add MSAA into the mix, which will double or quadruple the memory and bandwidth requirements in the naive case.

To help keep that bandwidth usage from becoming a bottleneck, GPU designers have put quite a bit of effort into building lossless compression techniques into their hardware. Typically this sort of thing is implemented as part of the ROP’s, and is therefore used when writing to render targets and depth buffers. There’s been a lot of specific techniques used over the years, and the exact details haven’t been made available to the public. However AMD and Nvidia have provided at least a bit of information about their particular implementations of delta color compression in their latest architectures. The basic gist of both techniques is that they aim to exploit the similarity in neighboring pixels in order to avoid storing a unique value for every texel of the render target. Instead, the hardware recognizes patterns in blocks of pixels, and stores each pixel’s difference (or delta) from an anchor value. Nvidia’s block modes give them anywhere from 2:1 to 8:1 compression ratios, which potentially results in huge bandwidth savings!

So what exactly does this have to do with barriers? The problem with these fancy compression modes is that while the ROP’s may understand how to deal with the compressed data, the same is not necessarily true when shaders need to randomly-access the data through their texture units. This means that depending on the hardware and how the texture is used, a decompression step might be necessary before the texture contents are readable by a dependent task (or writable through a means other than ROP’s). Once again, this is something that falls under the umbrella of “barriers” when we’re talking about GPU’s and the new explicit API’s for talking to them.

After reading through my ramblings about thread synchronization, cache coherency, and GPU compression, you hopefully have at least a very basic grasp of 3 potential reasons that typical GPU’s require barriers to do normal things that we expect of them. But if you look at the actual barrier API’s in D3D12 or Vulkan, you’ll probably notice that they don’t really seem to directly correspond with what we just talked about. After all, it’s not like there’s a “WaitForDispatchedThreadsToFinish” or “FlushTextureCaches” function on ID3D12GraphicsCommandList. And if you think about it, it makes sense that they don’t do this. The fact that most GPU’s have lots of shader cores where tasks can overlap is a pretty specific implementation detail, and you could say the same about GPU’s that have weird incoherent cache hierarchies. Even for an explicit API like D3D12 it wouldn’t make sense to leak that kind of detail across its abstraction, since it’s totally possible that one day D3D12 could be used to talk to a GPU that doesn’t behave the way that I just described (it may have already happened!).

When you think of things from that perspective, it starts to make sense that D3D12/Vulkan barriers are more high-level, and instead are mostly aimed at describing the flow of data from one pipeline stage to another. Another way to describe them is to say that the barriers tell the driver about changes in the *visibility* of data with regards to various tasks and/or functional units, which as we pointed out earlier is really the essence of barriers. So in D3D12 you don’t say “make sure that this draw call finishes before this other dispatch reads it”, you say “this texture is transitioning from a ‘render target’ state to a ‘shader readable’ state so that a shader program can read from it”. Essentially you’re giving the driver a bit of information about the past and future life of a resource, which may be necessary for making decisions about which caches to flush and whether or not to decompress a texture. Thread synchronization is then implied by the state transition rather than explicit dependencies between draws or dispatches, which isn’t a perfect system but it gets the job done.

If you’re wondering why we didn’t need to manually issue barriers in D3D11, the answer to that question is “because the driver did it for us!”. Remember how earlier I said that a compiler can analyze your code to determine dependencies, and generate the appropriate assembly automatically? This is basically what drivers do in D3D11, except they’re doing it at runtime! The driver needs to look at all the resources that you bind as inputs and outputs, figure out when there’s visibility changes (for instance, going from a render target to a shader input), and insert the necessary sync points, cache flushes, and decompression steps. While it’s nice that you automatically get correct results, it’s also bad for a few reasons:

- Automatically tracking resources and draw/dispatch calls is expensive, which is not great when you want to squeeze your rendering code into a few milliseconds per frame.
- It’s really bad for generating command buffers in parallel. If you can set a texture as a render target in one thread and then bind it as an input in another thread, the driver can’t figure out the full resource lifetime without somehow serializing the results of the two threads.
- It relies on an explicit resource binding model, where the context always knows the full set of inputs and outputs for every draw or dispatch. This can prevent you from doing awesome things with bindless resource access.
- In some cases the driver might issue unnecessary barriers due to not having knowledge of how the shaders access their data. For example, two dispatches that increment the same atomic counter won’t necessarily need a barrier between them, even though they access the same resource.

The thinking behind D3D12 and Vulkan is that you can eliminate those disadvantages by having the app provide the driver with the necessary visibility changes. This keeps the driver simpler, and lets the app figure out the barriers in any manner that it wants. If your rendering setup is fairly fixed, you can just hard-code your barriers and have essentially 0 CPU cost. Or you can setup your engine to build its own dependency graph, and use that to determine which barriers you’ll need.

In the next article, I’m going to dive a bit deeper into the topic of thread-level synchronization and how it’s typically implemented on GPU’s. Stay tuned!

]]>

*This is part 6 of a series on Spherical Gaussians and their applications for pre-computed lighting. You can find the other articles here:*

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Get the code on GitHub: https://github.com/TheRealMJP/BakingLab (pre-compiled binaries available here)

Back in early 2014, myself and David Neubelt started doing serious research into using Spherical Gaussians as a compact representation for our pre-computed lighting probes. One of the first things I did back then was to create a testbed application that we could use to compare various lightmap representations (SH, H-basis, SG, etc.) and quickly experiment with new ideas. As part of that application I implemented my first path tracer, which was directly integrated into the app for A/B comparisons. This turned out to be extremely useful, since having quick feedback was really helpful for evaluating quality and also for finding and fixing bugs. Eventually we used this app to finalize the exact approach that we would use when integrating SG’s into The Order: 1886.

A year later in 2015, Dave and I created another test application for experimenting with improvements that we were planning for future projects. This included things like a physically based exposure model utilizing real-world camera parameters, using the ACES[1] RRT/ODT for tone mapping, and using real-world units[2] for specifying lighting intensities. At some point I integrated an improved version of SG baking into this app that would progressively compute results in the background while the app remained responsive, allowing for quick “preview-quality” feedback after adjusting the lighting parameters. Once we started working on our SIGGRAPH presentation[3] from the 2015 physically based shading course, it occurred to us that we should really package up this new testbed and release it alongside the presentation to serve as a working implementation of the concepts we were going to cover. But unfortunately this slipped through the cracks: the new testbed required a lot of work in order to make it useful, and both Dave and I were really pressed for time due to multiple new projects ramping up at the office.

Now, more than a year after our SIGGRAPH presentation, I’m happy to announce that we’ve finally produced and published a working code sample that demonstrates baking of Spherical Gaussian lightmaps! This new app, which I call “The Baking Lab”, is essentially a combination of the two testbed applications that we created. It includes all of the fun features that we were researching in 2015, but also includes real-time progressive baking of 2D lightmaps in various formats. It also allows switching to a progressive path tracer at any time, which serves as the “ground truth” for evaluating lightmap quality and accuracy. Since it’s an amalgamation of two older apps, it uses D3D11 and the older version of my sample framework. So there’s no D3D12 fanciness, but it will run on Windows 7. If you’re just interested in looking at the code or running the app, then go ahead and head over to GitHub: https://github.com/TheRealMJP/BakingLab. If you’re interested in the details of what’s implemented in the app, then keep reading.

The primary feature of The Baking Lab is lightmap baking. Each of the test scenes includes a secondary UV set that contains non-overlapping UV’s used for mapping the lightmap onto the scene. Whenever the app starts or a new scene is selected, the baker uses the GPU to rasterize the scene into lightmap UV space. The pixel shader outputs interpolated vertex components like position, tangent frame, and UV’s to several render targets, which use MSAA to simulate conservative rasterization. Once the rasterization is completed, the results are copied back into CPU-accessible memory. The CPU then scans the render targets, and extracts “bake points” from all texels covered by the scene geometry. Each of these bake points represents the location of a single hemispherical probe to be baked.

Once all bake points are extracted, the baker begins running using a set of background threads on the CPU. Each thread continuously grabs a new work unit consisting of a group of contiguous bake points, and then loops over the bake points to compute the result for that probe. Each probe is computed by invoking a path tracer, which uses Embree[4] to allow for arbitrary ray tracing through the scene on the CPU. The path tracer returns the incoming radiance for a direction and starting point, where the radiance is the result of indirect lighting from various light sources as well as the direct lighting from the sky. The path tracer itself is a very simple unidirectional path tracer, using a few standard techniques like importance sampling, correlated multi-jittered sampling[5], and russian roulette to increase performance and/or convergence rates. The following baking modes are supported:

**Diffuse**– a single RGB value containing the result of applying a standard diffuse BRDF to the incoming lighting, with an albedo of 1.0**Half-Life 2**– directional irradiance projected onto the Half-Life 2 basis[6], making for a total of 3 sets of RGB coefficients (9 floats total)**L1 SH**– radiance projected onto the first two orders of spherical harmonics, making for a total of 4 sets of RGB coefficients (12 floats total). Supports environment specular via a 3D lookup texture.**L2 SH**– radiance projected on the first three orders of spherical harmonics, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via a 3D lookup texture.**L1 H-basis**– irradiance projected onto the first two orders of H-basis[7], making for a total of 4 sets of RGB coefficients (12 floats total).**L2 H-basis**– irradiance projected onto the first three orders of H-basis, making for a total of 6 sets of RGB coefficients (18 floats total).**SG5**– radiance represented by the sum of 5 SG lobes with fixed directions and sharpness, making for a total of 5 sets of RGB coefficients (15 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG6**– radiance represented by the sum of 6 SG lobes with fixed directions and sharpness, making for a total of 6 sets of RGB coefficients (18 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG9**– radiance represented by the sum of 9 SG lobes with fixed directions and sharpness, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG12**– radiance represented by the sum of 12 SG lobes with fixed directions and sharpness, making for a total of 12 sets of RGB coefficients (36 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.

For SH, H-basis, and HL2 basis baking modes the path tracer is evaluated for random rays distributed about the hemisphere so that Monte Carlo integration can be used to integrate the radiance samples onto the corresponding basis functions. This allows for true progressive integration, where the baker makes N passes over each bake point, each time adding a new sample with the appropriate weighting. It looks pretty cool in action:

The same approach is used for the “Diffuse” baking mode, except that sampling rays are evaluated using a cosine-weighted hemispherical sampling scheme[8]. For SG baking, things get a little bit trickier. If the ad-hoc projection mode is selected, the result can be progressively evaluated in the same manner as the non-SG bake modes. However if either the Least Squares or Non-Negative Least Squares mode are active, we can’t run the solve unless we have all of the hemispherical radiance samples available to feed to the solver. In this case we switch to a different baking scheme where each thread fully computes the final value for every bake point that it operates on. However the thread only does this for a single bake point from each work group, and afterwards it fills in the rest of the neighboring bake points (which are arranged in a 8×8 group of texels) with the results it just computed. Each pass of of baker then fills in the next bake point in the work group, gradually computing the final result for all texels in the group. So instead of seeing the quality slowly improve across the light map, you see extrapolated results being filled in. It ends up looking like this:

While it’s not as great as a true progressive bake, it’s still better than having no preview at all.

The app supports a few settings that control some of the bake parameters, such as the number of samples evaluated per-texel and the overall lightmap resolution. The “Scene” group in the UI also has a few settings that allow toggling different components of the final render, such as the direct or indirect lighting or the diffuse/specular components. Under the “Debug” setting you can also toggle a neat visualizer that shows a visual representation of the raw data stored in the lightmap. It looks like this:

The integrated path tracer is primarily there so that you can see how close or far off you are when computing environment diffuse or specular from a light map. It was also a lot of fun to write – I recommend doing it sometime if you haven’t already! Just be careful: it may make you depressed to see how poorly your real-time approximation holds up when compared with a proper offline render.

The ground truth renderer works in a similar vein to the lightmap baker: it kicks off multiple background threads that each grab work groups of 16×16 pixels that are contiguous in screen space. The renderer makes N passes over each pixel, where each pass adds an additional sample that’s weighted and summed with the previous results. This gives you a true progressive render, where the result starts out noisy and (very) gradually converges towards a noise-free image:

The ground truth renderer is activated by checking the “Show Ground Truth” setting under the “Ground Truth” group. There’s a few more parameters in that group to control the behavior of the renderer, such as the number of samples used per-pixel and the scheme used for generating random samples.

There’s 3 different light sources supported in the app: a sun, a sky, and a spherical area light. For real-time rendering, the sun is handled as a directional light with an intensity computed automatically using the Hosek-Wilkie solar radiance model[9]. So as you change the position of the sun in the sky, you’ll see the color and intensity of the sunlight automatically change. To improve the real-time appearance, I used the disk area light approximation from the 2014 Frostbite presentation. The path tracer evaluates the sun as an infinitely-distant spherical area light with the appropriate angular radius, with uniform intensity and color also computed from the solar radiance model. Since the path tracer handles the sun as a true area light source, it produces correct specular reflections and soft shadows. In both cases the sun defaults to correct real-world intensities using actual photometric units. There is a parameter for adjusting the sun size, which will result in the sun being too bright or too dark if manipulated. However there’s another setting called “Normalize Sun Intensity” which will attempt to maintain roughly the same illumination regardless of the size, which allows for changing the sun appearance or shadow softness without changing the overall scene lighting.

The default sky mode (called “Procedural”) uses the Hosek-Wilkie sky model to compute a procedural sky from a few input parameters. These include turbidity, ground albedo, and the current sun position. Whenever the parameters are changed, the model is cached to a cubemap that’ s used for real-time rendering on the GPU. For CPU path tracing, the the sky model is directly evaluated for a direction using the sample code provided by the authors. When combined with the procedural sun model, the two light sources form a simple outdoor lighting environment that corresponds to real-world intensities. Several other sky modes are also supported for convenience. The “Simple”mode takes just a color and intensity as input parameter, and flood-fills the entire sky with a value equal to color * intensity. The “Ennis”, “Grace Cathedral”, and “Uffizi Cross” modes use corresponding HDR environment maps to fill the sky instead of a procedural model.

For local lighting, the app supports enabling a single spherical area light using the “Enable Area Light” setting. The area light can be positioned using the Position X/Position Y/Position Z settings, and its radius can be specified with the “Size” setting. There are a 4 different modes for specifying the intensity of the light:

**Luminance**– the intensity corresponds to the amount of light being emitted from the light source along an infinitesimally small ray towards the viewer or receiving surface. Uses units of cd/m^{2}. Changing the size of the light source will change the overall illumination the scene.**Illuminance**– specifies the amount of light incident on a surface at a set distance, which is specified using the “Illuminance Distance” setting. So instead of saying “how much light is coming out of the light source” like you do with the “Luminance” mode, you’re saying “how much diffuse light is being reflected from a perpendicular surface N units away”. Uses units of lux, which are equivalent to lm/m^{2}. Changing the size of the light source will*not*change the overall illumination the scene.**Luminous Power**– specifies the total amount of light being emitted from the light source in all directions. Uses units of lumens. Changing the size of the light source will*not*change the overall illumination the scene.**EV100**– this is an alternative way of specifying the luminance of the light source, using the exposure value[10] system originally suggested by Nathan Reed[11]. The base-2 logarithmic scale for this mode is really nice, since incrementing by 1 means doubling the perceived brightness. Changing the size of the light source will change the overall illumination the scene.

The ground truth renderer will evaluate the area light as a true spherical light source, using importance sampling to reduce variance. The real-time renderer approximates the light source as a single SG, and generates very simple hard shadows using an array of 6 shadow maps. By default only indirect lighting from the area light will be baked into the lightmap, with the direct lighting evaluated on the GPU. However if the “Bake Direct Area Light” setting is enabled, then the direct contribution from the area light will be baked into the lightmap.

Note that all light sources in the app are always scaled down by a factor of 2^{-10} before being using in rendering, as suggested by Nathan Reed in his blog post[11]. Doing this effectively shifts the window of values that can be represented in a 16-bit floating point value, which is necessary in order to represent specular reflections from the sun. However the UI always will always show the unshifted values, as will the debug luminance picker that shows the final color and intensity of any pixel on the screen.

As I mentioned earlier, the app implements a physically based exposure system that attempts to models the behavior and parameters of a real-world camera. Much of the implementation was based on the code from Padraic Hennessy’s excellent series of articles[12], which was in turn inspired by Sébastien Lagarde and Charles de Rousiers’s SIGGRAPH presentation from 2014[2]. When the “Exposure Mode” setting is set to the “Manual (SBS)” or “Manual (SOS)” modes, the final exposure value applied before tone mapping will be computed based on the combination of aperture size, ISO rating, and shutter speed. There is also a “Manual (Simple)” mode available where a single value on a log2 scale can be used instead of the 3 camera parameters.

Mostly for fun, I integrated a post-process depth of field effect that uses the same camera parameters (along with focal length and film size) to compute per-pixel circle of confusion sizes. The effect is off by default, and can be toggled on using the “Enable DOF” setting. Polygonal and circular bokeh shapes are supported using the technique suggested by Tiago Sousa in his 2013 SIGGRAPH presentation[13]. Depth of field is also implemented in the ground truth renderer, which is capable of achieving true multi-layer effects by virtue of using a ray tracer.

Several tone mapping operators are available for experimentation:

**Linear**– no tone mapping, just a clamp to [0, 1]**Film Stock**– Jim Hejl and Richard Burgess-Dawson’s polyomial approximation of Haarm-Peter Duiker‘s filmic curve, which was created by scanning actual film stock. Based on the implementation provided by John Hable[14].**Hable (Uncharted2)**– John Hable‘s adjustable filmic curve from his GDC 2010 presentation[15]**Hejl 2015**– Jim Hejl’s filmic curve that he posted on Twitter[16], which is a refinement of Duiker’s curve**ACES sRGB Monitor**– a fitted polynomial version of the ACES[17] reference rendering transform (RRT) combined with the sRGB monitor output display transform (ODT), generously provided by Stephen Hill.

At the bottom of the settings UI are a group of debug options that can be selected. I already mentioned the bake data visualizer previously, but it’s worth mentioning again because it’s really cool. There’s also a “luminance picker”, which will enable a text output showing you the luminance and illuminance of the surface under the mouse cursor. This was handy for validating the physically based sun and sky model, since I could use the picker to make sure that the lighting values matched what you would expect from real-world conditions. The “View Indirect Specular” option causes both the real-time renderer and the ground truth renderer to only show the indirect specular component, which can be useful for gauging the accuracy of specular computed from the lightmap. After that there’s a pair of buttons for saving or loading light settings. This will serialize the settings that control the lighting environment (sun direction, sky mode, area light position, etc.) to a file, which can be loaded in whenever you like. The “Save EXR Screenshot” is fairly self-explanatory: it lets you save a screenshot to an EXR file that retains the HDR data. Finally there’s an option to show the current sun intensity that’s used for the real-time directional light.

[1] Academy Color Encoding System – https://en.wikipedia.org/wiki/Academy_Color_Encoding_System

[2] Moving Frostbite to PBR (course notes) – http://www.frostbite.com/wp-content/uploads/2014/11/course_notes_moving_frostbite_to_pbr_v2.pdf

[3] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pdf

[4] Embree: High Performance Ray Tracing Kernels – https://embree.github.io/

[5] Correlated Multi-Jittered Sampling – http://graphics.pixar.com/library/MultiJitteredSampling/paper.pdf

[6] Shading in Valve’s Source Engine – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[7] Efficient Irradiance Normal Mapping – https://www.cg.tuwien.ac.at/research/publications/2010/Habel-2010-EIN/

[8] Better Sampling – http://www.rorydriscoll.com/2009/01/07/better-sampling/

[9] Adding a Solar Radiance Function to the Hosek Skylight Model – http://cgg.mff.cuni.cz/projects/SkylightModelling/

[10] Exposure value – https://en.wikipedia.org/wiki/Exposure_value

[11] Artist-Friendly HDR With Exposure Values – http://www.reedbeta.com/blog/2014/06/04/artist-friendly-hdr-with-exposure-values/

[12] Implementing a Physically Based Camera: Understanding Exposure – https://placeholderart.wordpress.com/2014/11/16/implementing-a-physically-based-camera-understanding-exposure/

[13]CryENGINE 3 Graphics Gems – http://advances.realtimerendering.com/s2013/Sousa_Graphics_Gems_CryENGINE3.pptx

[14] Filmic Tonemapping Operators – http://filmicgames.com/archives/75

[15] Uncharted 2: HDR Lighting – http://www.gdcvault.com/play/1012351/Uncharted-2-HDR

[16] Jim Hejl on Twitter – https://twitter.com/jimhejl/status/633777619998130176

[17] Academy Color Encoding System Developer Resources – https://github.com/ampas/aces-dev

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the two previous articles I showed workable approaches for approximating the diffuse and specular result from a Spherical Gaussian light source. On its own these techniques might seem a bit silly, since it’s not immediately why the heck it would be useful to light a scene with an SG light source. But then you might remember that these articles started off by discussing methods for storing pre-computed radiance or irradiance in lightmaps or probe grids, which is a subject we’ll finally return to.

A common process in mathematics (particularly statistics) is to take a set of data points and attempt to figure out some sort of analytical curve that can represent the data. This process is known as curve fitting[1], since the goal is to find a curve that is a good fit for the data points. There’s various reasons to do this (such as regression analysis[2]), but I find it can helpful to think of it as a form of lossy compression: a few hundred data points might require kilobytes of data, but if you can approximate that data with a curve the coefficients might only need a few bytes of storage. Here’s a simple example from Wikipedia:

*Fitting various polynomials to data generated by a sine wave. Red is first degree, green is second degree, orange is third degree, blue is forth degree.
By Krishnavedala (Own work) [CC0], via Wikimedia Commons*

In the image there’s a bunch of black dots, which represent a set of data points that we want to fit. In this case the data points come from a sine wave, but in practice the data could take any form. The various colored curves represents attempts at fitting the data using polynomials of varying degrees:

By looking at graphs and the forms of the polynomials it should be obvious that higher degrees allow for more complex curves, but require more coefficients. More coefficients means more data to store, and may also mean that the fitting process is more difficult and/or more expensive. One of the most common techniques used for fitting is least squares[3], which works by minimizing the sum of all differences between the fit curve and the original data.

The other observation we can make it that the resulting fit is essentially a linear combination of basis functions, where the basis functions are , , , and so on. There are many other basis functions we could use here instead of polynomials, such as our old friend the Gaussian! Just like polynomials, a sum of Gaussians can represent more complex functions with a handful of coefficients. As an example, let’s take a set of data points and use least squares to fit varying numbers of Gaussians:

*Fitting Gaussians to a data set using least squares. The left graph shows a fit with a single Gaussian, the middle graph shows a fit with two Gaussians, and the right graph shows a fit with three Gaussians.*

For this example I used curve_fit[4] from the scipy optimization library, which uses a non-linear least squares algorithm. Notice how as I added more Gaussians, the resulting sum became a better approximation of the raw data.

So far we’ve been fitting 1D data sets, but the techniques we’re using also work in multiple dimensions. So for instance, let’s say we had a bunch of scattered samples in random directions on a sphere defined by a 2D spherical coordinate system. And let’s say that these samples represent something like…oh I don’t know…the amount of incoming lighting along an infinitesimally narrow ray oriented in that direction. If we take all of these data points and throw some least squares at it, we can end up with a series of N Spherical Gaussians whose sum can serve as an approximation for radiance in any direction on the sphere! We just need our fitting algorithm to spit out the axis, amplitude, and sharpness of each Gaussian, or if we want can fix one of more of the SG parameters ahead of time and only fit the remaining parameters. It should be immediately obvious why this is useful, since a set of SG coefficients can be stored very compactly compared to a gigantic set of radiance samples. Of course if we only use a few Gaussians the resulting approximation will probably lose details from the original radiance function, but this is no different from spherical harmonics or other common techniques for storing approximate representations of radiance or irradiance. Let’s take a look at a fit actually looks like using an HDR environment map as input data:

*Approximating radiance on a sphere using a sum of Spherical Gaussians. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows a least squares fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

These images were generated Yuriy O’Donnell‘s Probulator[5], which is an excellent tool for comparing various ways of approximating radiance and irradiance on a sphere. One important thing to note here is that the fit was only performed on the amplitude of the SG’s: the axis and sharpness are pre-determined based on the number of SG’s. Probulator generates the lobe axis directions using Vogel’s method[6], but any technique for distributing points on a sphere would also work. Fitting only the lobe amplitude significantly simplifies the solve, since there are less parameters to optimize. Solving for a single parameter also allows us to use linear least squares[7], while fitting all of the parameters would require use of complex and expensive non-linear least squares[8] algorithms. Solving for less parameters also decreases the storage costs, since only the amplitudes need to be stored per-probe while the directions and sharpness can be global constants. Either way it’s good to keep in mind when examining the results. In particular it helps explain why the white lobe in the middle image doesn’t quite line up with the bright windows in the source environment. Aside from that, the results are probably what you would expect: doubling the number of lobes increases the possible complexity and sharpness of the resulting approximation, which in this case allows it to provide a better representation of some of the high-frequency details in the source image.

One odd thing you might notice in the SG approximation is the overly dark areas towards the bottom-right of the sphere. They look somewhat similar to the darkening that can show up in SH approximations, which happens due to negative coefficients being used for the SH polynomials. It turns out that something very similar is happening with our SG fit: the least squares optimizations is returning negative coefficients for some of our lobes in an attempt to minimize the error of the resulting fit.If you’re having trouble understanding why this would happen, let’s go back to 1D for a quick example. For the last 1D example I cheated a bit: the data we were fitting our Gaussians to was actually just some random noise applied to the sum of 3 Gaussians. This is why our Gaussian fit so closely resembled the source data. This time, we’ll fitting some lobes to a more complex data set:

*A data set with a discontinuity near 0, making it more difficult to fit curves to.*

This time the data has a bunch of values near the center that have a value of zero. With such a data set it’s now less obvious how a sum of Gaussians could approximate the values. If we throw least squares at the problem and have it fit two lobes, we get the following:

*The result of using least squares to fit 2 Gaussian lobes to the above data set. The left graph shows the first lobe (red), the middle graph shows the second lobe (green), and the right graph shows the sum of the two lobes (blue) overlaid onto the original data set.*

This time around the optimization resulted in a positive amplitude for the first lobe, but a *negative* amplitude for the second lobe. Looking at the sum overlaid onto to the data makes it clear why this happened: the positive lobe takes care of the all of the positive data points to the left and right, while the negative lobe brings the sum closer to zero in the middle of the graph. Upon closer inspection the negative actually causes the approximation to dip *below* zero into the negatives. We can assume that having this dip results in lower overall error for the approximation, since that’s how least squares works.

In practice, having negative coefficients and negative values from our approximation can undesirable. In fact when approximating radiance or irradiance negative values really just don’t make sense, since they’re physically impossible. In our experience we also found that the visual result of lighting a scene with negative lobes can be quite displeasing, since it tends to look very unnatural to have surfaces that are completely dark. Perhaps you remember this image from the first article showing what L2 SH looks like with a bright area light in the environment:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

We found that character faces in particular tended to look really bad when our SH light probes had strong negative lobes, and this turned out to be one of our motivations for investigating alternative approximations. We also ran into some trouble when attempting to compress signed floating point values in BC6H textures: some compressors didn’t even support compressing to that format, and those that did had noticeably worse quality.

With that in mind, it would be nice to constrain a least squares solver in such a way that it only gave us positive coefficients. Fortunately for us such a technique exists, and it’s known as non-negative least squares[9] (or NNLS for short). If we use that technique for fitting SG’s to our original radiance function instead of standard least squares, we get this result instead:

*Fitting SG’s to a radiance function using a non-negative least squares solver. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows an NNLS fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

This time we don’t have the dark areas in the bottom right, since the fit only uses positive lobes. But unfortunately there’s no free lunch here, since the resulting approximation is also a bit “blurrier” compared to the normal least squares fit.

Now that we’ve covered how to generate an SG approximation of radiance from source data, we can take a look at how well it stacks up against other options for a simple use case. Probably the most obvious application is computing irradiance from the radiance approximation, which can be directly used to compute standard Lambertian diffuse lighting. The following images were captured using Probulator, and they show the Stanford Bunny being lit using a few common techniques for approximating irradiance:

*A bunny model being lit by various irradiance approximations generated from the “Pisa” HDR environment map*

With the exception of Valve’s Ambient Cube, all of the approximations hold up very well when compared with the ground truth. The non-negative least squares fit is just a bit more washed out than the least squares fit, but both seem to produce perfectly acceptable results. The SH result is also very good, with no noticeable ringing artifacts. However this particular environment is a somewhat easier case, as the range of intensities isn’t as large as you might find in some realistic lighting scenarios. For a more extreme case, let’s now look at a comparison using the “Ennis” environment map:

*A bunny model being lit by various irradiance approximations generated from the “Ennis” HDR environment map*

This time there’s a much more noticeable difference between the various techniques. This is because the source environment map has a very bright window to the left, which effectively serves as a large area light source. With this particular environment the SG results start to compare pretty favorably to the SH or ambient cube approximations. The results from L2 SH have severe ringing artifacts, which manifests as areas that are either too dark or too bright on the side of the bunny facing to the right. Meanwhile , the windowed version of L2 SH blurs the lighting too much, making it appear as if the environment is more uniform than it really is. The ambient cube probe doesn’t suffer from ringing, but it does have problems with the bright lighting from the left bleeding onto the top and side of the bunny. Looking at the least squares solve for 12 SG’s, the result is pretty nice but there is a bit of ringing evident on the upper-right side of the bunny model. This ringing isn’t present in the non-negative least squares solve, since all of the coefficients end up being positive.

As I mentioned earlier, these Probulator comparisons use fixed lobe directions and sharpness. Consequently we only need to store amplitude, meaning that the storage cost of the 12 SG lobes is equivalent to 12 sets of floating-point RGB coefficients (36 floats total). L2 SH requires 9 sets of RGB coefficients, which adds up to 27 floats. The ambient cube requires only 6 sets of RGB coefficents, which is half that of the SG solve. So for this particular comparison the SG representation of radiance requires the most storage, however this highlights one of the nice points using SG as your basis: you can solve for any number of lobes you’d like, allowing you to choose easily trade off quality vs. storage and performance cost. Valve’s ambient cube is only defined for 6 lobes, and that number can’t be increased since the lobes must remain orthogonal to each other.

For full lighting probes where the sampling surface can have any orientation, storing the radiance or irradiance on a sphere makes perfect sense. However it makes less sense if we would like to store baked lighting in 2D textures where all of the sample points lie on the surface of a mesh. For that case storing data for a full sphere is wasteful, since half of the data will point “into” the surface and therefore won’t be useful. With SG’s this is fairly trivial to correct: we can just choose to solve for lobes that only lie on the upper hemisphere surrounding the surface’s normal direction.

*The left side shows a configuration where 9 SG lobes are distributed about a sphere, forming a full spherical probe. The right side shows 5 SG lobes located on a hemisphere surrounding the normal of a surface (the blue arrow). *

To fully generate the compact radiance representation for an entire 2D lightmap, we need to gather radiance samples at every texel location, and then perform a solve to fit the samples to a set of SG lobes. It’s really no different from the spherical probe case we used as a testbed in Probulator, except now we’re generating many probes. The other main differences is that for lightmap generating the appropriate radiance requires sampling a full 3D scene, as opposed to using an environment map as we did with Probulator. This sort of problem is best solved with a ray tracer, using an algorithm such as path tracing to compute the incoming radiance for a particular ray. The following image shows a visualization of what the lightmap result looks like for a simple scene:

*Hemispherical radiance probes generated at the texel locations of a 2D lightmap applied to a simple scene. Each probe uses 9 SG lobes oriented about the surface normal of the underlying geometry.*

In the previous article we covered techniques that can be used to compute a specular term from SG light sources. If we apply them to a set of lobes that approximate the incoming radiance, then we can compute an approximation of the full environment specular response. For small lobe counts this is only going to be practical for materials with a relatively high roughness. This is because our SG approximation of the incoming radiance won’t be able to capture high-frequency details from the environment, and it would it be very obvious if those details were missing from the reflections on smooth surfaces. However for rougher surfaces where the BRDF itself starts to act as a low-pass filter, an SG approximation won’t have as much noticeable error. As an example, here’s what SG specular looks like for a test scene with a GGX roughness value of 0.25:

*Comparison of indirect specular approximation for a test scene with a GGX roughness of 0.25. The top-left image is a path-traced rendering of the final scene with full indirect and direct lighting. The top-right image shows the indirect environment specular term from SG9 lightmaps, with the exposure increased by 8x. The bottom left image shows the indirect specular term from L2 SH. The bottom right image shows the indirect specular term from a path-traced render of the scene.*

Compared to the ground truth, the SG approximation does pretty well in some places and not-so-well in others. In general it captures a lot of the overall specular response, but suffers from some of the higher-frequency detail being absent in the probes. This results in certain areas looking a bit “washed-out”, such as the right-most wall of the scene. You can also see that the reflections of the cylinder, sphere, and torus are not properly represented in the SG version for the same reason. On the positive side, lightmap samples are pretty dense in terms of their spatial distribution. They’re far more dense than what you typically achieve with sparse cubemap probes placed throughout the scene, which typically suffer from all kinds of occlusion and parallax artifacts. The SG specular also compares pretty favorably to the L2 SH result (despite having the same storage cost), which looks even more washed-out than the SG result. The SH implementation used a 3D lookup texture to store pre-computed SH coefficients, and you can see some interpolation artifacts from this method if you look at the far wall perpendicular to the camera.

In the first part of our presentation[10] at last year’s Physically Based Shading Course[11], Dave covered some of these details and also shared some information about how we implemented SG lighting probes into The Order: 1886. Much of the implementation was very similar to what I’ve described in this series of articles: we stored 5-12 SG lobes (the count was chosen per-level chunk) in our 2D lightmaps with fixed axis directions and sharpnesses, and we evaluated diffuse and specular lighting using the approximations that I outlined earlier. For dynamic meshes, we baked uniform 3D grids of spherical probes containing 9 SG lobes that were stored in 3D textures. The grids were defined by OBB’s that were hand-placed in the scene by our lighting artists, along with density parameters. In both cases we made use of hardware texture filtering to interpolate between neighboring probes before computing per-pixel lighting.

Much of our implementation closely followed the work of Wang[12] and Xu[13], at least in terms of the techniques used for approximating diffuse and specular lighting from a set of SG lobes. Where our work diverged quite a bit was in the choice to use fixed lobe directions and sharpness values. Both Wang and Xu generated their set of SG lobes by performing a solve on a single environment map, which produced the necessary axis, sharpness, and amplitude parameters. In our case, we always knew that we were going need many probes in order to maintain high-fidelity pre-computed lighting for our scenes. At the time (early 2014) we were already employing 2D lightmaps containing L1 H-basis hemispherical probes (4 coefficients) and 3D grids containing L2 spherical harmonics probes. Both could be quite dense in spatial terms, which allowed capturing important shadowing detail.

To make SG’s work with for these requirements, we had to carefully consider our available trade-offs. After getting a simple test-bed up and running where we could bake 2D lightmaps for a scene, it became quickly apparent that varying the axis directions per-texel wasn’t necessarily the best choice for us. Aside from the obvious issue of requiring more storage space and making the solve more complex and expensive, we also ran into issues resulting from interpolating the axis direction over a surface. The problem is most readily apparent at shadow boundaries: one texel might have visibility of a bright light source which causes a lobe to point in that direction, while its neighboring pixel might have no visibility and thus could end up with a lobe pointing in a completely different direction. The axis would then interpolate between the two directions for pixels between the two texels, which can cause noticeable specular shifting. This isn’t necessarily an unsolvable problem (the Frequency Domain Normal Map Filtering paper[14] extended their EM solver with a term that attempts to align neighboring lobes for coherency), but considering our time and memory constraints it made sense to just sidestep the problem altogether. Ultimately we ended up using fixed lobe directions, using the UV tangent frame as the local coordinate space for lightmap probes. Tangent space is natural for this purpose since it’s Z axis is the surface normal of the mesh, and also because it tends to be continuous over a mesh (you’ll have discontinuities wherever you have UV seams, but artists tend to hide those as best they can anyway). For the 3D probe grids, the directions were in world space for simplicity.

After deciding to fix the lobe directions, we also ultimately decided to go with a fixed sharpness value as well. This of course has the same obvious benefits as fixing the axis direction (less storage, simpler solve), which were definitely appealing. However another motivating factor came from the way were doing our solve. Or rather, our lack of a proper solve. Our early testbed performed all lightmap baking on the CPU, which allowed us to easily integrate packages like Eigen[15] so that we could use a battle-tested least squares solver. However our actual production baking farm at Ready at Dawn uses a Cuda baker that leverages Nvidia’s OptiX library to perform arbitrary ray tracing on a cluster of GPU’s.While Cuda does have optimization libraries that could have achieved what we wanted, we faced a bigger problem: memory. Our baker worked by baking many sample points in parallel on the GPU, with a kernel program that would generate and trace the many rays required for monte carlo integration. When we previously used SH and H-basis this approach worked well: both SH and H-basis are built upon orthogonal basis functions, which allows for continuous integration by projecting each sample onto those basis functions. Gathering thousands of samples per-texel is feasible with this setup, since those samples don’t need to be explicitly stored in memory. Instead, each new sample is projected onto the in-progress result for the texel and then discarded. This is not the case when performing a solve: the solver needs access to *all* of the samples, which means keeping them all around in memory. This is a big problem when you have many texels in flight simultaneously, and only a limited amount of GPU memory. Like the intepolation issue it’s probably not unsolveable, but we really looking for a less risky approach that would be more of a drop-in replacement for the SH integration.

Ultimately we ended up saying “screw it”, and projected on the SG lobes as they formed an orthogonal basis (even though they didn’t). Since the basis functions weren’t orthogonal the results ended up rather blurry compared to a least squares solve, which muddied some of the detail in the source environment for a probe. Here’s a comparison to show you what I mean:

*A comparison of different techniques for computing a set of SG lobes that approximate the radiance from an environment map*

Of the three techniques presented here, the naive projection is the worst at capturing the details from the source map and also the blurriest. As we saw earlier the least squares solve is the best at capturing sharp changes, but achieves this by over-darkening certain areas. NNLS is the “just right” fit for this particular case, doing a better job of capturing details compared to the projection but without using any negative lobes.

Before we switched to baking SG probes for our grids and lightmaps, we had a fairly standard system for applying pre-convolved cubemap probes that were hand-placed throughout our scenes. Previously these were our only source of environment specular lighting, but once we had SG’s working we began to to use the SG specular approximation to compute environment specular directly from lightmaps and probe grids. Obviously our low number of SG’s was not sufficient for accurately approximating environment specular for smooth and mirror-like surfaces, so our cubemap system remained relevent. We ended up coming up with a simple scheme to choose between cubemap and lightmap specular per-pixel based on the surface roughness, with a small zone in between where we would blend between the two specular sources. The following images taken from our SIGGRAPH slides use a color-coding to showing the specular source chosen for each pixel from one of our scenes:

*The top image shows a scene from The Order: 1886. The bottom image shows the same scene with a color coding applied to show the environment specular source for each pixel.*

To finish off, here’s some more comparison images taken from our SIGGRAPH slides:

*Several comparison images from The Order: 1886 showing scenes with and without environment specular from SG lightmaps*

We were quite happy with the improvements that come from the SG baking pipeline we introduced for The Order, but we also feel like we’ve barely scratched the surface. Our decision to use fixed lobe directions and sharpness values was probably the right one, but it also limits what we can do. When you take SG’s and compare them to a fixed set of basis functions like SH, perhaps the biggest advantage is the fact that you can use an arbitrary combination of lobes to represent a mix of high and low-frequency features. So for instance you can represent a sun and a sky by having one wide lobe with the sky color, and one very narrow and bright lobe oriented towards the sun. We gave up that flexibility when we decided to go with our simpler ad-hoc projection, and it’s something we’d like to explore further in the future. But until then we can at least enjoy the benefits of having a representation that allows for an environment specular approximation and also avoids ringing artifacts when approximating diffuse lighting.

Aside from the solve, I also think it would be worth taking the time to investigate better approximations for the specular BRDF. In particular I would like to try using something better than just evaluating the cosine, Fresnel, and shadow-masking terms at the center of the warped BRDF lobe. The assumption that those terms are constant over the lobe break down the most when the roughness is high, and in our case we’re only ever using SG specular for rough materials! Therefore I think it would be worth the effort to come up with a more accurate representation of those terms.

[1] Curve fitting – https://en.wikipedia.org/wiki/Curve_fitting

[2] Regression analysis – https://en.wikipedia.org/wiki/Regression_analysis

[3] Least squares – https://en.wikipedia.org/wiki/Least_squares

[4] scipy.optimize.curve_fit – http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy-optimize-curve-fit

[5] Probulator – https://github.com/kayru/Probulator

[6] Spreading points on a disc and on a sphere – http://blog.marmakoide.org/?p=1

[7] Linear least squares – https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)

[8] Non-linear least squares – https://en.wikipedia.org/wiki/Non-linear_least_squares

[9] Non-negative least squares – https://en.wikipedia.org/wiki/Non-negative_least_squares

[10] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pptx

[11] SIGGRAPH 2015 Course: Physically Based Shading in Theory and Practice – http://blog.selfshadow.com/publications/s2015-shading-course/

[12] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://renpr.org/project/sg_ssdf.htm

[13] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[14] Frequency Domain Normal Map Filtering – http://www.cs.columbia.edu/cg/normalmap/

[15] Eigen – http://eigen.tuxfamily.org/index.php?title=Main_Page

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, we explored a few ways to compute the contribution of an SG light source when using a diffuse BRDF. While this is already useful, it would be nice to be able work with more complex view-dependent BRDF’s so that we can also compute a specular contribution. In this article I’ll explain a possible approach for approximating the response of a microfacet specular BRDF when applied to an SG light, and also introduce the concept of Anisotropic Spherical Gaussians.

You probably recall from the past 5 years of physically based rendering presentations[1] that a standard microfacet BRDF takes the following structure:

Recall that *D*(h) is the *distribution* term, which tells us the percentage of active microfacets for a particular combination of view and light vectors. It’s also commonly known as the *normal distribution function*, or “NDF” for short. It’s generally parameterized on a *roughness* parameter, which essentially describes the “bumpiness” of the underlying microgeometry. Lower roughness values lead to sharp mirror-like reflections with a very narrow and intense specular lobe, while higher roughness values lead to more broad reflections with a wider specular lobe. Most modern games (including The Order: 1886) use the GGX (AKA Trowbridge-Reitz) distribution for this purpose.

*The top image shows a GGX distribution term with a roughness parameter of 0.5. The bottom image shows the same GGX distribution term with a roughness parameter of 0.1. For both graphs, the X axis represents the angle between the surface normal and the half vector. Click on either image for an interactive graph.*

Next we have G(i, o, h) which is referred to as the *geometry **term* or the *masking-shadow function*. Eric Heitz’s paper on the masking-shadow function[2] has a fantastic explanation of how these terms work and why they’re important, so I would strongly recommend reading through it if you haven’t already. As Eric explains, the geometry term actually accounts for two different phenomena. The first is the local occlusion of reflections by other neighboring microfacets. Depending on the angle of the incoming lighting and the bumpiness of the microsurface, a certain amount of the lighting will be occluded by the surface itself, and this term attempts to model that. The other phenomenon handled by this term is visibility of the surface from the viewer. A surface that isn’t visible naturally can’t reflect light towards the viewer, and so the geometry term models this masking effect using the view direction and the roughness of the microsurface.

*The Smith visibility term for GGX as a function of the angle between the surface normal and the light direction. The roughness used is 0.25, and the angle between the normal and the view direction is 0. Click on the image for an interactive graph.*

Finally we have F(o, h), which is the *Fresnel* term. This familiar term determines the amount of light that is reflected vs. the amount that is refracted or absorbed, which varies depending on the index of refraction for a particular interface as well as the angle of incidence. For a microfacet BRDF we compute the Fresnel term using the angle between the active microfacet direction (the half vector) and the light or view direction (it is equivalent to use either). This generally causes the reflected intensity to increase when both the viewing direction and the light direction are grazing with respect to the surface normal. In real-time graphics Schlick’s approximation is typically used to represent the Fresnel term, since it’s a bit cheaper than evaluating the actual Fresnel equations. It is also common to parameterize the Fresnel term on the reflectance at zero incidence (referred to as “F0”) instead of directly working with an index of refraction.

*Schlick’s approximation of the Fresnel term as a function of the angle between the half vector and the light direction. The graph uses a value of 0.04 for F0. Click on the image for an interactive graph.*

Let’s now return to our example of a Spherical Gaussian light source. In the previous article we explored how to approximate the purely diffuse contribution from such a light source, but how do we handle the specular response? If we go down this road, we ideally want to do it in a way that uses a microfacet specular model so that we can remain consistent with our lighting from punctual light sources and IBL probes. If we start out by just plugging our BRDF and SG light into the rendering equation we get this monstrosity:

Unlike diffuse we now have multiple terms inside the integral, many of which are view-dependent. This suggests that we will need to combine several aggressive optimizations in order to get anywhere close to our desired result.

Let’s start out by taking the distribution term on its own, since it’s arguably the most important part of the specular BRDF. The distribution determines the overall shape and intensity of the resulting specular highlight, and has a widely-varying frequency response over the domain of possible roughness values. If we look at the graph of the GGX distribution from the earlier section, the shape does resemble a Gaussian. Unfortunately it’s not an exact match: the GGX distribution has the characteristic narrow highlight and long tails which we can’t match using a single SG. We could get a closer fit by summing multiple Gaussians, but for this article we’ll keep things simple by sticking with a single lobe. If we go back to Wang et al.‘s paper[3] that we referenced earlier, we can see that they suggest a very simple fit of an SG to a Cook-Torrance distribution term:

If we look closely, we can see that they’re actually fitting for the Gaussian model mentioned in the original Cook-Torrance paper[4], which is similar to the one used in the Torrance-Sparrow model[5]. Note that this should not be confused the with Beckmann distribution that’s also mentioned in the Cook-Torrance paper, which is actually a 2D Gaussian in the slope domain (AKA parallel plane domain). However the shape isn’t even the biggest problem here, as the variant they’re using isn’t normalized. This means the peak of the Gaussian will always have a height of 1.0, rather than shrinking when the roughness increases and growing when the roughness decreases. Fortunately this is really easy to fix, since we have a simple analytical formula for computing the integral of an SG. Therefore if we set the amplitude to 1 over the integral, we end up with a normalized distribution:

SG DistributionTermSG(in float3 direction, in float roughness) { SG distribution; distribution.Axis = direction; float m2 = roughness * roughness; distribution.Sharpness = 2 / m2; distribution.Amplitude = 1.0f / (Pi * m2); return distribution; }

Let’s take a look at a graph of our distribution term, and see how far off it is from our target:

*Top graph shows a comparison between GGX, Beckmann, normalized Gaussian, and SG distribution terms with a roughness of 0.25. The bottom shows the same comparison with a roughness of 0.5. Click on either image for an interactive graph.*

It should come as no surprise that our SG distribution is almost an exact match for a normalized version of a Gaussian distribution. When the roughness is lower it’s also a fairly close match for Beckmann, but for higher roughness the difference gets to be quite large. In both cases our distribution isn’t a perfect fit for GGX, but it’s a workable approximation.

So we now have a normal distribution function in SG format, but unfortunately we’re not quite ready to use it as-is. The problem is that we’ve defined our distribution in the half-vector domain: the axis of the SG points in the direction of the surface normal, and we use the half-vector as our sampling direction. If we want to use an SG product to compute the result of multiplying our distribution with an SG light source, then we need to ensure that the distribution lobe is in the same domain as our light source. Another way to think about this is that center of our distribution shifts depending on viewing angle, since the half-vector also shifts as the camera moves.

In order to make sure that our distribution lobe is in the correct domain, we need “warp” our distribution so that it lines up with the current BRDF slice for our viewing direction. If you’re wondering what a BRDF slice is, it basically tells you “if I pick a particular view direction. what is the value of my BRDF for a given light direction?”. So if you had a mirror BRDF, the slice would just be a single ray pointing in the direction of the view ray reflected off the surface normal. For microfacet specular BRDF’s, you get a lobe that’s roughly centered around the reflected view direction. Here’s what a polar graph of a GGX BRDF slice looks like if we only consider the distribution term:

*Polar graph of the GGX distribution term from two different viewing angles. The blue line is the view direction, the green line is the surface normal, and the pink line is the reflected view direction. The top image shows the resulting slice when the viewing angle is 0 degrees, and the bottom images shows the resulting slice when the viewing angle is 45 degrees.*

Wang et al. proposed a simple spherical warp operator that would orient the distribution lobe about the reflected view direction, while also modifying the sharpness to take into account the differential area at the original center of the lobe:

SG WarpDistributionSG(in SG ndf, in float3 view) { SG warp; warp.Axis = reflect(-view, ndf.Axis); warp.Amplitude = ndf.Amplitude; warp.Sharpness = ndf.Sharpness; warp.Sharpness /= (4.0f * max(dot(ndf.Axis, view), 0.0001f)); return warp; }

Let’s now look at the result of that warp, and compare it to what the actual GGX distribution looks like:

*Result of applying a spherical warp to the SG distribution term (green) compared with the actual GGX distribution (red). The top graph shows a viewing angle of 0 degrees, and the bottom graph shows a viewing angle of 45 degrees.*

A quick look at the graph shows us that the shape is a bit off, but our warped lobe is ending up in approximately in the right spot. We’ll revisit the lobe shape later, but for now let’s try combining our distribution with the rest of our BRDF.

In the previous section we figured out how to obtain an SG approximation of our distribution term, and also warp it so that it’s in the correct domain for multiplication with our SG light source. Using our SG product operator would allow to us to represent the result of multiplying the distribution with the light source as another SG, which we could then multiply with other terms using SG operators…or at least we could if we were to represent the remaining terms as SG’s. Unfortunately this turns out to be a problem: the geometry and Fresnel terms are nothing at all like a Gaussian, which rules out approximating them as an SG. Wang et al. sidestep this issue by making the somewhat-weak assumption that the values of these terms will be constant across the entire BRDF lobe, which allows them to pull the terms out of the integral and evaluate them only for the axis direction of the BRDF lobe. This allows the resulting BRDF to still capture some of the glancing angle effects, with similar performance cost to evaluating those terms for a punctual light source. The downside is that the error of these terms will increase as the BRDF lobe becomes wider (increasing roughness), since the value of the geometry and Fresnel terms will vary more the further they are from the lobe center. Putting it all together gives the following specular BRDF:

The last thing we need to account for is the cosine term that needs to be multiplied with the BRDF inside of the hemispherical integral. Wang et al. suggest using an SG product to compute an SG representing the result of multiplying the distribution term SG and their SG approximation of a clamped cosine lobe, which can then be multiplied with the lighting lobe using an SG inner product. In order to avoid another expensive SG operation, we will instead use the same approach that we used for geometry and Fresnel terms and evaluate the cosine term using the BRDF lobe axis direction. Implementing it in shader code gives us the following:

float GGX_V1(in float m2, in float nDotX) { return 1.0f / (nDotX + sqrt(m2 + (1 - m2) * nDotX * nDotX)); } float3 SpecularTermSGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF. SG ndf = DistributionTermSG(normal, roughness); // Warp the distribution so that our lobe is in the same // domain as the lighting lobe SG warpedNDF = WarpDistributionSG(ndf, view); // Convolve the NDF with the SG light float3 output = SGInnerProduct(warpedNDF, light); // Parameters needed for the visibility float3 warpDir = warpedNDF.Axis; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpedNDF.Axis + view); // Visibility term evaluated at the center of // our warped BRDF lobe output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel evaluated at the center of our warped BRDF lobe float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term evaluated at the center of the BRDF lobe output *= nDotL; return max(output, 0.0f); }

Let’s now (finally) take a look at what our specular approximation looks like for a scene being lit by an SG light source:

*A scene being lit by an SG light source using our diffuse and specular approximations. The scene is using a uniform roughness of 0.128.*

So if we look at the specular highlights in the above image, you may notice that while the highlights on the red and green walls look pretty good, the highlight on floor seems a bit off. The highlight is rather wide and rounded, and our intuition tells that a highlight viewed at a grazing angle should appear vertically stretched across the floor. To determine why the look is so off, we need to revisit our warp of the distribution term. Previously when we looked at the polar graph of the distribution I noted that the shape of the resulting lobe was a bit off, even though it was oriented in approximately the right direction to line up with the BRDF slice. To get a better idea of what’s going on, let’s now take a look at a 3D graph of the GGX distribution term:

*3D graph of the GGX distribution term. The left image shows the distribution when the viewing angle is very low, while the right image shows the distribution when the viewing angle is very steep.*

Looking at the left image where the angle between the view direction and the surface normal are very low, the distribution is radially symmetrical just like an SG lobe. However as the viewing angle increases the lobe begins to stretch, looking more and more like the non-symmetrical lobe that we see in the right image. The stretching of the lobe is what causes the stretched highlights that occur when applying the BRDF, and our warped SG is unable to properly represent it since it must remain radially symmetric about its axis.

Luckily for us, there is a better way. In 2013 Xu et al.released a paper titled Anisotropic Spherical Gaussians[6], where they explain how they extended SG’s to support anisotropic lobe width/sharpness. They’re defined like this:

Instead of having a single axis direction, an ASG now has , , and , which are three orthogonal vectors forming a complete basis. It’s very similar to a tangent frame, where the normal, tangent, and bitangent together make up an orthonormal basis. With an ASG you also now have two separate sharpness parameters, and , which control the sharpness with respect to and . So for example setting to 16 and to 64 will result in stretched Gaussian lobe that’s skinnier along the direction, and with its center located at . Visualizing such an ASG on the surface of a sphere gives you this:

*An Anisotropic Spherical Gaussian visualized on the surface of a sphere. has a value of 16, and has a value of 64.*

Like SG’s, the equations lend themselves to simple HLSL implementations:

struct ASG { float3 Amplitude; float3 BasisZ; float3 BasisX; float3 BasisY; float SharpnessX; float SharpnessY; }; float3 EvaluateASG(in ASG asg, in float3 dir) { float sTerm = saturate(dot(asg.BasisZ, dir)); float lambdaTerm = asg.SharpnessX * dot(dir, asg.BasisX) * dot(dir, asg.BasisX); float muTerm = asg.SharpnessY * dot(dir, asg.BasisY) * dot(dir, asg.BasisY); return asg.Amplitude * sTerm* exp(-lambdaTerm - muTerm); }

The ASG paper provides us with formulas for two operations that we can use to improve the quality of our specular approximation for SG light sources. The first is a new warping operator that can take an NDF represented as an isotropic SG, and stretch it along the view direction to produce an ASG that better represents the actual BRDF. The other helpful forumula it provides is for convolving an ASG with an SG, which we can use to convolve a anisotropically warped NDF lobe with an SG lighting lobe. Let’s take a look at how their improved warp looks when graphing the NDF for a large viewing angle:

*The left image is a 3D graph of the distribution term when using a spherical warp. The middle image is the resulting distribution term when using an anisotropic warp. The right image is the actual GGX distribution term*.

The anisotropic distribution looks much closer to the actual GGX NDF, since it now has the vertical stretching that we were missing. Let’s now implement their formulas in HLSL so we can try the new warp in our test scene:

float3 ConvolveASG_SG(in ASG asg, in SG sg) { // The ASG paper specifes an isotropic SG as // exp(2 * nu * (dot(v, axis) - 1)), // so we must divide our SG sharpness by 2 in order // to get the nup parameter expected by the ASG formula float nu = sg.Sharpness * 0.5f; ASG convolveASG; convolveASG.BasisX = asg.BasisX; convolveASG.BasisY = asg.BasisY; convolveASG.BasisZ = asg.BasisZ; convolveASG.SharpnessX = (nu * asg.SharpnessX) / (nu + asg.SharpnessX); convolveASG.SharpnessY = (nu * asg.SharpnessY) / (nu + asg.SharpnessY); convolveASG.Amplitude = Pi / sqrt((nu + asg.SharpnessX) * (nu + asg.SharpnessY)); float3 asgResult = EvaluateASG(convolveASG, sg.Axis); return asgResult * sg.Amplitude * asg.Amplitude; } ASG WarpDistributionASG(in SG ndf, in float3 view) { ASG warp; // Generate any orthonormal basis with Z pointing in the // direction of the reflected view vector warp.BasisZ = reflect(-view, ndf.Axis); warp.BasisX = normalize(cross(ndf.Axis, warp.BasisZ)); warp.BasisY = normalize(cross(warp.BasisZ, warp.BasisX)); float dotDirO = max(dot(view, ndf.Axis), 0.0001f); // Second derivative of the sharpness with respect to how // far we are from basis Axis direction warp.SharpnessX = ndf.Sharpness / (8.0f * dotDirO * dotDirO); warp.SharpnessY = ndf.Sharpness / 8.0f; warp.Amplitude = ndf.Amplitude; return warp; } float3 SpecularTermASGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF SG ndf = DistributionTermSG(normal, roughness); // Apply a warpring operation that will bring the SG from // the half-angle domain the the the lighting domain. ASG warpedNDF = WarpDistributionASG(ndf, view); // Convolve the NDF with the light float3 output = ConvolveASG_SG(warpedNDF, light); // Parameters needed for evaluating the visibility term float3 warpDir = warpedNDF.BasisZ; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpDir + view); // Visibility term output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term output *= nDotL; return max(output, 0.0f); }

If we swap out our old specular approximation for one that uses an anisotropic warp, our test scene now looks much better!

*Diffuse and specular approximations applied to an SG light source, using an anisotropic warp to approximate the NDF as an ASG.*

Many of the images that I used for visually comparing the SG BRDF approximations were generated using Disney’s BRDF Explorer. When we were doing our initial research into SG’s and figuring out how to implement them, BRDF Explorer was extremely valuable both for understanding the concepts and for experimenting with different variations. If you’d like to this yourself, there’s a very easy way to do that courtesy of Nick Brancaccio. Nick was kind of enough to create his own awesome WebGL version of BRDF Explorer, and it comes pre-loaded with options for comparing an approximate SG specular BRDF with the GGX BRDF. I would recommend checking it out if you’d like to play around with the BRDF’s and make some pretty 3D graphs!

[1] Background: Physics and Math of Shading (SIGGRAPH 2013 Course: Physically Based Shading in Theory and Practice) – http://blog.selfshadow.com/publications/s2013-shading-course/hoffman/s2013_pbs_physics_math_notes.pdf

[2] Understanding the Masking-Shadowing Function in Microfacet-Based BRDFs – http://jcgt.org/published/0003/02/03/

[3] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[4] A Reflectance Model for Computer Graphics – http://inst.cs.berkeley.edu/~cs294-13/fa09/lectures/cookpaper.pdf

[5] Theory for Off-Specular Reflection From Roughened Surfaces – http://www.graphics.cornell.edu/~westin/pubs/TorranceSparrowJOSA1967.pdf

[6] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[7] BRDF Explorer – https://github.com/wdas/brdf

[8] WebGL BRDF Explorer – https://depot.floored.com/brdf_explorer

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous post we covered a few of the universal properties of SG’s. Now that we have a few tools on our utility belt, let’s discuss an example of how we can actually use those properties to our advantage in a rendering scenario. Let’s say we have a surface point **x** being lit by a light source **L**, with the light source being represented by an SG named **G _{L}**. Recall from the previous article that the equation for computing the outgoing radiance towards the eye for a surface with a Lambertian diffuse BRDF looks like the following:

For punctual light sources that are essentially a scaled delta function, computing this is as easy as N dot L. But we’re in trouble if we have an area light source, since we typically don’t have a closed form solution to the integral. But let’s suppose that we have some strange Gaussian light source, whose angular falloff can be exactly represented by an SG (normally area light sources are considered to have uniform emission over their surface, but let’s imagine we have case where the emission is non-uniform). If we can treat the light as an SG, then we can start to consider some of the handy Gaussian tools that we laid out earlier. In particular the inner product starts to seem really useful: it gives us the result of integrating the product of two SG’s, which is basically what we’re trying to accomplish in our diffuse lighting equation. The big catch is that we’re not integrating the product of two SG’s, we’re instead integrating the product of an SG with a clamped cosine lobe. Obviously a Gaussian lobe has a different shape compared to a clamped cosine lobe, but perhaps if we squint our eyes from a distance you could substitute one for another. This approach was taken by Wang et al.[1], who suggested fitting a cosine lobe to a single SG with **λ**=2.133 and **a**=1.17. If we follow in their footsteps, the diffuse calculation is straightforward:

SG CosineLobeSG(in float3 direction) { SG cosineLobe; cosineLobe.Axis = direction; cosineLobe.Sharpness = 2.133f; cosineLobe.Amplitude = 1.17f; return cosineLobe; } float3 SGIrradianceInnerProduct(in SG lightingLobe, in float3 normal) { SG cosineLobe = CosineLobeSG(normal); return max(SGInnerProduct(lightingLobe, cosineLobe), 0.0f); } float3 SGDiffuseInnerProduct(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradianceInnerProduct(lightingLobe, normal) * brdf; }

Not too bad, eh? Of course it’s worth taking a closer look at our cosine lobe approximation, since that’s definitely going to introduce some error. Perhaps the best way to do this is to look at the graphs of a real cosine lobe and our SG approximation side-by-side:

*Comparison of a clamped cosine cosine lobe (red) with an SG approximation (blue)*

Just from looking at the graph it’s fairly obvious that an SG isn’t necessarily a great fit for a cosine lobe. First of all, the amplitude actually goes above 1, which might seem a bit weird at first glance. However it’s necessary to ensure that the area under the curve remains somewhat consistent with the cosine lobe, since there would otherwise be a loss of energy. The other weirdness stems from the fact that an SG never actually hits 0 anywhere on the sphere, hence the long “tail” on the graph of the SG. This essentially means that if the SG were integrated against a punctual light source, the lighting would “wrap” around the sphere past the point where N dot L is equal to 0. The situation actually isn’t all that different from an SH representation of a cosine lobe, which also extends past π/2:

*L1 (green) and L2 (purple) SH approximation of a clamped cosine lobe compared with an SG approximation (blue) and the actual clamped cosine (red).*

In the SH case the approximation actually goes negative, which is arguably worse than the long tail of the SG approximation. The L1 approximation is particularly bad in this regard. If at this point you’re trying to imagine what these approximations look like on a sphere, let me save you the trouble by providing an image:

*From left to right: actual clamped cosine lobe, SG cosine approximation, L2 SH cosine approximation*

Now that we’ve finished analyzing you approximation of a cosine lobe, we need to take a look at the actual results of computing diffuse lighting from an SG light source. Let’s start off by graphing the results of computing irradiance using an SG inner product, and compare it against what we get by using brute-force numerical integration to compute the result of multiplying the SG with an actual clamped cosine (*not* the approximate SG cosine lobe that we use for the inner product):

*The resulting irradiance from an SG light source (with sharpness of 4.0) as a function of the angle between the light source and the surface normal. The red graph is the result of using numerical integration to compute the integral of the SG light source multiplied with a clamped cosine, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG.*

As you might expect, the inner product approximation has some error when compared with the “ground truth” provided by numerical integration. It’s worth pointing out that this error is purely a consequence of approximating the clamped cosine lobe as an SG: the inner product provides the exact result of the integral, and thus shouldn’t introduce any error on its own. Despite this error, the resulting irradiance isn’t hugely far off from our ground truth. The biggest difference is for the angles facing away from the light, where the SG inner product version has a stronger tail. Visualizing the resulting diffuse on a sphere gives us the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from computing irradiance using an SG inner product with an approximation of a cosine lobe.*

As an alternative to representing the cosine lobe with an SG and computing the inner product, we can consider a cheaper approximation. One advantage of working with SG’s is that each lobe is always symmetrical about its axis, which is also where its value is the highest. We also discussed earlier how we can compute the integral of an SG over the sphere, which gives us its total energy. This suggests that if we want to be frugal with our shader cycles, we can pull terms out of the integral over the sphere/hemisphere and only evaluate them for the SG axis direction. This obviously introduces error, but that error may be acceptable if the term we pull out is relatively “smooth”. If we apply this approximation to computing irradiance and diffuse lighting, we get this:

Translating to HLSL, we get the following functions:

float3 SGIrradiancePunctual(in SG lightingLobe, in float3 normal) { float cosineTerm = saturate(dot(lightingLobe.Axis, normal)); return cosineTerm * 2.0f * Pi * (lightingLobe.Amplitude) / lightingLobe.Sharpness; } float3 SGDiffusePunctual(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradiancePunctual(lightingLobe, normal) * brdf; }

If we overlay the graph of our super-cheap irradiance approximation on the graph we were looking at earlier, we get this:

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The green graph was computed by pulling the cosine term out of the integral, and multiplying it with the result of integrating the SG light about the sphere.*

The result shouldn’t be a surprise: it’s just a scaled version of the standard clamped cosine.It’s pretty obvious just by looking that this particular optimization will introduce quite a bit of error, particularly where theta is greater than π/2. But it is cheap, since we’ve effectively turned an SG into a point light. This is makes it useful tool for cases where we may want to approximate the convolution of an SG light source with a BRDF or some other function that isn’t easily represented as an SG.

So it’s nice to have a cheap option, but what if we want more accuracy than our inner product approximation? Fortunately for us, Stephen Hill was able to formulate another alternative approximation that directly fits a curve to the integral of a cosine lobe with an SG. His implementation is actually formulated for a normalized SG (where the integral about the sphere is equal to 1.0), but we can easily account for this by computing the integral and scaling the result by that value:

float3 SGIrradianceFitted(in SG lightingLobe, in float3 normal) { const float muDotN = dot(lightingLobe.Axis, normal); const float lambda = lightingLobe.Sharpness; const float c0 = 0.36f; const float c1 = 1.0f / (4.0f * c0); float eml = exp(-lambda); float em2l = eml * eml; float rl = rcp(lambda); float scale = 1.0f + 2.0f * em2l - rl; float bias = (eml - em2l) * rl - em2l; float x = sqrt(1.0f - scale); float x0 = c0 * muDotN; float x1 = c1 * x; float n = x0 + x1; float y = saturate(muDotN); if(abs(x0) <= x1) y = n * n / x; float result = scale * y + bias; return result * ApproximateSGIntegral(lightingLobe); }

The result is very close to the ground truth, which is very cool considering that it might actually be cheaper than our inner product approximation!

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The orange graph was computed using Stephen Hill’s fitted curve approximation.*

If we once again visualize the result on the sphere and compare with our previous results, we get the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using an SG inner product with an approximation of a cosine lobe. The middle sphere shows the resulting diffuse lighting from computing irradiance using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from Stephen Hill’s fitted approximation.*

[1] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – https://www.microsoft.com/en-us/research/wp-content/uploads/2009/12/sg.pdf

]]>

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, I gave a quick rundown of some of the available techniques for representing a pre-computed distribution of radiance or irradiance for each lightmap texel or probe location. In this article, I’m going cover the basics of Spherical Gaussians, which are a type of spherical radial basis function (SRBF for short). The concepts introduced here will serve as the core set of tools for working with Spherical Gaussians, and in later articles I’ll demonstrate how you can use those tools to form an alternative for approximating incoming radiance in pre-computed lightmaps or probes.

I should point out that this article is still going to be somewhat high-level, in that it won’t provide full derivations and background details for all formulas and operations. However it is my hope that the material here will be sufficient to gain a basic understanding of SG’s, and also use them in practical scenarios.

A Spherical Gaussian, or “SG” for short, is essentially a Gaussian function[1] that’s defined on the surface of a sphere. If you’re reading this, then you’re probably already familar with how a Gaussian function works in 1D: you compute the distance from the center of the Gaussian, and use this distance as part of a base-e exponential. This produces the characteristic “hump” that you see when you graph it:

*A Gaussian in 1D centered at x=0, with a height of 3*

You’re probably also familiar with how it looks in 2D, since it’s very commonly used in image processing as a filter kernel. It ends up looking like what you would get if you took the above graph and revolved it around its axis

*A Gaussian filter applied to a 2D image of a white dot, showing that the impulse response is effectively a Gaussian function in 2D*

A Spherical Gaussian still works the same way, except that it now lives on the surface of a sphere instead of on a line or a flat plane. If you’re having trouble visualizing that, imagine if you took the above image and wrapped it around a sphere like wrapping paper. It ends up looking like this:

*A Spherical Gaussian visualized on the surface of a sphere*

Since an SG is defined on a sphere rather than a line or plane, it’s parameterized differently than a normal Gaussian. A 1D Gaussian function always has the following form:

The part that we need to change in order to define the function on a sphere is the “(x – b)” term. This part of the function essentially makes the Gaussian a function of the cartesian distance between a given point and the center of the Gaussian, which can be trivially extended into 2D using the standard distance formula. To make this work on a sphere, we must instead make our Gaussian a function of the angle between two unit direction vectors. In practice we do this by making an SG a function of the *cosine* of the angle between two vectors, which can be efficiently computed using a dot product like so:

Just like a normal Gaussian, we have a few parameters that control the shape and location of the resulting lobe. First we have μ, which is the *axis*, or *direction* of the lobe. It effectively controls where the lobe is located on the sphere, and always points towards the exact center of the lobe. Next we have λ, which is the *sharpness* of the lobe. As this value increases, the lobe will get “skinnier”, meaning that the result will fall off more quickly as you get further from the lobe axis. Finally we have *a*, which is the *amplitude* or *intensity* of the lobe. If you were to look at a polar graph of an SG, it would correspond to the height of the lobe at its peak. The amplitude can be a scalar value, or for graphics applications we may choose to make it an RGB triplet in order to support varying intensities for different color channels. This all lends itself to a simple HLSL code definition:

struct SG { float3 Amplitude; float3 Axis; float Sharpness; };

Evaluating an SG is also easily expressible in HLSL. All we need is a normalized direction vector representing the point on the sphere where we’d like to compute the value of the SG:

float3 EvaluateSG(in SG sg, in float3 dir) { float cosAngle = dot(dir, sg.Axis); return sg.Amplitude * exp(sg.Sharpness * (cosAngle - 1.0f)); }

Now that we know what a Spherical Gaussian is, what’s so useful about them anyway? One pontential benefit is that they’re fairly intuitive: it’s not terribly hard to understand how the 3 parameters work, and how each parameter affects the resulting lobe. The other main draw is that they inherit a lot of useful properties of “regular” Gaussians, which makes them useful for graphics and other related applications. These properties have been explored and utilized in several research papers that were primarily aimed at achieving pre-computed radiance transfer (PRT) with both diffuse and specular material response. In particular, the paper entitled “All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance[2]” by Wang et al. was our main inspiration for pursuing SG’s at RAD.

So what are these useful Gaussian properties that we can exploit? For starters, taking the product of 2 Gaussians functions produces another Gaussian. For an SG, this is equivalent to visiting every point on the sphere, evaluating 2 different SG’s, and multiplying the two results. Since it’s an operation that takes 2 SG’s and produces another SG, it is sometimes referred to as a “vector” product. It’s defined as the following:

In HLSL code, it looks like this:

SG SGProduct(in SG x, in SG y) { float3 um = (x.Sharpness * x.Axis + y.Sharpness * y.Axis) / (x.Sharpness + y.Sharpness); float umLength = length(um); float lm = x.Sharpness + y.Sharpness; SG res; res.Axis = um * (1.0f / umLength); res.Sharpness = lm * umLength; res.Amplitude = x.Amplitude * y.Amplitude * exp(lm * (umLength - 1.0f)); return res; }

Gaussians have another really nice property in that their integrals have a closed-form solution, which is known as the error function[3]. The property also extends to SG’s, where we can compute the integral of an SG over the entire sphere:

Computing an integral will essentially tell us the total “energy” of an SG, which can be useful for lighting calculations. It can also be useful for *normalizing* an SG, which produces an SG that integrates to 1. Such a normalized SG is suitable for representing a probability distribution, such as an NDF. In fact, a normalized SG is actually equivalent to a von Mises-Fisher distribution[4] in 3D!

An SG integral is actually very cheap to compute…or at least it would be if we removed the exponential term. It turns out that the term actually approaches 1 very quickly as the SG’s sharpness increases, which means we can potentially drop it with little error as long as we know that the sharpness is high enough. Here’s what a graph of looks like for increasing sharpness:

*A graph of the exponential term in computing the integral of an SG over a sphere, which approaches 1 as the sharpness increases. The X-axis is sharpness, and the Y-axis is the value of .*

This all lends itself naturally to HLSL implementations for accurate and approximate versions of an SG integral:

float3 SGIntegral(in SG sg) { float expTerm = 1.0f - exp(-2.0f * sg.Sharpness); return 2 * Pi * (sg.Amplitude / sg.Sharpness) * expTerm; } float3 ApproximateSGIntegral(in SG sg) { return 2 * Pi * (sg.Amplitude / sg.Sharpness); }

If we were to use our SG integral formula to compute the integral of the product of two SG’s, we can compute what’s known as the *inner product*, or *dot product* of those SG’s. The operation is usually defined like this:

However we can avoid numerical precision issues by using an alternate arrangement:

…which looks like this in HLSL:

float3 SGInnerProduct(in SG x, in SG y) { float umLength = length(x.Sharpness * x.Axis + y.Sharpness * y.Axis); float3 expo = exp(umLength - x.Sharpness - y.Sharpness) * x.Amplitude * y.Amplitude; float other = 1.0f - exp(-2.0f * umLength); return (2.0f * Pi * expo * other) / umLength; }

SG’s have what’s known as “compact-ε” support, which means that it’s possible to determine an angle θ such that all points within θ radians of the SG’s axis will have a value greater than ε. This property is potentially more useful if we flip it around so that we calculate a sharpness λ that results in a given θ for a particular value of ε:

float SGSharpnessFromThreshold(in float amplitude, in float epsilon, in float cosTheta) { return (log(epsilon) - log(amplitude)) / (cosTheta - 1.0f); }

One last operation I’ll discuss is rotation. Rotating an SG is trivial: all you need to do is apply your rotation transform to the SG’s axis vector and you have a rotated SG! You can apply the transform using a matrix, a quaternion, or any other means you might have for rotating a vector. This is a welcome change from SH, which requires a very complex transform once you go above L1.

[1] Gaussian Function – https://en.wikipedia.org/wiki/Gaussian_function

[2] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[3] Error Function – https://en.wikipedia.org/wiki/Error_function

[4] von-Mises Fisher Distribtion – https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

For part 1 of this series, I’m going to provide some background material for our research into Spherical Gaussians. The main purpose is cover some of the alternatives to the approach we used for The Order: 1886, and also to help you understand why we decided to persue Spherical Gaussians. The main empahasis is going to be on discussing what exactly we store in pre-baked lightmaps and probes, and how that data is used to compute diffuse or specular lighting. If you’re already familiar with the concepts of pre-computing radiance or irradiance and approximating them using basis functions like the HL2 basis or Spherical Harmonics, then you will probably want to skip to the next article.

Before we get started, here’s a quick glossary of the terms I use the formulas:

- – the outgoing radiance (lighting) towards the viewer
- – the incoming radiance (lighting) hitting the surface
- – the direction pointing towards the viewer (often denoted as “V” in shader code dealing with lighting)
- – the direction pointing towards the incoming radiance hitting the surface (often denoted as “L” in shader code dealing with lighting)
- – the direction of the surface normal
- – the 3D location of the surface point
- – integral about the hemisphere
- – the angle between the surface normal and the incoming radiance direction
- – the angle between the surface normal and the outgoing direction towards the viewer
- – the BRDF of the surface

Games have used pre-computed lightmaps for almost as long as they have been using shaded 3D graphics, and they’re still quite popular in 2016. The idea is simple: pre-compute a lighting value for every texel, then sample those lighting values at runtime to determine the final appearance of a surface. It’s a simple concept to grasp, but there are some details you might not think about if you’re just learning how they work. For instance, what exactly does it mean to store “lighting” in a texture? What exact value are we computing, anyway? In the early days the value fetched from the lightmap was simply multiplied with the material’s diffuse albedo color (typically done with fixed-function texture stages), and then directly output to the screen. Ignoring the issue of gamma correction and sRGB transfer functions for the moment, we can work backwards from this simple description to describe this old-school approach in terms of the rendering equation. This might seem like a bit of a pointless exercise, but I think it helps build a solid base that we can use to discuss more advanced techniques.

So we know that our lightmap contains a single fixed color per-texel, and we apply it the same way regardless of the viewing direction for a given pixel. This implies that we’re using a simple Lambertian diffuse BRDF, since it lacks any sort of view-dependence. Recall that we compute the outgoing radiance for a single point using the following integral:

If we substitute the standard diffuse BRDF of for our BRDF (where C_{diffuse} is the diffuse albedo of the surface), then we get the following:

On the right side we see that we can pull the constant terms out the integral (the constant term is actually the entire BRDF!), and what we’re left with lines up nicely with how we handle lightmaps: the expensive integral part is pre-computed per-texel, and then the constant term is applied at runtime per-pixel. The “integral part” is actually computing the incident irradiance, which lets us finally identify the quantity being stored in the lightmap: it’s irradiance! In practice however most games would not apply the 1 / π term at runtime, since it would have been impractical to do so. Instead, let’s assume that the 1 / π was “baked” into the lightmap, since it’s constant for all surfaces (unlike the diffuse albedo, which we consider to be *spatially varying*). In that case, we’re actually storing a reflectance value that takes the BRDF into account. So if we wanted to be precise, we would say that it contains “the diffuse reflectance of a surface with C_{diffuse} = 1.0″, AKA the maximum possible outgoing radiance for a surface with a diffuse BRDF.

One of the key concepts of lightmapping is the idea of reconstructing the final surface appearance using data that’s stored at different rates in the spatial domain. Or in simpler words, we store lightmaps using one texel density while combining it with albedo maps that have a different (usually higher) density. This lets us retain the appearance of high-frequency details without actually computing irradiance integrals per-pixel. But what if we want to take this concept a step further? What it we also want the irradiance itself to vary in response to texture maps, and not just the diffuse albedo? By the early 2000’s normal maps were starting to see common use for this purpose, however they were generally only used when computing the contribution from punctual light sources. Normal maps were no help with light maps that only stored a single (scaled) irradiance value, which meant that that pure ambient lighting would look very flat compared to areas using dynamic lighting:

*Areas in direct lighting (on the right) have a varying appearance due to a normal map, but areas in shadow (on the left) have no variation due to being lit by a baked lightmap containing only a single irradiance value.*

To make lightmaps work with normal mapping, we need to stop storing a single value and instead somehow store a *distribution* of irradiance values for every texel. Normal maps contain a range of normal directions, where those directions are generally restricted to the hemisphere around a point’s surface normal. So if we want our lightmap to store irradiance values for all possible normal map values, then it must contain a distribution of irradiance that’s defined for that same hemisphere. One of the earliest and simplest examples of such a distribution was used by Half-Life 2[1], and was referred to as Radiosity Normal Mapping[2]:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

Valve essentially modified their lightmap baker to compute 3 values instead of 1, with each value computed by projecting the irradiance signal onto one of the corresponding orthogonal basis vectors in the above image. At runtime, the irradiance value used for shading would be computed by blending the 3 lightmap values based on the cosine of the angle between the normal map direction and the 3 basis directions (which is cheaply computed using a dot product). This allowed them to effectively vary the irradiance based on the normal map direction, thus avoiding the “flat ambient” problem described above.

While this worked for their static geometry, there still remained the issue of applying pre-computed lighting to dynamic objects and characters. Some early games (such as the original Quake) used tricks like sampling the lightmap value at a character’s feet, and using that value to compute ambient lighting for the entire mesh. Other games didn’t even do that much, and would just apply dynamic lights combined with a global ambient term. Valve decided to take a more sophisticated approach that extended their hemispherical lightmap basis into a full spherical basis formed by 6 orthogonal basis vectors:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

The basis vectors coincided with the 6 face directions of a unit cube, which led Valve to call this basis the “Ambient Cube”. By projecting irradiance in all directions around a point in space (instead of a hemisphere surrounding a surface normal) onto their basis functions, a dynamic mesh could sample irradiance for any normal direction and use it to compute diffuse lighting. This type of representation is often referred to as a *lighting probe*, or often just “probe” for short.

With Valve’s basis we can combine normal maps and light maps to get diffuse lighting that can vary in response to high-frequency normal maps. So what’s next? For added realism we would ideally like to support more complex BRDF’s, including view-dependent specular BRDF’s. Half-Life 2 handled environment specular by pre-generating cubemaps at hand-placed probe locations, which is still a common approach used by modern games (albeit with the addition of pre-filtering[3] used to approximate the response from a microfacet BRDF). However the large memory footprint of cubemaps limits the practical density of specular probes, which can naturally lead to issues caused by incorrect parallax or disocclusion.

*A combination of incorrect parallax and disocclusion when using a pre-filtered environment as a source for environment specular. Notice the bright edges on the sphere, which are actually caused by the sphere reflecting itself!*

With that in mind it would nice to be able to get some sort of specular response out of our lightmaps, even if only for a subset of materials. But if that is our goal, then our approach of storing an irradiance distribution starts to become a hinderance. Recall from earlier that with a diffuse BRDF we were able to completely pull the BRDF out of the irradiance integral, since the Lambertian diffuse BRDF is just a constant term. This is no longer the case even with a simple specular BRDF, whose value varies depending on both the viewing direction as well as the incident lighting direction.

If you’re working with the Half-Life 2 basis (or something similar), a tempting option might be to compute a specular term as if the 3 basis directions were directional lights. If you think about what this means, it’s basically what you get if you decide to say “screw it” and pull the specular BRDF out of the irradiance integral. So instead of Integrate(BRDF * Lighting * cos(theta)), you’re doing BRDF * Integrate(Lighting * cos(theta)). This will definitely give you *something,* and it’s perhaps a lot better than nothing. But you’ll also effectively lose out on a ton of your specular response, since you’ll only get specular when your viewing direction appropriately lines up with your basis directions according the the BRDF slice. To show you what I mean by this, here’s a comparison:

*The top image shows a path-traced rendering of a green wall being lit by direct sun lighting. The middle image shows the indirect specular component of the top image, with exposure increased by 4x. The bottom image shows the resulting specular from treating the HL2 basis directions as directional lights.*

Hopefully these images clearly show the problem that I’m describing. In the bottom image, you get specular reflections that look just like they came from a few point lights, since that’s effectively what you’re simulating. Meanwhile in the middle image with proper environment reflections, you can see that the the entire green wall effectively acts as an area light, and you get a very broad specular reflections across the entire floor. In general the problem tends to be less noticeable though as roughness increases, since higher roughness naturally results in broader, less-defined reflections that are harder to notice.

If we want to do better, we must instead find a way to store a radiance distribution and then efficiently integrate it against our BRDF. It’s at this point that we turn to spherical harmonics. Spherical harmonics (SH for short) have become a popular tool for real-time graphics, typically as a way to store an approximation of indirect lighting at discrete probe locations. I’m not going to go into the full specifics of SH since that could easily fill an entire article[4] on its own. If you have no experience with SH, the key thing to know about them is that they basically let you approximate a function defined on a sphere using a handful of coefficients (typically either 4 or 9 floats per RGB channel). It’s sort-of as if you had a compact cubemap, where you can take a direction vector and get back a value associated with that direction. The big catch is that you can only represent very low-frequency (fuzzy) signals with lower-order SH, which can limit what sort of things you can do with it. You can project detailed, high-frequency signals onto SH if you want to, but the resulting projection will be very blurry. Here’s an example showing what an HDR environment map looks like projected onto L2 SH, which requires 27 coefficients for RGB:

*The top image is an HDR environment map containing incoming radiance values about a sphere, while the bottom image shows the result of projecting that environment onto L2 spherical harmonics.*

In the case of irradiance, SH can work pretty well since it’s naturally low-frequency. The integration of incoming radiance against the cosine term effectively acts as a low-pass filter, which makes it a suitable candidate for approximation with SH. So if we project irradiance onto SH for every probe location or lightmap texel, we can now do an SH “lookup” (which is basically a few computations followed by a dot product with the coefficients) to get the irradiance in any direction on the sphere. This means we can get spatial variation from albedo and normal maps just like with the HL2 basis!

It also turns out that SH is pretty useful for *computing* irradiance from input radiance, since we can do it really cheaply. In fact it can do it so cheaply, it can be done at runtime by folding it into the SH lookup process. The reason it’s so cheap is because SH is effectively a frequency-domain representation of the signal, and when you’re in the frequency domain convolutions can be done with simple multiplication. In the spatial domain, convolution with a cubemap is an N^2 operation involving many samples from an input radiance cubemap. If you’re interested in the full details, the process was described in Ravi Ramamoorthi’s seminal paper[5] from 2001, with derivations provided in another article[6].

*The Stanford Bunny model being lit with diffuse lighting from an L2 spherical harmonics probe*

So we’ve established that SH works for approximating irradiance, and that we can convert from radiance to irradiance at runtime. But what does this have to do with specular? By storing an approximation of radiance instead of irradiance in our probes or lightmaps (albeit, a very blurry version of radiance), we now have the signal that we need to integrate our specular BRDF against in order to produce specular reflections. All we need is an SH representation of our BRDF, and we’re a dot product away from environment specular! The only problem we have to solve is how to actually *get* an SH representation of our BRDF.

Unfortunately a microfacet specular BRDF is quite a bit more complicated than a Lambertian diffuse BRDF, which makes our lives more difficult. For diffuse lighting we only needed to worry about the cosine lobe, which has the same shape regardless of the material or viewing direction. However a specular lobe will vary in shape and intensity depending on the viewing direction, material roughness, and the fresnel term at zero incidence (AKA F_{0}). If all else fails, we can always use monte-carlo techniques to pre-compute the coefficients and store the result in a lookup texture. At first it may seem like we need at parameterize our lookup table on 4 terms, since the viewing direction is two-dimensional. However we can drop a dimension if we follow in the footsteps[7] of the intrepid engineers at Bungie, who used a neat trick for their SH specular implementation in Halo 3[8]. The key insight that they shared was that the specular lobe shape doesn’t actually change as the viewer rotates around the local Z axis of the shading point (AKA the surface normal). It actually only changes based on the *viewing angle*, which is the angle between the view vector and the local Z axis of the surface. If we exploit this knowledge, we can pre-compute the coefficients for the set of possible viewing directions that are aligned with the local X axis. Then at runtime, we can rotate the coefficients so that the resulting lobe lines up with the actual viewing direction. Here’s an image to show you what I mean:

*Rotating a specular lobe from the X axis to its actual location based on the viewing direction, which is helpful for pre-computing the SH coefficients into a lookup texture*

So in this image the checkerboard is the surface being shaded, and the red, green and blue arrows are the local X, Y, and Z axes of the surface. The transparent lobe represents the specular lobe that we precomputed for a viewpoint that’s aligned with the X axis, but has the same viewing angle. The blue arrow shows how we can rotate the specular lobe from its original position to the actual position of the lobe based on the current viewing position, giving us the desired specular response. Here’s a comparison showing what it looks like it in action:

*The top image is a scene rendered with a path tracer. The middle image shows the indirect specular as rendered by a path tracer, with exposure increased 4x. The bottom image shows the indirect specular term computing an L2 SH lightmap, also with exposure increased by 4x. *

Not too bad, eh? Or at least…not too bad as long as we’re willing to store 27 coefficients per lightmap texel, and we’re only concerned with rough materials. The comparison image used a GGX α parameter of 0.39, which is fairly rough.

One common issue with issue with SH is a phenomenon known as “ringing”, which is described in Peter-Pike Sloan’s Stupid Spherical Harmonics Tricks[9]. Ringing artifacts tends to show up when you have a very intense light source one side of the sphere. When this happens, the SH projection will naturally result in negative lobes on the opposite side of the sphere, which can result very low (or even negative!) values when evaluated. It’s generally not too much of an issue for 2D lightmaps, since lightmaps are only concerned with the incoming radiance for a hemisphere surrounding the surface normal. However they often show up in probes, which store radiance or irradiance about the entire sphere. The solution suggested by Peter-Pike Sloan is to apply a windowing function to the SH coefficients, which will filter out the ringing artifacts. However the windowing will also introduce additional blurring, which may remove high-frequency components from the original signal being projected. The following image shows how ringing artifacts manifest when using SH to compute irradiance from an environment with a bright area light, and also shows how windowing affects the final result:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

[1] Shading in Valve’s Source Engine (SIGGRAPH 2006) – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[2] Half Life 2 / Valve Source Shading – http://www2.ati.com/developer/gdc/D3DTutorial10_Half-Life2_Shading.pdf

[3] Real Shading in Unreal Engine 4 – http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf

[4] Spherical Harmonic Lighting: The Gritty Details – http://www.research.scea.com/gdc2003/spherical-harmonic-lighting.pdf

[5] An Efficient Representation for Irradiance Environment Maps – https://cseweb.ucsd.edu/~ravir/papers/envmap/

[6] On the Relationship between Radiance and Irradiance: Determining the illumination from images of a convex Lambertian object – https://cseweb.ucsd.edu/~ravir/papers/invlamb/

[7] The Lighting and Material of Halo 3 (Slides) – https://developer.amd.com/wordpress/media/2012/10/S2008-Chen-Lighting_and_Material_of_Halo3.pdf

[8] The Lighting and Material of Halo 3 (Course Notes) – http://developer.amd.com/wordpress/media/2013/01/Chapter01-Chen-Lighting_and_Material_of_Halo3.pdf

[9] Stupid Spherical Harmonics Tricks – http://www.ppsloan.org/publications/StupidSH36.pdf

So nearly year and a half ago myself and Dave Neubelt gave a presentation at SIGGRAPH where we described the approach that we developed for approximating incoming radiance using Spherical Gaussians in both our lightmaps and 3D probe grids. We had planned on releasing a source code demo as well as course notes that would serve as a full set of implementation details, but unfortunately those efforts were sidetracked by other responsibilities. We had actually gotten pretty far on a demo, and it started to seem pretty silly to let a whole year go by without releasing it. So I’ve cleaned it up, and put it on GitHub for all to learn from and/or make fun of. So if you’re just interested in seeing the code or running the demo, then go ahead and download it:

https://github.com/TheRealMJP/BakingLab

For those looking for some in-depth written explanation, I’ve also decided to write a series of blog posts that should hopefully shed some light on the basics of using SG’s in rendering. The first post provides background material by explaining common approaches to storing pre-computing lighting data in lightmaps and/or probes. The second post focuses on explaining the basics of Spherical Gaussians, and demonstrating some of their more useful properties. The third post explains how the various SG properties can be used to compute diffuse lighting from an SG light source. The fourth post goes even deeper and covers methods for approximating the specular contribution from an SG light source. The fifth post explores some approaches for using SG’s to create a compact approximation of a lighting environment, and compares the results with spherical harmonics. Finally, the sixth posts discusses features present in the the lightmap baking demo that we’ve released on GitHub.

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Enjoy, and please leave a comment if you have questions, concerns, or corrections!

]]>https://github.com/TheRealMJP/DeferredTexturing

https://github.com/TheRealMJP/DeferredTexturing/releases (Precompiled Binaries)

Unless you’ve been in a coma for the past year, you’ve probably noticed that there’s a lot of buzz and excitement around the new graphics API’s that are available for PC and mobile. One of the biggest changes brought by both D3D12 and Vulkan is that they’ve ditched the old slot-based system for binding resources that’s been in use since…forever. In place of the old system, both API’s have a adopted a new model[1] based around placing opaque resource descriptors in contiguous ranges of GPU-accessible memory. The new model has the potential to be more efficient, since a lot of hardware (most notably AMD’s GCN-based GPU’s) can read their descriptors straight from memory instead of having to keep them in physical registers. So instead of having the driver take a slot-based model and do behind-the-scenes gymnastics to put the appropriate descriptors into tables that the shader can use, the app can just put the descriptors in a layout that works right from the start.

The new style of providing resources to the GPU is often referred to as “bindless”, since you’re no longer restricted to explicitly binding textures or buffers through dedicated API functions. The term “bindless” originally comes from Nvidia, who were the first to introduce the concept[2] through their NV_bindless_texture[3] extension for OpenGL. Their material shows some serious reductions in CPU overhead by skipping standard resource binding, and instead letting the app place 64-bit descriptor handles (most likely they’re actually pointers to descriptors) inside of uniform buffers. One major difference between Nvidia bindless and D3D12/Vulkan bindless is that the new APIs don’t allow you to simply put descriptor handles inside of constant/uniform buffers. Instead, they require you to manually specify (through a root signature) how you’ll organize your tables of descriptors for a shader. It might seem more complicated the Nvidia extension, but doing it this way has a big advantage: it lets D3D12 still support hardware that has no support (or limited support) for pulling descriptors from memory. It also still allows you to go full-on Nvidia-style bindless via support for unbounded texture arrays in HLSL. With unbounded arrays you can potentially put all of your descriptors in one giant table, and then index into that array using values from root constants/constant buffers/structured buffers/etc. This basically lets you treat an integer the same as a “handle” in Nvidia’s approach, with the added benefit that you don’t need to actually store a full 64-bit integer. Not only can this be really efficient, but it also opens the door to new rendering techniques that use GPU-generated values to determine which textures to fetch from.

One such use case for bindless textures is deferred texturing. The concept is pretty straightforward: instead of writing out a G-Buffer containing all of the material parameters required for shading, you instead write out your interpolated UV’s as well as a material ID. Then during your deferred pass you can use your material ID to figure out which textures you need to sample, and use the UV’s from your G-Buffer to actually sample them. The main benefit is that you can ensure that your textures are only sampled for visible pixels, without worrying about overdraw or quad packing. Depending on your approach, you may also be able to save on some G-Buffer space by virtue of not having to cram every material parameter in there. In practice you actually need more than just UV and material ID. For normal mapping you need your full tangent frame, which at minimum requires a quaternion. For mipmaps and anisotropic filtering we also need the screen-space derivatives of our UV’s. These can computed in the pixel shader and then explicitly stored in the G-Buffer, or you can compute them from G-Buffer as long as you’re willing to live with occasional artifacts. Nathan Reed has a nice write-up[4] on his blog discussing the various possibilities for G-Buffer layouts, so I would suggest reading through his article for some more details.

The place where bindless helps is in the actual sampling of the material textures during the deferred lighting phase. By putting all of your texture descriptors into one big array, the lighting pass can index into that array in order to sample the textures for any given material. All you need is a simple mapping of material ID -> texture indices, which you can do by indexing into a structured buffer. Of course it’s not really *required* to have bindless in order to pull this off. If you’re willing to stuff all of your textures into a big texture array, then you could achieve the same thing on D3D10-level hardware. Or if you use virtual mapping, then it’s pretty trivial to implement since everything’s already coming from big texture atlases. In fact, the virtual mapping approach has already been used in a shipping game, and was described at SIGGRAPH last year[5][6]. That said, the bindless approach is probably the easiest to get running and also places the least constraints on existing pipelines and assets.

As a way to test out the brand-new D3D12 version of my sample framework, I decided to have a go at writing a simple implementation of a deferred texturing renderer. It seemed like a good way to get familiar with some of the new features offered by D3D12, while also making sure that the new version of my sample framework was ready for prime-time. I also hoped that I could gain a better understanding of some of the practical issues involved in implementing deferred texturing, and use the experience to judge whether or not it might be a appealing choice for future projects.

Here’s a quick breakdown of how I ultimately set up my renderer:

- Bin lights and decals into clusters
- 16×16 tiles aligned to screen space, 16 linearly-partitioned depth tiles

- Render sun and spotlight shadows
- 4 2048×2048 cascaded shadow maps for the sun
- 1024×1024 standard shadow maps for each spotlight

- Render scene to the G-Buffer
- Depth (32bpp)
- Tangent frame as a packed quaternion[7] (32bpp)
- Texture coordinates (32bpp)
- Depth gradients (32bpp)
- (Optional) UV gradients (64bpp)
- Material ID (8bpp)
- 7-bit Material index
- 1-bit for tangent frame handedness

- If MSAA is not enabled:
- Run a deferred compute shader that perfoms deferred shading for all pixels
- Read attributes from G-Buffer

- Use material ID to get texture indices, and use those indices to index into descriptor tables
- Blend decals into material properties
- Apply sunlight and spotlights

- Render the sky to the output texture, using the depth buffer for occlusion

- Run a deferred compute shader that perfoms deferred shading for all pixels
- Otherwise if MSAA is enabled
- Detect “edge” pixels requiring per-sample shading using Material ID
- Classify 8×8 tiles as having either having edges or having no edges
- Render the sky to an MSAA render target texture, using the depth buffer for occlusion
- Run a deferred compute shader for non-edge tiles, shading only 1 subsample per pixel
- Run a deferred compute shader for edge tiles, shading 1 subsample for non-edge pixels and all subsamples for edge pixels
- Resolve MSAA subsamples to a non-MSAA texture

- Post-processing
- Present

In order to have another reference point for evaluating quality and performance, I also decided to implement a clustered forward[8] path side-by-side with the deferred renderer:

- Bin lights into clusters (16×16 tiles aligned to screen space, 16 linearly-partitioned depth tiles)
- Render sun and spotlight shadows
- (optional) Render scene depth prepass
- Render scene with full shading
- Read from material textures
- Blend decals into material properties
- Apply sunlight and spotlights

- if MSAA is enabled
- Resolve MSAA subsamples to a non-MSAA texture

- Post-processing
- Present

As you may have already noticed, I used the same clustered approach to binning lights for both the deferred and forward paths. I did this because it was simpler than having two different approaches to light selection, and it also seemed more fair to take that aspect out of the equation when comparing performance. However you could obviously use whatever approach you’d like for binning lights when using deferred texturing, such as classic light bounds rasterization/blending or tile-based subfrustum culling[9].

To implement the actual binning, I used a similar setup to what was described in Emil Persson’s excellent presentation[8] from SIGGRAPH 2013. If you’re not familiar, the basic idea is that you chop up your view frustum into a bunch subfrusta, both in screen-space XY as well as along the Z axis. This essentially looks like a voxel grid, except warped to fit inside the frustum shape of a perspective projection. This is actually rather similar to the approach used in tiled deferred or Forward+[10] rendering, except that you also bucket along Z instead of fitting each subfrustum to the depth buffer. This can be really nice for forward rendering, since it lets you avoid over-including lights for tiles with a large depth range.

In the Avalanche engine they decided to perform their light binning on the CPU, which is feasible since the binning doesn’t rely on a depth buffer. Binning on the CPU can make a lot of sense, since a typical GPU approach to culling will often have many redundant calculations across threads and thread groups.It’s also possible make it pretty fast through CPU parallelization, as demonstrated by Intel’s Clustered Forward Shading sample[11].

For my own implementation I decided that I wanted to stick with the GPU, and so I went with a different approach. Having shipped a game that used a Forward+-style renderer, I’m pretty disappointed with the results of using a typical subfrustum plane-based culling scheme in a compute shader. The dirty secret[12] of using plane/volume tests for frustum culling is that they’re actually quite prone to false positives. The “standard” test can only exclude when your bounding volume is completely on the wrong side of one or more of your frustum planes. Unfortunately this means that it will fail for cases where the bounding volume intersects multiple planes at points outside the frustum. Even more unfortunate is that this particular case becomes more likely as your bounding volumes become large relative to your frustum, which is typically the case for testing lights against subfrusta. Spotlights are especially bad in this regard, since the wide portion will often intersect the planes of subfrusta that are actually just outside the narrower tip:

The top-right image shows a 3D perspective view, where you can clearly see that the cone in this situation doesn’t actually intersect with the frustum. However if you look at the orthographic views, you can also see that the cone manages to be on both sides of the frustum’s right plane. In practice you end up getting results like the following (tiles colored in green are “intersecting” with the spotlight):

As you can see, you can end up with a ton of false positives. In fact towards the right we’re basically just filling in the screen-aligned AABB of the bounding cone, which turns into a whole lot of wasted work for those pixels.

This drove me nuts on The Order, since our lighting artists liked to pack our levels full of large, shadow-casting spotlights. On the CPU side I would resort to expensive frustum-frustum tests for shadow-casting spotlights, which makes sense when you consider that you can save a lot of CPU and GPU time by culling out an entire shadow map. Unfortunately frustum/frustum wasn’t a realistic option for the tiled subfrustum culling that was performed on the GPU, and so for “important” lights I augmented the intersection test with a 2D mask generated via rasterization of the bounding cone. The results were quite a bit better:

For this demo, I decided to take what I had done on The Order and move it into 3D by binning into Z buckets as well. Binning in Z is a bit tricky, since you essentially want the equivalent of solid voxelization except in the projected space of your the frustum. Working in projected space rules out some of the common voxelization tricks, and so I ended up going with a simple 2-pass approach. The first pass renders the backfaces of a light’s bounding geometry, and marks the light’s bit (each cluster stores a bitfield of active lights) in the furthest Z bucket intersected by the current triangle within a given XY bucket. To conservatively estimate the furthest Z bucket, I use pixel shader derivatives to get the depth gradients, and then compute the maximum depth at the corner of the pixel. This generally works OK, but when the depth gradient is large it’s possible to extrapolate off the triangle. To minimize the damage in these cases, I compute a view-space AABB of the light on the CPU, and clamp the extrapolated depth to this AABB. After the backfacing pass, the frontfaces are then rendered. This time, the pixel shader computes the minimum depth bucket, and then walks forward along view ray until encountering the bucket that was marked by the backface pass. Here’s a visualization of the binning for a single light in my demo:

The pixels marked in red are the ones that belong to a cluster where the light is active. The triangle-shaped UI in the bottom right is a visualization that shows the active clusters for the XZ plane located at y=0. This helps you to see how well the clustering is working in Z, which is where the most errors occur in my implementation. Here’s another image showing the scene with all (20) lights enabled:

To do this robustly, you really want to use conservative rasterization[13]. I have this option in my demo, but unfortunately there are still no AMD GPU’s that support the feature. As a fallback, I also support forcing 4x or 8x MSAA modes to reduce the chance that the pixel shader won’t be executed for a covered tile. For The Order I used 8x MSAA, and it was never an issue in practice. It would really only be an issue if the light was *very* small on-screen, in which case you could probably just rasterize a bounding box instead. I should also point out that in my implementation the depth buffer is not used to accelerate the binning process, or to produce more optimally-distributed Z buckets. I implemented it this way so that there would not be additional performance differences when choosing whether or not to enable a Z prepass for the forward rendering path.

For rendering we’re going to need a G-Buffer in which we can store whatever information comes from our vertex data, as well as a material ID that we can use to look up the appropriate textures during the deferred pass. In terms of vertex information, we need both the tangent frame and the UV’s in order to sample textures and perform normal mapping. If we assume our tangent frame is an orthonormal basis, we can store it pretty compactly by using a quaternion. R16G16B16A16_SNORM is perfect for this use case, since it covers the expected range and provides great precision. However we can crunch it down[7] to 4 bytes per texel if we really want to keep it small (and we do!). The UV’s are stored in a 16-bit UNORM format, which gives us plenty of precision as long as we store frac(UV) to keep things between 0 and 1. In the same texture as the UV’s I also store screen-space depth gradients in the Z and W components. After that is an optional 64bpp texture for storing the screen-space UV gradients, which I’ll discuss in the next section. Finally, the G-Buffer also has an 8-bit texture for storing a material ID. The MSB of each textel is the handedness bit for the tangent frame, which is used to flip the direction of the bitangent once it’s reconstructed from a quaternion. This brings us to a total of 25 bytes per sample when storing UV gradients, and 17 when computing them instead.

One issue with deferred texturing is that you can’t rely on automatic mip selection via screen-space UV gradients when sampling the material textures. The gradients computed in a pixel shader will be wrong for quads that span multiple triangles, and in a compute shader they’re not available at all. The simplest way to solve this is to obtain the gradients in the pixel shader (using ddx/ddy) when rendering the scene to the G-Buffer, and then store those gradients in a texture. Unfortunately this means storing 4 separate values, which requires an additional 8 bytes of data per pixel when using 16-bit precision. It also doesn’t help you at all if you require positional gradients during your deferred pass, which can be useful for things like gobos, decals, filterable shadows, or receiver plane shadow bias factors. Storing the full gradients of world or view-space position would be silly, but fortunately we can store depth gradients and use those to reconstruct position gradients. Depth gradients only need 2 values instead of 6, and we can use a 16-bit fixed-point format instead of floating-point. They also have the nice property of being constant across the surface of a plane, which makes them useful for detecting triangle edges.

Both the UV and depth gradients can be computed in the deferred pass by sampling values from neighboring pixels, but in practice I’ve found it’s actually somewhat tricky to get right. You have to be careful not to “walk off the triangle”, otherwise you might end up reading UV’s from a totally unrelated mesh. Unless of course your triangle is so small that none of your neighbors came from the same triangle, in which case walking off might be your only option. You also have to take care around UV seams, including any you might have created yourself by using frac()!

In my implementation, I decided to always store depth gradients (where “depth” in this case is the post-projection z/w value stored in the depth buffer) while supporting an option to either store or compute UV gradients. Doing it this way allowed me to utilize the depth gradients when trying to find suitable neighbor pixels for computing UV gradients, and also ensured that I always had high-quality positional gradients. The material ID was also useful here: by checking that a neighboring pixel used the same material and was also had the same depth gradients, I could be relatively certain that the neighbor was either from the same triangle, or from a coplanar triangle. The shader code for this step can be found here, if you’re interested.

To assess the quality, let’s look at a visualization showing what the UV gradients look like when using ddx/ddy during forward rendering:

And here’s an image showing what computed UV gradients look like:

As you can see there’s quite a few places where my algorithm fails to detect neighbors (these pixels are dark), and other places where a neighboring pixel shouldn’t have been used at all (these pixels are bright). I’m sure the results could be improved with more time and cleverness, but you need to be careful that the amount of work done is enough to offset the cost of just writing out the gradients in the first place. On my GTX 970 it’s actually faster to store the gradients than it is to compute them when MSAA is disabled, but then it switches to the other way around once MSAA is turned on.

It’s worth noting that Sebastian’s presentation[5] mentions that they reconstruct UV gradients in their implementation (see page 45), although you can definitely see some artifacts around triangle edges in their comparison image. They also mention that they use “UV distance” to detect neighbors, which makes sense considering that they have unique UVs for their virtual texturing.

For non-MSAA rendering, the deferred pass is pretty straightforward. First, the G-Buffer attributes are read from their textures and used to compute the original pixel position and gradients. UV gradients are then read from the G-Buffer if present, or otherwise computed from neighboring pixels. The material ID from the G-Buffer is then used to index into a structured buffer that contains one element per material, where each element contains the descriptor indices for the material textures (albedo, normal, roughness, and metallic). These indices are used to index into a large descriptor table containing descriptors for every texture in the scene, so that the appropriate textures can be sampled using the pixel’s UV coordinates and derivatives.

Once all of the surface and material parameters are read, they are passed into a function that performs the actual shading. This function will loop over all lights that were binned into the pixel’s XYZ cluster, and compute the reflectance for each light source. This requires evaluating the surface BRDF, and also applying a visibility term that comes from the light’s 1024×1024 shadow map. The shadow map is sampled with a 7×7 PCF kernel that’s implemented as an optimized, unrolled loop that makes use of GatherCmp instructions. This helps to make the actual workload somewhat representative of what you might have in actual game, instead of biasing things too much towards ALU work. My scene also has a directional light from the sun, which uses 4 2048×2048 cascaded shadow maps for visibility. Finally, an ambient term is applied by means of a set SH coefficients representing the radiance from the skydome.

Just like any other kind of deferred rendering, MSAA needs special care. In particular we need to determine which pixels contain multiple unique subsamples, so that we can shade each subsample individually during the deferred pass. The key issue here is scheduling: the “edge” pixels that require subsample shading will typically be rather sparse in screen space, which makes dynamic branching a poor fit. The “old-school” way of scheduling is to create a stencil mask from the edge pixels, and then render in two pixel shader passes: one pass for per-pixel shading, another for per-sample shading that runs the shader at per-sample frequency. This can work better than a branch, but the hardware still may not be able to schedule it particularly well due to the sparseness of the edge pixels. It will also still need to make sure that the shader runs with 2×2 quads, which can result in a lot of needless helper executions.

The “newer” way to schedule edge pixels (and by “newer”, I mean 6 years old) is to use a compute shader that re-schedules threads within a thread group. Basically you detect edge pixels, append their location to a list in thread group shared memory, and then have the entire group iterate over that list once it finishes shading the first subsample. This effectively compacts the sparse list of edge pixels within a tile, allowing for coherent looping and branching. The downside is that you need to use shared memory, which can decrease your maximum occupancy if you use too much of it.

In my sample, I use the compute shader approach but with a new twist. Instead of performing the edge detection in the deferred pass, I run an earlier pass that checks for edges and builds a mask. This pass uses append buffers to build a list of 8×8 tiles that contain edge pixels, as well as a separate list containing tiles that have no edges at all. The append counts are used as indirect arguments for ExecuteIndirect, so that the edge and non-edge tiles can processed with two separate dispatches using two different shader permutations. This helps minimize overhead from shared memory usage, since the non-edge version of the compute shader doesn’t touch shared memory at all.

As for the actual edge detection, my sample supports two different approaches. The first approach only checks the material ID, and flags pixels that contain multiple material ID values. This is a very conservative approach, since it will only flag pixels where meshes with different materials overlap. The second approach is more aggressive, and additionally flags pixels with varying depth gradients. A varying depth gradient means that we have multiple triangles that are not coplanar, which means that we avoid tagging edges for the case of a tessellated flat plane. Here’s what the edge detection looks like using only the material ID:

…and with the the more aggressive depth gradient check:

One of the big advantages of traditional deferred shading is that you can modify your G-Buffer before computing your lighting. Lots of games take advantage of this by rendering deferred decals[14] into the scene for things like graffiti, blood splatter, debris, and posters. It’s much nicer than traditional forward-rendered decals, since you only need to light once per pixel even when accumulating multiple decals. The typical approach is to apply these decals in a deferred pass prior to lighting, where bounding volumes are rasterized representing the area of influence for each decal. In the pixel shader, the depth buffer is used to compute a surface position, which is then projected into 2D in order to compute a UV value. The projected UV can then be used to sample textures containing the decal’s surface properties, which are then written out to be blended into the G-Buffer. The end result is cheap decals that can work on complex scene geometry.

To use a typical deferred decal system you need two things: a depth buffer, and a G-Buffer. The depth buffer is needed for the projection part, and the G-Buffer is needed so that you can blend in your properties before shading. For forward rendering you can get a depth buffer through a depth prepass, but you’re out of luck for the G-Buffer part. We’re in a similar position with deferred texturing: we have depth, but our G-Buffer lacks the parameters that we’d typically want to modify from a decal. For The Order I worked around this by making a very specialized decal system. We would essentially accumulate values into a render target, with each channel of the texture corresponding to specific hard-coded decal type: bullet damage/cratering, scorching, and blood. The forward shaders would then read in those values, and would use them to modify material parameters before performing the lighting phase of the shader. It worked, but it obviously was super-specific to our game and wasn’t at all generic. Although we did get two nice things from this approach: materials could customized how they reacted to decals (materials could actually allocate a fully-composited layer for the damaged areas inside of bullet decals), and decals would properly accumulate on top of one another.

The good news is that deferred projectors is definitely not the only way to do decals. You can actually remove the need for a both a depth buffer *and* a G-Buffer by switching to the same clustered approach that you can use for lights, which is an idea I’ve been kicking around for a year or two now. You just need to build a per-cluster list of decals, iterate over the list in your shading pass, and apply the decal according to its projection. The catch is that our shading pass now needs access to the textures for every possible decal in the scene, and it needs to be able to access the appropriate texture based on a decal index. In D3D11 this would have meant using texture arrays or atlases, but with D3D12 we can potentially avoid these headaches thanks to power of bindless.

So does a clustered approach to decals actually work? Why, yes it does! I implemented them in my app, and got them to work with both the clustered forward and deferred texturing rendering paths. I even made a picker so that you can splat decals wherever you want in the scene, which is lots of fun! The decals themselves are just a set of sci-fi color and normal maps, and were generously provided for free by Nobiax from DeviantArt[15]. They end up looking like this:

In my demo the decal color and normal are just blended with the surface parameters based on the alpha channel from the color texture. However one advantage of applying decals in this way is that you’re not restricted to framebuffer blending operations. So for instance, you can accumulate surface normals by reorienting the decal normals[19] to make them relative to the normals of the underlying surface.

As for their performance, I unfortunately don’t have another decal implementation to compare against. However if I set up test case where most of the screen is covered in around 11 decals, the cost of the deferred pass (no MSAA) goes from 1.50ms to 2.00ms on my GTX 970. If I branch over the decals entirely (including reading from the cluster bitfield buffer), then the cost drops to 1.33ms. For the forward path it costs about 1.55ms with no decals, 2.20ms with decals, and 1.38 branching over the entire decal step.

To measure performance, I captured GPU timing information via timestamp queries. All measurements were taken from the default camera position, at 1920×1080 resolution. The test scene is the CryTek Sponza (we really need a new test scene!) with 20 hand-placed spotlights, each casting a 1024×1024 shadow map. There’s also a directional light for the sun that uses 4 2048×2048 shadow cascades. The scene uses normal, albedo, roughness, and metallic maps courtesy of Alexandre Pestana[16]. Here’s what the frame breakdown looks like for the deferred texturing path, with no MSAA:

Render Total: 7.40ms (7.47ms max) Cluster Update: 0.08ms (0.08ms max) Sun Shadow Map Rendering: 1.30ms (1.34ms max) Spot Light Shadow Map Rendering: 1.04ms (1.04ms max) G-Buffer Rendering: 0.67ms (0.67ms max) Deferred Rendering: 3.54ms (3.58ms max) Post Processing: 0.59ms (0.60ms max)

…and here’s what it looks like with 4x MSAA:

Render Total: 9.86ms (9.97ms max) Cluster Update: 0.08ms (0.08ms max) Sun Shadow Map Rendering: 1.30ms (1.34ms max) Spot Light Shadow Map Rendering: 1.04ms (1.05ms max) G-Buffer Rendering: 1.48ms (1.49ms max) Deferred Rendering: 4.64ms (4.67ms max) Post Processing: 0.56ms (0.57ms max) MSAA Mask: 0.16ms (0.16ms max) MSAA Resolve: 0.21ms (0.21ms max)

The first timing number is an average over the past 64 frames, while the second number is the maximum time from the past 64 frames.

The following chart shows how some of the various configurations scale with increasing MSAA levels:

So there’s a few observations that we can make from this data. First of all, the deferred path generally does very well compared to the forward path. The forward path without a depth prepass is basically useless, which means we’re clearly suffering from overdraw problems. I do actually sort by depth when rendering the main forward pass, but my test scene doesn’t have sufficient granularity to achieve good front-to-back ordering. Enabling the depth prepass improves things considerably, but not enough to match the deferred performance. Once we enable MSAA things go a little differently, as the forward paths scale up better compared to the deferred rendering path. At 4x forward and deferred are nearly tied, but that is only when the deferred path uses the conservative material ID check for detecting edge pixels. The conservative path skips many edges in the test scene, and so the AA quality is inferior to the forward path. Using the aggressive depth gradient edge test brings the quality more in line with the forward path, but it’s also quite a bit more expensive. However I would also expect the forward path to scale more poorly with scene complexity, since pixel shader efficiency will only decrease as the triangle count increases. One other interesting observation we can make is that writing out the UV gradients doesn’t seem to be an issue for our test scene when running on my 970. With no MSAA it’s actually slightly faster (7.47ms vs 7.50ms) to just write out the gradients instead of computing them, but that changes by the time we get to 4x MSAA (9.97ms vs. 9.83ms).

I should point out that all of these timings were captured while using a heavy-duty, “thrash your cache” 7×7 PCF kernel that’s implemented as an unrolled loop using GatherCmp. It undoubtedly causes a large increase in memory traffic, and it probably causes an increase in register pressure as well. I would imagine that this is especially bad for the forward path, since everything is done in one pixel shader. As an alternative, I also have an option to revert back to a simple 2×2 PCF kernel that only uses a single GatherCmp (you can toggle it yourself by changing the “UseGatherPCF_” flag at the top of Shading.hlsl). This path is probably a better representation for games that use filterable shadow maps, and possibly games that use aggressive PCF sampling optimizations. Here’s what the data looks like:

Some of these results are quite different compared to the 7×7 case. The forward paths do much better than they did previously, especially at 4xMSAA. The deferred paths scale the same as they did before, with the aggressive edge detection again causing longer frame times.

At home I only have access to a GTX 970, and so that’s what I used for almost all of my testing, profiling, and optimization. However I was able to verify that the demo works on an an AMD R9 280, as well as a GTX 980. I’ve posted a summary of all of the performance data in table form below (all timings in milliseconds):

If you’re interested in seeing the raw performance data that includes the per-frame breakdowns, I’ve uploaded them here: https://mynameismjp.files.wordpress.com/2016/03/deferred-texturing-timings.zip. You can also access the timing data in the app by clicking on the “Timings” button in the top-left corner of the screen.

So what kind of conclusions can we draw from this experiment? Personally I’m wary of extrapolating too much from such an artificial testing scenario with only one basis for comparison, but I think it’s safe to say that the bindless deferred texturing approach doesn’t have any sort of inherent performance issue that would render it useless for games. Or at least, it doesn’t on the limited hardware that I’ve used for testing. I was initially worried that the combination of SampleGrad and divergent descriptor indices would lead to to suboptimal performance, but in the end it didn’t seem to be an issue for my particular setup. Although to be completely fair, the texture sampling ended up being a relatively small part of the shading process. It’s certainly possible that the situation could change if the number of textures were to increase, or if the indexing became more divergent due to increased material density in screen-space. But at the same time those situations could also lead to reduced performance during a forward pass or during a traditional G-Buffer pass, so it might end up being a wash anyway.

At least in my demo, the biggest performance issue for the deferred path seems to be MSAA. This shouldn’t really be a surprise, considering how hard it is to implement affordable MSAA with any deferred renderer. My hope would be that a deferred texturing approach does a bit better than a traditional deferred renderer with a very large G-Buffer, but unfortunately I don’t have data to prove that out. Ultimately it probably doesn’t even matter, since hardly anyone even bothers with MSAA these days.

What about limitations? Multiple UV sets are at best a headache, and at worst a non-option. I think you’d have to really like your artists to store another UV set in your G-Buffer, and also eat the cost of storing or computing another set of UV gradients. Not having a full G-Buffer might be an issue for certain screen-space techniques, like SSAO or screen-space reflections. I’ve shown that it’s possible to have decals even without a full G-Buffer, but it’s more complex and possibly more expensive than traditional deferred decals. But on the upside, it’s really nice to have a cheap geometry pass that doesn’t need to sample any textures! It’s also very friendly to GPU-driven batching techniques, which was demonstrated in the RedLynx presentation from SIGGRAPH.

There is one final reason why you might not want to do this (or at least not right now): it was a real pain in the ass to get this demo working in DX12. Driver bugs, shader compilation bugs, long compile times, validation bugs, driver crashes, blue screens of death: if it was annoying, I ran into it. Dynamic indexing support seems to be rather immature in both the shader compiler and the drivers, so tread carefully. The final code has a few work-arounds implemented, but I’ve noted them with a comment.

If I were to spend more time on this, I think it would be interesting to explore some of the more extreme variants of deferred texturing. In particular there’s Intel’s paper on Visibility Buffers[17], where the authors completely forego storing any surface data except for triangle ID and instance ID. All surface data is instead reconstructed by transforming vertices during the deferred pass, and performing a ray-triangle intersection to compute barycentrics for interpolation. There’s also Tomasz Stachowiak’s presentation about deferred material rendering[18], where barycentric coordinates are stored in the G-Buffer instead of being reconstructed (which he does by tricking the driver into accepting his hand-written GCN assembly!!!). He has some neat ideas about using tile classification to execute different shader paths based on the material, which is something that could be integrated with the MSAA tile classification that’s performed in my demo. Finally in the RedLynx presentation they use a neat trick where they render with MSAA at half resolution, and then reconstruct full-resolution surface samples during the deferred pass. It makes the deferred shader more complicated, but reduces the pixel shader cost of rasterizing the G-Buffer. These are all things I would love to implement in my demo if I had infinite time, but at some point I actually need to sleep.

If you’ve made it this far, thank you for hanging in there! This one might be my longest so far! I considering making it a series of articles, but I didn’t want it to turn into one of those blog series where the author just never finishes it.

If you want to look at the code or run the sample, everything is available on GitHub:

https://github.com/TheRealMJP/DeferredTexturing

https://github.com/TheRealMJP/DeferredTexturing/releases (Precompiled Binaries)

A word of warning : the shader compilation times are *incredibly* long for the deferred shader. A lot of it is from unrolling the loops in the monster 7×7 PCF kernel, so if you want to run the code in debug (the zipped up binaries on GitHub include cached Release shaders) or change the shaders, I would strongly recommend setting “UseGatherPCF_” to

“0” on line 19 of Shading.hlsl.

If you find any bugs or have any suggestions, please let me know via comments, email, or twitter!

[1] Introduction To Resource Binding In D3D12 (Intel)

[2]OpenGL Bindless Extensions (Nvidia)

[3]GL_NV_bindless_texture (OpenGL Extension Registry)

[4]Deferred Texturing (Nathan Reed)

[5]GPU-Driven Rendering Pipelines (Haar and Aaltonen, SIGGRAPH 2015)

[6]Modern Textureless Deferred Rendering Techniques (Beyond3D)

[7]The Bitsquid Low-Level Animation System

[8]Practical Clustered Shading (Emil Persson)

[9]Deferred Rendering for Current and Future Rendering Pipelines (Intel, Andrew Lauritzen)

[10]Forward+: Bringing Deferred Lighting To The Next Level (AMD, Takahiro Harada)

[11]Forward Clustered Shading (Intel)

[12]Correct Frustum Culling (Íñigo Quílez)

[13]Conservative Rasterization (MSDN)

[14]Screen Space Decals in Warhammer 40K: Space Marine (Pope Kim)

[15]Free Sci-Fi Decals 2 (Nobiax, DeviantArt)

[16]Base color, Roughness and Metallic textures for Sponza (Alexandre Pestana)

[17]The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading (Burns and Hunt, JCGT)

[18]A Deferred Material Rendering System (Tomasz Stachowiak)

]]>