Looking back on our fictional GPU that we discussed in Part 2, it’s fair to say to its synchronization capabilities are sufficient to obtain correct results from a sequence of compute jobs with coarse-grained dependencies between those jobs. However, the tool that we used to resolve those dependencies (a global GPU-wide barrier that forces all pending threads to finish executing, also known as a “flush”) is a *very* blunt instrument. It’s bad enough that we’re unable to represent dependencies at the sub-dispatch level (in other words, only some of threads from one dispatch depend on only some of threads from another dispatch), but it can also force us into situations where non-dependent dispatches still can’t overlap with other. This can be really bad when you’re talking about GPU’s that require thousands of threads to fill all of their executions units, particularly if your dispatches tend towards low thread counts and high execution times. If you recall from Part 2, we discovered that the cost of each GPU-wide barrier was tied to the corresponding decrease in utilization, and long-running threads can result in cores sitting idle for a long period of time.

In Part 2 we did discuss one way that the MJP-3000 could extract some extra overlap from a sequence of dispatches. Specifically, we talked about a case with two dependent dispatches, and a third dispatch that was independent of the first two. By splitting the barrier into two steps ( SIGNAL_POST_SHADER and WAIT_SIGNAL), we could effectively sync only a single dispatch’s threads instead of waiting for all executing threads to finish. It worked in that case, but it’s still very limited tool. In that example we relied on our advance knowledge of dispatch C’s thread count and execution time to set up the command buffer in a way that would maximize overlap. But this may not be practical in the real-world, where thread count and execution time might vary from frame-to-frame. That technique also doesn’t scale up to more complex dependency graphs, since we’re always going to be limited by the fact that there’s only 1 command processor that can wait for something to finish, and then kick off more threads.

To give you a better idea of what I’m talking about, let’s switch to an example that’s more grounded in reality. In real engines performing graphics work on GPU’s, it’s very common to have **chains** of draws or dispatches where each one depends on the previous step. For instance, imagine a bloom pass that requires several downsampling steps, followed by several blur passes, followed by several upscale passes. So you might end up with around 6-8 different dispatches, with a barrier between each one:

Now, let’s say we would like to overlap this with a large dispatch that takes about an equal amount of time. Overlapping it with a single step is pretty simple: we just need to make sure that a bloom dispatch and the large dispatch end up on the same side of a barrier, and they will overlap with each other. So for instance we could dispatch the large shader, then dispatch the first downscale, and then issue a barrier. However, this might not be ideal if the GPU has a setup similar to our fictional GPU, where our barriers are implemented in terms of commands that require waiting for all previously-launched shaders to finish executing. If that’s the case, then the first barrier after the long dispatch might cause a long period with no overlap:

This is better than nothing, but not great. Really we want the long dispatch to fully overlap one of the barriers, since the associated cache flushes and other actions are periods where the shader units would otherwise be completely idle. We can achieve this with split barriers, but we could only do this once if our SIGNAL_POST_SHADER instruction is limited to signaling after all previously-queued threads have completed:

Things could get even more complicated if we wanted to overlap our bloom with a *different* chain of dispatches. For example, we might have a depth of field technique that needs 4-5 dispatches that are all dependent on the previous dispatch, but are otherwise independent from the bloom dispatches. By interleaving the dispatches from both techniques it would be possible to achieve some overlap, but it would be complex and difficult to balance. In both cases we’re really being held back by the limitations of our barrier implementation, and our lives would be much easier if we could submit each pass as a set of completely independent commands that aren’t affected by each other’s sync operations.

After much, much complaining from the programmers writing software for the MJP-3000 about how much performance they were losing to sync points, the hardware engineers at MJP Industries finally had enough. When designing the all-new MJP-4000, they decided that the simplest way to handle multiple independent command streams would be to go ahead and copy/paste the front-end of the GPU! The logical hardware layout now looks like this:

As you can see, there’s now two command processors attached to two separate thread queues. This naturally gives the hardware the ability to handle two command buffers at once, with two different processors available for performing the flush/wait operations required for synchronization. The MJP-4000 still has the same number of shader cores as the 3000 (16), which means we haven’t actually increased the theoretical maximum throughput of the GPU. However the idea is that by using dual command processors we can reduce the amount of time that those shader cores sit idle, therefore improving the overall throughput of the jobs that it runs.

Looking at the diagram, you might (rightfully) wonder “how exactly do the two front-ends share the shader cores?” On real GPU’s there are several possible ways to handle that, which will touch on a bit in a later article. But for now, let’s assume that the dual thread queues use a very simple scheme for sharing the cores:

- If only one queue has pending threads and there are any empty shader cores, the queue will fill up those cores with work until there are no cores left
- If both queues have work to do and there are empty cores, those cores are split up evenly and assigned threads from both thread queues (if there’s an odd number of cores available, the top queue gets the extra core)
- Threads always run to completion once assigned to a shader core, which means pending threads can’t interrupt them

Let’s now look at how all of this all would work out in practice. Let’s imagine we have two independent chains of dispatches to run on the GPU at roughly the same time. Chain 1 consists of two dependent dispatches, which we’ll call Dispatch A (red) and Dispatch B (green). Dispatch A consists of 68 threads that each take 100 cycles to complete. Dispatch B then reads the results of Dispatch A, which means Chain 1 will need a FLUSH command between Dispatch A and Dispatch B. Once the flush completes, Dispatch B then launches another 8 threads that take 400 cycles to complete. Chain 2 also consists of two dependent dispatches, which we’ll call Dispatch C (blue) and Dispatch D (purple). Dispatch C will launch 80 threads that take 100 cycles to complete, followed by another 80 threads that take 100 cycles to complete. Here’s how it plays out:

Click to view slideshow.This sequence is easily the most complex example that we’ve looked at, so let’s break it down step by step:

- At cycle 0, we see that Chain 1’s command buffer has been submitted to the top-most command processor, which is ready to execute the Dispatch A command.
- Next we fast-forward to 50 cycles later, where we see that Dispatch A’s threads have been pushed to the top thread queue, and 16 threads are already running on the shader cores. It’s at this point that Chain 2’s command buffer is submitted to the bottom command processor, with the first command to be processed is Dispatch C.
- 1 cycle later at cycle 51, Dispatch C’s threads are sitting in the thread queue, but none of them are actually executing. This is because our all of the shader cores are already busy, and we’ve already stated that our GPU doesn’t interrupt running threads.
- The 80 threads from Dispatch C need to wait until cycle 102, which is when Dispatch A has written its results to memory and the shader cores are finally freed up.
- 1 cycle later in cycle 103 the two thread queues will evenly split the available cores, and 8 threads from both A and C get assigned to the shader cores. Once we’ve hit this point, the two dispatches continue to evenly share the available cores (since every batch of 8 threads starts and finishes at the same time as the other dispatch’s batch), which continues for about another 500 cycles.
- At cycle 608, something interesting happens. At this point only 4 threads from Dispatch A were running, since that’s all that was left after executing the previous 64 threads. The top-most command processor is running a FLUSH command, which means it’s waiting for those 4 threads to complete before it moves on and executes the next DISPATCH command. This means that if we
*didn’t*have the second command processor + thread queue launching threads from Dispatch C, then 12 of those 16 shader cores would have been idle due to the FLUSH! However by utilizing our dual command buffers, we’ve ensured that those idle cores stay busy with work from another chain! - Moving on to cycle 710, Dispatch A finally finishes writing its results, and Dispatch B gets its 8 threads enqueued in the top-most thread queue. Unfortunately those 8 threads have to wait about 100 cycles for a batch of Dispatch C’s threads to finish, since the bottom thread queue managed to get in there and hog all of the shader cores while the top-most command processor was handing the FLUSH and DISPATCH commands.
- At cycle 810 things return back to what we had before: both thread queues sharing the shader cores evenly. It should be noted that this is another point where we would normally have idle cores if we didn’t have dual front-ends, since Dispatch B only launched 8 threads! This case is actually even worse than what we had earlier with idle cores being caused by a FLUSH, since threads from Dispatch B take 400 cycles to complete. So it’s a good thing we have that extra front-end!
- At cycle 911 Dispatch C finally starts to wind down, which cause a period of 100 cycles where 4 cores are idle. But of course 100 cycles at 75% utilization is still a
*lot*better than 400 cycles with 50% utilization, so that’s really not so bad. - Around 100 cycles later we hit cycle 1012, where Dispatch D rolls in to once again fill up those remaining 8 shader cores.
- The sharing continues for about another 200 cycles until Dispatch B finishes at cycle 1214, at which point Dispatch D is free to use all of the cores it wants.
- Finally, at cycle 1618 the last batch from Dispatch D finishes, and all of our results are in memory.

Whew, we made it! The details were a bit complex, but we can distill this down to something much simpler: **by executing two command buffers at once, we got an overall higher utilization compared to if we ran them one at a time**. Pretty neat, right? However, this higher utilization came at the cost of higher **latency** for each individual chain. This is similar to what we saw in Part 2 when we overlapped multiple dispatches: it took longer for each separate chain to finish, but the combined execution time of Chain 1 + Chain 2 is lower when they’re interleaved compared to running them one at a time. Another way that you could say this is that we improved the **overall throughput** of the GPU. This is pretty impressive when you consider that we did this without adding additional shader cores to the GPU, and without changing the shader programs themselves! Keeping the shader core count at 16 means that we still have the same theoretical peak throughput that we had with the MJP-3000, and no amount of extra queues can get us past that peak. All we really did is make sure that the shader cores have less idle time, which can still be a big win when that percentage of idle time is high.

For real-world GPU’s that actual do graphics work (and not just dispatches like our imaginary GPU), there can be even more opportunities for filling idle shader cores with work from another command stream. As I mentioned in part 1, real GPU’s often have idle time due to flush caches and decompression steps. There’s also certain operations that you can do on GPU that barely even use the shader cores at all! Probably the most common of these is depth-only rasterization, which is typically used for depth prepasses or for generating shadow maps. For depth-only rendering the shader cores will be used to transform vertices in the vertex shader (and possibly also in the Hull/Domain/Geometry shader stages), but no pixel shader is run. Vertex counts tend to be much lower than pixel counts, which makes it very easy for these passes to end up getting bottlenecked by fixed-function triangle setup, rasterization, and/or depth buffer processing. When those fixed-function units become the bottleneck, the vertex shaders will only use a fraction of the available shader cores, leaving many idle cores available for other command streams to utilize. ^{1}

If we were to try to draw an analogy to the CPU world, the closest match to our multi-queue setup would probably be Simultaneous Multithreading, or **SMT** for short. You may also know this as Hyper-Threading, which is the Intel marketing name for this feature. The basic idea is pretty similar: it allows CPU’s to issue instructions from two different threads simultaneously, but they share most of the resources required for actually executing those instructions. The end goal of SMT is also to improve overall utilization by reducing idle time, although in the case of CPU’s the idle time that they’re avoiding mainly comes from stalls that occur during cache misses or page faults. ^{2} If we squint like we did in Part 2 and look at our command buffer streams as instruction streams, then the analogy holds fairly well. From this point of view it’s possible to think of a GPU as a kind of meta-processor, where the GPU executes various high-level instructions that in turn launch threads consisting of lower-level instructions!

In our hypothetical scenario, Chain 1 and Chain 2 were considered to be completely independent: they didn’t have to know or care about each other at all. Those chains may have been from different sub-systems in a game, or from two completely different programs sharing a single GPU. However it’s fairly common for a single application to have two different chains of dispatches that start from a single point, and whose final results are needed by another single dispatch. As a more concrete example, let’s return to the post-processing chain that we discussed at the beginning of the article. Let’s say that after we finish rendering our main pass to a render target, we have two different post-processing operations that need to happen: bloom and depth of field (DOF). We already discussed how the bloom pass consists of scaling and blur passes that are all dependent on the previous step, and the DOF pass works in a very similar way. After both of these passes continue, we want to have a single shader that samples the bloom and DOF output, composites them onto the screen, and then tone maps from HDR->SDR to get the final result. If we were to build a dependency graph for these operations, it would look something like this:

From the graph it’s clear that the bloom/DOF chains are independent while they’re executing, but they need to be synchronized at the start and end of the chains to make sure that everything works correctly. This implies that if we were to run the two chains as two separate command buffers submitted to the dual front-ends of the MJP-4000, we would need some kind of cross-queue synchronization in order to ensure correct results. Conceptually this is similar to our need for a barrier to ensure that all threads of a dispatch complete executing a dependent dispatch, except in this case we need to ensure that multiple *chains* have completed. If we were to use such a barrier to synchronize our post-processing chain, it would look something like this:

There are a few ways that this kind of cross-queue barrier, but the MJP-4000 takes a simple approach by re-purposing the SIGNAL_POST_SHADER and WAIT_SIGNAL instructions. Both command processors can access the same set of labels (indicated by the labels area being shared on the logical layout diagram), which gives them the basic functionality needed to wait for each other when necessary. For our bloom/DOF case we could have the DOF chain write a label when its finished, and the main command stream could wait on that label before continuing on to the tone mapping dispatch.

In Part 4, we’ll discuss how having multiple command processors can be useful for scenarios where multiple applications are sharing the same GPU.

- This sort of scenario is usually what game developers are referring to when they say that they were able to make something “free” by using “async compute”: they’re basically saying that there was enough idle execution units to absorb an entire dispatch or sequence of dispatches. But we’ll talk about this more in a future article.
- GPU’s also have their own ways of dealing with the high latency of memory access, one of which basically involves over-committing threads to shader cores and then cycling through them fairly quickly.

To explain the basics of GPU thread synchronization, I’m going to walk through some examples using a completely fictional architecture: the MJP-3000. This made-up GPU is much simpler than real graphics hardware, which will (hopefully) make it easier to demonstrate high-level concepts without getting lost in the weeds. I also don’t want to give the impression that what I describe is *exactly* how real GPU’s do things, especially since many of those details aren’t publicly available. However the commands and behavior are still loosely based on real-world GPU’s, since otherwise the example wouldn’t be very useful!

With the prologue out of the way, let’s have a look at the amazing feat of engineering that is the MJP-3000:

The interesting parts here are the **command processor** on the left, and the **shader cores** in the middle. The command processor is the brains of the operation, and its job is to read commands (the green blocks) from a **command buffer** and coordinate the **shader cores**. The command processor reads commands one at time from the command buffer, always in the exact order they’re submitted. When the command processor encounters the appropriate commands, it can add a group of threads to the **thread queue** immediately to the right of the command processor. The 16 shader cores pull threads from this queue in a first-in first-out (FIFO) scheme, after which the shader program for that thread is actually executed on the shader core. The cores are all identical, and completely independent of each other. This means that together they can simultaneously run 16 threads of the same shader program, or they can each run a thread from a completely different program. The shader cores can also read or write to arbitrary locations in device memory, which is on the right. Since the cores are independent and can all access memory, you can think of the array like a 16-core CPU. The major difference is that unlike a CPU they can’t tell themselves what to do, since they instead rely on the command processor to enqueue work for them. The **Current Cycle Count** in the top-left corner shows how many GPU cycles have executed for a particular example, which will help us keep track of how long it took for a particular example to complete execution.

For some reason, the designers of the MJP-3000 decided that their hardware could only run compute shaders. I suppose they felt that it would make things a lot simpler to only focus on the one shader stage that doesn’t rely on a complicated rasterization pipeline. Because of that, the command processor only has 1 command that actually kicks off threads to run on the shader cores: **DISPATCH**. The DISPATCH command specifies two things: how many threads need to run, and what shader program should be executed. When a DISPATCH command is encountered by the command processor, the threads from that dispatch are immediately placed in the thread queue, where they are grabbed by waiting shader cores. Since there are 16 cores, only 16 threads can be executing at any given time. Any threads that aren’t running on the shader cores stay in the thread queue until a core finishes a different thread and pulls the waiting thread out of the queue. The command processor can parse a DISPATCH command and enqueue its threads in 1 cycle, and the shader cores can dequeue a thread from the thread queue in 1 cycle.

Let’s now try a simple example where we dispatch 32 threads that each write something to a separate element of a buffer located in device memory. This dispatch will run shader program “A”, which takes 100 cycles to complete. So with 16 cores we would expect the whole dispatch to take around 200 cycles from start to end. Let’s go through the steps:

Click to view slideshow.In the first step, the command processor encounters a DISPATCH command in the command buffer that requests 32 threads of program A. 1 cycle later, the command processor has enqueued the 32 requested threads in the thread queue. 1 cycle after that, the 16 shader cores have each picked up a thread of program A and have started executing them. Meanwhile, 16 threads are left in the queue. 100 cycles later the first batch of threads have completed, and their result is in memory. 1 cycle after that we’re at the 103 cycle count, and the second batch of 16 threads are pulled from the now-empty queue to start executing on the shader cores. Finally after a total of 203 cycles, the threads are all finished and their results are in memory.

Now that we understand the basics of how this GPU works, let’s introduce some synchronization. As we already know from the previous article, synchronization implies that we’re going to somehow wait for all of our threads to hit a certain point before continuing. On a GPU where you’re constantly spinning up lots of new threads, this actually translates into something more like “wait for all of the threads from one group to finish before the threads from a second group start executing”. The common case where we’ll need to do this is where one dispatch needs to read the results that were written out by another dispatch. So for instance, say we run 24 threads of program A that collectively write their results to 24 elements of a buffer. After program A completes that we want to run 24 threads of program B, which will then read those 24 elements from the original output buffer and use them to compute new results written into a different buffer. If we were to try to do this by simply putting two DISPATCH commands in our command buffer, it would go something like this (program A is red, and program B is green):

Click to view slideshow.Take a look at the the third step: since dispatch A wasn’t a multiple of 16, the bottom 8 shader cores pulled from dispatch B to keep the cores from going idle. This caused the two dispatches to *overlap*, meaning that the end of dispatch A was still executing while the start of dispatch B was simultaneously executing. This is actually really bad for our case, because we now have a race condition: the threads of dispatch B might read from dispatch A’s output buffer before the threads of dispatch A have finished! Without knowing the specifics of which memory is accessed by programs A and B and how exactly the threads execute on the GPU, we have no choice but to insert a sync point between the two dispatches. This sync point will need to cause the command processor to wait until all threads of dispatch A run to completion before processing dispatch B. So let’s now introduce a FLUSH command that will do exactly that: when the command processor hits the flush, it waits for all shader cores to become idle before processing any further commands. The term “flush” is common for this sort of operation because it implies that it will “flush out” all pending work that’s waiting to execute. Let’s now try the same scenario again, this time using a flush to synchronize:

Notice how the command processor hits the FLUSH command, and then stops reading commands until dispatch A is completely finished and the thread queue is empty. This ensures that dispatch B never overlaps with dispatch A, which means it will be safe for any thread in dispatch B to access any result that that was output by dispatch A. This is pretty much exactly what I was talking about in part 1 when I mentioned the need for barriers to prevent dependent Draw/Dispatch calls from overlapping. In fact, you can usually expect something like a FLUSH to happen on current GPU’s if you issued dispatch A, issued a barrier to transition the output buffer from a write state to a read state, and then issued dispatch B (it’s also similar to what you would get in response to issuing a D3D12_RESOURCE_UAV_BARRIER in D3D12, since that also implies waiting for all pending writes to finish). Hopefully this example makes it even more clear as to why a barrier is necessary for this sort of data dependency, and why results could be wrong if the barrier is omitted.

It’s also very important to note that in this case the flush/barrier was not free from a performance point of view: our total processing time for both dispatches went from 304 cycles to 406 cycles. That’s a 25% increase! The reason for this should be intuitive: with the flush between dispatches, we now have more idle shader cores during the tail end of both dispatches. In fact the increase in processing time is exactly the same as the increase in the amount of idle time: without the flush we had about 0% idle cores over both dispatches, but *with* the flush our cores were idle about 25% of the time on average. This leads us to a simple conclusion: **the performance cost of a flush is directly tied to the decrease in utilization**. This ultimately means that the relative cost of introducing a thread synchronization barrier will vary depending on the number of threads, how long those threads execute, and how well the threads can fully saturate the available shader cores. We can confirm this with a simple thought experiment: imagine we ran dispatch A and dispatch B with 40 threads each instead of 24. The process would go almost exactly as it did before, except both dispatches would have another “phase” of 100 cycles where all 16 cores were in-use. Without our barrier the whole process would take about (40 + 40) / 16 = 500 cycles, while with the barrier it would take about 600 cycles. Therefore the relative cost of the barrier is about 16.5% as opposed to the 25% cost when our thread counts were lower.

The other way to look at this is that **removing an unnecessary flush can result in a performance increase that’s relative to the amount of idle shader cores**. So if we’re syncing between two dispatches and they have no dependency between them, it’s most likely a good idea to remove the barrier and let them overlap with each other. For larger dispatches (in terms of thread count) that can saturate the GPU on their own there won’t be much benefit, since there won’t be much idle time to exploit. However for very small dispatches the difference can be significant. This time let’s imagine that dispatch A and B both have 8 threads each. With a flush in between the total time will be about 200 cycles, but with no flush they can perfectly overlap and finish in only 100 cycles! Or as another example, imagine we had another completely independent workload of 8 threads that we’ll call dispatch C (and color its threads blue). If we were to overlap it with dispatch A, we could essentially get it for “free” by utilizing the idle cores:

If you squint a bit and look at our GPU as if it were a CPU executing instructions instead of a GPU executing commands, then this kind of overlapping of work could be considered a kind of **Instruction Level Parallelism**. In this case the parallel operations are being explicitly specified in our command stream, making it somewhat similar to how VLIW architectures work.

In the previous example, we were able to basically hide dispatch C in the idle time left by the barrier between dispatch A and dispatch B. But what if dispatch C was very complicated, and took much longer than 100 cycles to complete? Let’s re-do the example, except this time dispatch C will execute for 400 cycles instead of 100:

Click to view slideshow.Things didn’t go as well this time around. We still got a bit of overlap between A and C, but that was immediately followed by 300 cycles where half of our shader cores were idle. This happened because our FLUSH command ends up waiting for dispatch C to finish, since the flush works by waiting for the thread queue to become completely empty. We could re-arrange things so that dispatch C gets kicked off *after* the flush, but this is not ideal either because there would still be a bit of idle time during the tail end of dispatch A, and also a long period of half-idle cores when dispatch C is running.

Lucky for us, there’s a new driver update for the MJP-3000 that should be able to help us out. MJP xPerience 3D Nocturnal Driver v5.444.198754 adds support for two new commands that can be parsed and executed by the command processor. The first one is called SIGNAL_POST_SHADER, and the other is called WAIT_SIGNAL. The first command is pretty fancy: it tells the command processor to write a signal value to an address in memory (often called a *fence* or *label*) once all shaders have completed. The cool part is that it’s a “deferred” write: the write is actually performed by the thread queue once it determines that all previously-queued threads have run to completion. This allows the command processor to move on to other commands while previous dispatches are still executing. The other command, WAIT_SIGNAL, tells the command processor to stall and wait for a memory address to be signaled. This can be used in conjunction with SIGNAL_POST_SHADER to wait for a particular dispatch to complete, but with the added bonus that the command processor can kick off more work in between those steps. To help visualize this process, let’s update the GPU diagram with a new component:

Once a SIGNAL_POST_SHADER command is executed, any pending labels will show up as a colored block in a new area under the thread queue. The number on the block shows the current status of the label: “0” means it hasn’t been signaled yet, and “1” means that it’s in the signaled state and any dependent waits will be released.

Let’s now try out our new commands with the previous example:

Click to view slideshow.Very nice! By removing the long stall on dispatch C, we’ve effectively eliminated all of the idle time and kept the GPU busy for the entire duration of the 3 dispatches. As a result we’ve increased our overall throughput: previously the process took about 700 cycles, but now it’s down to about 500 cycles. Unfortunately this is still more time than it took to complete when we only had dispatch A and B to worry about, which means the *latency* for the A->B job increased by about 100 cycles. But at the same time the latency for dispatch C is is lower than it would be if it weren’t overlapped, since it would otherwise need to wait for either A or B to finish before it could start processing.

If the MJP-3000 were being programmed via D3D12 or Vulkan, then this signal/wait behavior is probably what you would hope to see when issuing a split barrier (vkCmdSetEvent + vkCmdWaitEvents in Vulkan-ese). Split barriers let you effectively specify 2 different points in a resource’s lifetime: the point where you’re done using it in its current state (read, write, etc.), and the point where you actually need the resource to be in its new state. By doing this and issuing some work between the begin and end of the barrier, the driver (potentially) has enough information to know that it can overlap the in-between work while it’s waiting for the pre-barrier work to finish. So for the example I outlined above, the D3D12 commands might go something like this:

- Issue Dispatch A which writes to Buffer A
- Begin Transition Buffer A from writable -> readable
- Issue Dispatch C which writes to Buffer C
- End Transition Buffer A from writable -> readable
- Issue Dispatch B which writes to Buffer B

For real-world GPU’s the benefits of split barriers can possibly be even greater than the sync point removal that I demonstrated with my imaginary GPU. As I mentioned in part 1, barriers on GPU’s are also responsible for handing things like cache flushes and decompression steps. These things increase the relative cost of a barrier past the simple “idle shader core tax” that we saw on our imaginary GPU, which gives us even more incentive to try to overlap the barrier with with some non-dependent work. However, the ability to overlap barrier operations with Draws and Dispatches is totally dependent on the specifics of the GPU architecture.

Before we wrap up, I’d like to point out that our GPU is still rather limited in terms of how it can overlap different dispatches, even with the new label/wait functionality that we just added. You can only do so much when the command processor is completely tied up every time that you need to wait for a previous dispatch to finish, which really starts to hurt you if you have more complex dependency chains. Later on in part 3 we’ll revisit this topic, and look at at how some hardware changes can help us get around these limitations.

In part 3, I’m going to discuss why explicit API’s expose multiple queues for submitting command buffers. I’ll also show how multiple queues could work on the fictional GPU architecture we’ve been using as an example, and discuss some implementations in real-world GPU’s.

]]>So what gives? Why the heck do we even need barriers in the first place, and why do things go so wrong if we misuse them? If you’ve done significant console programming or are already familiar with the lower-level details of modern GPU’s, then you probably know the answer to these questions, in which case this article isn’t really for you. But if you don’t have the benefit of that experience, then I’m going to do my best to give you a better understanding of what’s going on behind the scenes when you issue a barrier.

Like almost everything else in programming and computers, the term “barrier” is already a bit overloaded. In some contexts, a “barrier” is a synchronization point where a bunch of threads all have to stop once they reach a particular point in the code that they’re running. In this case you can think of the barrier as an immovable wall: the threads are all running, but stop dead in their tracks when they “hit” the barrier:

void ThreadFunction() { DoStuff(); // Wait for all threads to hit the barrier barrier.Wait(); // We now know that all threads called DoStuff() }

This sort of thing is helpful when you want to know when a bunch of threads have all finished executing their tasks (the “join” in the fork-join model), or when you have threads that need to read other’s results. As a programmer you can implement a thread barrier by “spinning” (looping until a conditions met) on a variable updated via atomic operations, or by using semaphores and condition variables when you want your threads to go to sleep while they’re waiting.

In other contexts the term “barrier” will refer to a memory barrier (also known as a “fence”), particularly if you’ve somehow fallen down the rabbit hole of lock-free programming. In these scenarios you’re usually dealing with reordering of memory operations that’s done by the compiler and/or the processor itself, which can really throw a wrench in the works when you have multiple processors communicating through shared memory. Memory barriers help you out by letting you force memory operations to complete either before or after the barrier , effectively keeping them on one “side” of the fence. In C++ you can insert these into your code using platform-specific macros like MemoryBarrier in the Windows API, or through the cross-platform std::atomic_thread_fence. A common use case might look like this:

// DataIsReady and Data are written to // by a different thread if(DataIsReady) { // Make sure that reading Data happens // *after* reading from DataIsReady MemoryBarrier(); DoSomething(Data); }

These two meanings of the term “barrier” have different specifics, but they also have something in common: they’re mostly used when one thing is producing a result and another thing needs to read that result. Another way of saying that is that one task has a **dependency** on a different task. Dependencies happen all of the time when writing code: you might have one line of code that adds two numbers to compute an offset, and the very next line of code will use that offset to read from an array. However you often don’t need to really be aware of this, because the compiler can **track** those dependencies for you and make sure that it produces code to give you the right results. Manually inserting barriers usually doesn’t come in until you do things in a way that the compiler can’t see how the data is going to be written to and read from at compile-time. This commonly happens due to multiple threads accessing the same data, but it can also happen in other weird cases (like when another piece of hardware writes to memory). Either way, using the appropriate barrier will make sure that the results will be **visible** to dependent steps, ensuring that they don’t end up reading the wrong data.

Since compilers can’t handle dependencies for you automatically when you’re doing multithreaded CPU programming, you’ll often spend a lot of time figuring how to express and resolve dependencies between your multithreaded tasks. In these situations it’s common to build a dependency graph indicating which tasks depend on the results of other tasks. That graph can help you decide what order to execute your tasks, and when you need to stick a sync point (barrier) between two tasks (or groups of tasks) so that the earlier task completely finishes before the second task starts executing. You’ll often see these graphs drawn out as a tree-like diagram, like in this easy-to-understand example from Intel’s documentation for Thread Building Blocks:

Even if you’ve never done task-oriented multithreaded programming, this diagram makes the concept of dependencies pretty clear: you can’t put the peanut better on the bread before you’ve gotten the bread! At a high level this determines the order of your tasks (bread before peanut butter), but it also subtly implies something that would be obvious to you if you were doing this in real life: you can’t start applying your peanut butter until you’ve gotten out your slices of bread from the cabinet. If you were doing this in real life by yourself, you wouldn’t even think about this. There’s only 1 of you, so you would just go through each step one at a time. But we were originally discussing this in the context of multithreading, which means that we’re talking about trying to run different tasks on different cores in parallel. Without properly waiting you could end up with the peanut butter task running at the same time as the bread step, and that’s obviously not good!

To avoid these kinds of issues, task schedulers like TBB give you mechanisms that force a task (or group of tasks) to wait until a prior task (or group of tasks) completely finishes executing. Like I mentioned earlier, you could call this mechanism a barrier, or a sync point:

This sort of thing is pretty easy to implement on modern PC CPU’s, since you have a lot of flexibility as well as some powerful tools at your disposal: atomic operations, synchronization primitives, OS-supplied condition variables, and so on.

So we’ve covered the basics of what a barrier is, but I still haven’t explained why we have them in API’s designed for talking to the GPU. After all, issuing Draw and Dispatch calls isn’t really the same as scheduling a bunch of parallel tasks to execute on separate cores, right? I mean, if you look at a D3D11 program’s sequence of API calls it looks pretty damn serial:

If you’re used to dealing with GPU’s through an API like this, you’d be forgiven for thinking that the GPU just goes through each command one at a time, in the order you submit them. And while this may have been true a long time ago, the reality is actually quite a bit more complicated on modern GPU’s. To show you what I’m talking about, let’s take a look at what my Deferred Texturing sample looks like when I take a capture with AMD’s awesome profiling tool, Radeon GPU Profiler:

This snippet is showing just a portion of a frame, specifically the part where all of the scene geometry is rasterized into the G-Buffer. The left-hand side shows the draw call, while the blue bars to the right show when the draw call actually starts and stops executing. And what do you know, there’s a whole lot of overlapping going on there! You can see the same thing shown a bit differently if you fire up PIX for Windows:

This is a snippet from PIX’s timeline view, which is also showing the execution time for the same sequence of draw calls (this time captured on my GTX 1070, whereas the earlier RGP capture was done on my significantly less beefy RX 460). You can see the same pattern: the draws start executing roughly in submission order, but they overlap all over the place. In some cases, draws will start and finish before an earlier draw completes!

If you know even a little bit about GPU’s this shouldn’t be *completely* surprising. After all, everyone knows that a GPU is mostly made up of hundreds or thousands of what the IHV’s like to call “**shader cores**“, and those shader cores all work together to solve “embarrassingly parallel” problems. These days the bulk of work done to process a draw (and pretty much all of the work done to process a dispatch) is performed on these guys, which run the shader programs compiled from our HLSL/GLSL/MetalSL code. Surely it makes sense to have the shader cores process the several thousand vertices from a single draw call in parallel, and to do the same with the thousands or millions of pixels that result from rasterizing the triangles. But does it really make sense to let multiple draw calls or dispatches bleed over into one another so that your actual high-level commands are also executing in parallel?

The correct answer is, “yes, absolutely!” In fact, hardware designers have put in quite a bit of effort over the years to make sure that their GPU’s can do this even if there are some state changes in between the draws. Desktop GPU’s have even engineered their ROP’s (the units that are responsible for taking the output of a pixel shader and actually writing it to memory) so that they can resolve blending operations even if the pixel shaders didn’t finish in draw order! Doing it this way helps avoid having idle shader cores, which in turn gives you better throughput. Don’t worry if this doesn’t completely make sense right now, as I’m going to walk through some examples in a future post that explain why this is the case. But for now, just take my word for it that allowing draws and dispatches to overlap generally leads to higher throughput.

If a GPU’s threads from a draw/dispatch can overlap with other, that means that the GPU needs a way to *prevent* that from happening in cases where there’s a data dependency between two tasks. When this happens, it makes sense do what we do on a CPU, and insert something roughly similar to a thread barrier in order to let us know when a group of threads have all finished their work. In practice GPU’s tend to do this in a very coarse manner, such as waiting for all outstanding compute shader threads to finish before starting up the next dispatch. This can be called a “flush”, or a “wait for idle”, since the GPU will wait for all threads to “drain” before moving on. But we’ll get into that in more detail in the next article.

Hopefully by now it’s clear that there’s at least one reason for barriers on GPU’s: to keep shader threads from overlapping when there’s a data dependency. This is really the same scenario with the peanut butter and the bread that we laid out earlier when talking about CPU threads, except with the core count cranked up to the thousands. But unfortunately things get a bit more complicated when we’re talking about GPU’s as opposed to CPU’s.

Let’s say that you start up a group of threads running on a PC CPU that write a bunch of data to individual buffers, insert a thread barrier to wait until those threads are finished, and then kick off a second group of threads that reads the output data of the first group of threads. As long as you make sure that you have the right memory/compiler barriers in place to ensure that the second tasks’s read operations don’t happen too early (and often you get this by default from using OS synchronization primitives or atomic operations), you don’t need to care about getting correct results in the presence of a cache hierarchy. This is because the caches on an x86 core (usually each core has its own individual L1 cache, with a shared L2 and possibly L3 cache) are coherent, which means that they stay “up to date” with each other as they access different memory addresses. The details of how they achieve this miraculous feat are quite complicated, but as programmers we’re usually allowed to remain blissfully ignorant of the internal gymnastics being performed by the hardware.

Things are not so simple for the poor folks that write drivers for a GPU. For various reasons, some of them dating back to their legacy as devices that weren’t used for general-purpose computing like they are now, GPU’s tend to have a bunch of caches that aren’t always organized into a strict hierarchy. The details aren’t always public, but AMD tends to has quite a bit of public information available about their GPU’s that we can learn from. Here’s a diagram from slide 50 of this presentation:

This looks quite different from a CPU’s cache hiearchy! We have two things with L1 that go through L2 the way that you expect they would, but then there’s color and depth caches that bypass L2 and go right to memory! And there’s a DMA engine that doesn’t go through cache at all! The diagram here is also a bit misleading, since in reality there’s an L1 texture cache on every compute unit (CU), and there can be dozens of compute units on the larger video cards. There’s also multiple instruction cache and scalar data L1’s, with one of these shared between up to 4 CU’s. There’s lots of details in this GCN whitepaper, which explains how the various caches work, and also how direct memory writes from shaders (AKA writes to UAV’s) go through their local L1 and *eventually* propagate to L2.

As a consequence of having all of these caches without a strict hierarchy, the caches can sometimes get out of sync with each other. As the GCN whitepaper describes, the L1 caches on the shader units aren’t coherent with each other until the writes reach L2. This means that if one dispatch writes to a buffer and another reads from it, the CU L1 cache may need to be **flushed** in between those dispatches to make sure that all of the writes at least made it to L2 (a cache flush refers to the operation of taking modified/dirty cache lines and writing them out to the next cache level, or actual memory if applied to the last cache level). And as slide 52 describes, it’s even worse when a texture goes from being used as a render target to being used as a readable texture. For that case the writes to the render target could be sitting in the color buffer L1 cache that’s attached to the ROP’s, which means that cache has to be flushed in addition to flushing the other L1’s and L2 cache. (note that AMD’s new Vega architecture has more unified cache hierarchy where the ROP’s are also clients of the L2).

One cool thing about AMD hardware is that their tools actually show you when this happens! Here’s a snippet from an RGP capture showing the caches being flushed (and shader threads being synchronized!) on my RX 460 after large dispatch finishes writing to a texture:

Now the point of this isn’t just to explain how caches are hard and complicated, but to illustrate another way in which GPU’s require barriers. Ensuring that your threads don’t overlap isn’t sufficient for resolving read-after-write dependencies when you also have multiple caches that can have stale data in them. You’ve also got to invalidate or flush those caches to make the results visible to subsequent tasks that need to read the data.

GPU’s have gotten more and more focused on compute as time goes on, but they’re still heavily optimized for rasterizing triangles into a grid of pixels. Doing this job means that the ROP’s can end up having touch a *ton* of memory every frame. Games now have to render at up to 4k resolutions, which works out to 8294400 pixels if you write every one with no overdraw. Multiply that by 8 bytes per-pixel for 16-bit floating point texture formats, or maybe up to 30 or 40 bytes per-pixel for fat G-Buffers, and you’re looking at a lot bandwidth consumption just to touch all of that memory once (and typically many texels will be touched more than once)! It only gets worse if you add MSAA into the mix, which will double or quadruple the memory and bandwidth requirements in the naive case.

To help keep that bandwidth usage from becoming a bottleneck, GPU designers have put quite a bit of effort into building lossless compression techniques into their hardware. Typically this sort of thing is implemented as part of the ROP’s, and is therefore used when writing to render targets and depth buffers. There’s been a lot of specific techniques used over the years, and the exact details haven’t been made available to the public. However AMD and Nvidia have provided at least a bit of information about their particular implementations of delta color compression in their latest architectures. The basic gist of both techniques is that they aim to exploit the similarity in neighboring pixels in order to avoid storing a unique value for every texel of the render target. Instead, the hardware recognizes patterns in blocks of pixels, and stores each pixel’s difference (or delta) from an anchor value. Nvidia’s block modes give them anywhere from 2:1 to 8:1 compression ratios, which potentially results in huge bandwidth savings!

So what exactly does this have to do with barriers? The problem with these fancy compression modes is that while the ROP’s may understand how to deal with the compressed data, the same is not necessarily true when shaders need to randomly-access the data through their texture units. This means that depending on the hardware and how the texture is used, a decompression step might be necessary before the texture contents are readable by a dependent task (or writable through a means other than ROP’s). Once again, this is something that falls under the umbrella of “barriers” when we’re talking about GPU’s and the new explicit API’s for talking to them.

After reading through my ramblings about thread synchronization, cache coherency, and GPU compression, you hopefully have at least a very basic grasp of 3 potential reasons that typical GPU’s require barriers to do normal things that we expect of them. But if you look at the actual barrier API’s in D3D12 or Vulkan, you’ll probably notice that they don’t really seem to directly correspond with what we just talked about. After all, it’s not like there’s a “WaitForDispatchedThreadsToFinish” or “FlushTextureCaches” function on ID3D12GraphicsCommandList. And if you think about it, it makes sense that they don’t do this. The fact that most GPU’s have lots of shader cores where tasks can overlap is a pretty specific implementation detail, and you could say the same about GPU’s that have weird incoherent cache hierarchies. Even for an explicit API like D3D12 it wouldn’t make sense to leak that kind of detail across its abstraction, since it’s totally possible that one day D3D12 could be used to talk to a GPU that doesn’t behave the way that I just described (it may have already happened!).

When you think of things from that perspective, it starts to make sense that D3D12/Vulkan barriers are more high-level, and instead are mostly aimed at describing the flow of data from one pipeline stage to another. Another way to describe them is to say that the barriers tell the driver about changes in the *visibility* of data with regards to various tasks and/or functional units, which as we pointed out earlier is really the essence of barriers. So in D3D12 you don’t say “make sure that this draw call finishes before this other dispatch reads it”, you say “this texture is transitioning from a ‘render target’ state to a ‘shader readable’ state so that a shader program can read from it”. Essentially you’re giving the driver a bit of information about the past and future life of a resource, which may be necessary for making decisions about which caches to flush and whether or not to decompress a texture. Thread synchronization is then implied by the state transition rather than explicit dependencies between draws or dispatches, which isn’t a perfect system but it gets the job done.

If you’re wondering why we didn’t need to manually issue barriers in D3D11, the answer to that question is “because the driver did it for us!”. Remember how earlier I said that a compiler can analyze your code to determine dependencies, and generate the appropriate assembly automatically? This is basically what drivers do in D3D11, except they’re doing it at runtime! The driver needs to look at all the resources that you bind as inputs and outputs, figure out when there’s visibility changes (for instance, going from a render target to a shader input), and insert the necessary sync points, cache flushes, and decompression steps. While it’s nice that you automatically get correct results, it’s also bad for a few reasons:

- Automatically tracking resources and draw/dispatch calls is expensive, which is not great when you want to squeeze your rendering code into a few milliseconds per frame.
- It’s really bad for generating command buffers in parallel. If you can set a texture as a render target in one thread and then bind it as an input in another thread, the driver can’t figure out the full resource lifetime without somehow serializing the results of the two threads.
- It relies on an explicit resource binding model, where the context always knows the full set of inputs and outputs for every draw or dispatch. This can prevent you from doing awesome things with bindless resource access.
- In some cases the driver might issue unnecessary barriers due to not having knowledge of how the shaders access their data. For example, two dispatches that increment the same atomic counter won’t necessarily need a barrier between them, even though they access the same resource.

The thinking behind D3D12 and Vulkan is that you can eliminate those disadvantages by having the app provide the driver with the necessary visibility changes. This keeps the driver simpler, and lets the app figure out the barriers in any manner that it wants. If your rendering setup is fairly fixed, you can just hard-code your barriers and have essentially 0 CPU cost. Or you can setup your engine to build its own dependency graph, and use that to determine which barriers you’ll need.

In the next article, I’m going to dive a bit deeper into the topic of thread-level synchronization and how it’s typically implemented on GPU’s. Stay tuned!

]]>

*This is part 6 of a series on Spherical Gaussians and their applications for pre-computed lighting. You can find the other articles here:*

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Get the code on GitHub: https://github.com/TheRealMJP/BakingLab (pre-compiled binaries available here)

Back in early 2014, myself and David Neubelt started doing serious research into using Spherical Gaussians as a compact representation for our pre-computed lighting probes. One of the first things I did back then was to create a testbed application that we could use to compare various lightmap representations (SH, H-basis, SG, etc.) and quickly experiment with new ideas. As part of that application I implemented my first path tracer, which was directly integrated into the app for A/B comparisons. This turned out to be extremely useful, since having quick feedback was really helpful for evaluating quality and also for finding and fixing bugs. Eventually we used this app to finalize the exact approach that we would use when integrating SG’s into The Order: 1886.

A year later in 2015, Dave and I created another test application for experimenting with improvements that we were planning for future projects. This included things like a physically based exposure model utilizing real-world camera parameters, using the ACES[1] RRT/ODT for tone mapping, and using real-world units[2] for specifying lighting intensities. At some point I integrated an improved version of SG baking into this app that would progressively compute results in the background while the app remained responsive, allowing for quick “preview-quality” feedback after adjusting the lighting parameters. Once we started working on our SIGGRAPH presentation[3] from the 2015 physically based shading course, it occurred to us that we should really package up this new testbed and release it alongside the presentation to serve as a working implementation of the concepts we were going to cover. But unfortunately this slipped through the cracks: the new testbed required a lot of work in order to make it useful, and both Dave and I were really pressed for time due to multiple new projects ramping up at the office.

Now, more than a year after our SIGGRAPH presentation, I’m happy to announce that we’ve finally produced and published a working code sample that demonstrates baking of Spherical Gaussian lightmaps! This new app, which I call “The Baking Lab”, is essentially a combination of the two testbed applications that we created. It includes all of the fun features that we were researching in 2015, but also includes real-time progressive baking of 2D lightmaps in various formats. It also allows switching to a progressive path tracer at any time, which serves as the “ground truth” for evaluating lightmap quality and accuracy. Since it’s an amalgamation of two older apps, it uses D3D11 and the older version of my sample framework. So there’s no D3D12 fanciness, but it will run on Windows 7. If you’re just interested in looking at the code or running the app, then go ahead and head over to GitHub: https://github.com/TheRealMJP/BakingLab. If you’re interested in the details of what’s implemented in the app, then keep reading.

The primary feature of The Baking Lab is lightmap baking. Each of the test scenes includes a secondary UV set that contains non-overlapping UV’s used for mapping the lightmap onto the scene. Whenever the app starts or a new scene is selected, the baker uses the GPU to rasterize the scene into lightmap UV space. The pixel shader outputs interpolated vertex components like position, tangent frame, and UV’s to several render targets, which use MSAA to simulate conservative rasterization. Once the rasterization is completed, the results are copied back into CPU-accessible memory. The CPU then scans the render targets, and extracts “bake points” from all texels covered by the scene geometry. Each of these bake points represents the location of a single hemispherical probe to be baked.

Once all bake points are extracted, the baker begins running using a set of background threads on the CPU. Each thread continuously grabs a new work unit consisting of a group of contiguous bake points, and then loops over the bake points to compute the result for that probe. Each probe is computed by invoking a path tracer, which uses Embree[4] to allow for arbitrary ray tracing through the scene on the CPU. The path tracer returns the incoming radiance for a direction and starting point, where the radiance is the result of indirect lighting from various light sources as well as the direct lighting from the sky. The path tracer itself is a very simple unidirectional path tracer, using a few standard techniques like importance sampling, correlated multi-jittered sampling[5], and russian roulette to increase performance and/or convergence rates. The following baking modes are supported:

**Diffuse**– a single RGB value containing the result of applying a standard diffuse BRDF to the incoming lighting, with an albedo of 1.0**Half-Life 2**– directional irradiance projected onto the Half-Life 2 basis[6], making for a total of 3 sets of RGB coefficients (9 floats total)**L1 SH**– radiance projected onto the first two orders of spherical harmonics, making for a total of 4 sets of RGB coefficients (12 floats total). Supports environment specular via a 3D lookup texture.**L2 SH**– radiance projected on the first three orders of spherical harmonics, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via a 3D lookup texture.**L1 H-basis**– irradiance projected onto the first two orders of H-basis[7], making for a total of 4 sets of RGB coefficients (12 floats total).**L2 H-basis**– irradiance projected onto the first three orders of H-basis, making for a total of 6 sets of RGB coefficients (18 floats total).**SG5**– radiance represented by the sum of 5 SG lobes with fixed directions and sharpness, making for a total of 5 sets of RGB coefficients (15 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG6**– radiance represented by the sum of 6 SG lobes with fixed directions and sharpness, making for a total of 6 sets of RGB coefficients (18 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG9**– radiance represented by the sum of 9 SG lobes with fixed directions and sharpness, making for a total of 9 sets of RGB coefficients (27 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.**SG12**– radiance represented by the sum of 12 SG lobes with fixed directions and sharpness, making for a total of 12 sets of RGB coefficients (36 floats total). Supports environment specular via an approximate evaluation of per-lobe specular contribution.

For SH, H-basis, and HL2 basis baking modes the path tracer is evaluated for random rays distributed about the hemisphere so that Monte Carlo integration can be used to integrate the radiance samples onto the corresponding basis functions. This allows for true progressive integration, where the baker makes N passes over each bake point, each time adding a new sample with the appropriate weighting. It looks pretty cool in action:

The same approach is used for the “Diffuse” baking mode, except that sampling rays are evaluated using a cosine-weighted hemispherical sampling scheme[8]. For SG baking, things get a little bit trickier. If the ad-hoc projection mode is selected, the result can be progressively evaluated in the same manner as the non-SG bake modes. However if either the Least Squares or Non-Negative Least Squares mode are active, we can’t run the solve unless we have all of the hemispherical radiance samples available to feed to the solver. In this case we switch to a different baking scheme where each thread fully computes the final value for every bake point that it operates on. However the thread only does this for a single bake point from each work group, and afterwards it fills in the rest of the neighboring bake points (which are arranged in a 8×8 group of texels) with the results it just computed. Each pass of of baker then fills in the next bake point in the work group, gradually computing the final result for all texels in the group. So instead of seeing the quality slowly improve across the light map, you see extrapolated results being filled in. It ends up looking like this:

While it’s not as great as a true progressive bake, it’s still better than having no preview at all.

The app supports a few settings that control some of the bake parameters, such as the number of samples evaluated per-texel and the overall lightmap resolution. The “Scene” group in the UI also has a few settings that allow toggling different components of the final render, such as the direct or indirect lighting or the diffuse/specular components. Under the “Debug” setting you can also toggle a neat visualizer that shows a visual representation of the raw data stored in the lightmap. It looks like this:

The integrated path tracer is primarily there so that you can see how close or far off you are when computing environment diffuse or specular from a light map. It was also a lot of fun to write – I recommend doing it sometime if you haven’t already! Just be careful: it may make you depressed to see how poorly your real-time approximation holds up when compared with a proper offline render.

The ground truth renderer works in a similar vein to the lightmap baker: it kicks off multiple background threads that each grab work groups of 16×16 pixels that are contiguous in screen space. The renderer makes N passes over each pixel, where each pass adds an additional sample that’s weighted and summed with the previous results. This gives you a true progressive render, where the result starts out noisy and (very) gradually converges towards a noise-free image:

The ground truth renderer is activated by checking the “Show Ground Truth” setting under the “Ground Truth” group. There’s a few more parameters in that group to control the behavior of the renderer, such as the number of samples used per-pixel and the scheme used for generating random samples.

There’s 3 different light sources supported in the app: a sun, a sky, and a spherical area light. For real-time rendering, the sun is handled as a directional light with an intensity computed automatically using the Hosek-Wilkie solar radiance model[9]. So as you change the position of the sun in the sky, you’ll see the color and intensity of the sunlight automatically change. To improve the real-time appearance, I used the disk area light approximation from the 2014 Frostbite presentation. The path tracer evaluates the sun as an infinitely-distant spherical area light with the appropriate angular radius, with uniform intensity and color also computed from the solar radiance model. Since the path tracer handles the sun as a true area light source, it produces correct specular reflections and soft shadows. In both cases the sun defaults to correct real-world intensities using actual photometric units. There is a parameter for adjusting the sun size, which will result in the sun being too bright or too dark if manipulated. However there’s another setting called “Normalize Sun Intensity” which will attempt to maintain roughly the same illumination regardless of the size, which allows for changing the sun appearance or shadow softness without changing the overall scene lighting.

The default sky mode (called “Procedural”) uses the Hosek-Wilkie sky model to compute a procedural sky from a few input parameters. These include turbidity, ground albedo, and the current sun position. Whenever the parameters are changed, the model is cached to a cubemap that’ s used for real-time rendering on the GPU. For CPU path tracing, the the sky model is directly evaluated for a direction using the sample code provided by the authors. When combined with the procedural sun model, the two light sources form a simple outdoor lighting environment that corresponds to real-world intensities. Several other sky modes are also supported for convenience. The “Simple”mode takes just a color and intensity as input parameter, and flood-fills the entire sky with a value equal to color * intensity. The “Ennis”, “Grace Cathedral”, and “Uffizi Cross” modes use corresponding HDR environment maps to fill the sky instead of a procedural model.

For local lighting, the app supports enabling a single spherical area light using the “Enable Area Light” setting. The area light can be positioned using the Position X/Position Y/Position Z settings, and its radius can be specified with the “Size” setting. There are a 4 different modes for specifying the intensity of the light:

**Luminance**– the intensity corresponds to the amount of light being emitted from the light source along an infinitesimally small ray towards the viewer or receiving surface. Uses units of cd/m^{2}. Changing the size of the light source will change the overall illumination the scene.**Illuminance**– specifies the amount of light incident on a surface at a set distance, which is specified using the “Illuminance Distance” setting. So instead of saying “how much light is coming out of the light source” like you do with the “Luminance” mode, you’re saying “how much diffuse light is being reflected from a perpendicular surface N units away”. Uses units of lux, which are equivalent to lm/m^{2}. Changing the size of the light source will*not*change the overall illumination the scene.**Luminous Power**– specifies the total amount of light being emitted from the light source in all directions. Uses units of lumens. Changing the size of the light source will*not*change the overall illumination the scene.**EV100**– this is an alternative way of specifying the luminance of the light source, using the exposure value[10] system originally suggested by Nathan Reed[11]. The base-2 logarithmic scale for this mode is really nice, since incrementing by 1 means doubling the perceived brightness. Changing the size of the light source will change the overall illumination the scene.

The ground truth renderer will evaluate the area light as a true spherical light source, using importance sampling to reduce variance. The real-time renderer approximates the light source as a single SG, and generates very simple hard shadows using an array of 6 shadow maps. By default only indirect lighting from the area light will be baked into the lightmap, with the direct lighting evaluated on the GPU. However if the “Bake Direct Area Light” setting is enabled, then the direct contribution from the area light will be baked into the lightmap.

Note that all light sources in the app are always scaled down by a factor of 2^{-10} before being using in rendering, as suggested by Nathan Reed in his blog post[11]. Doing this effectively shifts the window of values that can be represented in a 16-bit floating point value, which is necessary in order to represent specular reflections from the sun. However the UI always will always show the unshifted values, as will the debug luminance picker that shows the final color and intensity of any pixel on the screen.

As I mentioned earlier, the app implements a physically based exposure system that attempts to models the behavior and parameters of a real-world camera. Much of the implementation was based on the code from Padraic Hennessy’s excellent series of articles[12], which was in turn inspired by Sébastien Lagarde and Charles de Rousiers’s SIGGRAPH presentation from 2014[2]. When the “Exposure Mode” setting is set to the “Manual (SBS)” or “Manual (SOS)” modes, the final exposure value applied before tone mapping will be computed based on the combination of aperture size, ISO rating, and shutter speed. There is also a “Manual (Simple)” mode available where a single value on a log2 scale can be used instead of the 3 camera parameters.

Mostly for fun, I integrated a post-process depth of field effect that uses the same camera parameters (along with focal length and film size) to compute per-pixel circle of confusion sizes. The effect is off by default, and can be toggled on using the “Enable DOF” setting. Polygonal and circular bokeh shapes are supported using the technique suggested by Tiago Sousa in his 2013 SIGGRAPH presentation[13]. Depth of field is also implemented in the ground truth renderer, which is capable of achieving true multi-layer effects by virtue of using a ray tracer.

Several tone mapping operators are available for experimentation:

**Linear**– no tone mapping, just a clamp to [0, 1]**Film Stock**– Jim Hejl and Richard Burgess-Dawson’s polyomial approximation of Haarm-Peter Duiker‘s filmic curve, which was created by scanning actual film stock. Based on the implementation provided by John Hable[14].**Hable (Uncharted2)**– John Hable‘s adjustable filmic curve from his GDC 2010 presentation[15]**Hejl 2015**– Jim Hejl’s filmic curve that he posted on Twitter[16], which is a refinement of Duiker’s curve**ACES sRGB Monitor**– a fitted polynomial version of the ACES[17] reference rendering transform (RRT) combined with the sRGB monitor output display transform (ODT), generously provided by Stephen Hill.

At the bottom of the settings UI are a group of debug options that can be selected. I already mentioned the bake data visualizer previously, but it’s worth mentioning again because it’s really cool. There’s also a “luminance picker”, which will enable a text output showing you the luminance and illuminance of the surface under the mouse cursor. This was handy for validating the physically based sun and sky model, since I could use the picker to make sure that the lighting values matched what you would expect from real-world conditions. The “View Indirect Specular” option causes both the real-time renderer and the ground truth renderer to only show the indirect specular component, which can be useful for gauging the accuracy of specular computed from the lightmap. After that there’s a pair of buttons for saving or loading light settings. This will serialize the settings that control the lighting environment (sun direction, sky mode, area light position, etc.) to a file, which can be loaded in whenever you like. The “Save EXR Screenshot” is fairly self-explanatory: it lets you save a screenshot to an EXR file that retains the HDR data. Finally there’s an option to show the current sun intensity that’s used for the real-time directional light.

[1] Academy Color Encoding System – https://en.wikipedia.org/wiki/Academy_Color_Encoding_System

[2] Moving Frostbite to PBR (course notes) – http://www.frostbite.com/wp-content/uploads/2014/11/course_notes_moving_frostbite_to_pbr_v2.pdf

[3] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pdf

[4] Embree: High Performance Ray Tracing Kernels – https://embree.github.io/

[5] Correlated Multi-Jittered Sampling – http://graphics.pixar.com/library/MultiJitteredSampling/paper.pdf

[6] Shading in Valve’s Source Engine – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[7] Efficient Irradiance Normal Mapping – https://www.cg.tuwien.ac.at/research/publications/2010/Habel-2010-EIN/

[8] Better Sampling – http://www.rorydriscoll.com/2009/01/07/better-sampling/

[9] Adding a Solar Radiance Function to the Hosek Skylight Model – http://cgg.mff.cuni.cz/projects/SkylightModelling/

[10] Exposure value – https://en.wikipedia.org/wiki/Exposure_value

[11] Artist-Friendly HDR With Exposure Values – http://www.reedbeta.com/blog/2014/06/04/artist-friendly-hdr-with-exposure-values/

[12] Implementing a Physically Based Camera: Understanding Exposure – https://placeholderart.wordpress.com/2014/11/16/implementing-a-physically-based-camera-understanding-exposure/

[13]CryENGINE 3 Graphics Gems – http://advances.realtimerendering.com/s2013/Sousa_Graphics_Gems_CryENGINE3.pptx

[14] Filmic Tonemapping Operators – http://filmicgames.com/archives/75

[15] Uncharted 2: HDR Lighting – http://www.gdcvault.com/play/1012351/Uncharted-2-HDR

[16] Jim Hejl on Twitter – https://twitter.com/jimhejl/status/633777619998130176

[17] Academy Color Encoding System Developer Resources – https://github.com/ampas/aces-dev

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the two previous articles I showed workable approaches for approximating the diffuse and specular result from a Spherical Gaussian light source. On its own these techniques might seem a bit silly, since it’s not immediately why the heck it would be useful to light a scene with an SG light source. But then you might remember that these articles started off by discussing methods for storing pre-computed radiance or irradiance in lightmaps or probe grids, which is a subject we’ll finally return to.

A common process in mathematics (particularly statistics) is to take a set of data points and attempt to figure out some sort of analytical curve that can represent the data. This process is known as curve fitting[1], since the goal is to find a curve that is a good fit for the data points. There’s various reasons to do this (such as regression analysis[2]), but I find it can helpful to think of it as a form of lossy compression: a few hundred data points might require kilobytes of data, but if you can approximate that data with a curve the coefficients might only need a few bytes of storage. Here’s a simple example from Wikipedia:

*Fitting various polynomials to data generated by a sine wave. Red is first degree, green is second degree, orange is third degree, blue is forth degree.
By Krishnavedala (Own work) [CC0], via Wikimedia Commons*

In the image there’s a bunch of black dots, which represent a set of data points that we want to fit. In this case the data points come from a sine wave, but in practice the data could take any form. The various colored curves represents attempts at fitting the data using polynomials of varying degrees:

By looking at graphs and the forms of the polynomials it should be obvious that higher degrees allow for more complex curves, but require more coefficients. More coefficients means more data to store, and may also mean that the fitting process is more difficult and/or more expensive. One of the most common techniques used for fitting is least squares[3], which works by minimizing the sum of all differences between the fit curve and the original data.

The other observation we can make it that the resulting fit is essentially a linear combination of basis functions, where the basis functions are , , , and so on. There are many other basis functions we could use here instead of polynomials, such as our old friend the Gaussian! Just like polynomials, a sum of Gaussians can represent more complex functions with a handful of coefficients. As an example, let’s take a set of data points and use least squares to fit varying numbers of Gaussians:

*Fitting Gaussians to a data set using least squares. The left graph shows a fit with a single Gaussian, the middle graph shows a fit with two Gaussians, and the right graph shows a fit with three Gaussians.*

For this example I used curve_fit[4] from the scipy optimization library, which uses a non-linear least squares algorithm. Notice how as I added more Gaussians, the resulting sum became a better approximation of the raw data.

So far we’ve been fitting 1D data sets, but the techniques we’re using also work in multiple dimensions. So for instance, let’s say we had a bunch of scattered samples in random directions on a sphere defined by a 2D spherical coordinate system. And let’s say that these samples represent something like…oh I don’t know…the amount of incoming lighting along an infinitesimally narrow ray oriented in that direction. If we take all of these data points and throw some least squares at it, we can end up with a series of N Spherical Gaussians whose sum can serve as an approximation for radiance in any direction on the sphere! We just need our fitting algorithm to spit out the axis, amplitude, and sharpness of each Gaussian, or if we want can fix one of more of the SG parameters ahead of time and only fit the remaining parameters. It should be immediately obvious why this is useful, since a set of SG coefficients can be stored very compactly compared to a gigantic set of radiance samples. Of course if we only use a few Gaussians the resulting approximation will probably lose details from the original radiance function, but this is no different from spherical harmonics or other common techniques for storing approximate representations of radiance or irradiance. Let’s take a look at a fit actually looks like using an HDR environment map as input data:

*Approximating radiance on a sphere using a sum of Spherical Gaussians. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows a least squares fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

These images were generated Yuriy O’Donnell‘s Probulator[5], which is an excellent tool for comparing various ways of approximating radiance and irradiance on a sphere. One important thing to note here is that the fit was only performed on the amplitude of the SG’s: the axis and sharpness are pre-determined based on the number of SG’s. Probulator generates the lobe axis directions using Vogel’s method[6], but any technique for distributing points on a sphere would also work. Fitting only the lobe amplitude significantly simplifies the solve, since there are less parameters to optimize. Solving for a single parameter also allows us to use linear least squares[7], while fitting all of the parameters would require use of complex and expensive non-linear least squares[8] algorithms. Solving for less parameters also decreases the storage costs, since only the amplitudes need to be stored per-probe while the directions and sharpness can be global constants. Either way it’s good to keep in mind when examining the results. In particular it helps explain why the white lobe in the middle image doesn’t quite line up with the bright windows in the source environment. Aside from that, the results are probably what you would expect: doubling the number of lobes increases the possible complexity and sharpness of the resulting approximation, which in this case allows it to provide a better representation of some of the high-frequency details in the source image.

One odd thing you might notice in the SG approximation is the overly dark areas towards the bottom-right of the sphere. They look somewhat similar to the darkening that can show up in SH approximations, which happens due to negative coefficients being used for the SH polynomials. It turns out that something very similar is happening with our SG fit: the least squares optimizations is returning negative coefficients for some of our lobes in an attempt to minimize the error of the resulting fit.If you’re having trouble understanding why this would happen, let’s go back to 1D for a quick example. For the last 1D example I cheated a bit: the data we were fitting our Gaussians to was actually just some random noise applied to the sum of 3 Gaussians. This is why our Gaussian fit so closely resembled the source data. This time, we’ll fitting some lobes to a more complex data set:

*A data set with a discontinuity near 0, making it more difficult to fit curves to.*

This time the data has a bunch of values near the center that have a value of zero. With such a data set it’s now less obvious how a sum of Gaussians could approximate the values. If we throw least squares at the problem and have it fit two lobes, we get the following:

*The result of using least squares to fit 2 Gaussian lobes to the above data set. The left graph shows the first lobe (red), the middle graph shows the second lobe (green), and the right graph shows the sum of the two lobes (blue) overlaid onto the original data set.*

This time around the optimization resulted in a positive amplitude for the first lobe, but a *negative* amplitude for the second lobe. Looking at the sum overlaid onto to the data makes it clear why this happened: the positive lobe takes care of the all of the positive data points to the left and right, while the negative lobe brings the sum closer to zero in the middle of the graph. Upon closer inspection the negative actually causes the approximation to dip *below* zero into the negatives. We can assume that having this dip results in lower overall error for the approximation, since that’s how least squares works.

In practice, having negative coefficients and negative values from our approximation can undesirable. In fact when approximating radiance or irradiance negative values really just don’t make sense, since they’re physically impossible. In our experience we also found that the visual result of lighting a scene with negative lobes can be quite displeasing, since it tends to look very unnatural to have surfaces that are completely dark. Perhaps you remember this image from the first article showing what L2 SH looks like with a bright area light in the environment:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

We found that character faces in particular tended to look really bad when our SH light probes had strong negative lobes, and this turned out to be one of our motivations for investigating alternative approximations. We also ran into some trouble when attempting to compress signed floating point values in BC6H textures: some compressors didn’t even support compressing to that format, and those that did had noticeably worse quality.

With that in mind, it would be nice to constrain a least squares solver in such a way that it only gave us positive coefficients. Fortunately for us such a technique exists, and it’s known as non-negative least squares[9] (or NNLS for short). If we use that technique for fitting SG’s to our original radiance function instead of standard least squares, we get this result instead:

*Fitting SG’s to a radiance function using a non-negative least squares solver. The left image shows the original source radiance function taken from an HDR environment map. The middle image shows an NNLS fit of 12 SG’s, and right image shows a fit of 24 SG’s.*

This time we don’t have the dark areas in the bottom right, since the fit only uses positive lobes. But unfortunately there’s no free lunch here, since the resulting approximation is also a bit “blurrier” compared to the normal least squares fit.

Now that we’ve covered how to generate an SG approximation of radiance from source data, we can take a look at how well it stacks up against other options for a simple use case. Probably the most obvious application is computing irradiance from the radiance approximation, which can be directly used to compute standard Lambertian diffuse lighting. The following images were captured using Probulator, and they show the Stanford Bunny being lit using a few common techniques for approximating irradiance:

*A bunny model being lit by various irradiance approximations generated from the “Pisa” HDR environment map*

With the exception of Valve’s Ambient Cube, all of the approximations hold up very well when compared with the ground truth. The non-negative least squares fit is just a bit more washed out than the least squares fit, but both seem to produce perfectly acceptable results. The SH result is also very good, with no noticeable ringing artifacts. However this particular environment is a somewhat easier case, as the range of intensities isn’t as large as you might find in some realistic lighting scenarios. For a more extreme case, let’s now look at a comparison using the “Ennis” environment map:

*A bunny model being lit by various irradiance approximations generated from the “Ennis” HDR environment map*

This time there’s a much more noticeable difference between the various techniques. This is because the source environment map has a very bright window to the left, which effectively serves as a large area light source. With this particular environment the SG results start to compare pretty favorably to the SH or ambient cube approximations. The results from L2 SH have severe ringing artifacts, which manifests as areas that are either too dark or too bright on the side of the bunny facing to the right. Meanwhile , the windowed version of L2 SH blurs the lighting too much, making it appear as if the environment is more uniform than it really is. The ambient cube probe doesn’t suffer from ringing, but it does have problems with the bright lighting from the left bleeding onto the top and side of the bunny. Looking at the least squares solve for 12 SG’s, the result is pretty nice but there is a bit of ringing evident on the upper-right side of the bunny model. This ringing isn’t present in the non-negative least squares solve, since all of the coefficients end up being positive.

As I mentioned earlier, these Probulator comparisons use fixed lobe directions and sharpness. Consequently we only need to store amplitude, meaning that the storage cost of the 12 SG lobes is equivalent to 12 sets of floating-point RGB coefficients (36 floats total). L2 SH requires 9 sets of RGB coefficients, which adds up to 27 floats. The ambient cube requires only 6 sets of RGB coefficents, which is half that of the SG solve. So for this particular comparison the SG representation of radiance requires the most storage, however this highlights one of the nice points using SG as your basis: you can solve for any number of lobes you’d like, allowing you to choose easily trade off quality vs. storage and performance cost. Valve’s ambient cube is only defined for 6 lobes, and that number can’t be increased since the lobes must remain orthogonal to each other.

For full lighting probes where the sampling surface can have any orientation, storing the radiance or irradiance on a sphere makes perfect sense. However it makes less sense if we would like to store baked lighting in 2D textures where all of the sample points lie on the surface of a mesh. For that case storing data for a full sphere is wasteful, since half of the data will point “into” the surface and therefore won’t be useful. With SG’s this is fairly trivial to correct: we can just choose to solve for lobes that only lie on the upper hemisphere surrounding the surface’s normal direction.

*The left side shows a configuration where 9 SG lobes are distributed about a sphere, forming a full spherical probe. The right side shows 5 SG lobes located on a hemisphere surrounding the normal of a surface (the blue arrow). *

To fully generate the compact radiance representation for an entire 2D lightmap, we need to gather radiance samples at every texel location, and then perform a solve to fit the samples to a set of SG lobes. It’s really no different from the spherical probe case we used as a testbed in Probulator, except now we’re generating many probes. The other main differences is that for lightmap generating the appropriate radiance requires sampling a full 3D scene, as opposed to using an environment map as we did with Probulator. This sort of problem is best solved with a ray tracer, using an algorithm such as path tracing to compute the incoming radiance for a particular ray. The following image shows a visualization of what the lightmap result looks like for a simple scene:

*Hemispherical radiance probes generated at the texel locations of a 2D lightmap applied to a simple scene. Each probe uses 9 SG lobes oriented about the surface normal of the underlying geometry.*

In the previous article we covered techniques that can be used to compute a specular term from SG light sources. If we apply them to a set of lobes that approximate the incoming radiance, then we can compute an approximation of the full environment specular response. For small lobe counts this is only going to be practical for materials with a relatively high roughness. This is because our SG approximation of the incoming radiance won’t be able to capture high-frequency details from the environment, and it would it be very obvious if those details were missing from the reflections on smooth surfaces. However for rougher surfaces where the BRDF itself starts to act as a low-pass filter, an SG approximation won’t have as much noticeable error. As an example, here’s what SG specular looks like for a test scene with a GGX roughness value of 0.25:

*Comparison of indirect specular approximation for a test scene with a GGX roughness of 0.25. The top-left image is a path-traced rendering of the final scene with full indirect and direct lighting. The top-right image shows the indirect environment specular term from SG9 lightmaps, with the exposure increased by 8x. The bottom left image shows the indirect specular term from L2 SH. The bottom right image shows the indirect specular term from a path-traced render of the scene.*

Compared to the ground truth, the SG approximation does pretty well in some places and not-so-well in others. In general it captures a lot of the overall specular response, but suffers from some of the higher-frequency detail being absent in the probes. This results in certain areas looking a bit “washed-out”, such as the right-most wall of the scene. You can also see that the reflections of the cylinder, sphere, and torus are not properly represented in the SG version for the same reason. On the positive side, lightmap samples are pretty dense in terms of their spatial distribution. They’re far more dense than what you typically achieve with sparse cubemap probes placed throughout the scene, which typically suffer from all kinds of occlusion and parallax artifacts. The SG specular also compares pretty favorably to the L2 SH result (despite having the same storage cost), which looks even more washed-out than the SG result. The SH implementation used a 3D lookup texture to store pre-computed SH coefficients, and you can see some interpolation artifacts from this method if you look at the far wall perpendicular to the camera.

In the first part of our presentation[10] at last year’s Physically Based Shading Course[11], Dave covered some of these details and also shared some information about how we implemented SG lighting probes into The Order: 1886. Much of the implementation was very similar to what I’ve described in this series of articles: we stored 5-12 SG lobes (the count was chosen per-level chunk) in our 2D lightmaps with fixed axis directions and sharpnesses, and we evaluated diffuse and specular lighting using the approximations that I outlined earlier. For dynamic meshes, we baked uniform 3D grids of spherical probes containing 9 SG lobes that were stored in 3D textures. The grids were defined by OBB’s that were hand-placed in the scene by our lighting artists, along with density parameters. In both cases we made use of hardware texture filtering to interpolate between neighboring probes before computing per-pixel lighting.

Much of our implementation closely followed the work of Wang[12] and Xu[13], at least in terms of the techniques used for approximating diffuse and specular lighting from a set of SG lobes. Where our work diverged quite a bit was in the choice to use fixed lobe directions and sharpness values. Both Wang and Xu generated their set of SG lobes by performing a solve on a single environment map, which produced the necessary axis, sharpness, and amplitude parameters. In our case, we always knew that we were going need many probes in order to maintain high-fidelity pre-computed lighting for our scenes. At the time (early 2014) we were already employing 2D lightmaps containing L1 H-basis hemispherical probes (4 coefficients) and 3D grids containing L2 spherical harmonics probes. Both could be quite dense in spatial terms, which allowed capturing important shadowing detail.

To make SG’s work with for these requirements, we had to carefully consider our available trade-offs. After getting a simple test-bed up and running where we could bake 2D lightmaps for a scene, it became quickly apparent that varying the axis directions per-texel wasn’t necessarily the best choice for us. Aside from the obvious issue of requiring more storage space and making the solve more complex and expensive, we also ran into issues resulting from interpolating the axis direction over a surface. The problem is most readily apparent at shadow boundaries: one texel might have visibility of a bright light source which causes a lobe to point in that direction, while its neighboring pixel might have no visibility and thus could end up with a lobe pointing in a completely different direction. The axis would then interpolate between the two directions for pixels between the two texels, which can cause noticeable specular shifting. This isn’t necessarily an unsolvable problem (the Frequency Domain Normal Map Filtering paper[14] extended their EM solver with a term that attempts to align neighboring lobes for coherency), but considering our time and memory constraints it made sense to just sidestep the problem altogether. Ultimately we ended up using fixed lobe directions, using the UV tangent frame as the local coordinate space for lightmap probes. Tangent space is natural for this purpose since it’s Z axis is the surface normal of the mesh, and also because it tends to be continuous over a mesh (you’ll have discontinuities wherever you have UV seams, but artists tend to hide those as best they can anyway). For the 3D probe grids, the directions were in world space for simplicity.

After deciding to fix the lobe directions, we also ultimately decided to go with a fixed sharpness value as well. This of course has the same obvious benefits as fixing the axis direction (less storage, simpler solve), which were definitely appealing. However another motivating factor came from the way were doing our solve. Or rather, our lack of a proper solve. Our early testbed performed all lightmap baking on the CPU, which allowed us to easily integrate packages like Eigen[15] so that we could use a battle-tested least squares solver. However our actual production baking farm at Ready at Dawn uses a Cuda baker that leverages Nvidia’s OptiX library to perform arbitrary ray tracing on a cluster of GPU’s.While Cuda does have optimization libraries that could have achieved what we wanted, we faced a bigger problem: memory. Our baker worked by baking many sample points in parallel on the GPU, with a kernel program that would generate and trace the many rays required for monte carlo integration. When we previously used SH and H-basis this approach worked well: both SH and H-basis are built upon orthogonal basis functions, which allows for continuous integration by projecting each sample onto those basis functions. Gathering thousands of samples per-texel is feasible with this setup, since those samples don’t need to be explicitly stored in memory. Instead, each new sample is projected onto the in-progress result for the texel and then discarded. This is not the case when performing a solve: the solver needs access to *all* of the samples, which means keeping them all around in memory. This is a big problem when you have many texels in flight simultaneously, and only a limited amount of GPU memory. Like the intepolation issue it’s probably not unsolveable, but we really looking for a less risky approach that would be more of a drop-in replacement for the SH integration.

Ultimately we ended up saying “screw it”, and projected on the SG lobes as they formed an orthogonal basis (even though they didn’t). Since the basis functions weren’t orthogonal the results ended up rather blurry compared to a least squares solve, which muddied some of the detail in the source environment for a probe. Here’s a comparison to show you what I mean:

*A comparison of different techniques for computing a set of SG lobes that approximate the radiance from an environment map*

Of the three techniques presented here, the naive projection is the worst at capturing the details from the source map and also the blurriest. As we saw earlier the least squares solve is the best at capturing sharp changes, but achieves this by over-darkening certain areas. NNLS is the “just right” fit for this particular case, doing a better job of capturing details compared to the projection but without using any negative lobes.

Before we switched to baking SG probes for our grids and lightmaps, we had a fairly standard system for applying pre-convolved cubemap probes that were hand-placed throughout our scenes. Previously these were our only source of environment specular lighting, but once we had SG’s working we began to to use the SG specular approximation to compute environment specular directly from lightmaps and probe grids. Obviously our low number of SG’s was not sufficient for accurately approximating environment specular for smooth and mirror-like surfaces, so our cubemap system remained relevent. We ended up coming up with a simple scheme to choose between cubemap and lightmap specular per-pixel based on the surface roughness, with a small zone in between where we would blend between the two specular sources. The following images taken from our SIGGRAPH slides use a color-coding to showing the specular source chosen for each pixel from one of our scenes:

*The top image shows a scene from The Order: 1886. The bottom image shows the same scene with a color coding applied to show the environment specular source for each pixel.*

To finish off, here’s some more comparison images taken from our SIGGRAPH slides:

*Several comparison images from The Order: 1886 showing scenes with and without environment specular from SG lightmaps*

We were quite happy with the improvements that come from the SG baking pipeline we introduced for The Order, but we also feel like we’ve barely scratched the surface. Our decision to use fixed lobe directions and sharpness values was probably the right one, but it also limits what we can do. When you take SG’s and compare them to a fixed set of basis functions like SH, perhaps the biggest advantage is the fact that you can use an arbitrary combination of lobes to represent a mix of high and low-frequency features. So for instance you can represent a sun and a sky by having one wide lobe with the sky color, and one very narrow and bright lobe oriented towards the sun. We gave up that flexibility when we decided to go with our simpler ad-hoc projection, and it’s something we’d like to explore further in the future. But until then we can at least enjoy the benefits of having a representation that allows for an environment specular approximation and also avoids ringing artifacts when approximating diffuse lighting.

Aside from the solve, I also think it would be worth taking the time to investigate better approximations for the specular BRDF. In particular I would like to try using something better than just evaluating the cosine, Fresnel, and shadow-masking terms at the center of the warped BRDF lobe. The assumption that those terms are constant over the lobe break down the most when the roughness is high, and in our case we’re only ever using SG specular for rough materials! Therefore I think it would be worth the effort to come up with a more accurate representation of those terms.

[1] Curve fitting – https://en.wikipedia.org/wiki/Curve_fitting

[2] Regression analysis – https://en.wikipedia.org/wiki/Regression_analysis

[3] Least squares – https://en.wikipedia.org/wiki/Least_squares

[4] scipy.optimize.curve_fit – http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy-optimize-curve-fit

[5] Probulator – https://github.com/kayru/Probulator

[6] Spreading points on a disc and on a sphere – http://blog.marmakoide.org/?p=1

[7] Linear least squares – https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)

[8] Non-linear least squares – https://en.wikipedia.org/wiki/Non-linear_least_squares

[9] Non-negative least squares – https://en.wikipedia.org/wiki/Non-negative_least_squares

[10] Advanced Lighting R&D at Ready At Dawn Studios – http://blog.selfshadow.com/publications/s2015-shading-course/rad/s2015_pbs_rad_slides.pptx

[11] SIGGRAPH 2015 Course: Physically Based Shading in Theory and Practice – http://blog.selfshadow.com/publications/s2015-shading-course/

[12] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://renpr.org/project/sg_ssdf.htm

[13] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[14] Frequency Domain Normal Map Filtering – http://www.cs.columbia.edu/cg/normalmap/

[15] Eigen – http://eigen.tuxfamily.org/index.php?title=Main_Page

Part 1 – A Brief (and Incomplete) History of Baked Lighting Representations

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, we explored a few ways to compute the contribution of an SG light source when using a diffuse BRDF. While this is already useful, it would be nice to be able work with more complex view-dependent BRDF’s so that we can also compute a specular contribution. In this article I’ll explain a possible approach for approximating the response of a microfacet specular BRDF when applied to an SG light, and also introduce the concept of Anisotropic Spherical Gaussians.

You probably recall from the past 5 years of physically based rendering presentations[1] that a standard microfacet BRDF takes the following structure:

Recall that *D*(h) is the *distribution* term, which tells us the percentage of active microfacets for a particular combination of view and light vectors. It’s also commonly known as the *normal distribution function*, or “NDF” for short. It’s generally parameterized on a *roughness* parameter, which essentially describes the “bumpiness” of the underlying microgeometry. Lower roughness values lead to sharp mirror-like reflections with a very narrow and intense specular lobe, while higher roughness values lead to more broad reflections with a wider specular lobe. Most modern games (including The Order: 1886) use the GGX (AKA Trowbridge-Reitz) distribution for this purpose.

*The top image shows a GGX distribution term with a roughness parameter of 0.5. The bottom image shows the same GGX distribution term with a roughness parameter of 0.1. For both graphs, the X axis represents the angle between the surface normal and the half vector. Click on either image for an interactive graph.*

Next we have G(i, o, h) which is referred to as the *geometry **term* or the *masking-shadow function*. Eric Heitz’s paper on the masking-shadow function[2] has a fantastic explanation of how these terms work and why they’re important, so I would strongly recommend reading through it if you haven’t already. As Eric explains, the geometry term actually accounts for two different phenomena. The first is the local occlusion of reflections by other neighboring microfacets. Depending on the angle of the incoming lighting and the bumpiness of the microsurface, a certain amount of the lighting will be occluded by the surface itself, and this term attempts to model that. The other phenomenon handled by this term is visibility of the surface from the viewer. A surface that isn’t visible naturally can’t reflect light towards the viewer, and so the geometry term models this masking effect using the view direction and the roughness of the microsurface.

*The Smith visibility term for GGX as a function of the angle between the surface normal and the light direction. The roughness used is 0.25, and the angle between the normal and the view direction is 0. Click on the image for an interactive graph.*

Finally we have F(o, h), which is the *Fresnel* term. This familiar term determines the amount of light that is reflected vs. the amount that is refracted or absorbed, which varies depending on the index of refraction for a particular interface as well as the angle of incidence. For a microfacet BRDF we compute the Fresnel term using the angle between the active microfacet direction (the half vector) and the light or view direction (it is equivalent to use either). This generally causes the reflected intensity to increase when both the viewing direction and the light direction are grazing with respect to the surface normal. In real-time graphics Schlick’s approximation is typically used to represent the Fresnel term, since it’s a bit cheaper than evaluating the actual Fresnel equations. It is also common to parameterize the Fresnel term on the reflectance at zero incidence (referred to as “F0”) instead of directly working with an index of refraction.

*Schlick’s approximation of the Fresnel term as a function of the angle between the half vector and the light direction. The graph uses a value of 0.04 for F0. Click on the image for an interactive graph.*

Let’s now return to our example of a Spherical Gaussian light source. In the previous article we explored how to approximate the purely diffuse contribution from such a light source, but how do we handle the specular response? If we go down this road, we ideally want to do it in a way that uses a microfacet specular model so that we can remain consistent with our lighting from punctual light sources and IBL probes. If we start out by just plugging our BRDF and SG light into the rendering equation we get this monstrosity:

Unlike diffuse we now have multiple terms inside the integral, many of which are view-dependent. This suggests that we will need to combine several aggressive optimizations in order to get anywhere close to our desired result.

Let’s start out by taking the distribution term on its own, since it’s arguably the most important part of the specular BRDF. The distribution determines the overall shape and intensity of the resulting specular highlight, and has a widely-varying frequency response over the domain of possible roughness values. If we look at the graph of the GGX distribution from the earlier section, the shape does resemble a Gaussian. Unfortunately it’s not an exact match: the GGX distribution has the characteristic narrow highlight and long tails which we can’t match using a single SG. We could get a closer fit by summing multiple Gaussians, but for this article we’ll keep things simple by sticking with a single lobe. If we go back to Wang et al.‘s paper[3] that we referenced earlier, we can see that they suggest a very simple fit of an SG to a Cook-Torrance distribution term:

If we look closely, we can see that they’re actually fitting for the Gaussian model mentioned in the original Cook-Torrance paper[4], which is similar to the one used in the Torrance-Sparrow model[5]. Note that this should not be confused the with Beckmann distribution that’s also mentioned in the Cook-Torrance paper, which is actually a 2D Gaussian in the slope domain (AKA parallel plane domain). However the shape isn’t even the biggest problem here, as the variant they’re using isn’t normalized. This means the peak of the Gaussian will always have a height of 1.0, rather than shrinking when the roughness increases and growing when the roughness decreases. Fortunately this is really easy to fix, since we have a simple analytical formula for computing the integral of an SG. Therefore if we set the amplitude to 1 over the integral, we end up with a normalized distribution:

SG DistributionTermSG(in float3 direction, in float roughness) { SG distribution; distribution.Axis = direction; float m2 = roughness * roughness; distribution.Sharpness = 2 / m2; distribution.Amplitude = 1.0f / (Pi * m2); return distribution; }

Let’s take a look at a graph of our distribution term, and see how far off it is from our target:

*Top graph shows a comparison between GGX, Beckmann, normalized Gaussian, and SG distribution terms with a roughness of 0.25. The bottom shows the same comparison with a roughness of 0.5. Click on either image for an interactive graph.*

It should come as no surprise that our SG distribution is almost an exact match for a normalized version of a Gaussian distribution. When the roughness is lower it’s also a fairly close match for Beckmann, but for higher roughness the difference gets to be quite large. In both cases our distribution isn’t a perfect fit for GGX, but it’s a workable approximation.

So we now have a normal distribution function in SG format, but unfortunately we’re not quite ready to use it as-is. The problem is that we’ve defined our distribution in the half-vector domain: the axis of the SG points in the direction of the surface normal, and we use the half-vector as our sampling direction. If we want to use an SG product to compute the result of multiplying our distribution with an SG light source, then we need to ensure that the distribution lobe is in the same domain as our light source. Another way to think about this is that center of our distribution shifts depending on viewing angle, since the half-vector also shifts as the camera moves.

In order to make sure that our distribution lobe is in the correct domain, we need “warp” our distribution so that it lines up with the current BRDF slice for our viewing direction. If you’re wondering what a BRDF slice is, it basically tells you “if I pick a particular view direction. what is the value of my BRDF for a given light direction?”. So if you had a mirror BRDF, the slice would just be a single ray pointing in the direction of the view ray reflected off the surface normal. For microfacet specular BRDF’s, you get a lobe that’s roughly centered around the reflected view direction. Here’s what a polar graph of a GGX BRDF slice looks like if we only consider the distribution term:

*Polar graph of the GGX distribution term from two different viewing angles. The blue line is the view direction, the green line is the surface normal, and the pink line is the reflected view direction. The top image shows the resulting slice when the viewing angle is 0 degrees, and the bottom images shows the resulting slice when the viewing angle is 45 degrees.*

Wang et al. proposed a simple spherical warp operator that would orient the distribution lobe about the reflected view direction, while also modifying the sharpness to take into account the differential area at the original center of the lobe:

SG WarpDistributionSG(in SG ndf, in float3 view) { SG warp; warp.Axis = reflect(-view, ndf.Axis); warp.Amplitude = ndf.Amplitude; warp.Sharpness = ndf.Sharpness; warp.Sharpness /= (4.0f * max(dot(ndf.Axis, view), 0.0001f)); return warp; }

Let’s now look at the result of that warp, and compare it to what the actual GGX distribution looks like:

*Result of applying a spherical warp to the SG distribution term (green) compared with the actual GGX distribution (red). The top graph shows a viewing angle of 0 degrees, and the bottom graph shows a viewing angle of 45 degrees.*

A quick look at the graph shows us that the shape is a bit off, but our warped lobe is ending up in approximately in the right spot. We’ll revisit the lobe shape later, but for now let’s try combining our distribution with the rest of our BRDF.

In the previous section we figured out how to obtain an SG approximation of our distribution term, and also warp it so that it’s in the correct domain for multiplication with our SG light source. Using our SG product operator would allow to us to represent the result of multiplying the distribution with the light source as another SG, which we could then multiply with other terms using SG operators…or at least we could if we were to represent the remaining terms as SG’s. Unfortunately this turns out to be a problem: the geometry and Fresnel terms are nothing at all like a Gaussian, which rules out approximating them as an SG. Wang et al. sidestep this issue by making the somewhat-weak assumption that the values of these terms will be constant across the entire BRDF lobe, which allows them to pull the terms out of the integral and evaluate them only for the axis direction of the BRDF lobe. This allows the resulting BRDF to still capture some of the glancing angle effects, with similar performance cost to evaluating those terms for a punctual light source. The downside is that the error of these terms will increase as the BRDF lobe becomes wider (increasing roughness), since the value of the geometry and Fresnel terms will vary more the further they are from the lobe center. Putting it all together gives the following specular BRDF:

The last thing we need to account for is the cosine term that needs to be multiplied with the BRDF inside of the hemispherical integral. Wang et al. suggest using an SG product to compute an SG representing the result of multiplying the distribution term SG and their SG approximation of a clamped cosine lobe, which can then be multiplied with the lighting lobe using an SG inner product. In order to avoid another expensive SG operation, we will instead use the same approach that we used for geometry and Fresnel terms and evaluate the cosine term using the BRDF lobe axis direction. Implementing it in shader code gives us the following:

float GGX_V1(in float m2, in float nDotX) { return 1.0f / (nDotX + sqrt(m2 + (1 - m2) * nDotX * nDotX)); } float3 SpecularTermSGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF. SG ndf = DistributionTermSG(normal, roughness); // Warp the distribution so that our lobe is in the same // domain as the lighting lobe SG warpedNDF = WarpDistributionSG(ndf, view); // Convolve the NDF with the SG light float3 output = SGInnerProduct(warpedNDF, light); // Parameters needed for the visibility float3 warpDir = warpedNDF.Axis; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpedNDF.Axis + view); // Visibility term evaluated at the center of // our warped BRDF lobe output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel evaluated at the center of our warped BRDF lobe float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term evaluated at the center of the BRDF lobe output *= nDotL; return max(output, 0.0f); }

Let’s now (finally) take a look at what our specular approximation looks like for a scene being lit by an SG light source:

*A scene being lit by an SG light source using our diffuse and specular approximations. The scene is using a uniform roughness of 0.128.*

So if we look at the specular highlights in the above image, you may notice that while the highlights on the red and green walls look pretty good, the highlight on floor seems a bit off. The highlight is rather wide and rounded, and our intuition tells that a highlight viewed at a grazing angle should appear vertically stretched across the floor. To determine why the look is so off, we need to revisit our warp of the distribution term. Previously when we looked at the polar graph of the distribution I noted that the shape of the resulting lobe was a bit off, even though it was oriented in approximately the right direction to line up with the BRDF slice. To get a better idea of what’s going on, let’s now take a look at a 3D graph of the GGX distribution term:

*3D graph of the GGX distribution term. The left image shows the distribution when the viewing angle is very low, while the right image shows the distribution when the viewing angle is very steep.*

Looking at the left image where the angle between the view direction and the surface normal are very low, the distribution is radially symmetrical just like an SG lobe. However as the viewing angle increases the lobe begins to stretch, looking more and more like the non-symmetrical lobe that we see in the right image. The stretching of the lobe is what causes the stretched highlights that occur when applying the BRDF, and our warped SG is unable to properly represent it since it must remain radially symmetric about its axis.

Luckily for us, there is a better way. In 2013 Xu et al.released a paper titled Anisotropic Spherical Gaussians[6], where they explain how they extended SG’s to support anisotropic lobe width/sharpness. They’re defined like this:

Instead of having a single axis direction, an ASG now has , , and , which are three orthogonal vectors forming a complete basis. It’s very similar to a tangent frame, where the normal, tangent, and bitangent together make up an orthonormal basis. With an ASG you also now have two separate sharpness parameters, and , which control the sharpness with respect to and . So for example setting to 16 and to 64 will result in stretched Gaussian lobe that’s skinnier along the direction, and with its center located at . Visualizing such an ASG on the surface of a sphere gives you this:

*An Anisotropic Spherical Gaussian visualized on the surface of a sphere. has a value of 16, and has a value of 64.*

Like SG’s, the equations lend themselves to simple HLSL implementations:

struct ASG { float3 Amplitude; float3 BasisZ; float3 BasisX; float3 BasisY; float SharpnessX; float SharpnessY; }; float3 EvaluateASG(in ASG asg, in float3 dir) { float sTerm = saturate(dot(asg.BasisZ, dir)); float lambdaTerm = asg.SharpnessX * dot(dir, asg.BasisX) * dot(dir, asg.BasisX); float muTerm = asg.SharpnessY * dot(dir, asg.BasisY) * dot(dir, asg.BasisY); return asg.Amplitude * sTerm* exp(-lambdaTerm - muTerm); }

The ASG paper provides us with formulas for two operations that we can use to improve the quality of our specular approximation for SG light sources. The first is a new warping operator that can take an NDF represented as an isotropic SG, and stretch it along the view direction to produce an ASG that better represents the actual BRDF. The other helpful forumula it provides is for convolving an ASG with an SG, which we can use to convolve a anisotropically warped NDF lobe with an SG lighting lobe. Let’s take a look at how their improved warp looks when graphing the NDF for a large viewing angle:

*The left image is a 3D graph of the distribution term when using a spherical warp. The middle image is the resulting distribution term when using an anisotropic warp. The right image is the actual GGX distribution term*.

The anisotropic distribution looks much closer to the actual GGX NDF, since it now has the vertical stretching that we were missing. Let’s now implement their formulas in HLSL so we can try the new warp in our test scene:

float3 ConvolveASG_SG(in ASG asg, in SG sg) { // The ASG paper specifes an isotropic SG as // exp(2 * nu * (dot(v, axis) - 1)), // so we must divide our SG sharpness by 2 in order // to get the nup parameter expected by the ASG formula float nu = sg.Sharpness * 0.5f; ASG convolveASG; convolveASG.BasisX = asg.BasisX; convolveASG.BasisY = asg.BasisY; convolveASG.BasisZ = asg.BasisZ; convolveASG.SharpnessX = (nu * asg.SharpnessX) / (nu + asg.SharpnessX); convolveASG.SharpnessY = (nu * asg.SharpnessY) / (nu + asg.SharpnessY); convolveASG.Amplitude = Pi / sqrt((nu + asg.SharpnessX) * (nu + asg.SharpnessY)); float3 asgResult = EvaluateASG(convolveASG, sg.Axis); return asgResult * sg.Amplitude * asg.Amplitude; } ASG WarpDistributionASG(in SG ndf, in float3 view) { ASG warp; // Generate any orthonormal basis with Z pointing in the // direction of the reflected view vector warp.BasisZ = reflect(-view, ndf.Axis); warp.BasisX = normalize(cross(ndf.Axis, warp.BasisZ)); warp.BasisY = normalize(cross(warp.BasisZ, warp.BasisX)); float dotDirO = max(dot(view, ndf.Axis), 0.0001f); // Second derivative of the sharpness with respect to how // far we are from basis Axis direction warp.SharpnessX = ndf.Sharpness / (8.0f * dotDirO * dotDirO); warp.SharpnessY = ndf.Sharpness / 8.0f; warp.Amplitude = ndf.Amplitude; return warp; } float3 SpecularTermASGWarp(in SG light, in float3 normal, in float roughness, in float3 view, in float3 specAlbedo) { // Create an SG that approximates the NDF SG ndf = DistributionTermSG(normal, roughness); // Apply a warpring operation that will bring the SG from // the half-angle domain the the the lighting domain. ASG warpedNDF = WarpDistributionASG(ndf, view); // Convolve the NDF with the light float3 output = ConvolveASG_SG(warpedNDF, light); // Parameters needed for evaluating the visibility term float3 warpDir = warpedNDF.BasisZ; float m2 = roughness * roughness; float nDotL = saturate(dot(normal, warpDir)); float nDotV = saturate(dot(normal, view)); float3 h = normalize(warpDir + view); // Visibility term output *= GGX_V1(m2, nDotL) * GGX_V1(m2, nDotV); // Fresnel float powTerm = pow((1.0f - saturate(dot(warpDir, h))), 5); output *= specAlbedo + (1.0f - specAlbedo) * powTerm; // Cosine term output *= nDotL; return max(output, 0.0f); }

If we swap out our old specular approximation for one that uses an anisotropic warp, our test scene now looks much better!

*Diffuse and specular approximations applied to an SG light source, using an anisotropic warp to approximate the NDF as an ASG.*

Many of the images that I used for visually comparing the SG BRDF approximations were generated using Disney’s BRDF Explorer. When we were doing our initial research into SG’s and figuring out how to implement them, BRDF Explorer was extremely valuable both for understanding the concepts and for experimenting with different variations. If you’d like to this yourself, there’s a very easy way to do that courtesy of Nick Brancaccio. Nick was kind of enough to create his own awesome WebGL version of BRDF Explorer, and it comes pre-loaded with options for comparing an approximate SG specular BRDF with the GGX BRDF. I would recommend checking it out if you’d like to play around with the BRDF’s and make some pretty 3D graphs!

[1] Background: Physics and Math of Shading (SIGGRAPH 2013 Course: Physically Based Shading in Theory and Practice) – http://blog.selfshadow.com/publications/s2013-shading-course/hoffman/s2013_pbs_physics_math_notes.pdf

[2] Understanding the Masking-Shadowing Function in Microfacet-Based BRDFs – http://jcgt.org/published/0003/02/03/

[3] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[4] A Reflectance Model for Computer Graphics – http://inst.cs.berkeley.edu/~cs294-13/fa09/lectures/cookpaper.pdf

[5] Theory for Off-Specular Reflection From Roughened Surfaces – http://www.graphics.cornell.edu/~westin/pubs/TorranceSparrowJOSA1967.pdf

[6] Anisotropic Spherical Gaussians – http://cg.cs.tsinghua.edu.cn/people/~kun/asg/

[7] BRDF Explorer – https://github.com/wdas/brdf

[8] WebGL BRDF Explorer – https://depot.floored.com/brdf_explorer

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous post we covered a few of the universal properties of SG’s. Now that we have a few tools on our utility belt, let’s discuss an example of how we can actually use those properties to our advantage in a rendering scenario. Let’s say we have a surface point **x** being lit by a light source **L**, with the light source being represented by an SG named **G _{L}**. Recall from the previous article that the equation for computing the outgoing radiance towards the eye for a surface with a Lambertian diffuse BRDF looks like the following:

For punctual light sources that are essentially a scaled delta function, computing this is as easy as N dot L. But we’re in trouble if we have an area light source, since we typically don’t have a closed form solution to the integral. But let’s suppose that we have some strange Gaussian light source, whose angular falloff can be exactly represented by an SG (normally area light sources are considered to have uniform emission over their surface, but let’s imagine we have case where the emission is non-uniform). If we can treat the light as an SG, then we can start to consider some of the handy Gaussian tools that we laid out earlier. In particular the inner product starts to seem really useful: it gives us the result of integrating the product of two SG’s, which is basically what we’re trying to accomplish in our diffuse lighting equation. The big catch is that we’re not integrating the product of two SG’s, we’re instead integrating the product of an SG with a clamped cosine lobe. Obviously a Gaussian lobe has a different shape compared to a clamped cosine lobe, but perhaps if we squint our eyes from a distance you could substitute one for another. This approach was taken by Wang et al.[1], who suggested fitting a cosine lobe to a single SG with **λ**=2.133 and **a**=1.17. If we follow in their footsteps, the diffuse calculation is straightforward:

SG CosineLobeSG(in float3 direction) { SG cosineLobe; cosineLobe.Axis = direction; cosineLobe.Sharpness = 2.133f; cosineLobe.Amplitude = 1.17f; return cosineLobe; } float3 SGIrradianceInnerProduct(in SG lightingLobe, in float3 normal) { SG cosineLobe = CosineLobeSG(normal); return max(SGInnerProduct(lightingLobe, cosineLobe), 0.0f); } float3 SGDiffuseInnerProduct(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradianceInnerProduct(lightingLobe, normal) * brdf; }

Not too bad, eh? Of course it’s worth taking a closer look at our cosine lobe approximation, since that’s definitely going to introduce some error. Perhaps the best way to do this is to look at the graphs of a real cosine lobe and our SG approximation side-by-side:

*Comparison of a clamped cosine cosine lobe (red) with an SG approximation (blue)*

Just from looking at the graph it’s fairly obvious that an SG isn’t necessarily a great fit for a cosine lobe. First of all, the amplitude actually goes above 1, which might seem a bit weird at first glance. However it’s necessary to ensure that the area under the curve remains somewhat consistent with the cosine lobe, since there would otherwise be a loss of energy. The other weirdness stems from the fact that an SG never actually hits 0 anywhere on the sphere, hence the long “tail” on the graph of the SG. This essentially means that if the SG were integrated against a punctual light source, the lighting would “wrap” around the sphere past the point where N dot L is equal to 0. The situation actually isn’t all that different from an SH representation of a cosine lobe, which also extends past π/2:

*L1 (green) and L2 (purple) SH approximation of a clamped cosine lobe compared with an SG approximation (blue) and the actual clamped cosine (red).*

In the SH case the approximation actually goes negative, which is arguably worse than the long tail of the SG approximation. The L1 approximation is particularly bad in this regard. If at this point you’re trying to imagine what these approximations look like on a sphere, let me save you the trouble by providing an image:

*From left to right: actual clamped cosine lobe, SG cosine approximation, L2 SH cosine approximation*

Now that we’ve finished analyzing you approximation of a cosine lobe, we need to take a look at the actual results of computing diffuse lighting from an SG light source. Let’s start off by graphing the results of computing irradiance using an SG inner product, and compare it against what we get by using brute-force numerical integration to compute the result of multiplying the SG with an actual clamped cosine (*not* the approximate SG cosine lobe that we use for the inner product):

*The resulting irradiance from an SG light source (with sharpness of 4.0) as a function of the angle between the light source and the surface normal. The red graph is the result of using numerical integration to compute the integral of the SG light source multiplied with a clamped cosine, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG.*

As you might expect, the inner product approximation has some error when compared with the “ground truth” provided by numerical integration. It’s worth pointing out that this error is purely a consequence of approximating the clamped cosine lobe as an SG: the inner product provides the exact result of the integral, and thus shouldn’t introduce any error on its own. Despite this error, the resulting irradiance isn’t hugely far off from our ground truth. The biggest difference is for the angles facing away from the light, where the SG inner product version has a stronger tail. Visualizing the resulting diffuse on a sphere gives us the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from computing irradiance using an SG inner product with an approximation of a cosine lobe.*

As an alternative to representing the cosine lobe with an SG and computing the inner product, we can consider a cheaper approximation. One advantage of working with SG’s is that each lobe is always symmetrical about its axis, which is also where its value is the highest. We also discussed earlier how we can compute the integral of an SG over the sphere, which gives us its total energy. This suggests that if we want to be frugal with our shader cycles, we can pull terms out of the integral over the sphere/hemisphere and only evaluate them for the SG axis direction. This obviously introduces error, but that error may be acceptable if the term we pull out is relatively “smooth”. If we apply this approximation to computing irradiance and diffuse lighting, we get this:

Translating to HLSL, we get the following functions:

float3 SGIrradiancePunctual(in SG lightingLobe, in float3 normal) { float cosineTerm = saturate(dot(lightingLobe.Axis, normal)); return cosineTerm * 2.0f * Pi * (lightingLobe.Amplitude) / lightingLobe.Sharpness; } float3 SGDiffusePunctual(in SG lightingLobe, in float3 normal, in float3 albedo) { float3 brdf = albedo / Pi; return SGIrradiancePunctual(lightingLobe, normal) * brdf; }

If we overlay the graph of our super-cheap irradiance approximation on the graph we were looking at earlier, we get this:

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The green graph was computed by pulling the cosine term out of the integral, and multiplying it with the result of integrating the SG light about the sphere.*

The result shouldn’t be a surprise: it’s just a scaled version of the standard clamped cosine.It’s pretty obvious just by looking that this particular optimization will introduce quite a bit of error, particularly where theta is greater than π/2. But it is cheap, since we’ve effectively turned an SG into a point light. This is makes it useful tool for cases where we may want to approximate the convolution of an SG light source with a BRDF or some other function that isn’t easily represented as an SG.

So it’s nice to have a cheap option, but what if we want more accuracy than our inner product approximation? Fortunately for us, Stephen Hill was able to formulate another alternative approximation that directly fits a curve to the integral of a cosine lobe with an SG. His implementation is actually formulated for a normalized SG (where the integral about the sphere is equal to 1.0), but we can easily account for this by computing the integral and scaling the result by that value:

float3 SGIrradianceFitted(in SG lightingLobe, in float3 normal) { const float muDotN = dot(lightingLobe.Axis, normal); const float lambda = lightingLobe.Sharpness; const float c0 = 0.36f; const float c1 = 1.0f / (4.0f * c0); float eml = exp(-lambda); float em2l = eml * eml; float rl = rcp(lambda); float scale = 1.0f + 2.0f * em2l - rl; float bias = (eml - em2l) * rl - em2l; float x = sqrt(1.0f - scale); float x0 = c0 * muDotN; float x1 = c1 * x; float n = x0 + x1; float y = saturate(muDotN); if(abs(x0) <= x1) y = n * n / x; float result = scale * y + bias; return result * ApproximateSGIntegral(lightingLobe); }

The result is very close to the ground truth, which is very cool considering that it might actually be cheaper than our inner product approximation!

*The resulting irradiance from an SG light source (with sharpness of 4.0) as function of the angle between the light source and the surface normal. The red graph was computed using numerical integration, while the blue graph was computed using an SG inner product of the light source with a cosine lobe approximated as an SG. The orange graph was computed using Stephen Hill’s fitted curve approximation.*

If we once again visualize the result on the sphere and compare with our previous results, we get the following:

*The left sphere shows the resulting diffuse lighting from an SG light source with a sharpness of 4.0, where the irradiance was computed using an SG inner product with an approximation of a cosine lobe. The middle sphere shows the resulting diffuse lighting from computing irradiance using monte carlo importance sampling. The right sphere shows the resulting diffuse lighting from Stephen Hill’s fitted approximation.*

[1] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – https://www.microsoft.com/en-us/research/wp-content/uploads/2009/12/sg.pdf

]]>

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

In the previous article, I gave a quick rundown of some of the available techniques for representing a pre-computed distribution of radiance or irradiance for each lightmap texel or probe location. In this article, I’m going cover the basics of Spherical Gaussians, which are a type of spherical radial basis function (SRBF for short). The concepts introduced here will serve as the core set of tools for working with Spherical Gaussians, and in later articles I’ll demonstrate how you can use those tools to form an alternative for approximating incoming radiance in pre-computed lightmaps or probes.

I should point out that this article is still going to be somewhat high-level, in that it won’t provide full derivations and background details for all formulas and operations. However it is my hope that the material here will be sufficient to gain a basic understanding of SG’s, and also use them in practical scenarios.

A Spherical Gaussian, or “SG” for short, is essentially a Gaussian function[1] that’s defined on the surface of a sphere. If you’re reading this, then you’re probably already familar with how a Gaussian function works in 1D: you compute the distance from the center of the Gaussian, and use this distance as part of a base-e exponential. This produces the characteristic “hump” that you see when you graph it:

*A Gaussian in 1D centered at x=0, with a height of 3*

You’re probably also familiar with how it looks in 2D, since it’s very commonly used in image processing as a filter kernel. It ends up looking like what you would get if you took the above graph and revolved it around its axis

*A Gaussian filter applied to a 2D image of a white dot, showing that the impulse response is effectively a Gaussian function in 2D*

A Spherical Gaussian still works the same way, except that it now lives on the surface of a sphere instead of on a line or a flat plane. If you’re having trouble visualizing that, imagine if you took the above image and wrapped it around a sphere like wrapping paper. It ends up looking like this:

*A Spherical Gaussian visualized on the surface of a sphere*

Since an SG is defined on a sphere rather than a line or plane, it’s parameterized differently than a normal Gaussian. A 1D Gaussian function always has the following form:

The part that we need to change in order to define the function on a sphere is the “(x – b)” term. This part of the function essentially makes the Gaussian a function of the cartesian distance between a given point and the center of the Gaussian, which can be trivially extended into 2D using the standard distance formula. To make this work on a sphere, we must instead make our Gaussian a function of the angle between two unit direction vectors. In practice we do this by making an SG a function of the *cosine* of the angle between two vectors, which can be efficiently computed using a dot product like so:

Just like a normal Gaussian, we have a few parameters that control the shape and location of the resulting lobe. First we have μ, which is the *axis*, or *direction* of the lobe. It effectively controls where the lobe is located on the sphere, and always points towards the exact center of the lobe. Next we have λ, which is the *sharpness* of the lobe. As this value increases, the lobe will get “skinnier”, meaning that the result will fall off more quickly as you get further from the lobe axis. Finally we have *a*, which is the *amplitude* or *intensity* of the lobe. If you were to look at a polar graph of an SG, it would correspond to the height of the lobe at its peak. The amplitude can be a scalar value, or for graphics applications we may choose to make it an RGB triplet in order to support varying intensities for different color channels. This all lends itself to a simple HLSL code definition:

struct SG { float3 Amplitude; float3 Axis; float Sharpness; };

Evaluating an SG is also easily expressible in HLSL. All we need is a normalized direction vector representing the point on the sphere where we’d like to compute the value of the SG:

float3 EvaluateSG(in SG sg, in float3 dir) { float cosAngle = dot(dir, sg.Axis); return sg.Amplitude * exp(sg.Sharpness * (cosAngle - 1.0f)); }

Now that we know what a Spherical Gaussian is, what’s so useful about them anyway? One pontential benefit is that they’re fairly intuitive: it’s not terribly hard to understand how the 3 parameters work, and how each parameter affects the resulting lobe. The other main draw is that they inherit a lot of useful properties of “regular” Gaussians, which makes them useful for graphics and other related applications. These properties have been explored and utilized in several research papers that were primarily aimed at achieving pre-computed radiance transfer (PRT) with both diffuse and specular material response. In particular, the paper entitled “All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance[2]” by Wang et al. was our main inspiration for pursuing SG’s at RAD.

So what are these useful Gaussian properties that we can exploit? For starters, taking the product of 2 Gaussians functions produces another Gaussian. For an SG, this is equivalent to visiting every point on the sphere, evaluating 2 different SG’s, and multiplying the two results. Since it’s an operation that takes 2 SG’s and produces another SG, it is sometimes referred to as a “vector” product. It’s defined as the following:

In HLSL code, it looks like this:

SG SGProduct(in SG x, in SG y) { float3 um = (x.Sharpness * x.Axis + y.Sharpness * y.Axis) / (x.Sharpness + y.Sharpness); float umLength = length(um); float lm = x.Sharpness + y.Sharpness; SG res; res.Axis = um * (1.0f / umLength); res.Sharpness = lm * umLength; res.Amplitude = x.Amplitude * y.Amplitude * exp(lm * (umLength - 1.0f)); return res; }

Gaussians have another really nice property in that their integrals have a closed-form solution, which is known as the error function[3]. The property also extends to SG’s, where we can compute the integral of an SG over the entire sphere:

Computing an integral will essentially tell us the total “energy” of an SG, which can be useful for lighting calculations. It can also be useful for *normalizing* an SG, which produces an SG that integrates to 1. Such a normalized SG is suitable for representing a probability distribution, such as an NDF. In fact, a normalized SG is actually equivalent to a von Mises-Fisher distribution[4] in 3D!

An SG integral is actually very cheap to compute…or at least it would be if we removed the exponential term. It turns out that the term actually approaches 1 very quickly as the SG’s sharpness increases, which means we can potentially drop it with little error as long as we know that the sharpness is high enough. Here’s what a graph of looks like for increasing sharpness:

*A graph of the exponential term in computing the integral of an SG over a sphere, which approaches 1 as the sharpness increases. The X-axis is sharpness, and the Y-axis is the value of .*

This all lends itself naturally to HLSL implementations for accurate and approximate versions of an SG integral:

float3 SGIntegral(in SG sg) { float expTerm = 1.0f - exp(-2.0f * sg.Sharpness); return 2 * Pi * (sg.Amplitude / sg.Sharpness) * expTerm; } float3 ApproximateSGIntegral(in SG sg) { return 2 * Pi * (sg.Amplitude / sg.Sharpness); }

If we were to use our SG integral formula to compute the integral of the product of two SG’s, we can compute what’s known as the *inner product*, or *dot product* of those SG’s. The operation is usually defined like this:

However we can avoid numerical precision issues by using an alternate arrangement:

…which looks like this in HLSL:

float3 SGInnerProduct(in SG x, in SG y) { float umLength = length(x.Sharpness * x.Axis + y.Sharpness * y.Axis); float3 expo = exp(umLength - x.Sharpness - y.Sharpness) * x.Amplitude * y.Amplitude; float other = 1.0f - exp(-2.0f * umLength); return (2.0f * Pi * expo * other) / umLength; }

SG’s have what’s known as “compact-ε” support, which means that it’s possible to determine an angle θ such that all points within θ radians of the SG’s axis will have a value greater than ε. This property is potentially more useful if we flip it around so that we calculate a sharpness λ that results in a given θ for a particular value of ε:

float SGSharpnessFromThreshold(in float amplitude, in float epsilon, in float cosTheta) { return (log(epsilon) - log(amplitude)) / (cosTheta - 1.0f); }

One last operation I’ll discuss is rotation. Rotating an SG is trivial: all you need to do is apply your rotation transform to the SG’s axis vector and you have a rotated SG! You can apply the transform using a matrix, a quaternion, or any other means you might have for rotating a vector. This is a welcome change from SH, which requires a very complex transform once you go above L1.

[1] Gaussian Function – https://en.wikipedia.org/wiki/Gaussian_function

[2] All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance – http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf

[3] Error Function – https://en.wikipedia.org/wiki/Error_function

[4] von-Mises Fisher Distribtion – https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

For part 1 of this series, I’m going to provide some background material for our research into Spherical Gaussians. The main purpose is cover some of the alternatives to the approach we used for The Order: 1886, and also to help you understand why we decided to persue Spherical Gaussians. The main empahasis is going to be on discussing what exactly we store in pre-baked lightmaps and probes, and how that data is used to compute diffuse or specular lighting. If you’re already familiar with the concepts of pre-computing radiance or irradiance and approximating them using basis functions like the HL2 basis or Spherical Harmonics, then you will probably want to skip to the next article.

Before we get started, here’s a quick glossary of the terms I use the formulas:

- – the outgoing radiance (lighting) towards the viewer
- – the incoming radiance (lighting) hitting the surface
- – the direction pointing towards the viewer (often denoted as “V” in shader code dealing with lighting)
- – the direction pointing towards the incoming radiance hitting the surface (often denoted as “L” in shader code dealing with lighting)
- – the direction of the surface normal
- – the 3D location of the surface point
- – integral about the hemisphere
- – the angle between the surface normal and the incoming radiance direction
- – the angle between the surface normal and the outgoing direction towards the viewer
- – the BRDF of the surface

Games have used pre-computed lightmaps for almost as long as they have been using shaded 3D graphics, and they’re still quite popular in 2016. The idea is simple: pre-compute a lighting value for every texel, then sample those lighting values at runtime to determine the final appearance of a surface. It’s a simple concept to grasp, but there are some details you might not think about if you’re just learning how they work. For instance, what exactly does it mean to store “lighting” in a texture? What exact value are we computing, anyway? In the early days the value fetched from the lightmap was simply multiplied with the material’s diffuse albedo color (typically done with fixed-function texture stages), and then directly output to the screen. Ignoring the issue of gamma correction and sRGB transfer functions for the moment, we can work backwards from this simple description to describe this old-school approach in terms of the rendering equation. This might seem like a bit of a pointless exercise, but I think it helps build a solid base that we can use to discuss more advanced techniques.

So we know that our lightmap contains a single fixed color per-texel, and we apply it the same way regardless of the viewing direction for a given pixel. This implies that we’re using a simple Lambertian diffuse BRDF, since it lacks any sort of view-dependence. Recall that we compute the outgoing radiance for a single point using the following integral:

If we substitute the standard diffuse BRDF of for our BRDF (where C_{diffuse} is the diffuse albedo of the surface), then we get the following:

On the right side we see that we can pull the constant terms out the integral (the constant term is actually the entire BRDF!), and what we’re left with lines up nicely with how we handle lightmaps: the expensive integral part is pre-computed per-texel, and then the constant term is applied at runtime per-pixel. The “integral part” is actually computing the incident irradiance, which lets us finally identify the quantity being stored in the lightmap: it’s irradiance! In practice however most games would not apply the 1 / π term at runtime, since it would have been impractical to do so. Instead, let’s assume that the 1 / π was “baked” into the lightmap, since it’s constant for all surfaces (unlike the diffuse albedo, which we consider to be *spatially varying*). In that case, we’re actually storing a reflectance value that takes the BRDF into account. So if we wanted to be precise, we would say that it contains “the diffuse reflectance of a surface with C_{diffuse} = 1.0″, AKA the maximum possible outgoing radiance for a surface with a diffuse BRDF.

One of the key concepts of lightmapping is the idea of reconstructing the final surface appearance using data that’s stored at different rates in the spatial domain. Or in simpler words, we store lightmaps using one texel density while combining it with albedo maps that have a different (usually higher) density. This lets us retain the appearance of high-frequency details without actually computing irradiance integrals per-pixel. But what if we want to take this concept a step further? What it we also want the irradiance itself to vary in response to texture maps, and not just the diffuse albedo? By the early 2000’s normal maps were starting to see common use for this purpose, however they were generally only used when computing the contribution from punctual light sources. Normal maps were no help with light maps that only stored a single (scaled) irradiance value, which meant that that pure ambient lighting would look very flat compared to areas using dynamic lighting:

*Areas in direct lighting (on the right) have a varying appearance due to a normal map, but areas in shadow (on the left) have no variation due to being lit by a baked lightmap containing only a single irradiance value.*

To make lightmaps work with normal mapping, we need to stop storing a single value and instead somehow store a *distribution* of irradiance values for every texel. Normal maps contain a range of normal directions, where those directions are generally restricted to the hemisphere around a point’s surface normal. So if we want our lightmap to store irradiance values for all possible normal map values, then it must contain a distribution of irradiance that’s defined for that same hemisphere. One of the earliest and simplest examples of such a distribution was used by Half-Life 2[1], and was referred to as Radiosity Normal Mapping[2]:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

Valve essentially modified their lightmap baker to compute 3 values instead of 1, with each value computed by projecting the irradiance signal onto one of the corresponding orthogonal basis vectors in the above image. At runtime, the irradiance value used for shading would be computed by blending the 3 lightmap values based on the cosine of the angle between the normal map direction and the 3 basis directions (which is cheaply computed using a dot product). This allowed them to effectively vary the irradiance based on the normal map direction, thus avoiding the “flat ambient” problem described above.

While this worked for their static geometry, there still remained the issue of applying pre-computed lighting to dynamic objects and characters. Some early games (such as the original Quake) used tricks like sampling the lightmap value at a character’s feet, and using that value to compute ambient lighting for the entire mesh. Other games didn’t even do that much, and would just apply dynamic lights combined with a global ambient term. Valve decided to take a more sophisticated approach that extended their hemispherical lightmap basis into a full spherical basis formed by 6 orthogonal basis vectors:

*Image from “Shading in Valve’s Source Engine “, SIGGRAPH 2006*

The basis vectors coincided with the 6 face directions of a unit cube, which led Valve to call this basis the “Ambient Cube”. By projecting irradiance in all directions around a point in space (instead of a hemisphere surrounding a surface normal) onto their basis functions, a dynamic mesh could sample irradiance for any normal direction and use it to compute diffuse lighting. This type of representation is often referred to as a *lighting probe*, or often just “probe” for short.

With Valve’s basis we can combine normal maps and light maps to get diffuse lighting that can vary in response to high-frequency normal maps. So what’s next? For added realism we would ideally like to support more complex BRDF’s, including view-dependent specular BRDF’s. Half-Life 2 handled environment specular by pre-generating cubemaps at hand-placed probe locations, which is still a common approach used by modern games (albeit with the addition of pre-filtering[3] used to approximate the response from a microfacet BRDF). However the large memory footprint of cubemaps limits the practical density of specular probes, which can naturally lead to issues caused by incorrect parallax or disocclusion.

*A combination of incorrect parallax and disocclusion when using a pre-filtered environment as a source for environment specular. Notice the bright edges on the sphere, which are actually caused by the sphere reflecting itself!*

With that in mind it would nice to be able to get some sort of specular response out of our lightmaps, even if only for a subset of materials. But if that is our goal, then our approach of storing an irradiance distribution starts to become a hinderance. Recall from earlier that with a diffuse BRDF we were able to completely pull the BRDF out of the irradiance integral, since the Lambertian diffuse BRDF is just a constant term. This is no longer the case even with a simple specular BRDF, whose value varies depending on both the viewing direction as well as the incident lighting direction.

If you’re working with the Half-Life 2 basis (or something similar), a tempting option might be to compute a specular term as if the 3 basis directions were directional lights. If you think about what this means, it’s basically what you get if you decide to say “screw it” and pull the specular BRDF out of the irradiance integral. So instead of Integrate(BRDF * Lighting * cos(theta)), you’re doing BRDF * Integrate(Lighting * cos(theta)). This will definitely give you *something,* and it’s perhaps a lot better than nothing. But you’ll also effectively lose out on a ton of your specular response, since you’ll only get specular when your viewing direction appropriately lines up with your basis directions according the the BRDF slice. To show you what I mean by this, here’s a comparison:

*The top image shows a path-traced rendering of a green wall being lit by direct sun lighting. The middle image shows the indirect specular component of the top image, with exposure increased by 4x. The bottom image shows the resulting specular from treating the HL2 basis directions as directional lights.*

Hopefully these images clearly show the problem that I’m describing. In the bottom image, you get specular reflections that look just like they came from a few point lights, since that’s effectively what you’re simulating. Meanwhile in the middle image with proper environment reflections, you can see that the the entire green wall effectively acts as an area light, and you get a very broad specular reflections across the entire floor. In general the problem tends to be less noticeable though as roughness increases, since higher roughness naturally results in broader, less-defined reflections that are harder to notice.

If we want to do better, we must instead find a way to store a radiance distribution and then efficiently integrate it against our BRDF. It’s at this point that we turn to spherical harmonics. Spherical harmonics (SH for short) have become a popular tool for real-time graphics, typically as a way to store an approximation of indirect lighting at discrete probe locations. I’m not going to go into the full specifics of SH since that could easily fill an entire article[4] on its own. If you have no experience with SH, the key thing to know about them is that they basically let you approximate a function defined on a sphere using a handful of coefficients (typically either 4 or 9 floats per RGB channel). It’s sort-of as if you had a compact cubemap, where you can take a direction vector and get back a value associated with that direction. The big catch is that you can only represent very low-frequency (fuzzy) signals with lower-order SH, which can limit what sort of things you can do with it. You can project detailed, high-frequency signals onto SH if you want to, but the resulting projection will be very blurry. Here’s an example showing what an HDR environment map looks like projected onto L2 SH, which requires 27 coefficients for RGB:

*The top image is an HDR environment map containing incoming radiance values about a sphere, while the bottom image shows the result of projecting that environment onto L2 spherical harmonics.*

In the case of irradiance, SH can work pretty well since it’s naturally low-frequency. The integration of incoming radiance against the cosine term effectively acts as a low-pass filter, which makes it a suitable candidate for approximation with SH. So if we project irradiance onto SH for every probe location or lightmap texel, we can now do an SH “lookup” (which is basically a few computations followed by a dot product with the coefficients) to get the irradiance in any direction on the sphere. This means we can get spatial variation from albedo and normal maps just like with the HL2 basis!

It also turns out that SH is pretty useful for *computing* irradiance from input radiance, since we can do it really cheaply. In fact it can do it so cheaply, it can be done at runtime by folding it into the SH lookup process. The reason it’s so cheap is because SH is effectively a frequency-domain representation of the signal, and when you’re in the frequency domain convolutions can be done with simple multiplication. In the spatial domain, convolution with a cubemap is an N^2 operation involving many samples from an input radiance cubemap. If you’re interested in the full details, the process was described in Ravi Ramamoorthi’s seminal paper[5] from 2001, with derivations provided in another article[6].

*The Stanford Bunny model being lit with diffuse lighting from an L2 spherical harmonics probe*

So we’ve established that SH works for approximating irradiance, and that we can convert from radiance to irradiance at runtime. But what does this have to do with specular? By storing an approximation of radiance instead of irradiance in our probes or lightmaps (albeit, a very blurry version of radiance), we now have the signal that we need to integrate our specular BRDF against in order to produce specular reflections. All we need is an SH representation of our BRDF, and we’re a dot product away from environment specular! The only problem we have to solve is how to actually *get* an SH representation of our BRDF.

Unfortunately a microfacet specular BRDF is quite a bit more complicated than a Lambertian diffuse BRDF, which makes our lives more difficult. For diffuse lighting we only needed to worry about the cosine lobe, which has the same shape regardless of the material or viewing direction. However a specular lobe will vary in shape and intensity depending on the viewing direction, material roughness, and the fresnel term at zero incidence (AKA F_{0}). If all else fails, we can always use monte-carlo techniques to pre-compute the coefficients and store the result in a lookup texture. At first it may seem like we need at parameterize our lookup table on 4 terms, since the viewing direction is two-dimensional. However we can drop a dimension if we follow in the footsteps[7] of the intrepid engineers at Bungie, who used a neat trick for their SH specular implementation in Halo 3[8]. The key insight that they shared was that the specular lobe shape doesn’t actually change as the viewer rotates around the local Z axis of the shading point (AKA the surface normal). It actually only changes based on the *viewing angle*, which is the angle between the view vector and the local Z axis of the surface. If we exploit this knowledge, we can pre-compute the coefficients for the set of possible viewing directions that are aligned with the local X axis. Then at runtime, we can rotate the coefficients so that the resulting lobe lines up with the actual viewing direction. Here’s an image to show you what I mean:

*Rotating a specular lobe from the X axis to its actual location based on the viewing direction, which is helpful for pre-computing the SH coefficients into a lookup texture*

So in this image the checkerboard is the surface being shaded, and the red, green and blue arrows are the local X, Y, and Z axes of the surface. The transparent lobe represents the specular lobe that we precomputed for a viewpoint that’s aligned with the X axis, but has the same viewing angle. The blue arrow shows how we can rotate the specular lobe from its original position to the actual position of the lobe based on the current viewing position, giving us the desired specular response. Here’s a comparison showing what it looks like it in action:

*The top image is a scene rendered with a path tracer. The middle image shows the indirect specular as rendered by a path tracer, with exposure increased 4x. The bottom image shows the indirect specular term computing an L2 SH lightmap, also with exposure increased by 4x. *

Not too bad, eh? Or at least…not too bad as long as we’re willing to store 27 coefficients per lightmap texel, and we’re only concerned with rough materials. The comparison image used a GGX α parameter of 0.39, which is fairly rough.

One common issue with issue with SH is a phenomenon known as “ringing”, which is described in Peter-Pike Sloan’s Stupid Spherical Harmonics Tricks[9]. Ringing artifacts tends to show up when you have a very intense light source one side of the sphere. When this happens, the SH projection will naturally result in negative lobes on the opposite side of the sphere, which can result very low (or even negative!) values when evaluated. It’s generally not too much of an issue for 2D lightmaps, since lightmaps are only concerned with the incoming radiance for a hemisphere surrounding the surface normal. However they often show up in probes, which store radiance or irradiance about the entire sphere. The solution suggested by Peter-Pike Sloan is to apply a windowing function to the SH coefficients, which will filter out the ringing artifacts. However the windowing will also introduce additional blurring, which may remove high-frequency components from the original signal being projected. The following image shows how ringing artifacts manifest when using SH to compute irradiance from an environment with a bright area light, and also shows how windowing affects the final result:

*A sphere with a Lambertian diffuse BRDF being lit by a lighting environment with a strong area light source. The left image shows the ground-truth result of using monte-carlo integration. The middle image shows the result of projecting radiance onto L2 SH, and then computing irradiance. The right image shows the result of applying a windowing function to the L2 SH coefficients before computing irradiance.*

[1] Shading in Valve’s Source Engine (SIGGRAPH 2006) – http://www.valvesoftware.com/publications/2006/SIGGRAPH06_Course_ShadingInValvesSourceEngine.pdf

[2] Half Life 2 / Valve Source Shading – http://www2.ati.com/developer/gdc/D3DTutorial10_Half-Life2_Shading.pdf

[3] Real Shading in Unreal Engine 4 – http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf

[4] Spherical Harmonic Lighting: The Gritty Details – http://www.research.scea.com/gdc2003/spherical-harmonic-lighting.pdf

[5] An Efficient Representation for Irradiance Environment Maps – https://cseweb.ucsd.edu/~ravir/papers/envmap/

[6] On the Relationship between Radiance and Irradiance: Determining the illumination from images of a convex Lambertian object – https://cseweb.ucsd.edu/~ravir/papers/invlamb/

[7] The Lighting and Material of Halo 3 (Slides) – https://developer.amd.com/wordpress/media/2012/10/S2008-Chen-Lighting_and_Material_of_Halo3.pdf

[8] The Lighting and Material of Halo 3 (Course Notes) – http://developer.amd.com/wordpress/media/2013/01/Chapter01-Chen-Lighting_and_Material_of_Halo3.pdf

[9] Stupid Spherical Harmonics Tricks – http://www.ppsloan.org/publications/StupidSH36.pdf

So nearly year and a half ago myself and Dave Neubelt gave a presentation at SIGGRAPH where we described the approach that we developed for approximating incoming radiance using Spherical Gaussians in both our lightmaps and 3D probe grids. We had planned on releasing a source code demo as well as course notes that would serve as a full set of implementation details, but unfortunately those efforts were sidetracked by other responsibilities. We had actually gotten pretty far on a demo, and it started to seem pretty silly to let a whole year go by without releasing it. So I’ve cleaned it up, and put it on GitHub for all to learn from and/or make fun of. So if you’re just interested in seeing the code or running the demo, then go ahead and download it:

https://github.com/TheRealMJP/BakingLab

For those looking for some in-depth written explanation, I’ve also decided to write a series of blog posts that should hopefully shed some light on the basics of using SG’s in rendering. The first post provides background material by explaining common approaches to storing pre-computing lighting data in lightmaps and/or probes. The second post focuses on explaining the basics of Spherical Gaussians, and demonstrating some of their more useful properties. The third post explains how the various SG properties can be used to compute diffuse lighting from an SG light source. The fourth post goes even deeper and covers methods for approximating the specular contribution from an SG light source. The fifth post explores some approaches for using SG’s to create a compact approximation of a lighting environment, and compares the results with spherical harmonics. Finally, the sixth posts discusses features present in the the lightmap baking demo that we’ve released on GitHub.

Part 2 – Spherical Gaussians 101

Part 3 – Diffuse Lighting From an SG Light Source

Part 4 – Specular Lighting From an SG Light Source

Part 5 – Approximating Radiance and Irradiance With SG’s

Part 6 – Step Into The Baking Lab

Enjoy, and please leave a comment if you have questions, concerns, or corrections!

]]>