Average luminance calculation using a compute shader

A common part of most HDR rendering pipelines is some form of average luminance calculation. Typically it’s used to implement Reinhard’s method of image calibration, which is to map the geometric mean of luminance (log average) to some “key value”. This, combined with some time-based adaptation, allows for a reasonable approximation of auto-exposure or human eye adaptation.

In the old days of DX9, the average luminance calculation was usually done repeatedly downscaling a luminance texture as if generating mipmaps. Technically DX9 had support for automatic mipmap generation on the GPU, but support wasn’t guaranteed so to be safe you had to do it yourself. DX10 brought guaranteed support for mipmap generation of a variety of texture formats, making it viable for use with average luminance calculations. It’s obviously still there in DX11, and it’s still a very easy and pretty quick way to do it. On my AMD 6950 it only takes about 0.2ms to generate the full mip chain for a 1024×1024 luminance map, which is pretty quick. But with DX11 it’s not sexy unless you’re doing it with compute shaders, which means we need to ditch that one line call to GenerateMips and replace it with some parallel reductions. Technically a parallel reduction should have much fewer memory reads/writes compared to generating successive mip levels, so there’s also some actual sound reasoning behind exploring that approach.

The DirectX SDK actually comes with a sample that implements average luminance calculation with a compute shader parallel reduction (HDRToneMappingCS11), but unfortunately their CS implementation actually performs a fair bit worse than their pixel shader implementation. A few people on gamedev.net had asked about this and I had said that it should definitely be possible to beat successive downscaling with a compute shader if you did it right, and used cs_5_0 rather than cs_4_0 like the sample. When it came up again today I decided to put my money where my mouth is and make a working example.

The implementation is really simple: render the scene in HDR, render log(luminance) to a 1024×1024 texture, downscale to 1×1 using either GenerateMips or a compute shader reduction, apply adaption, then tone map (and add bloom). My first try was to do the reduction in 32×32 thread groups (giving the max of 1024 per thread group), where each thread sampled a single float from the input texture and stored in shared memory. Then the reduction is done in shared memory using the techniques outlined in Nvidia’s CUDA parallel reduction whitepaper, which helps avoid shared memory conflicts. The first pass used a 32×32 dispatch which resulted in a 32×32 output texture, which was then reduced to 1×1 with one more 1×1 dispatch. Unfortunately this approach took about 0.3ms to complete, which was slower than the 0.2ms taken for generating mips.

For my second try, I decided to explicitly vectorize so that I could take better advantage of the vector ALU’s in my AMD GPU. I reconfigured the reduction compute shader to use 16×16 thread groups, and had each thread group take 4 samples (forming a 2×2 grid) from the input texture and store it in a float4 in shared memory. Then the float4’s were summed in a parallel reduction, with the last step being to sum the 4 components of the final result. This approach required only 0.08ms for the reduction, meaning I hit my goal of beating out the mipmap generation. After all that work saving 0.1ms doesn’t seem like a whole lot, but it’s worth it for the cool factor. The performance differential may also become more pronounced at higher resolution, or on hardware with less bandwidth available. I’m not sure how the compute shader will fare on Nvidia hardware since they don’t use vectorized GPU, so it should be interesting to get some numbers. I’d suspect that shared memory access patterns are going to dominate anyway over ALU cost anyway, so it could go either way.

The sample project is up for download here: http://mjp.codeplex.com/releases/view/71518#DownloadId=268633

16 thoughts on “Average luminance calculation using a compute shader

  1. Mykhailo: indeed it might! But as I said earlier I suspect that shared memory access is the primary bottleneck, so I wouldn’t expect too much of a gain.

    Daan: the formula for geometric mean is exp(avg(log(luminance))). So if you render out log(luminance) and generate mips, the last mip level is equal to avg(log(luminance)). Then you just take the exp() of the value you get when sampling the lowest mip level and you have the geometric mean.

  2. OK, I get it now,, thanks for the clarification!. That’s actual pretty clever, I guess it makes very bright pixels have less of an impact on the average, as well as better temporal stability.

  3. I ran this on my personal computer, and got about 312FPS (At the most) (single GTX 460, i7 920 @ 4.1GHz, 3GB of RAM)
    Not bad but, it stays around 4.2ms. Sadly it doesn’t change much between the methods, infact it goes up with your implementation, about 0.2ms

  4. Wouldn’t the float4’s cause bank conflicts on NVidia hardware? Each subsequent 32-bit is stored in the next memory bank. So the 4 floats of a float4 would be stored on 4 different banks. So when the hardware wants to load all gsm[i].x values, only 4 of them can directly be accessed by the half warp. So it would cause a 4-way bank conflict. I don’t think the hardware coalesces the memory automatically.

  5. Yes it would definitely have bank conflicts on both Nvidia and AMD hardware. The first (scalar) version of the reduction that I wrote avoided bank conflicts, since it used the techniques illustrated the CUDA parallel reduction whitepaper. But then it appeared that on my AMD hardware the benefit of vectorization outweighed the cost of bank conflicts, so I just stopped there. In hindsight it probably would have been best to leave in the scalar version as an option, since I would suspect that would perform better on Nvidia hardware and the new GCN-based AMD hardware.

  6. You can use GroupMemoryBarrier() inside the parallel reduction’s for loop in ReductionCS(). I see 2-35% speedups on different AMD architectures with that for slightly different implementation with no explicit vectorization. Thanks for the articles and samples, much appreciated!

  7. Actually, I take it back. You do need the GroupMemoryBarrierWithGroupSync() to avoid potential R/W hazards among different warps.

  8. I recently implemented the both pixel shader and compute shader versions for calculating luminance. For the pixel shader version, I use the downscale 2*2 method with a single bilinear tap all the way down to a single pixel (GenerateMipMaps might be doing the same process). For the compute shader, i use the parallel sum method with 2 passes and then took the average using the resolution. If i use a resolution that is not the power of 2, I end up with an incorrect average (when compared to the compute shader). I have verified that the compute shader version returns the right average, and the pixel shader version starts to introduce errors when downscaling from an odd height or odd width ( for eg, downscaling from 5*3, 5*2, or 4*3). Can you suggest a method to get the right average from the Pixel shader version?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s