A common part of most HDR rendering pipelines is some form of average luminance calculation. Typically it’s used to implement Reinhard’s method of image calibration, which is to map the geometric mean of luminance (log average) to some “key value”. This, combined with some time-based adaptation, allows for a reasonable approximation of auto-exposure or human eye adaptation.
In the old days of DX9, the average luminance calculation was usually done repeatedly downscaling a luminance texture as if generating mipmaps. Technically DX9 had support for automatic mipmap generation on the GPU, but support wasn’t guaranteed so to be safe you had to do it yourself. DX10 brought guaranteed support for mipmap generation of a variety of texture formats, making it viable for use with average luminance calculations. It’s obviously still there in DX11, and it’s still a very easy and pretty quick way to do it. On my AMD 6950 it only takes about 0.2ms to generate the full mip chain for a 1024×1024 luminance map, which is pretty quick. But with DX11 it’s not sexy unless you’re doing it with compute shaders, which means we need to ditch that one line call to GenerateMips and replace it with some parallel reductions. Technically a parallel reduction should have much fewer memory reads/writes compared to generating successive mip levels, so there’s also some actual sound reasoning behind exploring that approach.
The DirectX SDK actually comes with a sample that implements average luminance calculation with a compute shader parallel reduction (HDRToneMappingCS11), but unfortunately their CS implementation actually performs a fair bit worse than their pixel shader implementation. A few people on gamedev.net had asked about this and I had said that it should definitely be possible to beat successive downscaling with a compute shader if you did it right, and used cs_5_0 rather than cs_4_0 like the sample. When it came up again today I decided to put my money where my mouth is and make a working example.
The implementation is really simple: render the scene in HDR, render log(luminance) to a 1024×1024 texture, downscale to 1×1 using either GenerateMips or a compute shader reduction, apply adaption, then tone map (and add bloom). My first try was to do the reduction in 32×32 thread groups (giving the max of 1024 per thread group), where each thread sampled a single float from the input texture and stored in shared memory. Then the reduction is done in shared memory using the techniques outlined in Nvidia’s CUDA parallel reduction whitepaper, which helps avoid shared memory conflicts. The first pass used a 32×32 dispatch which resulted in a 32×32 output texture, which was then reduced to 1×1 with one more 1×1 dispatch. Unfortunately this approach took about 0.3ms to complete, which was slower than the 0.2ms taken for generating mips.
For my second try, I decided to explicitly vectorize so that I could take better advantage of the vector ALU’s in my AMD GPU. I reconfigured the reduction compute shader to use 16×16 thread groups, and had each thread group take 4 samples (forming a 2×2 grid) from the input texture and store it in a float4 in shared memory. Then the float4’s were summed in a parallel reduction, with the last step being to sum the 4 components of the final result. This approach required only 0.08ms for the reduction, meaning I hit my goal of beating out the mipmap generation. After all that work saving 0.1ms doesn’t seem like a whole lot, but it’s worth it for the cool factor. The performance differential may also become more pronounced at higher resolution, or on hardware with less bandwidth available. I’m not sure how the compute shader will fare on Nvidia hardware since they don’t use vectorized GPU, so it should be interesting to get some numbers. I’d suspect that shared memory access patterns are going to dominate anyway over ALU cost anyway, so it could go either way.
The sample project is up for download here: http://mjp.codeplex.com/releases/view/71518#DownloadId=268633