Light Indexed Deferred Rendering

There’s been a bit of a stir on the Internet lately due to AMD’s recent Leo demo, which was recently revealed to be using a modern twist on Light Indexed Deferred Rendering. The idea of light indexed deferred has always been pretty appealing, since it gives you some of the advantages of deferred rendering (namely using the GPU to decide which lights affect each pixel) while still letting you use forward rendering to actually apply the lighting to each surface. While there’s little doubt at this point that deferred rendering has proven itself as an effective and practical technique, I’m sure that plenty of programmers currently maintaining such a renderer have dreamed of a day where they don’t have to figure out how to cram every attribute into their G-Buffer using as few bits as possible, or consume 100′s of megabytes for MSAA G-Buffer textures.

While the benefits of light indexed deferred were pretty obvious to, I was pretty sure that the performance wouldn’t hold up when compared to the state-of-art in traditional deferred rendering. So I decided to make a simple test app where I could toggle between the two techniques for the same scene. For the deferred renderer, I based my implementation very closely on Andrew Lauritzen’s work since he had done quite a bit of work in terms of optimizing it for modern GPU architectures. The only differences were that I used a different G-Buffer layout (normals, specular albedo + roughness, diffuse albedo, and ambient lighting, all 32bpp) and I used an oversized texture instead of a structured buffer for writing out the individual MSAA subsamples from the compute shader.

For the light indexed deferred renderer implementation I used a depth-only prepass to fill the depth buffer, which was then used by a compute shader to compute the list of intersecting lights per-tile. This list was stored in either an R8_UINT or R16_UINT typed buffer (8-bit for < 255 lights, 16-bit otherwise), with enough space pre-allocated in the buffer to store a full light list for each tile. So no bitfields or linked lists or anything fancy like that, just a simple per-tile list terminated by sentinel value. I found that this worked best for the forward lighting pass, since this resulted in the least amount of overheard for reading the list in the forward rendering pass, although there might be better ways to do it. The forward rendering pass then figures out which tile each pixel is in, and applies the list of lights one by one.

In both cases I used normalized Blinn-Phong with fresnel approximation for the lights, so nothing fancy there. I did use a terrible linear falloff for the point lights just so that I could artificially restrict the radius, so please don’t judge me for that. I also used the depth-only prepass for both implementations, since it actually resulted in a speed up of around 0.5ms for the G-Buffer pass. For a test scene, I used the ol’ Sponza atrium.

I gathered some performance numbers for the hardware I have access to, which is an AMD 6970 and an Nvidia GTX 570. For both GPU’s I ran at 1920×1080 resolution with VSYNC disabled, and the timings represent total frame time. The Nvidia numbers were pretty much in line with my expectations:

Nvidia GTX 570
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 6.94ms 6.41ms
2x MSAA 7.81ms 7.51ms
4xMSAA 8.47ms 9.17ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.67ms 9.43ms
2x MSAA 12.987ms 10.75ms
4xMSAA 13.88ms 12.34ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 18.18ms 14.084ms
2x MSAA 20.00ms 15.63ms
4xMSAA 21.27ms 17.24ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 27.03ms
2x MSAA 29.41ms
4xMSAA 31.25ms

Tile-based deferred rendering wins out and nearly every case, and it only gets worse as you add in more lights.  Light indexed seems to scale a bit better with MSAA, but even with that it’s only enough to overcome the overall disadvantage for the 128 light case. For 1024 lights it seemed as though the Nvidia driver or hardware couldn’t handle the large buffer I was using for storing the light indices, as I was getting very strange artifacts on the lower half of the screen. However I can only imagine the trend would continue, and it would lag further behind the tile-based deferred renderer.

For the AMD 6970, the results were much more interesting:

AMD 6970
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.26ms 5.71ms
2x MSAA 5.98ms 9.43ms
4xMSAA 6.49ms 10.75ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 7.87ms 7.87ms
2x MSAA 8.77ms 11.11ms
4xMSAA 9.43ms 13.15ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 11.76ms 11.36ms
2x MSAA 12.98ms 14.93ms
4xMSAA 13.89ms 16.94ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 22.22ms 20ms
2x MSAA 24.39ms 25.64ms
4xMSAA 25.64ms 33.33ms

These results really surprised me. The light indexed renderer actually starts out faster than the deferred renderer, and doesn’t really start to fall behind until you hit 1024 lights. However with either 2xMSAA or 4xMSAA the light indexed renderer absolutely blows away the competition. I actually suspected that I did something wrong in my MSAA implementation, until I verified that I got similar results from the Intel sample. Perhaps there’s a better way to handle MSAA in a compute shader for AMD hardware? I didn’t spend a lot of time experimenting, so perhaps someone else has a few bright ideas. Either way it’s clear that forward rendering scales really well with MSAA on this hardware. Even the G-Buffer pass fares pretty well, as it goes from 1ms to 1.2ms to 1.3ms as the MSAA level increases (1.5ms to 1.9ms to 2.1ms without a z prepass).

So, where does this leave us? Even with these numbers we really don’t have a complete picture. Really we need some tests run with…

1. Different scenes, preferably some with even higher poly counts and/or some tessellation
2. More realistic material variety, including different texture configurations, layer blending, decals
3. A variety of complex BRDF’s
4. A few different ambient/bounce lighting configurations
5. More lighting types, with different shadowing configurations
6. More hardware to test on

These things have some big implications on what you store in the G-Buffer, forward shading efficiency, and the cost of a z prepass. That last one is important, since it’s mandatory for light indexed deferred but optional for traditional deferred. While it can still be cheaper overall to have a z prepass before your G-Buffer pass (as it was in my case), that could change depending on how your vertex processing costs.

So for now, my conclusion is that Light Indexed Deferred is at least in the realm of practical for most cases. Personally I consider even 256 to be a LOT of lights, so I’m not too worried about scaling up to thousands of lights anytime soon. But if anyone has access to different GPU’s, I would love to get some more numbers so that I can post them here. So if you happen to have a 7970 or GTX 680 lying around, feel free to download my sample and take down some numbers. Originally the number of lights was hard-coded to 128 in the binary, but I uploaded a new version that lets you toggle through the number of lights that I used for my test runs.

You can find the code and binary on CodePlex: http://mjp.codeplex.com/releases/view/85279#DownloadId=363173

Here are a few numbers for a GTX 680 contributed by Sander van Rossen:

Nvidia GTX 680
128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.3ms 2.6ms
2x MSAA 2.62ms 3.86ms
4xMSAA 2.85ms 4.95ms

And some more numbers for the AM 7970 courtesy of phantom, gathered at 1280×720:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 1.8ms 1.9ms
2x MSAA 2.0ms 2.82ms
4xMSAA 2.3ms 3.6ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 2.5ms 2.3ms
2x MSAA 2.7ms 3.3ms
4xMSAA 3.0ms 4.2ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.3ms 2.9ms
2x MSAA 3.8ms 4.2ms
4xMSAA 4.2ms 5.2ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.9ms 4.5ms
2x MSAA 6.7ms 6.4ms
4xMSAA 7.4ms 7.8ms

Nvidia GTX 580, from Nathan Reed

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.77ms 3.17ms
2x MSAA 4.14ms 3.58ms
4xMSAA 4.39ms 4.17ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.68ms 4.37ms
2x MSAA 6.33ms 4.95ms
4xMSAA 6.80ms 5.52ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 8.54ms 6.10ms
2x MSAA 9.62ms 6.80ms
4xMSAA 10.42ms 7.35ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 16.67ms 10.99ms
2x MSAA 18.87ms 12.05ms
4xMSAA 20.41ms 12.82ms

AMD 5870, from Ethatron

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.59ms 3.48ms
2x MSAA 4.03ms 4.60ms
4xMSAA 4.44ms 5.49ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.18ms 4.42ms
2x MSAA 5.78ms 5.95ms
4xMSAA 6.32ms 6.89ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 7.51ms 5.95ms
2x MSAA 8.40ms 7.63ms
4xMSAA 9.09ms 8.69ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 14.28ms 10.20ms
2x MSAA 15.87ms 12.98ms
4xMSAA 17.24ms 14.28ms

Radeon 7970 @ 1920×1080, from 3dcgi:

128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 3.03ms 3.34ms
2x MSAA 3.52ms 5.12ms
4xMSAA 3.96ms 6.84ms
256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 4.18ms 4.20ms
2x MSAA 4.76ms 6.25ms
4xMSAA 5.32ms 8.13ms
512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 5.85ms 5.46ms
2x MSAA 6.62ms 8.00ms
4xMSAA 7.19ms 10.00ms
1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 10.42ms 8.92ms
2x MSAA 11.63ms 12.66ms
4xMSAA 12.82ms 15.63ms

38 comments

  1. I couldn’t see how many lights your binary was displaying, but these are the results for my ridiculously overpowered dual gtx680′s:

    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 2.3ms 2.6ms
    2x MSAA 2.62ms 3.86ms
    4xMSAA 2.85ms 4.95ms

  2. Thanks for posting those Sanders! The original binary was running at 128 lights. I just uploaded a new one that lets you switch the number of lights. I would assume that you ran at the default resolution of 1280×720?

  3. On my GTX 580 at 1280×720:

    128 lights
    1x: LI 3.77 TD 3.17
    2x: LI 4.14 TD 3.58
    4x: LI 4.39 TD 4.17

    256 lights
    1x: LI 5.68 TD 4.37
    2x: LI 6.33 TD 4.95
    4x: LI 6.80 TD 5.52

    512 lights
    1x: LI 8.54 TD 6.10
    2x: LI 9.62 TD 6.80
    4x: LI 10.42 TD 7.35

    1024 lights
    1x: LI 16.67 TD 10.99
    2x: LI 18.87 TD 12.05
    4x: LI 20.41 TD 12.82

    Similar pattern to your GTX 570. I should also note I disabled the Z prepass for the tilde deferred cases since it was slowing it down a bit.

    By the way, in multisampling mode with light-indexed are you running lighting per MSAA sample or just per pixel? And have you looked at detecting edges and running the per-sample lighting only for the tiles (or pixels) containing edges? That can be a big optimization for tiled-deferred, maybe less so for light-indexed deferred as it seems you’d have to branch in the pixel shader to implement it.

    Finally, I wonder how CSAA (NVIDIA) or EQAA (AMD) would affect things. I’m not sure how you actually turn those on in D3D, though.

  4. AMD 5870

    128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 3.48ms 3.59ms
    2x MSAA 4.03ms 4.60ms
    4xMSAA 4.44ms 5.49ms

  5. Continued … (above the “No MSAA” is swapped, sorry – yes on the 5870 “No MSAA” and “Some MSAA” swaps the rank = 512 indexed can’t manage competing)

    256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 5.18ms 4.42ms
    2x MSAA 5.78ms 5.95ms
    4xMSAA 6.32ms 6.89ms

    512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 7.51ms 5.95ms
    2x MSAA 8.40ms 7.63ms
    4xMSAA 9.09ms 8.69ms

    1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 14.28ms 10.20ms
    2x MSAA 15.87ms 12.98ms
    4xMSAA 17.24ms 14.28ms

    All default resolution.

  6. LI seems ALU-bound and TB seems at least memory-related on the 5870, overclocking the core yields different speedup for LI and TB respectively (looking at the two extremes):

    LI speeds up more or less linear on all accounts, that is 850 to 990 (MHz) ^= 3.59 to 3.19 (linear is 3.08) ^= 17.24 to 14.92 (linear is 14.80).

    TB is stalemate in the “No MSAA” case for 990, not faster anymore, that is 850 to 990 (MHz) ^= 3.48 to 3.14 (linear is 2.98) ^= 14.28 to 12.65 (linear is 12.26).

    On the slow extreme it’s 13.5% vs. 11.5% speedup from a 14.2% overclock, that is TB gains 85% vs. LI.

    To me it seems on the GKs it’s only that TB is faster because of the large sustainable memory bandwidth. And it is visible, that if I would clock my 5870 at say 2GHz, then TB would never win. LI vs. TB seems a ALU vs. mem tradeoff, or not that relevant if the architecture is somewhere in the middle. But as memory speeds are unlikely to rise much further, and often are below our GDDR5 speeds on medium class cards, and core clock still keeps rising even on medium class cards, I’d say LI has a rosier prognosis.

  7. @Nathan,

    For Light Indexed Deferred it’s really just forward lighting, so I just turn on MSAA for the render target and let the hardware do its thing. This means that you only shade multiple times per pixel along triangle edges where the triangle doesn’t full cover all subsamples of a pixel. You could certainly use CSAA if you wanted, you just turn it on by using a different quality level. I’m not sure about EQAA.

    @ethatron

    Thank you for sharing such a detailed analysis! Your findings make sense though, since light indexed deferred tends to be VERY heavy on ALU in the pixel shader.

  8. Radeon 7970 at stock clocks full screen on a 1080p monitor with the taskbar hidden so the rendering window is a title bar short of 1080p.

    128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 3.03ms 3.34ms
    2x MSAA 3.52ms 5.12ms
    4xMSAA 3.96ms 6.84ms

    256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 4.18ms 4.20ms
    2x MSAA 4.76ms 6.25ms
    4xMSAA 5.32ms 8.13ms

    512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 5.85ms 5.46ms
    2x MSAA 6.62ms 8.00ms
    4xMSAA 7.19ms 10.00ms

    1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 10.42ms 8.92ms
    2x MSAA 11.63ms 12.66ms
    4xMSAA 12.82ms 15.63ms

  9. @mjp Here: “No MSAA 3.48ms 3.59ms” I accidentally flipped the number, it should be “No MSAA 3.59ms 3.48ms”. :^)

  10. It was at the default resolution, I don’t know if that’s 1280×720..

    128 lights
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 2.3ms 2.6ms
    2x MSAA 2.62ms 3.86ms
    4xMSAA 2.85ms 4.95ms

    256 Lights
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 3.5ms 3.95ms
    2x MSAA 3.95ms 4.76ms
    4xMSAA 4.3ms 6.28ms

    512 Lights
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 5.26ms 5.71ms
    2x MSAA 5.95ms 7.87ms
    4xMSAA 6.45ms 9.61ms

    1024 Lights
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 10.2ms 12.6ms
    2x MSAA 11.62ms 15.15ms
    4xMSAA 12.65ms 16.39ms

    And it’s dual gtx680 (so 2x with sli, not a single card).
    These results make me wonder if SLI is configured correctly … or if something in the app makes it impossible for the driver to use SLI effectively. It’s just hard to believe that dual gtx680′s can be beaten so easily heh

  11. Yeah, the default resolution is 1280×720. SLI won’t kick in unless the driver has a profile for the app (or you use NVAPI to manually select a profile), so I’m sure that it’s just running on 1 GPU.

  12. I don’t have time to perform a full run, but here are a few 1280×720 numbers for comparison with a stock Radeon 7970.

    1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 6.02ms 4.63ms
    2x MSAA 6.85ms 6.58ms
    4xMSAA 7.52ms 8.00ms

    One thing I noticed from the other reported numbers is the GTX580 is faster than the 680 at tiled deferred yet the situation changes for index deferred. I’m surprised at how much faster the Radeon 7970 is than the GTX 680. At least with 1024 lights.

  13. @MJP Hah, using the hardware to do what it’s designed for – who does that? :) But anyway, it seems that this is a bit of an unfair comparison because light-indexed is mostly shading per-pixel while tiled-deferred is (I presume) shading per sample in all cases. Tiled-deferred with MSAA edge detection could turn things around on the AMD cards. (Of course, the fact that MSAA ‘just works’ with light-indexed is itself an argument in its favor…)

  14. @Nathan

    The tile-based deferred renderer does use edge detection. It compares the normal + depth of all subsamples in a pixel, and appends the coordinate of those pixels to a list in shared memory. Then all of the subsamples from those pixels distributed evenly among threads in the thread group so they can be shaded. The comparison is actually pretty conservative in my sample, so you end up doing per-sample shading on significantly fewer pixels than in the forward-rendered case. But even with that optimization the AMD cards take a huge hit from MSAA, which is a bit puzzling to me.

  15. MJP:
    this is all speculation, but one problem I’ve found with the tile-based deferred and splitting the samples across threads is the amount of shared memory. I’ve had real problems with this especially on geforce – seems real sensitive to sharedmem (and effect on occupancy).

    The other thing is, with the deferred version (with quite large threadgroups running 16×16 tiles = 256 threads – I actually went for 8×8 here) you’re making quite a big statement about your own scheduling / work balancing – the light indexed version is running a pixel shader to do the lighting so the hardware is scheduling the work in its own, probably smart, way. Wonder if that’s a part of the difference.

    Nice comparison though! Very useful to see.

  16. MJP – great read. Just wondering, why didn’t you try using a per-pixel list (like the AMD order-independent-transparency demo) for the light indices? Do you think that would have worse performance? Thanks.

  17. @directtovideo

    I suspected the same thing regarding shared memory pressure, so I ran a few experiments where I varied the thread group size but I wasn’t able to improve the performance. And you’re absolutely right about the scheduling…it’s really the key difference between the two techniques. Ultimately it comes to down to whether the the flexibility you get from shading in a compute shader ends up winning out over the efficiency of hardware scheduling, and taking that into account along with having to render out a G-Buffer (for tiled deferred), or requiring geometry to be rasterized twice (for indexed deferred).

    @Anonymous

    For a large number of lights in a scene having a per-pixel list doesn’t seem very compelling to me. The problems to me are:

    A. You’d have to compute light intersections per-list rather than per-tile, which means that you can’t compute the intersections for many lights in parallel like you can with per-tile lists. You could rasterize the light volumes and append the index to the per-tile linked lists (like in the AMD demo, as you suggested) but I’d imagine that would still be much slower.
    B. Your granularity during the forward lighting phase is limited by branching coherency, so it doesn’t seem worth it to do fine-grained light intersection
    C. You’ll consume a lot more memory with per-pixel lists
    D. If you use a linked list, just reading the light indices in the forward lighting phase is going to be slower. One thing I discovered early on was that just reading indices can be a serious performance drain, so I tried to make it as cheap as possible.

    For a smaller number of lights it might make sense though, especially if going fine-grained allows you to do a better job culling non-spherical light sources.

  18. Not sure why you had issues with the GTX 570 at 1920×1080, it worked fine with mine. Here’s my results:

    Windows 7, Intel Q6600, 2.40Ghz
    NVIDIA GeForce 570 GTX, 296.10 drivers

    1280×720:

    LIDR TBDR
    No MSAA 4.34 4.03
    2x MSAA 4.78 4.54
    4x MSAA 5.18 5.05

    No MSAA 6.53 5.40
    2x MSAA 7.35 6.05
    4x MSAA 7.87 6.66

    No MSAA 9.90 7.46
    2x MSAA 11.23 8.19
    4x MSAA 12.04 8.84

    No MSAA 19.23 13.15
    2x MSAA 21.73 14.49
    4x MSAA 23.80 15.38

    1920×1080:

    LIDR TBDR
    No MSAA 7.24 6.80
    2x MSAA 8.00 7.69
    4x MSAA 8.47 9.17

    No MSAA 11.62 9.90
    2x MSAA 12.98 10.98
    4x MSAA 13.88 12.19

    No MSAA 18.51 14.49
    2x MSAA 20.40 15.87
    4x MSAA 21.73 17.24

    No MSAA 37.03 27.02
    2x MSAA 40.00 28.57
    4x MSAA 43.47 30.30

    I uploaded a graph of results from this page, substituting my GTX 570 results: http://img543.imageshack.us/img543/4589/lidrvstidr.png

    Do you have any idea about the performance difference if using arbitrary light volumes? Or frustum volumes for spot lights? Thanks.

  19. Thanks for those cool graphs Cyrus! I realized a few days ago that I was running an older driver on the machine I did the 570 test on, so it was probably just a driver bug that was resolved at some point.

    Spot lights are pretty tricky. I’ve been meaning to dedicate some time investigating efficent ways to cull them per-tile, but haven’t gotten around to it yet. A full frustum-frustum test with SAT seems too heavyweight to be done in a single thread (IIRC it’s something like 6*8 + 6*8 + 6*6*8 dot products for the full test), so I’m thinking a cheaper approximation might be the way to go. I’ve been kicking around something I came up with based on plane/cone intersection tests that’s alot cheaper, but gives false positives for a few cases. Rasterizing the volume might be another viable option for expensive lights. I can let you know how it goes once I get some time to work on it more.

  20. The big hit on AMD cards with MSAA seems to be in the G-buffer rendering phase in my brief testing. Never tracked down why, as the cards have plenty of bandwidth available. Perhaps a ROP throughput bottleneck, I’m not sure. Ideally if MSAA compression was “perfect”, it should be about the same overhead as MSAA with forward as it (roughly) is on NVIDIA.

  21. Hi Andrew,

    According to my profiling (perfomed via queries) filing the G-Buffer on my AMD 6970 only accounts for 2.09ms at 1920×1080, 4xMSAA with no z prepass. With a z prepass it takes 1.31ms, with 0.4ms for the z prepass. The increase in frame time mostly comes from the lighting compute shader, which goes from 2.7ms with no MSAA to 6.8ms with 4xMSAA for the 128 light case.

  22. Your demo could be doing something different than mine of course, but if you hit “F8″ in mine you can disable G-buffer rendering/updating and last I checked that was the biggest part of the bottleneck on ATI.

    Of course it would make more sense if it was something in the significantly-more-complex light/shading pass, but that wasn’t what I experienced at least in the past :)

  23. I will note too that I typically prefer to “disable parts of the rendering”, etc. rather than use queries. Queries are a bit finicky in that they don’t necessarily interact with the pipelining in the GPU in a natural way (i.e. are you measuring end-to-end latency of a submitted command? stalling between each command instead? None is a good solution). Of course there’s no perfect solution but I find that a somewhat more consistent and predictable way to profile than queries.

  24. You definitely have a good point regarding the queries…I try to only use them to get a rough idea of timings but even then they can be quite off from the delta you get in overall frame time. I put it in a setting to disable G-Buffer rendering, and that shows a delta of about 4-4.5ms with 4xMSAA which is definitely pretty significant and more in line with your findings. I can actually get a similar result from my queries if try to force a sync point with a simple compute shader that reads from the MSAA G-Buffer textures. I would suspect that there might be something else expensive going on here, perhaps an expensive decompression step to allow the shader to sample the MSAA textures. Thank you for your input!

  25. Just got my first results after switching from light-prepass to tiled deferred:) I’m quite happy with it. Have you had time to investigate arbitrary volume or spot light implementations yet? I haven’t had time yet since I’m dealing with cascade shadow map performance:( Off topic but, from the latest GDC papers, seems DICE is using VS instancing and the GS to select a rendertarget array index. This is to avoid multiple draws per cascade. But you mention texture arrays being lower performance than an atlas in your tests. Any idea why? Also, I have not heard anyone talk about using an atlas with VS instancing and selecting a viewport index. That would eliminate the array but keep the single draw for all cascades.

  26. Radeon 7970 at 1200MHz clocks 1280×720,default setting
    128 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 1.70ms 1.80ms
    2x MSAA 1.91ms 2.55ms
    4xMSAA 6.25ms 3.27ms

    256 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 2.15ms 2.08ms
    2x MSAA 2.42ms 2.94ms
    4xMSAA 7.35ms 3.74ms

    512 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 2.84ms 2.51ms
    2x MSAA 3.21ms 3.57ms
    4xMSAA 8.771ms 4.46ms

    1024 Lights MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 4.78ms 3.74ms
    2x MSAA 5.43ms 5.26ms
    4xMSAA 11.9ms 6.53ms

  27. GTX 680, 1280×720

    (Numbers are light prepass, tile deferred.)

    256 Lights
    1x 3.58 4.15
    2x 4.07 5.32
    4x 4.4 6.25

    512 Lights
    1x 5.46 5.6
    2x 6.2 7.75
    4x 6.7 9.4

    1024 Lights
    1x 10.6 12.2
    2x 12.04 14.7
    4x 13.1 16.1

  28. At least on a AMD Radeon HD 7770, there are artifacts when using 1024 lights (all MSAA settings) with Forward rendering (G-Buffer works fine)

    Here’s an image highlighting the artifacts:

    IT ONLY APPEARS WHEN LOOKING FROM THAT ANGLE

    Although it looks very small, it’s actually *very* noticeable because it flickers in blocks (tiles) across ALL the roof border; even when the camera is completely still. It works fine with 512 lights.

    My theory from a quick glance is that those tiles have more lights than what the card allows to hold in the linked list buffer (is there a hard limit? or may be there’s a hard limit in the forward rendering loop…?) and race conditions cause different lights to be dropped each frame; therefore the tile always has a light list that doesn’t hold all the needed lights; being always different.

    Each tile flickers going lighter & darker. I don’t have an NVIDIA DX11 card to compare with, unfortunately.

  29. Hi Matias,

    I haven’t seen any similar artifacts myself, but I’m not terribly surprised. It certainly wouldn’t be the first time that I encountered quirky behavior with compute shaders that use atomics on shared memory variables. There’s actually no linked list, each tile has enough room in a buffer to store indices for N lights (where N is the maximum number of lights in the scene). So there *should* be enough room in the buffer to store 1024 lights, as well as in shared memory.

  30. Hi, thanks for the answer.
    Yeah, when I was referring to the linked list buffer, I was thinking you probably just used a big per-tile array.
    This is unsurprising for me either, CS puts more responsibility to developers than pixel shaders, but it’s a new tech where driver, compiler & even HW bugs can’t be ruled out yet.

    So, either the driver clamps the buffer size, the HW’s atomic operation is malfunctioning, there’s a rare race condition somewhere, the tile is somehow overflowing, or the pixel shader in the forward pass is just refusing to read the entire buffer and just parsing it partially. So many possibilities….

    I just wanted to know if someone else was able to reproduce the artifacts (only shows when lightcount = 1024; while looking from that particular angle as in the screenshot)

  31. @matiasgoldberg i got the same problem. I got a HD Radeon 7950 – it flickers on that angle and with the 1024 lights activatated. However it works fine with less lights.

  32. Hi Matt,

    I’ve been using a simple frustum/cone that checks to ensure that some part of the cone is on the positive side of all 6 frustum planes (using the cone/plane test from Real-Time Collision Detection). It certainly works when the cone is entirely on the negative side of one of the 6 planes, but for cones that are rather large relative to the frustum you can get cases where the cone is on the positive side of all 6 planes but still doesn’t intersect the actual frustum (you can actually get the same problem with a sphere/frustum test if you do it the same way). Constructing additional planes to test against will help, but doesn’t solve the problem entirely. If you’re not running into the same issues, then perhaps you’re doing something that’s a bit more sophisticated?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s