For profiling GPU performance on the PC, there aren’t too many options. AMD’s GPU PerfStudio and Nvidia’s Parallel Nsight can be pretty handy due to their ability to query hardware performance counters and display the data, but they only work on each vendor’s respective hardware. You also might want to integrate some GPU performance numbers into your own internal profiling systems, in which case those tools aren’t going to be of much use.
To get around this, it’s possible to use D3D11 timestamp queries to get coarse-grained timing info for different parts of the frame. It’s a ways off from the kind of info you get from the vendor-specific tools, but it’s a lot better than nothing. It’s also pretty easy to implement. To profile a portion of your frame, you need a trio of ID3D11Query objects. Two of them need to have the type D3D11_QUERY_TIMESTAMP, and are used to get the GPU timestamp at the start and end of the block you want to profile. The third needs to have the type D3D11_QUERY_TIMESTAMP_DISJOINT, and it tells you whether your timestamps are invalid as well as the frequency used for converting from ticks to seconds. In practice it goes like this:
When starting a profiling block:
- Call ID3D11DeviceContext::Begin and pass the DISJOINT query
- Call ID3D11DeviceContext::End and pass the start TIMESTAMP query
When ending a profiling block:
- Call ID3D11DeviceContext::End and pass the end TIMESTAMP query
- Call ID3D11DeviceContext::End and pass the DISJOINT query
After waiting a sufficient amount of time for the queries to be ready:
- Call ID3D11DeviceContext::GetData on all 3 queries
- Compute the delta in ticks using the timestamps from both TIMESTAMP queries
- Use the frequency from the DISJOINT query to convert the delta to a time in seconds
Like any query, you need to wait for the GPU to actually execute all of the commands you submitted for the data to be ready. In my sample app, I handle this by keeping an array of queries for each profile block and moving to the next one each frame. Then at the end of the frame, I get the data from the oldest query and use that for outputting the timing data to the screen. So the actual timing data lags behind by a few frames, but that’s okay for real-time profiling. For automated benchmarks or performance snapshots you could either use the data from N frames later, or you could just stall at the end of the frame and wait for the query to be ready.
Sample code and binaries are available on CodePlex: http://mjp.codeplex.com/releases/view/74987#DownloadId=292437