A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer
Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-related posts in the future, but this series is long enough already. We’ve been touring all the regular parts of the graphics pipeline, down to different levels of detail. Which leaves one major new feature introduced in DX11 out: Compute Shaders. So that’s gonna be my topic this time around.
For this series, the emphasis has been on overall dataflow at the architectural level, not shader execution (which is explained well elsewhere). For the stages so far, that meant focusing on the input piped into and output produced by each stage; the way the internals work was usually dictated by the shape of the data. Compute shaders are different – they’re running by themselves, not as part of the graphics pipeline, so the surface area of their interface is much smaller.
In fact, on the input side, there’s not really any buffers for input data at all. The only input Compute Shaders get, aside from API state such as the bound Constant Buffers and resources, is their thread index. There’s a tremendous potential for confusion here, so here’s the most important thing to keep in mind: a “thread” is the atomic unit of dispatch in the CS environment, and it’s a substantially different beast from the threads provided by the OS that you probably associate with the term. CS threads have their own identity and registers, but they don’t have their own Program Counter (Instruction Pointer) or stack, nor are they scheduled individually.
That explains the “thread” and “warp” levels. Above that is the “thread group” level, which deals with – who would’ve thought? – groups of threads. The size of a thread group is specified during shader compilation. In DX11, a thread group can contain anywhere between 1 and 1024 threads, and the thread group size is specified not as a single number but as a 3-tuple giving thread x, y, and z coordinates. This numbering scheme is mostly for the convenience of shader code that addresses 2D or 3D resources, though it also allows for traversal optimizations. At the macro level, CS execution is dispatched in multiples of thread groups; thread group IDs in D3D11 again use 3D group IDs, same as thread IDs, and for pretty much the same reasons.
The above description makes it sound like thread groups are a fairly arbitrary middle level in this hierarchy. However, there’s one important bit missing that makes thread groups very special indeed: Thread Group Shared Memory (TGSM). On DX11 level hardware, compute shaders have access to 32k of TGSM, which is basically a scratchpad for communication between threads in the same group. This is the primary (and fastest) way by which different CS threads can communicate.
Group Synchronization. A Group Synchronization Barrier forces all threads inside the current group to reach the barrier before any of them may consume past it. Once a Warp reaches such a barrier, it will be flagged as non-runnable, same as if it was waiting for a memory or texture access to complete. Once the last Warp reaches the barrier, the remaining Warps will be reactivated. This all happens at the Warp scheduling level; it adds additional scheduling constraints, which may cause stalls, but there’s no need for atomic memory transactions or anything like that; other than lost utilization at the micro level, this is a reasonably cheap operation.
Group Memory Barriers. Since all threads within a group run on the same shader unit, this basically amounts to a pipeline flush, to ensure that all pending shared memory operations are completed. There’s no need to synchronize with resources external to the current shader unit, which means it’s again reasonably cheap.
DX11 offers different types of barriers that combine several of the above components into one atomic unit; the semantics should be obvious.
We’ve now dealt with CS input and learned a bit about CS execution. But where do we put our output data? The answer has the unwieldy name “unordered access views”, or UAVs for short. An UAV seems somewhat similar to render targets in Pixel Shaders (and UAVs can in fact be used in addition to render targets in Pixel Shaders), but there’s some very important semantic differences:
In current CPUs, most of the magic for shared memory processing is handled by the memory hierarchy (i.e. caches). To write to a piece of memory, the active core must first assert exclusive ownership of the corresponding cache line. This is accomplished using what’s called a “cache coherency protocol”, usually MESI and descendants. The details are tangential to this article; what matters is that because writing to memory entails acquiring exclusive ownership, there’s never a risk of two cores simultaneously trying to write to the some location. In such a model, atomic operations can be implemented by holding exclusive ownership for the duration of the operation; if we had exclusive ownership for the whole time, there’s no chance that someone else was trying to write to the same location while we were performing the atomic operation. Again, the actual details of this get hairy pretty fast (especially as soon as things like paging, interrupts and exceptions get involved), but the 30000-feet-view will suffice for the purposes of this article.
One final remark is that, of course, outstanding atomic operations count as “device memory” accesses, same as memory/texture reads and UAV writes; shader units need to keep track of their outstanding atomic operations and make sure they’re finished when they hit device memory access barriers.
Unless I missed something, these two buffer types are the last CS-related features I haven’t talked about yet. And, well, from a hardware perspective, there’s not that much to talk about, really. Structured buffers are more of a hint to the driver-internal shader compiler than anything else; they give the driver some hint as to how they’re going to be used – namely, they consist of elements with a fixed stride that are likely going to be accessed together – but they still compile down to regular memory accesses in the end. The structured buffer part may bias the driver’s decision of their position and layout in memory, but it does not add any fundamentally new functionality to the model.
Append/consume buffers are similar; they could be implemented using the existing atomic instructions. In fact, they kind of are, except the append/consume pointers aren’t at an explicit location in the resource, they’re side-band data outside the resource that are accessed using special atomic instructions. (And similarly to structured buffers, the fact that their usage is declared as append/consume buffer allows the driver to pick their location in memory appropriately).
And… that’s it. No more previews for the next part, this series is done :), though that doesn’t mean I’m done with it. I have some restructuring and partial rewriting to do – these blog posts are raw and unproofed, and I intend to go over them and turn it into a single document. In the meantime, I’ll be writing about other stuff here. I’ll try to incorporate the feedback I got so far – if there’s any other questions, corrections or comments, now’s the time to tell me! I don’t want to nail down the ETA for the final cleaned-up version of this series, but I’ll try to get it down well before the end of the year. We’ll see. Until then, thanks for reading!
A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer的更多相关文章
- A trip through the Graphics Pipeline 2011_10_Geometry Shaders
Welcome back. Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ...
- A trip through the Graphics Pipeline 2011_12 Tessellation
Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ...
- A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”
In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ...
- A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways
In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ...
- A trip through the Graphics Pipeline 2011_05
After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ...
- A trip through the Graphics Pipeline 2011_04
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
- A trip through the Graphics Pipeline 2011_03
At this point, we’ve sent draw calls down from our app all the way through various driver layers and ...
- A trip through the Graphics Pipeline 2011_02
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
- A trip through the Graphics Pipeline 2011_11 Stream Out
Welcome back! This time, the focus is going to be on Stream-Out (SO). This is a facility for storing ...
随机推荐
- 现代JavaScript
1.async(异步)属性以及defer(延迟)属性 2.load事件是在所有资源被完全加载后才触发 3.domready,在DOM加载之后以及资源加载之前被触发,以DOMContentLoaded的 ...
- CSS3-html,样式与样式表的创建,选择器
- CSS 样式使用
[attribute] 用于选取带有指定属性的元素. [attribute=value] 用于选取带有指定属性和值的元素. [attribute~=value] 用于选取属性值中包含指定词汇的元素. ...
- CSS background 属性
CSS 参考手册 实例 如何在一个声明中设置所有背景属性: body { background: #00FF00 url(bgimage.gif) no-repeat fixed top; } 亲自试 ...
- HDU4307 Matrix(最小割)
题目 Source http://acm.hdu.edu.cn/showproblem.php?pid=4307 Description Let A be a 1*N matrix, and each ...
- Codeforces 615C Running Track(DP + Trie树)
题目大概说给两个串,问最少要用多少个第一个串的子串(可以翻转)拼成第二个串. UVa1401,一个道理..dp[i]表示前缀i拼接成功所需最少的子串,利用第一个串所有子串建立的Trie树往前枚举转移. ...
- dplyr包--数据操作与清洗
1.简介 在我们数据分析的实际应用中,我们可能会花费大量的时间在数据清洗上,而如果使用 R 里面自带的一些函数(base 包的 transform 等),可能会觉得力不从心,或者不是很人性化.好在我们 ...
- Codeforces Beta Round #2 A. Winner
A. Winner time limit per test 1 second memory limit per test 64 megabytes input standard input outpu ...
- ACM 房间安排
房间安排 时间限制:3000 ms | 内存限制:65535 KB 难度:2 描述 2010年上海世界博览会(Expo2010),是第41届世界博览会.于2010年5月1日至10月31日期间, ...
- Codeforces Round #228 (Div. 2) A. Fox and Number Game
#include <iostream> #include <algorithm> #include <vector> #include <numeric> ...