A trip through the Graphics Pipeline 2011_09_Pixel processing – “join phase”
Welcome back!
At the bottom of the pipeline (in what D3D calls the “Output Merger” stage), we have late Z/stencil processing and blending. These two operations are both relatively simple computationally, and they both update the render target(s) / depth buffer respectively. “Update” operation here means they’re of the read-modify-write variety. Because all of this happens for every quad that makes it this far through the pipeline, it’s also bandwidth-intensive. Finally, it’s order-sensitive (both blending and Z processing need to happen in API order), so we need to make sure to sort processed quads into order first.
I’ve already explained Z-processing, and blending is one of these things that work pretty much as you’d expect; it’s a fixed-function block that performs a multiply, a multiply-add and maybe some subtractions first, per render target. This block is kept deliberately simple; it’s separate from the shader units so it needs its own ALU, and we’d really prefer for it to be as small as possible: we want to spend our chip area (and power budget) on ALUs in the shader units, where they benefit every code that runs on the GPU, not on a fixed-function unit that’s only used at the end of the pixel pipeline. Also, we need it to have a short, predictable latency: this part of the pipeline needs to process data in-order to be correct. This limits our options as far as trading throughput for latency is concerned; we can still process quads that don’t overlap in parallel, but if we e.g. draw lots of small triangles, we’ll have multiple quads coming in for every screen location, and we’d better be able to write them out as quickly as they come, or else all our massively parallel pixel processing was for nought.
ROPs are the hardware units that handle this part of the pipeline (as you can tell by the plural, there’s more than one). The acronym, depending on who you asks, stands for “Render OutPut unit”, “Raster Operations Pipeline”, or “Raster Operations Processor”. The actual name is fairly archaic – it derives from the days of pure 2D hardware acceleration, with hardware whose main purpose was to do fast Bit blits. The classic 2D ROP design has three inputs – the current (destination) pixel value in the frame buffer, the source data, and a mask input – then computes some function of the 3 values and writes the results back to the frame buffer. Note this is before true color displays: the image data was usually in bit plane format and the function was some binary logic function. Then at some point bit planes died out (in favor of “chunky” representations that keep the bits for a pixel together), true color became the norm, the on-off mask was replaced with an alpha channel and the bitwise operations with blends, but the name stuck. So even now in 2011, when about the last remnant of that original architecture is the “logic op” in OpenGL, we still call them ROPs.
So what do we need to do, in hardware, for blend/late Z? A simple plan:
Read original render target/depth buffer contents from memory – memory access, long latency. Might also involve depth buffer and render target decompression! (I’ll explain render target compression later)
So, build the late-Z/blending unit, add some compression logic, wire it up to memory on one side and do some buffering of shaded quads on the other side and we’re done, right?
Well, in theory anyway.
Except we need to cover the long latencies somehow. And all this happens for every single pixel (well, quad, actually). So we need to worry about memory bandwidth too… memory bandwidth? Wasn’t there something about memory bandwidth? Watch closely now as I pull a bunny out of a hat after I put it there way back in part 2(uh oh, that was more than a week ago – hope that critter is still OK in there…).
In part 2, I described the 2D layout of DRAM, and how it’s faster to stay within a single row because changing the active row takes time – so for ideal bandwidth you want to stay in the same row between accesses. Well, the thing is, single DRAM rows are kinda large. Individual DRAM chips go up into the Gigabit range in size these days, and while they’re not necessarily square (in fact a 2:1 aspect ratio seems to be preferred), you can still do a rough calculation of how many rows and columns there would be; for 512 Megabit (=64MB), we’d expect something like 16384×32768, i.e. a single row is about 32k bits or 4k bytes (or maybe 2k, or 8k, but somewhere in that ballpark – you get the idea). That’s a rather inconvenient size to be making memory transactions in.
Hence, a compromise: the page. A DRAM page is some more conveniently sized slice of a row (by now, usually 256 or 512 bits) that’s commonly transferred in a single burst. Let’s take 512 bits (64 bytes) for now. At 32 bits per pixel – the standard for depth buffers and still fairly common for render targets although rendering workloads are definitely shifting towards 64 bit/pixel formats – that’s enough memory to fit data for 16 pixels in. Hey, that’s funny – we’re usually shading pixels in groups of 16 to 64! (NV is a bit closer to the smaller end, AMD favors the larger counts).
That gives us yet another reason to shade pixels in groups, and also yet another reason to do a two-level traversal. But can we milk this some more? You bet we can: we still have the memory latency to cover. Usual disclaimer: This is one of the places where I don’t have detailed information on what GPUs actually do, so what I’m describing here is a guess, not a fact. Anyway, as soon as we’ve rasterized a tile, we know whether it generates any pixels or not.
All of this is early enough to avoid latency stalls for all but the fastest pixel shaders (which are usually memory bandwidth-bound anyway). There’s also the issue of pixel shaders that output to multiple render targets, but that depends on how exactly that feature is implemented. You could run the shader multiple times (not efficient but easiest if you have fixed-size output buffers), or you could run all the render targets through the same ROP (but up to 8 rendertargets with up to 128 bits/pixels – that’s a lot of buffer space we’re talking), or you could allocate one ROP per output render target.
An of course, if we have these buffers in the ROPs anyway, we might as well treat them as a small cache (i.e. keep them around for a while). This would help if you’re drawing lots of small triangles – as long as they’re spatially localized, anyway. Again, I’m not sure if GPUs actually do this, but it seems like a reasonable thing to do (you’d probably want to flush these buffers something like once per batch or so though, to avoid the synchronization/coherency issues that full write-back caches bring).
Okay, that explains the memory side of things, and the computational part we’ve already covered. Next up: Compression!
I already explained the basic workings of this in part 7 while talking about Z; in fact, I don’t have much to add about depth buffer compression here. But all the bandwidth issues I mentioned there exist for color values too; it’s not so bad for regular rendering (unless the Pixel Shaders output pixels fast enough to hit memory bandwidth limits), but it is a serious issue for MSAA, where we suddenly store somewhere between 2 and 8 samples per pixel. Like Z, we want some lossless compression scheme to save bandwidth in common cases. Unlike Z, plane equations per tile are not a good fit to textured pixel data.
However, that’s no problem, because actually, MSAA pixel data is even easier to optimize for: Remember that pixel shaders only run once per pixel, not per sample – unless you’re using sample-frequency shading anyway, but that’s a D3D11 feature and not commonly used (yet?). Hence, for all pixels that are fully covered by a single primitive, the 2-8 samples stored will usually be the same. And that’s the idea behind the common color buffer compression schemes: Write a flag bit (either per pixel, or per quad, or on an even larger granularity) that denotes whether for all the pixels in a compression block, all the per-sample colors are in fact the same. And if that’s the case, we only need to store the color once per pixel after all.
On the subject of clears and compression, there’s another thing to mention: Some GPUs have “hierarchical Z”-like mechanisms that store, for a large block of pixels (a rasterizer tile, maybe even larger) that the block was recently cleared. Then you only need to store one color value for the whole tile (or larger block) in memory. This gives you very fast color clears for some buffers (again, you need some tag bits for this!). However, as soon as any pixel with non-clear color is written to the tile (or larger block), the “this was just cleared” flag needs to be… well, cleared. But we do save a lot of memory bandwidth on the clear itself and the first time a tile is read from memory.
And that’s it for our first rendering data path: just Vertex and Pixel Shaders (the most common path). In the next part, I’ll talk about Geometry Shaders and how that pipeline looks. But before I conclude this post, I have a small bonus topic that fits into this section.
Everyone who writes rendering code wonders about this at some point – the regular blend pipeline a serious pain to work with sometimes. So why can’t we get fully programmable blend? We have fully programmable shading, after all! Well, we now have the necessary framework to look into this properly. There’s two main proposals for this that I’ve seen – let’s look at the both in turn:
Blend in Pixel Shader – i.e. Pixel Shader reads framebuffer, computes blend equation, writes new output value.
Programmable Blend Unit – “Blend Shaders”, with subset of full shader instruction set if necessary. Happen in separate stage after PS.
1. Blend in Pixel Shader
This seems like a no-brainer: after all, we have loads and texture samples in shaders already, right? So why not just allow a read to the current render target? Turns out that unconstrained reads are a really bad idea, because it means that every pixel being shaded could (potentially) influence every other pixel being shaded. So what if I reference a pixel in the quad over to the left? Well, a shader for that quad could be running this instant. Or I could be sampling half of my current quad and half of another quads that’s currently active – what do I do now? What exactly would be the correct results in that regard, never mind that we’d probably have to shade all quads sequentially to reliably get them? No, that’s a can of worms. Unconstrained reads from the frame buffer in Pixel Shaders are out. But what if we get a special render target read instruction that samples one of the active render targets at the current location? Now, that’s a lot better – now we only need to worry about writes to the location of the current quad, which is a way more tractable problem.
However, it still introduces ordering constraints; we have to check all quads generated by the rasterizer vs. the quads currently being pixel-shaded. If a quad just generated by the rasterizer wants to write to a sample that’ll be written by one of the Pixel Shaders that are currently in flight, we need to wait until that PS is completed before we can dispatch the new quad. This doesn’t sound too bad, but how do we track this? We could just have a “this sample is currently being shaded” bit flag… so how many of these bits do we need? At 1920×1080 with 8x MSAA, about 2MB worth of them (that’s bytes not bits) – and that memory is global, shared and determines the rate at which we can issue new quads (since we need to mark a quad as busy before we can issue it). Worse, with the hierarchical Z etc. tag bits, they were just a hint; if we ran out of them, we could still render, albeit more slowly. But this memory is not optional. We can’t guarantee correctness unless we’re really tracking every sample! What if we just tracked the “busy” state per pixel (or even quad), and any write to a pixel would block all other such writes? That would work, but it would massively harm our MSAA performance: If we track per sample, we can shade adjacent, non-overlapping triangles in parallel, no problem. But if we track per pixel (or at lower granularity), we effectively serialize all the edge quads. And what happens to our fill rate for e.g. particle systems with lots of overdraw? With the pipeline I described, these render (more or less) as fast as the ROPs can merge the incoming pixels into the store buffers. But if we need to avoid conflicts, we really end up shading the individual overlapping particles in order. This isn’t good news for our shader units that are designed to trade latency for throughput, not at all.
Okay, so this whole tracking thing is a problem. What if we just force shading to execute in order? That is, keep the whole thing pipelined and all shaders running in lockstep; now we don’t need tracking because pixels will finish in the same order we put them into the pipeline! But the problem here is that we need to make sure the shaders in a batch actually always take the exact same time, which has unfortunate consequences: You always have to wait the worst-case delay time for every texture sample, need to always execute both sides of every branch (someone might at some point need the then/else branches, and we need everything to take the same time!), always runs all loops through for the same number of iterations, can’t stop shading on discard… no, that doesn’t sound like a winner either.
Okay, time to face the music: Pixel Shader blend in the architecture I’ve described comes with a bunch of seriously tricky problems. So what about the second approach?
I’ll say it right now: This can be made to work, but…
Let’s just say it has its own problems. For once, we now need another full ALU + instruction decoder/sequencer etc. in the ROPs. This is not a small change – not in design effort, nor in area, nor in power.
Point being, the serial execution here really constrains us to designs that are still relatively low-level; nowhere near the fully programmable shader units we’ve come to love. A nicer blend unit with some extra blend modes, you can definitely get; a more open register combiner-style design, possibly, though neither the API guys nor the hardware guys will like it much (the API because it’s a fixed function block, the hardware guys because it’s big and needs a big ALU+control logic where they’d rather not have it). Fully programmable, with branches, loops, etc. – not going to happen. At that point you might as well bite the bullet and do what it takes to get the “Blend in Pixel Shader” scenario to work properly.
…and that’s it for this post! See you next time.
A trip through the Graphics Pipeline 2011_09_Pixel processing – “join phase”的更多相关文章
- A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”
In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ...
- A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer
Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-relate ...
- A trip through the Graphics Pipeline 2011_10_Geometry Shaders
Welcome back. Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ...
- A trip through the Graphics Pipeline 2011_12 Tessellation
Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ...
- A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways
In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ...
- A trip through the Graphics Pipeline 2011_05
After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ...
- A trip through the Graphics Pipeline 2011_04
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
- A trip through the Graphics Pipeline 2011_03
At this point, we’ve sent draw calls down from our app all the way through various driver layers and ...
- A trip through the Graphics Pipeline 2011_02
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
随机推荐
- BFS HDOJ 2102 A计划
题目传送门 题意:中文题面 分析:双层BFS,之前写过类似的题.总结坑点: 1.步数小于等于T都是YES 2. 传送门的另一侧还是传送门或者墙都会死 3. 走到传送门也需要一步 #include &l ...
- MapReduce 作业调试
1. 最经典的方法通过打印语句来调试程序 System.err.println("Bad Data"+value.toString()); 这些输出错误都会记录到一个标准错误中,可 ...
- eclipse下Android无法自动生成apk文件怎么办?
eclipse下Android无法自动生成apk文件怎么办? 现象:创建android工程后,通过手动build/clean或自动build均无法在bin文件夹下生成.apk文件 解决方法:进入win ...
- Codeforces 682B New Skateboard(DP)
题目大概说给一个数字组成的字符串问有几个子串其代表的数字(可以有前导0)能被4整除. dp[i][m]表示字符串0...i中mod 4为m的后缀的个数 通过在i-1添加str[i]字符转移,或者以st ...
- EF框架step by step(6)—处理实体complex属性
上一篇的中介绍过了对于EF4.1框架中,实体的简单属性的处理 这一篇介绍一下Code First方法中,实体Complex属性的处理.Complex属性是将一个对象做为另一个对象的属性.映射到数据库中 ...
- oracle增量备份
在进行数据库维护的过程中经常会遇到数据库备份的问题.先介绍一种常用的数据备份操作系统执行计划+批处理命令:在win的系统中存在 任务计划程序 选项:新建任务选中你写好的程序,设定好时间,就可以按照设定 ...
- mysql建表建索引
建表: CREATE TABLE `sj_projects` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL ...
- 当编译AFNetworking 2.0时出现了Undefined symbols for architecture i386
当将AFNetworking添加到工程后编译时出现 Undefined symbols for architecture i386: "_SecCertificateCopyData&quo ...
- ACM: Long Live the Queen - 树上的DP
Long Live the Queen Time Limit:250MS Memory Limit:4096KB 64bit IO Format:%I64d & %I64u D ...
- [深入浅出WP8.1(Runtime)]应用文件的URI方案
6.2.4 应用文件的URI方案 在上文我们获取文件的方式都是通过应用程序的三个跟目录的文件夹对象来获取文件夹对象和文件对象,那么我们这一小节来讲解一种新的获取文件对象的方式,这种方式就是通过Uri地 ...