Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章：计算着色器（The Compute Shader）

原文:Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章：计算着色器（The Compute Shader）

代码工程地址：

https://github.com/jiabaodan/Direct12BookReadingNotes

GPU已经被优化为处理单个地址或者连续地址（流操作）的大量内存数据；这和CPU的随机内存访问形成鲜明对比。因为顶点和像素可以独立处理，所以GPU被架构为大量的并行运算；比如NVIDIA的“Fermi”架构支持16个拥有32个CUDA cores的流多处理器（streaming multiprocessors），总共可以由512个CUDA cores。

使用GPU计算非图形的应用称之为普通目的的GPU编程（general purpose GPU (GPGPU) programming）。

学习目标

学习如何编写计算着色器程序；
对硬件如何与线程组和线程处理有一个基本的高级理解；
学习哪些D3D资源可以作为CS的输入，哪些可以作为输出；
理解线程ID变量和他们的用途；
学习共享内存，已经它们如何用来优化性能；
查找更多有关GPGPU编程的资料。

1 线程（THREADS）和线程组（THREAD GROUPS）

在GPU编程中，多个用以处理的线程会划分为一个格子的线程组，一个线程组在单个处理器上执行。所以如果你的GPU有16个多处理器，那么你至少要把你的需求划分为16个线程组，这样你所有的多处理器都可以同时计算。为了有更好的性能，你应该为每个多处理器划分2个线程组，这样就可以切换线程组（[Fung10]）。

每个线程组获取的共享内存，可以让所有线程组内的线程访问；线程不能访问其他线程组的共享内存。

一个线程组包含n个线程。硬件把这些线程划分为warps（32个线程为一个warp），然后warps被多处理器以SIMD32来处理。每个CUDA core处理一个线程并且回顾“Fermi”多处理器，有32个CUDA cores。在D3D中你可以用一个不是32的倍数的值指定一个线程组的大小，但是出于性能考虑，最好还是指定为warp大小的倍数（[Fung10]）。

对于不同的硬件，设置线程组为256看起来是一个好的开始，然后再尝试其他尺寸。

NVIDIA使用warp尺寸（32线程）；ATI使用wavefront尺寸（64线程），并且建议线程组尺寸要一直是wavefront的倍数。当然，warp和wavefront在将来的硬件中可能会改变。

在D3D，线程组有下面的函数开始：

void ID3D12GraphicsCommandList::Dispatch(

	UINT ThreadGroupCountX,

	UINT ThreadGroupCountY,

	UINT ThreadGroupCountZ);

本书只关心2维。下面的例子表示x方向有3个线程组，y方向有2个线程组，所以总共6个：

2 一个简单的计算着色器

下面是一个简单的计算着色器，对两个相同尺寸的纹理相加：

cbuffer cbSettings

{

	// Compute shader can access values in constant buffers.

};

// Data sources and outputs.

Texture2D gInputA;

Texture2D gInputB;

RWTexture2D<float4> gOutput;

// The number of threads in the thread group. The threads in a group can

// be arranged in a 1D, 2D, or 3D grid layout.

[numthreads(16, 16, 1)]

void CS(int3 dispatchThreadID : SV_DispatchThreadID) // Thread ID

{

	// Sum the xyth texels and store the result in the xyth texel of

	// gOutput.

	gOutput[dispatchThreadID.xy] = gInputA[dispatchThreadID.xy] + gInputB[dispatchThreadID.xy];

}

一个计算着色器包含下面的组件：

一个全局变量用来访问常量缓冲；
输入和输出资源，下节介绍；
[numthreads(X, Y, Z)]属性，指定在线程组中线程的数量；
着色器主体执行代码；
线程识别系统参数；

观察上面的代码，线程组中的线程可以有不同的线程拓扑结构，主要根据你的问题需求来选择不同的拓扑结构。尺寸最好是wavefront的倍数（因为同时也是warp的倍数），这样就可以同时兼容两种显卡。

2.1 计算PSO

为了开启计算着色器，我们使用一个特殊的“计算渲染状态描述”。它的属性要比D3D12_GRAPHICS_PIPELINE_STATE_DESC少很多，因为它并不在图形管线中，所以图形管线的各种状态它都不需要。下面是一个创建的例子：

D3D12_COMPUTE_PIPELINE_STATE_DESC wavesUpdatePSO = {};

wavesUpdatePSO.pRootSignature = mWavesRootSignature.Get();

wavesUpdatePSO.CS =

{

	reinterpret_cast<BYTE*> (mShaders["wavesUpdateCS"]->GetBufferPointer()),

		mShaders["wavesUpdateCS"]->GetBufferSize()

};

wavesUpdatePSO.Flags = D3D12_PIPELINE_STATE_FLAG_NONE;

ThrowIfFailed(md3dDevice->CreateComputePipelineState(

	&wavesUpdatePSO,

	IID_PPV_ARGS(&mPSOs["wavesUpdate"])));

根签名描述了哪些输入参数。下面是编译CS代码的例子：

mShaders["wavesUpdateCS"] = d3dUtil::CompileShader(

	L"Shaders\\WaveSim.hlsl", nullptr,

	"UpdateWavesCS", "cs_5_0");

3 输入和输出资源

CS支持2种类型的资源：缓冲和纹理。

3.1 纹理的输入

在上一章的例子中，定义了2个纹理输入：

Texture2D gInputA;

Texture2D gInputB;

它们通过创建(SRVs)来传递：

cmdList->SetComputeRootDescriptorTable(1, mSrvA);

cmdList->SetComputeRootDescriptorTable(2, mSrvB);

这个和像素着色器的绑定是一样的（SRVs是只读的）。

3.2 纹理的输出和无序访问视图（UAVs）

之前的代码中创建了一个输出资源：

RWTexture2D<float4> gOutput;

输出资源比较特殊，并有一个特殊的前缀“RW”表示可以读写（read-write）。相比之下gInputA和gInputB是只读的。并且需要指定类型和维度。比如如果我们需要输出2D的整形类型DXGI_FORMAT_R8G8_SINT，那么需要这样写：

RWTexture2D<int2> gOutput;

绑定输出资源到CS，需要新的视图类型unordered access view (UAV)，它在代码中通过描述句柄和D3D12_UNORDERED_ACCESS_VIEW_DESC描述来表示。它与SRV的创建类似，下面是创建UAV的例子：

D3D12_RESOURCE_DESC texDesc;

ZeroMemory(&texDesc, sizeof(D3D12_RESOURCE_DESC));

texDesc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;

texDesc.Alignment = 0;

texDesc.Width = mWidth;

texDesc.Height = mHeight;

texDesc.DepthOrArraySize = 1;

texDesc.MipLevels = 1;

texDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;

texDesc.SampleDesc.Count = 1;

texDesc.SampleDesc.Quality = 0;

texDesc.Layout = D3D12_TEXTURE_LAYOUT_UNKNOWN;

texDesc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

ThrowIfFailed(md3dDevice->CreateCommittedResource(

	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),

	D3D12_HEAP_FLAG_NONE,

	&texDesc,

	D3D12_RESOURCE_STATE_COMMON,

	nullptr,

	IID_PPV_ARGS(&mBlurMap0)));

D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc = {};

srvDesc.Shader4ComponentMapping = D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING;

srvDesc.Format = mFormat;

srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;

srvDesc.Texture2D.MostDetailedMip = 0;

srvDesc.Texture2D.MipLevels = 1;

D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};

uavDesc.Format = mFormat;

uavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE2D;

uavDesc.Texture2D.MipSlice = 0;

md3dDevice->CreateShaderResourceView(mBlurMap0.Get(),

	&srvDesc, mBlur0CpuSrv);

md3dDevice->CreateUnorderedAccessView(mBlurMap0.Get(),

	nullptr, &uavDesc, mBlur0CpuUav);

如果一个纹理要绑定为UAV，它必须要通过flag值为D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS来创建。

回顾描述堆的类型D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV可以混合它们到同一个堆上。当放它们到堆上的时候，我们只需要针对分派调用（dispatch call）通过传递描述句柄到根参数上来绑定资源到流水线。下面是针对CS的根签名代码：

void BlurApp::BuildPostProcessRootSignature()

{

	CD3DX12_DESCRIPTOR_RANGE srvTable;

	srvTable.Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1, 0);

	CD3DX12_DESCRIPTOR_RANGE uavTable;

	uavTable.Init(D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, 0);

	// Root parameter can be a table, root descriptor or root constants.

	CD3DX12_ROOT_PARAMETER slotRootParameter[3];

	// Perfomance TIP: Order from most frequent to least frequent.

	slotRootParameter[0].InitAsConstants(12, 0);

	slotRootParameter[1].InitAsDescriptorTable(1, &srvTable);

	slotRootParameter[2].InitAsDescriptorTable(1, &uavTable);

	// A root signature is an array of root parameters.

	CD3DX12_ROOT_SIGNATURE_DESC rootSigDesc(3,

		slotRootParameter,

		0, nullptr,

		D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_// create a root signature with a single slot which points to a

	// descriptor range consisting of a single constant buffer

	ComPtr<ID3DBlob> serializedRootSig = nullptr;

	ComPtr<ID3DBlob> errorBlob = nullptr;

	HRESULT hr = D3D12SerializeRootSignature(&rootSigDesc,

		D3D_ROOT_SIGNATURE_VERSION_1,

		serializedRootSig.GetAddressOf(),

		errorBlob.GetAddressOf());

	if(errorBlob != nullptr)

	{

		::OutputDebugStringA((char*)errorBlob->GetBufferPointer());

	}

	ThrowIfFailed(hr);

	ThrowIfFailed(md3dDevice->CreateRootSignature(

		0,

		serializedRootSig->GetBufferPointer(),

		serializedRootSig->GetBufferSize(),

		IID_PPV_ARGS(mPostProcessRootSignature.GetAddressOf())));

}

在分派调用前，我们绑定常量和描述：

cmdList->SetComputeRootSignature(rootSig);

cmdList->SetComputeRoot32BitConstants(0, 1, &blurRadius, 0);

cmdList->SetComputeRoot32BitConstants(0, (UINT)weights.size(), weights.data(), 1);

cmdList->SetComputeRootDescriptorTable(1, mBlur0GpuSrv);

cmdList->SetComputeRootDescriptorTable(2, mBlur1GpuUav);

UINT numGroupsX = (UINT)ceilf(mWidth / 256.0f);

cmdList->Dispatch(numGroupsX, mHeight, 1);

3.3 纹理索引和采样

纹理的元素通过一个2D索引来访问，索引基于分派线程ID（3.4介绍），每一个线程具有唯一的分派ID：

[numthreads(16, 16, 1)]

void CS(int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Sum the xyth texels and store the result in the xyth texel of

	// gOutput.

	gOutput[dispatchThreadID.xy] =

		gInputA[dispatchThreadID.xy] +

		gInputB[dispatchThreadID.xy];

}

假设我们分派了足够多的线程堆来覆盖到纹理，那么这个代码就将两个纹理相加，保存到gOutput。

因为CS是在GPU上执行，所以它可以访问GPU的工具，我们可以对纹理采用使用滤波器。但是有两个问题。第一：不能使用Sample方法，而是SampleLeve，多了一个mip等级的参数，因为CS不是直接用来渲染，所以不知道相机与它的距离，所以必须设置Mip等级；其中0代表最高级，小数会做线性差值；第二：做纹理采样的时候，我们使用标准纹理坐标系[0, 1]2代替整数索引，纹理尺寸(width, height)可以设置到常量缓冲变量，然后标准化纹理坐标：

下面的代码展示了CS使用整形索引，第二个相同的版本是使用纹理坐标和SampleLevel（假设纹理尺寸是512*512，并只用最高级mip等级）：

//

// VERSION 1: Using integer indices.

//

cbuffer cbUpdateSettings

{

	float gWaveConstant0;

	float gWaveConstant1;

	float gWaveConstant2;

	float gDisturbMag;

	int2 gDisturbIndex;

};

RWTexture2D<float> gPrevSolInput : register(u0);

RWTexture2D<float> gCurrSolInput : register(u1);

RWTexture2D<float> gOutput : register(u2);

[numthreads(16, 16, 1)]

void CS(int3 dispatchThreadID : SV_DispatchThreadID)

{

int x = dispatchThreadID.x;

int y = dispatchThreadID.y;

gNextSolOutput[int2(x,y)] =

	gWaveConstants0*gPrevSolInput[int2(x,y)].r +

	gWaveConstants1*gCurrSolInput[int2(x,y)].r +

	gWaveConstants2*(

		gCurrSolInput[int2(x,y+1)].r +

		gCurrSolInput[int2(x,y-1)].r +

		gCurrSolInput[int2(x+1,y)].r +

		gCurrSolInput[int2(x-1,y)].r);

}

//

// VERSION 2: Using SampleLevel and texture coordinates.

//

cbuffer cbUpdateSettings

{

	float gWaveConstant0;

	float gWaveConstant1;

	float gWaveConstant2;

	float gDisturbMag;

	int2 gDisturbIndex;

};

SamplerState samPoint : register(s0);

RWTexture2D<float> gPrevSolInput : register(u0);

RWTexture2D<float> gCurrSolInput : register(u1);

RWTexture2D<float> gOutput : register(u2);

[numthreads(16, 16, 1)]

void CS(int3 dispatchThreadID : SV_DispatchThreadID)

{

// Equivalently using SampleLevel() instead of operator [].

int x = dispatchThreadID.x;

int y = dispatchThreadID.y;

float2 c = float2(x,y)/512.0f;

float2 t = float2(x,y-1)/512.0;

float2 b = float2(x,y+1)/512.0;

float2 l = float2(x-1,y)/512.0;

float2 r = float2(x+1,y)/512.0;

gNextSolOutput[int2(x,y)] =

	gWaveConstants0*gPrevSolInput.SampleLevel(samPoint, c, 0.0f).r +

	gWaveConstants1*gCurrSolInput.SampleLevel(samPoint, c, 0.0f).r +

	gWaveConstants2*(

		gCurrSolInput.SampleLevel(samPoint, b, 0.0f).r +

		gCurrSolInput.SampleLevel(samPoint, t, 0.0f).r +

		gCurrSolInput.SampleLevel(samPoint, r, 0.0f).r +

		gCurrSolInput.SampleLevel(samPoint, l, 0.0f).r);

}

3.4 结构化的缓冲资源

下面的代码展示了HLSL中结构化的缓冲：

struct Data

{

	float3 v1;

	float2 v2;

};

StructuredBuffer<Data> gInputA : register(t0);

StructuredBuffer<Data> gInputB : register(t1);

RWStructuredBuffer<Data> gOutput : register(u0);

结构化的缓冲可以简单的看做是缓冲中一个结构类型元素的数组，它可以让用户在HLSL中定义。

它可以作为SRV，也可以作为UAV，创建方法类似：

struct Data

{

	XMFLOAT3 v1;

	XMFLOAT2 v2;

};

// Generate some data to fill the SRV buffers with.

std::vector<Data> dataA(NumDataElements);

std::vector<Data> dataB(NumDataElements);

for(int i = 0; i < NumDataElements; ++i)

{

	dataA[i].v1 = XMFLOAT3(i, i, i);

	dataA[i].v2 = XMFLOAT2(i, 0);

	dataB[i].v1 = XMFLOAT3(-i, i, 0.0f);

	dataB[i].v2 = XMFLOAT2(0, -i);

}

UINT64 byteSize = dataA.size()*sizeof(Data);

// Create some buffers to be used as SRVs.

mInputBufferA = d3dUtil::CreateDefaultBuffer(

	md3dDevice.Get(),

	mCommandList.Get(),

	dataA.data(),

	byteSize,

	mInputUploadBufferA);

mInputBufferB = d3dUtil::CreateDefaultBuffer(

	md3dDevice.Get(),

	mCommandList.Get(),

	dataB.data(),

	byteSize,

	mInputUploadBufferB);

// Create the buffer that will be a UAV.

ThrowIfFailed(md3dDevice->CreateCommittedResource(

	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),

	D3D12_HEAP_FLAG_NONE,

	&CD3DX12_RESOURCE_DESC::Buffer(byteSize,

		D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS),

	D3D12_RESOURCE_STATE_UNORDERED_ACCESS,

	nullptr,

	IID_PPV_ARGS(&mOutputBuffer)));

结构化的缓冲绑定到流水线和纹理是类似的。我们创建SRV和UAV描述给他们然后以参数的方式传递到描述表类型的根参数上。不同的地方在于，我们可以定义根描述类型的根签名，所以我们可以直接绑定它们的虚拟地址到根参数上，而不通过描述堆（只适用于SRV和UAV，不能用以纹理），考虑下面的根签名描述：

// Root parameter can be a table, root descriptor or root constants.

CD3DX12_ROOT_PARAMETER slotRootParameter[3];

// Perfomance TIP: Order from most frequent to least frequent.

slotRootParameter[0].InitAsShaderResourceView(0);

slotRootParameter[1].InitAsShaderResourceView(1);

slotRootParameter[2].InitAsUnorderedAccessView(0);

// A root signature is an array of root parameters.

CD3DX12_ROOT_SIGNATURE_DESC rootSigDesc(3,

	slotRootParameter,

	0, nullptr,

	D3D12_ROOT_SIGNATURE_FLAG_NONE);

然后我们绑定我们的缓冲到分派调用：

mCommandList->SetComputeRootSignature(mRootSignature.Get());

mCommandList->SetComputeRootShaderResourceView(0,

	mInputBufferA->GetGPUVirtualAddress());

mCommandList->SetComputeRootShaderResourceView(1,

	mInputBufferB->GetGPUVirtualAddress());

mCommandList->SetComputeRootUnorderedAccessView(2,

	mOutputBuffer->GetGPUVirtualAddress());

mCommandList->Dispatch(1, 1, 1);

3.5 拷贝CS结构到系统内存

需要适用堆属性D3D12_HEAP_TYPE_READBACK创建系统内存缓冲。然后我们可以使用ID3D12GraphicsCommandList::CopyResource方法拷贝GPU资源到系统内存资源。系统资源要有相同的大小和格式。最终我们可以通过映射API映射系统内存缓冲，然后再CPU读取。

我们有一个结构化缓冲Demo叫“VecAdd”，只是相加了对应vector：

struct Data

{

	float3 v1;

	float2 v2;

};

StructuredBuffer<Data> gInputA : register(t0);

StructuredBuffer<Data> gInputB : register(t1);

RWStructuredBuffer<Data> gOutput : register(u0);

[numthreads(32, 1, 1)]

void CS(int3 dtid : SV_DispatchThreadID)

{

	gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;

	gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;

}

为了简化，这个结构化缓冲只包含32个元素，所以我们只分派了一个线程组（一个线程组处理32个元素）。当CS计算完成后，我们将结果拷贝到系统内存，然后保存到文件。下面的代码展示了如何拷贝到系统内存：

// Create a system memory version of the buffer to read the

// results back from.

ThrowIfFailed(md3dDevice->CreateCommittedResource(

	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_READBACK),

	D3D12_HEAP_FLAG_NONE,

	&CD3DX12_RESOURCE_DESC::Buffer(byteSize),

	D3D12_RESOURCE_STATE_COPY_DEST,

	nullptr,

	IID_PPV_ARGS(&mReadBackBuffer)));

// …

//

// Compute shader finished!

struct Data

{

	XMFLOAT3 v1;

	XMFLOAT2 v2;

};

// Schedule to copy the data to the default buffer to the readback buffer.

mCommandList->ResourceBarrier(1,

	&CD3DX12_RESOURCE_BARRIER::Transition(

		mOutputBuffer.Get(),

		D3D12_RESOURCE_STATE_COMMON,

		D3D12_RESOURCE_STATE_COPY_SOURCE));

mCommandList->CopyResource(mReadBackBuffer.Get(), mOutputBuffer.Get());

mCommandList->ResourceBarrier(1,

	&CD3DX12_RESOURCE_BARRIER::Transition(

		mOutputBuffer.Get(),

		D3D12_RESOURCE_STATE_COPY_SOURCE,

		D3D12_RESOURCE_STATE_COMMON));

// Done recording commands.

ThrowIfFailed(mCommandList->Close());

// Add the command list to the queue for execution.

ID3D12CommandList* cmdsLists[] = { mCommandList.Get() };

mCommandQueue->ExecuteCommandLists(_countof(cmdsLists), cmdsLists);

// Wait for the work to finish.

FlushCommandQueue();

// Map the data so we can read it on CPU.

Data* mappedData = nullptr;

ThrowIfFailed(mReadBackBuffer->Map(0, nullptr, reinterpret_cast<void**>(&mappedData)));

std::ofstream fout("results.txt");

for(int i = 0; i < NumDataElements; ++i)

{

	fout << "(" << mappedData[i].v1.x << ", " <<

	mappedData[i].v1.y << ", " <<

	mappedData[i].v1.z << ", " <<

	mappedData[i].v2.x << ", " <<

	mappedData[i].v2.y << ")" << std::endl;

}

mReadBackBuffer->Unmap(0, nullptr);

In the demo, we fill the two input buffers with

the following initial data:

std::vector<Data> dataA(NumDataElements);

std::vector<Data> dataB(NumDataElements);

for(int i = 0; i < NumDataElements; ++i)

{

	dataA[i].v1 = XMFLOAT3(i, i, i);

	dataA[i].v2 = XMFLOAT2(i, 0);

	dataB[i].v1 = XMFLOAT3(-i, i, 0.0f);

	dataB[i].v2 = XMFLOAT2(0, -i);

}

下面是写到文件中的结果：

(0, 0, 0, 0, 0)

(0, 2, 1, 1, -1)

(0, 4, 2, 2, -2)

(0, 6, 3, 3, -3)

(0, 8, 4, 4, -4)

(0, 10, 5, 5, -5)

(0, 12, 6, 6, -6)

(0, 14, 7, 7, -7)

(0, 16, 8, 8, -8)

(0, 18, 9, 9, -9)

(0, 20, 10, 10, -10)

(0, 22, 11, 11, -11)

(0, 24, 12, 12, -12)

(0, 26, 13, 13, -13)

(0, 28, 14, 14, -14)

(0, 30, 15, 15, -15)

(0, 32, 16, 16, -16)

(0, 34, 17, 17, -17)

(0, 36, 18, 18, -18)

(0, 38, 19, 19, -19)

(0, 40, 20, 20, -20)

(0, 42, 21, 21, -21)

(0, 44, 22, 22, -22)

(0, 46, 23, 23, -23)

(0, 48, 24, 24, -24)

(0, 50, 25, 25, -25)

(0, 52, 26, 26, -26)

(0, 54, 27, 27, -27)

(0, 56, 28, 28, -28)

(0, 58, 29, 29, -29)

(0, 60, 30, 30, -30)

(0, 62, 31, 31, -31)

从下图可以看出，在CPU和GPU之间拷贝内存数据是最慢的。对于图形，我们不要每帧这样做，它会kill性能。对于GPGPU编程，经常需要得到结果到CPU，所以对于GPGPU不是什么大问题（因为不会像每帧调用那么频繁）。

4 线程表示系统值（THREAD IDENTIFICATION SYSTEM VALUES）

被标识的线程T，具有线程组ID.(1, 1, 0)，具有组线程ID(1, 5, 0)，具有分派线程ID(1, 1, 0) ⊗

(8, 8, 0) + (2, 5, 0) = (10, 13, 0)；它的组索引ID是5·8 + 2 = 42。

每个线程组会被系统分配一个线程组ID，具有SV_GroupID标识；
线程组内，每一个线程具有一个唯一的ID：SV_GroupThreadID；
每一个分派调用，分派一网格的线程组。分派线程ID在一个分派调用中是唯一的，并且与所有创建的线程组相关联。令ThreadGroupSize =(X,Y,Z)为线程组尺寸，分派线程ID可以由组ID和组线程ID计算出来：

dispatchThreadID.xyz = groupID.xyz * ThreadGroupSize.xyz + groupThreadID.xyz;

它具有SV_DispatchThreadID标识，

一个线性索引版本的组线程ID可以通过D3D的SV_GroupIndex标识获得，它的计算：

groupIndex = groupThreadID.z*ThreadGroupSize.x*ThreadGroupSize.y +

	groupThreadID.y*ThreadGroupSize.x +

	groupThreadID.x;

关于所以坐标系的顺序，第一个坐标是x轴（列）；第二个坐标是y轴（行）。这个和传统的矩阵是相反的。

为什么要用这些ID呢？CS会输入和输出一些数据结构，我们可以将这些ID保存到数据结构中：

Texture2D gInputA;

Texture2D gInputB;

RWTexture2D<float4> gOutput;

[numthreads(16, 16, 1)]

void CS(int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Use dispatch thread ID to index into output and input textures.

	gOutput[dispatchThreadID.xy] = gInputA[dispatchThreadID.xy] + gInputB[dispatchThreadID.xy];

}

SV_GroupThreadID对于索引本地储存内存很有用。

5 添加和消耗缓冲

假设我们有一个用下面的粒子的数据结构定义的缓冲：

struct Particle

{

	float3 Position;

	float3 Velocity;

	float3 Acceleration;

};

我们希望在CS在根据他的常量加速度和速度来更新它的位置。并且假设我们不关系它们更新的顺序以及写入输出缓冲的顺序。消耗和添加结构化缓冲对于这种情况就是一个方案，并且还提供了不需要考虑索引的便利：

struct Particle

{

	float3 Position;

	float3 Velocity;

	float3 Acceleration;

};

float TimeStep = 1.0f / 60.0f;

ConsumeStructuredBuffer<Particle> gInput;

AppendStructuredBuffer<Particle> gOutput;

[numthreads(16, 16, 1)]

void CS()

{

	// Consume a data element from the input buffer.

	Particle p = gInput.Consume();

	p.Velocity += p.Acceleration*TimeStep;

	p.Position += p.Velocity*TimeStep;

	// Append normalized vector to output buffer.

	gOutput.Append( p );

}

当一个数据被消耗掉，它不能再被其他线程消耗。

添加结构化缓冲并不是动态增长的：它必须始终足够大，来保存你添加的数据。

6 共享内存和同步

在CS代码中，共享内存可以这样声明：

groupshared float4 gCache[256];

数组大小可以随意，但是最大是32kb。因为它是线程堆的局部共享内存，所以它由SV_ThreadGroupID索引；所以，例如你可以让线程堆中的每个线程访问共享内存中的一个槽。

使用过多的共享内存可能导致一些性能问题（[Fung10]），假设多处理器支持32kb共享内存，而你需要20kb共享内存；那就代表只有一个线程堆能有足够的共享内存。这就限制了多处理器的并行运算，因为不能切换内存堆来防止等待时间（3.1中讨论过，每个多处理器最好有两个线程堆用以切换）。所以减少共享内存大小可以保证性能。

大部分应用的共享内存是用来保存纹理值的。比如模糊，需要取相同的像素多次。纹理采样是一个比较慢的GPU操作，因为内存带宽和内存等待时间并没有像GPU计算能力提高那么多（[Möller08]）。线程组可以通过将需要的纹理采样放到共享内存数组中，来避免多余的纹理读取，这样性能就可以提高很多。

加入我们使用下面错误的代码来实现这个策略：

Texture2D gInput;

RWTexture2D<float4> gOutput;

groupshared float4 gCache[256];

[numthreads(256, 1, 1)]

void CS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Each thread samples the texture and stores the

	// value in shared memory.

	gCache[groupThreadID.x] = gInput[dispatchThreadID.xy];

	// Do computation work: Access elements in shared memory

	// that other threads stored:

	// BAD!!! Left and right neighbor threads might not have

	// finished sampling tzZhe texture and storing it in shared memory.

	float4 left = gCache[groupThreadID.x - 1];

	float4 right = gCache[groupThreadID.x + 1];

	…

}

因为我们没有保证这个线程组中所有线程同时完成，所以导致这个错误的发生。由于相邻的线程还没有完成初始化操作，所以当前线程可能会访问相邻的未初始化的数据。为了修复这个问题，在CS继续计算前，要先等待所有线程完成纹理的加载计算。这个可以通过一个同步命令完成：

Texture2D gInput;

RWTexture2D<float4> gOutput;

groupshared float4 gCache[256];

[numthreads(256, 1, 1)]

void CS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Each thread samples the texture and stores the

	// value in shared memory.

	gCache[groupThreadID.x] = gInput[dispatchThreadID.xy];

	// Wait for all threads in group to finish.

	GroupMemoryBarrierWithGroupSync();

	// Safe now to read any element in the shared memory

	//and do computation work.

	float4 left = gCache[groupThreadID.x - 1];

	float4 right = gCache[groupThreadID.x + 1];

	…

}

7 模糊Demo

这节我们介绍如何实现一个基于CS的模糊Demo。我们从模糊的数学理论开始，然后介绍渲染到纹理技术，生成我们模糊的源纹理，最后实现基于CS的模糊代码。

7.1 模糊理论

本Demo的模糊算法描述如下：对于在ij位置的点P，计算以P为中心的m × n矩阵像素权重平均值：

权重总和必须为1，如果大于1图像会变亮，小于1会变暗。

有很多方法计算权重（总和为1），最常用的方法是高斯模糊：

高斯模糊是可以分离的，可以先水平1D模糊，然后再竖直模糊：

对于9x9的矩阵，我们需要81个采样。但是分离到2个1D的时候，我们只需要18个采样。尤其我们是在模糊纹理，纹理提取是很消耗性能的，所以通过分离模糊来减少纹理采样可以提高性能。

7.2 渲染到纹理

目前我们的程序只是渲染到后置缓冲，但是后置缓冲其实也是在交换链中的一张纹理：

Microsoft::WRL::ComPtr<ID3D12Resource> mSwapChainBuffer[SwapChainBufferCount];

CD3DX12_CPU_DESCRIPTOR_HANDLE rtvHeapHandle(mRtvHeap->GetCPUDescriptorHandleForHeapStart());

for (UINT i = 0; i < SwapChainBufferCount; i++)

{

	ThrowIfFailed(mSwapChain->GetBuffer(i, IID_PPV_ARGS(&mSwapChainBuffer[i])));

	md3dDevice->CreateRenderTargetView(

		mSwapChainBuffer[i].Get(), nullptr,

		rtvHeapHandle);

	rtvHeapHandle.Offset(1, mRtvDescriptorSize);

}

我们通过绑定后置缓冲的RTV到OM阶段来命令D3D渲染到后置缓冲中：

// Specify the buffers we are going to render to.

mCommandList->OMSetRenderTargets(1,

	&CurrentBackBufferView(),

	true, &DepthStencilView());

后置缓冲中的内容最终通过IDXGISwapChain::Present方法显示到屏幕上。

一个纹理如果要用以渲染目标需要使用D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET flag来创建。

所以用一张纹理替换后置缓冲，将结果渲染到它上面，这个技术就叫做渲染到纹理（render-to-off-screen-texture 或者简化版本 render-to-texture）。渲染到纹理主要用以：

阴影映射（Shadow mapping）；
屏幕空间环境光遮蔽（Screen Space Ambient Occlusion）；
立方体贴图动态反射。（Dynamic reflections with cube maps）

我们的迷糊Demo实现方案步骤如下：

正常绘制场景到一张贴图；
使用CS模糊它；
映射模糊后的贴图到一个屏幕大小的方块几何体，然后绘制到后置缓冲。

渲染到纹理的方案是可以实现的；假设后置缓冲的格式和大小与我们纹理的一致，我们还可以先正常渲染到后置缓冲，然后使用CopyResource方法复制资源到纹理：

// Copy the input (back-buffer in this example) to BlurMap0.

cmdList->CopyResource(mBlurMap0.Get(), input);

上面的步骤需要我们先进行正常的渲染流水线，然后切换到CS进行计算，然后切换回渲染流水线。这样的切换是由开销的（[NVIDIA10]）应当尽可能避免这样的切换。

7.3 模糊实现概述

我们假设模糊是分离的，即2个1D模糊。我们需要2张纹理，，我们叫他们A和B，并且绑定SRV输入，UAV输出；那么模糊算法如下：

绑定SRV到A，作为CS的输入；
绑定UAV到B，作为CS的输出；
分派水平模糊，此时B保存的是水平模糊后的纹理；
绑定SRV到B，作为CS的输入；
绑定UAV到A，作为CS的输出；
分派竖直模糊，此时A保存的是模糊后的结果。

因为我们渲染的纹理和窗口的尺寸一致，所以在OnResize函数中需要重新创建我们的模糊纹理：

void BlurApp::OnResize()

{

	D3DApp::OnResize();

	// The window resized, so update the aspect ratio and

	// recompute the projection matrix.

	XMMATRIX P = XMMatrixPerspectiveFovLH(

		0.25f*MathHelper::Pi, AspectRatio(),

		1.0f, 1000.0f);

	XMStoreFloat4x4(&mProj, P);

	if(mBlurFilter != nullptr)

	{

		mBlurFilter->OnResize(mClientWidth, mClientHeight);

	}

}

void BlurFilter::OnResize(UINT newWidth, UINT newHeight)

{

	if((mWidth != newWidth) || (mHeight != newHeight))

	{

		mWidth = newWidth;

		mHeight = newHeight;

		// Rebuild the off-screen texture resource with new dimensions.

		BuildResources();

		// New resources, so we need new descriptors to that resource.

		BuildDescriptors();

	}

}

mBlur变量我们创建的BlurFilter辅助类的一个实例。该类封装了纹理A和B，SRVs和UAVs，提供了开始CS模糊运算的方法。

BlurFilter类封装了纹理资源，通过使用draw/dispatch方法来绑定资源到流水线，我们需要创建这些资源的描述。这代表我们需要在D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV描述堆中申请更多的空间。BlurFilter使用BlurFilter::BuildDescriptors函数，利用descriptor句柄在描述堆中开始定位和保存描述。原因在于当屏幕尺寸变化的时候，可以重新创建资源：

void BlurFilter::BuildDescriptors(

	CD3DX12_CPU_DESCRIPTOR_HANDLE hCpuDescriptor,

	CD3DX12_GPU_DESCRIPTOR_HANDLE hGpuDescriptor,

	UINT descriptorSize)

{

	// Save references to the descriptors.

	mBlur0CpuSrv = hCpuDescriptor;

	mBlur0CpuUav = hCpuDescriptor.Offset(1, descriptorSize);

	mBlur1CpuSrv = hCpuDescriptor.Offset(1, descriptorSize);

	mBlur1CpuUav = hCpuDescriptor.Offset(1, descriptorSize);

	mBlur0GpuSrv = hGpuDescriptor;

	mBlur0GpuUav = hGpuDescriptor.Offset(1, descriptorSize);

	mBlur1GpuSrv = hGpuDescriptor.Offset(1, descriptorSize);

	mBlur1GpuUav = hGpuDescriptor.Offset(1, descriptorSize);

	BuildDescriptors();

}

void BlurFilter::BuildDescriptors()

{

	D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc = {};

	srvDesc.Shader4ComponentMapping = D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING;

	srvDesc.Format = mFormat;

	srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;

	srvDesc.Texture2D.MostDetailedMip = 0;

	srvDesc.Texture2D.MipLevels = 1;

	D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};

	uavDesc.Format = mFormat;

	uavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE2D;

	uavDesc.Texture2D.MipSlice = 0;

	md3dDevice->CreateShaderResourceView(mBlurMap0.Get(),

		&srvDesc, mBlur0CpuSrv);

	md3dDevice->CreateUnorderedAccessView(mBlurMap0.Get(),

		nullptr, &uavDesc, mBlur0CpuUav);

	md3dDevice->CreateShaderResourceView(mBlurMap1.Get(),

		&srvDesc, mBlur1CpuSrv);

	md3dDevice->CreateUnorderedAccessView(mBlurMap1.Get(),

		nullptr, &uavDesc, mBlur1CpuUav);

}

// In BlurApp.cpp…Offset to location in heap to

// store descriptors for BlurFilter

mBlurFilter->BuildDescriptors(

	CD3DX12_CPU_DESCRIPTOR_HANDLE(

	mCbvSrvUavDescriptorHeap->GetCPUDescriptorHandleForHeapStart(),

		3, mCbvSrvUavDescriptorSize),

	CD3DX12_GPU_DESCRIPTOR_HANDLE(

	mCbvSrvUavDescriptorHeap->GetGPUDescriptorHandleForHeapStart(),

		3, mCbvSrvUavDescriptorSize),

	mCbvSrvUavDescriptorSize);

模糊是一个很占用性能的操作，它的运算量主要与纹理的大小相关。一般情况下我们渲染到纹理的时候，可以渲染到一张比后置缓冲小的纹理上。这样可以提高渲染的纹理的速度；因为尺寸减小了，所以提高了模糊的速度；最终绘制到后置缓冲的时候，因为用了放大滤波器，又增加一层模糊效果。

假设我们的贴图是宽w，高h。在下章中的CS我们可以看到，对于水平1D模糊，我们线程组水平方向有256个线程，所以我们需要分发 w/256。如果256不能被w整除，最后的线程组将会有多余的线程。对此我们没有办法，除非线程组大小被修复。我们可以使用clamping来进行边缘检测。竖直方向和水平方向处理类似。

下面的代码支出多少线程组被分派，并且开始实际的在CS上的模糊操作：

void BlurFilter::Execute(ID3D12GraphicsCommandList* cmdList,

	ID3D12RootSignature* rootSig,

	ID3D12PipelineState* horzBlurPSO,

	ID3D12PipelineState* vertBlurPSO,

	ID3D12Resource* input,

	int blurCount)

{

	auto weights = CalcGaussWeights(2.5f);

	int blurRadius = (int)weights.size() / 2; cmdList->SetComputeRootSignature(rootSig);

	cmdList->SetComputeRoot32BitConstants(0, 1, &blurRadius, 0);

	cmdList->SetComputeRoot32BitConstants(0, (UINT)weights.size(), weights. data(), 1);

	cmdList->ResourceBarrier(1,

		&CD3DX12_RESOURCE_BARRIER::Transition(input,

		D3D12_RESOURCE_STATE_RENDER_TARGET,

		D3D12_RESOURCE_STATE_COPY_SOURCE));

	cmdList->ResourceBarrier(1,

		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap0.

		Get(),

		D3D12_RESOURCE_STATE_COMMON,

		D3D12_RESOURCE_STATE_COPY_DEST));

	// Copy the input (back-buffer in this example) to BlurMap0.

	cmdList->CopyResource(mBlurMap0.Get(), input);

	cmdList->ResourceBarrier(1,

		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap0. Get(),

		D3D12_RESOURCE_STATE_COPY_DEST,

		D3D12_RESOURCE_STATE_GENERIC_READ));

	cmdList->ResourceBarrier(1,

		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap1.

		Get(),

		D3D12_RESOURCE_STATE_COMMON,

		D3D12_RESOURCE_STATE_UNORDERED_ACCESS));

	for(int i = 0; i < blurCount; ++i)

	{

		//

		// Horizontal Blur pass.

		//

		cmdList->SetPipelineState(horzBlurPSO);

		cmdList->SetComputeRootDescriptorTable(1, mBlur0GpuSrv);

		cmdList->SetComputeRootDescriptorTable(2, mBlur1GpuUav);

		// How many groups do we need to dispatch to cover a row of pixels, where

		// each group covers 256 pixels (the 256 is defined in the ComputeShader).

		UINT numGroupsX = (UINT)ceilf(mWidth / 256.0f);

		cmdList->Dispatch(numGroupsX, mHeight, 1);

		cmdList->ResourceBarrier(1,

			&CD3DX12_RESOURCE_BARRIER::Transition(

			mBlurMap0.Get(),

			D3D12_RESOURCE_STATE_GENERIC_READ,

			D3D12_RESOURCE_STATE_UNORDERED_ACCESS));

		cmdList->ResourceBarrier(1,

			&CD3DX12_RESOURCE_BARRIER::Transition(

			mBlurMap1.Get(),

			D3D12_RESOURCE_STATE_UNORDERED_ACCESS,

			D3D12_RESOURCE_STATE_GENERIC_READ));

		//

		// Vertical Blur pass.

		//

		cmdList->SetPipelineState(vertBlurPSO);

		cmdList->SetComputeRootDescriptorTable(1, mBlur1GpuSrv);

		cmdList->SetComputeRootDescriptorTable(2, mBlur0GpuUav);

		// How many groups do we need to dispatch to cover a column of pixels,

		// where each group covers 256 pixels (the 256 is defined in the

		// ComputeShader).

		UINT numGroupsY = (UINT)ceilf(mHeight / 256.0f);

		cmdList->Dispatch(mWidth, numGroupsY, 1);

		cmdList->ResourceBarrier(1,

			&CD3DX12_RESOURCE_BARRIER::Transition(

			mBlurMap0.Get(),

			D3D12_RESOURCE_STATE_UNORDERED_ACCESS,

			D3D12_RESOURCE_STATE_GENERIC_READ));

		cmdList->ResourceBarrier(1,

			&CD3DX12_RESOURCE_BARRIER::Transition(

			mBlurMap1.Get(),

			D3D12_RESOURCE_STATE_GENERIC_READ,

			D3D12_RESOURCE_STATE_UNORDERED_ACCESS));

	}

}

7.4 计算着色器编程

根据之前章节的描述，我们线程组水平方向有256个线程，每个线程模糊一个像素。一个低效的方案是直接实现每个像素的模糊，这种方案的问题在于需要针对每个纹理的像素提取多次，浪费性能；

我们可以通过共享内存的方式来优化这个方案。每个线程可以在共享内存中读取像素值，当所有线程读取完毕后，再完成模糊操作。如果线程组有n = 256个线程，那么需要n + 2R个像素来模糊，R是模糊的半径：

解决方案很简单，我们申请n + 2R个元素的共享内存，然后有2R个线程看向2个像素值。唯一棘手的是当索引共享内存的时候需要一些记录；我们不再有第i个线程组ID对应第i个元素。下图展示了当R=4时的共享内存：

最后一个需要讨论的问题是，最左边和最右边的组索引的时候，会出输入纹理的范围：

超出边界的值正常情况下返回的是0，但是在我们这个Demo中，0就代表了黑色。我们采用使用边界值，类似clamp函数。这个可以通过clamping索引来实现：

// Clamp out of bound samples that occur at left image borders.

int x = max(dispatchThreadID.x - gBlurRadius, 0);

gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];

// Clamp out of bound samples that occur at right image borders.

int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1);

gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];

// Clamp out of bound samples that occur at image borders.

gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy- 1)];

最终完整的着色器代码如下：

//====================================================================

// Performs a separable Guassian blur with a blur

radius up to 5 pixels.

//====================================================================

cbuffer cbSettings : register(b0)

{

	// We cannot have an array entry in a constant buffer that gets mapped onto

	// root constants, so list each element. int gBlurRadius;

	// Support up to 11 blur weights.

	float w0;

	float w1;

	float w2;

	float w3;

	float w4;

	float w5;

	float w6;

	float w7;

	float w8;

	float w9;

	float w10;

};

static const int gMaxBlurRadius = 5;

Texture2D gInput : register(t0);

RWTexture2D<float4> gOutput : register(u0);

#define N 256

#define CacheSize (N + 2*gMaxBlurRadius)

groupshared float4 gCache[CacheSize];

[numthreads(N, 1, 1)]

void HorzBlurCS(int3 groupThreadID : SV_GroupThreadID,

	int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Put in an array for each indexing.

	float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };

	//

	// Fill local thread storage to reduce bandwidth. To blur

	// N pixels, we will need to load N + 2*BlurRadius pixels

	// due to the blur radius.

	//

	// This thread group runs N threads. To get the extra 2*BlurRadius

	// pixels, have 2*BlurRadius threads sample an extra pixel.

	if(groupThreadID.x < gBlurRadius)

	{

		// Clamp out of bound samples that occur at image borders.

		int x = max(dispatchThreadID.x - gBlurRadius, 0);

		gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];

	}

	if(groupThreadID.x >= N-gBlurRadius)

	{

		// Clamp out of bound samples that occur at image borders.

		int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1);

		gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];

	}

	// Clamp out of bound samples that occur at image borders.

	gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];

	// Wait for all threads to finish.

	GroupMemoryBarrierWithGroupSync();

	//

	// Now blur each pixel.

	//

	float4 blurColor = float4(0, 0, 0, 0);

	for(int i = -gBlurRadius; i <= gBlurRadius; ++i)

	{

		int k = groupThreadID.x + gBlurRadius + i;

		blurColor +=

		weights[i+gBlurRadius]*gCache[k];

	}

	gOutput[dispatchThreadID.xy] = blurColor;

}

[numthreads(1, N, 1)]

void VertBlurCS(int3 groupThreadID : SV_GroupThreadID,

	int3 dispatchThreadID : SV_DispatchThreadID)

{

	// Put in an array for each indexing.

	float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };

	//

	// Fill local thread storage to reduce bandwidth. To blur

	// N pixels, we will need to load N + 2*BlurRadius pixels

	// due to the blur radius.

	//

	// This thread group runs N threads. To get the extra 2*BlurRadius

	// pixels, have 2*BlurRadius threads sample an extra pixel.

	if(groupThreadID.y < gBlurRadius)

	{

		// Clamp out of bound samples that occur at image borders.

		int y = max(dispatchThreadID.y - gBlurRadius, 0);

		gCache[groupThreadID.y] = gInput[int2(dispatchThreadID.x, y)];

	}

	if(groupThreadID.y >= N-gBlurRadius)

	{

		// Clamp out of bound samples that occur at image borders.

		int y = min(dispatchThreadID.y + gBlurRadius, gInput.Length.y-1);

		gCache[groupThreadID.y+2*gBlurRadius] = gInput[int2(dispatchThreadID.x, y)];

	}

	// Clamp out of bound samples that occur at image borders.

	gCache[groupThreadID.y+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];

	// Wait for all threads to finish.

	GroupMemoryBarrierWithGroupSync();

	//

	// Now blur each pixel.

	//

	float4 blurColor = float4(0, 0, 0, 0);

	for(int i = -gBlurRadius; i <= gBlurRadius; ++i)

	{

		int k = groupThreadID.y + gBlurRadius + i;

		blurColor += weights[i+gBlurRadius]*gCache[k];

	}

	gOutput[dispatchThreadID.xy] = blurColor;

}

最后一行：

gOutput[dispatchThreadID.xy] = blurColor;

dispatchThreadID.xy它有可能是超出边界的，但是我们不需要担心这个问题，因为超出边界的写入是无效的。

8 更深入的材料

计算着色器编程是一个子学科，有几本关于使用GPU进行CS编程的书：

Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu.
OpenCL Programming Guide by Aaftab Munshi, Benedict R. Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg.

类似CUDA和OpenCL的技术只是使用不同API访问GPU编写程序。好的CUDA和OpenCL练习也是好的DX计算机编程练习，它们都执行在相同的硬件上。本章展示了主要的Direct计算语法，所以移植到CUDA和OpenCL编程不会是什么太大的问题。

Chuck Walbourn发表了博客包含了许多Direct计算介绍的链接：

http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx

另外微软通道9有一些关于Direct计算编程的演讲视频：

http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/

最后NVIDIA有完整的CUDA训练：

http://developer.nvidia.com/cuda-training

另外Illinois大学有有完整的CUDA编程课程，是我们强烈推荐的。学习了CUDA后，你将会对GPU硬件的工作有更好的了解，可以让你写出更优化的代码。

9 总结

ID3D12GraphicsCommandList::Dispatch结构分派一个格子的线程组。每个线程组是一个3D格子的线程[numthreads(x,y,z)]；出于性能考虑，线程总数最好是warp（Nvidea硬件 32）大小的倍数或者wavefront（ATI硬件 64）大小的倍数；
为了确保并行运算，每个多处理器应该至少分配2个线程组。最新的硬件可能有更多个多处理器，所以线程组的个数应该更好的确保为新硬件多处理器个数的倍数；
当线程组被指定到多处理器后，线程组中的线程会被分开到warps个（每个32个线程），然后多处理器对每个warp线程同时以SIMD形式执行。如果一个warp停滞了，比如在提取纹理，处理器会迅速切换到另一个潜伏的warp线程并指向指令。这个会让处理器一直都在运行。这个就是建议为什么线程组的大小是warp大小的倍数的原因，如果不这么设置，某个warp中的线程就会没有指令处理；
纹理资源可以作为CS输入资源，用过SRV；可以作为读取和写入资源（RWTexture)）作为输出资源，通过UAV。纹理元素可以通过索引或者采样（纹理坐标和采样状态 SampleLevel函数）访问；
结构化缓冲是一个包含相同类型元素的数组，类型可以让用户自己定义，比如只读：

StructuredBuffer<DataType> gInputA;

读写：

RWStructuredBuffer<DataType> gOutput;

只读可以作为输入资源通过SRV绑定进来；读写通过UAV绑定。

线程ID变量通过系统值传递到CS，它通常用来索引资源和共享内存；
消耗和添加结构化缓冲在HLSL中的定义如下：

ConsumeStructuredBuffer<DataType> gInput;

AppendStructuredBuffer<DataType> gOutput;

它们用来如果你不关心元素的处理和写入输出的顺序的时候，它可以避免索引符号。添加缓冲并不动态增长，它不需要足够大来保存添加的数据。

线程组提供共享内存，访问它跟访问硬件cache一样快，它可以用来优化或者一些算法的实现。在CS中，它的定义如下：groupshared float4 gCache[N]; 数组大小可以是任意数，但是不能超过32kb，出于性能考虑，它的大小应该不超过16kb，否则不能让2个线程组指定到用一个多处理器；
尽可能避免计算处理和显然之间的切换，因为切换操作是有性能消耗的。如果可能的话，最好在每帧先执行所有计算操作，然后执行所有渲染操作。

10 练习

本章内容因为本人暂时还都用不到，练习先不写