CUDA 7流简化并发


在不指定流的情况下执行异步CUDA命令时,运行时runtime将使用默认流。在CUDA 7之前,默认流是特殊流,它与设备上的所有其他流隐式同步。

CUDA 7引入了大量强大的新功能,其中包括为每个主机线程使用独立默认流的新选项,从而避免了对传统默认流的序列化。本文将展示这如何简化CUDA程序中内核与数据副本之间的并发。

Asynchronous Commands in CUDA

As described by the CUDA C Programming Guide, asynchronous commands return control to the calling host thread before the device has finished the requested task (they are non-blocking). These commands are:

  • Kernel launches;
  • Memory copies between two addresses to the same device memory;
  • Memory copies from host to device of a memory block of 64 KB or less;
  • Memory copies performed by functions with the Async suffix;
  • Memory set function calls.

Specifying a stream for a kernel launch or host-device memory copy is optional; you can invoke CUDA commands without specifying a stream (or by setting the stream parameter to zero). The following two lines of code both launch a kernel on the default stream.

如CUDA C编程指南所述,异步命令在设备完成请求的任务之前,将控制权返回给调用主线程(它们是非阻塞的)。这些命令是:

  • 内核启动;
  • 在两个地址之间复制内存到相同的设备内存;
  • 从主机到设备的内存副本,大小为64 KB或更少;
  • 带Async后缀的功能执行的内存副本;
  • 内存设置函数调用。


kernel<<< blocks, threads, bytes >>>();    // default stream

kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0

The Default Stream

如果并发性对性能不重要,则默认流很有用。在CUDA 7之前,每个设备都有一个用于所有主机线程的默认流,这会导致隐式同步。如《 CUDA C编程指南》中的“隐式同步”部分所述,如果主机线程向它们之间的默认流发出任何CUDA命令,则来自不同流的两个命令不能同时运行。

CUDA 7引入了一个新选项,即每线程默认流,它具有两个作用。首先,它为每个主机线程提供自己的默认流。这意味着由不同的主机线程发布到默认流的命令可以同时运行。其次,这些默认流是常规流。这意味着默认流中的命令可以与非默认流中的命令同时运行。

为了使每个线程默认CUDA流7或更高版本,可以用编译nvcc命令行选项--default-stream per-thread,或#defineCUDA_API_PER_THREAD_DEFAULT_STREAM预处理宏包括CUDA头(前cuda.hcuda_runtime.h)。重要的是要注意:#define CUDA_API_PER_THREAD_DEFAULT_STREAM编译代码时,不能使用.cu文件中的此功能,nvcc因为nvcc隐式包括cuda_runtime.h在翻译单元的顶部。

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit.

A Multi-Stream Example

Let’s look at a trivial example. The following code simply launches eight copies of a simple kernel on eight streams. We launch only a single thread block for each grid so there are plenty of resources to run multiple of them concurrently. As an example of how the legacy default stream causes serialization, we add dummy kernel launches on the default stream that do no work. Here’s the code.


const int N = 1 << 20;

__global__ void kernel(float *x, int n)


int tid = threadIdx.x + blockIdx.x * blockDim.x;

for (int i = tid; i < n; i += blockDim.x * gridDim.x) {

x[i] = sqrt(pow(3.14159,i));



int main()


const int num_streams = 8;

cudaStream_t streams[num_streams];

float *data[num_streams];

for (int i = 0; i < num_streams; i++) {


cudaMalloc(&data[i], N * sizeof(float));

// launch one worker kernel per stream

kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

// launch a dummy kernel on the default stream

kernel<<<1, 1>>>(0, 0);



return 0;



nvcc ./ -o stream_legacy

可以在NVIDIA Visual Profiler(nvvp)中运行该程序,以获得显示所有流和内核启动的时间线。图1显示了在配备NVIDIA GeForce GT 750M(开普勒GPU)的Macbook Pro上生成的内核时间轴。可以在默认流上看到虚拟内核的极小条,以及如何导致所有其它流序列化。



nvcc --default-stream per-thread ./ -o stream_per-thread



A Multi-threading Example

Let’s look at another example, designed to demonstrate how the new default stream behavior makes it easier to achieve execution concurrency in multi-threaded applications. The following example creates eight POSIX threads, and each thread calls our kernel on the default stream and then synchronizes the default stream. (We need the synchronization in this example to make sure the profiler gets the kernel start and end timestamps before the program exits.)


#include <pthread.h>

#include <stdio.h>

const int N = 1 << 20;

__global__ void kernel(float *x, int n)


int tid = threadIdx.x + blockIdx.x * blockDim.x;

for (int i = tid; i < n; i += blockDim.x * gridDim.x) {

x[i] = sqrt(pow(3.14159,i));



void *launch_kernel(void *dummy)


float *data;

cudaMalloc(&data, N * sizeof(float));

kernel<<<1, 64>>>(data, N);


return NULL;


int main()


const int num_threads = 8;

pthread_t threads[num_threads];

for (int i = 0; i < num_threads; i++) {

if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {

fprintf(stderr, "Error creating threadn");

return 1;



for (int i = 0; i < num_threads; i++) {

if(pthread_join(threads[i], NULL)) {

fprintf(stderr, "Error joining threadn");

return 2;




return 0;



nvcc ./ -o pthreads_legacy




nvcc --default-stream per-thread ./ -o pthreads_per_thread





  • 切记:对于每个线程的默认流,就同步和并发而言,每个线程中的默认流的行为与常规流相同。对于旧式默认流,情况并非如此。
  • --default-stream选项适用于每个编译单元,确保将其应用于nvcc需要它的所有命令行。
  • cudaDeviceSynchronize()继续使用新的每线程默认流选项同步设备上的所有内容。如果只想同步单个流,使用cudaStreamSynchronize(cudaStream_t stream),如第二个示例中所示。
  • 从CUDA 7开始,还可以使用句柄显式访问每个线程的默认流cudaStreamPerThread,并且可以使用句柄访问旧式默认流cudaStreamLegacy。注意,cudaStreamLegacy碰巧将它们混合在程序中,则仍与每个线程默认流进行隐式同步。
  • 可以通过将cudaStreamNonBlocking标志传递到cudaStreamCreate()来创建与旧式默认流同步的非阻塞流

