stream是什么

nivdia给出的解释是:
A sequence of operations that execute in issue-order on the GPU. 可以理解成在GPU上执行的操作序列.比如下面的这些动作.

cudaMemcpy()
kernel launch
device sync
cudaMemcpy()

不同的流操作可能是交叉执行的,可能是同事执行的.

流的API:

cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

我们可以把一个应用程序的整体对的stream的情况称之为pipeline.优化程序以stream的角度就是优化pipeline

cuda overlap重叠

支持设备重叠的cuda GPU设备能够在执行kernel函数时同时执行设备与主机之间的内存拷贝动作.可以用下面的代码查看设备是否支持overlap:

int dev_count;

cudaDeviceProp prop;

cudaGetDeviceCount( &dev_count);

for (int i = ; i < dev_count; i++) {

    cudaGetDeviceProperties(&prop, i);

    if (prop.deviceOverlap) ...

cudaMemcpyAsync()

memcpy是以同步方式执行的,当函数返回时,复制操作已经完成.而cudaMemcpyAsync()是异步函数,它只是放置一个请求,表示在流中执行一次内存复制操作,这个复制操作是通过参数stream来指定的.当函数返回时我们无法保证函数已经执行完成,能够保证的是复制操作肯定会在下一个放入流的操作之前执行完成.任何传递给cudaMemcpyAsync()的主机内存指针都必须已经通过cudaHostAlloc()分配好内存,也就是,你只能以异步方式对页锁定内存进行复制操作.

Vector stream add 向量流加法

优化这个pipeline,最理想的pipeline如下:

可以看到在同一时间,lanuch kernel, copy host to device, copy device back to host 三个任务同时执行. 有2个stream流,一个是copy, 一个用于执行kernel.

实际优化pipeline的时候并不是这么简单和容易的,先看下面一段host代码:

    for (int i=; i<n; i+=SegSize*) {

        cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);

        cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);

        vecAdd<<<SegSize/, , , stream0>>>(d_A0, d_B0,...);

        cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);

        cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,

                        stream1);

        cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float) ,...,

                        stream1);

        vecAdd<<<SegSize/, , , stream1>>>(d_A1, d_B1, ...);

        cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),...,

                        stream1);

    }

这段代码的pipeline的情况是: 执行kernel计算和下一块拷贝主机内存到设备是同事进行的.

再看下面这段代码:

for (int i=; i<n; i+=SegSize*) {

        cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);

        cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);

        cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,

                        stream1);

        cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),...,

                        stream1);

        vecAdd<<<SegSize/, , , stream0>>>(d_A0, d_B0, ...);

        vecAdd<<<SegSize/, , , stream1>>>(d_A1, d_B1, ...);

        cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);

        cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),...,

                        stream1);

}

这段代码的pipeline情况是:和上一种的区别是把拷贝A和B元素与kernel并行,可以形象的理解成,下一行向左移动一下,那么整个pipeline整体是缩短了的.

strean 同步API

cudaStreamSynchronize(stream_id): 等待一个stream中的所有任务执行完成.

cudaDeviceSynchronize(): 不带参数等待设备中所有流任务执行完成

Vector-stream-add Code

首先使用2个stream来做:

#include    <wb.h>

#define wbCheck(stmt) do {                                                    \

        cudaError_t err = stmt;                                               \

        if (err != cudaSuccess) {                                             \

            wbLog(ERROR, "Failed to run stmt ", #stmt);                       \

            wbLog(ERROR, "Got CUDA error ...  ", cudaGetErrorString(err));    \

            return -;                                                        \

        }                                                                     \

    } while()  

#define SegSize 256

#define StreamNum 2

__global__ void vecAdd(float * in1, float * in2, float * out, int len) {

    //@@ Insert code to implement vector addition here

    int gidx = blockIdx.x*blockDim.x + threadIdx.x;

    if(gidx< len)

    {

        out[gidx]= in1[gidx]+in2[gidx];

    }

}

int main(int argc, char ** argv) {

    wbArg_t args;

    int inputLength;

    float * hostInput1;

    float * hostInput2;

    float * hostOutput;

  //  float * deviceInput1;

  //  float * deviceInput2;

  //  float * deviceOutput;

    float *h_A, *h_B, *h_C;

    //cudaStream_t stream0, stream1;

    //cudaStreamCreate(&stream0);

    //cudaStreamCreate(&stream1);

    float *d_A0, *d_B0, *d_C0;// device memory for stream 0

    float *d_A1, *d_B1, *d_C1;// device memory for stream 1

    args = wbArg_read(argc, argv);

    int Csize = SegSize*sizeof(float);

    wbTime_start(Generic, "Importing data and creating memory on host");

    hostInput1 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);

    hostInput2 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);

    hostOutput = (float *) malloc(inputLength * sizeof(float));

    printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);

    wbTime_stop(Generic, "Importing data and creating memory on host");

    cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);

    cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);

    cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault);

    memcpy(h_A, hostInput1,inputLength*sizeof(float));

    memcpy(h_B, hostInput2,inputLength*sizeof(float));

    wbCheck(cudaMalloc((void **)&d_A0, Csize));

    wbCheck(cudaMalloc((void **)&d_A1, Csize));

    wbCheck(cudaMalloc((void **)&d_B0, Csize));

    wbCheck(cudaMalloc((void **)&d_B1, Csize));

    wbCheck(cudaMalloc((void **)&d_C0, Csize));

    wbCheck(cudaMalloc((void **)&d_C1, Csize));

    cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));

    for(int i = ; i < StreamNum; i++)

        cudaStreamCreate(&(streams[i]));

    int main = inputLength/(SegSize*StreamNum);

    int left = inputLength%(SegSize*StreamNum);

    printf("main =%d, left=%d\n", main, left);

        int i = ; // keep the increaing length

      for(i; i < inputLength; i+=SegSize*StreamNum)

    {

            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]); 

            // block size is 256

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);

          //  cudaStreamSynchronize(yiming_stream0);

            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);

            //cudaStreamSynchronize(yiming_stream1);

            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);

    }

    // Process the remaining elements

    if(SegSize < left)

    {

        printf("AAAAAAA, left- size ==%d\n", left-SegSize);

        cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

        // block size is 256

        vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

        vecAdd<<<, left-SegSize, , streams[]>>>(d_A0, d_B0, d_C0, left-SegSize);                                                                                                                                    

       // cudaStreamSynchronize(streams[0]);

        cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[]);

        cudaMemcpyAsync(hostOutput+i+SegSize, d_C0, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[]);                                                                                                                                    

    //    i+=SegSize;

    //    left = left - SegSize;

    }

    else if(left > )

    {

        printf("BBBBBBB\n");

        cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);

        cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice);

        vecAdd<<<, left, , streams[]>>>(d_A0, d_B0, d_C0, left);

        //cudaDeviceSynchronize();

        cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);

    }

    cudaDeviceSynchronize();

    wbSolution(args, hostOutput, inputLength);

    free(hostInput1);

    free(hostInput2);

    free(hostOutput);

    for(int i = ; i < StreamNum; i++)

        cudaStreamDestroy(streams[i]);

    cudaFree(d_A0);

    cudaFree(d_A1);

    cudaFree(d_B0);

    cudaFree(d_B1);

    cudaFree(d_C0);

    cudaFree(d_C1);

    return ;

}

然后是使用4个流来做,code如下:

#include    <wb.h>

#define wbCheck(stmt) do {                                                    \

        cudaError_t err = stmt;                                               \

        if (err != cudaSuccess) {                                             \

            wbLog(ERROR, "Failed to run stmt ", #stmt);                       \

            wbLog(ERROR, "Got CUDA error ...  ", cudaGetErrorString(err));    \

            return -;                                                        \

        }                                                                     \

    } while()  

#define SegSize 256

#define StreamNum 4

__global__ void vecAdd(float * in1, float * in2, float * out, int len) {

    //@@ Insert code to implement vector addition here

    int gidx = blockIdx.x*blockDim.x + threadIdx.x;

    if(gidx< len)

    {

        out[gidx]= in1[gidx]+in2[gidx];

    }

}

int main(int argc, char ** argv) {

    wbArg_t args;

    int inputLength, i;

    float * hostInput1;

    float * hostInput2;

    float * hostOutput;

  //  float * deviceInput1;

  //  float * deviceInput2;

  //  float * deviceOutput;

    float *h_A, *h_B, *h_C;

    //cudaStream_t stream0, stream1;

    //cudaStreamCreate(&stream0);

    //cudaStreamCreate(&stream1);

    float *d_A0, *d_B0, *d_C0;// device memory for stream 0

    float *d_A1, *d_B1, *d_C1;// device memory for stream 1

    float *d_A2, *d_B2, *d_C2;// device memory for stream 2

    float *d_A3, *d_B3, *d_C3;// device memory for stream 3

    args = wbArg_read(argc, argv);

    int Csize = SegSize*sizeof(float);

    wbTime_start(Generic, "Importing data and creating memory on host");

    hostInput1 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);

    hostInput2 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);

    hostOutput = (float *) malloc(inputLength * sizeof(float));

    printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);

    wbTime_stop(Generic, "Importing data and creating memory on host");

    cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);

    cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);

    cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault);

    memcpy(h_A, hostInput1,inputLength*sizeof(float));

    memcpy(h_B, hostInput2,inputLength*sizeof(float));

    wbCheck(cudaMalloc((void **)&d_A0, Csize));

    wbCheck(cudaMalloc((void **)&d_A1, Csize));

    wbCheck(cudaMalloc((void **)&d_B0, Csize));

    wbCheck(cudaMalloc((void **)&d_B1, Csize));

    wbCheck(cudaMalloc((void **)&d_C0, Csize));

    wbCheck(cudaMalloc((void **)&d_C1, Csize));

    wbCheck(cudaMalloc((void **)&d_A2, Csize));

    wbCheck(cudaMalloc((void **)&d_A3, Csize));

    wbCheck(cudaMalloc((void **)&d_B2, Csize));

    wbCheck(cudaMalloc((void **)&d_B3, Csize));

    wbCheck(cudaMalloc((void **)&d_C2, Csize));

    wbCheck(cudaMalloc((void **)&d_C3, Csize));

    cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));

    for(int i = ; i < StreamNum; i++)

        cudaStreamCreate(&(streams[i]));

    int main = inputLength/(SegSize*StreamNum);

    int left = inputLength%(SegSize*StreamNum);

    printf("main =%d, left=%d\n", main, left);

    for(i=; i < inputLength; i+=SegSize*StreamNum)

    {

            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]); 

            // block size is 256

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A2, d_B2, d_C2, SegSize);

            vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A3, d_B3, d_C3, SegSize);

            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);

            //cudaStreamSynchronize(yiming_stream1);

            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);

            cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, Csize, cudaMemcpyDeviceToHost, streams[]);

            cudaMemcpyAsync(hostOutput+i+SegSize*, d_C3, Csize, cudaMemcpyDeviceToHost, streams[]);

    }

    // Process the remaining elements

    if(SegSize* < left){

            printf("DDDDDDDD\n");

            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]); 

            // block size is 256

            vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

            vecAdd<<<, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);

            vecAdd<<<, SegSize, , streams[]>>>(d_A2, d_B2, d_C2, SegSize);

            vecAdd<<<, (left-SegSize*), , streams[]>>>(d_A3, d_B3, d_C3, (left-SegSize*));

            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);

            //cudaStreamSynchronize(yiming_stream1);

            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);

            cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, Csize, cudaMemcpyDeviceToHost, streams[]);

            cudaMemcpyAsync(hostOutput+i+SegSize*, d_C3, (left-SegSize*)*sizeof(float), cudaMemcpyDeviceToHost, streams[]);

    }

    else if(SegSize* < left){

            printf("CCCCCCCC\n");

            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

            // block size is 256

            vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

            vecAdd<<<, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);

            vecAdd<<<, left-SegSize*, , streams[]>>>(d_A2, d_B2, d_C2, (left-SegSize*));

            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);

            //cudaStreamSynchronize(yiming_stream1);

            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);

            cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, (left-SegSize*)*sizeof(float), cudaMemcpyDeviceToHost, streams[]);

    }

    else if(SegSize < left)

    {

        printf("AAAAAAA, left- size ==%d\n", left-SegSize);

        cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

        cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);

        // block size is 256

        vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);

        vecAdd<<<, left-SegSize, , streams[]>>>(d_A0, d_B0, d_C0, left-

SegSize);                                                                                                                                    

       // cudaStreamSynchronize(streams[0]);

        cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[]);

        cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[]);                                                                                                                                    

    //    i+=SegSize;

    //    left = left - SegSize;

    }

    else if(left > )

    {

        printf("BBBBBBB\n");

        cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);

        cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice);

        vecAdd<<<, left, , streams[]>>>(d_A0, d_B0, d_C0, left);

        //cudaDeviceSynchronize();

        cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);

    }

    cudaDeviceSynchronize();

    wbSolution(args, hostOutput, inputLength);

    free(hostInput1);

    free(hostInput2);

    free(hostOutput);

    for(int i = ; i < StreamNum; i++)

        cudaStreamDestroy(streams[i]);

    cudaFree(d_A0);

    cudaFree(d_A1);

    cudaFree(d_B0);

    cudaFree(d_B1);

    cudaFree(d_C0);

    cudaFree(d_C1);

    cudaFree(d_A2);

    cudaFree(d_A3);

    cudaFree(d_B2);

    cudaFree(d_B3);

    cudaFree(d_C2);

    cudaFree(d_C3);

    return ;

}

运行成功,但是遗留一个问题,当我把拷贝内存的代码改成:

cudaMemcpyAsync(d_A0, h_A+i, Csize, cudaMemcpyHostToDevice, streams[0]);  即使用页固定内存,结果就会错误,不明白为什么

6.2 CUDA streams的更多相关文章

CUDA 进阶学习
CUDA基本概念 CUDA网格限制 1.2CPU和GPU的设计区别 2.1CUDA-Thread 2.2CUDA-Memory(存储)和bank-conflict 2.3CUDA矩阵乘法 3.1 全局 ...
CUDA ---- Stream and Event
Stream 一般来说,cuda c并行性表现在下面两个层面上: Kernel level Grid level 到目前为止,我们讨论的一直是kernel level的,也就是一个kernel或者一个 ...
Cuda Stream流分析
Cuda Stream流分析 Stream 一般来说,cuda c并行性表现在下面两个层面上: Kernel level Grid level Stream和event简介 Cuda stream是指 ...
PyTorch中的CUDA操作
CUDA(Compute Unified Device Architecture)是NVIDIA推出的异构计算平台,PyTorch中有专门的模块torch.cuda来设置和运行CUDA相关操作.本 ...
cudaMemcpy与cudaMemcpyAsync的区别
转载请注明来源:http://www.cnblogs.com/shrimp-can/p/5231857.html 简单可以理解为:cudaMemcpy是同步的,而cudaMemcpyAsync是异步的 ...
Suricata的规则解读（默认和自定义）
不多说,直接上干货! 见suricata官网 https://suricata.readthedocs.io/en/latest/rules/index.html 一.Suricata的规则所放位置 ...
suricata.yaml （一款高性能的网络IDS、IPS和网络安全监控引擎）默认配置文件（图文详解）
不多说,直接上干货! 前期博客基于CentOS6.5下Suricata(一款高性能的网络IDS.IPS和网络安全监控引擎)的搭建(图文详解)(博主推荐) 或者基于Ubuntu14.04下Suric ...
TensorRT 介绍
引用:https://arleyzhang.github.io/articles/7f4b25ce/ 1 简介 TensorRT是一个高性能的深度学习推理(Inference)优化器,可以为深度学习应 ...
【转载】 NVIDIA Tesla/Quadro和GeForce GPU比较
原文地址: https://blog.csdn.net/m0_37462765/article/details/74394932 版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议 ...

随机推荐

Side by Side Assembly介绍--manifest文件的使用
什么是Side-by-Side Assembly? Side-by-Side Assembly(建称SxS)是微软在Visual Studio 2005(Windows 2000?)中引入的技术,用来 ...
C++中cin、cin.get()、cin.getline()、getline()、gets()等函数的用法
学C++的时候,这几个输入函数弄的有点迷糊:这里做个小结,为了自己复习,也希望对后来者能有所帮助,如果有差错的地方还请各位多多指教(本文所有程序均通过VC 6.0运行) 1.cin 2.cin.get ...
1987-A. 集训队选拔
描述南邮ACM暑期集训队一年一度的选拔如火如荼的开始了.按照以往的惯例,通过ACM校赛预赛和决赛的两轮选拔,成绩优异者将入选集训队,获得下半年在各大赛区现场赛上与各路神牛角逐奖牌的机会.但是,校赛的 ...
json中jobject
Json.net codeplex :http://www.codeplex.com/Json 原本感觉Newtonsoft.Json和.net自己的JavaScriptSerializer相差无几, ...
JavaScript 三种创建对象的方法
JavaScript中对象的创建有以下几种方式: (1)使用内置对象 (2)使用JSON符号 (3)自定义对象构造一.使用内置对象 JavaScript可用的内置对象可分为两种: 1,JavaScr ...
thinkphp URL相关
具体详见tp文档. 此处仅做学习笔记. 后缀配置: // 模板文件后缀名 'TMPL_TEMPLATE_SUFFIX'=>'.html', // 伪静态文件后缀名 'URL_HTML_SUFFI ...
QT小技巧（书上没有的）
1. Layout本身不能控制隐藏和显示,但是可以在外面专门套一个Widget,然后控制这个Widget就可以达到相应的效果了. 2. 空目录居然也存在 if (QDir(""). ...
208. Implement Trie (Prefix Tree)
题目: Implement a trie with insert, search, and startsWith methods. 链接: http://leetcode.com/problems/i ...
CodeForces250B——Restoring IPv6(字符串处理)
Restoring IPv6 DescriptionAn IPv6-address is a 128-bit number. For convenience, this number is recor ...
Linux用户空间与内核空间
源:http://blog.csdn.net/f22jay/article/details/7925531 Linux 操作系统和驱动程序运行在内核空间,应用程序运行在用户空间,两者不能简单地使用指针 ...

6.2 CUDA streams