stream是什么

nivdia给出的解释是:
A sequence of operations that execute in issue-order on the GPU.  可以理解成在GPU上执行的操作序列.比如下面的这些动作.

cudaMemcpy()
kernel launch
device sync
cudaMemcpy()

不同的流操作可能是交叉执行的,可能是同事执行的.

流的API:

cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

我们可以把一个应用程序的整体对的stream的情况称之为pipeline.优化程序以stream的角度就是优化pipeline

cuda overlap重叠

支持设备重叠的cuda GPU设备能够在执行kernel函数时同时执行设备与主机之间的内存拷贝动作.可以用下面的代码查看设备是否支持overlap:

int dev_count;
cudaDeviceProp prop;
cudaGetDeviceCount( &dev_count);
for (int i = ; i < dev_count; i++) {
cudaGetDeviceProperties(&prop, i);
if (prop.deviceOverlap) ...

cudaMemcpyAsync()

memcpy是以同步方式执行的,当函数返回时,复制操作已经完成.而cudaMemcpyAsync()是异步函数,它只是放置一个请求,表示在流中执行一次内存复制操作,这个复制操作是通过参数stream来指定的.当函数返回时我们无法保证函数已经执行完成,能够保证的是复制操作肯定会在下一个放入流的操作之前执行完成.任何传递给cudaMemcpyAsync()的主机内存指针都必须已经通过cudaHostAlloc()分配好内存,也就是,你只能以异步方式对页锁定内存进行复制操作.

Vector stream add 向量流加法

优化这个pipeline,最理想的pipeline如下:

可以看到在同一时间,lanuch kernel, copy host to device, copy device back to host 三个任务同时执行. 有2个stream流,一个是copy, 一个用于执行kernel.

实际优化pipeline的时候并不是这么简单和容易的,先看下面一段host代码:

    for (int i=; i<n; i+=SegSize*) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);
vecAdd<<<SegSize/, , , stream0>>>(d_A0, d_B0,...);
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,
stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float) ,...,
stream1);
vecAdd<<<SegSize/, , , stream1>>>(d_A1, d_B1, ...);
cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),...,
stream1);
}

这段代码的pipeline的情况是: 执行kernel计算和 下一块拷贝主机内存到设备是同事进行的.

再看下面这段代码:

for (int i=; i<n; i+=SegSize*) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,
stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),...,
stream1);
vecAdd<<<SegSize/, , , stream0>>>(d_A0, d_B0, ...);
vecAdd<<<SegSize/, , , stream1>>>(d_A1, d_B1, ...);
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);
cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),...,
stream1);
}

这段代码的pipeline情况是:和上一种的区别是把拷贝A和B元素与kernel并行,可以形象的理解成,下一行向左移动一下,那么整个pipeline整体是缩短了的.

strean 同步API

cudaStreamSynchronize(stream_id): 等待一个stream中的所有任务执行完成.

cudaDeviceSynchronize(): 不带参数等待设备中所有流任务执行完成

Vector-stream-add Code

首先使用2个stream来做:

#include    <wb.h>
#define wbCheck(stmt) do { \
cudaError_t err = stmt; \
if (err != cudaSuccess) { \
wbLog(ERROR, "Failed to run stmt ", #stmt); \
wbLog(ERROR, "Got CUDA error ... ", cudaGetErrorString(err)); \
return -; \
} \
} while() #define SegSize 256
#define StreamNum 2 __global__ void vecAdd(float * in1, float * in2, float * out, int len) {
//@@ Insert code to implement vector addition here
int gidx = blockIdx.x*blockDim.x + threadIdx.x; if(gidx< len)
{
out[gidx]= in1[gidx]+in2[gidx];
}
} int main(int argc, char ** argv) {
wbArg_t args;
int inputLength;
float * hostInput1;
float * hostInput2;
float * hostOutput;
// float * deviceInput1;
// float * deviceInput2;
// float * deviceOutput;
float *h_A, *h_B, *h_C; //cudaStream_t stream0, stream1;
//cudaStreamCreate(&stream0);
//cudaStreamCreate(&stream1);
float *d_A0, *d_B0, *d_C0;// device memory for stream 0
float *d_A1, *d_B1, *d_C1;// device memory for stream 1 args = wbArg_read(argc, argv);
int Csize = SegSize*sizeof(float); wbTime_start(Generic, "Importing data and creating memory on host");
hostInput1 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);
hostInput2 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);
hostOutput = (float *) malloc(inputLength * sizeof(float));
printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);
wbTime_stop(Generic, "Importing data and creating memory on host"); cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault); memcpy(h_A, hostInput1,inputLength*sizeof(float));
memcpy(h_B, hostInput2,inputLength*sizeof(float)); wbCheck(cudaMalloc((void **)&d_A0, Csize));
wbCheck(cudaMalloc((void **)&d_A1, Csize));
wbCheck(cudaMalloc((void **)&d_B0, Csize));
wbCheck(cudaMalloc((void **)&d_B1, Csize));
wbCheck(cudaMalloc((void **)&d_C0, Csize));
wbCheck(cudaMalloc((void **)&d_C1, Csize)); cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));
for(int i = ; i < StreamNum; i++)
cudaStreamCreate(&(streams[i])); int main = inputLength/(SegSize*StreamNum);
int left = inputLength%(SegSize*StreamNum); printf("main =%d, left=%d\n", main, left);
int i = ; // keep the increaing length
for(i; i < inputLength; i+=SegSize*StreamNum)
{
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize); // cudaStreamSynchronize(yiming_stream0);
cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);
//cudaStreamSynchronize(yiming_stream1);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);
} // Process the remaining elements if(SegSize < left)
{
printf("AAAAAAA, left- size ==%d\n", left-SegSize);
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<, left-SegSize, , streams[]>>>(d_A0, d_B0, d_C0, left-SegSize); // cudaStreamSynchronize(streams[0]);
cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C0, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[]); // i+=SegSize;
// left = left - SegSize;
}
else if(left > )
{
printf("BBBBBBB\n");
cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice); vecAdd<<<, left, , streams[]>>>(d_A0, d_B0, d_C0, left); //cudaDeviceSynchronize();
cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);
} cudaDeviceSynchronize();
wbSolution(args, hostOutput, inputLength); free(hostInput1);
free(hostInput2);
free(hostOutput); for(int i = ; i < StreamNum; i++)
cudaStreamDestroy(streams[i]); cudaFree(d_A0);
cudaFree(d_A1);
cudaFree(d_B0);
cudaFree(d_B1);
cudaFree(d_C0);
cudaFree(d_C1);
return ;
}

然后是使用4个流来做,code如下:

#include    <wb.h>
#define wbCheck(stmt) do { \
cudaError_t err = stmt; \
if (err != cudaSuccess) { \
wbLog(ERROR, "Failed to run stmt ", #stmt); \
wbLog(ERROR, "Got CUDA error ... ", cudaGetErrorString(err)); \
return -; \
} \
} while() #define SegSize 256
#define StreamNum 4 __global__ void vecAdd(float * in1, float * in2, float * out, int len) {
//@@ Insert code to implement vector addition here
int gidx = blockIdx.x*blockDim.x + threadIdx.x; if(gidx< len)
{
out[gidx]= in1[gidx]+in2[gidx];
}
} int main(int argc, char ** argv) {
wbArg_t args;
int inputLength, i;
float * hostInput1;
float * hostInput2;
float * hostOutput;
// float * deviceInput1;
// float * deviceInput2;
// float * deviceOutput;
float *h_A, *h_B, *h_C; //cudaStream_t stream0, stream1;
//cudaStreamCreate(&stream0);
//cudaStreamCreate(&stream1);
float *d_A0, *d_B0, *d_C0;// device memory for stream 0
float *d_A1, *d_B1, *d_C1;// device memory for stream 1
float *d_A2, *d_B2, *d_C2;// device memory for stream 2
float *d_A3, *d_B3, *d_C3;// device memory for stream 3 args = wbArg_read(argc, argv);
int Csize = SegSize*sizeof(float); wbTime_start(Generic, "Importing data and creating memory on host");
hostInput1 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);
hostInput2 = (float *) wbImport(wbArg_getInputFile(args, ), &inputLength);
hostOutput = (float *) malloc(inputLength * sizeof(float));
printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);
wbTime_stop(Generic, "Importing data and creating memory on host"); cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault); memcpy(h_A, hostInput1,inputLength*sizeof(float));
memcpy(h_B, hostInput2,inputLength*sizeof(float)); wbCheck(cudaMalloc((void **)&d_A0, Csize));
wbCheck(cudaMalloc((void **)&d_A1, Csize));
wbCheck(cudaMalloc((void **)&d_B0, Csize));
wbCheck(cudaMalloc((void **)&d_B1, Csize));
wbCheck(cudaMalloc((void **)&d_C0, Csize));
wbCheck(cudaMalloc((void **)&d_C1, Csize));
wbCheck(cudaMalloc((void **)&d_A2, Csize));
wbCheck(cudaMalloc((void **)&d_A3, Csize));
wbCheck(cudaMalloc((void **)&d_B2, Csize));
wbCheck(cudaMalloc((void **)&d_B3, Csize));
wbCheck(cudaMalloc((void **)&d_C2, Csize));
wbCheck(cudaMalloc((void **)&d_C3, Csize)); cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));
for(int i = ; i < StreamNum; i++)
cudaStreamCreate(&(streams[i])); int main = inputLength/(SegSize*StreamNum);
int left = inputLength%(SegSize*StreamNum); printf("main =%d, left=%d\n", main, left);
for(i=; i < inputLength; i+=SegSize*StreamNum)
{
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A2, d_B2, d_C2, SegSize);
vecAdd<<<SegSize/, SegSize, , streams[]>>>(d_A3, d_B3, d_C3, SegSize); cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);
//cudaStreamSynchronize(yiming_stream1);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, Csize, cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize*, d_C3, Csize, cudaMemcpyDeviceToHost, streams[]);
} // Process the remaining elements
if(SegSize* < left){
printf("DDDDDDDD\n");
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);
vecAdd<<<, SegSize, , streams[]>>>(d_A2, d_B2, d_C2, SegSize);
vecAdd<<<, (left-SegSize*), , streams[]>>>(d_A3, d_B3, d_C3, (left-SegSize*)); cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);
//cudaStreamSynchronize(yiming_stream1);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, Csize, cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize*, d_C3, (left-SegSize*)*sizeof(float), cudaMemcpyDeviceToHost, streams[]);
}
else if(SegSize* < left){
printf("CCCCCCCC\n");
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*, (left-SegSize*)*sizeof(float), cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<, SegSize, , streams[]>>>(d_A1, d_B1, d_C1, SegSize);
vecAdd<<<, left-SegSize*, , streams[]>>>(d_A2, d_B2, d_C2, (left-SegSize*)); cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[]);
//cudaStreamSynchronize(yiming_stream1);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize*, d_C2, (left-SegSize*)*sizeof(float), cudaMemcpyDeviceToHost, streams[]); }
else if(SegSize < left)
{
printf("AAAAAAA, left- size ==%d\n", left-SegSize);
cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]);
cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[]); // block size is 256
vecAdd<<<, SegSize, , streams[]>>>(d_A0, d_B0, d_C0, SegSize);
vecAdd<<<, left-SegSize, , streams[]>>>(d_A0, d_B0, d_C0, left-
SegSize); // cudaStreamSynchronize(streams[0]);
cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[]);
cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[]); // i+=SegSize;
// left = left - SegSize;
}
else if(left > )
{
printf("BBBBBBB\n");
cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice); vecAdd<<<, left, , streams[]>>>(d_A0, d_B0, d_C0, left); //cudaDeviceSynchronize();
cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);
} cudaDeviceSynchronize();
wbSolution(args, hostOutput, inputLength); free(hostInput1);
free(hostInput2);
free(hostOutput); for(int i = ; i < StreamNum; i++)
cudaStreamDestroy(streams[i]); cudaFree(d_A0);
cudaFree(d_A1);
cudaFree(d_B0);
cudaFree(d_B1);
cudaFree(d_C0);
cudaFree(d_C1);
cudaFree(d_A2);
cudaFree(d_A3);
cudaFree(d_B2);
cudaFree(d_B3);
cudaFree(d_C2);
cudaFree(d_C3);
return ;
}

运行成功,但是遗留一个问题,当我把拷贝内存的代码改成:

cudaMemcpyAsync(d_A0, h_A+i, Csize, cudaMemcpyHostToDevice, streams[0]);  即使用页固定内存,结果就会错误,不明白为什么

6.2 CUDA streams的更多相关文章

  1. CUDA 进阶学习

    CUDA基本概念 CUDA网格限制 1.2CPU和GPU的设计区别 2.1CUDA-Thread 2.2CUDA-Memory(存储)和bank-conflict 2.3CUDA矩阵乘法 3.1 全局 ...

  2. CUDA ---- Stream and Event

    Stream 一般来说,cuda c并行性表现在下面两个层面上: Kernel level Grid level 到目前为止,我们讨论的一直是kernel level的,也就是一个kernel或者一个 ...

  3. Cuda Stream流分析

    Cuda Stream流分析 Stream 一般来说,cuda c并行性表现在下面两个层面上: Kernel level Grid level Stream和event简介 Cuda stream是指 ...

  4. PyTorch中的CUDA操作

      CUDA(Compute Unified Device Architecture)是NVIDIA推出的异构计算平台,PyTorch中有专门的模块torch.cuda来设置和运行CUDA相关操作.本 ...

  5. cudaMemcpy与cudaMemcpyAsync的区别

    转载请注明来源:http://www.cnblogs.com/shrimp-can/p/5231857.html 简单可以理解为:cudaMemcpy是同步的,而cudaMemcpyAsync是异步的 ...

  6. Suricata的规则解读(默认和自定义)

    不多说,直接上干货! 见suricata官网 https://suricata.readthedocs.io/en/latest/rules/index.html 一.Suricata的规则所放位置 ...

  7. suricata.yaml (一款高性能的网络IDS、IPS和网络安全监控引擎)默认配置文件(图文详解)

    不多说,直接上干货! 前期博客 基于CentOS6.5下Suricata(一款高性能的网络IDS.IPS和网络安全监控引擎)的搭建(图文详解)(博主推荐) 或者 基于Ubuntu14.04下Suric ...

  8. TensorRT 介绍

    引用:https://arleyzhang.github.io/articles/7f4b25ce/ 1 简介 TensorRT是一个高性能的深度学习推理(Inference)优化器,可以为深度学习应 ...

  9. 【转载】 NVIDIA Tesla/Quadro和GeForce GPU比较

    原文地址: https://blog.csdn.net/m0_37462765/article/details/74394932 版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议 ...

随机推荐

  1. Java中堆、栈、常量池分析

    栈用于存储局部变量,包括基本类型的变量(方法语句块内部定义的变量.方法中的形参).引用类型的变量,它们都是存储在各自的方法栈中,随着方法的执行完成而消失: 堆用于存储引用类型变量所指向的对象,包括普通 ...

  2. VS2005中SetUnhandledExceptionFilter函数应用

    很多软件通过设置自己的异常捕获函数,捕获未处理的异常,生成报告或者日志(例如生成mini-dump文件),达到Release版本下追踪Bug的目的.但是,到了VS2005(即VC8),Microsof ...

  3. POJ1265Area

    http://poj.org/problem?id=1265 题意 : 给你一个点阵,上边有很多点连成的多边形,让你求多边形内部的点和边界上的点以及多边形的面积,要注意他每次给出的点并不是点的横纵坐标 ...

  4. struts2文件下载 火狐浏览器的文件名乱码问题

    这是一个文件下载的action,红色部分为火狐浏览器需要特地做的事情. @Controller @Scope(value = "prototype") public class F ...

  5. hbase集群 常用维护命令

    一. zk集群 1. 查看当前服务的角色 leader/follower echo stat|nc 127.0.0.1 2181 2.  启动ZK服务: sh bin/zkServer.sh star ...

  6. 李洪强iOS开发之使用 Reachability 检测网络

    1.iOS平台是按照一直有网络连接的思路来设计的,开发者利用这一特点创造了很多优秀的第三方应用. 大多数的iOS应用都需要联网,甚至有些应用严重依赖网络,没有网络就无法正常工作. 2.在你的应用尝试通 ...

  7. CentOS7.1 安装关键步骤 记录下来

    SecureCRT下载地址 https://yunpan.cn/cS9W94kuvhXPb  访问密码 08cd[这里GNOME桌面 下的 要全选,截屏有误]

  8. Linux/Unix 系统分析命令速查手册

    1.Hardware CPU information: cat /proc/cpuinfo 物理core个数: 统计core 逻辑CPU个数:统计processor Memory informatio ...

  9. 【HDOJ】4601 Letter Tree

    挺有意思的一道题,思路肯定是将图转化为Trie树,这样可以求得字典序.然后,按照trie的层次求解.一直wa的原因在于将树转化为线性数据结构时要从原树遍历,从trie遍历就会wa.不同结点可能映射为t ...

  10. [ffmpeg 扩展第三方库编译系列] 关于libopenjpeg mingw32编译问题

    在mingw32如果想编译libopenjpeg 会比较麻烦 会出现undefined reference to `_imp__opj_destroy_cstr_info@4' 等错误 因此编译时候需 ...