1. 介绍

1.1 GPU vs. CPU

GPU 使用更多的晶体管进行数据处理，而不是数据缓存和流控制，因此可以提供高度的并行计算。
GPU 可以通过计算来隐藏内存访问延迟，而不是依赖于大量的数据缓存和复杂的流控制来避免长时间的内存访问延迟。

1.2 CUDA

NVIDIA 推出的一种通用的并行计算平台和编程模型，它利用了 NVIDIA GPU 中的并行计算引擎。

2. 编程模型

2.1 kernel

就是一个 C++ 函数，但与普通 C++ 函数不同的是：当 kernel 被调用时，它会在 N 个不同的 CUDA 线程上被并行执行，也就是一共被执行了 N 次。
使用 __global__ 来标识 kernel。
执行 kernel 所使用的 CUDA 线程数在 <<<...>>> 中指定：kernelName<<<blocksPerGid, threadsPerBlock>>>(params...)

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements)

{

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)

    {

        C[i] = A[i] + B[i];

    }

}

int main(void)

{

    // ...

    // Launch the Vector Add CUDA Kernel

    int threadsPerBlock = 256;

    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;

    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

    // ...

}

2.2 线程层次关系

一个 grid 可以包含多个 block，一个 block 又可以包含多个 thread。
内建变量 blockIdx（block在grid中的索引）是一个三维变量，threadIdx（thread在block中的索引）也是一个三维变量。这意味着 grid 中的 blocks 可以组织成三维，block 中的 threads 也可以组织成三维。如，blockIdx.x 表示 block 在 grid 中的 x 坐标（x轴方向上第几个 block）；threadIdx.x 表示 thread 在 block 中的 x 坐标（x轴方向上第几个 thread）。
block 的各个维度的大小可用通过内建变量 blockDim 来获得，如，blockDim.x 表示 grid 在 x 轴方向上有多少个 block。
线程索引和线程 ID 的对应关系如下：

线程块	线程块大小	线程索引	线程 ID
一维	(D_x)	(x)	x
二维	(D_x, D_y)	(x, y)	yD_x + x
三维	(D_x, D_y, D_z)	(x, y, z)	zD_xD_y + yD_x + x

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])

{
　　 // blockIdx.x * blockDim.x 表示当前 block 的第一个 thread 在 grid 中的 x 坐标
　　 // 所以 blockIdx.x * blockDim.x + threadIdx.x 表示当前 thread 在 grid 中的 x 坐标

    int i = blockIdx.x * blockDim.x + threadIdx.x;

　　 // blockIdx.y * blockDim.y 表示当前 block 的第一个 thread 在 grid 中的 y 坐标
　　 // 所以 blockIdx.y * blockDim.y + threadIdx.y 表示当前 thread 在 grid 中的 y 坐标

    int j = blockIdx.y * blockDim.y + threadIdx.y;


    if (i < N && j < N)

        C[i][j] = A[i][j] + B[i][j];

}

int main()

{

    ...

    dim3 threadsPerBlock(16, 16);

    dim3 blocksPerGrid(N / threadsPerBlock.x, N / threadsPerBlock.y);

    MatAdd<<<blocksPerGrid, threadsPerBlock>>>(A, B, C);

    ...

}

2.3 内存层次关系

thread 私有内存：每个 thread 都有一个本地内存，该内存只有自己能够访问；
block 共享内存：每个 block 都有一个共享内存，该内存只有 block 内的 thread 可访问；
全局内存：所有 thread 都可以访问全局内存；
常量内存和 纹理(texture) 内存：只读、全局可访问。

全局内存、常量内存和纹理内存空间在由同一应用程序启动的多个 kernel 之间是持久的。

2.4 异构编程

kernel （device code）在 GPU （device）上执行，程序的其他部分（host code）在 CPU （host）上执行。

3. 编程接口

3.1 全局内存

3.1.1 一维

分配内存：cudaMalloc()
释放内存：cudaFree()
在 host 和 device 之间拷贝内存：cudaMemcpy()

int N = 1024;

size_t size = N * sizeof(float);

// 分配 host 内存

float* h_A = (float*)malloc(size);

float* h_B = (float*)malloc(size);

// 初始化输入

...

// 分配 device 内存

float* d_A;

cudaMalloc(&d_A, size);

float* d_B;

cudaMalloc(&d_B, size);

float* d_C;

cudaMalloc(&d_C, size);

// 拷贝内存：host -> device

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// 调用 kernel

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// 返回结果：拷贝内存，device -> host

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// 释放 device 内存

cudaFree(d_A);

cudaFree(d_B);

cudaFree(d_C);

// 释放 host 内存

free(h_A)

free(h_B)

free(h_C)

对于二维、三维向量来说，也可以使用 cudaMalloc() 和 cudaMemcpy()，但是使用下列函数可以进行适当填充，实现内存对齐。

3.1.2 二维

分配内存：cudaMallocPitch()
在 host 和 device 之间拷贝内存：cudaMemcpy2D()

// Host code

int width = 64, height = 64;

float* devPtr;

size_t pitch;


cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);


MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code

__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)

{

    for (int r = 0; r < height; ++r) {

        float* row = (float*)((char*)devPtr + r * pitch);

        for (int c = 0; c < width; ++c) {

            float element = row[c];

        }

    }

}

3.1.3 三维

分配内存：cudaMalloc3D()
在 host 和 device 之间拷贝内存：cudaMemcpy3D()

// Host code

int width = 64, height = 64, depth = 64;

cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth);


cudaPitchedPtr devPitchedPtr;

cudaMalloc3D(&devPitchedPtr, extent);


MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code

__global__ void MyKernel(cudaPitchedPtr devPitchedPtr, int width, int height, int depth)

{

    char* devPtr = devPitchedPtr.ptr;

    size_t pitch = devPitchedPtr.pitch;

    size_t slicePitch = pitch * height;

    for (int z = 0; z < depth; ++z) {

        char* slice = devPtr + z * slicePitch;

        for (int y = 0; y < height; ++y) {

            float* row = (float*)(slice + y * pitch);

            for (int x = 0; x < width; ++x) {

                float element = row[x];

            }

        }

    }

}

3.2 共享内存

共享内存使用标识符 __shared__ 表示。

// Matrices are stored in row-major order:

// M(row, col) = *(M.elements + row * M.stride + col)

typedef struct {

    int width;            // 每个矩阵/子矩阵的宽

    int height;            // 每个矩阵/子矩阵的高

    int stride;         // 对子矩阵而言有意义，表示所在大矩阵的宽

    float* elements;    // 数据区起始地址

} Matrix;

// Get a matrix element

__device__ float GetElement(const Matrix A, int row, int col)

{

    return A.elements[row * A.stride + col];

}

// Set a matrix element

__device__ void SetElement(Matrix A, int row, int col,

                           float value)

{

    A.elements[row * A.stride + col] = value;

}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is

// located col sub-matrices to the right and row sub-matrices down

// from the upper-left corner of A

 __device__ Matrix GetSubMatrix(Matrix A, int row, int col)

{

    Matrix Asub;

    Asub.width    = BLOCK_SIZE;

    Asub.height   = BLOCK_SIZE;

    Asub.stride   = A.stride;

    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col];

    return Asub;

}

// Thread block size

#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE

void MatMul(const Matrix A, const Matrix B, Matrix C)

{

    // Load A and B to device memory

    Matrix d_A;

    d_A.width = d_A.stride = A.width; d_A.height = A.height;

    size_t size = A.width * A.height * sizeof(float);

    cudaMalloc(&d_A.elements, size);

    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

    Matrix d_B;

    d_B.width = d_B.stride = B.width; d_B.height = B.height;

    size = B.width * B.height * sizeof(float);

    cudaMalloc(&d_B.elements, size);

    cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);

    // Allocate C in device memory

    Matrix d_C;

    d_C.width = d_C.stride = C.width; d_C.height = C.height;

    size = C.width * C.height * sizeof(float);

    cudaMalloc(&d_C.elements, size);

    // Invoke kernel

    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory

    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

    // Free device memory

    cudaFree(d_A.elements);

    cudaFree(d_B.elements);

    cudaFree(d_C.elements);

}

// Matrix multiplication kernel called by MatMul()

 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)

{

    // Block row and column

    int blockRow = blockIdx.y;

    int blockCol = blockIdx.x;

    // Each thread block computes one sub-matrix Csub of C

    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub

    // by accumulating results into Cvalue

    float Cvalue = 0;

    // Thread row and column within Csub

    int row = threadIdx.y;

    int col = threadIdx.x;

    // Loop over all the sub-matrices of A and B that are

    // required to compute Csub

    // Multiply each pair of sub-matrices together

    // and accumulate the results

    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

        // Get sub-matrix Asub of A

        Matrix Asub = GetSubMatrix(A, blockRow, m);

        // Get sub-matrix Bsub of B

        Matrix Bsub = GetSubMatrix(B, m, blockCol);

        // Shared memory used to store Asub and Bsub respectively

        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

        // Load Asub and Bsub from device memory to shared memory

        // Each thread loads one element of each sub-matrix

        As[row][col] = GetElement(Asub, row, col);

        Bs[row][col] = GetElement(Bsub, row, col);

        // Synchronize to make sure the sub-matrices are loaded

        // before starting the computation

        __syncthreads();

        // Multiply Asub and Bsub together

        for (int e = 0; e < BLOCK_SIZE; ++e)

            Cvalue += As[row][e] * Bs[e][col];

        // Synchronize to make sure that the preceding

        // computation is done before loading two new

        // sub-matrices of A and B in the next iteration

        __syncthreads();

    }

    // Write Csub to device memory

    // Each thread writes one element

    SetElement(Csub, row, col, Cvalue);

}

4. C++ 扩展

4.1 函数标识符

global

在 device 上执行；
从 host 上调用；
标识一个函数为 kernel；
异步执行：在 device 完成计算之前就返回；
返回值为 void，不可以是成员函数；

device

在 device 上执行；
只能从 device 上调用；

host

在 host 上执行；
只能从 host 上调用；

4.2 变量标识符

device

驻留在 device 上；
位于全局内存中；（如果不和其他标识符一起使用的话）
拥有和 CUDA context 相同的生命周期；
可以和以下标识符一起使用；

constant

位于常量内存中；
拥有和 CUDA context 相同的生命周期；

shared

位于共享内存中；
拥有和 block 相同的生命周期；
只能被 block 内的线程访问；

managed

可以在 device 和 host 上访问；
拥有和应用程序相同的生命周期；

4.3 内建变量

gridDim：grid 的维度，dim3 类型；
blockDim：block 的维度，dim3 类型；
blockIdx：block 在 grid 内的索引，uint3 类型；
threadIdx：thread 在 block 内的索引，uint3 类型；

5. 例子

ps：本例子是在 cuda 自带的 samples 的基础上做的简化。

vectorAdd.cu:

#include <stdio.h>

#include <cuda_runtime.h>

#include <helper_cuda.h>

// C = A + B

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements) {

        C[i] = A[i] + B[i];

    }

}

int main(void) {

    cudaError_t err = cudaSuccess;

    int numElements = 50000;

    size_t size = numElements * sizeof(float);

    // 分配 host 内存

    float *h_A = (float *)malloc(size);

    float *h_B = (float *)malloc(size);

    float *h_C = (float *)malloc(size);

    if (h_A == NULL || h_B == NULL || h_C == NULL) {

        fprintf(stderr, "Failed to allocate host vectors!\n");

        exit(EXIT_FAILURE);

    }

    // 初始化输入向量

    for (int i = 0; i < numElements; ++i) {

        h_A[i] = rand()/(float)RAND_MAX;

        h_B[i] = rand()/(float)RAND_MAX;

    }

    // 分配 device 内存

    float *d_A = NULL;

    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    float *d_B = NULL;

    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    float *d_C = NULL;

    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    // 拷贝内存：host -> device

    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    // 启动 kernel

    int threadsPerBlock = 256;

    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;

    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

    err = cudaGetLastError();

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    // 返回计算结果：device -> host

    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    // 验证计算结果

    for (int i = 0; i < numElements; ++i) {

        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {

            fprintf(stderr, "Result verification failed at element %d!\n", i);

            exit(EXIT_FAILURE);

        }

    }

    printf("Test PASSED\n");

    // 释放 device 内存

    err = cudaFree(d_A);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    err = cudaFree(d_B);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    err = cudaFree(d_C);

    if (err != cudaSuccess) {

        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));

        exit(EXIT_FAILURE);

    }

    // 释放 host 内存

    free(h_A);

    free(h_B);

    free(h_C);

    printf("Done\n");

    return 0;

}

Makefile:

CUDA_PATH ?= "/usr/local/cuda-10.0"

# 架构

HOST_ARCH   := $(shell uname -m)

TARGET_ARCH ?= $(HOST_ARCH)

TARGET_SIZE := 64

# 操作系统

HOST_OS   := $(shell uname -s 2>/dev/null | tr "[:upper:]" "[:lower:]")

TARGET_OS ?= $(HOST_OS)

# 编译器

HOST_COMPILER ?= g++

NVCC          := $(CUDA_PATH)/bin/nvcc -ccbin $(HOST_COMPILER)

NVCCFLAGS   := -m${TARGET_SIZE}

CCFLAGS     :=

LDFLAGS     :=

NVCCFLAGS += -g -G

BUILD_TYPE := debug

ALL_CCFLAGS :=

ALL_CCFLAGS += $(NVCCFLAGS)

ALL_CCFLAGS += $(EXTRA_NVCCFLAGS)

ALL_CCFLAGS += $(addprefix -Xcompiler ,$(CCFLAGS))

ALL_CCFLAGS += $(addprefix -Xcompiler ,$(EXTRA_CCFLAGS))

ALL_LDFLAGS :=

ALL_LDFLAGS += $(ALL_CCFLAGS)

ALL_LDFLAGS += $(addprefix -Xlinker ,$(LDFLAGS))

ALL_LDFLAGS += $(addprefix -Xlinker ,$(EXTRA_LDFLAGS))

INCLUDES  := -I../../common/inc

LIBRARIES :=

SMS ?= 30 35 37 50 52 60 61 70 75

$(foreach sm,$(SMS),$(eval GENCODE_FLAGS += -gencode arch=compute_$(sm),code=sm_$(sm)))

HIGHEST_SM := $(lastword $(sort $(SMS)))

GENCODE_FLAGS += -gencode arch=compute_$(HIGHEST_SM),code=compute_$(HIGHEST_SM)

all: build

build: vectorAdd

vectorAdd.o:vectorAdd.cu

    $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<

vectorAdd: vectorAdd.o

    $(NVCC) $(ALL_LDFLAGS) $(GENCODE_FLAGS) -o $@ $+ $(LIBRARIES)

    mkdir -p ../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

    cp $@ ../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)

run: build

    ./vectorAdd

clean:

    rm -f vectorAdd vectorAdd.o

    rm -rf ../../bin/$(TARGET_ARCH)/$(TARGET_OS)/$(BUILD_TYPE)/vectorAdd

vectorAdd$ make

"/usr/local/cuda-10.0"/bin/nvcc -ccbin g++ -I../../common/inc  -m64 -g -G    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o vectorAdd.o -c vectorAdd.cu

"/usr/local/cuda-10.0"/bin/nvcc -ccbin g++   -m64 -g -G      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o vectorAdd vectorAdd.o

mkdir -p ../../bin/x86_64/linux/debug

cp vectorAdd ../../bin/x86_64/linux/debug

参考资料

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

CUDA 介绍的更多相关文章

CUDA基础介绍
一.GPU简介 1985年8月20日ATi公司成立,同年10月ATi使用ASIC技术开发出了第一款图形芯片和图形卡,1992年4月ATi发布了Mach32图形卡集成了图形加速功能,1998年4月ATi ...
professional cuda c programming--CUDA库简单介绍
CUDA Libraries简单介绍上图是CUDA 库的位置.本文简要介绍cuSPARSE.cuBLAS.cuFFT和cuRAND.之后会介绍OpenACC. cuSPARSE线性代数库,主要针 ...
56 Marvin: 一个支持GPU加速、且不依赖其他库（除cuda和cudnn）的轻量化多维深度学习（deep learning）框架介绍
0 引言 Marvin是普林斯顿视觉实验室(PrincetonVision)于2015年提出的轻量化GPU加速的多维深度学习网络框架.该框架采用纯c/c++编写,除了cuda和cudnn以外,不依赖其 ...
GPU 、APU、CUDA、TPU、FPGA介绍
购买显卡主要关注:显存.带宽和浮点运算数量 GPU :图形处理器(英语:Graphics Processing Unit,缩写:GPU),又称显示核心.视觉处理器.显示芯片,是一种专门在个人电脑. ...
转载一篇介绍CUDA
鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to Genera ...
【CUDA】CUDA框架介绍
引用出自Bookc的博客,链接在此http://bookc.github.io/2014/05/08/my-summery-the-book-cuda-by-example-an-introduct ...
CUDA && GPU中dim3介绍
手把手教你搭建深度学习平台——避坑安装theano+CUDA
python有多混乱我就不多说了.这个混论不仅是指整个python市场混乱,更混乱的还有python的各种附加依赖包.为了一劳永逸解决python的各种依赖包对深度学习造成的影响,本文中采用pytho ...
[CUDA] CUDA to DL
又是一枚祖国的骚年,阅览做做笔记:http://www.cnblogs.com/neopenx/p/4643705.html 这里只是一些基础知识.帮助理解DL tool的实现. “这也是深度学习带来 ...

随机推荐

ASP.NET Core 3.1使用Swagger
一.什么是Swagger 随着技术的不断方法,现在的网站开发基本都是使用前后端分离的模式,这样使前端开发者和后端开发者只需要专注自己擅长的即可.但这种方式会存在一种问题:前后端通过API接口的方式进行 ...
IIS本地部署局域网可随时访问的项目
原理在本机的IIS下创建一个网站,文件目录直接指向Web项目文件夹步骤 1.项目的启动项目为web 2.在iis中创建一个新的网站(Work_TK_EIS) 文件目录为web项目的目录(D:\Gi ...
[.NET] - 在Create一个RSA密钥的是要注意的长度问题
有时候我们需要自己手动的创建RSA密钥,但是在密钥创建之后,在使用的时候会有类似密钥长度不正确的错误信息被抛出,那可能就是在创建一个RSA密钥的时候,对于的elements长度没设置正确,所以的ele ...
Error:(18) error: '#FFFF782' is incompatible with attribute android:endColor (attr) color. --Android
android studio 编译是报如下错误: Error:(18) error: '#FFFF782' is incompatible with attribute android:endCol ...
[leetcode]118,119PascalsTriangle,杨辉三角1,2
杨辉三角1Given numRows, generate the first numRows of Pascal's triangle.For example, given numRows = 5,R ...
[Machine Learning] 多变量线性回归(Linear Regression with Multiple Variable)-特征缩放-正规方程
我们从上一篇博客中知道了关于单变量线性回归的相关问题,例如:什么是回归,什么是代价函数,什么是梯度下降法. 本节我们讲一下多变量线性回归.依然拿房价来举例,现在我们对房价模型增加更多的特征,例如房间数 ...
JAVA并发包——锁
1.java多线程中,可以使用synchronized关键字来实现线程间的同步互斥工作,其实还有个更优秀的机制来完成这个同步互斥的工作--Lock对象,主要有2种锁:重入锁和读写锁,它们比synchr ...
JVM调试说明
-XX:+<option>:表示开启option选项 -XX:-<option>:表示关闭option选项 -XX:<option>=<value>:表 ...
SonarQube学习（三）- 项目代码扫描
一.前言元旦三天假,两天半都在玩86版本DNF,不得不说,这个服真的粘度太高了,但是真的很良心. 说明: 注册账号上线100w点券,一身+15红字史诗装备以及+21强化新手武器.在线泡点一分钟888 ...
Oracle 模糊查询优化
模糊查询是数据库查询中经常用到的,一般常用的格式如下: (1)字段 like '%关键字%' 字段包含"关键字"的记录即使在目标字段建立索引也不会走索引,速度最慢 (2 ...

CUDA 介绍

1. 介绍

1.1 GPU vs. CPU

1.2 CUDA

2. 编程模型

2.1 kernel

2.2 线程层次关系

2.3 内存层次关系

2.4 异构编程

3. 编程接口

3.1 全局内存

3.1.1 一维

3.1.2 二维

3.1.3 三维

3.2 共享内存

4. C++ 扩展

4.1 函数标识符

__global__

__device__

__host__

4.2 变量标识符

__device__

__constant__

__shared__

__managed__

4.3 内建变量

5. 例子

参考资料

CUDA 介绍的更多相关文章

随机推荐

热门专题

global

device

host

device

constant

shared

managed