CUDA C编程权威指南：1.3-CUDA基础知识点梳理

主要整理了N多年前（2013年）学习CUDA的时候开始总结的知识点，好长时间不写CUDA代码了，现在LLM推理需要重新学习CUDA编程，看来出来混迟早要还的。

1.CUDA数组

解析：CUDA数组是使用cudaMallocArray()、cudaMalloc3DArray()分配的，使用cudaFreeArray()释放。

2.OpenGL/DirectX Interoperability

解析：OpenGL的帧缓冲与DirectX的顶点缓冲可以被映射到CUDA可操作的地址空间中，让CUDA读写帧缓冲里面的数据。OpenGL与CUDA互操作，主要是缓冲对象的注册与取消注册、映射与取消映射。如下所示：

（1）cudaGLRegisterBufferObject()：缓冲对象注册。
（2）cudaGLUnregisterBufferObject()：取消缓冲对象注册。

（3）cudaGLMapBufferObject()：映射缓冲对象。

（4）cudaGLUnmapBufferObject()：取消映射。

（5）cudaGLMapBufferObject()：映射缓冲对象后，CUDA可以使用其返回的设备存储器地址读取和写入缓冲对象。Direct 3D与CUDA互操作，主要是Direct 3D设备的设置、资源的注册、资源映射、映射后信息获取、取消映射、取消注册。以Direct3D9为例，如下所示：

（1）cudaD3D9SetDirect3DDevice()：Direct3D设备的设置。

（2）cudaD3D9RegisterResource ()：注册资源。

（3）cudaD3D9MapResources()：资源映射。

（4）cudaD3D9ResourceGetMappedPointer()：获取资源映射后的CUDA设备存储器地址。

（5）cudaD3D9ResourceGetMappedSize()：获取大小。

（6）cudaD3D9ResourceGetMappedPitch()：获取间隔。

（7）cudaD3D9UnmapResources()：取消映射。

（8）cudaD3D9UnregisterResource()：取消注册。

3.CUDA软件环境

解析：

（1）NVIDIA Jetson TK1：NVIDIA提供的基于GPU的嵌入式开发板。

（2）NVRTC（NVIDIA Runtime Compilation）：基于CUDA C++的运行时编译库。

（3）cuSolver：基于cuBLAS和cuSPARSE库的高级包。

（4）ptxas：PTX汇编工具。

（5）cuobjdump：CUDA目标文件转储工具。

（6）nvidia-smi：英伟达系统管理接口。

（7）CUDA Binary Utilities：cuobjdump；nvdisasm；nvprune。

（8）CUDA-MEMCHECK：CUDA工具套件中提供的独立的内存检查实用程序。

4.cudaGetLastError和cudaGetErrorString

解析：

（1）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。

（2）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。

5.零拷贝内存

解析：可以在CUDA核函数中直接访问主机内存，不需要复制到GPU。如下所示：

（1）开辟Host内存空间：cudaHostAlloc((void**)&host_data_to_device, size_in_bytes, cudaHostAllocMapped);

（2）获取Device端指针：cudaHostGetDevicePointer(&dev_host_data_to_device, host_data_to_device, 0);

说明：零拷贝内存技术适用于计算密集型、读取写入次数少的程序中。

6.static函数

解析：静态函数（内部函数）只在声明的文件中可见，不能被其它文件调用。

7.cudaError_t和checkCudaErrors

解析：

（1）#include <helper_cuda.h>

（2）#include <helper_functions.h>

8.texture<type, dimension, readtype> texreference;[4]

解析：

（1）type：int，uchar，float等。

（2）dimension：1，2，3。

（3）readtype：cudaReadModeNormalizedFloat（归一化），cudaReadModeElementType（默认）。

9.纹理存储器绑定[5]

解析：纹理存储器绑定有两种，一种是绑定到cudaMalloc()，cudaMemcpy()开辟的一维数组，另一种是绑定到cudaMallocArray，cudaMemcpyToArray开辟的二维数组或者三维数组。

10.cudaCreateChannelDesc

解析：__host__ cudaChannelFormatDesc cudaCreateChannelDesc ( int x, int y, int z, int w, cudaChannelFormatKind f )：Returns a channel descriptor using the specified format。

说明：where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

11.cudaBindTextureToArray

解析：__host__ cudaError_t cudaBindTextureToArray ( const textureReference* texref, cudaArray_const_t array, const cudaChannelFormatDesc* desc )：Binds an array to a texture。

说明：texref：Texture to bind；array：Memory array on device；desc：Channel format.

12.CUDA同步函数

解析：

（1）cudaDeviceSynchronize()：停止CPU端线程执行，直到GPU端完成CUDA任务，包括kernel、数据拷贝等。

（2）cudaThreadSynchronize()：和cudaDeviceSynchronize()基本相同，过时版本。

（3）cudaStreamSynchronize()：该方法接受一个Stream ID，它将阻止CPU执行直到GPU端完成相应Stream ID的CUDA任务，但并不关心其它Stream ID中的CUDA任务是否完成。

13.cudaGetLastError

解析：__host__ __device__ cudaError_t cudaGetLastError ( void )：返回运行时调用的最后错误。

14.PGM图像格式

解析：PGM是Portable Gray Map的缩写，它是灰度图像格式中一种最简单的格式标准。

15.CUDA架构

解析：Tesla架构，Femi架构，Kepler架构，Maxwell架构，Pascal架构，Volta架构。

16.tex2D

解析：在核函数中访问纹理存储器的操作称为纹理拾取。通过tex2D()来读取纹理内存中的数据。

17.simpleTexture.cu代码剖析

解析：

#include <iostream>
#include <cuda_runtime.h>
#include <helper_image.h>
using namespace std;
 
const char *imagePath = "./data/lena_bw.pgm";
const char *outputFilename = "./data/lena_bw_out.pgm";
const float angle = 0.5f;
// Texture reference for 2D float texture
texture<float, 2, cudaReadModeElementType> tex;
 
// @param outputData  output data in global memory
__global__ void transformKernel(float *outputData, int width, int height, float theta)
{   
 // calculate normalized texture coordinates
 unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
 unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
    
 float u = (float)x - (float)width / 2;
 float v = (float)y - (float)height / 2;
 
 u /= (float)width;
 v /= (float)height;
 
 float tu = u*cosf(theta) - v*sinf(theta);
 float tv = v*cosf(theta) + u*sinf(theta);
    
 // read from texture and write to global memory
 outputData[y * width + x] = tex2D(tex, tu + 0.5f, tv + 0.5f);
}

int main(int argc, char **argv)
{   
 unsigned int width = 512;
 unsigned int height = 512;
 unsigned int size = width * height * sizeof(float);
        
 // Allocate device memory for result
 float *dData = NULL;
 cudaMalloc((void **)&dData, size);
 
 float *hData = NULL;
 sdkLoadPGM(imagePath, &hData, &width, &height);
 
 // Allocate array and copy image data
 cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
 cudaArray *cuArray;
 cudaMallocArray(&cuArray, &channelDesc, width, height);
 cudaMemcpyToArray(cuArray, 0, 0, hData, size, cudaMemcpyHostToDevice);
 
 // Set texture parameters
 tex.addressMode[0] = cudaAddressModeWrap;
 tex.addressMode[1] = cudaAddressModeWrap;
 tex.filterMode = cudaFilterModeLinear;
 tex.normalized = true;
 
 // Bind the array to the texture
 cudaBindTextureToArray(tex, cuArray, channelDesc);
 
 dim3 dimBlock(8, 8, 1);
 dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
 
 // Execute the kernel
 transformKernel << <dimGrid, dimBlock, 0 >> >(dData, width, height, angle);
    
 // Allocate mem for the result on host side
 float *hOutputData = (float *)malloc(size);
 // copy result from device to host
 cudaMemcpy(hOutputData, dData, size, cudaMemcpyDeviceToHost);
 sdkSavePGM(outputFilename, hOutputData, width, height);
 
 cudaFree(dData);
    cudaFreeArray(cuArray);
}

18.Texrure<Type, Dim, ReadMode> texRef;

解析：

（1）Type：数据类型。

（2）Dim：纹理引用维数。

（3）ReadMode：cudaReadModeNormalizedFloat或cudaReadModeElementType。

说明：纹理引用只能声明为全局静态变量，不能作为函数参数传递。

19.struct cudaChannelFormatDesc { int x, y, z, w; enum cudaChannelFormatKind f;};

解析：

（1）结构体中的成员x，y，z以及w指定了纹理元素中每个成员的比特数。比如，仅包含一个浮点元素的纹理对应的x为32，其它成员的值为0。

（2）结构体cudaChannelFormatKind指明了该数据的类型，是带符号的整数（cudaChannelFormatKindSigned），还是无符号整数（cudaChannelFormatKindUnsigned），或者浮点数（cudaChannelFormatKindFloat）。

20.struct cudaChannelFormatDesc cudaCreateChannelDesc(int x,int y,int z,int w,enum cudaChannelFormatKind f);

解析：通过函数cudaCreateChannelDesc创建cudaChannelFormatDesc结构体。

21.cudaError_t cudaMallocArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, size_t width, size_t height);

解析：根据cudaChannelFormatDesc结构desc分配一个CUDA 数组，并返回一个在*array的新CUDA数组的句柄。

22.cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind);

解析：拷贝count字节，从src指向的内存区域到dstArray指向的CUDA数组，从数组的左上角（dstX, dstY）开始，kind可以是cudaMemcpyHostToHost，cudaMemcpyHostToDevice，cudaMemcpyDeviceToHost，或cudaMemcpyDeviceToDevice的拷贝方向。

23.struct textureReference {int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[2]; struct cudaChannelFormatDesc channelDesc;}[1]

解析：

（1）normalized：归一化模式。

（2）filterMode：cudaFilterModePoint或cudaFilterModeLinear。（滤波模式）

（3）addressMode：cudaAddressModeClamp或cudaAddressModeWrap。（寻址模式）

24.cudaError_t cudaBindTextureToArray(const struct textureReference* texRef, const struct cudaArray* array, const struct cudaChannelFormatDesc* desc);

解析：绑定CUDA数组array到纹理引用texRef。desc描述了在纹理拾取时内存如何被解释。任何之前被绑定到texRef的内存将被解除绑定。

25.CUDA数组纹理操作

解析：

（1）template<class Type, enum cudaTextureReadMode readMode> Type tex1D(texture<Type, 1, readMode> texRef, float x);

（2）template<class Type, enum cudaTextureReadMode readMode> Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);

说明：tex1D和tex2D函数通过纹理坐标x和y拾取CUDA数组中绑定到纹理引用texRef的区域。

26.CPU/GPU并发性

解析：CPU/GPU并发性指CPU在已经发送一些请求给GPU后能够继续处理的能力。其实，CPU/GPU并发性最重要的用处就是隐藏来自GPU请求任务的开销。

27.cudaError_t cudaThreadSynchronize(void);

解析：阻止直到设备上所有请求任务执行完毕。cudaThreadSynchronize()返回一个错误，如果其中一个任务失败。

28.cudaError_t cudaThreadExit(void);

解析：清除主机调用的线程中所有runtime相关的资源。任何后来的API将重新初始化runtime。

29.cudaError_t cudaStreamCreate(cudaStream_t* stream);

解析：创建一个流。

30.cudaError_t cudaStreamQuery(cudaStream_t stream);

解析：返回cudasuccess，如果所有流中的操作完成。返回cudaErrorNotReady，如果不是。

31.cudaError_t cudaStreamSyncronize(cudaStream_t stream)

解析：阻止直到设备上完成流中的所有操。

32.cudaError_t cudaStreamDestroy(cudaStream_t stream);

解析：销毁一个流。

33.cudaError_t cudaEventCreate(cudaEvent_t* event);

解析：创建一个事件。

34.cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream);

解析：记录一个事件。如果stream是非零的，当流中所有的操作完毕，事件被记录；否则，当CUDA context中所有的操作完毕，事件被记录。由于这个操作是异步的，必须使用cudaEventQuery和/或cudaEventSyncronize 来决定何时事件被真的记录了。

35.cudaError_t cudaEventQuery(cudaEvent_t event);

解析：返回cudaSuccess，如果事件被真的记录了。返回cudaErrorNotReady，如果不是。

36.cudaError_t cudaEventSyncronize(cudaEvent_t event);

解析：阻止直到事件被真的记录了。如果cudaEventRecord()在这个事件中没有被调用，函数返回cudaErrorInvalidValue。

37.cudaError_t cudaEventDestroy(cudaEvent_t event);

解析：销毁一个事件。

38.cudaError_t cudaEventElapsedTime(float* time, cudaEvent_t start, cudaEvent_t end);

解析：计算两个事件之间花费的时间（millisecond）。如果事件未被记录，函数返回cudaErrorInvalidValue。

39.cudaError_t cudaGetDeviceCount(int* count);

解析：返回计算兼容性大于等于1.0的设备数量到指针*count。

40.cudaError_t cudaSetDevice(int dev);

解析：记录dev作为设备在哪个活动的主机线程中执行设备代码。

41.cudaError_t cudaGetDevice(int* dev);

解析：返回设备在哪个活动的主机线程中执行设备代码到指针*dev。

42.cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp* prop, int dev);

解析：返回设备dev的属性到指针*prop。

43.cudaError_t cudaChooseDevice(int* dev, const struct cudaDeviceProp* prop);

解析：返回设备的哪些属性最匹配*prop到指针*dev。

44.CUDA Runtime API

解析：

（1）低级API（cuda_runtime_api.h）是C接口类型的，不需要nvcc编译。

（2）高级API（cuda_runtime.h ）是C++接口类型的，基于低级API之上的，可直接使用C++代码，并被任何的C++编译器编译。高级API还有一些CUDA特定的包，它们需要nvcc编译。

说明：CUDA Runtime API和CUDA Driver API提供了设备管理，线程管理，流管理，事件管理，内存管理，纹理引用管理，执行控制，OpenGL互操作性，Direct3D互操作性，错误处理等函数。

45.__noinline__

解析：默认下，__device__函数总是inline的。__noinline__函数可以作为一个非inline函数的提示。

46.#pragma unroll

解析：编译器默认情况下将循环展开小的次数，#pragma unroll能够指定循环以多少次展开。

47.CUDA内置矢量类型

解析：char1，uchar1，char2，uchar2，char3，uchar3，char4，uchar4，short1，ushort1，short2，ushort2，short3，ushort3，short4，ushort4，int1，uint1，int2，uint2，int3，uint3，int4，uint4，long1，ulong1，long2，ulong2，long3，ulong3，long4，ulong4，float1，float2，float3，float4。

48.CUDA类型转换函数

解析：

（1）int __float2int_[rn,rz,ru,rd](float);：用指定的舍入模式转换浮点参数到整型。

（2）unsigned int __float2unit_[rn,rz,ru,zd](float);:用指定的舍入模式转换浮点参数到无符号整型。

（3）float __int2float_[rn,rz,ru,rd](int);：用指定的舍入模式转换整型参数到浮点数。

（4）float __int2float_[rn,rz,ru,rd](unsigned int);：用指定的舍入模式转换无符号整型参数到浮点数。

说明：rn是求最近的偶数，rz是逼近零，ru是向上舍入[到正无穷]，rd是向下舍入[到负无穷]。

49.asyncAPI.cu代码剖析

解析：

#include <stdio.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include <helper_functions.h> 
 
__global__ void increment_kernel(int *g_data, int inc_value)
{
 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 g_data[idx] = g_data[idx] + inc_value;
}

int main(int argc, char *argv[])
{   
 int n = 16 * 1024 * 1024;
 int nbytes = n * sizeof(int);
 int value = 26;
 
 // allocate host memory
 int *a = 0;
 cudaMallocHost((void **)&a, nbytes);
 memset(a, 0, nbytes);
 
 // allocate device memory
 int *d_a = 0;
 cudaMalloc((void **)&d_a, nbytes);
 cudaMemset(d_a, 255, nbytes);
 
 // set kernel launch configuration
 dim3 threads = dim3(512, 1);
 dim3 blocks = dim3(n / threads.x, 1);
 
 // create cuda event handles
 cudaEvent_t start, stop;
 cudaEventCreate(&start);
 cudaEventCreate(&stop);
 
 // define time
 StopWatchInterface *timer = NULL;
 sdkCreateTimer(&timer);
 sdkResetTimer(&timer);
 
 cudaDeviceSynchronize();
 float gpu_time = 0.0f;
 
 // asynchronously issue work to the GPU (all to stream 0)
 sdkStartTimer(&timer);
 cudaEventRecord(start, 0);
 cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
 increment_kernel << <blocks, threads, 0, 0 >> >(d_a, value);
 cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
 cudaEventRecord(stop, 0);
 sdkStopTimer(&timer);
 
 // have CPU do some work while waiting for stage 1 to finish
 unsigned long int counter = 0;
 while (cudaEventQuery(stop) == cudaErrorNotReady)
 {
  counter++;
 }
 
 cudaEventElapsedTime(&gpu_time, start, stop);
 
 // print the cpu and gpu times
 printf("time spent executing by the GPU: %.2f\n", gpu_time);
 printf("time spent by CPU in CUDA calls: %.2f\n", sdkGetTimerValue(&timer));
 printf("CPU executed %lu iterations while waiting for GPU to finish\n", counter);
 
 // release resources
 cudaEventDestroy(start);
 cudaEventDestroy(stop);
 cudaFreeHost(a);
 cudaFree(d_a);
 
 cudaDeviceReset();
}

解析：

（1）使用事件管理API主要作用是用于记录GPU状态，使CPU可以通过查询CUDA事件来确定GPU是否执行结束。

（2）常见的异步执行（主机端和设备端）函数包括Kernel启动；以Async为后缀的内存拷贝函数；device到device内存拷贝函数；存储器初始化函数，比如cudaMemset()，cudaMemset2D()，cudaMemset3D()。

50.流的创建与初始化

解析：

cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++)
{
 checkCudaErrors(cudaStreamCreate(&(streams[i])));
}

51.simpleStreams.cu代码剖析

解析：

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void init_array(int *g_data, int *factor, int num_iterations)
{
 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 for (int i = 0; i<num_iterations; i++) { g_data[idx] += *factor;}
}
 
int main(int argc, char **argv)
{
 int nstreams = 4;               
 int nreps = 10;                 
 int n = 16 * 1024 * 1024;       
 int nbytes = n * sizeof(int);   
 dim3 threads, blocks;           
 float elapsed_time, time_memcpy, time_kernel;   
 int niterations = 5;    
 
 // allocate host memory
 int c = 5;                      
 int *h_a = 0;                  
 cudaMallocHost((void**)&h_a, nbytes);
 
 // allocate device memory
 int *d_a = 0, *d_c = 0;            
 cudaMalloc((void **)&d_a, nbytes);
 cudaMalloc((void **)&d_c, sizeof(int));
 cudaMemcpy(d_c, &c, sizeof(int), cudaMemcpyHostToDevice);
 
 // allocate and initialize an array of stream handles
 cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));
 for (int i = 0; i < nstreams; i++)
 { cudaStreamCreate(&(streams[i]));}
 
 // create CUDA event handles, use blocking sync
 cudaEvent_t start_event, stop_event;
 cudaEventCreate(&start_event);
 cudaEventCreate(&stop_event);
 
 // time memcopy from device
 cudaEventRecord(start_event, 0);
 cudaMemcpyAsync(h_a, d_a, nbytes, cudaMemcpyDeviceToHost, streams[0]);
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);   
 cudaEventElapsedTime(&time_memcpy, start_event, stop_event);
 printf("memcopy:\t%.2f\n", time_memcpy);
 
 // time kernel
 threads = dim3(512, 1);
 blocks = dim3(n / threads.x, 1);
 cudaEventRecord(start_event, 0);
 init_array << <blocks, threads, 0, streams[0] >> >(d_a, d_c, niterations);
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&time_kernel, start_event, stop_event);
 printf("kernel:\t\t%.2f\n", time_kernel);
    
 // time non-streamed execution for reference
 threads = dim3(512, 1);
 blocks = dim3(n / threads.x, 1);
 cudaEventRecord(start_event, 0);
 for (int k = 0; k < nreps; k++)
 {
  init_array << <blocks, threads >> >(d_a, d_c, niterations);
  cudaMemcpy(h_a, d_a, nbytes, cudaMemcpyDeviceToHost);
 }
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&elapsed_time, start_event, stop_event);
 printf("non-streamed:\t%.2f\n", elapsed_time / nreps);
 
 // time execution with nstreams streams
 threads = dim3(512, 1);
 blocks = dim3(n / (nstreams*threads.x), 1);
 memset(h_a, 255, nbytes);     
 cudaMemset(d_a, 0, nbytes); 
 cudaEventRecord(start_event, 0);
 for (int k = 0; k < nreps; k++)
 {   // 异步加载nstreams个kernel
  for (int i = 0; i < nstreams; i++)
  {
   init_array << <blocks, threads, 0, streams[i] >> >(d_a + i *n / nstreams, d_c, niterations);
  }
     // 异步加载nstreams个memcopy
  for (int i = 0; i < nstreams; i++)
  {
   cudaMemcpyAsync(h_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]);
  }
 }
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&elapsed_time, start_event, stop_event);
 printf("%d streams:\t%.2f\n", nstreams, elapsed_time / nreps);
    
 // release resources
 for (int i = 0; i < nstreams; i++) { cudaStreamDestroy(streams[i]); }
 cudaEventDestroy(start_event);
 cudaEventDestroy(stop_event);
 cudaFree(h_a);
 cudaFree(d_a);
 cudaFree(d_c);
 
 cudaDeviceReset();
}

解析：simpleStreams.cu进行了流与事件的创建，并分别进行了内存拷贝计时，使用流的kernel执行计时，不使用流的kernel执行计时，以及使用nstreams个流的整体计时。为了使计时更加准确，采用了执行nreps次求平均值的方法。

52.CUDA中的流

解析：在一个给定的流中，操作顺序进行，但在不同流上的操作是乱序执行的，也可能是并行执行的。流的定义方法是创建一个cudaStream_t对象，并在启动内核和进行内存复制时将该对象作为参数传入，参数相同的属于同一个流，参数不同的属于不同的流。

53.Tegra

解析：Tegra是于推出的基于ARM构架通用处理器品牌（即CPU，NVIDIA称为"Computer on a chip"片上计算机），能够为便携设备提供高性能、低功耗体验。

54.Ubuntu 16.04安装CUDA 10.1

解析：

（1）禁用nouveau驱动

lsmod | grep nouveau
sudo vim /etc/modprobe.d/blacklist.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs –u
sudo reboot
lsmod | grep nouveau

（2）文本命令行模式运行runfile文件安装CUDA

sudo service lightdm stop
sudo sh cuda_10.1.168_418.67_linux.run --no-opengl-libs
sudo /usr/local/cuda-10.1/bin/cuda-uninstaller
sudo /usr/bin/nvidia-uninstall
ls /dev/nvidia*

（3）设置环境变量/etc/profile

sudo vim /etc/profile
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
cat /proc/driver/nvidia/version
nvcc -v

（4）编译cuda提供的samples

cd /home/xxx/NVIDIA_CUDA-10.1_Samples
make
cd /home/lxxx/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release
./deviceQuery
./bandwidthTest

（5）安装cudnn

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

（6）查看cuDNN是否安装成功

~/Downloads$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
 
#include "driver_types.h"

说明：至此就把CUDA和cuDNN安装好了。

55.CMakeLists.txt

解析：

[1]MESSAGE(STATUS "Project: SERVER")：打印相关消息
[2]SET(CMAKE_BUILE_TYPE DEBUG)：指定编译类型，debug或者为release
[3]SET(CMAKE_C_FLAGS_DEBUG "-g -Wall")：指定编译器
[4]CMAKE_C_FLAGS_DEBUG：C编译器
[5]CMAKE_CXX_FLAGS_DEBUG：C++编译器
[6]-g：只是编译器，在编译的时候，产生调试信息
[7]-Wall：生成所有警告信息
[8]ADD_SUBDIRECTORY()：添加子目录
[9]SET(*.cpp)：设置变量，表示所有的源文件
[10]INCLUDE_DIRECTORIES：相关头文件的目录
[11]LINK_DIRECTORIES：相关库文件的目录
[12]ADD_LIBRARY：生成静态链接库
[13]TARGET_LINK_LIBRARY：依赖的库文件
[14]SET_TARGET_PROPERTIES：表示生成的执行文件所在路径
[15]add_executable：指定生成目标
[16]find_package()：用来查找依赖包的
[17]FILE(GLOB EXTENDED doc/**)：加载doc文件夹下的所有文件

56.CUDA API

解析：Runtime API和Driver API，两种API各有其适用的范围。

57.PTX代码

解析：并行线程执行（Parallel Thread eXecution，PTX）代码是编译后的GPU代码的一种中间形式，它可以再次编译为原生的GPU微码。

58.Clang

解析：Clang是一个C语言、C++、Objective-C语言的轻量级编译器。源代码发布于BSD协议下。Clang将支持其普通lambda表达式、返回类型的简化处理以及更好的处理constexpr关键字。

59.clang-tidy

解析：clang-tidy是一个基于clang的静态代码分析框架，支持C++/C/Objective-C。

参考文献：

[1] CUDA编程：http://www.cnblogs.com/stewart/archive/2013/01/05/2846860.html

[2] DirectX学习经典参考书籍：http://blog.csdn.net/kuangfengwu/article/details/7344009

[3] 数字图像处理高级应用：基于MATLAB与CUDA的实现

[4] CUDA里面的Texture Memory：http://blog.csdn.net/qq_25716575/article/details/52444686

[5] CUDA纹理的使用：http://preston2006.blog.sohu.com/253531751.html

[6] CUDA纹理存储器的特性及其使用：http://blog.csdn.net/darkstorm2111203/articl

[7]CUDA计算能力查询表：https://blog.csdn.net/allyli0022/article/details/54628987e/details/4294012

CUDA C编程权威指南：1.3-CUDA基础知识点梳理的更多相关文章

『CUDA C编程权威指南』第二章编程题选做
第一题设置线程块中线程数为1024效果优于设置为1023,且提升明显,不过原因未知,以后章节看看能不能回答. 第二题参考文件sumArraysOnGPUtimer.cu,设置block=256,新 ...
读《Android编程权威指南》
因为去年双十二购买了一折的<Android 编程权威指南(第一版)>,在第二版出来后图灵社区给我推送了第二版的优惠码,激动之余就立马下单购买电子书,不得不说Big Nerd Ranch G ...
《Android编程权威指南》
<Android编程权威指南> 基本信息原书名:Android programming: the big nerd ranch guide 原出版社: Big Nerd Ranch Gu ...
Swift编程权威指南第2版读后收获
自从参加工作一直在用OC做iOS开发.在2015年的时候苹果刚推出swift1.0不久,当时毕竟是新推出的语言,大家也都很有激情的学习.不过在学完后发现很难在实际项目中使用,再加上当时公司项目都是基于 ...
《Android编程权威指南》PhotoGallery应用梳理
PhotoGalley是<Android编程权威指南>书中另外一个重要的应用.
《Android编程权威指南》CriminalIntent项目梳理
相信很多新手或者初级开发人员都已经买了第2版的<Android编程权威指南>, 这本书基于Android Studio开发,对入门人员来说是很好的选择,但是很可惜的是, 在完成一个项目后, ...
使用最新AndroidStudio编写Android编程权威指南（第3版）中的代码会遇到的一些问题
Android编程权威指南(第3版)这本书是基于Android7.0的,到如今已经过于古老,最新的Android版本已经到10,而这本书的第四版目前还没有正式发售,在最近阅读这本书时,我发现这本书的部 ...
Android编程权威指南第三版第32章
版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/qq_35564145/article/de ...
Android编程权威指南（第2版）--第16章使用intent拍照挑战练习
16.7挑战练习:优化照片显示新建dialog_photo.xml 1234567891011121314 <?xml version="1.0" encoding=&qu ...
Android编程权威指南(第三版)- 2.8　挑战练习：添加后退按钮
package com.example.geoquiz; import android.support.v7.app.AppCompatActivity; import android.os.Bund ...

随机推荐

浅谈Python异步编程
1. 异步编程概述异步编程是一种编程范式,用于处理那些需要等待I/O操作完成或者耗时任务的情况.在传统的同步编程中,代码会按照顺序逐行执行,直到遇到一个耗时操作,它会阻塞程序的执行直到操作完成.这种 ...
[ABC276Ex] Construct a Matrix
没有题解,所以来写一篇. Description 构造一个 \(N\times N\) 的矩阵 \(A\),其中 \(A_{i,j}\in {0,1,2}\),要求同时满足 \(Q\) 条限制. 每条 ...
svn的常规使用
svn的常规使用 svn的常规使用 1 客户端 2 svn server 3 qt使用svn 4 svn项目迁移 Ubuntu上使用svn 1 安装 2 使用 svn的常规使用 1 客户端下载地址: ...
Golang面试题从浅入深高频必刷「2023版」
大家好,我是阳哥.专注Go语言的学习经验分享和就业辅导. Go语言特点 Go语言相比C++/Java等语言是优雅且简洁的,是我最喜爱的编程语言之一,它既保留了C++的高性能,又可以像Java,Pyth ...
两个对于电影片段的情绪研究（中国&国外）
1.国内的研究(A new standardized emotional film database for Asian culture) 测试片使用了8种情绪类型,每部片子有4个维度的分数,分数是从 ...
raspberry pi Pico使用MicroPython变砖后的解决方法
使用raspberry pi Pico的原因在硬件产品(单片机)的开发中我们往往需要借助一些额外的仪器/设备进行产品的辅助测试, 假设我们需要一个IO+ADC类型辅助设备, 以往的做法是原理图-& ...
php开发之文件上传的实现
前言 php是网络安全学习里必不可少的一环,简单理解php的开发环节能更好的帮助我们去学习php以及其他语言的web漏洞原理正文在正常的开发中,文件的功能是必不可少,比如我们在论坛的头像想更改时就 ...
elrond32
前置知识 int __cdecl main(int argc, char **argv) * argc: 整数, 为传给main()的命令行参数个数.* argv: 字符串数组.argv[0] 为程序 ...
vue 中引入pingfang字体或者其他字体支持ttf otf格式
新建一个font 文件里面放字体文件可以百度搜索你想要的字体下载下来一般10m左右新建一个font.css 里面配置字体 @font-face { font-family: 'PF'; ...
Kubernetes:kube-apiserver 之准入
kubernetes:kube-apiserver 系列文章: Kubernetes:kube-apiserver 之 scheme(一) Kubernetes:kube-apiserver 之 sc ...

CUDA C编程权威指南：1.3-CUDA基础知识点梳理

CUDA C编程权威指南：1.3-CUDA基础知识点梳理的更多相关文章

随机推荐

热门专题