1GPUs can handle thousands of concurrent threads.

2The pieces of code running on the gpu are called kernels

3A kernel is executed by a set of threads.

4All threads execute the same code (SPMD)

5Each thread has an index that is used to calculate memory addresses that this will access.

1Threads are grouped into blocks

2 Blocks are grouped into a grid

3 A kernel is executed as a grid of blocks of threads

 Built-in variables ⎯ threadIdx, blockIdx ⎯ blockDim, gridDim

CUDA的线程组织即Grid-Block-Thread结构。一组线程并行处理可以组织为一个block,而一组block并行处理可以组织为一个Grid。下面的程序分别为线程并行和块并行,线程并行为细粒度的并行,而块并行为粗粒度的并行。addKernelThread<<<1, size>>>(dev_c, dev_a, dev_b);

 #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#include <stdlib.h> #define MAX 255
#define MIN 0
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size,int type,float* etime);
__global__ void addKernelThread(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
__global__ void addKernelBlock(int *c, const int *a, const int *b)
{
int i = blockIdx.x;
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = ; int a[arraySize] = { , , , , };
int b[arraySize] = { , , , , }; for (int i = ; i< arraySize ; i++){
a[i] = rand() % (MAX + - MIN) + MIN;
b[i] = rand() % (MAX + - MIN) + MIN;
}
int c[arraySize] = { };
// Add vectors in parallel.
cudaError_t cudaStatus;
int num = ; float time;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = ;i<num;i++)
{
cudaGetDeviceProperties(&prop,i);
} cudaStatus = addWithCuda(c, a, b, arraySize,,&time); printf("Elasped time of thread is : %f \n", time);
printf("{%d,%d,%d,%d,%d} + {%d,%d,%d,%d,%d} = {%d,%d,%d,%d,%d}\n",a[],a[],a[],a[],a[],b[],b[],b[],b[],b[],c[],c[],c[],c[],c[]); cudaStatus = addWithCuda(c, a, b, arraySize,,&time); printf("Elasped time of block is : %f \n", time); if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "addWithCuda failed!");
return ;
}
printf("{%d,%d,%d,%d,%d} + {%d,%d,%d,%d,%d} = {%d,%d,%d,%d,%d}\n",a[],a[],a[],a[],a[],b[],b[],b[],b[],b[],c[],c[],c[],c[],c[]);
// cudaThreadExit must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaThreadExit();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadExit failed!");
return ;
}
return ;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size,int type,float * etime)
{
int *dev_a = ;
int *dev_b = ;
int *dev_c = ;
clock_t start, stop;
float time;
cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
} // Launch a kernel on the GPU with one thread for each element.
if(type == ){
start = clock();
addKernelThread<<<, size>>>(dev_c, dev_a, dev_b);
}
else{
start = clock();
addKernelBlock<<<size, >>>(dev_c, dev_a, dev_b);
} stop = clock();
time = (float)(stop-start)/CLOCKS_PER_SEC;
*etime = time;
// cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}

运行的结果是

Elasped time of thread is : 0.000010
{103,105,81,74,41} + {198,115,255,236,205} = {301,220,336,310,246}
Elasped time of block is : 0.000005
{103,105,81,74,41} + {198,115,255,236,205} = {301,220,336,310,246}

CUDA入门1的更多相关文章

  1. CUDA入门

    CUDA入门 鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to ...

  2. 一篇不错的CUDA入门

    鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to Genera ...

  3. CUDA入门需要知道的东西

    CUDA刚学习不久,做毕业要用,也没时间研究太多的东西,我的博客里有一些我自己看过的东西,不敢保证都特别有用,但是至少对刚入门的朋友或多或少希望对大家有一点帮助吧,若果你是大牛请指针不对的地方,如果你 ...

  4. Cuda入门笔记

    最近在学cuda ,找了好久入门的教程,感觉入门这个教程比较好,网上买的书基本都是在掌握基础后才能看懂,所以在这里记录一下.百度文库下载,所以不知道原作者是谁,向其致敬! 文章目录 1. CUDA是什 ...

  5. CUDA 入门(转)

    CUDA(Compute Unified Device Architecture)的中文全称为计算统一设备架构.做图像视觉领域的同学多多少少都会接触到CUDA,毕竟要做性能速度优化,CUDA是个很重要 ...

  6. CUDA编程-&gt;CUDA入门了解(一)

    安装好CUDA6.5+VS2012,操作系统为Win8.1版本号,首先下个GPU-Z检測了一下: 看出本显卡属于中低端配置.关键看两个: Shaders=384.也称作SM.或者说core/流处理器数 ...

  7. CUDA中Bank conflict冲突

    转自:http://blog.csdn.net/smsmn/article/details/6336060 其实这两天一直不知道什么叫bank conflict冲突,这两天因为要看那个矩阵转置优化的问 ...

  8. 【CUDA】CUDA框架介绍

    引用 出自Bookc的博客,链接在此http://bookc.github.io/2014/05/08/my-summery-the-book-cuda-by-example-an-introduct ...

  9. 转:ubuntu 下GPU版的 tensorflow / keras的环境搭建

    http://blog.csdn.net/jerr__y/article/details/53695567 前言:本文主要介绍如何在 ubuntu 系统中配置 GPU 版本的 tensorflow 环 ...

随机推荐

  1. 根据IP地址获取地址所在城市帮助类(IPHelper)

    很多类库都是需要在长时间的编写过程中进行积累的,进入软件编程行业已经是第五个年头了,从2011年写下第一行代码到现在不知道已经写了多少行代码了,时间也过得挺快的.最近事情比较多,也很少写博客了,最近项 ...

  2. csharp:百度翻译

    参考:http://api.fanyi.baidu.com/api/trans/product/index http://developer.baidu.com/wiki/index.php?titl ...

  3. Install gocode

    1. D:\AWS_workspace\DAAS_Go>go get -u -ldflags -H=windowsgui github.com/nsf/gocode 2. Then gocode ...

  4. web技术人员-推荐书籍

    学习是技术人员成长的基础,本次分享20本技术方面的书籍,这些书不是每一本都是经典,但是每一本都有其特点.以下20本大部分本人都看过,因此推荐给大家.(本次推荐的20本只是一个参考,比如像Head Fi ...

  5. 【GOF23设计模式】状态模式

    来源:http://www.bjsxt.com/ 一.[GOF23设计模式]_状态模式.UML状态图.酒店系统房间状态.线程对象状态切换 package com.test.state; public ...

  6. UIMenuController的使用

    1, 基本使用 以对一个UILabel长按弹出菜单为例 子类化UILabel 因为需要覆盖这几个方法:- (BOOL)canBecomeFirstResponder; 返回YES 同时需要在每次UI元 ...

  7. Javascript-回调函数浅谈

    回调函数就是一个通过函数指针调用的函数.如果你把函数的指针(地址)作为参数传递给另一个函数,当这个指针被用来调用其所指向的函数时,我们就说这是回调函数.回调函数不是由该函数的实现方直接调用,而是在特定 ...

  8. JSON详解 .net

    之前json掌握的不好,浪费了好多时间在查找一些json有关的转换问题,我所知道的方法只有把json序列化和反序列化一下,但是太麻烦了我觉得,所以就在找一些更简单又方便使用的方法.也许这个会有用吧,所 ...

  9. Vue中class与style绑定

    gitHub地址:https://github.com/lily1010/vue_learn/tree/master/lesson07 一 用对象的方法绑定class 很简单,举个栗子: <!D ...

  10. Android studio打包APK混淆配置

    要在打包APK时加入混淆需要在Module中的buid.gradle中加入如下信息 buildTypes { release { minifyEnabled true shrinkResources ...