OpenACC 异步计算
▶ 按照书上的例子,使用 async 导语实现主机与设备端的异步计算
● 代码,非异步的代码只要将其中的 async 以及第 29 行删除即可
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 // 多算几次,增加耗时 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async // 在设备上赋值 a
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} for (int i = ; i < COUNT; i++) // 在主机上赋值 b
{
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc update host(a[0:N]) async // 异步必须 update a,否则还没同步就参与 c 的运算
#pragma acc wait // 非异步时去掉该行 for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update device(a[0:N]) async // 没啥用,增加耗时
#pragma acc exit data delete(a[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果(是否异步,差异仅在行号、耗时上)
//+-----------------------------------------------------------------------------非异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time //------------------------------------------------------------------------------有异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 的结果,我是真没看出来有较大的差别,可能例子举得不够好
● 在一个设备上同时使用两个命令队列
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async(1)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(1)
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} #pragma acc enter data create(b[0:N]) async(2)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(2)
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc enter data create(c[0:N]) async(2)
#pragma acc wait(1) async(2) #pragma acc parallel loop async(2)
for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update host(c[0:N]) async(2)
#pragma acc exit data delete(a[0:N], b[0:N], c[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating enter data create(b[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(b[:])
, Generating enter data create(c[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(c[:])
Generating implicit copyin(b[:],a[:])
, Generating update self(c[:])
Generating exit data delete(c[:],b[:],a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached time
: kernel launched time
grid: [] block: []
device time(us): total= max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 中,可以看到两个命令队列交替执行
● 在 PGI 命令行中使用命令 pgaccelinfo 查看设备信息
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgaccelinfo CUDA Driver Version: Device Number:
Device Name: GeForce GTX
Device Revision Number: 6.1
Global Memory Size:
Number of Multiprocessors:
Concurrent Copy and Execution: Yes
Total Constant Memory:
Total Shared Memory per Block:
Registers per Block:
Warp Size:
Maximum Threads per Block:
Maximum Block Dimensions: , ,
Maximum Grid Dimensions: x x
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: MHz
Memory Bus Width: bits
L2 Cache Size: bytes
Max Threads Per SMP:
Async Engines: // 有两个异步引擎,支持两个命令队列并行
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Compiler Option: -ta=tesla:cc60
OpenACC 异步计算的更多相关文章
- Task:取消异步计算限制操作 & 捕获任务中的异常
Why:ThreadPool没有内建机制标记当前线程在什么时候完成,也没有机制在操作完成时获得返回值,因而推出了Task,更精确的管理异步线程. How:通过构造方法的参数TaskCreationOp ...
- 13.FutureTask异步计算
FutureTask 1.可取消的异步计算,FutureTask实现了Future的基本方法,提供了start.cancel 操作,可以查询计算是否完成,并且可以获取计算 的结果.结果 ...
- 怎样给ExecutorService异步计算设置超时
ExecutorService接口使用submit方法会返回一个Future<V>对象.Future表示异步计算的结果.它提供了检查计算是否完毕的方法,以等待计算的完毕,并获取计算的结果. ...
- java异步计算Future的使用(转)
从jdk1.5开始我们可以利用Future来跟踪异步计算的结果.在此之前主线程要想获得工作线程(异步计算线程)的结果是比较麻烦的事情,需要我们进行特殊的程序结构设计,比较繁琐而且容易出错.有了Futu ...
- 使用QFuture类监控异步计算的结果
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/Amnes1a/article/details/65630701在Qt中,为我们提供了好几种使用线程的 ...
- gearman(异步计算)学习
Gearman是什么? 它是分布式的程序调用框架,可完成跨语言的相互调 用,适合在后台运行工作任务.最初是2005年perl版本,2008年发布C/C++版本.目前大部分源码都是(Gearmand服务 ...
- OpenACC 绘制曼德勃罗集
▶ 书上第四章,用一系列步骤优化曼德勃罗集的计算过程. ● 代码 // constants.h ; ; ; ; const double xmin=-1.7; ; const double ymin= ...
- OpenACC Julia 图形
▶ 书上的代码,逐步优化绘制 Julia 图形的代码 ● 无并行优化(手动优化了变量等) #include <stdio.h> #include <stdlib.h> #inc ...
- 如何解救在异步Java代码中已检测的异常
Java语言通过已检测异常语法所提供的静态异常检测功能非常实用,通过它程序开发人员可以用很便捷的方式表达复杂的程序流程. 实际上,如果某个函数预期将返回某种类型的数据,通过已检测异常,很容易就可以扩展 ...
随机推荐
- 谈ObjC对象的两段构造模式
前言 Objective-c语言在申请对象的时,需要使用两段构造(Two Stage Creation)的模式.一个对象的创建,需要先调用alloc方法或allocWithZone方法,再调用init ...
- IDEA中遇到One of the two will be used. Which one is undefined.
某次启动idea的时候看到控制台提示如下错误 : objc[]: Class JavaLaunchHelper is implemented .0_131.jdk/Contents/Home/bin/ ...
- CTF之常见的两种关于word的信息隐藏技术
一.利用word本身自带的文字隐藏功能 1.在word中输入文字 2.选中文字,单击右键,选择字体选项 3.单击字体选项后,单击隐藏,确定 查找隐藏信息 1.单击左上角WPS文字后,选择选项按钮单击 ...
- WinHex简介
WinHex是一个专门用来对付各种日常紧急情况的小工具.它可以用来检查和修复各种文件.恢复删除文件.硬盘损坏造成的数据丢失等.同时它还可以让你看到其他程序隐藏起来的文件和数据.得到 ZDNetSoft ...
- 自适应巡航控制系统——ACC
ACC(Adaptive Cruise Control)自适应巡航控制系统(以下简称ACC)是一种基于传感器识别技术而诞生的智能巡航控制,相比只能根据驾驶者设置的速度进行恒定速度巡航的传统巡航控制系统 ...
- maven 知识点2
maven 命令: table th:first-of-type { width: 500px; } table th:nth-of-type(2) { } 命令 含义 mvn help:effect ...
- hadoop入门篇---超详细hadoop服务器环境配置教程
虚拟机以及Linux系统安装在之前的两篇分享中已经详细的介绍了方法,并且每一步的都配图了.如果有朋友还是看不懂,那我也爱莫能助了.本篇主要就hadoop服务器操作系统配置进行详细说明,hadoop安装 ...
- linux上通过lighttpd上跑一个C语言的CGI小页面以及所遇到的坑
Common Gateway Interface如雷贯耳,遗憾的是一直以来都没玩过CGI,今天尝试一把.Tomcat可以是玩CGI的,但得改下配置.为了方便,直接使用一款更轻量级的web服务器ligh ...
- 【Spring-AOP-学习笔记-5】@AfterReturning增强处理简单示例
项目结构 业务代码 @Component("hello") public class HelloImpl implements Hello { // 定义一个简单方法,模拟 ...
- [转]关于vs2005、vs2008和vs2010项目互转的总结
关于vs2005.vs2008和vs2010项目互转的总结 分类: Asp.Net2010-11-16 16:59 18239人阅读 评论(12) 收藏 举报 2010.net框架编译器 有做.net ...