OpenACC 异步计算
▶ 按照书上的例子,使用 async 导语实现主机与设备端的异步计算
● 代码,非异步的代码只要将其中的 async 以及第 29 行删除即可
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 // 多算几次,增加耗时 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async // 在设备上赋值 a
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} for (int i = ; i < COUNT; i++) // 在主机上赋值 b
{
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc update host(a[0:N]) async // 异步必须 update a,否则还没同步就参与 c 的运算
#pragma acc wait // 非异步时去掉该行 for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update device(a[0:N]) async // 没啥用,增加耗时
#pragma acc exit data delete(a[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果(是否异步,差异仅在行号、耗时上)
//+-----------------------------------------------------------------------------非异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time //------------------------------------------------------------------------------有异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 的结果,我是真没看出来有较大的差别,可能例子举得不够好

● 在一个设备上同时使用两个命令队列
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async(1)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(1)
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} #pragma acc enter data create(b[0:N]) async(2)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(2)
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc enter data create(c[0:N]) async(2)
#pragma acc wait(1) async(2) #pragma acc parallel loop async(2)
for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update host(c[0:N]) async(2)
#pragma acc exit data delete(a[0:N], b[0:N], c[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating enter data create(b[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(b[:])
, Generating enter data create(c[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(c[:])
Generating implicit copyin(b[:],a[:])
, Generating update self(c[:])
Generating exit data delete(c[:],b[:],a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached time
: kernel launched time
grid: [] block: []
device time(us): total= max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 中,可以看到两个命令队列交替执行

● 在 PGI 命令行中使用命令 pgaccelinfo 查看设备信息
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgaccelinfo CUDA Driver Version: Device Number:
Device Name: GeForce GTX
Device Revision Number: 6.1
Global Memory Size:
Number of Multiprocessors:
Concurrent Copy and Execution: Yes
Total Constant Memory:
Total Shared Memory per Block:
Registers per Block:
Warp Size:
Maximum Threads per Block:
Maximum Block Dimensions: , ,
Maximum Grid Dimensions: x x
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: MHz
Memory Bus Width: bits
L2 Cache Size: bytes
Max Threads Per SMP:
Async Engines: // 有两个异步引擎,支持两个命令队列并行
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Compiler Option: -ta=tesla:cc60
OpenACC 异步计算的更多相关文章
- Task:取消异步计算限制操作 & 捕获任务中的异常
Why:ThreadPool没有内建机制标记当前线程在什么时候完成,也没有机制在操作完成时获得返回值,因而推出了Task,更精确的管理异步线程. How:通过构造方法的参数TaskCreationOp ...
- 13.FutureTask异步计算
FutureTask 1.可取消的异步计算,FutureTask实现了Future的基本方法,提供了start.cancel 操作,可以查询计算是否完成,并且可以获取计算 的结果.结果 ...
- 怎样给ExecutorService异步计算设置超时
ExecutorService接口使用submit方法会返回一个Future<V>对象.Future表示异步计算的结果.它提供了检查计算是否完毕的方法,以等待计算的完毕,并获取计算的结果. ...
- java异步计算Future的使用(转)
从jdk1.5开始我们可以利用Future来跟踪异步计算的结果.在此之前主线程要想获得工作线程(异步计算线程)的结果是比较麻烦的事情,需要我们进行特殊的程序结构设计,比较繁琐而且容易出错.有了Futu ...
- 使用QFuture类监控异步计算的结果
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/Amnes1a/article/details/65630701在Qt中,为我们提供了好几种使用线程的 ...
- gearman(异步计算)学习
Gearman是什么? 它是分布式的程序调用框架,可完成跨语言的相互调 用,适合在后台运行工作任务.最初是2005年perl版本,2008年发布C/C++版本.目前大部分源码都是(Gearmand服务 ...
- OpenACC 绘制曼德勃罗集
▶ 书上第四章,用一系列步骤优化曼德勃罗集的计算过程. ● 代码 // constants.h ; ; ; ; const double xmin=-1.7; ; const double ymin= ...
- OpenACC Julia 图形
▶ 书上的代码,逐步优化绘制 Julia 图形的代码 ● 无并行优化(手动优化了变量等) #include <stdio.h> #include <stdlib.h> #inc ...
- 如何解救在异步Java代码中已检测的异常
Java语言通过已检测异常语法所提供的静态异常检测功能非常实用,通过它程序开发人员可以用很便捷的方式表达复杂的程序流程. 实际上,如果某个函数预期将返回某种类型的数据,通过已检测异常,很容易就可以扩展 ...
随机推荐
- ES6必知必会 (二)—— 字符串和函数的拓展
字符串的拓展 1.ES6为字符串添加了遍历器接口,因此可以使用for...of循环遍历字符串 2.字符串新增的 includes().startsWith().endsWidth() 三个方法用于判断 ...
- index.do
- 基于TLS(线程局部存储)的高效timelog实现
什么是timelog? 我们在分析程序性能的时候,会加入的一些logging信息记录每一部分的时间信息 timelog模块的功能就是提供统一的接口来允许添加和保存logging 我们正在用的timel ...
- Why my setting does not work?
lab mypc server7000 -> 5900 1080 -> 10800 10800 -> inter ...
- js 逻辑的短路运算
&& 与运算 同时为true,才为true: 表达式1为false,不用看表达式2: || 或运算 有一个为true,就为true: 表达式1为true,不用看表达式2: && ...
- JavaScriptSerializer类 对象序列化为JSON,JSON反序列化为对象 。
JavaScriptSerializer 类由异步通信层内部使用,用于序列化和反序列化在浏览器和 Web 服务器之间传递的数据.说白了就是能够直接将一个C#对象传送到前台页面成为javascript对 ...
- 【linux】Linux 进程状态
linux上进程有5种状态: 1. 运行(正在运行或在运行队列中等待) 2. 中断(休眠中, 受阻, 在等待某个条件的形成或接受到信号) 3. 不可中断(收到信号不唤醒和不可运行, 进程必须等待直到有 ...
- SqlBulkCopy(批量复制)使用方法
SqlBulkCopy提供了一种将数据复制到Sql Server数据库表中高性能的方法.SqlBulkCopy 包含一个方法 WriteToServer,它用来从数据的源复制数据到数据的目的地. Wr ...
- 【FusionCharts学习-3】显示中国地图
概述 使用FusionCharts显示中国地图 资源获取 地图下载地址:http://www.fusioncharts.com/download/maps/definition/ 将下载的地图拷贝 ...
- Bezier画线算法
编译器:VS2013 描述:Bezier画线是利用导数相同拼接曲线,使曲线十分光滑,而不是随意拼接观赏性很差 主函数段 #include "stdafx.h" #include&l ...