▶ 按照书上的例子,使用 async 导语实现主机与设备端的异步计算

● 代码,非异步的代码只要将其中的 async 以及第 29 行删除即可

 #include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 // 多算几次,增加耗时 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async // 在设备上赋值 a
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} for (int i = ; i < COUNT; i++) // 在主机上赋值 b
{
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc update host(a[0:N]) async // 异步必须 update a,否则还没同步就参与 c 的运算
#pragma acc wait // 非异步时去掉该行 for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update device(a[0:N]) async // 没啥用,增加耗时
#pragma acc exit data delete(a[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}

● 输出结果(是否异步,差异仅在行号、耗时上)

//+-----------------------------------------------------------------------------非异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time //------------------------------------------------------------------------------有异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time

● Nvvp 的结果,我是真没看出来有较大的差别,可能例子举得不够好

● 在一个设备上同时使用两个命令队列

 #include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async(1)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(1)
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} #pragma acc enter data create(b[0:N]) async(2)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(2)
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc enter data create(c[0:N]) async(2)
#pragma acc wait(1) async(2) #pragma acc parallel loop async(2)
for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update host(c[0:N]) async(2)
#pragma acc exit data delete(a[0:N], b[0:N], c[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}

● 输出结果

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating enter data create(b[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(b[:])
, Generating enter data create(c[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(c[:])
Generating implicit copyin(b[:],a[:])
, Generating update self(c[:])
Generating exit data delete(c[:],b[:],a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached time
: kernel launched time
grid: [] block: []
device time(us): total= max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: data region reached time

● Nvvp 中,可以看到两个命令队列交替执行

● 在 PGI 命令行中使用命令 pgaccelinfo 查看设备信息

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgaccelinfo

CUDA Driver Version:           

Device Number:
Device Name: GeForce GTX
Device Revision Number: 6.1
Global Memory Size:
Number of Multiprocessors:
Concurrent Copy and Execution: Yes
Total Constant Memory:
Total Shared Memory per Block:
Registers per Block:
Warp Size:
Maximum Threads per Block:
Maximum Block Dimensions: , ,
Maximum Grid Dimensions: x x
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: MHz
Memory Bus Width: bits
L2 Cache Size: bytes
Max Threads Per SMP:
Async Engines: // 有两个异步引擎,支持两个命令队列并行
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Compiler Option: -ta=tesla:cc60

OpenACC 异步计算的更多相关文章

  1. Task:取消异步计算限制操作 & 捕获任务中的异常

    Why:ThreadPool没有内建机制标记当前线程在什么时候完成,也没有机制在操作完成时获得返回值,因而推出了Task,更精确的管理异步线程. How:通过构造方法的参数TaskCreationOp ...

  2. 13.FutureTask异步计算

    FutureTask     1.可取消的异步计算,FutureTask实现了Future的基本方法,提供了start.cancel 操作,可以查询计算是否完成,并且可以获取计算     的结果.结果 ...

  3. 怎样给ExecutorService异步计算设置超时

    ExecutorService接口使用submit方法会返回一个Future<V>对象.Future表示异步计算的结果.它提供了检查计算是否完毕的方法,以等待计算的完毕,并获取计算的结果. ...

  4. java异步计算Future的使用(转)

    从jdk1.5开始我们可以利用Future来跟踪异步计算的结果.在此之前主线程要想获得工作线程(异步计算线程)的结果是比较麻烦的事情,需要我们进行特殊的程序结构设计,比较繁琐而且容易出错.有了Futu ...

  5. 使用QFuture类监控异步计算的结果

    版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/Amnes1a/article/details/65630701在Qt中,为我们提供了好几种使用线程的 ...

  6. gearman(异步计算)学习

    Gearman是什么? 它是分布式的程序调用框架,可完成跨语言的相互调 用,适合在后台运行工作任务.最初是2005年perl版本,2008年发布C/C++版本.目前大部分源码都是(Gearmand服务 ...

  7. OpenACC 绘制曼德勃罗集

    ▶ 书上第四章,用一系列步骤优化曼德勃罗集的计算过程. ● 代码 // constants.h ; ; ; ; const double xmin=-1.7; ; const double ymin= ...

  8. OpenACC Julia 图形

    ▶ 书上的代码,逐步优化绘制 Julia 图形的代码 ● 无并行优化(手动优化了变量等) #include <stdio.h> #include <stdlib.h> #inc ...

  9. 如何解救在异步Java代码中已检测的异常

    Java语言通过已检测异常语法所提供的静态异常检测功能非常实用,通过它程序开发人员可以用很便捷的方式表达复杂的程序流程. 实际上,如果某个函数预期将返回某种类型的数据,通过已检测异常,很容易就可以扩展 ...

随机推荐

  1. [UE4]虚幻4的智能指针

    虚幻自己实现了一套智能指针系统,为了跨平台. 指针: 占用8个字节,4个字节的Object指针,4字节的引用计数控制器的指针, 引用计数控制器需要12字节, 一个C++的Object指针4字节,一个共 ...

  2. 从dfs向动态规划过渡

    据说每一个dfs,都能用动态规划思想做出来. 首先要明白dfs与动态规划的一些小要点   1)dfs重在通过使用递归来使用不同的选择,通过使用形参的改变实现不同情景的改变(形参既包括了代价,又包含了结 ...

  3. k8s dockerk个人学习(1)

    虚拟机部署k8s 1. 创建虚拟机 虚拟机用的是virtualBox和vagrant工具,百度安装virtualBox和vagrant 创建vagrant目录并创建文件Vagrantfile内容为 V ...

  4. test20181004 苹果树

    题意 分析 对每个点维护子树所能达到的dfn最大值.最小值.次大值.次小值,然后就可以计算原树中每个点与父亲的连边对答案的贡献. 如果子树中没有边能脱离子树,断掉该边与任意一条新加的边都成立,答案就加 ...

  5. test20180922 世界第一的猛汉王

    题意 分析 由于异色点必有连边,所以一个点的covered减去两个点共有的covered就是可存在的环数,十分巧妙. 代码 #include <bits/stdc++.h> using L ...

  6. day36 python学习gevent io 多路复用 socketserver *****

    ---恢复内容开始--- gevent 1.切换+保存状态 2.检测单线程下任务的IO,实现遇到IO自动切换 Gevent 是一个第三方库,可以轻松通过gevent实现并发同步或异步编程,在geven ...

  7. 消息队列在VB.NET数据库开发中的应用

    我们先简单的了解一下什么是消息队列(MSMQ)?消息队列是 Windows 2000(NT也有MSMQ,WIN95/98/me/xp不含消息队列服务但是支持客户端的运行)操作系统中通讯的基础,也是用于 ...

  8. 【转】实战USB接口手机充电 看3.0/2.0谁更快

    原文网址:http://mb.it168.com/a2012/0816/1385/000001385641_all.shtml [IT168 应用]当下,越来越多的电脑都已普及USB 3.0接口,新买 ...

  9. javascript的循环使用

    学习网址: http://www.w3school.com.cn/js/js_loop_for.asp JavaScript 循环 如果您希望一遍又一遍地运行相同的代码,并且每次的值都不同,那么使用循 ...

  10. Angular 4 管道

    一.date管道 1.html 2. 控制器中的定义brithday 3.效果图 如果时间格式 为: 我的生日是{{birthday | date:'yyyy-MM-dd HH:mm:ss'}} 则效 ...