OpenACC数据管理语句

▶ 书中第4章，数据管理部分的代码和说明

● 代码，关于 copy，copyin，copyout，create

 #include <stdio.h>

 #include <openacc.h>

 int main()

 {

     const int length = ;

     int a[length], b[length], c[length],d[length];

     for (int i = ; i < length; a[i] = b[i] = c[i] = );

     {

 #pragma acc kernels create(d)

         for (int i = ; i < length; i++)

         {

             a[i] ++;

             c[i] = a[i] + b[i];

             d[i] = ;

         }

     }

     for (int i = ; i < ; i++)

         printf("a[%d] = %d, c[%d] = %d\n", i, a[i], i, c[i]);

     getchar();

     return ;

 }

● 输出结果，显式创建了中间变量 d，隐式创建了 a，b，c，并具有不同的拷贝属性

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc -acc -Minfo main.c -o main_acc.exe

main:

     , Generating create(d[:])

         Generating implicit copyout(c[:])

         Generating implicit copyin(b[:])

         Generating implicit copy(a[:])

     , Loop is parallelizable

         Accelerator kernel generated

         Generating Tesla code

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

● 在 kernels 里单独使用 copyout 时报警告：PGC-W-0996-The directive #pragma acc copyout is deprecated; use #pragma acc declare copyout instead (main.c: XX)

● enter data 和 exit data 用于 C++。

■ 首先，windows 中 pgi 不支持 C++ 编译，只有 pgcc.exe 而没有 pgc++*.exe，只能乖乖到 Linux 下去写！

■ 书上的代码有点问题，大意是 OpenACC 的 copy 是浅拷贝，对于内含指针的数据结构（如 vector，class）不会连着指针指向的对象一起拷。这里有两种解决办法，一种是去结构化，将 class 中的数据集中成简单数组来进行拷贝；另一种是使用 Managed 内存，也就不存在显式拷贝的问题了。【https://stackoverflow.com/questions/53860467/how-to-copy-on-gpu-a-vector-of-vector-pointer-memory-allocated-in-openacc】

■ 书上的代码没有采用这两种解决方案，会报错 “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution” 以及 “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution”，这个问题还蛮常见的【https://stackoverflow.com/search?q=call+to+returned+error+700%3A+Illegal+address+during+kernel+execution】

● 使用去结构化来使用数组

 #include <iostream>

 #include <vector>

 #include <cstdint>

 using namespace std;

 int main()

 {

     const int vectorCount = , vectorLength = ;

     long sum = ;

     vector<int32_t> *vectorTable = new vector<int32_t>[vectorCount]; // 1024 个向量，每个向量放入 20 个元素

     for (int i = ; i < vectorCount; i++)

     {

         for (int j = ; j < vectorLength; j++)

             vectorTable[i].push_back(i);

     }

     int32_t **arrayTable = new int32_t *[vectorCount]; // 仅包含向量数据的数组，与 vectorTable 对应

     int *vectorSize = new int[vectorCount];            // 每个向量的尺寸

 #pragma acc enter data create(arrayTable [0:vectorCount] [0:0]) // 设备中创建 arryTable，注意维度

     for (int i = ; i < vectorCount; i++)

     {

         int sze = vectorTable[i].size();

         vectorSize[i] = sze;

         arrayTable[i] = vectorTable[i].data();        // 把每个向量数据的指针赋给 arrayTable

 #pragma acc enter data copyin(arrayTable [i:1][:sze]) // 把每个向量的数据拷贝进设备

     }

 #pragma acc enter data copyin(vectorSize[:vectorCount]) // 向量尺寸也放进设备

 #pragma acc parallel loop gang vector reduction(+: sum) present(arrayTable, vectorSize) // 规约计算

     for (int i = ; i < vectorCount; i++)

     {

         for (int j = ; j < vectorSize[i]; ++j)

             sum += arrayTable[i][j];

     }

     cout << "Sum: " << sum << endl;

 #pragma acc exit data delete (vectorSize)

 #pragma acc exit data delete (arrayTable)

     delete[] vectorSize;

     delete[] vectorTable;

     return ;

 }

● 输出结果

cuan@CUAN:~$ pgc++ main.cpp -o main.exe --c++ -ta=tesla -Minfo -acc

main:

     , Generating enter data create(arrayTable[:][:])

     , Generating enter data copyin(arrayTable[i][:sze+],vectorSize[:])

         Generating implicit copy(sum)

         Generating present(vectorSize[:])

         Generating Tesla code

         , Generating reduction(+:sum)

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

         , #pragma acc loop seq

     , Generating present(arrayTable[:][:])

     , Loop is parallelizable

     , Generating exit data delete(vectorSize[:],arrayTable[:][:])

cuan@CUAN:~$ ./main.exe

Sum:

● 使用 Managed 内存

 #include <iostream>

 using namespace std;

 class ivector

 {

 public:

     int len;

     int *arr;

     ivector(int length)

     {

         len = length;

         arr = new int[len];

 #pragma acc enter data copyin(this)

 #pragma acc enter data create(arr [0:len])

 #pragma acc parallel loop present(arr [0:len])

         for (int iend = len, i = ; i < iend; i++)              // 使用临时变量 iend，防止编译器认为 len 值在循环中会改变，从而拒绝并行化

             arr[i] = i;

     }

     ivector(const ivector &s)

     {

         len = s.len;

         arr = new int[len];

 #pragma acc enter data copyin(this)

 #pragma acc enter data create(arr [0:len])

 #pragma acc parallel loop present(arr [0:len], s.arr [0:len])   // s 也已经在设备上了

         for (int iend = len, i = ; i < iend; i++)

             arr[i] = s.arr[i];

     }

     ~ivector()

     {

 #pragma acc exit data delete (arr)                              // 销毁对象时依次销毁设备上的 arr 和 this

 #pragma acc exit data delete (this)

         cout << "deconstruction!" << endl;

         delete[] arr;

         len = ;

     }

     int &operator[](int i)

     {

         if (i <  || i >= this->len)

             return arr[];

         return arr[i];

     }

     void add(int c)

     {

 #pragma acc kernels loop present(arr [0:len])                   // 每次涉及修改 arr 的操作都要注明 present

         for (int iend = len, i = ; i < iend; i++)

             arr[i] += c;

     }

     void updateHost()                                           // 手动更新主机端数据

     {

 #pragma acc update host(arr [0:len])

     }

 };

 int main()

 {

     ivector s1();

     s1.add();

     s1.updateHost();

     cout << "s1[1] = " << s1[] << endl;

     ivector s2(s1);

     s2.updateHost();

     cout << "s2[1] = " << s2[] << endl;

     return ;

 }

● 输出结果，不加 -ta=tesla:managed 会报错【填坑】

cuan@CUAN:~$ pgc++ main.cpp -o main.exe --c++ -ta=tesla:managed -Minfo -acc

ivector::ivector(int):

     , Generating enter data copyin(this[:])

         Generating enter data create(arr[:len])

         Generating Tesla code

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

     , Generating implicit copy(this[:])

         Generating present(arr[:len])

ivector::ivector(const ivector&):

     , Generating enter data create(arr[:len])

         Generating enter data copyin(this[:])

         Generating present(arr[:len])

         Generating Tesla code

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

     , Generating implicit copyin(s[:])

         Generating implicit copy(this[:])

         Generating present(s->arr[:len])

ivector::~ivector():

     , Generating exit data delete(this[:],arr[:])

ivector::add(int):

      , Generating Tesla code

     , Accelerator serial kernel generated

         Generating implicit copy(this[:])

         Generating present(arr[:len])

     , Loop is parallelizable

         Generating Tesla code

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

ivector::updateHost():

     , Generating update self(arr[:len])

cuan@CUAN:~$ ./main.exe

launch CUDA kernel  file=/home/cuan/main.cpp function=_ZN7ivectorC1Ei line= device= threadid= num_gangs= num_workers= vector_length= grid= block=

launch CUDA kernel  file=/home/cuan/main.cpp function=_ZN7ivector3addEi line= device= threadid= num_gangs= num_workers= vector_length= grid= block=

launch CUDA kernel  file=/home/cuan/main.cpp function=_ZN7ivector3addEi line= device= threadid= num_gangs= num_workers= vector_length= grid= block=

s1[] =

launch CUDA kernel  file=/home/cuan/main.cpp function=_ZN7ivectorC1ERKS_ line= device= threadid= num_gangs= num_workers= vector_length= grid= block=

s2[] =

deconstruction!

deconstruction!

● 在这本书上找到了 C++ 中使用 OpenACC 的办法【https://www.elsevier.com/books/parallel-programming-with-openacc/farber/978-0-12-410397-9】，代码是【https://github.com/rmfarber/ParallelProgrammingWithOpenACC/tree/master/Chapter05】下的 accList.double.cpp

 // accList.h

 #ifndef ACC_LIST_H

 #define ACC_LIST_H

 #include <cstdlib>

 #include <cassert>

 #ifdef _OPENACC

 #include <openacc.h>

 #endif

 template<typename T>

 class accList

 {

 public:

     explicit accList() {}

     explicit accList(size_t size)           // 构造函数把 this 指针拷进设备，然后创建内存

     {

 #pragma acc enter data copyin(this)

         allocate(size);

     }

     ~accList()                              // 析构时释放内存，再删除 this 指针

     {

         release();

 #pragma acc exit data delete(this)

     }

 #pragma acc routine seq

     T& operator[](size_t idx)

     {

         return _A[idx];

     }

 #pragma acc routine seq

     const T& operator[](size_t idx) const

     {

         return _A[idx];

     }

     size_t size() const

     {

         return _size;

     }

     accList& operator=(const accList& B)

     {

         allocate(B.size());

         for (size_t j = ; j < _size; ++j)

         {

             _A[j] = B[j];

         }

         accUpdateDevice();

         return *this;

     }

     void insert(size_t idx, const T& val)

     {

         _A[idx] = val;

     }

     void insert(size_t idx, const T* val)

     {

         _A[idx] = *val;

     }

     void accUpdateSelf()

     {

         accUpdateSelfT(_A, );

     }

     void accUpdateDevice()

     {

         accUpdateDeviceT(_A, );

     }

 private:

     T * _A{ nullptr };                      // 数据成员只有指针和长度

     size_t _size{  };

     void release()

     {

         if (_size > )

         {

 #pragma acc exit data delete(_A[0:_size])   // 释放内存时删除设备内存

             delete[] _A;

             _A = nullptr;

             _size = ;

         }

     }

     void allocate(size_t size)

     {

         if (_size != size)                  // 申请内存尺寸与当前尺寸不一致时重新开辟一块

         {

             release();

             _size = size;

 #pragma acc update device(_size)

             if (_size > )

             {

                 _A = new T[_size];

 #ifdef _OPENACC                             // 有 OpenACC 的话检查 _A 是否已经在设备上了

                 assert(!acc_is_present(&_A[], sizeof(T)));

 #endif

 #pragma acc enter data create(_A[0:_size])  // 在设备上申请新内存

             }

         }

     }

     template<typename U>

     void accUpdateSelfT(U *p, long)

     {

 #pragma acc update self(p[0:_size])

     }

     template<typename U>

     auto accUpdateSelfT(U *p, int) -> decltype(p->accUpdateSelf())

     {

         for (size_t j = ; j < _size; ++j)

         {

             p[j].accUpdateSelf();

         }

     }

     template<typename U>

     void accUpdateDeviceT(U *p, long)

     {

 #pragma acc update device(p[0:_size])

     }

     template<typename U>

     auto accUpdateDeviceT(U *p, int) -> decltype(p->accUpdateDevice())

     {

         for (size_t j = ; j < _size; ++j)

         {

             p[j].accUpdateDevice();

         }

     }

 };

 #endif

 // main.cpp

 #include <iostream>

 #include <cstdlib>

 #include <cstdint>

 #include "accList.h"

 using namespace std;

 #ifndef N

 #define N 1024

 #endif

 int main()

 {

     accList<double> A(N), B(N);

     for (int i = ; i < B.size(); ++i)

         B[i] = 2.5;

     B.accUpdateDevice();                                // 手动更新设备内存

 #pragma acc parallel loop gang vector present(A,B)

     for (int i = ; i < A.size(); ++i)

         A[i] = B[i] + i;

     A.accUpdateSelf();                                  // 手动更新主机内存

     for (int i = ; i<; ++i)

         cout << "A[" << i << "]: " << A[i] << endl;

     cout << "......" << endl;

     for (int i = N - ; i<N; ++i)

         cout << "A[" << i << "]: " << A[i] << endl;

     return ;

 }

● 运行结果

cuan@CUAN:~/acc$ pgc++ main.cpp -o main.exe -Minfo -acc

main:

     , Generating present(B,A)

         Generating Tesla code

         , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

accList<double>::accList(unsigned long):

      , include "accList.h"

          , Generating enter data copyin(this[:])

accList<double>::~accList():

      , include "accList.h"

          , Generating exit data delete(this[:])

accList<double>::operator [](unsigned long):

      , include "accList.h"

          , Generating acc routine seq

              Generating Tesla code

accList<double>::size() const:

      , include "accList.h"

          , Generating implicit acc routine seq

              Generating acc routine seq

              Generating Tesla code

accList<double>::release():

      , include "accList.h"

          , Generating exit data delete(_A[:_size])

accList<double>::allocate(unsigned long):

      , include "accList.h"

          , Generating update device(_size)

         , Generating enter data create(_A[:_size])

void accList<double>::accUpdateSelfT<double>(T1 *, long):

      , include "accList.h"

         , Generating update self(p[:_size])

void accList<double>::accUpdateDeviceT<double>(T1 *, long):

      , include "accList.h"

         , Generating update device(p[:_size])

cuan@CUAN:~/acc$ ./main.exe

launch CUDA kernel  file=/home/cuan/acc/main.cpp function=main line= device= threadid= num_gangs= num_workers= vector_length= grid= block=

A[]: 2.5

A[]: 3.5

A[]: 4.5

A[]: 5.5

A[]: 6.5

A[]: 7.5

A[]: 8.5

A[]: 9.5

A[]: 10.5

A[]: 11.5

......

A[]: 1016.5

A[]: 1017.5

A[]: 1018.5

A[]: 1019.5

A[]: 1020.5

A[]: 1021.5

A[]: 1022.5

A[]: 1023.5

A[]: 1024.5

A[]: 1025.5

Accelerator Kernel Timing data

/home/cuan/acc/main.cpp

  main  NVIDIA  devicenum=

    time(us):

    : compute region reached  time

        : kernel launched  time

            grid: []  block: []

             device time(us): total= max= min= avg=

            elapsed time(us): total= max= min= avg=

    : data region reached  times

/home/cuan/acc/main.cpp

  _ZN7accListIdEC1Em  NVIDIA  devicenum=

    time(us):

    : data region reached  times

        : data copyin transfers:

             device time(us): total= max= min= avg=

/home/cuan/acc/main.cpp

  _ZN7accListIdED1Ev  NVIDIA  devicenum=

    time(us):

    : data region reached  times

/home/cuan/acc/main.cpp

  _ZN7accListIdE7releaseEv  NVIDIA  devicenum=

    time(us):

    : data region reached  times

        : data copyin transfers:

             device time(us): total= max= min= avg=

/home/cuan/acc/main.cpp

  _ZN7accListIdE8allocateEm  NVIDIA  devicenum=

    time(us):

    : update directive reached  times

        : data copyin transfers:

             device time(us): total= max= min= avg=

    : data region reached  times

        : data copyin transfers:

             device time(us): total= max= min= avg=

/home/cuan/acc/main.cpp

  _ZN7accListIdE14accUpdateSelfTIdEEvPT_l  NVIDIA  devicenum=

    time(us):

    : update directive reached  time

        : data copyout transfers:

             device time(us): total= max= min= avg=

/home/cuan/acc/main.cpp

  _ZN7accListIdE16accUpdateDeviceTIdEEvPT_l  NVIDIA  devicenum=

    time(us):

    : update directive reached  time

        : data copyin transfers:

             device time(us): total= max= min= avg=

OpenACC数据管理语句的更多相关文章

sql数据管理语句
一.数据管理 1.增加数据 INSERT INTO student VALUES(1,'张三','男',20); -- 插入所有字段.一定依次按顺序插入 -- 注意不能少或多字段值如只需要插入部分字 ...
基于指令的移植方式的几个重要概念的理解（OpenHMPP, OpenACC）-转载
引言: 什么是基于指令的移植方式呢?首先我这里说的移植可以理解为把原先在CPU上跑的程序放到像GPU一样的协处理器上跑的这个过程.在英文里可以叫Porting.移植有两种方式:一种是使用CUDA或者O ...
【并行计算-CUDA开发】OpenACC与OpenHMPP
在西雅图超级计算大会(SC11)上发布了新的基于指令的加速器并行编程标准,既OpenACC.这个开发标准的目的是让更多的编程人员可以用到GPU计算,同时计算结果可以跨加速器使用,甚至能用在多核CPU上 ...
python第六天函数 python标准库实例大全
今天学习第一模块的最后一课课程--函数: python的第一个函数: 1 def func1(): 2 print('第一个函数') 3 return 0 4 func1() 1 同时返回多种类型时, ...
whdxlib
1 数据库系统实现实验指导书齐心彭彬计算机工程与软件实验中心 2016 年 3 月2目录实验一.JDBC 应用程序设计(2 学时) ......................... ...
oracle DML（数据管理语言）sql 基本语句
Hadoop的数据管理
Hadoop的数据管理,主要包括Hadoop的分布式文件系统HDFS.分布式数据库HBase和数据仓库工具Hive的数据管理. 1.HDFS的数据管理 HDFS是分布式计算的存储基石,Hadoop分布 ...
R语言实战（二）数据管理
本文对应<R语言实战>第4章:基本数据管理:第5章:高级数据管理创建新变量 #建议采用transform()函数 mydata <- transform(mydata, sumx ...
夜黑风高的夜晚用SQL语句做了一些想做的事·······
IT这条漫漫长路注定是孤独的,陪伴我们的只有那些不知冷暖的代码语句和被手指敲打的磨掉了键上的标识的键盘. 之所以可以继续坚持下去,是因为心中有一份永不熄灭的激情. 成功的路上让我们为自己带盐 ...

随机推荐

数据库备份-SQL Server 维护计划
SQL Server 维护计划(数据库备份) 公司的项目都需要定期备份,程序备份关掉iis站点复制文件就可以了,难受的地方就是数据库的备份了.服务器上装的大都是英文版,一看见英文,操作都变得小心翼 ...
yocto和bitbake
一.yocto 1.yocto简介 Yocto 是一个开源社区通过它提供模版.工具和方法帮助开发者创建基于linux内核的定制系统,支持ARM, PPC, MIPS, x86 (32 & 64 ...
dbt 生产环境使用
可以使用如下方式: 云主机或者普通的机器使用 airflow 使用 Sinter, 一个托管的dbt runner 常见问题: 如何管理权限? 可以使用hooks 进行配置用户组以及权限说明实际 ...
knowledge 开源知识管理系统
knowledge 是一个不错的知识管理系统,基于markdown 我们可以方便的进行知识的标签以及展示使用docker-compose 运行环境准备 docker-compose 文件 ver ...
dgraph cluster docker-compose 安装
dgraph 是一款基于golang 的图数据库,使用了graphql+ 的查询方式集群的安装官方也提供了对应的模版,比较简单 docker-compose 文件我做了一些简单修改(数据存储的问题 ...
Python函数 dict()
dict() 作用:dict() 函数用于创建一个字典.返回一个字典. 语法: class dict(**kwarg) class dict(mapping, **kwarg) class dict( ...
FastAdmin selectPage 前端传递查询条件
★夕狱-东莞 2018/2/2 16:19:33 selectpage 怎么在前端传递查询条件,看了下源码,好像有个custom,怎么用来的,比如我要下拉的时候,只显示id=1的数据 Karson-深 ...
putty SSH tunnel function
github & dynamic
saltops 安装及相关环境安装
本次布署测试环境阿里云 Centos 7.3 1.安装nginx,这里采用yum 安装方式 A.yum install nginx B.创建开机启动 systemctl enable nginx.s ...
ORACLE数据导入导出后新数据库中某些表添加操作报错[ORA-12899]
由于项目需要,我在搭建了新的开发环境后,需要将之前环境中的ORACLE数据库导出,再导入到新的开发环境下.当导出导入完成后,使用数据库进行添加操作时发现针对很多表的添加操作报错,具体报错原因描述为: ...

OpenACC数据管理语句

OpenACC数据管理语句的更多相关文章

随机推荐

热门专题