OpenCL学习笔记（三）：OpenCL安装，编程简介与helloworld

欢迎转载，转载请注明：本文出自Bin的专栏blog.csdn.net/xbinworld。技术交流QQ群：433250724，欢迎对算法、技术、应用感兴趣的同学加入。

OpenCL安装

安装我不打算花篇幅写，原因是OpenCL实在是可以太多的平台+环境下实现了，包括GPU和FPGA，以及不同的器件支持，在这里我主要把网上可以找到比较不错的经验贴列一下，方便大家，我主要关注了FPGA的，其他GPU的大家网上搜搜吧：

altera opencl sdk下载：

https://www.altera.com.cn/products/design-software/embedded-software-developers/opencl/overview.html

alter的安装指南，《Altera SDK for OpenCL

Getting Started Guide》

理论上看上面两个就够了，你需要做的事情包括：

下载opencl SDK，或者quatuars II软件(含SDK)，下载相应开发板的支持（altera上面有一些，但是其他的可能就需要你从相应的供应商那边找了）；还需要opencl的license，不然是不能编译的。

中文的一些经验贴可以看：

《Altera OpenCL入门(beta版)》http://wenku.baidu.com/link?url=bkIyo01jXeWfdGsrA_M0J1zomx6f0lYk0NPf-9-MNaC0OkWRmukDwY5yFz0I3Wrctqi5qD3jC8BhQQzjoqw1HXpUgIM68_blz5Cr3vxpaZC

【Altera SoC体验之旅】+ 正式开启OpenCL模式

http://home.eeworld.com.cn/my/space-uid-169743-blogid-247647.html

OpenCL编程简介

下面的图简单说明了OpenCL的编程框架，图是用的GPU，其他类似；

从图中可以看出（参考《OpenCL 编程入门》）：

1. 异构计算设备，可以是CPU或GPU。现在也有支持OpenCL的FPGA设备和至强融核协处理设备（MIC）。

2. OpenCL的API通过Context（环境上下文）联系在一起。

3. 运行设备端的程序，经过了编译->设置参数->运行等步骤。

名词的概念：

Platform (平台)：主机加上OpenCL框架管理下的若干设备构成了这个平台，通过这个平台，应用程序可以与设备共享资源并在设备上执行kernel。实际使用中基本上一个厂商对应一个Platform，比如Intel, AMD都是这样。

Device（设备）：官方的解释是计算单元（Compute Units）的集合。举例来说，GPU是典型的device。Intel和AMD的多核CPU也提供OpenCL接口，所以也可以作为Device。

Context（上下文）：OpenCL的Platform上共享和使用资源的环境，包括kernel、device、memory objects、command queue等。使用中一般一个Platform对应一个Context。

Program：OpenCL程序，由kernel函数、其他函数和声明等组成。

Kernel（核函数）：可以从主机端调用，运行在设备端的函数。

Memory Object（内存对象）：在主机和设备之间传递数据的对象，一般映射到OpenCL程序中的global memory。有两种具体的类型：Buffer Object（缓存对象）和Image Object（图像对象）。

Command Queue（指令队列）：在指定设备上管理多个指令（Command）。队列里指令执行可以顺序也可以乱序。一个设备可以对应多个指令队列。

NDRange：主机端运行设备端kernel函数的主要接口。实际上还有其他的，NDRange是非常常见的，用于分组运算，以后具体用到的时候就知道区别了。

Host端来看，OpenCL的组要执行流程是这样的：

其实基本上大部分简单的程序HOST部分都是差不多的，不用改很多，具体下面看一个例子就知道了。

第一个程序

这里贴一个altera官方的vector add的实例code，基本就是helloworld级别了，不过它的host写的很通用（考虑到对多个device统一编程），可以过一遍看看是不是和上面的图对的上。其实看过这个基本其他的也就差不多了。

Host部分：（Kernel在最后）

// Copyright (C) 2013-2014 Altera Corporation, San Jose, California, USA. All rights reserved.

// Permission is hereby granted, free of charge, to any person obtaining a copy of this

// software and associated documentation files (the "Software"), to deal in the Software

// without restriction, including without limitation the rights to use, copy, modify, merge,

// publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to

// whom the Software is furnished to do so, subject to the following conditions:

// The above copyright notice and this permission notice shall be included in all copies or

// substantial portions of the Software.

//

// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,

// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES

// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND

// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT

// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,

// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING

// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR

// OTHER DEALINGS IN THE SOFTWARE.

//

// This agreement shall be governed in all respects by the laws of the State of California and

// by the laws of the United States of America. 

///////////////////////////////////////////////////////////////////////////////////

// This host program executes a vector addition kernel to perform:

//  C = A + B

// where A, B and C are vectors with N elements.

//

// This host program supports partitioning the problem across multiple OpenCL

// devices if available. If there are M available devices, the problem is

// divided so that each device operates on N/M points. The host program

// assumes that all devices are of the same type (that is, the same binary can

// be used), but the code can be generalized to support different device types

// easily.

//

// Verification is performed against the same computation on the host CPU.

///////////////////////////////////////////////////////////////////////////////////

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include "CL/opencl.h"

#include "AOCL_Utils.h"

using namespace aocl_utils;

// OpenCL runtime configuration

cl_platform_id platform = NULL;

unsigned num_devices = 0;

scoped_array<cl_device_id> device; // num_devices elements

cl_context context = NULL;

scoped_array<cl_command_queue> queue; // num_devices elements

cl_program program = NULL;

scoped_array<cl_kernel> kernel; // num_devices elements

scoped_array<cl_mem> input_a_buf; // num_devices elements

scoped_array<cl_mem> input_b_buf; // num_devices elements

scoped_array<cl_mem> output_buf; // num_devices elements

// Problem data.

const unsigned N = 1000000; // problem size

scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements

scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements

scoped_array<scoped_array<float> > ref_output; // num_devices elements

scoped_array<unsigned> n_per_device; // num_devices elements

// Function prototypes

float rand_float();

bool init_opencl();

void init_problem();

void run();

void cleanup();

// Entry point.

int main() {

  // Initialize OpenCL.

  if(!init_opencl()) {

    return -1;

  }

  // Initialize the problem data.

  // Requires the number of devices to be known.

  init_problem();

  // Run the kernel.

  run();

  // Free the resources allocated

  cleanup();

  return 0;

}

/////// HELPER FUNCTIONS ///////

// Randomly generate a floating-point number between -10 and 10.

float rand_float() {

  return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;

}

// Initializes the OpenCL objects.

bool init_opencl() {

  cl_int status;

  printf("Initializing OpenCL\n");

  if(!setCwdToExeDir()) {

    return false;

  }

  // Get the OpenCL platform.

  platform = findPlatform("Altera");

  if(platform == NULL) {

    printf("ERROR: Unable to find Altera OpenCL platform.\n");

    return false;

  }

  // Query the available OpenCL device.

  device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));

  printf("Platform: %s\n", getPlatformName(platform).c_str());

  printf("Using %d device(s)\n", num_devices);

  for(unsigned i = 0; i < num_devices; ++i) {

    printf("  %s\n", getDeviceName(device[i]).c_str());

  }

  // Create the context.

  context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);

  checkError(status, "Failed to create context");

  // Create the program for all device. Use the first device as the

  // representative device (assuming all device are of the same type).

  std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);

  printf("Using AOCX: %s\n", binary_file.c_str());

  program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);

  // Build the program that was just created.

  status = clBuildProgram(program, 0, NULL, "", NULL, NULL);

  checkError(status, "Failed to build program");

  // Create per-device objects.

  queue.reset(num_devices);

  kernel.reset(num_devices);

  n_per_device.reset(num_devices);

  input_a_buf.reset(num_devices);

  input_b_buf.reset(num_devices);

  output_buf.reset(num_devices);

  for(unsigned i = 0; i < num_devices; ++i) {

    // Command queue.

    queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);

    checkError(status, "Failed to create command queue");

    // Kernel.

    const char *kernel_name = "vectorAdd";

    kernel[i] = clCreateKernel(program, kernel_name, &status);

    checkError(status, "Failed to create kernel");

    // Determine the number of elements processed by this device.

    n_per_device[i] = N / num_devices; // number of elements handled by this device

    // Spread out the remainder of the elements over the first

    // N % num_devices.

    if(i < (N % num_devices)) {

      n_per_device[i]++;

    }

    // Input buffers.

    input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,

        n_per_device[i] * sizeof(float), NULL, &status);

    checkError(status, "Failed to create buffer for input A");

    input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,

        n_per_device[i] * sizeof(float), NULL, &status);

    checkError(status, "Failed to create buffer for input B");

    // Output buffer.

    output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,

        n_per_device[i] * sizeof(float), NULL, &status);

    checkError(status, "Failed to create buffer for output");

  }

  return true;

}

// Initialize the data for the problem. Requires num_devices to be known.

void init_problem() {

  if(num_devices == 0) {

    checkError(-1, "No devices");

  }

  input_a.reset(num_devices);

  input_b.reset(num_devices);

  output.reset(num_devices);

  ref_output.reset(num_devices);

  // Generate input vectors A and B and the reference output consisting

  // of a total of N elements.

  // We create separate arrays for each device so that each device has an

  // aligned buffer.

  for(unsigned i = 0; i < num_devices; ++i) {

    input_a[i].reset(n_per_device[i]);

    input_b[i].reset(n_per_device[i]);

    output[i].reset(n_per_device[i]);

    ref_output[i].reset(n_per_device[i]);

    for(unsigned j = 0; j < n_per_device[i]; ++j) {

      input_a[i][j] = rand_float();

      input_b[i][j] = rand_float();

      ref_output[i][j] = input_a[i][j] + input_b[i][j];

    }

  }

}

void run() {

  cl_int status;

  const double start_time = getCurrentTimestamp();

  // Launch the problem for each device.

  scoped_array<cl_event> kernel_event(num_devices);

  scoped_array<cl_event> finish_event(num_devices);

  for(unsigned i = 0; i < num_devices; ++i) {

    // Transfer inputs to each device. Each of the host buffers supplied to

    // clEnqueueWriteBuffer here is already aligned to ensure that DMA is used

    // for the host-to-device transfer.

    cl_event write_event[2];

    status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,

        0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);

    checkError(status, "Failed to transfer input A");

    status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,

        0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);

    checkError(status, "Failed to transfer input B");

    // Set kernel arguments.

    unsigned argi = 0;

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);

    checkError(status, "Failed to set argument %d", argi - 1);

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);

    checkError(status, "Failed to set argument %d", argi - 1);

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);

    checkError(status, "Failed to set argument %d", argi - 1);

    // Enqueue kernel.

    // Use a global work size corresponding to the number of elements to add

    // for this device.

    //

    // We don't specify a local work size and let the runtime choose

    // (it'll choose to use one work-group with the same size as the global

    // work-size).

    //

    // Events are used to ensure that the kernel is not launched until

    // the writes to the input buffers have completed.

    const size_t global_work_size = n_per_device[i];

    printf("Launching for device %d (%d elements)\n", i, global_work_size);

    status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,

        &global_work_size, NULL, 2, write_event, &kernel_event[i]);

    checkError(status, "Failed to launch kernel");

    // Read the result. This the final operation.

    status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,

        0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);

    // Release local events.

    clReleaseEvent(write_event[0]);

    clReleaseEvent(write_event[1]);

  }

  // Wait for all devices to finish.

  clWaitForEvents(num_devices, finish_event);

  const double end_time = getCurrentTimestamp();

  // Wall-clock time taken.

  printf("\nTime: %0.3f ms\n", (end_time - start_time) * 1e3);

  // Get kernel times using the OpenCL event profiling API.

  for(unsigned i = 0; i < num_devices; ++i) {

    cl_ulong time_ns = getStartEndTime(kernel_event[i]);

    printf("Kernel time (device %d): %0.3f ms\n", i, double(time_ns) * 1e-6);

  }

  // Release all events.

  for(unsigned i = 0; i < num_devices; ++i) {

    clReleaseEvent(kernel_event[i]);

    clReleaseEvent(finish_event[i]);

  }

  // Verify results.

  bool pass = true;

  for(unsigned i = 0; i < num_devices && pass; ++i) {

    for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {

      if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {

        printf("Failed verification @ device %d, index %d\nOutput: %f\nReference: %f\n",

            i, j, output[i][j], ref_output[i][j]);

        pass = false;

      }

    }

  }

  printf("\nVerification: %s\n", pass ? "PASS" : "FAIL");

}

// Free the resources allocated during initialization

void cleanup() {

  for(unsigned i = 0; i < num_devices; ++i) {

    if(kernel && kernel[i]) {

      clReleaseKernel(kernel[i]);

    }

    if(queue && queue[i]) {

      clReleaseCommandQueue(queue[i]);

    }

    if(input_a_buf && input_a_buf[i]) {

      clReleaseMemObject(input_a_buf[i]);

    }

    if(input_b_buf && input_b_buf[i]) {

      clReleaseMemObject(input_b_buf[i]);

    }

    if(output_buf && output_buf[i]) {

      clReleaseMemObject(output_buf[i]);

    }

  }

  if(program) {

    clReleaseProgram(program);

  }

  if(context) {

    clReleaseContext(context);

  }

}

Kernel部分：

// ACL kernel for adding two input vectors

__kernel void vectorAdd(__global const float *x,

                        __global const float *y,

                        __global float *restrict z)

{

    // get index of the work item

    int index = get_global_id(0);

    // add the vector elements

    z[index] = x[index] + y[index];

}

kernel部分代码就这几行，__global是一个限定符，表示用外部存储（比如DDR）来存储，其他语法和标准C语言是一样的，就不多说了。

代码中最重要的就是get_global_id，这个是在多work-item工作模式下的常用手段，通过id确定work-item然后进行操作，所有的item都是一样的，因此就add的函数里面就没有习惯的for()的写法了。可以对kernel的设置进行定制，包括compute unit，SIMD模式等，这样来控制程序的并行性，更大的并行往往性能高，但是更耗资源。

具体的Kernel函数的内容可以参考OpenCL的《The OpenCL Specification 1.0》以及altera的opencl编程指南，后面的笔记我会具体写一下。

内存模型

最后写一下Opencl的内存模型，看下面的示意图：

用核函数中的内存变量来简单地解释：用clCreateBuffer 创建、用clSetKernelArg 传递的数据在global memory 和constant memory中；核函数中的寄存器变量在private memory 中；核函数的内部变量、缓存等，在local memory 中。图例中可以看到Device 并不直接访问global memory，而是通过Cache 来访问。可以想象当同时运行的work-item，使用的内存都在同一块cache 中，则内存吞吐的效率最高。对应到work group 中，就是在程序设计上尽量使同一个work group 中的work item 操作连续的内存，以提高访存效率。

本篇就到这里。