R – GPU Programming for All with ‘gpuR’
INTRODUCTION
GPUs (Graphic Processing Units) have become much more popular in recent years for computationally intensive calculations. Despite these gains, the use of this hardware has been very limited in the R programming language. Although possible, the prospect of programming in either OpenCL or CUDA is difficult for many programmers unaccustomed to working with such a low-level interface. Creating bindings for R’s high-level programming that abstracts away the complex GPU code would make using GPUs far more accessible to R users. This is the core idea behind the gpuR package. There are three novel aspects of gpuR:
- Applicable on ‘ALL’ GPUs
- Abstracts away CUDA/OpenCL code to easily incorporate in to existing R algorithms
- Separates copy/compute functions to allow objects to persist on GPU
BROAD APPLICATION:
The ‘gpuR’ package was created to bring the power of GPU computing to any R user with a GPU device. Although there are a handful of packages that provide some GPU capability (e.g.gputools, cudaBayesreg, HiPLARM, HiPLARb, and gmatrix) all are strictly limited to NVIDIA GPUs. As such, a backend that is based upon OpenCL would allow all users to benefit from GPU hardware. The ‘gpuR’ package therefore utilizes the ViennaCL linear algebra library which contains auto-tuned OpenCL kernels (among others) that can be leveraged for GPUs. The headers have been conveniently repackaged in the RViennaCLpackage. It also allows for a CUDA backend for those with NVIDIA GPUs that may see further improved performance (contained within the companion gpuRcuda package not yet formally released).
ABSTRACT AWAY GPU CODE:
The gpuR package uses the S4 object oriented system to have explicit classes and methods that all the user to simply cast their matrix or vector and continue programming in R as normal. For example:
|
1
2
3
4
5
6
7
8
9
10
11
12
|
ORDER = 1024A = matrix(rnorm(ORDER^2), nrow=ORDER)B = matrix(rnorm(ORDER^2), nrow=ORDER)gpuA = gpuMatrix(A, type="double")gpuB = gpuMatrix(B, type="double")C = A %*% BgpuC = gpuA %*% gpuBall(C == gpuC[])[1] TRUE |
The gpuMatrix object points to a matrix in RAM which is then computed by the GPU when a desired function is called. This avoids R’s habit of copying the memory of objects. For example:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
library(pryr)# Initially points to same objectx = matrix(rnorm(16), 4)y = xaddress(x)[1] "0x16177f28"address(y)[1] "0x16177f28"# But once modify the second object it creates a copyy[1,1] = 0address(x)[1] "0x16177f28"address(y)[1] "0x15fbb1d8 |
In contrast, the same syntax for a gpuMatrix will modify the original object in-place without any copy.
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
library(pryr)library(gpuR)# Initially points to same objectx = gpuMatrix(rnorm(16), 4, 4)y = xx@address[1] <pointer: 0x6baa040>y@address[1] <pointer: 0x6baa040># Modification affects both objects without copyy[1,1] = 0x@address[1] <pointer: 0x6baa040>y@address[1] <pointer: 0x6baa040> |
Each new variable assigned to this object will only copy the pointer thereby making the program more memory efficient. However, the gpuMatrix> class does still require allocating GPU memory and copying data to device for each function call. The most commonly used methods have been overloaded such as %*%, +, -, *, /, crossprod, tcrossprod, and trig functions among others. In this way, an R user can create these objects and leverage GPU resources without the need to know a bunch more functions that would break existing algorithms.
DISTINCT COPY/COMPUTE FUNCTIONALITY:
For the gpuMatix and gpuVector classes there are companion vclMatrix andvclVector class that point to objects that persist in the GPU RAM. In this way, the user explicitly decides when data needs to be moved back to the host. By avoiding unnecessary data transfer between host and device performance can significantly improve. For example:
|
1
2
3
4
5
6
7
8
9
|
vclA = vclMatrix(rnorm(10000), nrow = 100)vclB = vclMatrix(rnorm(10000), nrow = 100)vclC = vclMatrix(rnorm(10000), nrow = 100)# GEMMvclD = vclA %*% vclB# Element-wise additionvclD = vclD + vclC |
In this code, the three initial matrices already exist in the GPU memory so no data transfer takes place in the GEMM call. Furthermore, the returned matrix remains in the GPU memory. In this case, the ‘vclD’ object is still in GPU RAM. As such, the element-wise addition call also happens directly on the GPU with no data transfers. It is worth also noting that the user can still modify elements, rows, or columns with the exact same syntax as a normal R matrix.
|
1
2
3
|
vclD[1,1] = 42vclD[,2] = rep(12, 100)vclD[3,] = rep(23, 100) |
These operations simply copy the new elements to the GPU and modify the object in-place within the GPU memory. The ‘vclD’ object is never copied to the host.
BENCHMARKS:
With all that in mind, how does gpuR perform? Here are some general benchmarks of the popular GEMM operation. I currently only have access to a single NVIDIA GeForce GTX 970 for these simulations. Users should expect to see differences with high performance GPUs (e.g. AMD FirePro, NVIDIA Tesla, etc.). Speedup relative to CPU will also vary depending upon user hardware.
(1) DEFAULT DGEMM VS BASE R
R is known to only support two numeric types (integer and double). As such, Figure 1 shows the fold speedup achieved by using the gpuMatrix and vclMatrix classes. Since R is already known to not be the fastest language, an implementation with the OpenBLAS backend is included as well for reference using a 4 core Intel i5-2500 CPU @ 3.30GHz. As can be seen there is a dramatic speedup from just using OpenBLAS or the gpuMatrix class (essentially equivalent). Of interest is the impact of the transfer time from host-device-host that is typical in many GPU implementations. This cost is eliminated by using the vclMatrix class which continues to scale with matrix size.
Figure 1 – Fold speedup achieved using openblas (CPU) as well as the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.
(2) SGEMM VS BASE R
In many GPU benchmarks there is often float operations measured as well. As noted above, R does not provide this by default. One way to go around this is to use the RcppArmadillo package and explicitly casting R objects as float types. The armadillo library will also default to using the BLAS backend provided (i.e. OpenBLAS). Figure 2 shows the impact of using float data types. OpenBLAS continues to provide a noticeable speedup but gpuMatrix begins to outperform once matrix order exceeds 1500. The vclMatrix continues to demonstrate the value in retaining objects in GPU memory and avoiding memory transfers.
Figure 2 – Float type GEMM comparisons. Fold speedup achieved using openblas (via RcppArmadillo) as well as the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.
To give an additional view on the performance achieved by gpuMatrix and vclMatrix is comparing directly against the OpenBLAS performance. The gpuMatrix reaches ~2-3 fold speedup over OpenBLAS whereas vclMatrix scales to over 100 fold speedup! It is curious as to why the performance with vclMatrix is so much faster (only differing in host-device-host transfers). Further optimization with gpuMatrix will need to be explored (fresh eyes are welcome) accepting limitations in the BUS transfer speed. Performance will certainly improve with improved hardware capabilities such as NVIDIA’s NVLink.
Figure 3 – Fold speedup achieved over openblas (via RcppArmadillo) float type GEMM comparisons vs the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.
CONCLUSION
The gpuR package has been created to bring GPU computing to as many R users as possible. It is the intention to use gpuR to more easily supplement current and future algorithms that could benefit from GPU acceleration. The gpuR package is currently available on CRAN. The development version can be found on my github in addition to existing issues and wiki pages (assisting primarily in installation). Future developments include solvers (e.g. QR, SVD, cholesky, etc.), scaling across multiple GPUs, ‘sparse’ class objects, and custom OpenCL kernels.
As noted above, this package is intended to be used with a multitude of hardware and operating systems (it has been tested on Windows, Mac, and multiple Linux flavors). I only have access to a limited set of hardware (I can’t access every GPU, let along the most expensive). As such, the development of gpuR depends upon the R user community. Volunteers who possess different hardware are always welcomed and encouraged to submit issues regarding any discovered bugs. I have begun a gitter account for users to report on successful usage with alternate hardware. Suggestions and general conversation about gpuR is welcome.
转自:http://www.parallelr.com/r-gpu-programming-for-all-with-gpur/
R – GPU Programming for All with ‘gpuR’的更多相关文章
- 把书《CUDA By Example an Introduction to General Purpose GPU Programming》读薄
鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to Genera ...
- Udacity并行计算课程笔记-The GPU Programming Model
一.传统的提高计算速度的方法 faster clocks (设置更快的时钟) more work over per clock cycle(每个时钟周期做更多的工作) more processors( ...
- High level GPU programming in C++
https://github.com/prem30488/C2CUDATranslator http://www.training.prace-ri.eu/uploads/tx_pracetmo/GP ...
- 【Udacity并行计算课程笔记】- lesson 1 The GPU Programming Model
一.传统的提高计算速度的方法 faster clocks (设置更快的时钟) more work over per clock cycle(每个时钟周期做更多的工作) more processors( ...
- GPU并行编程小结
http://peghoty.blog.163.com/blog/static/493464092013016113254852/ http://blog.csdn.net/augusdi/artic ...
- [CUDA] 00 - GPU Driver Installation & Concurrency Programming
前言 对,这是一个高大上的技术,终于要做老崔当年做过的事情了,生活很传奇. 一.主流 GPU 编程接口 1. CUDA 是英伟达公司推出的,专门针对 N 卡进行 GPU 编程的接口.文档资料很齐全,几 ...
- GPU深度发掘(一)::GPGPU数学基础教程
作者:Dominik Göddeke 译者:华文广 Contents 介绍 准备条件 硬件设备要求 软件设备要求 两者选择 初始化OpenGL GLUT OpenGL ...
- GPU编程自学7 —— 常量内存与事件
深度学习的兴起,使得多线程以及GPU编程逐渐成为算法工程师无法规避的问题.这里主要记录自己的GPU自学历程. 目录 <GPU编程自学1 -- 引言> <GPU编程自学2 -- CUD ...
- PatentTips - Heterogeneous Parallel Primitives Programming Model
BACKGROUND 1. Field of the Invention The present invention relates generally to a programming model ...
随机推荐
- JavaScript定时器分析
一.事件循环 JavaScript是单线程,同一个时间只能做一件事情,所以执行任务需要排队.如果前一个耗时很长,那么下一个只能等待. 1)两种任务 为了更好的处理任务,JavaScript语言的设计者 ...
- python_原始_web框架
创:10_4_2017 修: 什么是web框架? -- 本质上是socket,用户请求来,业务逻辑处理,返回处理结果 -- 包含socket或者不包含socket的框架 什么是wsgi? -- web ...
- 学习面向对象编程OOP 第一天
面向对象编程 Object Oriented Programming 一.什么是面向对象编程OOP 1.计算机编程架构; 2.计算机程序是由一个能够起到子程序作用的单元或者对象组合而成.也就是说由多个 ...
- C# 超高速高性能写日志 代码开源
1.需求 需求很简单,就是在C#开发中高速写日志.比如在高并发,高流量的地方需要写日志.我们知道程序在操作磁盘时是比较耗时的,所以我们把日志写到磁盘上会有一定的时间耗在上面,这些并不是我们想看到的. ...
- 【mysql】关于InnoDB表text blob大字段的优化
最近在数据库优化的时候,看到一些表在设计上使用了text或者blob的字段,单表的存储空间已经达到了近100G,这种情况再去改变和优化就非常难了 一.简介 为了清楚大字段对性能的影响,我们必须要知道i ...
- 使用RandomAccessFile类对文件进行读写
1. RandomAccessFile类简介 前面一篇随笔<File类遍历目录及文件>中有说到,File类只能用于表示文件或目录的名称.大小等信息,而不能用于文件内容的访问.而当需要访 ...
- NOIP2015游记——一次开心又失望的旅行
啊,一年一度的NOIP终于是结束了 以前的大神都有写自己的感受 然而我居然给忘了!!!! 吓得我赶紧来写一份游记 Day.-INF--出发前一个星期 机智的我选择了停课 就是为了OIER这伟大而又光荣 ...
- 基于jquery 的分页插件,前端实现假分页效果
上次分享了一款jquery插件,现在依旧分享这个插件,不过上一次分享主要是用于regular框件,且每一页数据都是从后端获取过来的,这一次的分享主要是讲一次性获取完数据 然后手动进行分页.此需求基本上 ...
- Android 性能优化——之控件的优化
Android 性能优化——之控件的优化 前面讲了图像的优化,接下来分享一下控件的性能优化,这里主要是面向自定义View的优化. 1.首先先说一下我们在自定义View中可能会犯的3个错误: 1)Use ...
- UE4 Fade out Mesh
由于项目需要一个将场景慢慢淡入以及淡出的效果,所以就想了想实现思路.因为PBR光照模型是不支持透明物体的渲染的,所以UE4中的PBR材质在为Opaque时是无法改变透明度的,想来想去想不出解决方法,然 ...