技术背景

在数学和物理学领域，总是充满了各种连续的函数模型。而当我们用现代计算机的技术去处理这些问题的时候，事实上是无法直接处理连续模型的，绝大多数的情况下都要转化成一个离散的模型再进行数值的计算。比如计算数值的积分，计算数值的二阶导数（海森矩阵）等等。这里我们所介绍的打格点的算法，正是一种典型的离散化方法。这个对空间做离散化的方法，可以在很大程度上简化运算量。比如在分子动力学模拟中，计算近邻表的时候，如果不采用打格点的方法，那么就要针对整个空间所有的原子进行搜索，计算出来距离再判断是否近邻。而如果采用打格点的方法，我们只需要先遍历一遍原子对齐进行打格点的离散化，之后再计算近邻表的时候，只需要计算三维空间下邻近的27个格子中的原子是否满足近邻条件即可。在这篇文章中，我们主要探讨如何用GPU来实现打格点的算法。

打格点算法实现

我们先来用一个例子说明一下什么叫打格点。对于一个给定所有原子坐标的系统，也就是已知了$[x,y,z]$，我们需要得到的是这些原子所在的对应的格子位置$[n_x,n_y,n_z]$。我们先看一下在CPU上的实现方案，是一个遍历一次的算法：

# cuda_grid.py

from numba import jit

from numba import cuda

import numpy as np

def grid_by_cpu(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

if __name__=='__main__':

    np.random.seed(1)

    atoms = 4

    grid_size = 0.1

    crd = np.random.random((atoms,3)).astype(np.float32)

    xmin = min(crd[:,0])

    ymin = min(crd[:,1])

    zmin = min(crd[:,2])

    xmax = max(crd[:,0])

    ymax = max(crd[:,1])

    zmax = max(crd[:,2])

    xgrids = int((xmax-xmin)/grid_size)+1

    ygrids = int((ymax-ymin)/grid_size)+1

    zgrids = int((zmax-zmin)/grid_size)+1

    rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32)

    grids = np.ones_like(crd)*(-1)

    grids = grids.astype(np.float32)

    grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids)

    print (crd)

    print (grids_cpu)

    import matplotlib.pyplot as plt

    plt.figure()

    plt.plot(crd[:,0], crd[:,1], 'o', color='red')

    for grid in range(ygrids+1):

        plt.plot([xmin,xmin+grid_size*xgrids], [ymin+grid_size*grid,ymin+grid_size*grid], color='black')

    for grid in range(xgrids+1):

        plt.plot([xmin+grid_size*grid,xmin+grid_size*grid], [ymin,ymin+grid_size*ygrids], color='black')

    plt.savefig('Atom_Grids.png')

输出结果如下，

$ python3 cuda_grid.py

[[4.17021990e-01 7.20324516e-01 1.14374816e-04]

 [3.02332580e-01 1.46755889e-01 9.23385918e-02]

 [1.86260208e-01 3.45560730e-01 3.96767467e-01]

 [5.38816750e-01 4.19194520e-01 6.85219526e-01]]

[[2. 5. 0.]

 [1. 0. 0.]

 [0. 1. 3.]

 [3. 2. 6.]]

上面两个打印输出就分别对应于$[x,y,z]$和$[n_x,n_y,n_z]$，比如第一个原子被放到了编号为$[2,5,0]$的格点。那么为了方便理解打格点的方法，我们把这个三维空间的原子系统和打格点以后的标号取前两个维度来可视化一下结果，作图以后效果如下：

我们可以看到，这些红色的点就是原子所处的位置，而黑色的网格线就是我们所标记的格点。在原子数量比较多的时候，有可能出现在一个网格中存在很多个原子的情况，所以如何打格点，格点大小如何去定义，这都是不同场景下的经验参数，需要大家一起去摸索。

打格点算法加速

在上面这个算法实现中，我们主要是用到了一个for循环，这时候我们可以想到numba所支持的向量化运算，还有GPU硬件加速，这里我们先对比一下三种实现方案的计算结果：

# cuda_grid.py

from numba import jit

from numba import cuda

import numpy as np

def grid_by_cpu(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@jit

def grid_by_jit(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@cuda.jit

def grid_by_gpu(crd, rxyz, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    i,j = cuda.grid(2)

    grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3])

if __name__=='__main__':

    np.random.seed(1)

    atoms = 4

    grid_size = 0.1

    crd = np.random.random((atoms,3)).astype(np.float32)

    xmin = min(crd[:,0])

    ymin = min(crd[:,1])

    zmin = min(crd[:,2])

    xmax = max(crd[:,0])

    ymax = max(crd[:,1])

    zmax = max(crd[:,2])

    xgrids = int((xmax-xmin)/grid_size)+1

    ygrids = int((ymax-ymin)/grid_size)+1

    zgrids = int((zmax-zmin)/grid_size)+1

    rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32)

    crd_cuda = cuda.to_device(crd)

    rxyz_cuda = cuda.to_device(rxyz)

    grids = np.ones_like(crd)*(-1)

    grids = grids.astype(np.float32)

    grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids)

    grids = np.ones_like(crd)*(-1)

    grids_jit = grid_by_jit(crd, rxyz, atoms, grids)

    grids = np.ones_like(crd)*(-1)

    grids_cuda = cuda.to_device(grids)

    grid_by_gpu[(atoms,3),(1,1)](crd_cuda,

                                 rxyz_cuda,

                                 grids_cuda)

    print (crd)

    print (grids_cpu)

    print (grids_jit)

    print (grids_cuda.copy_to_host())

输出结果如下：

$ python3 cuda_grid.py

/home/dechin/anaconda3/lib/python3.8/site-packages/numba/cuda/compiler.py:865: NumbaPerformanceWarning: Grid size (12) < 2 * SM count (72) will likely result in GPU under utilization due to low occupancy.

  warn(NumbaPerformanceWarning(msg))

[[4.17021990e-01 7.20324516e-01 1.14374816e-04]

 [3.02332580e-01 1.46755889e-01 9.23385918e-02]

 [1.86260208e-01 3.45560730e-01 3.96767467e-01]

 [5.38816750e-01 4.19194520e-01 6.85219526e-01]]

[[2. 5. 0.]

 [1. 0. 0.]

 [0. 1. 3.]

 [3. 2. 6.]]

[[2. 5. 0.]

 [1. 0. 0.]

 [0. 1. 3.]

 [3. 2. 6.]]

[[2. 5. 0.]

 [1. 0. 0.]

 [0. 1. 3.]

 [3. 2. 6.]]

我们先看到这里面的告警信息，因为GPU硬件加速要在一定密度的运算量之上才能够有比较明显的加速效果。比如说我们只是计算两个数字的加和，那么是完全没有必要使用到GPU的。但是如果我们要计算两个非常大的数组的加和，那么这个时候GPU就能够发挥出非常大的价值。因为这里我们的案例中只有4个原子，因此提示我们这时候是体现不出来GPU的加速效果的。我们仅仅关注下这里的运算结果，在不同体系下得到的格点结果是一致的，那么接下来就可以对比一下几种不同实现方式的速度差异。

# cuda_grid.py

from numba import jit

from numba import cuda

import numpy as np

def grid_by_cpu(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@jit

def grid_by_jit(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@cuda.jit

def grid_by_gpu(crd, rxyz, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    i,j = cuda.grid(2)

    grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3])

if __name__=='__main__':

    import time

    from tqdm import trange

    np.random.seed(1)

    atoms = 100000

    grid_size = 0.1

    crd = np.random.random((atoms,3)).astype(np.float32)

    xmin = min(crd[:,0])

    ymin = min(crd[:,1])

    zmin = min(crd[:,2])

    xmax = max(crd[:,0])

    ymax = max(crd[:,1])

    zmax = max(crd[:,2])

    xgrids = int((xmax-xmin)/grid_size)+1

    ygrids = int((ymax-ymin)/grid_size)+1

    zgrids = int((zmax-zmin)/grid_size)+1

    rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32)

    crd_cuda = cuda.to_device(crd)

    rxyz_cuda = cuda.to_device(rxyz)

    cpu_time = 0

    jit_time = 0

    gpu_time = 0

    for i in trange(100):

        grids = np.ones_like(crd)*(-1)

        grids = grids.astype(np.float32)

        time0 = time.time()

        grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids)

        time1 = time.time()

        grids = np.ones_like(crd)*(-1)

        time2 = time.time()

        grids_jit = grid_by_jit(crd, rxyz, atoms, grids)

        time3 = time.time()

        grids = np.ones_like(crd)*(-1)

        grids_cuda = cuda.to_device(grids)

        time4 = time.time()

        grid_by_gpu[(atoms,3),(1,1)](crd_cuda,

                                    rxyz_cuda,

                                    grids_cuda)

        time5 = time.time()

        if i != 0:

            cpu_time += time1 - time0

            jit_time += time3 - time2

            gpu_time += time5 - time4

    print ('The time cost of CPU calculation is: {}s'.format(cpu_time))

    print ('The time cost of JIT calculation is: {}s'.format(jit_time))

    print ('The time cost of GPU calculation is: {}s'.format(gpu_time))

输出结果如下：

$ python3 cuda_grid.py

100%|███████████████████████████| 100/100 [00:23<00:00,  4.18it/s]

The time cost of CPU calculation is: 23.01943016052246s

The time cost of JIT calculation is: 0.04810166358947754s

The time cost of GPU calculation is: 0.01806473731994629s

在100000个原子的体系规模下，普通的for循环实现效率就非常的低下，需要23s，而经过向量化运算的加速之后，直接飞升到了0.048s，而GPU上的加速更是达到了0.018s，相比于没有GPU硬件加速的场景，实现了将近2倍的加速。但是这还远远不是GPU加速的上限，让我们再测试一个更大的案例：

# cuda_grid.py

from numba import jit

from numba import cuda

import numpy as np

def grid_by_cpu(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@jit

def grid_by_jit(crd, rxyz, atoms, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    for i in range(atoms):

        grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3])

        grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3])

        grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3])

    return grids

@cuda.jit

def grid_by_gpu(crd, rxyz, grids):

    """Transform coordinates [x,y,z] into grids [nx,ny,nz].

    Args:

        crd(list): The 3-D coordinates of atoms.

        rxyz(list): The list includes xmin,ymin,zmin,grid_num.

        atoms(int): The total number of atoms.

        grids(list): The transformed grids matrix.

    """

    i,j = cuda.grid(2)

    grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3])

if __name__=='__main__':

    import time

    from tqdm import trange

    np.random.seed(1)

    atoms = 5000000

    grid_size = 0.1

    crd = np.random.random((atoms,3)).astype(np.float32)

    xmin = min(crd[:,0])

    ymin = min(crd[:,1])

    zmin = min(crd[:,2])

    xmax = max(crd[:,0])

    ymax = max(crd[:,1])

    zmax = max(crd[:,2])

    xgrids = int((xmax-xmin)/grid_size)+1

    ygrids = int((ymax-ymin)/grid_size)+1

    zgrids = int((zmax-zmin)/grid_size)+1

    rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32)

    crd_cuda = cuda.to_device(crd)

    rxyz_cuda = cuda.to_device(rxyz)

    jit_time = 0

    gpu_time = 0

    for i in trange(100):

        grids = np.ones_like(crd)*(-1)

        time2 = time.time()

        grids_jit = grid_by_jit(crd, rxyz, atoms, grids)

        time3 = time.time()

        grids = np.ones_like(crd)*(-1)

        grids_cuda = cuda.to_device(grids)

        time4 = time.time()

        grid_by_gpu[(atoms,3),(1,1)](crd_cuda,

                                     rxyz_cuda,

                                     grids_cuda)

        time5 = time.time()

        if i != 0:

            jit_time += time3 - time2

            gpu_time += time5 - time4

    print ('The time cost of JIT calculation is: {}s'.format(jit_time))

    print ('The time cost of GPU calculation is: {}s'.format(gpu_time))

在这个5000000个原子的案例中，因为普通的for循环已经实在是跑不动了，因此我们就干脆不统计这一部分的时间，最后输出结果如下：

$ python3 cuda_grid.py

100%|███████████████████████████| 100/100 [00:09<00:00, 10.15it/s]

The time cost of JIT calculation is: 2.3743042945861816s

The time cost of GPU calculation is: 0.022843599319458008s

在如此大规模的运算下，GPU实现100倍的加速，而此时作为对比的CPU上的实现方法是已经用上了向量化运算的操作，也已经可以认为是一个极致的加速了。

总结概要

在这篇文章中，我们主要介绍了打格点算法在分子动力学模拟中的重要价值，以及几种不同的实现方式。其中最普通的for循环的实现效率比较低下，从算法复杂度上来讲却已经是极致。而基于CPU上的向量化运算的技术，可以对计算过程进行非常深度的优化。当然，这个案例在不同的硬件上也能够发挥出明显不同的加速效果，在GPU的加持之下，可以获得100倍以上的加速效果。这也是一个在Python上实现GPU加速算法的一个典型案例。

版权声明

本文首发链接为：https://www.cnblogs.com/dechinphy/p/cuda-grid.html

作者ID：DechinPhy

更多原著文章请参考：https://www.cnblogs.com/dechinphy/

打赏专用链接：https://www.cnblogs.com/dechinphy/gallery/image/379634.html

腾讯云专栏同步：https://cloud.tencent.com/developer/column/91958

Python3实现打格点算法的GPU加速的更多相关文章

自学Python3.6-算法二分查找算法
自学Python之路-Python基础+模块+面向对象自学Python之路-Python网络编程自学Python之路-Python并发编程+数据库+前端自学Python之路-django 自学Pyth ...
超过Numpy的速度有多难？试试Numba的GPU加速
技术背景 Numpy是在Python中非常常用的一个库,不仅具有良好的接口文档和生态,还具备了最顶级的性能,这个库很大程度上的弥补了Python本身性能上的缺陷.虽然我们也可以自己使用Cython或者 ...
用cudamat做矩阵运算的GPU加速
1. cudamat简介 cudamat是一个python语言下,利用NVIDIA的cuda sdk 进行矩阵运算加速的库.对于不熟悉cuda编程的程序员来说,这是一个非常方便的GPU加速方案.很多工 ...
深度学习“引擎”之争：GPU加速还是专属神经网络芯片？
深度学习“引擎”之争:GPU加速还是专属神经网络芯片? 深度学习(Deep Learning)在这两年风靡全球,大数据和高性能计算平台的推动作用功不可没,可谓深度学习的“燃料”和“引擎”,GPU则是引 ...
编译GDAL支持OpenCL使用GPU加速
前言 GDAL库中提供的gdalwarp支持各种高性能的图像重采样算法,图像重采样算法广泛应用于图像校正,重投影,裁切,镶嵌等算法中,而且对于这些算法来说,计算坐标变换的运算量是相当少的,绝大部分运算 ...
记录一次Python下Tensorflow安装过程，1.7带GPU加速版本
最近由于论文需要,急需搭建Tensorflow环境,16年底当时Tensorflow版本号还没有过1,我曾按照手册搭建过CPU版本.目前,1.7算是比较新的版本了(也可以从源码编译1.8版本的Tens ...
基于GPU加速的三维空间分析【转】
基于GPU加速的三维空间分析标签:supermap地理信息系统gisit 文:李凯随着三维GIS 的快速发展和应用普及,三维空间分析技术以其应用中的实用性成为当前GIS技术研究的热点领域.面对日益 ...
GPU—加速数据科学工作流程
GPU-加速数据科学工作流程 GPU-ACCELERATE YOUR DATA SCIENCE WORKFLOWS 传统上,数据科学工作流程是缓慢而繁琐的,依赖于cpu来加载.过滤和操作数据,训练和部 ...
构建可扩展的GPU加速应用程序（NVIDIA HPC）
构建可扩展的GPU加速应用程序(NVIDIA HPC) 研究人员.科学家和开发人员正在通过加速NVIDIA GPU上的高性能计算(HPC)应用来推进科学发展,NVIDIA GPU具有处理当今最具挑战性 ...

随机推荐

canvas点阵函数波动，类似飘带或水波
canvas动画利用函数波动特点制作水波动画 <canvas id="myCanvas" width="500" height="400&quo ...
window.location.href下载文件，文件名中文乱码处理
下载文件方法: window.location.href='http://www.baidu.com/down/downFile.txt?name=资源文件'; 这种情况下载时:文件名资源文件会中文乱 ...
小白学习vue第五天-第二弹(全局局部、父子、注册语法糖，script/template抽离模板)
全局组件: 就是注册的位置在实例对象的外面并且可以多个实例对象使用而局部: 就是在实例对象的内部注册父组件和子组件的关系子组件就是在另一个组件里面注册的组件组件注册语法糖: 就不用Vue.e ...
Linux线程同步之读写锁（rwlock）
读写锁和互斥量(互斥锁)很类似,是另一种线程同步机制,但不属于POSIX标准,可以用来同步同一进程中的各个线程.当然如果一个读写锁存放在多个进程共享的某个内存区中,那么还可以用来进行进程间的同步, 和 ...
Linux文件系统与日志文件
目录一.inode和block 1.1.inode和block概述 1.2.inode的内容 inode包含文件的元信息: 查看inode号两种方式目录文件的结构 1.3.inode的号码用户通 ...
如何请求一个需要登陆才能访问的接口(基于cookie)---apipost
在后台在开发.调试接口时,常常会遇到需要登陆才能请求的接口. 比如:获取登陆用户的收藏列表,此时,我们就需要模拟登陆状态进行接口调试了.如图: 今天,我们讲解利用ApiPost的环境变量,解决这种需要 ...
Redis实现分布式锁那件事
今天我们来聊一聊分布式锁的那些事. 相信大家对锁已经不陌生了,我们在多线程环境中,如果需要对同一个资源进行操作,为了避免数据不一致,我们需要在操作共享资源之前进行加锁操作.在计算机科学中,锁(lock ...
NOIP 模拟 $14\; \text{队长快跑}$
题解 $by\;zj\varphi$ 一道很妙的 $dp$ 题,方程状态不好设置,细节也不少看到数据范围,直接想离散化设 $f_{i,j}$ 表示处理完前 $i$ 个水晶,其中摧毁 ...
微信小程序全局数据globalData的使用问题
如果在A页面设置全局属性,但在B页面无法使用的话,可能是这个问题: app.js globalData: { helpPage:0, }, A页面 A(e) { getApp().globalData ...
数据结构与算法-排序（十）桶排序（Bucket Sort）
摘要桶排序和基数排序类似,相当于基数排序的另外一种逻辑.它是将取值范围当做创建桶的数量,桶的长度就是序列的大小.通过处理比较元素的数值,把元素放在桶的特定位置,然后遍历桶,就可以得到有序的序列. 逻 ...

Python3实现打格点算法的GPU加速