2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic operation performance. A linked list approach I developed worked well on Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations are much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations are not limited to MPC or particle-in-cell codes, but, to some extent, occur whenever data elements are aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same key at warp level can lead to a significant speed-up. In this post, I will describe algorithms for speeding up your summations (or similar aggregations) for problems with a large number of keys where there is a reasonable correlation between the thread index and the key. This is usually the case for elements that are at least partially sorted. Unfortunately, this argument works in both directions: these algorithms are not for you if your number of keys is small or your distribution of keys is random.  To clarify: by a “large” number of keys I mean more than could be handled if all bins were put into shared memory.

Note that this technique is related to a previously posted technique called warp-aggregated atomics by Andrey Adinetz, and also to the post Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh. The main difference here is that we are aggregating many groups, each designated by a key (to compute a histogram, for example). So you could consider this technique “warp-aggregated atomic reduction by key”.

Double Precision Atomics and Warp Divergence

To achieve sufficient numerical precision, the natively provided single-precision atomicAdd is often inadequate. Unfortunately, using the atomicCAS loop to implement double precision atomic operations (as suggested in the CUDA C Programming guide) introduces warp divergence, especially when the order of the data elements correlates with their keys. However, there is a way to remove this warp divergence (and a number of atomic operations): pre-combine all the elements in each warp that share the same key and use only one atomic operation per key and warp.

Why Work With Warps?

Using a “per warp” approach has two practical benefits. First, thread in a warp can communicate efficiently using warp vote and __shfl (shuffle) operations. Second, there is no need for synchronization because threads in a warp work synchronously.

Applying the concepts shown later at the thread block level is significantly slower because of the slower communication through shared memory and the necessary synchronization.

Finding Groups of Keys Within a Warp

Various algorithms can be used to find the elements in a warp that share the same key (“peers”). A simple way is to loop over all different keys, but there are also hash-value based approaches using shared memory and probably many others. The performance and sometimes also the suitability of each algorithm depends on the type and distribution of the keys and the architecture of the target GPU.

The code shown in example 1 should work with keys of all types on Kepler and later GPUs (Compute Capability 3.0 and later). It takes the key of lane 0 as reference, distributes it using a __shfl() operation to all other lanes and uses a __ballot() operation to determine which lanes share the same key. These are removed from the pool of lanes and the procedure is repeated with the key of the first remaining lane until all keys in the warp are checked. For each warp, this loop obviously has as many operations as there are different keys. The function returns a bit pattern for each thread with the bits of its peer elements set. You can see a step-by-step example of how this works starting on slide 9 of my GTC 2015 talk.

template<typename G> __device__ __inline__ uint get_peers(G key) { uint peers=0; bool is_peer; // in the beginning, all lanes are available uint unclaimed=0xffffffff; do { // fetch key of first unclaimed lane and compare with this key         is_peer = (key == __shfl(key, __ffs(unclaimed) - 1)); // determine which lanes had a match         peers = __ballot(is_peer); // remove lanes with matching keys from the pool         unclaimed ^= peers; // quit if we had a match } while (!is_peer); return peers; }

Pre-Combining Peers

Ideally, the peer elements found can be added or otherwise combined in parallel in log2(max_n_peer) iterations if we use parallel tree-like reductions on all groups of peers. But with a number of interleaved trees traversed in parallel and only represented by bit patterns, it is not obvious which element should be added. There are at least two solutions to this problem:

  1. make it obvious by switching the positions of all elements back and forth for the calculation; or
  2. interpret the bit pattern to determine which element to add next.

The latter approach had a slight edge in performance in my implementation.

Note that Mark Harris taught us a long time ago to start reductions with the “far away” elements to avoid bank conflicts. This is still true when using shared memory, but with the __shfl() operations used here, we do not have bank conflicts, so we can start with our neighboring peers. (See the post “Faster Reductions on Kepler using SHFL”).

Parallel Reductions

Let’s start with some observations on the usual parallel reductions. Of the two elements combined in each operation, the one with the higher index is used and, therefore, obsolete. At each iteration, every second remaining element is “reduced away”. In other words, in iteration i (0 always being the first here), all elements with an index not divisible by 2i+1 are “reduced away”, which is equivalent to saying that in iteration i, elements are reduced away if the least-significant bit of their relative index is 2i.

Replacing thread lane index by a relative index among the thread’s peers, we can perform the parallel reductions over the peer groups as follows.

  • Calculate each thread’s relative index within its peer group.
  • Ignore all peer elements in the warp with a relative index lower or equal to this lane’s index.
  • If the least-significant bit of the relative index of the thread is not 2i (with i being the iteration), and there are peers with higher relative indices left, acquire and use data from the next highest peer.
  • If the least-significant bit of the relative index of the thread is 2i, remove the bit for the lane of this thread from the bit set of remaining peers.
  • Continue until there are no more remaining peers in the warp.

The last step may sound ineffective, but no thread in this warp will continue before the last thread is done. Furthermore, we need a more complex loop exit condition, so there’s no gain in a more efficient-looking implementation.

You can see an actual implementation in Example 2 and a step-by-step walkthrough starting on slide 19 of my GTC 2015 presentation.

template <typename F> __device__ __inline__ F reduce_peers(uint peers, F &x) { int lane = TX&31; // find the peer with lowest lane index int first = __ffs(peers)-1; // calculate own relative position among peers int rel_pos = __popc(peers << (32 - lane)); // ignore peers with lower (or same) lane index     peers &= (0xfffffffe << lane); while(__any(peers)) { // find next-highest remaining peer int next = __ffs(peers); // __shfl() only works if both threads participate, so we always do.         F t = __shfl(x, next - 1); // only add if there was anything to add if (next) x += t; // all lanes with their least significant index bit set are done uint done = rel_pos & 1; // remove all peers that are already done         peers & = ~__ballot(done); // abuse relative position as iteration counter         rel_pos >>= 1; } // distribute final result to all peers (optional)     F res = __shfl(x, first); return res; }

Performance Tests

To test the performance I set up an example with a simulation box of size 100x100x100 cells with 10 particles per cell and the polynomial cell index being the key element. This leads to one million different keys and is therefore unsuitable for direct binning in shared memory. I compared my algorithm (“warp reduction” in Figure 1) against an unoptimized atomic implementation and a binning algorithm using shared memory atomics with a hashing function to calculate bin indices per block on the fly.

For comparison I used three different distributions of keys:

  • completely random;
  • completely ordered by key;
  • ordered, then shifted with a 50% probability to the next cell in each direction. This is a test pattern resembling a common distribution during the MPC algorithm.

Tests were performed using single- and double-precision values. For single-precision numbers, the simple atomic implementation is always either faster or at least not significantly slower than other implementations. This is also true for double-precision numbers with randomly distributed keys, where any attempt to reduce work occasionally results in huge overhead. For these reasons single-precision numbers and completely randomly distributed keys are excluded in the comparison. Also excluded is the time used to calculate the peer bit masks, because it varies from an insignificant few percent (shared memory hashing for positive integer keys on Maxwell) to about the same order of magnitude as a reduction step (external loop as described above for arbitrary types, also on Maxwell). But even in the latter case we can achieve a net gain if it is possible to reuse the bit patterns at least once.


Figure 1: Performance comparison of the warp-aggregated atomic summation by key vs. simple atomic and shared atomic / shared hashed approaches on Kepler and Maxwell GPUs, for double-precision values.

Conclusion

For data elements sorted and combined by key with a large number of possible values, pre-combining same-key elements at warp level can lead to a significant speed-up. The actual gain depends on many factors like GPU architecture, number and type of data elements and keys, and re-usability. The changes in code are minor, so you should try it if you have a task fitting the criteria.

For related techniques, be sure to check out the posts Warp-Aggregated Atomics by Andrey Adinetz, and Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh.

Voting and Shuffling to Optimize Atomic Operations的更多相关文章

  1. 什么是Java中的原子操作( atomic operations)

    1.啥是java的原子性 原子性:即一个操作或者多个操作 要么全部执行并且执行的过程不会被任何因素打断,要么就都不执行. 一个很经典的例子就是银行账户转账问题: 比如从账户A向账户B转1000元,那么 ...

  2. Atomic operations on the x86 processors

    On the Intel type of x86 processors including AMD, increasingly there are more CPU cores or processo ...

  3. Adaptively handling remote atomic execution based upon contention prediction

    In one embodiment, a method includes receiving an instruction for decoding in a processor core and d ...

  4. Method and apparatus for an atomic operation in a parallel computing environment

    A method and apparatus for a atomic operation is described. A method comprises receiving a first pro ...

  5. A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer

    Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-relate ...

  6. 原子操作(atomic operation)

    深入分析Volatile的实现原理 引言 在多线程并发编程中synchronized和Volatile都扮演着重要的角色,Volatile是轻量级的synchronized,它在多处理器开发中保证了共 ...

  7. C++11开发中的Atomic原子操作

    C++11开发中的Atomic原子操作 Nicol的博客铭 原文  https://taozj.org/2016/09/C-11%E5%BC%80%E5%8F%91%E4%B8%AD%E7%9A%84 ...

  8. boost atomic

    文档: http://www.boost.org/doc/libs/1_53_0/doc/html/atomic.html Presenting Boost.Atomic Boost.Atomic i ...

  9. centos 7.0 nginx 1.7.9 安装过程

    系统用的是centos 7.0最小化安装 我现在安装完了 写一下步骤 还没完全搞懂 首先安装GCC [root@localhost ~]# yum install -y gcc gcc-c++ 已加载 ...

随机推荐

  1. 【P3355】骑士共存问题(最大流+黑白染色,洛谷)

    这个题刚看上去就让人不禁想到一道叫做方格取数问题的题目,事实上也就是这么做,对棋盘黑白染色,然后黑格子连源点,白的连汇点,点权为1.然后判断一下黑格子能影响到的白格子,边权为inf,跑一遍最大流就可以 ...

  2. 【bzoj4806~bzoj4808】炮车马后——象棋四连击

    bzoj4806——炮 题目传送门:bzoj4806 这种题一看就是dp...我们可以设$ f[i][j][k] $表示处理到第$ i $行,有$ j $列没放炮,$ k $列只放了一个炮.接着分情况 ...

  3. 关于ENABLE_BITCODE

    pod 'TSVoiceConverter' 如果,设置了工程target的ENABLE_BITCODE为NO.但是,在真机上运行时,仍然提示类似于如下错误: URGENT: all bitcode ...

  4. Rotate List ,反转链表的右k个元素

    问题描述: Given a list, rotate the list to the right by k places, where k is non-negative. For example:G ...

  5. 修改Tomcat默认端口号,避免与IDEA冲突

    修改Tomcat默认端口号,避免与IDEA冲突 APT安装默认位置如下 /var/lib/tomcat8/conf 修改server.xml中的8080端口为8088或其他. 重启服务,试试看效果. ...

  6. 用Heartbeat实现HA集群

    HA即高可用(high avaliable),又被叫做双机热备,用于关键性业务,简单理解就是,有两台机器A和B,正常是A提供服务,B待机闲置,当A宕机或服务宕掉,会切换到B机器继续提供服务.常用实现高 ...

  7. LeetCode第[17]题(Java):Letter Combinations of a Phone Number

    题目:最长公共前缀 难度:EASY 题目内容: Given a string containing digits from 2-9 inclusive, return all possible let ...

  8. codeforces 820A. Mister B and Book Reading 解题报告

    题目链接:http://codeforces.com/problemset/problem/820/A 坑爹题目,坑爹题目,坑爹题目....汗 = =!  后台还110个 test 有个地方需要注意下 ...

  9. jquery二维码生成插件jquery.qrcode.js

    插件描述:jquery.qrcode.js 是一个能够在客户端生成矩阵二维码QRCode 的jquery插件 ,使用它可以很方便的在页面上生成二维条码. 转载于:http://www.jq22.com ...

  10. 阿里云上如何利用yum安装jenkins

    一. 安装jdk 确保安装jenkins前jdk已经安装,如何安装见<如何在阿里云上部署war包到tomcat服务器> 二. 安装jenkins 使用以下命令安装jenkins: wget ...