In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
int nres = 0;
for (int i = 0; i < n; i++)
if (src[i] > 0)
dst[nres++] = src[i];
// return the number of elements copied
return nres;
}

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code.

Solutions with Global and Shared Memory

Now, what if we want to implement filtering on a GPU, and process the elements of the array src in parallel? A straightforward approach is to use a single global counter and atomically increment it for each new element written int the dst array. A GPU implementation of this may look as follows.

__global__
void filter_k(int *dst, int *nres, const int *src, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i < n && src[i] > 0)
dst[atomicAdd(nres, 1)] = src[i];
}

The main problem with this implementation is that all threads in the grid that read positive elements from src increment a single counter, nres. Depending on the number of positive elements, this may be a very large number of threads. Therefore, the degree of collisions for atomicAdd() is high, which limits performance. You can see this in Figure 1, which plots the kernel bandwidth (counting both reads and writes, but not atomics) achieved on a Kepler K40 GPU when processing 16 Mi (224) elements.

Figure 1: Performance of filtering with global atomics on Kepler K40 GPU.

The bandwidth is inversely proportional to the number of atomics executed, or the fraction of positive elements in the array. While performance is acceptable (40 GiB/s) for a 5% fraction, it drops drastically when more elements pass the filter, to just 7 GiB/s for a 50% fraction. Atomic operations are clearly a bottleneck, and need to be removed or reduced to increase application performance.

One way to improve filtering performance is to use shared memory atomics. This increases the speed of each operation, and reduces the degree of collisions, as the counter is only shared between threads in a single block. With this approach, we only need one global atomicAdd() per thread block. Here is a kernel implemented with this approach.

__global__
void filter_shared_k(int *dst, int *nres, const int* src, int n) {
__shared__ int l_n;
int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; for (int iter = 0; iter < NPER_THREAD; iter++) {
// zero the counter
if (threadIdx.x == 0)
l_n = 0;
__syncthreads(); // get the value, evaluate the predicate, and
// increment the counter if needed
int d, pos; if(i < n) {
d = src[i];
if(d > 0)
pos = atomicAdd(&l_n, 1);
}
__syncthreads(); // leader increments the global counter
if(threadIdx.x == 0)
l_n = atomicAdd(nres, l_n);
__syncthreads(); // threads with true predicates write their elements
if(i < n && d > 0) {
pos += l_n; // increment local pos by global counter
dst[pos] = d;
}
__syncthreads(); i += BS;
}
}

Another approach is to first use a parallel prefix sum to compute the output index of each element. Thrust’s copy_if() function uses an optimized version of this approach. Performance of both approaches for Kepler K40 is presented in Figure 2. Though shared memory atomics improve filtering performance, it still stays within 2-3x of the original approach. Atomics are still a bottleneck, as the number of operations hasn’t changed. Thrust is better than both approaches for high filtering fractions, but incurs large upfront costs which are not amortized for small filtering fractions.

It is important to note that the comparison to Thrust is not apples-to-apples, because Thrust implements a stable filter: it preserves the relative order of the input elements in the output. This is a result of using prefix sum to implement it, but it is more expensive as a result. If we don’t need a stable filter, then a purely atomic approach is simpler and performs less work.

Figure 1: Performance of filtering with shared memory atomics on Kepler K40 GPU.

Warp-Aggregated Atomics

Warp aggregation is the process of combining atomic operations from multiple threads in a warp into a single atomic. This approach is orthogonal to using shared memory: the type of the atomics remains the same, but we use fewer of them. With warp aggregation, we replace atomic operations with the following steps.

  1. Threads in the warp elect a leader thread.
  2. Threads in the warp compute the total atomic increment for the warp.
  3. The leader thread performs an atomic add to compute the offset for the warp.
  4. The leader thread broadcasts the offset to all other threads in the warp.
  5. Each thread adds its own index within the warp to the warp offset to get its position in the output array.

After that, each thread proceeds as in the original code, and writes its value to its position in the dst array. Let’s now consider each of the steps in detail.

Step 1: Leader Election

First, we need a way to identify threads, or lanes in a warp. For the sake of simplicity, we assume one-dimensional thread blocks (or at least that all threads in a warp have the same threadIdx.y and threadIdx.z). The lane identifier can then be computed as follows.

#define WARP_SZ 32
__device__
inline int lane_id(void) { return threadIdx.x % WARP_SZ; }

The simplest approach is to always elect lane 0. In our example, this requires code reorganization so that lane 0 is always active during the warp-aggregated atomic. While for filtering code this reorganization is simple, in other applications, where the atomic counter increment may be performed inside a deep nest of conditionals, this can be much more complicated. To create a drop-in replacement for atomicAdd(), we need an alternative approach to leader election. Specifically, we elect the lowest active lane. This can be done using two CUDA intrinsics. __ballot(int v), available starting from Compute Capability 2.0 (sm_20), returns a mask in which each bit indicates whether the predicate v is true for the corresponding lane in the warp (the bit is unset for inactive lanes). __ffs(int v) (“find first set bit”) returns the 1-based position of the lowest bit that is set in v, or 0 if no bit is set. The code for leader selection thus looks as follows.

int mask = __ballot(1);  // mask of active lanes
int leader = __ffs(mask) - 1; // -1 for 0-based indexing

Step 2: Computing the Total Increment

For our filtering example, each thread with a true predicate increments the counter by 1. The total increment for the warp is equal to the number of active lanes which hold true predicates, which we can compute by counting the number of bits set in the mask returned by __ballot(1). For this, we use the __popc(int v) intrinsic, which returns the number of bits set in the binary representation of integer v. (We don’t consider here the case of increments that vary across lanes). The following code computes the total increment.

int change = __popc(mask);

Step 3: Performing the Atomic

Only the leader lane performs the atomic operation, using the following code.

int res;
if(lane_id() == leader)
res = atomicAdd(ctr, change); // ctr is the pointer to the counter

Step 4: Broadcasting the Result

In this step, the leader thread broadcasts the result of the atomicAdd() to other lanes in the warp. On Kepler and later GPUs (CC 3.0 and above), we can implement the broadcast efficiently using the warp shuffle intrinsic. The looks as follows.

__device__ int warp_bcast(int v, int leader) { return __shfl(v, leader); }

We use warp_bcast() to share the atomic result with all threads in the warp.

res = warp_bcast(res, leader);

Step 5: Computing the Index for Each Lane

Finally, we must compute the output index for each lane, by adding the broadcast counter value for the warp to the lane’s intra-warp offset. We define the intra-warp offset as the number of active lanes with IDs lower than current lane ID. This is the count of bits set in the active lanes mask corresponding to lanes with ID lower than the current lane ID. We can compute this by applying a mask that selects only lower lane IDs, and then applying the __popc() intrinsic to the result. In code, that looks like the following.

res + __popc(mask & ((1 << lane_id()) – 1))

We can now join the pieces of the code defined above to get the full warp-aggregated version of the increment function.

// warp-aggregated atomic increment
__device__
int atomicAggInc(int *ctr) {
int mask = __ballot(1);
// select the leader
int leader = __ffs(mask) – 1;
// leader does the update
int res;
if(lane_id() == leader)
res = atomicAdd(ctr, __popc(mask));
// broadcast result
res = warp_bcast(res, leader);
// each thread computes its own value
return res + __popc(mask & ((1 << lane_id()) – 1));
} // atomicAggInc

Performance Comparison

The warp-aggregated atomic increment function is a drop-in replacement for atomicAdd(ctr, 1) where ctr is the same across all threads of a warp. Therefore, we can rewrite GPU filtering using atomicAggInc() as follows.

__global__ void filter_k(int *dst, const int *src, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i >= n)
return;
if(src[i] > 0)
dst[atomicAggInc(nres)] = src[i];
}

Note that though we defined warp aggregation with global atomics in mind, nothing precludes doing the same for shared memory atomics. In fact, the atomicAggInc(int *ctr) function defined above works if ctr is a pointer to shared memory. Warp aggregation can thus also be used to accelerate filtering with shared memory. Figure 3 shows a performance comparison of different variants of filtering with and without warp aggregation for a Kepler GPU, and Figure 4 shows the results for a Maxwell GPU.

Figure 3: Performance of different filtering variants on Tesla K40 (Kepler) GPU (CUDA 6.5, driver version 340.29).

Figure 4: Performance of different filtering variants on GeForce GTX 750 Ti (Maxwell) GPU (CUDA 6.5, driver version 340.29).

For Kepler GPUs, the version with warp-aggregated global atomics is the clear winner. It always provides more than 60 GiB/s bandwidth, and the bandwidth actually increases with the fraction of elements that successfully pass through the filter. This also indicates that atomics are no longer a significant bottleneck. Compared to global atomics, performance improves up to 20x. Performance of a simple copy operation on the same GPU is 180 GiB/s. We can thus say that the performance of filtering with warp-aggregated atomics is comparable (in fact, within a factor of 2x) to that of a simple copy operation. This also means that filtering can now be used in performance-critical portions of the code. Also note that shared memory atomics (with warp aggregation) are actually slower than warp-aggregated atomics. This indicates that warp aggregation already does a very good job, and using shared memory brings no benefit and introduces only additional overhead.

For Maxwell, global aggregated atomics also provide the best performance. However, we have only tested on a small Maxwell GPU so far, with just 5 SMs. Having more SMs could increase the performance of the shared memory versions. We hope to test on a GM204 GPU soon. Another thing to note about Maxwell is that it is the first GPU architecture to implement fast shared memory atomics; Kepler and earlier GPUs used an expensive lock/update/unlock sequence for shared memory atomics. On Maxwell, you can see that for a low collision fraction, performance of the non-aggregated shared memory version is good, but it decreases with increasing collision rate. Therefore, warp aggregation is still valuable for our example.

For both GPU generations, warp aggregation provides performance improvements for both global and shared memory atomics. Warp aggregation enables achieving the best performance. Additionally, achieved bandwidth actually increases with the fraction of filtered elements (due to more elements being written), compared to decreasing bandwidth without warp aggregation.

Conclusion

Warp aggregation of atomics is a useful technique to improve performance of applications that perform many operations on a small number of counters. In this post we applied warp aggregation to filtering, and obtained more than an order-of-magnitude performance improvement. An important point is that the atomic function with warp aggregation, atomicAggInc(), is logically separate from the rest of the application, and can be used as a drop-in replacement for a simple atomic addition when threads in a warp increment the same counter by the same value.

Warp-aggregated atomics are by no means limited to filtering; you can use it for many other applications which make use of atomic operations.

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics的更多相关文章

  1. CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

    https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ One of the most c ...

  2. CUDA Pro:通过向量化内存访问提高性能

    CUDA Pro:通过向量化内存访问提高性能 许多CUDA内核受带宽限制,而新硬件中触发器与带宽的比率不断提高,导致带宽受限制的内核更多.这使得采取措施减轻代码中的带宽瓶颈非常重要.本文将展示如何在C ...

  3. cuda编程-卷积优化

    CUDA Convolution https://www.evl.uic.edu/sjames/cs525/final.html Improve Image Processing Using GPU ...

  4. CUDA 8混合精度编程

    CUDA 8混合精度编程 Mixed-Precision Programming with CUDA 8 论文地址:https://devblogs.nvidia.com/mixed-precisio ...

  5. CUDA性能优化----warp深度解析

    本文转自:http://blog.163.com/wujiaxing009@126/blog/static/71988399201701224540201/ 1.引言 CUDA性能优化----sp, ...

  6. CUDA C Programming Guide 在线教程学习笔记 Part 4

    ▶ 图形互操作性,OpenGL 与 Direct3D 相关.(没学过,等待填坑) ▶ 版本号与计算能力 ● 计算能力(Compute Capability)表征了硬件规格,CUDA版本号表征了驱动接口 ...

  7. Linux下Qt+CUDA调试并运行

    Qt与CUDA相结合具体的操作主要修改qt项目中的配置文件pro.下面以测试的项目为例. 因为这是一个测试案例,代码很简单,下面将这几个文件的代码贴出来,方面后面对应pro文件和Makefile文件中 ...

  8. 使用CUDA Warp-Level级原语

    使用CUDA Warp-Level级原语 NVIDIA GPU以SIMT(单指令,多线程)的方式执行称为warps 的线程组.许多CUDA程序通过利用warp执行来实现高性能.本文将展示如何使用cud ...

  9. CUDA 8的混合精度编程

    CUDA 8的混合精度编程 Volta和Turing GPU包含 Tensor Cores,可加速某些类型的FP16矩阵数学运算.这样可以在流行的AI框架内更快,更轻松地进行混合精度计算.要使用Ten ...

随机推荐

  1. MySQL几个重要的目录

    MySQL几个重要的目录 1 数据库目录 /var/lib/mysql/ 2 配置文件 /usr/share/mysql(mysql.server命令及配置文件) 3 相关命令 /usr/bin(my ...

  2. string 类(二)

    处理string对象中的字符: 在cctype头文件中定义了一组标准库函数来处理string对象中的字符,比如检查一个string对象是否包含空白,或者把string对象中的字母改成小写,再或者查看某 ...

  3. 项目打包部署到tomcat操作步骤

    对于项目部署到tomcat中,需进行一下步骤: 1.对于项目打war包,方式有以下几种:install一下   找到war包的路径即可 另外:在eclipse中,选中项目 1.1 选中Export 1 ...

  4. 深入Struts2的过滤器FilterDispatcher--中文乱码及字符编码过滤器

    引用 前几天在论坛上看到一篇帖子,是关于Struts2.0中文乱码的,楼主采用的是spring的字符编码过滤器(CharacterEncodingFilter)统一编码为GBK,前台提交表单数据到Ac ...

  5. BZOJ-1396: 识别子串

    后缀自动机+线段树 先建出\(sam\),统计一遍每个点的\(right\)集合大小\(siz\),对于\(siz=1\)的点\(x\),他所代表的子串只会出现一次,设\(y=fa[x]\),则这个点 ...

  6. Tensorflow 从零开始

    1.安装pip pip是一个用于管理和安装Python包的工具,类似于LINUX 的yum命令一样! 命令(Ubuntu系统):sudo apt-get install python-pip 验证安装 ...

  7. HIVE 配置文件详解

    hive的配置: hive.ddl.output.format:hive的ddl语句的输出格式,默认是text,纯文本,还有json格式,这个是0.90以后才出的新配置: hive.exec.scri ...

  8. Flume-NG启动过程源码分析(三)(原创)

    上一篇文章分析了Flume如何加载配置文件的,动态加载也只是重复运行getConfiguration(). 本篇分析加载配置文件后各个组件是如何运行的? 加载完配置文件订阅者Application类会 ...

  9. Avoid nesting too deeply and return early避免嵌套太深应尽早返回

    当某个变量与多个值进行比较的时候 不要用多个if else 判断是否相等 将多个值放在数组里,然后用PHP函数in_array(mixed $needle,array $haystack),检查数组$ ...

  10. ZC_注意点

    1. domain类 里面的 属性的类型,一般都是用 包装类 2. 使用 "Hibernate Reverse Engineering ..." 来进行自动生成domain类和?? ...