C++11: Multi-core Programming – PPL Parallel Aggregation Explained
https://katyscode.wordpress.com/2013/08/17/c11-multi-core-programming-ppl-parallel-aggregation-explained/
In the first part of this series we looked at general multi-threading and multi-core programming concepts without getting into the meat of any real problems, and in the second part we looked at the theory and application of the parallel aggregation pattern using either C++11’s standard thread library or Boost using classical parallel programming techniques. In this part we shall solve exactly the same summation problem from part 2 (adding all the numbers from 0 to 1 billion) but use the new Parallel Patterns Library (PPL) to do so. We will first look at a direct port using classical parallel programming techniques, then find out how to leverage new language features in C++11 which are employed by PPL to simplify the code and write parallel algorithms using an entirely new paradigm. We shall also look at some fundamental differences between PPL and other multi-threading/multi-core libraries, the pros and cons, and some gotchas to be aware of. This tutorial will assume you have at least read the non-coding sections of part 2 of this series, so that you understand the problem we are trying to solve, the underlying algorithm and some of the problems that can crop up. Once again, we will use a Core i7-2600K CPU with 6GB of DDR-1600 RAM when benchmarking performance. This article is primarily focused on Windows, but a related library called PPLx exists which provides very similar functionality for Linux, so any non-Windows code below will apply equally to PPLx.
The Concurrency Runtime and PPL
New versions of Windows ship with a sub-system known as the Concurrency Runtime or ConcRTfor short. ConcRT is a thin layer that sits between the operating system and applications and provides services specifically relating to multi-core programming. Various tools and libraries are provided with new versions of Visual Studio allowing us to take advantage of ConcRT; here we will look at just one, the Parallel Patterns Library or PPL for short. PPL is an abstraction layer which sits on top of ConcRT and leverages the new syntax of C++11 to hide much of the underlying complexity. While the feature set of ConcRT and PPL is vast, there are a few key advantages worth mentioning:
- PPL’s programming paradigm does away with synchronization objects such as critical sections and mutexes almost entirely
- PPL allows you to write parallel code without having to manage the creation and tear-down of threads yourself
- PPL allows (in many cases) sequential algorithms to be spread across multiple cores without having to re-design the algorithm significantly
ConcRT differs from traditional paradigm of leaving it up to the application to distribute specific-sized parcels of work across a specific number of threads by using something called a work-stealing algorithm. This is essentially a form of load balancing: consider the case where you allocate 5 units of work to thread A and 2 units to thread B. Traditionally, if we assume thread B finishes first, we end up waiting on thread A. ConcRT will automatically detect when a thread finishes its work and transfer (steal) work from another thread with a remaining backlog (in this case, thread B will steal work from thread A) to minimize the amount of waiting and thread idling, and therefore minimize the total execution time. In addition, this frees the application from having to compute or estimate the optimum number of threads or cores to use: the load balancing in ConcRT will take care of it automatically. If you do need finer-grained control, you can use partitioning strategies which are discussed below. PPL also provides objects which can be used to store and combine partial results from a parallel aggregation task in a thread-safe manner. More on this below.
Using Task Groups to perform Parallel Aggregation the old-fashioned way
As we explored in part 2 of this series, the classical way to code for parallel aggregation of a single large computational task is:
- determine how many cores to use
- calculate the size of sub-task to allocate to each core (splitting the main task up into several smaller sub-tasks)
- start each sub-task on a new core/thread
- wait for all the threads to finish
- combine the partial results returned from each thread into the final result
You can accomplish roughly the same feat using task groups in PPL, with the proviso that ConcRT will select the number of threads and cores for you. The basic code is very similar to the C++11 example in part 2 and looks like this:
#include <ppl.h> // for PPL (concurrency::task_group) #include <chrono> // for std::chrono::high_resolution_clock #include <cstdint> // for uint64_t #include <iostream> #include <vector> // for std::vector #include <cassert> // for assert #include <concrtrm.h> // for GetProcessorCount() using namespace concurrency; std::vector<uint64_t *> part_sums; int cores_to_use; const int max_sum_item = 1000000000; void per_core_sum(uint64_t *final, int start_val, int sums_to_do) { uint64_t sub_result = 0; for ( int i = start_val; i < start_val + sums_to_do; i++) sub_result += i; *final = sub_result; } int main() { for (unsigned int cores_to_use = 1; cores_to_use <= GetProcessorCount(); cores_to_use++) { task_group tasks; part_sums.clear(); for (unsigned int i = 0; i < cores_to_use; i++) part_sums.push_back( new uint64_t(0)); auto start = std::chrono::high_resolution_clock::now(); int sums_per_thread = max_sum_item / cores_to_use; for ( int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++) { // Lump extra bits onto last thread if work items is not equally divisible by number of threads int sums_to_do = sums_per_thread; if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item) sums_to_do = max_sum_item - start_val; tasks.run([start_val, sums_to_do, i] { per_core_sum(part_sums[i], start_val, sums_to_do); }); if (sums_to_do != sums_per_thread) break ; } tasks.wait(); auto end = std::chrono::high_resolution_clock::now(); std::cout << "Time taken with " << cores_to_use << " core" << (cores_to_use != 1? "s" : "" ) << ": " << (end - start).count() * (( double ) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl; uint64_t result = 0; std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; }); assert (result == uint64_t(499999999500000000)); for (unsigned int i = 0; i < cores_to_use; i++) delete part_sums[i]; } } |
Figure 1. Comparison between PPL task groups, C++ thread library and a single-threaded implementation
Just 6 lines of code differ from the C++ standard thread library version to make this application work using PPL (highlighted in grey). Let’s walk through it: We start by including ppl.h to get access to PPL, and place using namespace concurrency at the top of our code to save on typing a lot of concurrency:: scope resolution operators later. per_core_sum is identical to do_partial_sumin the C++11 example in part 2. We’re not using the C++ standard thread library here, so instead of using std::thread::hardware_concurrency() to fetch the number of cores, we use GetProcessorCount() instead which is defined in concrtrm.h. The only other changes are PPL-specific: we get rid of the vector of std::thread pointers and replace it with a PPL task_group object which we have called tasks. The key line which starts new threads is replaced as follows:
tasks.run([start_val, sums_to_do, i] { per_core_sum(part_sums[i], start_val, sums_to_do); }); |
The run() method of task_group takes a single parameter-less function with no return value to start on a new thread. By using a C++11 lambda expression, we can capture the values we want to pass as parameters to per_core_sum and cause it to be executed with these parameters when the thread starts. The thread details are stored as a task object in the task group. Notice that when you group tasks into a task group in this way, the work-stealing algorithm mentioned earlier comes into effect, so if a task in the task group on one thread finishes and there are other tasks remaining, work will be stolen from one of the other threads to keep all threads busy and minimize the total execution time. This is all done automatically for you. The join operation (which waits for each thread to finish) is replaced with:
tasks.wait(); |
In the version using the C++ thread library, we iterate over all the std::threads and call join()on them to wait for them to finish one at a time. Since a PPL task group manages all the threads involved, there is no need for a for loop; wait() will block until all of the tasks in the task group have finished, then return. The actual algorithm used to apportion the work into equal chunks (besides possibly the last chunk) and the method of aggregating the partial results into a final result is identical in both concept and code to the C++11 standard thread library example in part 2 – refer to that article for the full details.
Performance-wise, changing the use of std::thread to PPL tasks results in comparable execution times (see figure 1). If the work unit sizes were more uneven, the work-stealing algorithm in ConcRT would make the task group version slightly faster, but in this case where the work unit sizes are the same, performance is more or less identical to managing the threads directly.
Implementing Parallel Aggregation with parallel_for
PPL provides a number of functions such as parallel_invoke, parallel_for, parallel_for_each, parallel_reduce and so on to enable common constructs and aggregation operations to be spread over multiple cores automatically. The simplest solution to our summation problem using parallel_for looks like this:
#include <iostream> #include <cstdint> #include <ppl.h> using namespace Concurrency; const int max_sum_item = 1000000000; int main() { combinable<uint64_t> part_sums([] { return 0; }); parallel_for(0, max_sum_item, [&part_sums] ( int i) { part_sums.local() += i; } ); uint64_t result = part_sums.combine(std::plus<uint64_t>()); if (result != uint64_t(499999999500000000)) throw ; } |
Pretty cool, huh? We have written the program in a similar way to how we would code it if we were writing a sequential (single-core/single-thread) algorithm, and it’s really short! We have done away with our old std::vector<uint64_t *> part_sums which stored the partial results from each thread, and replaced it with an instance PPL’s combinable template class. combinableis great for several reasons:
- we don’t need to know in advance how many threads will be used
- by using the local() method we can read and write a thread-private or thread-local instance of the object’s template type parameter (uint64_t in this case). We don’t need to worry about synchronization objects, or allocating/releasing memory for partial results, or providing the thread function with pointers to their own partial result storage. We can just call local() to access our result storage location.
- when a new thread starts, a new instance of the object’s template type parameter is initialized using the function passed in combinable‘s constructor
- when all the threads are finished, you can use combine() (or combine_each()) with any signature-compatible function to aggregate the final result
In our code above, we initialize each new partial result element with 0 by using a C++11 lambda function with returns 0, and combine the final results using std::plus, which adds together (sums) its arguments. Let’s take a closer look at the main calculation code:
parallel_for(0, max_sum_item, [&part_sums] ( int i) { part_sums.local() += i; } ); |
Here is how the equivalent sequential code would look (a standard for loop):
uint64_t total = 0; for ( int i = 0; i < max_sum_item; i++) { total += i; } |
parallel_for iterates from the value passed in the first argument (0) to the value passed in the second argument exclusive (max_sum_item), calling the function passed in the third argument on each iteration with the current value, just like a standard for loop. The difference is that parallel_for splits the work over multiple threads/cores, and it is here we can see where the beauty lies: the algorithm inside the loop is exactly the same in both cases, and can be written in-line thanks to C++11 lambda expressions. The only difference is that in the parallel_for loop, we add the result to our thread-private partial result rather than the global result. If we did this, we would have to lock the global result with a critical section or mutex prior to each update; the cost of synchronization is so high compared to the complexity of the addition that the ultimate performance would be significantly worse than a single-threaded implementation, as the following code shows:
CRITICAL_SECTION cs_total; ... InitializeCriticalSection(&cs_total); uint64_t total = 0; parallel_for(0, max_sum_item, [&total] ( int i) { EnterCriticalSection(&cs_total); total += i; LeaveCriticalSection(&cs_total); } ); DeleteCriticalSection(&cs_total); |
Figure 2. The bottleneck caused by using a critical section to access a global variable from a parallel_for construct is massive
On the benchmark machine, the version using the critical section runs approximately 15 times slower than the version without – see figure 2. Watching CPU resource use during execution also shows that the CPU is only under an approximate 20% load in total, meaning that it is spending 80% of its time stalling and waiting for critical sections to be exited in other threads, whereas the basic parallel_for loop without critical sections puts the CPU under the desired 100% load, using all available processing resources.
Unfortunately, even without using a critical section, the performance of the parallel_for loop in the example above is very slow compared to our expectations, as figure 2 again shows. While performance on the benchmark machine using the C++11 standard thread library and PPL task groups is comparable (0.08 seconds), and a single-threaded sequential for loop is slower as expected (0.21 seconds), using the parallel_for loop takes a whopping 2.4 seconds – over 10 times slower than the single-threaded for loop! What is going on here? Read on to find out.
Small loop bodies, large management overheads
The problem with our code is that the loop body is so short and simple and does so little that the vast majority of the processing time is taken up managing overheads:
- function call cost overhead
- using the thread-private storage location (part_sums.local())
- worker thread creation and tear-down
Function call cost overhead
The first problem can be solved by making each call to the thread function (the function we pass to parallel_for) do more work. Rather than only adding one number to the total, we can parcel up blocks of work similar to in the C++11 standard thread library and PPL task group examples and make each function call do a block of additions. Each thread will then store its own partial result in a combinable object and we’ll sum the partial results when all the threads have finished, just like before. Note this somewhat defeats the point of using parallel_for to simplify parallel programming because we are still having to re-design the algorithm, but it is only a problem with small loop bodies; more typically used complex code will work just fine and see a good performance benefit using the basic technique in the previous section, without further modification. However, the issue with small loop bodies is a common gotcha so I thought it worth exploring further.
Our first stab at the solution might look like this:
combinable<uint64_t> part_sums([] { return 0; }); int sums_to_do = max_sum_item / GetProcessorCount(); parallel_for(0, max_sum_item, sums_to_do, [&part_sums, sums_to_do] ( int start_val) { for ( int i = start_val; i < start_val + sums_to_do; i++) part_sums.local() += i; } ); uint64_t result = part_sums.combine(std::plus<uint64_t>()); |
We have essentially replaced the single addition with a standard for loop that performs a series of additions (I have called this kind of construct a ‘chunked parallel_for loop’ in figure 3, as we are splitting the work into chunks). Here we have split the work equally across a number of threads equal to the number of processor cores (as returned by GetProcessorCount()), but you can use any number of threads you choose.
Note that we have now lost the flexibility to not worry about how much thread does how much work: if all of the work unit sizes are not the same (for example if GetProcessorCount() is not exactly divisible into the problem size, or if you replace the call with some arbitrary number of threads which are not exactly divisible into the problem size), you will get an incorrect result.
Performance-wise, figure 3 shows an average time of about 1.88 seconds for this version – still much slower than the single-threaded version, but about 25% faster than our first attempt atparallel_for. It’s a start.
Thread-private storage
Every time we use local() a function call must be made to get the reference to the thread-private storage in the combinable object. We can improve performance by fetching the reference once and storing it in a variable before our for loop begins:
parallel_for(0, max_sum_item, sums_to_do, [&part_sums, sums_to_do] ( int start_val) { uint64_t &part_result = part_sums.local(); for ( int i = start_val; i < start_val + sums_to_do; i++) part_result += i; } ); |
We call local() only once, before the for loop, save the reference in part_result and use this as the storage target when doing the additions.
Figure 3 shows the dramatic performance gain achieved: at around 0.4 seconds, this is over 4 times faster than the previous version, yet still twice as slow as the single-threaded version!
The problem is that accessing any variable not local to the thread function still takes time – more time than if it was local. The solution therefore, is to not use part_sums.local() at all until the partial result is ready, and store a temporary result as a local variable:
parallel_for(0, max_sum_item, sums_to_do, [&part_sums, sums_to_do] ( int start_val) { uint64_t part_result = 0; for ( int i = start_val; i < start_val + sums_to_do; i++) part_result += i; part_sums.local() += part_result; } ); |
Figure 3. Using chunked parallel_for loops to improve performance when the loop body is small
Figure 3 shows the performance: at 0.085 seconds, it is comparable with the standard thread library and PPL task groups, and therefore runs at the expected speed.
NOTE: It is very important that you add your temporary partial results to part_sums.local()and not assign them directly. This is because the work-stealing algorithm may add extra work (in the form of another call to this function) to a thread after the first run of the for loop completes, which means part_sums.local() will already be initialized with a value which is part of the total result. If you just assign to part_sums.local() directly, you overwrite any previous results calculated in the same thread and may finish with an incorrect total result.
Partitioning strategy
Partitioning is the mechanism ConcRT uses to decide how much work of a parallel_for or other construct to give to each thread. When you use the basic (non-chunked) parallel_for loop for example, ConcRT’s partitioning strategy decides how many iterations of the function will run on how many threads, and this is the reason you don’t need to specify a workload size for each thread individually.
The default strategy is called auto_partitioner and splits the work into ranges, and these ranges into sub-ranges, then employs the work-stealing algorithm to move the sub-ranges around to the least backlogged threads, essentially performing load balancing as discussed earlier.
For our chunked version of the code, this has a down-side: we have already calculated the ranges (workload sizes per thread) ourselves and we don’t want or need load balancing. This extra management performed by auto_partitioner then just becomes an unnecessary overhead. Even with the non-chunked version, since our loop iterations are so simple, the overhead of load balancing may end up costing more time than it saves.
Fortunately, when you call parallel_for you can specify as a final optional argument an alternative partitioning strategy to use instead of auto_partitioner. For our purposes, the static_partitioner strategy is a good match: it splits the work into ranges and assigns one range to each thread. No work-stealing/load balancing is used, and the ranges are not divided into sub-ranges for this purpose. The documentation states: “Use this partitioner type when each iteration of a parallel loop performs a fixed and uniform amount of work…”.
If you want to ensure that each range has a certain minimum size, you can use simple_partitioner. simple_partitioner takes a single argument which specifies the minimum number of iterations of the parallel_for which should run on each thread (ie. the minimum range size). Work-stealing is still performed but only once the specified minimum number of iterations has completed, and the ranges are not split into sub-ranges. Essentially then, work-stealing can only occur if an entire range completes before another one on another thread has started.
WARNING: In previous versions of PPL there was no modifiable partitioning strategy. Users were required to install the Concurrency Runtime Sample Pack, include ConcRTExtras/ppl_extras.h and use samples::parallel_for_fixed to have the same effect as calling parallel_for with static_partitioner. The Microsoft documentation is still out-of-date in this regard and you will find many references to parallel_for_fixed and related functions on MSDN. The current version of PPL implements partitioning strategy as the optional final argument as a replacement to these functions. Do not use the Concurrency Runtime Sample Pack for partitioning.
Looking at the non-chunked case first:
parallel_for(0, max_sum_item, [&part_sums] ( int i) { part_sums.local() += i; }, static_partitioner() ); |
By simply adding static_partitioner() as the final argument to parallel_for, we change the partitioning strategy in use.
For the chunked case:
parallel_for(0, max_sum_item, sums_to_do, [&part_sums, sums_to_do] ( int start_val) { uint64_t part_result = 0; for ( int i = start_val; i < start_val + sums_to_do; i++) part_result += i; part_sums.local() += part_result; }, static_partitioner() ); |
Figure 4. Partitioning strategy comparison for a non-chunked parallel_for loop
Figure 5. Partitioning strategy comparison for a chunked parallel_for loop
Figures 4 and 5 show the relative performance of the mentioned partitioner types. In the standard (non-chunked case), we set the range size in simple_partitionersuch that it is exactly divisible into the number of processor cores available, such that each core is used and given an initially equal amount of work.
In the chunked case, we set the range size to 1 when we use simple_partitioner, because we have already divided the work into a number of chunks equal to the number of processor cores and each iteration of the parallel_for is meant to run exactly once on each core, in contrast to the standard parallel_for use.
As you can see, in our summation problem the simple_partitioner wins out in all cases, although both simple_partitioner and static_partitioner bring performance benefits over the default auto_partitioner, particularly in the standard (non-chunked) case.
Final results
Figure 6. Performance comparison between the single-threaded implementation (red), the C++11 thread library, PPL task groups and the optimized chunked cases of parallel_for using different partitioning strategies
Figure 6 shows all the PPL solutions we have tried which improve performance over the single-threaded implementation. As you can see, with small loop bodies, the chunking approach discussed above (without using local() until the entire chunk of work is finished) is the optimal solution, with performance comparable to the thread library and task groups. parallel_for is preferred over task groups (whether chunked or not) for this kind of problem, where the tasks use identical code with different data. Task groups should be used when each task requires different code (ie. several unrelated tasks). Using the simple or static partitioning strategies can lead to additional performance gains, but check during development as your mileage may vary depending on the computations being performed.
Standard (non-chunked) parallel_for is preferred where the tasks have more complexity than a single line of code, and in those cases performance should again be comparable to the thread library and task groups.
…and across the line
This concludes our exploration of parallel aggregation in PPL. I hope you found it useful!
References
MSDN – Overview of the Concurrency Runtime
Microsoft Patterns & Practices – Parallel Loops (including Special Handling of Small Loop Bodies)
Microsoft Patterns & Practices – Parallel Tasks
MSDN – Parallel Algorithms -> Partitioning Work
C++11: Multi-core Programming – PPL Parallel Aggregation Explained的更多相关文章
- Spring Core Programming(Spring核心编程) - AOP Concepts(AOP基本概念)
1. What is aspect-oriented programming?(什么是面向切面编程?) Aspects help to modularize cross-cutting concern ...
- 《Programming Massively Parallel Processors》Chapter5 习题解答
自己做的部分习题解答,因为时间关系,有些马虎,也不全面,欢迎探讨或指出错误 5.1 Consider the matrixaddition in Exercise 3.1. Can one use s ...
- 11/7 <Dynamic Programming>
62. Unique Paths 方法一: 二位数组 而这道题是每次可以向下走或者向右走,求到达最右下角的所有不同走法的个数.那么跟爬梯子问题一样,需要用动态规划 Dynamic Programmin ...
- Miscellaneous--Tech
1. Questions: 1)EF.2)MVC/MVP/MVVM.3)page lifecyle. preInit,Init,InitCompleted,preLoad,Load,LoadCompl ...
- Introduction to Multi-Threaded, Multi-Core and Parallel Programming concepts
https://katyscode.wordpress.com/2013/05/17/introduction-to-multi-threaded-multi-core-and-parallel-pr ...
- Introduction to Parallel Computing
Copied From:https://computing.llnl.gov/tutorials/parallel_comp/ Author: Blaise Barney, Lawrence Live ...
- 翻新并行程序设计的认知整理版(state of the art parallel)
近几年,业内对并行和并发积累了丰富的经验.有了较深刻的理解.但之前积累的大量教材,在当今的软硬件体系下.反而都成了负面教材.所以,有必要加强宣传,翻新大家的认知. 首先.天地倒悬,结论先行:当你须要并 ...
- General mistakes in parallel computing
这是2013年写的一篇旧文,放在gegahost.net上面 http://raison.gegahost.net/?p=97 March 11, 2013 General mistakes in ...
- Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章:计算着色器(The Compute Shader)
原文:Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章:计算着色器(The Compute Shader) 代码工程 ...
随机推荐
- Num 34 : HDOJ : 1205 吃糖果 [ 狄利克雷抽屉原理 ]
抽屉原理: 桌上有十个苹果,要把这十个苹果放到九个抽屉里,不管如何放,我们会发现至少会有一个抽屉里面至少放两个苹果. 这一现象就是我们所说的" ...
- cocos2d-x项目101次相遇:在HelloWorld上--建立新场景
cocos2d-x 101次相遇 / 文件夹 1 安装和环境搭建 -xcode 2 Scenes , Director, Layers, Sprites 3 建立图片菜单 4 在 ...
- Thunderbolt雷电接口
官网:https://thunderbolttechnology.net/tech/certification
- Android中Environment与StatFs获取系统/SDCard存储空间大小
近期想起Android开发的知识.好久没有使用了,都忘得几乎相同了,今天查看了一会资料往回捡捡,顺便写下来帮助一下须要的同学. 首先讲述一下Environment与StatFs这两个类,然后介绍它们的 ...
- React机制浅析
在写React代码时,避免不了提前想想页面的结构,当然这也属于HTML布局了,比如哪些组件里需要包含哪些组件.遂突发奇想,如果试着把子组件的render内容替换原组件,会是个啥? 比如拿 https: ...
- WPF前台数据验证(红框)Validation.ErrorTemplate 附加属性
WPF 显示验证错误的默认方式是在控件周围绘制红色边框.通常需要对此方法进行自定义,以通过其他方式来显示错误.而且,默认情况下不会显示与验证错误关联的错误消息.常见的要求是仅当存在验证错误时才在工具提 ...
- SQL Server 2005中top关键字的用法
1.返回N条记录数 select top n * from <表名> [查询条件] 2.返回总结果集中指定百分比记录数 select top n percent * from <表名 ...
- 使用Scapy回放报文pcap
一.准备环境: Ubuntu + python2.7 sudo apt-get install python-scapy 二.准备报文: 先抓取一些报文,本实验使用的是DHCP的报文. 文件-导出 ...
- [数据挖掘课程笔记]基于规则的分类-顺序覆盖算法(sequential covering algorithm)
Rule_set = {}; //学习的规则集初试为空 for 每个类c do repeat Rule = Learn_One_Rule(D,Att-vals,c) 从D中删除被Rule覆盖的元组; ...
- jQuery remove()与jQuery empty()的区别
jQuery remove() 方法删除被选元素及其子元素.举例如下: <!DOCTYPE html> <html> <head> <script src=& ...