readerwriterqueue 一个用 C++ 实现的快速无锁队列
https://www.oschina.net/translate/a-fast-lock-free-queue-for-cpp?cmp&p=2
A single-producer, single-consumer lock-free queue for C++
如果没有可靠的(已被测试的)实现,设计又有什么用呢?:-)
我已经 在GitHub发布了我的实现。 自由的fork它吧!它由两个头部组成,一个是给队列的,还有一个取决于是否包含一些辅助参数。
它具有几个优异的特性:
- 与 C++11兼容 (支持移动对象而不是做拷贝)
- 完全通用 (任何类型的模板化容器) -- 就像std::queue,你从不需要自己给元素分配内存 (这将你从为了管理正在排队的元素而去写锁无关的内存管理单元的麻烦中解脱出来)
- 以连续的块预先分配内存
- 提供 atry_enqueue方法,该方法保证不去分配内存 (队列以初始容量起动)
- 也提供了一个enqueue方法,该方法能够根据需要动态的增长队列的大小
- 不采用比较-交换循环;这意味着 enqueue和dequeue是O(1)复杂度 (不计算内存分配)
- 对于x86设备, 内存屏障编译为空指令,这意味着enqueue与dequeue仅仅只是简单的loads和stores序列 (以及 branches)
- 在 MSVC2010+ 和 GCC 4.7+下编译 (而且应该工作于任何支持 C++11 的编译器)
I'm releasing the code and algorithm under the terms of the simplified BSD license. Use it at your own risk; in particular, lock-free programming is a patent minefield, and this code may very well violate a pending patent (I haven't looked). It's worth noting that I came up with the algorithm and implementation from scratch, independent of any existing lock-free queues.
Performance and correctness
In addition to agonizing over the design for quite some time, I tested the algorithm using several billion randomized operations in a simple stability test (on x86). This, of course, helps inspire confidence, but proves nothing about the correctness. In order to ensure it was correct, I also tested using Relacy, which ran all the possible interleavings for a simple test which turned up no errors; it turns out this simple test wasn't comprehensive, however, since I eventually did find a bug later using a different set of randomized runs (which I then fixed).
I've only tested this queue on x86-64, which is rather forgiving as memory ordering goes. If somebody is willing to test this code on another architecture, let me know! The quick stability test I whipped up is available here.
性能测试和无误较正
除了折腾在相当长的一段时间的设计,我(X86)测试了一个简单的稳定性测试使用数十亿随机操作的算法。 当然,这有助于鼓舞信心,但不能证明什么的正确性。 为了确保它是正确的,我的测试也使用了Relacy,跑了一个简单的测试来测试所有可能的交错。没有发现错误;但是,事实证明这个简单的测试是不全面的,因为通过使用一组不同的随机运行,我发现了一个错误(当然我最后修正了这些)。
我只在x86-64架构的机器上测试此队列,内存占用是相当宽裕(少)的。如有人乐意在其他架构机器上测试这些代码,告诉我吧。快速稳定性的测试代码我放在了这儿 。
In terms of performance, it's fast. Really fast. In my tests, I was able to get up to about 12+ million concurrent enqueue/dequeue pairs per second! (The dequeue thread had to wait for the enqueue thread to catch up if there was nothing in the queue.) After I had implemented my queue, though, I found another single-consumer, single-producer templated queue (written by the author of Relacy) published on Intel's website; his queue is roughly twice as fast, though it doesn't have all the features that mine does, and his only works on x86 (and, at this scale, "twice as fast" means the difference in enqueue/dequeue time is in the nanosecond range).
Update16 days ago
I spent some time properly benchmarking, profiling, and optimizing the code, using Dmitry's single-producer, single-consumer lock-free queue (published on Intel's website) as a baseline for comparison. Mine's now faster in general, particularly when it comes to enqueueing many elements (mine uses a contiguous block instead of separate linked elements). Note that different compilers give different results, and even the same compiler on different hardware yields significant speed variations. The 64-bit version is generally faster than the 32-bit one, and for some reason my queue is much faster under GCC on a Linode. Here are the benchmark results in full:
32-bit, MSVC2010, on AMD C-50 @ 1GHz
------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0039s | 0.0268s | 0.0040s | 0.0271s | 0.0040s | 0.0270s | 6.8x
Raw remove | 0.0015s | 0.0017s | 0.0015s | 0.0018s | 0.0015s | 0.0017s | 1.2x
Raw empty remove | 0.0048s | 0.0027s | 0.0049s | 0.0027s | 0.0048s | 0.0027s | 0.6x
Single-threaded | 0.0181s | 0.0172s | 0.0183s | 0.0173s | 0.0182s | 0.0173s | 0.9x
Mostly add | 0.0243s | 0.0326s | 0.0245s | 0.0329s | 0.0244s | 0.0327s | 1.3x
Mostly remove | 0.0240s | 0.0274s | 0.0242s | 0.0277s | 0.0241s | 0.0276s | 1.1x
Heavy concurrent | 0.0164s | 0.0309s | 0.0349s | 0.0352s | 0.0236s | 0.0334s | 1.4x
Random concurrent | 0.1488s | 0.1509s | 0.1500s | 0.1522s | 0.1496s | 0.1517s | 1.0x Average ops/s:
ReaderWriterQueue: 23.45 million
SPSC queue: 28.10 million 64-bit, MSVC2010, on AMD C-50 @ 1GHz
------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0022s | 0.0210s | 0.0022s | 0.0211s | 0.0022s | 0.0211s | 9.6x
Raw remove | 0.0011s | 0.0022s | 0.0011s | 0.0023s | 0.0011s | 0.0022s | 2.0x
Raw empty remove | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.6x
Single-threaded | 0.0060s | 0.0054s | 0.0061s | 0.0054s | 0.0061s | 0.0054s | 0.9x
Mostly add | 0.0080s | 0.0259s | 0.0081s | 0.0263s | 0.0080s | 0.0261s | 3.3x
Mostly remove | 0.0092s | 0.0109s | 0.0093s | 0.0110s | 0.0093s | 0.0109s | 1.2x
Heavy concurrent | 0.0150s | 0.0175s | 0.0181s | 0.0200s | 0.0165s | 0.0190s | 1.2x
Random concurrent | 0.0367s | 0.0349s | 0.0369s | 0.0352s | 0.0368s | 0.0350s | 1.0x Average ops/s:
ReaderWriterQueue: 34.90 million
SPSC queue: 32.50 million 32-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
----------------------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0011s | 0.0097s | 0.0011s | 0.0099s | 0.0011s | 0.0098s | 9.2x
Raw remove | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 1.1x
Raw empty remove | 0.0018s | 0.0011s | 0.0019s | 0.0011s | 0.0018s | 0.0011s | 0.6x
Single-threaded | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.9x
Mostly add | 0.0052s | 0.0114s | 0.0053s | 0.0116s | 0.0053s | 0.0115s | 2.2x
Mostly remove | 0.0055s | 0.0067s | 0.0056s | 0.0068s | 0.0055s | 0.0068s | 1.2x
Heavy concurrent | 0.0044s | 0.0089s | 0.0075s | 0.0128s | 0.0066s | 0.0107s | 1.6x
Random concurrent | 0.0294s | 0.0306s | 0.0295s | 0.0312s | 0.0294s | 0.0310s | 1.1x Average ops/s:
ReaderWriterQueue: 71.18 million
SPSC queue: 61.02 million 64-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
----------------------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0007s | 0.0097s | 0.0007s | 0.0100s | 0.0007s | 0.0099s | 13.6x
Raw remove | 0.0004s | 0.0015s | 0.0004s | 0.0020s | 0.0004s | 0.0018s | 4.6x
Raw empty remove | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.7x
Single-threaded | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.9x
Mostly add | 0.0031s | 0.0112s | 0.0031s | 0.0115s | 0.0031s | 0.0114s | 3.7x
Mostly remove | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 1.2x
Heavy concurrent | 0.0042s | 0.0035s | 0.0067s | 0.0039s | 0.0054s | 0.0038s | 0.7x
Random concurrent | 0.0142s | 0.0141s | 0.0145s | 0.0144s | 0.0143s | 0.0142s | 1.0x Average ops/s:
ReaderWriterQueue: 101.21 million
SPSC queue: 71.42 million 32-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
--------------------------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0014s | 0.0095s | 0.0014s | 0.0097s | 0.0014s | 0.0096s | 6.8x
Raw remove | 0.0007s | 0.0006s | 0.0007s | 0.0007s | 0.0007s | 0.0006s | 0.9x
Raw empty remove | 0.0028s | 0.0013s | 0.0028s | 0.0018s | 0.0028s | 0.0015s | 0.5x
Single-threaded | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.8x
Mostly add | 0.0049s | 0.0113s | 0.0050s | 0.0116s | 0.0050s | 0.0115s | 2.3x
Mostly remove | 0.0051s | 0.0061s | 0.0051s | 0.0062s | 0.0051s | 0.0061s | 1.2x
Heavy concurrent | 0.0066s | 0.0036s | 0.0084s | 0.0039s | 0.0076s | 0.0038s | 0.5x
Random concurrent | 0.0291s | 0.0282s | 0.0294s | 0.0287s | 0.0292s | 0.0286s | 1.0x Average ops/s:
ReaderWriterQueue: 55.65 million
SPSC queue: 63.72 million 64-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
--------------------------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0010s | 0.0099s | 0.0010s | 0.0100s | 0.0010s | 0.0099s | 9.8x
Raw remove | 0.0006s | 0.0015s | 0.0006s | 0.0018s | 0.0006s | 0.0017s | 2.7x
Raw empty remove | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.7x
Single-threaded | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.9x
Mostly add | 0.0032s | 0.0114s | 0.0032s | 0.0118s | 0.0032s | 0.0116s | 3.6x
Mostly remove | 0.0037s | 0.0042s | 0.0037s | 0.0044s | 0.0037s | 0.0044s | 1.2x
Heavy concurrent | 0.0060s | 0.0092s | 0.0088s | 0.0096s | 0.0077s | 0.0095s | 1.2x
Random concurrent | 0.0168s | 0.0166s | 0.0168s | 0.0168s | 0.0168s | 0.0167s | 1.0x Average ops/s:
ReaderWriterQueue: 68.45 million
SPSC queue: 50.75 million 64-bit, GCC 4.7.2, on Linode 1GB virtual machine (Intel Xeon L5520 @ 2.27GHz)
-----------------------------------------------------------------------------
| Min | Max | Avg
Benchmark | RWQ | SPSC | RWQ | SPSC | RWQ | SPSC | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add | 0.0004s | 0.0055s | 0.0005s | 0.0055s | 0.0005s | 0.0055s | 12.1x
Raw remove | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 8.4x
Raw empty remove | 0.0009s | 0.0060s | 0.0010s | 0.0061s | 0.0009s | 0.0060s | 6.4x
Single-threaded | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 1.5x
Mostly add | 0.0042s | 0.0096s | 0.0042s | 0.0106s | 0.0042s | 0.0103s | 2.5x
Mostly remove | 0.0042s | 0.0057s | 0.0042s | 0.0058s | 0.0042s | 0.0058s | 1.4x
Heavy concurrent | 0.0030s | 0.0164s | 0.0036s | 0.0216s | 0.0032s | 0.0188s | 5.8x
Random concurrent | 0.0256s | 0.0282s | 0.0257s | 0.0290s | 0.0257s | 0.0287s | 1.1x Average ops/s:
ReaderWriterQueue: 137.88 million
SPSC queue: 24.34 million
In short, my queue is blazingly fast, and actually doing anything with it will eclipse the overhead of the data structure itself.
The benchmarking code is available here (compile and run with full optimizations).
readerwriterqueue 一个用 C++ 实现的快速无锁队列的更多相关文章
- 一个可无限伸缩且无ABA问题的无锁队列
关于无锁队列,详细的介绍请参考陈硕先生的<无锁队列的实现>一文.然进一步,如何实现一个不限node数目即能够无限伸缩的无锁队列,即是本文的要旨. 无锁队列有两种实现形式,分别是数组与链表. ...
- boost 无锁队列
一哥们翻译的boost的无锁队列的官方文档 原文地址:http://blog.csdn.net/great3779/article/details/8765103 Boost_1_53_0终于迎来了久 ...
- Erlang运行时中的无锁队列及其在异步线程中的应用
本文首先介绍 Erlang 运行时中需要使用无锁队列的场合,然后介绍无锁队列的基本原理及会遇到的问题,接下来介绍 Erlang 运行时中如何通过“线程进度”机制解决无锁队列的问题,并介绍 Erlang ...
- 聊一聊无锁队列rte_ring
之前用基于dpdk 实现小包快速转发的时候有用到无锁队列!今天就来看看吧!(后续完成了去dpdk化,直接在内核完成快速转发功能) dpdk的无锁队列ring是借鉴了linux内核kfifo无锁队列.r ...
- 无锁队列以及ABA问题
队列是我们非常常用的数据结构,用来提供数据的写入和读取功能,而且通常在不同线程之间作为数据通信的桥梁.不过在将无锁队列的算法之前,需要先了解一下CAS(compare and swap)的原理.由于多 ...
- zeromq源码分析笔记之无锁队列ypipe_t(3)
在上一篇中说到了mailbox_t的底层实际上使用了管道ypipe_t来存储命令.而ypipe_t实质上是一个无锁队列,其底层使用了yqueue_t队列,ypipe_t是对yueue_t的再包装,所以 ...
- 无锁队列--基于linuxkfifo实现
一直想写一个无锁队列,为了提高项目的背景效率. 有机会看到linux核心kfifo.h 原则. 所以这个实现自己仿照,眼下linux我们应该能够提供外部接口. #ifndef _NO_LOCK_QUE ...
- CAS简介和无锁队列的实现
Q:CAS的实现 A:gcc提供了两个函数 bool __sync_bool_compare_and_swap (type *ptr, type oldval, type newval, ...)// ...
- Go语言无锁队列组件的实现 (chan/interface/select)
1. 背景 go代码中要实现异步很简单,go funcName(). 但是进程需要控制协程数量在合理范围内,对应大批量任务可以使用"协程池 + 无锁队列"实现. 2. golang ...
随机推荐
- 在javascript中:(函数()()是一个匿名函数
在javascript中:(函数()()是一个匿名函数,它主要使用函数中的变量范围来避免全局变量,影响整个页面环境,并提高代码兼容性. (函数())是标准函数定义,但不会复制到任何变量.所以有一个没有 ...
- 微信小程序子传父
子组件 父组件
- 1.Shell脚本
1.Shell脚本 可以将Shell终端解释器当作人与计算机硬件之间的“翻译官”,它作为用户与Linux系统内部的通信媒介,除了能够支持各种变量与参数外,还提供了诸如循环.分支等高级编 程语言才有的控 ...
- windows下用navicat链接虚拟机MySQL数据库的过程和问题解决
navicat远程连接虚拟机中的MySQL数据库 1.在linux查看mysql服务器IP地址 ifconfig 记住此IP navicat设置 设置完毕 遇到问题 一直连不上,在网上搜索了一下,主要 ...
- Host服务
这也是看网上的例子自己跟着配置做的一个小demo,这里记录一下. 一.创建一个空的控制台应用程序 二.安装所需dll 1.Quartz Install-Package Quartz -Version ...
- crc32 cpp Makefile可参考
https://github.com/stbrumme/crc32 # simple Makefile CPP = g++ # files PROGRAM = Crc32Test LIBS = -lr ...
- 将本地代码使用Git上传更新至Github
注册.配置git 1. 首先注册git image 2.然后下载.配置git 百度“git下载”,然后默认安装,注意的是最后要添加环境变量,最后安装结果如下: image 配置如下: 1.设置本地的s ...
- qspi nor
qspi: sf probe SF: Detected s25fl256s_256k with page size 512 Bytes, erase size 256 KiB, total 32 Mi ...
- H5小程序不同页面之间通讯解决方案
小程序做开发的时候难免需要不同页面之间的通讯,比如首页打开新的页面搜索获取结果返回到首页,不同tab页面之间的数据交互等等.于是做了以下总结 打开新的页面可以通过 navigator 组件来实现,通过 ...
- Bind+DLZ构建企业智能DNS/DNS
Bind+DLZ构建企业智能DNS 目录:一.简介二.服务规划三.安装BIND及基本环境四.配置Bind-View-DLZ-MYSQL五.添加相关记录并进行测试六.配置从DNS七.补充 一.简介: ...