[译]Memory Reordering Caught in the Act

原文：http://preshing.com/20120515/memory-reordering-caught-in-the-act/

编写lock-free的C/C++程序时，在保证memory ordering正确性上要非常小心，否则，奇怪的事就来了。

Intel在《x86/64 Architecture Specification》Volume 3, §8.2.3一节中列出了一些可能发生的“奇怪的事”。

来看一个小例子：X，Y两个变量均被初始化为0，编写如下汇编代码，由两个处理器(processor)执行：

为了清楚的阐述CPU ordering，此处使用了汇编指令。每个处理器将一个变量赋值为1(store)，并读取另外一个变量(load)，此处的r1、r2代表寄存器。

正常情况下，不管处理器的执行顺序如何，r1，r2所有可能的结果为：

r1==0, r2 ==1

r1==1, r2==0

r1==1, r2==1

不可能的结果为:

r1==0, r2==0。

但按Intel说明书中的说法，这种不可能是有可能的，这就是本文所述的“奇怪的事”。至少，这是违反直觉的。

Memory Reordering

Intel x86/x64处理器和大多数处理器家族一样，在不影响单个线程执行结果的前提下，允许对内存交互指令进行重排(reorder)。特别指出：处理器允许将store动作延迟到任何load动作之后,只要load、store的操作的不是同一块内存。因此，您编写的汇编代码在执行时可能变成了这样：

来，试试！

“好好好，你说这事可能发生，但我从来没见过，叫我如何相信？”

那...叫我们来试试如何：源码在这。

这份代码包括win32、POSIX两个版本，由两个派生线程重复执行上述transaction代码，并由主线程核对执行结果。

第一个工作线程源码如下：x,y,r1,r2为全局变量，两个POSIX semaphores用于和主线程的并发处理：

 sem_t beginSema1;
 sem_t endSema;
 int X, Y;int r1, r2;
 void *thread1Func(void *param)
 {
     MersenneTwister random();                // Initialize random number generator
     for (;;)                                  // Loop indefinitely
     {
         sem_wait(&beginSema1);                // Wait for signal from main thread
          != ) {}  // Add a short, random delay

         // ----- THE TRANSACTION! -----
         X = ;
         asm volatile("" ::: "memory");        // Prevent compiler reordering
         r1 = Y;

         sem_post(&endSema);                   // Notify transaction complete
     }
     return NULL;  // Never returns
 };

补充一句，每个transaction执行前，插入随机延时逻辑以保证线程切换。示例有两个工作线程，这里试图让它们的执行尽可能的重叠(译注：或许并无必要)。本例采用的随机延时实现：MersennsTwister和 measuring lock contention 、

validating that the recursive Benaphore worked中的一样。别被asm volatile这行代码唬到，这只是告诉GCC编译器在生成机器码时，不要重排store和load，以防GCC在编译优化时又想出了什么“歪点子”。

来看编译后的汇编代码：

$ gcc -O2 -c -S -masm=intel ordering.cpp
$ cat ordering.s
    ...
    mov    DWORD PTR _X,
    mov    eax, DWORD PTR _Y
    mov    DWORD PTR _r1, eax
    ...

Store和load的顺序和预期的一致，先执行X=1，随后执行r1=Y。

下面是主线程代码，职责如下：初始化后，进入无限循环，重置x，y为0，通过信号量触发两个线程运行。

Pay particular attention to the way all writes to shared memory occur before sem_post, and all reads from shared memory occur after sem_wait. The same rules are followed in the worker threads when communicating with the main thread. Semaphores give us acquire and release semantics on every platform. That means we are guaranteed that the initial values of X = 0 and Y = 0 will propagate completely to the worker threads, and that the resulting values of r1 and r2 will propagate fully back here. In other words, the semaphores prevent memory reordering issues in the framework, allowing us to focus entirely on the experiment itself!（这段怎么读都像废话，不翻译了）

 int main()
 {
     // Initialize the semaphores
     sem_init(&beginSema1, , );
     sem_init(&beginSema2, , );
     sem_init(&endSema, , );

     // Spawn the threads
     pthread_t thread1, thread2;
     pthread_create(&thread1, NULL, thread1Func, NULL);
     pthread_create(&thread2, NULL, thread2Func, NULL);

     // Repeat the experiment ad infinitum
     ;
     ; ; iterations++)
     {
         // Reset X and Y
         X = ;
         Y = ;
         // Signal both threads
         sem_post(&beginSema1);
         sem_post(&beginSema2);
         // Wait for both threads
         sem_wait(&endSema);
         sem_wait(&endSema);
         // Check if there was a simultaneous reorder
          && r2 == )
         {
             detected++;
             printf("%d reorders detected after %d iterations\n", detected, iterations);
         }
     }
     ;  // Never returns
 }

检验真理的时刻到了，这是我在Intel Xeon W3520、Cygwin环境下运行的结果：

这下你总算信了吧！运行过程中，内存重排序大约每6600次检测到一次。当我在Core 2 Duo E6300、Ubuntu 环境下测试时，出现的概率甚至更低。你已经开始意识到，微妙的“时机”bugs可以在不被发现的情况下蔓延到lock-free的代码中。现在，你可能在想：“我不需要这该死的reording”。OK，至少有两种方法。

一种是将两个线程绑定到同一个CPU core上，pthread并未提供相应的结构，但linux上可以这样做：

    cpu_set_t cpus;
    CPU_ZERO(&cpus);
    CPU_SET(, &cpus);
    pthread_setaffinity_np(thread1, sizeof(cpu_set_t), &cpus);
    pthread_setaffinity_np(thread2, sizeof(cpu_set_t), &cpus);

自此之后，重排序消失了。因为单个处理器上是保序的，哪怕线程是抢占的、将在任意时间被重新调度（That’s because a single processor never sees its own operations out of order, even when threads are pre-empted and rescheduled at arbitrary times.）。当然，将两个线程绑定到一个核上，致使其它CPU Core未被有效利用(由此看来，这并不是个好办法)。

我在Playstation 3上编译、运行，并未检测到内存重排。This suggests (but doesn’t confirm) that the two hardware threads inside the PPU may effectively act as a single processor, with very fine-grained hardware scheduling.

采用StoreLoad Barrier避免memory reordering

另一种避免memory reordering的方法是：在两条指令间引入CPU Barrier。本例中，我们要阻止Store和随后的Load指令发生重排，引入的CPU Barrier通常称为StoreLoad Barrier。

在X86/X64处理器上，没有专门的StoreLoad barrier指令，但有一些指令可完成另丰富的功能。Mfence指令为full memory barrier指令，它可以避免任何情况的内存重排。GCC中的实现方式如下：

     for (;;)                                  // Loop indefinitely
     {
         sem_wait(&beginSema1);                // Wait for signal from main thread
          != ) {}  // Add a short, random delay

         // ----- THE TRANSACTION! -----
         X = ;
         asm volatile("mfence" ::: "memory");  // Prevent memory reordering
         r1 = Y;

查看编译生成的汇编代码来验证效果：

    ...
    mov    DWORD PTR _X,
    mfence
    mov    eax, DWORD PTR _Y
    mov    DWORD PTR _r1, eax
    ...

修改后，内存重排消失了，两个线程可运行在两个不同的CPU cores上。

Similar Instructions and Different Platforms

其实，mfence不是x86/x64下唯一的full memory barrier.在这些处理器上，任何locked指令，如xchg均属于full memory barrier，此时无需使用其他的SSE指令或write-combined memory。实际上，如果你使用MemoryBarrier指令时，Microsoft C++编译器会生成xchg指令(至少VS2008如此)

Mfence指令适用于x86/x64，如果想编写可移植的代码，可以采用预处理宏技术。Linux内核提供了一组宏：smp_mb、smp_rmb、smp_wmb，并提供了一组实现alternate implementations on different architectures. 如在PowerPC上，smp_mb被实现为sync.

不同的CPU家族有其自己的memory ordering指令集，编译器根据自身喜好提供此类功能，而跨平台项目则为此封装自己的抽象层...而这些对简化lock-free编程毫无益处。这也是为何C++11引入C++11 atomic library标准的原因，标准化、更为方便的编写lock-free的可移植代码。

译注：

1. Memory Reordering准确的理解应该是和Memory相关的机器指令的重排序。

2. Memory Reordering将其认定的同类的或访问相同内存的CPU指令尽可能的放到一起执行。

3. Memory Reording的原则在于：重排前后，单个线程上的行为保持一致，前述例子中，每个程序单独运行时其结果是一致的，也可以理解为“保序”。

4. 在编写多线程程序时，我们通常通过添加mutex、semaphores 等方式执行并发保护，而非lock-free程序。这类锁按本文的描述均属于full memory barrier，程序当然不会出现memory reordering问题。

[译]Memory Reordering Caught in the Act的更多相关文章

java高并发核心要点|系列4|CPU内存指令重排序(Memory Reordering)
今天,我们来学习另一个重要的概念. CPU内存指令重排序(Memory Reordering) 什么叫重排序? 重排序的背景我们知道现代CPU的主频越来越高,与cache的交互次数也越来越多.当CP ...
Memory Barriers Are Like Source Control Operations
From: http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ If you use ...
volatile关键字及编译器指令乱序总结
本文简单介绍volatile关键字的使用,进而引出编译期间内存乱序的问题,并介绍了有效防止编译器内存乱序所带来的问题的解决方法,文中简单提了下CPU指令乱序的现象,但并没有深入讨论. 以下是我搭建的博 ...
JVM系列【4】内存模型
JVM系列笔记目录虚拟机的基础概念 class文件结构 class文件加载过程 jvm内存模型 JVM常用指令 GC与调优硬件层数据一致性 - 存储器层次结构从L6-L0 空间由大变小,速度由慢 ...
Memory Ordering in Modern Microprocessors
Linux has supported a large number of SMP systems based on a variety of CPUs since the 2.0 kernel. L ...
memory model
最近看C++11 atomic发现对memory_order很是不理解,memory_order_relaxed/memory_order_consume/memory_order_acquire/m ...
memory consistency
目前的计算机系统中,都是shared memory结构,提供统一的控制接口给软件, shared memory结构中,为了memory correctness,可以将问题分为:memory consi ...
memory ordering 内存排序
Memory ordering - Wikipedia https://en.wikipedia.org/wiki/Memory_ordering https://zh.wikipedia.org/w ...
Java的多线程机制系列：不得不提的volatile及指令重排序(happen-before)
一.不得不提的volatile volatile是个很老的关键字,几乎伴随着JDK的诞生而诞生,我们都知道这个关键字,但又不太清楚什么时候会使用它:我们在JDK及开源框架中随处可见这个关键字,但并发专 ...

随机推荐

poj3253
一道赫夫曼树的经典题目,一直以为这题的代码会很复杂,没想到书中竟描述地如此简单 #include <stdio.h> int n; long long p[20010]; //一道经典的赫 ...
inno setup 执行SQL
参考之:1.可将导入数据的功能写入一个小程序,再外部调用(楼上已经说了):2.可用程序代码:[Setup] AppName=科發醫院管理系統 AppVerName=科發醫院管理系統4.0 AppPub ...
ecmall程序结构图与常用数据库表
ecm_acategory:存放的是商城的文章分类.ecm_address:存放的是店长的地址ecm_article:存放的是商城的相关文章ecm_brand:存放的是商城的品牌分类(注意与表ecm_ ...
Ext中 get、getDom、getCmp的区别
getDom方法能够得到文档中的DOM节点,该方法中包含一个参数,该参数可以是DOM节点的id.DOM节点对象或DOM节点对应的Ext元素(Element)等. (与getElementById是一个 ...
【转】java内部类的作用
http://andy136566.iteye.com/blog/1061951/ 推荐一. 定义放在一个类的内部的类我们就叫内部类. 二. 作用 1.内部类可以很好的实现隐藏一般的非内部类,是不 ...
<%%>与<%#%>与<%=%>
在asp.net中经常出现包含这种形式<%%>的html代码,总的来说包含下面这样几种格式: 一. <%%> 这种格式实际上就是和asp的用法一样的,只是asp中里面是vbsc ...
PostMan入门使用教程
最近需要测试产品中的REST API,无意中发现了PostMan这个chrome插件,把玩了一下,发现postman秉承了一贯以来google工具强大,易用的特质.独乐乐不如众乐乐,特此共享出来给大伙 ...
002..NET MVC实现自己的TempBag
原文链接:http://www.dotnetbips.com/articles/bc422c95-02cc-4d05-9c5c-fa89d0e78cc0.aspx 1.前言本来今天是想发那篇关于在W ...
Laxcus大数据管理系统2.0（2）- 第一章基础概述 1.1 基于现状的一些思考
第一章基础概述 1.1 基于现状的一些思考在过去十几年里,随着互联网产业的普及和高速发展,各种格式的互联网数据也呈现爆炸性增长之势.与此同时,在数据应用的另一个重要领域:商业和科学计算,在各种新兴 ...
实用防火墙（Iptables）脚本分析
实用防火墙(Iptables)脚本分析 --Redhat,CentOS,Ubuntu等常见Linux发行版中都会预装Iptables防火墙,大多数初学者设置起来由于对这款软件比较陌生,设置起来比较困难 ...

[译]Memory Reordering Caught in the Act

Similar Instructions and Different Platforms

[译]Memory Reordering Caught in the Act的更多相关文章

随机推荐

热门专题