Pthreads并行编程之spin lock与mutex性能对比分析（转）

POSIX threads(简称Pthreads)是在多核平台上进行并行编程的一套常用的API。线程同步(Thread Synchronization)是并行编程中非常重要的通讯手段，其中最典型的应用就是用Pthreads提供的锁机制(lock)来对多个线程之间共享的临界区(Critical Section)进行保护(另一种常用的同步机制是barrier)。

Pthreads提供了多种锁机制：
(1) Mutex（互斥量）：pthread_mutex_***
(2) Spin lock（自旋锁）：pthread_spin_***
(3) Condition Variable（条件变量）：pthread_con_***
(4) Read/Write lock（读写锁）：pthread_rwlock_***

Pthreads提供的Mutex锁操作相关的API主要有：
pthread_mutex_lock (pthread_mutex_t *mutex);
pthread_mutex_trylock (pthread_mutex_t *mutex);
pthread_mutex_unlock (pthread_mutex_t *mutex);

Pthreads提供的与Spin Lock锁操作相关的API主要有：
pthread_spin_lock (pthread_spinlock_t *lock);
pthread_spin_trylock (pthread_spinlock_t *lock);
pthread_spin_unlock (pthread_spinlock_t *lock);

从实现原理上来讲，Mutex属于sleep-waiting类型的锁。例如在一个双核的机器上有两个线程(线程A和线程B)，它们分别运行在Core0和Core1上。假设线程A想要通过pthread_mutex_lock操作去得到一个临界区的锁，而此时这个锁正被线程B所持有，那么线程A就会被阻塞(blocking)，Core0 会在此时进行上下文切换(Context Switch)将线程A置于等待队列中，此时Core0就可以运行其他的任务(例如另一个线程C)而不必进行忙等待。而Spin lock则不然，它属于busy-waiting类型的锁，如果线程A是使用pthread_spin_lock操作去请求锁，那么线程A就会一直在 Core0上进行忙等待并不停的进行锁请求，直到得到这个锁为止。

如果大家去查阅Linux glibc中对pthreads API的实现NPTL(Native POSIX Thread Library) 的源码的话(使用”getconf GNU_LIBPTHREAD_VERSION”命令可以得到我们系统中NPTL的版本号)，就会发现pthread_mutex_lock()操作如果没有锁成功的话就会调用system_wait()的系统调用（现在NPTL的实现采用了用户空间的futex，不需要频繁进行系统调用，性能已经大有改善），并将当前线程加入该mutex的等待队列里。而spin lock则可以理解为在一个while(1)循环中用内嵌的汇编代码实现的锁操作(印象中看过一篇论文介绍说在linux内核中spin lock操作只需要两条CPU指令，解锁操作只用一条指令就可以完成)。有兴趣的朋友可以参考另一个名为sanos的微内核中pthreds API的实现：mutex.c spinlock.c，尽管与NPTL中的代码实现不尽相同，但是因为它的实现非常简单易懂，对我们理解spin lock和mutex的特性还是很有帮助的。

那么在实际编程中mutex和spin lcok哪个的性能更好呢？我们知道spin lock在Linux内核中有非常广泛的利用，那么这是不是说明spin lock的性能更好呢？下面让我们来用实际的代码测试一下（请确保你的系统中已经安装了最近的g++）。

 // Name: spinlockvsmutex1.cc

 // Source: http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

 // Compiler(spin lock version): g++ -o spin_version -DUSE_SPINLOCK spinlockvsmutex1.cc -lpthread

 // Compiler(mutex version): g++ -o mutex_version spinlockvsmutex1.cc -lpthread

 #include <stdio.h>

 #include <unistd.h>

 #include <sys/syscall.h>

 #include <errno.h>

 #include <sys/time.h>

 #include <list>

 #include <pthread.h>

 #define LOOPS 50000000

 using namespace std;

 list<int> the_list;

 #ifdef USE_SPINLOCK

 pthread_spinlock_t spinlock;

 #else

 pthread_mutex_t mutex;

 #endif

 //Get the thread id

 pid_t gettid() { return syscall( __NR_gettid ); }

 void *consumer(void *ptr)

 {

     int i;

     printf("Consumer TID %lun", (unsigned long)gettid());

     while ()

     {

 #ifdef USE_SPINLOCK

         pthread_spin_lock(&spinlock);

 #else

         pthread_mutex_lock(&mutex);

 #endif

         if (the_list.empty())

         {

 #ifdef USE_SPINLOCK

             pthread_spin_unlock(&spinlock);

 #else

             pthread_mutex_unlock(&mutex);

 #endif

             break;

         }

         i = the_list.front();

         the_list.pop_front();

 #ifdef USE_SPINLOCK

         pthread_spin_unlock(&spinlock);

 #else

         pthread_mutex_unlock(&mutex);

 #endif

     }

     return NULL;

 }

 int main()

 {

     int i;

     pthread_t thr1, thr2;

     struct timeval tv1, tv2;

 #ifdef USE_SPINLOCK

     pthread_spin_init(&spinlock, );

 #else

     pthread_mutex_init(&mutex, NULL);

 #endif

     // Creating the list content...

     for (i = ; i < LOOPS; i++)

         the_list.push_back(i);

     // Measuring time before starting the threads...

     gettimeofday(&tv1, NULL);

     pthread_create(&thr1, NULL, consumer, NULL);

     pthread_create(&thr2, NULL, consumer, NULL);

     pthread_join(thr1, NULL);

     pthread_join(thr2, NULL);

     // Measuring time after threads finished...

     gettimeofday(&tv2, NULL);

     if (tv1.tv_usec > tv2.tv_usec)

     {

         tv2.tv_sec--;

         tv2.tv_usec += ;

     }

     printf("Result - %ld.%ldn", tv2.tv_sec - tv1.tv_sec,

         tv2.tv_usec - tv1.tv_usec);

 #ifdef USE_SPINLOCK

     pthread_spin_destroy(&spinlock);

 #else

     pthread_mutex_destroy(&mutex);

 #endif

     return ;

 }

该程序运行过程如下：主线程先初始化一个list结构，并根据LOOPS的值将对应数量的entry插入该list，之后创建两个新线程，它们都执行consumer()这个任务。两个被创建的新线程同时对这个list进行pop操作。主线程会计算从创建两个新线程到两个新线程结束之间所用的时间，输出为下文中的”Result “。

测试机器参数：
Ubuntu 9.04 X86_64
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
4.0 GB Memory

从下面是测试结果：

POSIX threads(简称Pthreads)是在多核平台上进行并行编程的一套常用的API。线程同步(Thread Synchronization)是并行编程中非常重要的通讯手段，其中最典型的应用就是用Pthreads提供的锁机制(lock)来对多个线程之间共 享的临界区(Critical Section)进行保护(另一种常用的同步机制是barrier)。

Pthreads提供了多种锁机制：

() Mutex（互斥量）：pthread_mutex_***

() Spin lock（自旋锁）：pthread_spin_***

() Condition Variable（条件变量）：pthread_con_***

() Read/Write lock（读写锁）：pthread_rwlock_***

Pthreads提供的Mutex锁操作相关的API主要有：

pthread_mutex_lock (pthread_mutex_t *mutex);

pthread_mutex_trylock (pthread_mutex_t *mutex);

pthread_mutex_unlock (pthread_mutex_t *mutex);

Pthreads提供的与Spin Lock锁操作相关的API主要有：

pthread_spin_lock (pthread_spinlock_t *lock);

pthread_spin_trylock (pthread_spinlock_t *lock);

pthread_spin_unlock (pthread_spinlock_t *lock);

从实现原理上来讲，Mutex属于sleep-waiting类型的锁。例如在一个双核的机器上有两个线程(线程A和线程B)，它们分别运行在Core0和Core1上。假设线程A想要通过pthread_mutex_lock操作去得到一个临界区的锁，而此时这个锁正被线程B所持有，那么线程A就会被阻塞(blocking)，Core0 会在此时进行上下文切换(Context Switch)将线程A置于等待队列中，此时Core0就可以运行其他的任务(例如另一个线程C)而不必进行忙等待。而Spin lock则不然，它属于busy-waiting类型的锁，如果线程A是使用pthread_spin_lock操作去请求锁，那么线程A就会一直在 Core0上进行忙等待并不停的进行锁请求，直到得到这个锁为止。

如果大家去查阅Linux glibc中对pthreads API的实现NPTL(Native POSIX Thread Library) 的源码的话(使用”getconf GNU_LIBPTHREAD_VERSION”命令可以得到我们系统中NPTL的版本号)，就会发现pthread_mutex_lock()操作如果没有锁成功的话就会调用system_wait()的系统调用（现在NPTL的实现采用了用户空间的futex，不需要频繁进行系统调用，性能已经大有改善），并将当前线程加入该mutex的等待队列里。而spin lock则可以理解为在一个while()循环中用内嵌的汇编代码实现的锁操作(印象中看过一篇论文介绍说在linux内核中spin lock操作只需要两条CPU指令，解锁操作只用一条指令就可以完成)。有兴趣的朋友可以参考另一个名为sanos的微内核中pthreds API的实现：mutex.c spinlock.c，尽管与NPTL中的代码实现不尽相同，但是因为它的实现非常简单易懂，对我们理解spin lock和mutex的特性还是很有帮助的。

那么在实际编程中mutex和spin lcok哪个的性能更好呢？我们知道spin lock在Linux内核中有非常广泛的利用，那么这是不是说明spin lock的性能更好呢？下面让我们来用实际的代码测试一下（请确保你的系统中已经安装了最近的g++）。

// Name: spinlockvsmutex1.cc

// Source: http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

// Compiler(spin lock version): g++ -o spin_version -DUSE_SPINLOCK spinlockvsmutex1.cc -lpthread

// Compiler(mutex version): g++ -o mutex_version spinlockvsmutex1.cc -lpthread

#include <stdio.h>

#include <unistd.h>

#include <sys/syscall.h>

#include <errno.h>

#include <sys/time.h>

#include <list>

#include <pthread.h>

#define LOOPS 50000000

using namespace std;

list<int> the_list;

#ifdef USE_SPINLOCK

pthread_spinlock_t spinlock;

#else

pthread_mutex_t mutex;

#endif

//Get the thread id

pid_t gettid() { return syscall( __NR_gettid ); }

void *consumer(void *ptr)

{

    int i;

    printf("Consumer TID %lun", (unsigned long)gettid());

    while ()

    {

#ifdef USE_SPINLOCK

        pthread_spin_lock(&spinlock);

#else

        pthread_mutex_lock(&mutex);

#endif

        if (the_list.empty())

        {

#ifdef USE_SPINLOCK

            pthread_spin_unlock(&spinlock);

#else

            pthread_mutex_unlock(&mutex);

#endif

            break;

        }

        i = the_list.front();

        the_list.pop_front();

#ifdef USE_SPINLOCK

        pthread_spin_unlock(&spinlock);

#else

        pthread_mutex_unlock(&mutex);

#endif

    }

    return NULL;

}

int main()

{

    int i;

    pthread_t thr1, thr2;

    struct timeval tv1, tv2;

#ifdef USE_SPINLOCK

    pthread_spin_init(&spinlock, );

#else

    pthread_mutex_init(&mutex, NULL);

#endif

    // Creating the list content...

    for (i = ; i < LOOPS; i++)

        the_list.push_back(i);

    // Measuring time before starting the threads...

    gettimeofday(&tv1, NULL);

    pthread_create(&thr1, NULL, consumer, NULL);

    pthread_create(&thr2, NULL, consumer, NULL);

    pthread_join(thr1, NULL);

    pthread_join(thr2, NULL);

    // Measuring time after threads finished...

    gettimeofday(&tv2, NULL);

    if (tv1.tv_usec > tv2.tv_usec)

    {

        tv2.tv_sec--;

        tv2.tv_usec += ;

    }

    printf("Result - %ld.%ldn", tv2.tv_sec - tv1.tv_sec,

        tv2.tv_usec - tv1.tv_usec);

#ifdef USE_SPINLOCK

    pthread_spin_destroy(&spinlock);

#else

    pthread_mutex_destroy(&mutex);

#endif

    return ;

}

该程序运行过程如下：主线程先初始化一个list结构，并根据LOOPS的值将对应数量的entry插入该list，之后创建两个新线程，它们都执行consumer()这个任务。两个被创建的新线程同时对这个list进行pop操作。主线程会计算从创建两个新线程到两个新线程结束之间所用的时间，输出为下文中的”Result “。

测试机器参数：

Ubuntu 9.04 X86_64

Intel(R) Core(TM) Duo CPU E8400 @ .00GHz

4.0 GB Memory

从下面是测试结果：

gchen@gchen-desktop:~/Workspace/mutex$ g++ -o spin_version -DUSE_SPINLOCK spinvsmutex1.cc -lpthread

gchen@gchen-desktop:~/Workspace/mutex$ g++ -o mutex_version spinvsmutex1.cc -lpthread

gchen@gchen-desktop:~/Workspace/mutex$ time ./spin_version

Consumer TID

Consumer TID

Result - 5.888750

real    0m10.918s

user    0m15.601s

sys    0m0.804s

gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex_version

Consumer TID

Consumer TID

Result - 9.116376

real    0m14.031s

user    0m12.245s

sys    0m4.368s

可以看见spin lock的版本在该程序中表现出来的性能更好。另外值得注意的是sys时间，mutex版本花费了更多的系统调用时间，这就是因为mutex会在锁冲突时调用system wait造成的。

但是，是不是说spin lock就一定更好了呢？让我们再来看一个锁冲突程度非常剧烈的实例程序：

 //Name: svm2.c

 //Source: http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks

 //Compile(spin lock version): gcc -o spin -DUSE_SPINLOCK svm2.c -lpthread

 //Compile(mutex version): gcc -o mutex svm2.c -lpthread

 #include <stdio.h>

 #include <stdlib.h>

 #include <pthread.h>

 #include <sys/syscall.h>

 #define        THREAD_NUM     2

 pthread_t g_thread[THREAD_NUM];

 #ifdef USE_SPINLOCK

 pthread_spinlock_t g_spin;

 #else

 pthread_mutex_t g_mutex;

 #endif

 __uint64_t g_count;

 pid_t gettid()

 {

     return syscall(SYS_gettid);

 }

 void *run_amuck(void *arg)

 {

        int i, j;

        printf("Thread %lu started.n", (unsigned long)gettid());

        for (i = ; i < ; i++) {

 #ifdef USE_SPINLOCK

            pthread_spin_lock(&g_spin);

 #else

                pthread_mutex_lock(&g_mutex);

 #endif

                for (j = ; j < ; j++) {

                        if (g_count++ == )

                                printf("Thread %lu wins!n", (unsigned long)gettid());

                }

 #ifdef USE_SPINLOCK

            pthread_spin_unlock(&g_spin);

 #else

                pthread_mutex_unlock(&g_mutex);

 #endif

        }

        printf("Thread %lu finished!n", (unsigned long)gettid());

        return (NULL);

 }

 int main(int argc, char *argv[])

 {

        int i, threads = THREAD_NUM;

        printf("Creating %d threads...n", threads);

 #ifdef USE_SPINLOCK

        pthread_spin_init(&g_spin, );

 #else

        pthread_mutex_init(&g_mutex, NULL);

 #endif

        for (i = ; i < threads; i++)

                pthread_create(&g_thread[i], NULL, run_amuck, (void *) i);

        for (i = ; i < threads; i++)

                pthread_join(g_thread[i], NULL);

        printf("Done.n");

        return ();

 }

这个程序的特征就是临界区非常大，这样两个线程的锁竞争会非常的剧烈。当然这个是一个极端情况，实际应用程序中临界区不会如此大，锁竞争也不会如此激烈。测试结果显示mutex版本性能更好：

gchen@gchen-desktop:~/Workspace/mutex$ time ./spin

Creating  threads...

Thread  started.

Thread  started.

Thread  wins!

Thread  finished!

Thread  finished!

Done.

real    0m5.748s

user    0m10.257s

sys    0m0.004s

gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex

Creating  threads...

Thread  started.

Thread  started.

Thread  wins!

Thread  finished!

Thread  finished!

Done.

real    0m4.823s

user    0m4.772s

sys    0m0.032s

另外一个值得注意的细节是spin lock耗费了更多的user time。这就是因为两个线程分别运行在两个核上，大部分时间只有一个线程能拿到锁，所以另一个线程就一直在它运行的core上进行忙等待，CPU占用率一直是100%；而mutex则不同，当对锁的请求失败后上下文切换就会发生，这样就能空出一个核来进行别的运算任务了。（其实这种上下文切换对已经拿着锁的那个线程性能也是有影响的，因为当该线程释放该锁时它需要通知操作系统去唤醒那些被阻塞的线程，这也是额外的开销）

总结
（1）Mutex适合对锁操作非常频繁的场景，并且具有更好的适应性。尽管相比spin lock它会花费更多的开销（主要是上下文切换），但是它能适合实际开发中复杂的应用场景，在保证一定性能的前提下提供更大的灵活度。

（2）spin lock的lock/unlock性能更好(花费更少的cpu指令)，但是它只适应用于临界区运行时间很短的场景。而在实际软件开发中，除非程序员对自己的程序的锁操作行为非常的了解，否则使用spin lock不是一个好主意(通常一个多线程程序中对锁的操作有数以万次，如果失败的锁操作(contended lock requests)过多的话就会浪费很多的时间进行空等待)。

（3）更保险的方法或许是先（保守的）使用 Mutex，然后如果对性能还有进一步的需求，可以尝试使用spin lock进行调优。毕竟我们的程序不像Linux kernel那样对性能需求那么高(Linux Kernel最常用的锁操作是spin lock和rw lock)。

2010年3月3日补记：这个观点在Oracle的文档中得到了支持：

During configuration, Berkeley DB selects a mutex implementation for the architecture. Berkeley DB normally prefers blocking-mutex implementations over non-blocking ones. For example, Berkeley DB will select POSIX pthread mutex interfaces rather than assembly-code test-and-set spin mutexes because pthread mutexes are usually more efficient and less likely to waste CPU cycles spinning without getting any work accomplished.

p.s.调用syscall(SYS_gettid)和syscall( __NR_gettid )都可以得到当前线程的id:)

转自：www.parallellabs.com

Pthreads并行编程之spin lock与mutex性能对比分析（转）的更多相关文章

并行编程之PLINQ
并行编程之PLINQ 并行 LINQ (PLINQ) 是 LINQ 模式的并行实现.PLINQ 的主要用途是通过在多核计算机上以并行方式执行查询委托来加快 LINQ to Objects 查询的执行速 ...
C#多线程编程之：lock使用注意事项
1.避免锁定public类型对象. 如果实例可以被公共访问,将出现lock(this)问题. 如有一个类MyClass,该类有一个Method方法通过lock(this)来实现互斥: 1 public ...
.Net并行编程之二：并行循环
本篇内容主要包括: 1.能够转化为并行循环的条件 2.并行For循环的用法:Parallel.For 3.并行ForEach的用法Parallel.ForEach 4.并行LINQ(PLINQ)的用法 ...
并发编程之：Lock
大家好,我是小黑,一个在互联网苟且偷生的农民工. 在之前的文章中,为了保证在并发情况下多线程共享数据的线程安全,我们会使用synchronized关键字来修饰方法或者代码块,以及在生产者消费者模式中同 ...
并行编程之CountdownEvent的用法
教程:http://blog.gkarch.com/threading/part5.html#the-parallel-class http://www.cnblogs.com/huangxinche ...
高效编程之cache命中对于程序性能的影响
下面这个代码用两个双层循环遍历了一个二维数组里所有的元素,以我自己机器的测试上面那个循环耗时基本为下面的一半,两个循环的时间复杂度相同,为什么会有这么大的差别? 首先要明白的是不管是几维数组,他们都 ...
jvm默认的并行垃圾回收器和G1垃圾回收器性能对比
http://www.importnew.com/13827.html 参数如下: JAVA_OPTS="-server -Xms1024m -Xmx1024m -Xss256k -XX:M ...
自旋锁Spin lock与互斥锁Mutex的区别
POSIX threads(简称Pthreads)是在多核平台上进行并行编程的一套常用的API.线程同步(Thread Synchronization)是并行编程中非常重要的通讯手段,其中最典型的应用 ...
Python核心技术与实战——十七|Python并发编程之Futures
不论是哪一种语言,并发编程都是一项非常重要的技巧.比如我们上一章用的爬虫,就被广泛用在工业的各个领域.我们每天在各个网站.App上获取的新闻信息,很大一部分都是通过并发编程版本的爬虫获得的. 正确并合 ...

随机推荐

Python开发基础-Day8-装饰器扩展和迭代器
wraps模块让原函数保留原来的说明信息 import time import random from functools import wraps def auth(func): '''auth ...
【二分】Codeforces Round #417 (Div. 2) C. Sagheer and Nubian Market
傻逼二分 #include<cstdio> #include<algorithm> using namespace std; typedef long long ll; ll ...
【最小割】BZOJ3438-小M的作物(Rank 2???!!!)（含新款Dinic模板）
一开始被T掉了之后,才害怕地发现之前写的Dinic基本上都是错的??!!!正确的写在注释里了,注意一下(;3<)馬鹿やろ一个丧心病狂的优化前后效率对比:
（原创）Stanford Machine Learning (by Andrew NG) --- (week 3) Logistic Regression & Regularization
coursera上面Andrew NG的Machine learning课程地址为:https://www.coursera.org/course/ml 我曾经使用Logistic Regressio ...
MySQL中变量的定义和变量的赋值使用（转）
说明:现在市面上定义变量的教程和书籍基本都放在存储过程上说明,但是存储过程上变量只能作用于begin...end块中,而普通的变量定义和使用都说的比较少,针对此类问题只能在官方文档中才能找到讲解. 前 ...
Problem G: 部分复制字符串
#include <stdio.h> #include <string.h> int main() { void copystr(char *,char *,int); int ...
Problem B: 查找某一个数
#include<stdio.h> int main(void) { ]; int i; char ch='n'; while(scanf("%d %d",&x ...
[bzoj1014](JSOI2008)火星人 prefix (Splay维护哈希)
Description 火星人最近研究了一种操作:求一个字串两个后缀的公共前缀. 比方说,有这样一个字符串:madamimadam,我们将这个字符串的各个字符予以标号:序号: 1 2 3 4 5 6 ...
iOS获取已安装的app列表（私有库）+ 通过包名打开应用
1.获取已安装的app列表 - (void)touss { Class lsawsc = objc_getClass("LSApplicationWorkspace"); NSOb ...
课程学习：Linux系统管理
版本内核版本发行版本常见Linux发行版本 ubuntu: 易用,可靠:技术支持付费,生态稍弱 debin: 精简,稳定,可靠; 更新较慢, 无技术支持,软件过时, 企业不太用 opensuse ...

Pthreads并行编程之spin lock与mutex性能对比分析（转）

Pthreads并行编程之spin lock与mutex性能对比分析（转）的更多相关文章

随机推荐

热门专题