刨根问底系列(1)——虚假唤醒(spurious wakeups)的原因以及在pthread_cond_wait、pthread_cond_singal中使用while的必要性
刨根问底之虚假唤醒
1. 概要
将会以下方式展开介绍:
- 什么是虚假唤醒
- 什么原因会导致虚假唤醒(两种原因)
- 为什么系统内核不从根本上解决虚假唤醒这个“bug”(两个原因)
- 开发者如何解决虚假唤醒的bug(用while检测)
2. 什么是虚假唤醒
当一次pthread_cond_signal的调用, 导致了多个线程从pthread_cond_wait的调用中返回, 这种效应就叫"虚假唤醒". (正常情况是一次pthread_cond_signal让一个线程返回)
The effect is that more than one thread can return from its call to pthread_cond_wait() or pthread_cond_timedwait() as a result of one call to pthread_cond_signal(). This effect is called "spurious wakeup".
摘自: https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
3. 什么原因会导致虚假唤醒
大体上我看到了两种解释:
- 系统中断等不可避免的bug, 这个是官方的解释, 也符合虚假唤醒的定义
- 应用层面开发者设计的问题, 这个情况严格意义上并不能算是虚假唤醒, 但是为后面讲while举了很好的例子, 所以暂且当作是广义的虚假唤醒吧. (这是我自己定义的"广义虚假唤醒": 某线程被唤醒, 但这并不是开发者本意)
3.1 系统中断等
Linux中pthread_cond_wait是用futex的系统调用实现的. 而进程收到信号后, 每个阻塞的系统调用(类似wait, read, recv)都会立马返回(并且错误码errno为EINTR), 也就是说即使没有调用pthread_cond_sinal, 也可能导致wait返回.
就像任何其他的代码, 线程调度器都可能会因为底层的硬件软件的异常事件而出现短暂的宕机.
The pthread_cond_wait() function in Linux is implemented using the futex system call. Each blocking system call on Linux returns abruptly with EINTR when the process receives a signal.
摘自: http://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=289803065
like any code, thread scheduler may experience temporary blackout due to something abnormal happening in underlying hardware / software.
摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
3.2 应用层问题
注意: 这种情况, 不知道算不算严格意义上的虚假唤醒, 因为这种情况与上面虚假唤醒的定义不一致的, 也就是并没有因为一个signal导致多个wait返回, 所以这里暂且当作是广义上的虚假唤醒吧.
考虑一个生产消费者队列, 有三个线程(thead1, thread2为消费者, Thread3为生产者):
thread1: lock后直接dequeue, 然后unlock, 伪码如下:
lock
dequeue //消费
unlock
thread2: 相比thread1, 增加了判断: 如果队列为空则wait:
lock
if (queue is empyt) pthread_cond_wait
dequeue //消费, 这一步就是bug点, 必须把if改为while,下面分析
unclock
thread3: 和thread1类似, 不过它是生产者, lock后inqueue, 再unclock
lock
inqueue //生产
if(queue is not empty) pthread_cond_signal
unlock
步骤如下:
- 假设初始队列为空, 此时thread2阻塞在wait中:
- thread3生产一个, 此时队列不为空, thread3发送signal通知, 并且unlock
- thead2的wait虽然收到通知了(注意wait返回需要两个条件: 1是收到signal, 2是抢到lock), 但是thread2和thread1还在竞争lock
- 假设thread1先抢到了lock, 消费了, 此时队列为空, 然后unlock
- 此时假设thread2抢到了lock, wait也就终于返回了, 退出if语句, 再去消费, 此时队列早已为空了, 程序就可能报错了, 因为程序本意是确保队列不为空thread2才能消费, 所以thread2就是被广义上的虚假唤醒了.
Consider a producer consumer queue, and three threads.
Thread 1 has just dequeued an element and released the mutex, and the queue is now empty. The thread is doing whatever it does with the element it acquired on some CPU.
Thread 2 attempts to dequeue an element, but finds the queue to be empty when checked under the mutex, calls pthread_cond_wait, and blocks in the call awaiting signal/broadcast.
Thread 3 obtains the mutex, inserts a new element into the queue, notifies the condition variable, and releases the lock.
In response to the notification from thread 3, thread 2, which was waiting on the condition, is scheduled to run.
However before thread 2 manages to get on the CPU and grab the queue lock, thread 1 completes its current task, and returns to the queue for more work. It obtains the queue lock, checks the predicate, and finds that there is work in the queue. It proceeds to dequeue the item that thread 3 inserted, releases the lock, and does whatever it does with the item that thread 3 enqueued.
Thread 2 now gets on a CPU and obtains the lock, but when it checks the predicate, it finds that the queue is empty. Thread 1 'stole' the item, so the wakeup appears to be spurious. Thread 2 needs to wait on the condition again.
So since you already always need to check the predicate under a loop, it makes no difference if the underlying condition variables can have other sorts of spurious wakeups.
摘自: https://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups
4. 为什么不从根本上解决虚假唤醒bug
大致上有以下原因:
- 客观原因: 很难解决且没人愿意去解决这种bug
- 性能考虑: 如果不采用虚假唤醒, 则可能大大降低条件变量的效率(比较玄幻的解释, David R. Butenhof in "Programming with POSIX Threads" (p. 80)
- 阿Q精神: 虚假唤醒迫使我们去考虑并解决这些系统bug, 反而"帮我们"提升了程序健壮性
4.1 客观原因
很难解决且没人愿意去解决这种bug.
The first reason is that nobody wants to fix it.
The second reason is that fixing this is supposed to be hard.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
4.2 性能考虑
总而言之, 为了解决一个很少发生且比较容易用其他方式解决的bug, 而去牺牲整体运行效率是不值得的.
Spurious wakeups may sound strange, but on some multiprocessor systems, making condition wakeup completely predictable might substantially slow all condition variable operations.
摘自: David R. Butenhof in "Programming with POSIX Threads" (p. 80)
While this problem could be resolved, the loss of efficiency for a fringe condition that occurs only rarely is unacceptable, especially given that one has to check the predicate associated with a condition variable anyway. Correcting this problem would unnecessarily reduce the degree of concurrency in this basic building block for all higher-level synchronization operations.
摘自: https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
4.3 阿Q精神
当然,应该尽量避免这种情况的发生,但是由于没有100%健壮的软件之类的东西,因此可以合理地假设这种情况会发生,并在调度程序检测到这种情况时谨慎地进行恢复(例如 通过观察丢失的心跳)
Of course, care should be taken for this to happen as rare as possible, but since there's no such thing as 100% robust software it is reasonable to assume this can happen and take care on the graceful recovery in case if scheduler detects this (eg by observing missing heartbeats).
https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
5. 如何解决虚假唤醒带来的bug
下面先介绍某些方案的不可行性和while的可行性和必要性:
- 再次直接调用wait方案的不可行性: 如果signal发送之后, 再一次wait将会收不到signal, 线程将会suspend
- while可行且必要.
5.1 再次直接调用wait方案的不可行性
glibc在调用阻塞函数时(例如read), 会用while循环执行, 当返回错误码errno为EINTR时, 继续循环read, 那么我们可不可以用这种方法去解决虚假唤醒呢? 答案是不能的, 伪码如下:
//read模型
readagain:
ret = read(fd);
if(ret < 0 && errno== EINTR) //如果被中断了, 就再读
goto readagain;
//wait模型
wait //第一次wait, 返回后, 假设我们用某种手段发现是虚拟唤醒, 准备再次wait
// 这个间隙将会错过一些signal
wait // 第二次wait, 错过了signal, 将会一直wait
之所以read可以, 是因为read是直接从接收缓冲区读数据就可以了, 被中断多少次、多长时间都无所谓, 而wait一旦中断再去wait, 可能这个间隙就错过了signal, 可能会一直wait下去.
... when glibc calls any blocking function, like 'read', it does it in a loop, and if 'read' returns EINTR, calls 'read' again.
Can the same trick be used to conditions? No, because the moment we return from 'futex' call, another thread can send us notification. And since we're not waiting inside 'futex', we'll miss the notification. So, we need to return to the caller, and have it reevaluate the predicate. If another thread indeed set it to true, we'll break out of the loop.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
Now, how could scheduler recover, taking into account that during blackout it could miss some signals intended to notify waiting threads? If scheduler does nothing, mentioned "unlucky" threads will just hang, waiting forever - to avoid this, scheduler would simply send a signal to all the waiting threads.
摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
5.2 while循环的可行和必要性
总而言之, 用wait和signal的根本是程序员希望满足了某个条件, 现在既然存在虚假唤醒, 那我们就直接去看那个条件是否满足了就好. 同时, 为了防止多次虚假唤醒, 我们用while.
这里直接把3.2中的例子搬过来(把if改为while了):
thread1: lock后直接dequeue, 然后unlock, 伪码如下:
lock
dequeue
unlock
thread2: 相比thread1, 增加了判断: 如果队列为空则wait:
lock
while (queue is empyt) pthread_cond_wait //改为while
dequeue
unclock
thread3: 和thread1类似, 不过它是生产者, lock后inqueue, 再unclock
lock
inqueue
if(queue is not empty) pthread_cond_signal
unlock
步骤如下:
- 假设初始队列为空, 此时thread2阻塞在wait中:
- thread3生产一个, 此时队列不为空, thread3发送signal通知, 并且unlock
- thead2的wait虽然收到通知了, 但是thread2和thread1还在竞争lock
- 假设thread1先抢到了lock, 消费了, 此时队列为空, 然后unlock
- 此时假设thread2抢到了lock, wait也就终于返回了, 但是有while循环, 再次判断队列是否为空, 发现仍然为空(已经被thread1偷过去消费掉了), 所以并不能退出while循环, 所以再次wait, 这就正常了
Assumption of spurious wakeups forces thread to be conservative in what it does: set condition when notifying other threads, and liberal in what it accepts: check the condition upon any return from wait and repeat wait if it's not there yet.
摘自: https://softwareengineering.stackexchange.com/users/31260/gnat
So, we need to return to the caller, and have it reevaluate the predicate. If another thread indeed set it to true, we'll break out of the loop.
摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html
6. 参考网址
- 介绍了虚假唤醒的一种用户层的原因:https://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups
- 介绍了虚拟唤醒的系统内核原因:https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is
- 说明了Linux大佬们不解决虚拟唤醒的两个原因:
https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html
刨根问底系列(1)——虚假唤醒(spurious wakeups)的原因以及在pthread_cond_wait、pthread_cond_singal中使用while的必要性的更多相关文章
- 什么是虚假唤醒 spurious wakeup
解释一下什么是虚假唤醒? 说具体的例子,比较容易说通. pthread_mutex_t lock; pthread_cond_t notempty; pthread_cond_t notfull; v ...
- 多线程编程中条件变量和的spurious wakeup 虚假唤醒
1. 概述 条件变量(condition variable)是利用共享的变量进行线程之间同步的一种机制.典型的场景包括生产者-消费者模型,线程池实现等. 对条件变量的使用包括两个动作: 1) 线程等待 ...
- Java-JUC(八):使用wait,notify|notifyAll完成生产者消费者通信,虚假唤醒(Spurious Wakeups)问题出现场景,及问题解决方案。
模拟通过线程实现消费者和订阅者模式: 首先,定义一个店员:店员包含进货.卖货方法:其次,定义一个生产者,生产者负责给店员生产产品:再者,定义一个消费者,消费者负责从店员那里消费产品. 店员: /** ...
- java多线程 生产者消费者案例-虚假唤醒
package com.java.juc; public class TestProductAndConsumer { public static void main(String[] args) { ...
- JUC虚假唤醒(六)
为什么条件锁会产生虚假唤醒现象(spurious wakeup)? 在不同的语言,甚至不同的操作系统上,条件锁都会产生虚假唤醒现象.所有语言的条件锁库都推荐用户把wait()放进循环里: whil ...
- notify丢失、虚假唤醒
notify丢失: 假设线程A因为某种条件在条件队列中等待,同时线程B因为另外一种条件在同一个条件队列中等待,也就是说线程A/B都被同一个Object.wait()挂起,但是等待的条件不同. 现在假设 ...
- pthread_cond_wait虚假唤醒
pthread_cond_wait中的while()不仅仅在等待条件变量前检查条件cond_is_false是否成立,实际上在等待条件变量后也检查条件cond_is_false是否成立.在多线程等待的 ...
- (三)juc高级特性——虚假唤醒 / Condition / 按序交替 / ReadWriteLock / 线程八锁
8. 生产者消费者案例-虚假唤醒 参考下面生产者消费者案例: /* * 生产者和消费者案例 */ public class TestProductorAndConsumer { public stat ...
- 【转】pthread_cond_signal 虚假唤醒问题
引用:http://blog.csdn.net/leeds1993/article/details/52738845 什么是虚假唤醒? 举个例子,我们现在有一个生产者-消费者队列和三个线程. I.1号 ...
随机推荐
- SpringBoot怎么自动部署到内置的Tomcat的?
先看看SpringBoot的主配置类的main方法: main方法运行了一个run()方法,进去run方法看一下: /** * 静态帮助程序,可用于从中运行{@link SpringApplicati ...
- In Triangle Test / To Left Test
2020-01-09 14:51:29 如何高效的判断一个点是否是包含在一个三角形的内部是计算几何里的一个基础问题. In Triangle Test问题也可以用来解决计算几何里的一个基础问题就是 凸 ...
- 快速理解编码,unicode与utf-8
1.为什么编码,因为cpu只认识数字2.ASCII 一个字符共占7位,用一个字节表示,共128个字符3.那么ASCII浪费了最高位多可惜,出现了ISO-8859-1,一个字节,256个字符,很多协议的 ...
- spring最核心思想--ioc控制反转
一核心概念 控制反转:将bean的生成交给容器,程序可以从容器中获取指定的bean. 个人理解:此优势也是spring能够流行并成为java主流框架的主要原因,java是帮助java程序员以对象的方式 ...
- Go深入学习之select
select的用法 1)select只能用于channel的操作(写入.读出),而switch则更通用一些 2)select的case是随机的,而switch里的case是顺序执行 3)select要 ...
- Java复合优先于继承
复合优于继承 继承打破了封装性(子类依赖父类中特定功能的实现细节) 合理的使用继承的情况: 在包内使用 父类专门为继承为设计,并且有很好的文档说明,存在is-a关系 只有当子类真正是父类的子类型时,才 ...
- Java 添加、读取和删除 Excel 批注
批注是一种富文本注释,常用于为指定的Excel单元格添加提示或附加信息. Free Spire.XLS for Java 为开发人员免费提供了在Java应用程序中对Excel文件添加和操作批注的功能. ...
- 微服务架构盛行的时代,你需要了解点 Spring Boot
随着互联网的高速发展,庞大的用户群体和快速的需求变化已经成为了传统架构的痛点. 在这种情况下,如何从系统架构的角度出发,构建出灵活.易扩展的系统来快速响应需求的变化,同时,随着用户量的增加,如何保证系 ...
- C语言把整数转换为字符串
目录 1.把整数/长整数格式化输出到字符串 2.注意事项 3.版权声明 各位可能在网上看到用以下函数可以将整数转换为字符串: itoa(); //将整型值转换为字符串 ultoa(); // 将无符号 ...
- Java 程序该怎么优化?(实战篇)
面试官:出现了性能问题,该怎么去排查呢? 程序猿:接口响应那么慢,时间都花到哪里去了? 运维喵:为什么你的应用跑着跑着,CPU 就接近 100%? 分享一些真实生产问题排查故事,看看能否涨姿势,能否 ...