刨根问底之虚假唤醒

1. 概要

将会以下方式展开介绍：

什么是虚假唤醒
什么原因会导致虚假唤醒（两种原因）
为什么系统内核不从根本上解决虚假唤醒这个“bug”(两个原因)
开发者如何解决虚假唤醒的bug（用while检测）

2. 什么是虚假唤醒

当一次pthread_cond_signal的调用, 导致了多个线程从pthread_cond_wait的调用中返回, 这种效应就叫"虚假唤醒". (正常情况是一次pthread_cond_signal让一个线程返回)

The effect is that more than one thread can return from its call to pthread_cond_wait() or pthread_cond_timedwait() as a result of one call to pthread_cond_signal(). This effect is called "spurious wakeup".

摘自: https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html

3. 什么原因会导致虚假唤醒

大体上我看到了两种解释:

系统中断等不可避免的bug, 这个是官方的解释, 也符合虚假唤醒的定义
应用层面开发者设计的问题, 这个情况严格意义上并不能算是虚假唤醒, 但是为后面讲while举了很好的例子, 所以暂且当作是广义的虚假唤醒吧. (这是我自己定义的"广义虚假唤醒": 某线程被唤醒, 但这并不是开发者本意)

3.1 系统中断等

Linux中pthread_cond_wait是用futex的系统调用实现的. 而进程收到信号后, 每个阻塞的系统调用(类似wait, read, recv)都会立马返回(并且错误码errno为EINTR), 也就是说即使没有调用pthread_cond_sinal, 也可能导致wait返回.

就像任何其他的代码, 线程调度器都可能会因为底层的硬件软件的异常事件而出现短暂的宕机.

The pthread_cond_wait() function in Linux is implemented using the futex system call. Each blocking system call on Linux returns abruptly with EINTR when the process receives a signal.

摘自: http://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=289803065

like any code, thread scheduler may experience temporary blackout due to something abnormal happening in underlying hardware / software.

摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is

3.2 应用层问题

注意: 这种情况, 不知道算不算严格意义上的虚假唤醒, 因为这种情况与上面虚假唤醒的定义不一致的, 也就是并没有因为一个signal导致多个wait返回, 所以这里暂且当作是广义上的虚假唤醒吧.

考虑一个生产消费者队列, 有三个线程(thead1, thread2为消费者, Thread3为生产者):

thread1: lock后直接dequeue, 然后unlock, 伪码如下:
```
lock

dequeue //消费

unlock
```

thread2: 相比thread1, 增加了判断: 如果队列为空则wait:

lock

if (queue is empyt)  pthread_cond_wait

dequeue //消费, 这一步就是bug点, 必须把if改为while,下面分析

unclock

thread3: 和thread1类似, 不过它是生产者, lock后inqueue, 再unclock

lock

inqueue //生产

if(queue is not empty) pthread_cond_signal

unlock

步骤如下:

假设初始队列为空, 此时thread2阻塞在wait中:
thread3生产一个, 此时队列不为空, thread3发送signal通知, 并且unlock
thead2的wait虽然收到通知了(注意wait返回需要两个条件: 1是收到signal, 2是抢到lock), 但是thread2和thread1还在竞争lock
假设thread1先抢到了lock, 消费了, 此时队列为空, 然后unlock
此时假设thread2抢到了lock, wait也就终于返回了, 退出if语句, 再去消费, 此时队列早已为空了, 程序就可能报错了, 因为程序本意是确保队列不为空thread2才能消费, 所以thread2就是被广义上的虚假唤醒了.

Consider a producer consumer queue, and three threads.

Thread 1 has just dequeued an element and released the mutex, and the queue is now empty. The thread is doing whatever it does with the element it acquired on some CPU.

Thread 2 attempts to dequeue an element, but finds the queue to be empty when checked under the mutex, calls pthread_cond_wait, and blocks in the call awaiting signal/broadcast.

Thread 3 obtains the mutex, inserts a new element into the queue, notifies the condition variable, and releases the lock.

In response to the notification from thread 3, thread 2, which was waiting on the condition, is scheduled to run.

However before thread 2 manages to get on the CPU and grab the queue lock, thread 1 completes its current task, and returns to the queue for more work. It obtains the queue lock, checks the predicate, and finds that there is work in the queue. It proceeds to dequeue the item that thread 3 inserted, releases the lock, and does whatever it does with the item that thread 3 enqueued.

Thread 2 now gets on a CPU and obtains the lock, but when it checks the predicate, it finds that the queue is empty. Thread 1 'stole' the item, so the wakeup appears to be spurious. Thread 2 needs to wait on the condition again.

So since you already always need to check the predicate under a loop, it makes no difference if the underlying condition variables can have other sorts of spurious wakeups.

摘自: https://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

4. 为什么不从根本上解决虚假唤醒bug

大致上有以下原因:

客观原因: 很难解决且没人愿意去解决这种bug
性能考虑: 如果不采用虚假唤醒, 则可能大大降低条件变量的效率(比较玄幻的解释, David R. Butenhof in "Programming with POSIX Threads" (p. 80)
阿Q精神: 虚假唤醒迫使我们去考虑并解决这些系统bug, 反而"帮我们"提升了程序健壮性

4.1 客观原因

很难解决且没人愿意去解决这种bug.

The first reason is that nobody wants to fix it.

The second reason is that fixing this is supposed to be hard.

摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html

4.2 性能考虑

总而言之, 为了解决一个很少发生且比较容易用其他方式解决的bug, 而去牺牲整体运行效率是不值得的.

Spurious wakeups may sound strange, but on some multiprocessor systems, making condition wakeup completely predictable might substantially slow all condition variable operations.

摘自: David R. Butenhof in "Programming with POSIX Threads" (p. 80)

While this problem could be resolved, the loss of efficiency for a fringe condition that occurs only rarely is unacceptable, especially given that one has to check the predicate associated with a condition variable anyway. Correcting this problem would unnecessarily reduce the degree of concurrency in this basic building block for all higher-level synchronization operations.

摘自： https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_signal.html

4.3 阿Q精神

当然，应该尽量避免这种情况的发生，但是由于没有100％健壮的软件之类的东西，因此可以合理地假设这种情况会发生，并在调度程序检测到这种情况时谨慎地进行恢复(例如通过观察丢失的心跳)

Of course, care should be taken for this to happen as rare as possible, but since there's no such thing as 100% robust software it is reasonable to assume this can happen and take care on the graceful recovery in case if scheduler detects this (eg by observing missing heartbeats).

https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is

5. 如何解决虚假唤醒带来的bug

下面先介绍某些方案的不可行性和while的可行性和必要性:

再次直接调用wait方案的不可行性: 如果signal发送之后, 再一次wait将会收不到signal, 线程将会suspend
while可行且必要.

5.1 再次直接调用wait方案的不可行性

glibc在调用阻塞函数时(例如read), 会用while循环执行, 当返回错误码errno为EINTR时, 继续循环read, 那么我们可不可以用这种方法去解决虚假唤醒呢? 答案是不能的, 伪码如下:

//read模型

readagain:

	ret = read(fd);

	if(ret < 0 && errno== EINTR) //如果被中断了, 就再读

		goto readagain;

//wait模型

wait //第一次wait, 返回后, 假设我们用某种手段发现是虚拟唤醒, 准备再次wait

// 这个间隙将会错过一些signal

wait // 第二次wait, 错过了signal, 将会一直wait

之所以read可以, 是因为read是直接从接收缓冲区读数据就可以了, 被中断多少次、多长时间都无所谓, 而wait一旦中断再去wait, 可能这个间隙就错过了signal, 可能会一直wait下去.

... when glibc calls any blocking function, like 'read', it does it in a loop, and if 'read' returns EINTR, calls 'read' again.

Can the same trick be used to conditions? No, because the moment we return from 'futex' call, another thread can send us notification. And since we're not waiting inside 'futex', we'll miss the notification. So, we need to return to the caller, and have it reevaluate the predicate. If another thread indeed set it to true, we'll break out of the loop.

摘自: http://blog.vladimirprus.com/2005/07/spurious-wakeups.html

Now, how could scheduler recover, taking into account that during blackout it could miss some signals intended to notify waiting threads? If scheduler does nothing, mentioned "unlucky" threads will just hang, waiting forever - to avoid this, scheduler would simply send a signal to all the waiting threads.

摘自: https://softwareengineering.stackexchange.com/questions/186842/spurious-wakeups-explanation-sounds-like-a-bug-that-just-isnt-worth-fixing-is

5.2 while循环的可行和必要性

总而言之, 用wait和signal的根本是程序员希望满足了某个条件, 现在既然存在虚假唤醒, 那我们就直接去看那个条件是否满足了就好. 同时, 为了防止多次虚假唤醒, 我们用while.

这里直接把3.2中的例子搬过来(把if改为while了):

thread1: lock后直接dequeue, 然后unlock, 伪码如下:
```
lock

dequeue

unlock
```