How long does it take to make a context switch?

FROM：　　　http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the improvements in the Nehalem architecture (marketed as Intel i7), context switching would be much faster. How would you devise a test to empirically find an answer to this question? How expensive are context switches anyway? (tl;dr answer: very expensive)

The lineup

April 21, 2011 update: I added an "extreme" Nehalem and a low-voltage Westmere.
April 1, 2013 update: Added an Intel Sandy Bridge E5-2620.
I've put 4 different generations of CPUs to test:

A dual Intel 5150 (Woodcrest, based on the old "Core" architecture, 2.67GHz). The 5150 is a dual-core, and so in total the machine has 4 cores available. Kernel: 2.6.28-19-server x86_64.
A dual Intel E5440 (Harpertown, based on the Penrynn architecture, 2.83GHz). The E5440 is a quad-core so the machine has a total of 8 cores. Kernel: 2.6.24-26-server x86_64.
A dual Intel E5520 (Gainestown, based on the Nehalem architecture, aka i7, 2.27GHz). The E5520 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Kernel: 2.6.28-18-generic x86_64.
A dual Intel X5550 (Gainestown, based on the Nehalem architecture, aka i7, 2.67GHz). The X5550 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the X5550 is in the "server" product line. This CPU is 3x more expensive than the previous one. Kernel: 2.6.28-15-server x86_64.
A dual Intel L5630 (Gulftown, based on the Westmere architecture, aka i7, 2.13GHz). The L5630 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the L5630 is a "low-voltage" CPU. At equal price, this CPU is in theory 16% less powerful than a non-low-voltage CPU. Kernel: 2.6.32-29-server x86_64.
A dual Intel E5-2620 (Sandy Bridge-EP, based on the Sandy Bridge architecture, aka E5, 2Ghz). The E5-2620 is a hexa-core, has HyperThreading, so the machine has a total of 12 cores, or 24 "hardware threads". Kernel: 3.4.24 x86_64.

As far as I can say, all CPUs are set to a constant clock rate (no Turbo Boost or anything fancy). All the Linux kernels are those built and distributed by Ubuntu.

First idea: with syscalls (fail)

My first idea was to make a cheap system call many times in a row, time how long it took, and compute the average time spent per syscall. The cheapest system call on Linux these days seems to be gettid. Turns out, this was a naive approach since system calls don't actually cause a full context switch anymore nowadays, the kernel can get away with a "mode switch" (go from user mode to kernel mode, then back to user mode). That's why when I ran my first test program, vmstat wouldn't show a noticeable increase in number of context switches. But this test is interesting too, although it's not what I wanted originally.

Source code: timesyscall.c Results:

Intel 5150: 105ns/syscall
Intel E5440: 87ns/syscall
Intel E5520: 58ns/syscall
Intel X5550: 52ns/syscall
Intel L5630: 58ns/syscall
Intel E5-2620: 67ns/syscall

Now that's nice, more expensive CPUs perform noticeably better (note however the slight increase in cost on Sandy Bridge). But that's not really what we wanted to know. So to test the cost of a context switch, we need to force the kernel to de-schedule the current process and schedule another one instead. And to benchmark the CPU, we need to get the kernel to do nothing but this in a tight loop. How would you do this?

Second idea: with `futex`

The way I did it was to abuse futex (RTFM). futex is the low level Linux-specific primitive used by most threading libraries to implement blocking operations such as waiting on a contended mutexes, semaphores that run out of permits, condition variables and friends. If you would like to know more, go read Futexes Are Tricky by Ulrich Drepper. Anyways, with a futex, it's easy to suspend and resume processes. What my test does is that it forks off a child process, and the parent and the child take turn waiting on the futex. When the parent waits, the child wakes it up and goes on to wait on the futex, until the parent wakes it and goes on to wait again. Some kind of a ping-pong "I wake you up, you wake me up...".

Source code: timectxsw.c Results:

Intel 5150: ~4300ns/context switch
Intel E5440: ~3600ns/context switch
Intel E5520: ~4500ns/context switch
Intel X5550: ~3000ns/context switch
Intel L5630: ~3000ns/context switch
Intel E5-2620: ~3000ns/context switch

Note: those results include the overhead of the futex system calls.

Now you must take those results with a grain of salt. The micro-benchmark does nothing but context switching. In practice context switching is expensive because it screws up the CPU caches (L1, L2, L3 if you have one, and the TLB – don't forget the TLB!).

CPU affinity

Things are harder to predict in an SMP environment, because the performance can vary wildly depending on whether a task is migrated from one core to another (especially if the migration is across physical CPUs). I ran the benchmarks again but this time I pinned the processes/threads on a single core (or "hardware thread"). The performance speedup is dramatic.

Source code: cpubench.sh Results:

Intel 5150: ~1900ns/process context switch, ~1700ns/thread context switch
Intel E5440: ~1300ns/process context switch, ~1100ns/thread context switch
Intel E5520: ~1400ns/process context switch, ~1300ns/thread context switch
Intel X5550: ~1300ns/process context switch, ~1100ns/thread context switch
Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch

Performance boost: 5150: 66%, E5440: 65-70%, E5520: 50-54%, X5550: 55%, L5630: 45%, E5-2620: 45%.

The performance gap between thread switches and process switches seems to increase with newer CPU generations (5150: 7-8%, E5440: 5-15%, E5520: 11-20%, X5550: 15%, L5630: 13%, E5-2620: 19%). Overall the penalty of switching from one task to another remains very high. Bear in mind that those artificial tests do absolutely zero computation, so they probably have 100% cache hit in L1d and L1i. In the real world, switching between two tasks (threads or processes) typically incurs significantly higher penalties due to cache pollution. But we'll get back to this later.

Threads vs. processes

After producing the numbers above, I quickly criticized Java applications, because it's fairly common to create shitloads of threads in Java, and the cost of context switching becomes high in such applications. Someone retorted that, yes, Java uses lots of threads but threads have become significantly faster and cheaper with the NPTL in Linux 2.6. They said that normally there's no need to do a TLB flush when switching between two threads of the same process. That's true, you can go check the source code of the Linux kernel (switch_mm in mmu_context.h):

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
                             struct task_struct *tsk)
{
       unsigned cpu = smp_processor_id();
 
       if (likely(prev != next)) {
               [...]
               load_cr3(next->pgd);
       } else {
               [don't typically reload cr3]
       }
}

In this code, the kernel expects to be switching between tasks that have different memory structures, in which cases it updates CR3, the register that holds a pointer to the page table. Writing to CR3 automatically causes a TLB flush on x86.

In practice though, with the default kernel scheduler and a busy server-type workload, it's fairly infrequent to go through the code path that skips the call to load_cr3. Plus, different threads tend to have different working sets, so even if you skip this step, you still end up polluting the L1/L2/L3/TLB caches. I re-ran the benchmark above with 2 threads instead of 2 processes (source: timetctxsw.c) but the results aren't significantly different (this varies a lot depending on scheduling and luck, but on average on many runs it's typically only 100ns faster to switch between threads if you don't set a custom CPU affinity).

Indirect costs in context switches: cache pollution

The results above are in line with a paper published a bunch of guys from University of Rochester: Quantifying The Cost of Context Switch. On an unspecified Intel Xeon (the paper was written in 2007, so the CPU was probably not too old), they end up with an average time of 3800ns. They use another method I thought of, which involves writing / reading 1 byte to / from a pipe to block / unblock a couple of processes. I thought that (ab)using futex would be better since futex is essentially exposing some scheduling interface to userland.

The paper goes on to explain the indirect costs involved in context switching, which are due to cache interference. Beyond a certain working set size (about half the size of the L2 cache in their benchmarks), the cost of context switching increases dramatically (by 2 orders of magnitude).

I think this is a more realistic expectation. Not sharing data between threads leads to optimal performance, but it also means that every thread has its own working set and that when a thread is migrated from one core to another (or worse, across physical CPUs), the cache pollution is going to be costly. Unfortunately, when an application has many more active threads than hardware threads, this is happening all the time. That's why not creating more active threads than there are hardware threads available is so important, because in this case it's easier for the Linux scheduler to keep re-scheduling the same threads on the core they last used ("weak affinity").

Having said that, these days, our CPUs have much larger caches, and can even have an L3 cache.

5150: L1i & L1d = 32K each, L2 = 4M
E5440: L1i & L1d = 32K each, L2 = 6M
E5520: L1i & L1d = 32K each, L2 = 256K/core, L3 = 8M (same for the X5550)
L5630: L1i & L1d = 32K each, L2 = 256K/core, L3 = 12M
E5-2620: L1i & L1d = 64K each, L2 = 256K/core, L3 = 15M

Note that in the case of the E5520/X5550/L5630 (the ones marketed as "i7") as well as the Sandy Bridge E5-2520, the L2 cache is tiny but there's one L2 cache per core (with HT enabled, this gives us 128K per hardware thread). The L3 cache is shared for all cores that are on each physical CPU.

Having more cores is great, but it also increases the chance that your task be rescheduled onto a different core. The cores have to "migrate" cache lines around, which is expensive. I recommend reading What Every Programmer Should Know About Main Memory by Ulrich Drepper (yes, him again!) to understand more about how this works and the performance penalties involved.

So how does the cost of context switching increase with the size of the working set? This time we'll use another micro-benchmark, timectxswws.c that takes in argument the number of pages to use as a working set. This benchmark is exactly the same as the one used earlier to test the cost of context switching between two processes except that now each process does a memset on the working set, which is shared across both processes. Before starting, the benchmark times how long it takes to write over all the pages in the working set size requested. This time is then discounted from the total time taken by the test. This attempts to estimate the overhead of overwriting pages across context switches.

Here are the results for the 5150:As we can see, the time needed to write a 4K page more than doubles once our working set is bigger than what we can fit in the L1d (32K). The time per context switch keeps going up and up as the working set size increases, but beyond a certain point the benchmark becomes dominated by memory accesses and is no longer actually testing the overhead of a context switch, it's simply testing the performance of the memory subsystem.

Same test, but this time with CPU affinity (both processes pinned on the same core):Oh wow, watch this! It's an order of magnitude faster when pinning both processes on the same core! Because the working set is shared, the working set fits entirely in the 4M L2 cache and cache lines simply need to be transfered from L2 to L1d, instead of being transfered from core to core (potentially across 2 physical CPUs, which is far more expensive than within the same CPU).

Now the results for the i7 processor:Note that this time I covered larger working set sizes, hence the log scale on the X axis.

So yes, context switching on i7 is faster, but only for so long. Real applications (especially Java applications) tend to have large working sets so typically pay the highest price when undergoing a context switch. Other observations about the Nehalem architecture used in the i7:

Going from L1 to L2 is almost unnoticeable. It takes about 130ns to write a page with a working set that fits in L1d (32K) and only 180ns when it fits in L2 (256K). In this respect, the L2 on Nehalem is more of a "L1.5", since its latency is simply not comparable to that of the L2 of previous CPU generations.
As soon as the working set increases beyond 1024K, the time needed to write a page jumps to 750ns. My theory here is that 1024K = 256 pages = half of the TLB of the core, which is shared by the two HyperThreads. Because now both HyperThreads are fighting for TLB entries, the CPU core is constantly doing page table lookups.

Speaking of TLB, the Nehalem has an interesting architecture. Each core has a 64 entry "L1d TLB" (there's no "L1i TLB") and a unified 512 entry "L2TLB". Both are dynamically allocated between both HyperThreads.

Virtualization

I was wondering how much overhead there is when using virtualization. I repeated the benchmarks for the dual E5440, once in a normal Linux install, once while running the same install inside VMware ESX Server. The result is that, on average, it's 2.5x to 3x more expensive to do a context switch when using virtualization. My guess is that this is due to the fact that the guest OS can't update the page table itself, so when it attempts to change it, the hypervisor intervenes, which causes an extra 2 context switches (one to get inside the hypervisor, one to get out, back to the guest OS).

This probably explains why Intel added the EPT (Extended Page Table) on the Nehalem, since it enables the guest OS to modify its own page table without help of the hypervisor, and the CPU is able to do the end-to-end memory address translation on its own, entirely in hardware (virtual address to "guest-physical" address to physical address).

Parting words

Context switching is expensive. My rule of thumb is that it'll cost you about 30µs of CPU overhead. This seems to be a good worst-case approximation. Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads. I think the sweet spot for optimal CPU use is to have the same number of worker threads as there are hardware threads, and write code in an asynchronous / non-blocking fashion. Asynchronous code tends to be CPU bound, because anything that would block is simply deferred to later, until the blocking operation completes. This means that threads in asynchronous / non-blocking applications are much more likely to use their full time quantum before the kernel scheduler preempts them. And if there's the same number of runnable threads as there are hardware threads, the kernel is very likely to reschedule threads on the same core, which significantly helps performance.

Another hidden cost that severely impacts server-type workloads is that after being switched out, even if your process becomes runnable, it'll have to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which entails that processes are given time slices of 10ms. If your thread has been switched out but becomes runnable almost immediately, and there are 2 other threads before it in the run queue waiting for CPU time, your thread may have to wait up to 20ms in the worst scenario to get CPU time. So depending on the average length of the run queue (which is reflected in load average), and how long your threads typically run before getting switched out again, this can considerably impact performance.

It is illusory to imagine that NPTL or the Nehalem architecture made context switching cheaper in real-world server-type workloads. Default Linux kernels don't do a good job at keeping CPU affinity, even on idle machines. You must explore alternative schedulers or use taskset or cpuset to control affinity yourself. If you're running multiple different CPU-intensive applications on the same server, manually partitioning cores across applications can help you achieve very significant performance gains.

Posted by Benoit Sigoure at 11/14/2010 08:53:00 PM

[转帖]How long does it take to make a context switch?的更多相关文章

聊一下Python的线程 & GIL
再来聊一下Python的线程参考这篇文章 https://www.zhihu.com/question/23474039/answer/24695447 简单地说就是作为可能是仅有的支持多线程的解释 ...
nginx负载均衡基于ip_hash的session粘帖
nginx负载均衡基于ip_hash的session粘帖 nginx可以根据客户端IP进行负载均衡,在upstream里设置ip_hash,就可以针对同一个C类地址段中的客户端选择同一个后端服务器,除 ...
[转帖]网络协议封封封之Panabit配置文档
原帖地址:http://myhat.blog.51cto.com/391263/322378
[转帖]零投入用panabit享受万元流控设备——搭建篇
原帖地址:http://net.it168.com/a2009/0505/274/000000274918.shtml 你想合理高效的管理内网流量吗?你想针对各个非法网络应用与服务进行合理限制吗?你是 ...
3d数学总结帖
3d数学总结帖,以下是对3d学习过程中数学知识的简单总结角度值和弧度制的互转 Deg2Rad 角度A1转弧度A2 => A2=A1*PI/180 Rad2Deg 弧度A2转换角度A1 => ...
[转帖]The Lambda Calculus for Absolute Dummies (like myself)
Monday, May 7, 2012 The Lambda Calculus for Absolute Dummies (like myself) If there is one highly ...
[转帖]FPGA开发工具汇总
原帖:http://blog.chinaaet.com/yocan/p/5100017074 ----------------------------------------------------- ...
[Android分享] 【转帖】Android ListView的A-Z字母排序和过滤搜索功能
感谢eoe社区的分享最近看关于Android实现ListView的功能问题,一直都是小伙伴们关心探讨的Android开发问题之一,今天看到有关ListView实现A-Z字母排序和过滤搜索功能 ...
AxureRP7.0各类交互效果汇总帖（转）
了便于大家参考,我把这段时间发布分享的所有关于AxureRP7.0的原型做了整理. 以下资源均有对应的RP源文件可以下载. 当然 ,其中有部分是需要通过完成解密游戏[攻略]才能得到下载地址或者下载密码 ...

随机推荐

English--音标拼读
English|音标拼读音标拼读主要内容是,如何使用音标进行单词的拼读,并且会有相应的语音现象,最关键的还是自己多加练习,多听~ 前言目前所有的文章思想格式都是:知识+情感. 知识:对于所有的知识 ...
Ivanti的垃圾软件landesk
landesk是Ivanti公司推出的终端管理工具,这个工具垃圾就垃圾在无法卸载,进程杀不死.文件删不掉,奉劝大家千万不要安装这个软件.前些天公司的IT部门一直在催促员工安装这个软件,我一时糊涂安装了 ...
设计模式之（七）适配器模式（Adapter）
作为一个码农,天天都要面对电脑.知道电脑一直在不停的升级换代.电脑的很多零件接口也不断的变化.如果你曾经花巨资采购的一台电脑在使用一段时间后,发现硬盘空间不够使用,需要加一块硬盘,在加的时候才发现新硬 ...
服务刚启动就 Old GC，要闹哪样？
1.背景最近有个同学说他的服务刚启动就收到两次 Full GC 告警, 按道理来说刚启动,对象应该不会太多,为啥会触发 Full GC 呢? 带着疑问,我们还是先看看日志吧,毕竟日志的信息更多. 2 ...
Nginx配置文件 nginx.conf 和default.conf 讲解
nginx.conf /etc/nginx/nginx.conf ######Nginx配置文件nginx.conf中文详解##### #定义Nginx运行的用户和用户组 user www www; ...
EFK日志搭建
安装java 安装java1.8以上的版本并验证 [root@localhost ~]# yum install java [root@localhost ~]# java -version open ...
MessagePack详解
版权声明:分享是一种品质,开源是一种精神. https://blog.csdn.net/wangmx1993328/article/details/84477073 MessagePack Intro ...
块 /宏块(MB)/片(Slice/片组/图像(picture) 对应关系
根据包含关系从大到小顺序排列序列(GOP)-> 帧(I/IDR/P/B)-> 片组 -> 片(slice)-> 宏块(Block)-> 块(Macro Block ...
mysql常用操作（测试必备）
现在互联网的主流关系型数据库是mysql,掌握其基本的增.删.改.查是每一个测试人员必备的技能. sql语言分类 1.DDL语句(数据库定义语言): 数据库.表.视图.索引.存储过程,例如:CREAT ...
【洛谷P3835】【模板】可持久化平衡树
可持久化非旋转treap,真的是又好写又好调 ~ code: #include <cstdio> #include <cstdlib> #include <algorit ...

[转帖]How long does it take to make a context switch?