[转] KVM scalability and consolidation ratio: cache none vs cache writeback
In the latest ten years, full-virtualization technologies gained much traction. While this sometime led to an excessive virtual machines proliferation, the key concept is very appealing: as CPU performance and memory capacity relentless grow over time, why do not use this ever-increasing power to consolidate multiple operating system instances on one single, powerful server?
If done correctly (ie: without an unnecessary grow of total OS instances), this consolidation process bring considerable lower operating costs, both from electricity and maintenance/administration standpoints.
However, in order to extract good performance from virtual machines, it is imperative to correctly size the host virtualizer: CPU, disk, memory and network subsystems should all be capable to sustain the expected average workload, and also something more for the inevitable usage peeks.
Usually, the most stressed component in a virtualized environment is the I/O subsystem, especially taking into account the very slow random read/write speed offered by mechanical disks. As covered in previous articles, KVM give you the choice to enable OS caching on the image file or LVM volume backing a VM's virtual disk.
As latest Qemu versions honor write barrier requests and pass them down to the host stack, using a write-back strategy is a real option. Sure, it is not a silver bullet: there are some cases where a write-back cache could represent a problem, for example in scenario involving live-migration operation (currently, libvirt/qemu advise you to not use writeback cache during live migration, or data corruption may happens). However in many cases it is an appropriate choice.
So, go straight to the point: how KVM cache setting affect VMs performance, host resources usage and consolidation ratio?
What we are looking for, and when using a cache is not appropriate
In this article we are going to evaluate if, and how much, different caching settings influence KVM performance at a point where consolidation ratio can be impacted. To answer this question, we will collect performance data from both guests and host machine.
In a previous article, I explained why using a write-back cache is quite safe now. Basically, Qemu/KVM honors any flushing operation issued by the guest, so if a guest writes sensible data and issues a flush, it can be certain that data hit the physical disk platters.
However, let me very clearly state that in some circumstances you should not use a write back cache.
The three most common reasons to not use a writeback cache are:
- one or more guests don't support write barrier (which are used by the host to decide when flushing its cache)
- you need to live-migrate VMs between multiple hosts (currently libvirt warns you to not use livemigration together with caching, or data corruption may happens)
- your workload is so cache-unfriendly that the nocache option is the better performing configuration
Point n.1 can be simply verified by looking at your guests: in a modern operating system write barriers are surely supported, if not already enabled by default. For example, Win2000 and later automatically issue a cache flush operation each second, while EXT3-based linux distributions often need to explicitly enable barriers using the “barrier=1” mount option.
Regarding live migration, it is a matter of thinking about your requirements; in most environment, it is not used.
Point n.3 can be verified only after extensive testing; however, often caching is beneficial to a wide range of workloads, so it is safe to assume that it will increase performances, rather that decreasing them.
I want to stress that I am not advocating to always, forever use caching. As stated above, there are reasonable use cases when you should not use the OS cache. Anyway, if caching brings consistent and noticeable performance improvements in the general cases, it may be worth using it.
A small detour: buffering vs caching
While this page is not strictly needed for article comprehension, I feel it is important for terminology clarifications.
As you probably know, modern operating systems tend to aggressively cache data, using how much unused memory at their disposal.
Indeed, issuing the “free” command on a Linux terminal show something similar to that:
total used free shared buffers cached
Mem: 7801072 210632 7590440 0 2296 23004
-/+ buffers/cache: 185332 7615740
Swap: 8388600 0 8388600
Notice the “buffers” and “cached” column: what they means?
First we should note that, in Linux, every time you write to a file your are using the pagecache, while when you write to a block device (eg: a logical volume), bypassing the filesystem, you are using buffering.
Without too much surprise, the “cached” column represents the filesystem blocks cached for later reuse (it is the so-called “pagecache”). When you read something from a file, its content ends not only in your application, but on pagecache also. When you write something, generally your writes end first in the pagecache and, only after some seconds, they hit the disks.
How about the “buffer” column? Buffers are closely related to caching, but serve a somewhat different purpose: while a cache explicitly retains data until they are stale or they are forcedly flushed, a buffer retains data only for the smallest amount of time needed to efficiently transfer data to/from the backing device. In other word, they are a necessity due to hardware constrains: for example, as small disk transfers are quite inefficient, a buffer would accumulate smaller writes until the backing block device is closed. At this point, it flush data to the backing device and release the memory allocated for buffering. On contrary, a cache would accumulate writes for much higher threshold, basically ignoring the close syscall and, in order to improve future reads, it will maintain a local copy of the written data even after they are flushed to the backing device.
As my KVM setup is using LVM-based virtual disks, you may wonder why, in the rest of the article, I speak about “cache” and not “buffer”. They point is that buffers can be effectively used as a long-term cache. Remember what I wrote above? Buffers are flushed when the underlying device is closed via a close() syscall. This means that if we don't close the device, buffers remain in place and data can be written/read directly from the them, rather than from the device. This is precisely what happens with Qemu/KVM on top of a LVM-based disk: while the qemu process is running, the buffer retain both read and written blocks, acting as a true cache. It's worth note that a simple VM reboot will not drain the buffers, as the qemu process is still running. In order to completely discard the accumulated buffers, you had to shutdown the VM (or kill the qemu process running it). This is the only significant difference between the buffered-backed LVM virtual disk and a real pagecache-backed file-based virtual disk: in this latter case, a shutdown will not drain the cache, so a subsequent VM start benefits from the old, still valid, data in cache.
In short: while I am using LVM-based virtual disks that are not strictly pagecached-backed, the current Linux's buffers implementation is, in this case, very similar to a classical cache. But if they are the same, why spend and entire page arguing about the terminology? The answer is simple: historically, pagecache was somewhat more CPU-hungry then buffering. The difference is very small, at a point that with a modern CPU it is negligible, but I am quite picky when describing benchmark results :)
Testbed and methods
Benchmarks were performed on a system equipped with:
- PhenomII 940 CPU (4 cores @ 3.0 GHz, 1.8 GHz Northbridge and 6 MB L3 cache)
- 8 GB DDR2-800 DRAM (in unganged mode)
- Asus M4A78 Pro motherboard (AMD 780G + SB700 chipset)
- 4x 500 GB hard disks (1x WD Green, 3x Seagate Barracuda) in AHCI mode, configured in software RAID10 "near" layout
- S.O. CentOS 6.5 x64
The operation system was installed with “basic server” profile and then I selectively installed the other softwares required (libvirtd, qemu, etc). Key systems softwares were:
- kernel-2.6.32-431.1.2.0.1.el6.x86_64
- qemu-kvm-0.12.1.2-2.415.el6_5.3.x86_64
- libvirt-0.10.2-29.el6_5.2.x86_64
To test KVM in a true multi-guests environment, I created a basic “tile” of four guests, each with VirtIO drivers in place (for both disk and network devices). A tile is composed by:
- two (#1 and #2) Windows Server 2012 R2 64 bit guests, each with 1 GB RAM and 32 GB disk
- two CentOS 6.5 x86_64 guests (#3 and #4), each with 512 MB RAM and 8 GB disk.
All virtual machines use LVM-based disks, carved out by a dedicated volume group.
Inside the tile, each VM has the following role:
- the two (#1 and #2) Windows Server 2012 R2 64 bit guests act as fileservers. The first Win2012 guest copies, via SMB/CIFS, a ~670 MB directory (with over 44500 files) on the second Win2012 guest. After 30 seconds of idling, it copies back the directory from the peer Win2012 machine.
- the first CentOS 6.5 x86_64 guest (#3) acts as a dynamic web and email server (using apache, mysql and postfix). This machine serves a Joomla 3.2.5 site. At the same time, it runs a postgresql benchmark (sysbench) against the fourth guest, issuing a total of 10.000 transactions with a concurrency level of 4.
- the second CentOS 6.5 x86_64 guest (#4) acts as a pure database server (using postgresql). This guest is benchmarked by the previous CentOS VM, and at the same time it runs AB (apache benchmark) to stress it, issuing 2.000 requests with a concurrency level of 4. Moreover, it run a shell script generating 100 batches of 5 emails, each of ~56 KB, with one second wait between each batch. Finally, it sends 100 of those emails in a single, big burst. Totally, it moves about 34 MB of data.
In short, those VMs benchmark each other. This ensure not only that the test is self-contained, without externally-induced variables, but it also stress the internal virtual network switch created by Qemu.
I perfectly understand that I am testing very specific scenarios, so let me know what you, the reader, think about that. Do you want a more web-specific test case? Your focus is on database performance? Or fileserver speed is all that matter to you?
Let me know your ideas!
Total benchmarks run time
This article focuses on how well the host machine manage an ever increasing number of virtual machines. In order to present you realistic results, I run the benchmark using 1, 2 or 3 tiles (4, 8 or 12 VMs).
The first graph depicts total wall-clock run time, ie how much time a complete benchmark run needs:
This first result is eloquent: enabling the write-back cache translates in much lower execution time, at a point that a 3-tile setup (12 virtual machines) performs better than a 2-tiles setup (8 virtual machines) without caching.
But where the wb-enabled case gains the most?
As you can see, is the filecopy benchmark the speedup the most. This was expected: apache benchmark is basically CPU-bound, while sysbench's complex test is fsync-write bound, a situation where a writeback cache is of little help. Still, the increased filecopy speed is a very nice bonus.
Did you notice how emails seem to basically take no time? It depend on how the SMTP protocol works: even when overloaded by other activities, postfix try hard to queue all incoming emails for later delivery. This delayed delivery phase is not directly timed, but it is another source of fsync-heavy writes.
Host scaling: CPU
How the host responds to the ever-increasing load? Lets start from CPU scaling:
At first, it seems that the write-back cache comports a noticeable toll on CPU performance: even accounting for the increased speed, total CPU load is quite high.
However, a deeper analysis show that increased CPU load is really due to the increased WAIT time, which is the time the CPU is spending waiting for the I/O subsystem to catch-up.
Hey, wait a moment (no pun intended): why enabling the wb cache actually results in increased disk WAIT time? The fact is that the writeback cache enables much more parallelism in the I/O stack, resulting in the disk working much harder, and more threads are concurrently marked as “executable” by the scheduler.
While CPU is waiting for I/O to complete, the process requesting the I/O operation is blocked, but the CPU is free to execute another thread. This means that real CPU load should be obtained summing USER and SYS loads, and in this case we see only a very mild increase in CPU load between the nocache and wb-cache scenarios (perfectly justified by the increased performance).
In short: enabling the write-back cache is a non-issue from a CPU standpoint, at least when using buffered I/O (as is the case with direct access to LVM-based disks).
Host scalability: disks load
We previous stated that enabling the write-back cache led to increased disk performance. The following chart proves our affirmation:
The wb-enabled tiles have similar disk utilization than the no-cache ones, but they provide superior speed: if we normalize for performance, write-back cache provide much higher efficiency.
Lets spends some word on the increased average access time (await). What happens here? The answer is simple: as the write-back cache enable more I/O threads to be concurrency active, total throughput is higher but at the same time the average single-request access time grows.
Don't let await fear you: when the cache is disabled, the running threads enjoy lower access time, but you have a lower total number of I/O active threads. If your application depends on multiple, concurrent I/O write operations, it will be condemned to serially executing many of them, leading to the perception of a slower system. Enabling the wb-cache give your application a real chance to execute multiple concurrently I/O writes, and the host system can even coalesce some of them.
So, while in some specific, controlled, low-latency workload the nocache configuration can be the better choice, generally the write-back cache is the preferred one.
Detailed disk perf data:
The wb-enabled cases show much higher read and write speeds, indeed.
Host scalability: memory utilization
Do using some RAM for caching lowers total free memory? Lets check:
This chart really need an explanation. The “MEM (w/cache)” line shows total system memory usage, while the “MEM (w/out cache)” shows real memory utilization. The key concept here is that for Linux (and other modern OSes are the same) the memory used for caching and/or buffering is not really maked as “used memory”, as it can readily freed at any time.
So, real memory utilization is depicted by the red line. Watching this line, you can see that the wb-cache is no more memory-hungry that the nocache configuration.
In other words: Linux only engages unused memory for caching and, when an application requires more free memory, it immediately deallocs cache for giving the application the requested memory.
The careful reader should have a question now: how it is possible for a 8 GB host machine to happily run as much as 12 virtual machines, for a total estimated guest memory usage of over 9 GB, without heavy swapping? The answer lie in a very useful Linux feature, called KSM (kernel samepage merging). KSM enable the host system to periodically check for duplicate memory chunks, and to deduplicate them when found. In short, if two memory locations have the same content, KSM marks the first location as a CoW (copy-on-write) one, and frees the second location. If an application want to modify the shared location, the system first re-duplicate it, and then modify the newly-created location.
In practice KSM works surprisingly well, especially for short-lived Windows machines: as Windows has the habit of zeroing all free memory at startup, KSM has plenty of opportunities to coalesce these zeroed pages.
Conclusions
It is clear that enabling caching/buffering is very beneficial to the specific workload tested in this article. Cache led to much higher disk usage and, as often the disk subsystem is the weak link of any server, this mean higher potential consolidation ratio.
The story don't ends here, obviously: sometime RAM capacity plays an even bigger role in defining max consolidation ratio. And Linux is very well equipped in this area, thanks to KSM.
[转] KVM scalability and consolidation ratio: cache none vs cache writeback的更多相关文章
- [Java 缓存] Java Cache之 Guava Cache的简单应用.
前言 今天第一次使用MarkDown的形式发博客. 准备记录一下自己对Guava Cache的认识及项目中的实际使用经验. 一: 什么是Guava Guava工程包含了若干被Google的 Java项 ...
- 缓存 HttpContext.Current.Cache和HttpRuntime.Cache的区别
先看MSDN上的解释: HttpContext.Current.Cache:为当前 HTTP 请求获取Cache对象. HttpRuntime.Cache:获取当前应用程序的Cache. 我们再用. ...
- HttpContext.Current.Cache 和HttpRuntime.Cache的区别
先看MSDN上的解释: HttpContext.Current.Cache:为当前 HTTP 请求获取Cache对象. HttpRuntime.Cache:获取当前应用程序的Cac ...
- Page cache和Buffer cache[转1]
http://www.cnblogs.com/mydomain/archive/2013/02/24/2924707.html Page cache实际上是针对文件系统的,是文件的缓存,在文件层面上的 ...
- linux page cache和buffer cache
主要区别是,buffer cache缓存元信息,page cache缓存文件数据 buffer 与 cache 是作为磁盘文件缓存(磁盘高速缓存disk cache)来使用,主要目的提高文件系统系性能 ...
- 深入理解shared pool共享池之library cache的library cache lock系列四
本文了解下等待事件library cache lock,进一步理解library cache,之前的文章请见: 深入理解shared pool共享池之library cache的library ca ...
- 深入理解shared pool共享池之library cache的library cache pin系列三
关于library cache相关的LATCH非常多,名称差不多,我相信一些人对这些概念还是有些晕,我之前也有些晕,希望此文可以对这些概念有个更为清晰的理解,本文主要学习library cache p ...
- HttpContext.Current.Cache 和 HttpRuntime.Cache 区别
原文地址:http://blog.csdn.net/avon520/article/details/4872704 .NET中Cache有两种调用方式:HttpContext.Current.Cach ...
- 外键约束列并没有导致大量建筑指数library cache pin/library cache lock
外键约束列并没有导致大量建筑指数library cache pin/library cache lock 清除一个100大数据表超过一百万线,发现已经运行了几个小时: delete B001.T_B1 ...
随机推荐
- 提交一个变量或数组到另一个jsp页面
注意一:提交一个变量到另一个jsp页面,用hidden的input 另一个页面用request.getParameter();获取 注意二:提交一个数组到另一个页面,可以用相同的input的n ...
- 记忆化搜索 P1464 Function
题目描述 对于一个递归函数w(a,b,c) 如果a≤0 or b≤0 or c≤0就返回值1. 如果a>20 or b>20 or c>20就返回w(20,20,20) 如果a< ...
- Sql 根据当前时间,获取星期一具体日期
--根据当前时间,计算每周一日期,周日为每周第一天 declare @getDate datetime --set @getDate='2018-12-30' set @getDate='2019-0 ...
- HYSBZ 2743 (树状数组) 采花
题目:这里 题意: 在2016年,佳媛姐姐刚刚学习了树,非常开心.现在他想解决这样一个问题:给定一颗有根树(根为1),有以下 两种操作:1. 标记操作:对某个结点打上标记(在最开始,只有结点1有标记, ...
- VirtualBox CentOS7 Mini 安装增强工具
安装相关依赖 # yum install vim gcc kernel kernel-devel bzip2 -y # reboot 点击虚拟机菜单栏 => 设备 => 安装增强功能 # ...
- TCP/IP数据加密传输及CA简述
TCP/IP跨主机之间的通信数据封装发送的都是明文数据,现代通讯中会有安全问题. 三个安全问题 如:A发送消息给B的三个安全问题机密性:明文传输如:ftp,http,smtp,telnet等完整性:数 ...
- 从零开始学java (五)接口与内部类
接口,是描述类具有什么样的功能,而不是给出每个功能的实现.一个类可以implements多个接口...接口中可以含有 变量和方法.但是要注意, 接口中的变量会被隐式地指定为public static ...
- 基于STM32L4的开源NBIOT开发资料
基于STM32L4的开源NBIOT开发资料 1. 参考路径:http://www.stmcu.org/module/forum/forum.php?mod=viewthread&tid=615 ...
- JupyterLab绘制:柱状图,饼状图,直方图,散点图,折线图
JupyterLab绘图 喜欢python的同学,可以到 https://v3u.cn/(刘悦的技术博客) 里面去看看,爬虫,数据库,flask,Django,机器学习,前端知识点,JavaScrip ...
- npm -S -D -g i 有什么区别
npm i module_name -S = > npm install module_name --save 写入到 dependencies 对象 //开发环境能使用,生产环境也能使用or ...