Understanding Virtual Memory
Understanding Virtual Memory
by Norm Murray and Neil Horman
- Introduction
- Definitions
- The Life of a Page
- Tuning the VM
- Example Scenarios
- Further Reading
- About the Author
Introduction
One of the most important aspects of an operating system is the Virtual Memory Management system. Virtual Memory (VM) allows an operating system to perform many of its advanced functions, such as process isolation, file caching, and swapping. As such, it
is imperative that an administrator understand the functions and tunable parameters of an operating system's Virtual Memory Manager so that optimal performance for a given workload may be achieved. After reading this article, the reader should have a rudimentary
understanding of the data the Red Hat Enterprise Linux (RHEL3) VM controls and the algorithms it uses. Further, the reader should have a fairly good understanding of general Linux VM tuning techniques. It is important to note that Linux as an operating system
has a proud legacy of overhaul. Items which no longer serve useful purposes or which have better implementations as technology advances are phased out. This implies that the tuning parameters described in this article may be out of date if you are using a
newer or older kernel. Fear not however! With a well grounded understanding of the general mechanics of a VM, it is fairly easy to convert knowledge of VM tuning to another VM. The same general principles apply, and documentation for a given kernel (including
its specific tunable parameters) can be found in the corresponding kernel source tree under the file
Documentation/sysctl/vm.txt
.
Definitions
To properly understand how a Virtual Memory Manager does its job, it helps to understand what components comprise a VM. While the low level view of a VM are overwhelming for most, a high level view is necessary to understand how a VM works and how it can
be optimized for workloads.
What Comprises a VM
The inner workings of the Linux virtual memory subsystem are quitecomplex, but it can be defined at a high level with the followingcomponents:
MMU
The Memory Management Unit (MMU) is the hardware base that makes a VM system possible. The MMU allows software to reference physical memory by aliased addresses, quite often more than one. It accomplishes this through the use of pages and page tables. The
MMU uses a section of memory to translate virtual addresses into physical addresses via a series of table lookups.
Zoned Buddy Allocator
The Zoned Buddy Allocator is responsible for the management of page allocations to the entire system. This code manages lists of physically contiguous pages and maps them into the MMU page tables, so as to provide other kernel subsystems with valid physical
address ranges when the kernel requests them (Physical to Virtual Address mapping is handled by a higher layer of the VM). The name Buddy Allocator is derived from the algorithm this subsystem uses to maintain it free page lists. All physical pages in RAM
are cataloged by the Buddy Allocator and grouped into lists. Each list represents clusters of 2n pages, where n is incremented in each list. If no entries exist on the requested list, an entry from the next list up is broken into two separate clusters and
is returned to the caller while the other is added to the next list down. When an allocation is returned to the buddy allocator, the reverse process happens. Note that the Buddy Allocator also manages memory zones, which define pools of memory which have different
purposes. Currently there are three memory pools which the Buddy Allocator manages accesses for:
DMA — This zone consists of the first 16 MB of RAM, from which legacy devices allocate to perform direct memory operations.
NORMAL — This zone encompasses memory addresses from 16 MB to 1 GB and is used by the kernel for internal data structures as well as other system and user space allocations.
HIGHMEM — This zone includes all memory above 1 GB and is used exclusively for system allocations (file system buffers, user space allocations, etc).
Slab Allocator
The Slab Allocator provides a more usable front end to the Buddy Allocator for those sections of the kernel which require memory in sizes that are more flexible than the standard 4 KB page. The Slab Allocator allows other kernel components to create caches
of memory objects of a given size. The Slab Allocator is responsible for placing as many of the cache's objects on a page as possible and monitoring which objects are free and which are allocated. When allocations are requested and no more are available, the
Slab Allocator requests more pages from the Buddy Allocator to satisfy the request. This allows kernel components to use memory in a much simpler way. This way components which make use of many small portions of memory are not required to individually implement
memory management code so that too many pages are not wasted. The Slab Allocator may only allocate from the DMA and NORMAL zones.
Kernel Threads
The last component in the VM subsystem are the kernel threads:kscand
,
kswapd
,kupdated
, and
. Thesetasks are responsible for the recovery and management of in usememory. All pages of memory have an associated state (for moreinformation on the memory state machine, refer to
bdflush
the section called “The Life of a Page” section. In general, the active tasks in thekernel related to VM usage are responsible for attempting to move pagesout of RAM. Periodically they examine RAM, trying to identify and freeinactive memory so that it can
be put to other uses in thesystem.
The Life of a Page
All of the memory managed by the VM is labeled by a state. These states help let the VM know what to do with a given page under various circumstances. Dependent on the current needs of the system, the VM may transfer pages from one state to the next, according
to the state machine in Figure 2. “VM Page State Machine”. Using these states, the VM can determine what is being done with a page by the system at a given time and what actions the VM may take on the page. The states that have particular meanings are as follows:
FREE — All pages available for allocation begin in thisstate. This indicates to the VM that the page is not being used for anypurpose and is available for allocation.
ACTIVE — Pages which have been allocated from the BuddyAllocator enter this state. It indicates to the VM that the page hasbeen allocated and is actively in use by the kernel or a userprocess.
INACTIVE DIRTY — This state indicates that the page hasfallen into disuse by the entity which allocated it and thus is acandidate for removal from main memory. The
kscand
task periodically sweeps through all the pages in memory, taking note ofthe amount of time the page has been in memory since it was lastaccessed. If
kscand
finds that a page has beenaccessed since it last visited the page, it increments the page's agecounter; otherwise, it decrements that counter. Ifkscand
finds a page with its age counter at zero,it
moves the page to the inactive dirty state. Pages in the inactivedirty state are kept in a list of pages to be laundered.INACTIVE LAUNDERED — This is an interim state in which thosepages which have been selected for removal from main memory enter whiletheir contents are being moved to disk. Only pages which were in theinactive dirty state can enter this state. When the disk
I/O operationis complete, the page is moved to the inactive clean state, where it maybe deallocated or overwritten for another purpose. If, during the diskoperation, the page is accessed, the page is moved back into the activestate.INACTIVE CLEAN — Pages in this state have beenlaundered. This means that the contents of the page are in sync with thebacked up data on disk. Thus, they may be deallocated by the VM oroverwritten for other purposes.
Tuning the VM
Now that the picture of the VM mechanism is sufficiently illustrated, how is it adjusted to fit certain workloads? There are two methods for changing tunable parameters in the Linux VM. The first is the sysctl interface. The sysctl interface is a programming
oriented interface, which allows software programs to modify various tunable parameters directly. It is exported to system administrators via the sysctl utility, which allows an administrator to specify a value for any of the tunable VM parameters via the
command line. For example:
sysctl -w vm.max map count=65535
The sysctl utility also supports the use of a configuration file (/etc/sysctl.conf
), in which all the desirable changes to a VM can be recorded for a system and restored after a restart of the operating system, making this access
method suitable for long term changes to a system VM. The file is straightforward in its layout, using simple key-value pairs with comments for clarity. For example:
#Adjust the min and max read-ahead for files
vm.max-readahead=64
vm.min-readahead=32
#turn on memory over-commit
vm.overcommit_memory=2
#bump up the percentage of memory in use to activate bdflush
vm.bdflush="40 500 0 0 500 3000 60 20 0"
The second method of modifying VM tunable parameters is via the proc file system. This method exports every group of VM tunables as a virtual file, accessible via all the common Linux utilities used for modifying file contents. The VM tunables are available
in the directory /proc/sys/vm/
and are most commonly read and modified using the
cat
and echo
commands. For example, use the command
cat /proc/sys/vm/kswapd
to view the current value of the
kswapd
tunable. The output should be similar to:
512 32 8
Then, use the following command to modify the value of the tunable:
echo 511 31 7 > /proc/sys/vm/kswapd
Use the cat /proc/sys/vm/kswapd
command again to verify that the value was modified. The output should be:
511 31 7
The proc file system interface is a convenient method for making adjustments to the VM while attempting to isolate the peak performance of a system. For convenience, the following sections list the VM tunable parameters as the filenames they are exported
to in the /proc/sys/vm/
directory. Unless otherwise noted, these tunables apply to the RHEL3 2.4.21-4 kernel.
bdflush
The bdflush
file contains 9 parameters, of which 6are tunable. These parameters affect the rate at which pages in thebuffer cache (the subset of pagecache which stores files in memory) arefreed and returned to disk. By adjusting
the various values in thisfile, a system can be tuned to achieve better performance inenvironments where large amounts of file I/O are performed.
Table 1. “bdflush Parameters” defines the parameters forbdflush
in the order they appear in thefile.
Parameter | Description |
---|---|
nfract |
The percentage ofdirty pages in the buffer cache required to activate thebdflush task |
ndirty |
The maximumnumber of dirty pages in the buffer cache to write to disk ineach execution |
reserved1 |
Reserved for future use |
reserved2 |
Reserved for future |
interval |
The number ofjiffies (10ms periods) to delay betweenbdflush iterations |
age_buffer |
The time for a normal buffer to age before it is considered for flushing back to disk |
nfract_sync |
The percentage of dirty pages in the buffer cache required to cause the tasks which are writing pages of memory to begin writing those pages to disk instead |
nfract_stop_bdflush |
Thepercentage of dirty pages in buffer cache required to allowbdflush to return to idle state |
reserved3 |
Reserved for future use |
bdflush
ParametersGenerally, systems that require more free memory for applicationallocation want to set the
bdflush
values higher(except for the
, which would be movedlower), so that file data is sent to disk more frequently and in greatervolume, thus freeing up pages of RAM for application use. This, ofcourse, comes at the expense of CPU cycles because the system processorspends more
age_buffer
time moving data to disk and less time runningapplications. Conversely, systems which are required to perform largeamounts of I/O would want to do the opposite to these values, allowingmore RAM to be used to cache disk file so that file access is faster.
dcache_priority
This file controls the bias of the priority for caching directorycontents. When the system is under stress, it selectively reduces thesize of various file system caches in an effort to reclaim memory. Byincreasing this value, memory reclamation bias is shifted
away from thedirent cache. By reducing this amount, the bias is shifted towardsreclaiming dirent memory. This is not a particularly useful tuningparameter, but it can be helpful in maintaining the interactive responsetime on an otherwise heavily loaded system.
If you experienceintolerable delays in communicating with your system when it is busyperforming other work, increasing this parameter may help.
hugetlb_pool
The hugetlb_pool
file is responsible for recordingthe number of megabytes used for huge pages. Huge pages are just likeregular pages in the VM, only they are an order of magnitudelarger. Note also that huge pages are not swappable.
Huge pages are bothbeneficial and detrimental to a system. They are helpful in that eachhuge page takes only one set of entries in the VM page tables, whichallows for a higher degree of virtual address caching in the TLB(Translation Look-aside Buffer: A device
which caches virtual addresstranslations for faster lookups) and a requisite performanceimprovement. On the downside, they are very large and can be wasteful ofmemory resources for those applications which do not need large amountsof memory. Some applications,
however, do require large amounts ofmemory and can make good use of huge pages if they are written to beaware of them. If a system is running applications which require largeamounts of memory and is aware of this feature, then it is advantageousto increase
this value to an amount satisfactory to that application orset of applications.
inactive_clean_percent
This control specifies the minimum percentage of pages in each page zonethat must be in the clean or laundered state. If any zone drops belowthis threshold, and the system is under pressure for more memory, thenthat zone will begin having its inactive dirty
pages laundered. Notethat this control is only available on the 2.4.21-5EL kernelsforward. Raising the value for the corresponding zone which is memorystarved causes pages to be paged out more quickly, eliminating memorystarvation at the expense of CPU clock
cycles. Lowering this numberallows more data to remain in RAM, increasing the system performance butat the risk of memory starvation.
kswapd
While this set of parameters previously defined how frequently and inwhat volume a system moved non-buffer cache pages to disk, in Red HatEnterprise Linux 3, these controls are unused.
max_map_count
The max_map_count
file allows for the restrictionof the number of VMAs (Virtual Memory Areas) that a particular processcan own. A Virtual Memory Area is a contiguous area of virtual addressspace. These areas are created during
the life of the process when theprogram attempts to memory map a file, links to a shared memory segment,or allocates heap space. Tuning this value limits the amount of theseVMAs that a process can own. Limiting the amount of VMAs a process canown can lead
to problematic application behavior because the system willreturn out of memory errors when a process reaches its VMA limit but canfree up lowmem for other kernel uses. If your system is running low onmemory in the NORMAL zone, then lowering this value will
help free upmemory for kernel use.
max-readahead
The max-readahead
tunable affects how early theLinux VFS (Virtual File System) fetches the next block of a file frommemory. File readahead values are determined on a per file basis in theVFS and are adjusted based on the behavior
of the application accessingthe file. Anytime the current position being read in a file plus thecurrent read ahead value results in the file pointer pointing to thenext block in the file, that block is fetched from disk. By raising thisvalue, the Linux kernel
allows the readahead value to grow larger,resulting in more blocks being prefetched from disks which predictablyaccess files in uniform linear fashion. This can result in performanceimprovements but can also result in excess (and often unnecessary)memory usage.
Lowering this value has the opposite affect. By forcingreadaheads to be less aggressive, memory may be conserved at a potentialperformance impact.
min-readahead
Like max-readahead
,min-readahead
places a floor on the readaheadvalue. Raising this number forces a file's readahead value to beunconditionally higher, which can bring about performance improvementsprovided
that all file access in the system is predictably linear fromthe start to the end of a file. This, of course, results in highermemory usage from the pagecache. Conversely, lowering this value, allowsthe kernel to conserve pagecache memory at a potential performancecost.
overcommit_memory
overcommit_memory
is a value which sets the generalkernel policy toward granting memory allocations. If the value is 0,then the kernel checks to determine if there is enough memory free togrant a memory request to a malloc call
from an application. If there isenough memory, then the request is granted. Otherwise, it is denied andan error code is returned to the application. If the value is set to 1, then thekernel grants allocations above the amount of physical RAM and swap inthe
system as defined by the overcommit_ratio
value. Enabling this feature can be somewhat helpful in environmentswhich allocate large amounts of memory expecting worst case scenariosbut do not use it all. If the setting in thisfile
is 2, the kernel allows all memory allocations, regardless of thecurrent memory allocation state.
overcommit_ratio
The overcommit_ratio
tunable defines the amount bywhich the kernel overextends its memory resources in the event thatovercommit_memory
is set to the value of 2. Thevalue in this file represents
a percentage added to the amount of actualRAM in a system when considering whether to grant a particular memoryrequest. For instance, if this value is set to 50, then the kernel wouldtreat a system with 1 GB of RAM and 1 GB of swap as a system with 2.5 GBof
allocatable memory when considering whether to grant a malloc requestfrom an application. The general formula for this tunable is:
allocatable memory=(swap size + (RAM size * overcommit ratio))
Use these previous two parameters with caution. Enablingovercommit_memory
can create significantperformance gains at little cost but only if your applications aresuited to its use. If your applications use all of the memory
theyallocate, memory overcommit can lead to short performance gains followedby long latencies as your applications are swapped out to diskfrequently when they must compete for oversubscribed RAM. Also, ensurethat you have at least enough swap space to cover
the overallocation ofRAM (meaning that your swap space should be at least big enough tohandle the percentage if overcommit in addition to the regular 50percent of RAM that is normally recommended).
pagecache
The pagecache
file adjusts the amount of RAM whichcan be used by the page cache. The page cache holds various pieces ofdata, such as open files from disk, memory mapped files, and pages ofexecutable programs. Modifying the values
in this file dictates how muchof memory is used for this purpose. Table 2. “pagecache Parameters” defines the parameters for pagecache inthe order they appear in the file.
Parameter | Description |
---|---|
min |
The minimum amount of memory to reservefor pagecache use. |
borrow |
The percentage of pagecache pageskswapd uses to balance the reclaiming of pagecache pages andprocess memory. |
max |
If more memory than this percentage isused by pagecache, only evicts pages from thepagecache. Once the amount of memory in pagecache is below thisthreshold,kswapd begins moving process pages to swapagain. |
pagecache
ParametersIncreasing these values allows more programs and cached files to stay inmemory longer, thereby allowing applications to execute more quickly. Onmemory starved systems, however, this may lead to application delays asprocesses must wait for memory to become
available. Moving these valuesdownward swaps processes and other disk-backed data out more quickly,allowing for other processes to obtain memory more easily and increasingexecution speed. For most workloads the automatic tuning issufficient. However, if your
workload suffers from excessive swappingand a large cache, you may want to reduce the values until the swappingproblem goes away.
page-cluster
The kernel attempts to read multiple pages from disk on a page fault toavoid excessive seeks on the hard drive. This parameter defines thenumber of pages the kernel tries to read from memory during each pagefault. The value is interpreted as2page-cluster
pages for each page fault. Apage fault is encountered every time a virtual memory address isaccessed for which there is not yet a corresponding physical pageassigned or for which the corresponding physical page has been swappedto disk. If the memory address
has been requested in a valid way (forexample, the application contains the address in its virtual memorymap), then the kernel associates a page of RAM with the address orretrieves the page from disk and places it back in RAM. Then the kernelrestarts the application
from where it left off. By increasing thepage-cluster
value, pages subsequent to therequested page are also retrieved, meaning that if the workload of aparticular system accesses data in RAM in a linear fashion, increasingthis
parameter can provide significant performance gains (much like thefile readahead parameters described earlier). Of course if your workloadaccesses data discreetly in many separate areas of memory, then this canjust as easily cause performance degradation.
Example Scenarios
Now that we have covered the details of kernel tuning, let us look at some example workloads and the various tuning parameters that may improve system performance.
File (IMAP, Web, etc.) Server
This workload is geared towards performing a large amount of I/O to andfrom the local disk, thus benefiting from an adjustment allowing morefiles to be maintained in RAM. This speeds up I/O by caching more filesin RAM and eliminating the need to wait for
disk I/O to complete. Asimple change to sysctl.conf
as follows usuallybenefits this workload:
#increase the amount of RAM pagecache is allowed to use
#before we start moving it back to disk
vm.pagecache="10 40 100"
General Compute Server With Many Active Users
This workload is a very general type of configuration. It involves manyactive users who likely run many processes, all of which may or may notbe CPU intensive or I/O intensive or a combination thereof. As thedefault VM configuration attempts to find a balance
between I/O andprocess memory usage, it may be best to leave most configurationsettings alone in this case. However, this environment likely containsmany small processes which, regardless of workload, consume memoryresources, particularly lowmem. It may help,
therefore, to tune the VMto conserve low memory resources when possible:
#lower the pagecache max to keep from eating all memory up with cache
vm.pagecache=10 25 50
#lower max-readahead to reduce the amount of unneeded IO
vm.max-readahead=16
Non interactive (Batch) Computing Server
A batch computing server is usually the exact opposite of a fileserver. Applications run without human interaction, and they commonlyperform with little I/O. The number of processes running oncontrolled. Consequently this system should allow maximumthroughput:
#Reduce the amount of pagecache normally allowed
vm.pagecache="1 10 100"
#do not worry about conserving lowmem, not that many processes
vm.max_map_count=128000 14
#crank up overcommit, processes can sleep as they are not interactive
vm.overcommit=2
vm.overcommit_ratio=75
Further Reading
Understanding the Linux Kernel by DanielBovet and Marco Cesati (O'Reilly & Associates)
VirtualMemory Behavior in Red Hat Enterprise Linux AS 2.1by Bob Matthews and Norm Murray
Towards an O(1)VM by Rik Van Riel
TheLinux Kernel Source Tree, versions 2.4.21-4EL &2.4.21-5EL
About the Author
Neil Horman is a software engineer at Red Hat. He lives in Raleigh, NC with his wife and 1 year old son. He has a BS and MS in computer engineering from North Carolina State University. When not enjoying family time he enjoys developing, repairing, and writing
about software.
Norm Murray has been working at Red Hat for the last 3 years. Coming to programming after dissatisfaction with the state of genetic engineering, he is now an information and learning junkie.
Understanding Virtual Memory的更多相关文章
- reds Virtual Memory
Virtual Memory technical specification This document details the internals of the Redis Virtual Memo ...
- vmtouch - the Virtual Memory Toucher
https://hoytech.com/vmtouch/ [root@localhost ~]# git clone git://github.com/hoytech/vmtouch.git 正克隆到 ...
- JVM virtual memory
This has been a long-standing complaint with Java, but it's largely meaningless, and usually based o ...
- Understanding Java Memory Model-理解java内存模型(JVM)
from https://medium.com/platform-engineer/understanding-java-memory-model-1d0863f6d973 Understanding ...
- 初识virtual memory
一.先谈几个重要的东西 virtual memory是一个抽象概念,书上的原文是"an abstraction of main memory known as virtual memory& ...
- php编译 :virtual memory exhausted: Cannot allocate memory
有时候用vps建站时需要通过编译的方式来安装主机控制面板.对于大内存的VPS来说一般问题不大,但是对于小内存,比如512MB内存的VPS来说,很有可能会出现问题,因为编译过程是一个内存消耗较大的动作. ...
- ADDM Reports bug:Significant virtual memory paging was detected on the host operating system
查看ADDM(数据库版本为ORACLE 10.2.0.5.0)报告时,发现其中有个结论非常不靠谱:Significant virtual memory paging was detected on t ...
- Linux Process Virtual Memory
目录 . 简介 . 进程虚拟地址空间 . 内存映射的原理 . 数据结构 . 对区域的操作 . 地址空间 . 内存映射 . 反向映射 .堆的管理 . 缺页异常的处理 . 用户空间缺页异常的校正 . 内核 ...
- Cache and Virtual Memory
Cache存储器:电脑中为高速缓冲存储器,是位于CPU和主存储器DRAM(DynamicRandonAccessMemory)之间,规模较小,但速度很高的存储器,通常由SRAM(StaticRando ...
随机推荐
- javascript知识总汇
命名: 变量名和函数命名:第一个单词小写以后每个单词首字母大写.geteElementById() 对象命名:每个单词首字母大写. 数据类型 typeof()方法返回数据类型. number数据类型: ...
- 流媒体基础实践之——RTMP直播推流
一.RTMP推流:用户可将RTMP视频流推送到阿麦提供的打流地址.地址格式类似于: rtmp://livepush.myqcloud.com/live 现在可以支持哪些直播源?和那些直播软件?推流参数 ...
- 【51nod】1376 最长递增子序列的数量
数组A包含N个整数(可能包含相同的值).设S为A的子序列且S中的元素是递增的,则S为A的递增子序列.如果S的长度是所有递增子序列中最长的,则称S为A的最长递增子序列(LIS).A的LIS可能有很多个. ...
- NEU校园网登录器
http://www.cnblogs.com/weidiao/p/5124106.html 改自学长的博客. 我们的目标是写一个程序实现自动登录校园网.而这基于的是表单的post机制. 输入校园网网址 ...
- 13 SELECT 以外的内容
Insert 直接路径法 这种方法不去查找已有块中的空间, 它直接从高水位之上开始插入数据. 直接使用的是 nologging模式, 记住默认情况下通过直接路径插入进行加载的表上的索引仍然是会产生un ...
- Echarts个人实例
1.deviceOperateTrendIndex.jsp <%@ page language="java" contentType="text/html; cha ...
- Android ViewFlipper控件实例
使用ViewFlipper实现两张图片切换效果,废话不多说,直接上代码. java源码: package com.example.viewflipper; import android.os.Bund ...
- Android 空心和实心按钮
Android 空心和实心按钮 做界面时 有时老要用到这种按钮 动画如下 实心的 <?xml version="1.0" encoding="utf-8" ...
- swift语言学习之UITableView分割线左边到头的解决
此方法兼容ios 7.8.9 // 在tableView创建地方设置 if tableView!.respondsToSelector("setSeparatorInset:") ...
- 【转载】PHP运行模式的深入理解
PHP运行模式的深入理解 作者: 字体:[增加 减小] 类型:转载 时间:2013-06-03我要评论 本篇文章是对PHP运行模式进行了详细的分析介绍,需要的朋友参考下 PHP运行模式有4钟:1) ...