14.3.2.2. Avoiding request queue congestion

Each request queue has a maximum number of allowed pending requests.By default, a queue has at most 128 pending read requests and 128 pending write requests.

每个请求队列都有一个允许处理的最大请求数。默认情况下,一个队列最多有128个等待读请求和128个等待写请求(CFQ就不是128)。

14.3.4.2. The "CFQ" elevator

The main goal of the "Complete Fairness Queueing" elevator is ensuring a fair allocation of the disk I/O bandwidth among all the processes that trigger the I/O requests. To achieve this result, the elevator makes use of a large number of sorted queuesby default, 64that store the requests coming from the different processes.

"CFQ(完全公平队列)”算法的主要目标是在触发I/O请求的所有进程中确保磁盘I/O带宽的公平分配。为了达到这个目标,算法使用许多个排序队列——缺省为64——它们存放了不同进程发出的请求。

14.3.4.3. The "Deadline" elevator

By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 secondsread requests are privileged over write requests because they usually block the processes that issued them.

默认情况下,读请求的超时时间是500ms,而写请求的超时时间是5s,读请求优先于写请求,因为读请求通常会阻塞发出请求的进程。

14.3.4.4. The "Anticipatory" elevator

The default expire time for read requests is 125 milliseconds, while the default expire time for write requests is 250 milliseconds.

读请求的默认超时时间是125ms,写请求的缺省超时时间是250ms。

在centos6.5中默认以下参数为/sys/block/sda/queue

add_random
  This file allows to turn off the disk entropy contribution. Default value of this file is '1'(on).
discard_granularity
  This shows the size of internal allocation of the device in bytes, if reported by the device. A value of '0' means device does not support the discard functionality.
discard_max_bytes (RW)
  While discard_max_hw_bytes is the hardware limit for the device, this setting is the software limit. Some devices exhibit large latencies when large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard operations.

hw_sector_size (RO)
  This is the hardware sector size of the device, in bytes.

iostats (RW)
  This file is used to control (on/off) the iostats accounting of the disk.

logical_block_size
lsblk -o NAME,PHY-SeC,LOG-SEC
The logical block size is usually different from the physical block size. The physical block size is usually bytes, which is the size of the smallest block that the disk controller can read or write.The UFS logical_block_size is 8192bytes
max_hw_sectors_kb
表示单个请求所能处理的最大KB(硬约束)
max_sectors_kb
表示设备允许的最大请求大小。
max_segments
表示设备能够允许的最大段的数目。

max_segment_size (RO)
  Maximum segment size of the device.
minimum_io_size (RO)
  This is the smallest preferred IO size reported by the device.
nomerges (RW)
  This enables the user to disable the lookup logic involved with IO merging requests in the block layer. By default (0) all merges are enabled. When set to 1 only simple one-hit merges will be tried. When set to 2 no merge algorithms will be tried (including one-hit or more complex tree/hash lookups).

nr_requests
The nr_requests represents the IO queue size.
read_ahead_kb
The read_ahead_kb represents the amount of additional data that should be read after fulfilling a given read request. For example, if there's a read-request for 4KB and the read_ahead_kb is set to 64, then an additional 64KB will be read into the cache after the base 4KB request has been met. Why read this additional data? It counteracts the rotational latency problems inherent in spinning disk / hard disk drives. A 7200 RPM hard drive rotates 120 times per second, or roughly once every 8ms. That may sound fast, but take an example where you're reading records from a database and only gathering 4KB with each IO read request (one read per rotation). Done serially that would produce a throughput of a mere 480K/sec. This is why read-ahead and request queue depth are so important to getting good performance with spinning disk. With SSD, there are no mechanical rotational latency issues so the SSD profile uses a small 4k read-ahead.
参见https://wiki.osnexus.com/index.php?title=IO_Performance_Tuning#QuantaStor_System_Network_Setup 以上所有参数可参考https://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt

通过以下命令可以指定sda使用不同的IO调度器,而相应的/sys/block/sda/queue/iosched的内容也将发生变化。

echo deadline >/sys/block/sda/queue/scheduler
echo anticipatory >/sys/block/sda/queue/scheduler
echo cfq >/sys/block/sda/queue/scheduler
echo noop >/sys/block/sda/queue/scheduler

下面罗列出不同调度器时/sys/block/sda/queue/iosched的参数值如下:

anticipatory还包括一个参数est_time未在上表列出(太长不好看)

est_time    "0 % exit probability
% probability of exiting without a cooperating process submitting IO
ms new thinktime
sectors new seek distance"

Tuning The CFQ Scheduler

Remember that this is for mostly to entirely non-interactive work where latency is of lower concern. You care some about latency, but your main concern is throughput.

Attribute Meaning and suggested tuning
fifo_expire_async Number of milliseconds an asynchronous request (buffered write) can remain unserviced. 
If lowered buffered write latency is needed, either decrease from default 250 msec or consider switching to deadline scheduler.
fifo_expire_sync Number of milliseconds a synchronous request (read, or O_DIRECT unbuffered write) can remain unserviced. 
If lowered read latency is needed, either decrease from default 125 msec or consider switching to deadline scheduler.
low_latency 0=disabled: Latency is ignored, give each process a full time slice. 
1=enabled: Favor fairness over throughput, enforce a maximum wait time of 300 milliseconds for each process issuing I/O requests for a device. 
Select this if using CFQ with applications requiring it, such as real-time media streaming.
quantum Number of I/O requests sent to a device at one time, limiting the queue depth. request (read, or O_DIRECT unbuffered write) can remain unserviced. 
Increase this to improve throughput on storage hardware with its own deep I/O buffer such as SAN and RAID, at the cost of increased latency.
slice_idle Length of time in milliseconds that cfq will idle while waiting for further requests. 
Set to 0 for solid-state drives or for external RAID with its own cache. Leave at default of 8 milliseconds for internal non-RAID storage to reduce seek operations.

Tuning The Deadline Scheduler

Remember that this is for interactive work where latency above about 100 milliseconds will really bother your users. Throughput would be nice, but we must keep the latency down.

Attribute Meaning and suggested tuning
fifo_batch Number of read or write operations to issue in one batch. 
Lower values may further reduce latency. 
Higher values can increase throughput on rotating mechanical disks, but at the cost of worse latency. 
You selected the deadline scheduler to limit latency, so you probably don't want to increase this, at least not by very much.
read_expire Number of milliseconds within which a read request should be served. 
Reduce this from the default of 500 to 100 on a system with interactive users.
write_expire Number of milliseconds within which a write request should be served. 
Leave at default of 5000, let write operations be done asynchronously in the background unless your specialized application uses many synchronous writes.
writes_starved Number read batches that can be processed before handling a write batch. 
Increase this from default of 2 to give higher priority to read operations.

Tuning The NOOP Scheduler

Remember that this is for entirely non-interactive work where throughput is all that matters. Data mining, high-performance computing and rendering, and CPU-bound systems with fast storage.

The whole point is that NOOP isn't a scheduler, I/O requests are handled strictly first come, first served. All we can tune are some block layer parameters in /sys/block/sd*/queue/*, which could also be tuned for other schedulers, so...

Tuning General Block I/O Parameters

These are in /sys/block/sd*/queue/.

Attribute Meaning and suggested tuning
max_sectors_kb Maximum allowed size of an I/O request in kilobytes, which must be within these bounds: 
Min value = max(1, logical_block_size/1024) 
Max value = max_hw_sectors_kb
nr_requests Maximum number of read and write requests that can be queued at one time before the next process requesting a read or write is put to sleep. Default value of 128 means 128 read requests and 128 write requests can be queued at once. 
Larger values may increase throughput for workloads writing many small files, smaller values increase throughput with larger I/O operations. 
You could decrease this if you are using latency-sensitive applications, but then you shouldn't be using NOOP if latency is sensitive!
optimal_io_size If non-zero, the storage device has reported its own optimal I/O size. 
If you are developing your own applications, make its I/O requests in multiples of this size if possible.
read_ahead_kb Number of kilobytes the kernel will read ahead during a sequential read operation. 128 kbytes by default, if the disk is used with LVM the device mapper may benefit from a higher value. 
If your workload does a lot of large streaming reads, larger values may improve performance.
rotational Should be 0 (no) for solid-state disks, but some do not correctly report their status to the kernel. 
If incorrectly set to 1 for an SSD, set it to 0 to disable unneeded scheduler logic meant to reduce number of seeks.

对于IO调度器参数的描述参见https://cromwell-intl.com/open-source/performance-tuning/disks.html

https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-storage_and_file_systems-configuration_tools#sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Configuration_tools-Tuning_the_cfq_scheduler

读Understanding the Linux Kernel, 3rd Edition有感的更多相关文章

  1. Linux Kernel - Debug Guide (Linux内核调试指南 )

    http://blog.csdn.net/blizmax6/article/details/6747601 linux内核调试指南 一些前言 作者前言 知识从哪里来 为什么撰写本文档 为什么需要汇编级 ...

  2. linux kernel内存回收机制

    转:http://www.wowotech.net/linux_kenrel/233.html linux kernel内存回收机制 作者:itrocker 发布于:2015-11-12 20:37 ...

  3. Linux kernel 内核学习路线

    看了下各位大神的推荐路线,总结如下: 0. 跟着项目走: 1. 学会用.熟练用linux系统: 2. Linux Kernel Development. 3. Understanding the Li ...

  4. linux kernel RCU 以及读写锁

    信号量有一个很明显的缺点,没有区分临界区的读写属性,读写锁允许多个线程进程并发的访问临界区,但是写访问只限于一个线程,在多处理器系统中允许多个读者访问共享资源,但是写者有排他性,读写锁的特性如下:允许 ...

  5. LINUX kernel笔记系列 :IO块参数 图

      Linux下,I/O处理的层次可分为4层: 系统调用层,应用程序使用系统调用指定读写哪个文件,文件偏移是多少 文件系统层,写文件时将用户态中的buffer拷贝到内核态下,并由cache缓存该部分数 ...

  6. [中英对照]Linux kernel coding style | Linux内核编码风格

    Linux kernel coding style | Linux内核编码风格 This is a short document describing the preferred coding sty ...

  7. linux kernel学习笔记-5内存管理_转

    void * kmalloc(size_t size, gfp_t gfp_mask); kmalloc()第一个参数是要分配的块的大小,第一个参数为分配标志,用于控制kmalloc()的行为. km ...

  8. 编译linux kernel及制作initrd ( by quqi99 )

    编译linux kernel及制作initrd ( by quqi99 ) 作者:张华  发表于:2013-01-27    ( http://blog.csdn.net/quqi99 ) 运行一个l ...

  9. Linux Kernel Maintainers

    http://en.wikipedia.org/wiki/Ingo_Molnár http://zh.wikipedia.org/wiki/英格·蒙內 Ingo Molnár Ingo Molnár, ...

随机推荐

  1. QWebEngine_C++_交互

    参考网址:http://blog.csdn.net/liuyez123/article/details/50509788 ZC: 该文章里面的: “ <ahref="javascrip ...

  2. selenium 3.6.0 geckodriver的一次坑

    Traceback (most recent call last):  File "./se3.py", line 16, in <module>    dr=webd ...

  3. Mybatis generator 配置

    mybatis-generator.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ...

  4. 几句话概括理查德成熟度模型(RESTful)

    近期做的项目中准备引入RESTful风格,特地进行了一些学习,其中比较重点的有一个理查德成熟度模型(Richardson Maturity Model),模型提出了四个等级(0-3),如下图 其中只有 ...

  5. Sizzle源码分析:一 设计思路

    一.前言 DOM选择器(Sizzle)是jQuery框架中非常重要的一部分,在H5还没有流行起来的时候,jQuery为我们提供了一个简洁,方便,高效的DOM操作模式,成为那个时代的经典.虽然现在Vue ...

  6. CentOS6.5系统服务

    服务名称 功能 默认 建议 备注说明 NetworkManager 用于自动连接网络,常用在Laptop上 开启 关闭 对服务器无用 abrt-ccpp   开启 自定 对服务器无用 abrt-oop ...

  7. Jacoco的原理

    覆盖率计数器 Jacoco使用一系列的不同的计数器来做覆盖率的度量计算.所有这些计数器都是从java的class文件中获取信息,这些class文件可以(可选)包含调试的信息在里面.即使在没有源码的情况 ...

  8. IE9(8)跨域(CORS)解决方案

    HTML5中 XMLHttpRequest Level 2 的推出.可以通过在返回的HTTP请求头中加入 Access-Control-Allow-Origin 的设置,让浏览器支持对不同域的AJAX ...

  9. DDMS介绍

    DDMS全称:Dalvik Debug Monitor Service 一,DDMS的作用 它提供了截屏.查看线程和堆信息.logcat.进程.广播状态信息.模拟来电呼叫和短信.虚拟地理坐标等等. 二 ...

  10. 如何在未越狱的ios系统安装ipa文件

    首先我们先下载一个PP助手正版在电脑上 用iphone打开http://z.25pp.com/?from=bdpz,根据网页上的教程,我们安装好PP助手正版(注意是正版!!) 将手机连接电脑,在电脑上 ...