14.3.2.2. Avoiding request queue congestion

Each request queue has a maximum number of allowed pending requests.By default, a queue has at most 128 pending read requests and 128 pending write requests.

每个请求队列都有一个允许处理的最大请求数。默认情况下,一个队列最多有128个等待读请求和128个等待写请求(CFQ就不是128)。

14.3.4.2. The "CFQ" elevator

The main goal of the "Complete Fairness Queueing" elevator is ensuring a fair allocation of the disk I/O bandwidth among all the processes that trigger the I/O requests. To achieve this result, the elevator makes use of a large number of sorted queuesby default, 64that store the requests coming from the different processes.

"CFQ(完全公平队列)”算法的主要目标是在触发I/O请求的所有进程中确保磁盘I/O带宽的公平分配。为了达到这个目标,算法使用许多个排序队列——缺省为64——它们存放了不同进程发出的请求。

14.3.4.3. The "Deadline" elevator

By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 secondsread requests are privileged over write requests because they usually block the processes that issued them.

默认情况下,读请求的超时时间是500ms,而写请求的超时时间是5s,读请求优先于写请求,因为读请求通常会阻塞发出请求的进程。

14.3.4.4. The "Anticipatory" elevator

The default expire time for read requests is 125 milliseconds, while the default expire time for write requests is 250 milliseconds.

读请求的默认超时时间是125ms,写请求的缺省超时时间是250ms。

在centos6.5中默认以下参数为/sys/block/sda/queue

add_random
  This file allows to turn off the disk entropy contribution. Default value of this file is '1'(on).
discard_granularity
  This shows the size of internal allocation of the device in bytes, if reported by the device. A value of '0' means device does not support the discard functionality.
discard_max_bytes (RW)
  While discard_max_hw_bytes is the hardware limit for the device, this setting is the software limit. Some devices exhibit large latencies when large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard operations.

hw_sector_size (RO)
  This is the hardware sector size of the device, in bytes.

iostats (RW)
  This file is used to control (on/off) the iostats accounting of the disk.

logical_block_size
lsblk -o NAME,PHY-SeC,LOG-SEC
The logical block size is usually different from the physical block size. The physical block size is usually bytes, which is the size of the smallest block that the disk controller can read or write.The UFS logical_block_size is 8192bytes
max_hw_sectors_kb
表示单个请求所能处理的最大KB(硬约束)
max_sectors_kb
表示设备允许的最大请求大小。
max_segments
表示设备能够允许的最大段的数目。

max_segment_size (RO)
  Maximum segment size of the device.
minimum_io_size (RO)
  This is the smallest preferred IO size reported by the device.
nomerges (RW)
  This enables the user to disable the lookup logic involved with IO merging requests in the block layer. By default (0) all merges are enabled. When set to 1 only simple one-hit merges will be tried. When set to 2 no merge algorithms will be tried (including one-hit or more complex tree/hash lookups).

nr_requests
The nr_requests represents the IO queue size.
read_ahead_kb
The read_ahead_kb represents the amount of additional data that should be read after fulfilling a given read request. For example, if there's a read-request for 4KB and the read_ahead_kb is set to 64, then an additional 64KB will be read into the cache after the base 4KB request has been met. Why read this additional data? It counteracts the rotational latency problems inherent in spinning disk / hard disk drives. A 7200 RPM hard drive rotates 120 times per second, or roughly once every 8ms. That may sound fast, but take an example where you're reading records from a database and only gathering 4KB with each IO read request (one read per rotation). Done serially that would produce a throughput of a mere 480K/sec. This is why read-ahead and request queue depth are so important to getting good performance with spinning disk. With SSD, there are no mechanical rotational latency issues so the SSD profile uses a small 4k read-ahead.
参见https://wiki.osnexus.com/index.php?title=IO_Performance_Tuning#QuantaStor_System_Network_Setup 以上所有参数可参考https://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt

通过以下命令可以指定sda使用不同的IO调度器,而相应的/sys/block/sda/queue/iosched的内容也将发生变化。

echo deadline >/sys/block/sda/queue/scheduler
echo anticipatory >/sys/block/sda/queue/scheduler
echo cfq >/sys/block/sda/queue/scheduler
echo noop >/sys/block/sda/queue/scheduler

下面罗列出不同调度器时/sys/block/sda/queue/iosched的参数值如下:

anticipatory还包括一个参数est_time未在上表列出(太长不好看)

est_time    "0 % exit probability
% probability of exiting without a cooperating process submitting IO
ms new thinktime
sectors new seek distance"

Tuning The CFQ Scheduler

Remember that this is for mostly to entirely non-interactive work where latency is of lower concern. You care some about latency, but your main concern is throughput.

Attribute Meaning and suggested tuning
fifo_expire_async Number of milliseconds an asynchronous request (buffered write) can remain unserviced. 
If lowered buffered write latency is needed, either decrease from default 250 msec or consider switching to deadline scheduler.
fifo_expire_sync Number of milliseconds a synchronous request (read, or O_DIRECT unbuffered write) can remain unserviced. 
If lowered read latency is needed, either decrease from default 125 msec or consider switching to deadline scheduler.
low_latency 0=disabled: Latency is ignored, give each process a full time slice. 
1=enabled: Favor fairness over throughput, enforce a maximum wait time of 300 milliseconds for each process issuing I/O requests for a device. 
Select this if using CFQ with applications requiring it, such as real-time media streaming.
quantum Number of I/O requests sent to a device at one time, limiting the queue depth. request (read, or O_DIRECT unbuffered write) can remain unserviced. 
Increase this to improve throughput on storage hardware with its own deep I/O buffer such as SAN and RAID, at the cost of increased latency.
slice_idle Length of time in milliseconds that cfq will idle while waiting for further requests. 
Set to 0 for solid-state drives or for external RAID with its own cache. Leave at default of 8 milliseconds for internal non-RAID storage to reduce seek operations.

Tuning The Deadline Scheduler

Remember that this is for interactive work where latency above about 100 milliseconds will really bother your users. Throughput would be nice, but we must keep the latency down.

Attribute Meaning and suggested tuning
fifo_batch Number of read or write operations to issue in one batch. 
Lower values may further reduce latency. 
Higher values can increase throughput on rotating mechanical disks, but at the cost of worse latency. 
You selected the deadline scheduler to limit latency, so you probably don't want to increase this, at least not by very much.
read_expire Number of milliseconds within which a read request should be served. 
Reduce this from the default of 500 to 100 on a system with interactive users.
write_expire Number of milliseconds within which a write request should be served. 
Leave at default of 5000, let write operations be done asynchronously in the background unless your specialized application uses many synchronous writes.
writes_starved Number read batches that can be processed before handling a write batch. 
Increase this from default of 2 to give higher priority to read operations.

Tuning The NOOP Scheduler

Remember that this is for entirely non-interactive work where throughput is all that matters. Data mining, high-performance computing and rendering, and CPU-bound systems with fast storage.

The whole point is that NOOP isn't a scheduler, I/O requests are handled strictly first come, first served. All we can tune are some block layer parameters in /sys/block/sd*/queue/*, which could also be tuned for other schedulers, so...

Tuning General Block I/O Parameters

These are in /sys/block/sd*/queue/.

Attribute Meaning and suggested tuning
max_sectors_kb Maximum allowed size of an I/O request in kilobytes, which must be within these bounds: 
Min value = max(1, logical_block_size/1024) 
Max value = max_hw_sectors_kb
nr_requests Maximum number of read and write requests that can be queued at one time before the next process requesting a read or write is put to sleep. Default value of 128 means 128 read requests and 128 write requests can be queued at once. 
Larger values may increase throughput for workloads writing many small files, smaller values increase throughput with larger I/O operations. 
You could decrease this if you are using latency-sensitive applications, but then you shouldn't be using NOOP if latency is sensitive!
optimal_io_size If non-zero, the storage device has reported its own optimal I/O size. 
If you are developing your own applications, make its I/O requests in multiples of this size if possible.
read_ahead_kb Number of kilobytes the kernel will read ahead during a sequential read operation. 128 kbytes by default, if the disk is used with LVM the device mapper may benefit from a higher value. 
If your workload does a lot of large streaming reads, larger values may improve performance.
rotational Should be 0 (no) for solid-state disks, but some do not correctly report their status to the kernel. 
If incorrectly set to 1 for an SSD, set it to 0 to disable unneeded scheduler logic meant to reduce number of seeks.

对于IO调度器参数的描述参见https://cromwell-intl.com/open-source/performance-tuning/disks.html

https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-storage_and_file_systems-configuration_tools#sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Configuration_tools-Tuning_the_cfq_scheduler

读Understanding the Linux Kernel, 3rd Edition有感的更多相关文章

  1. Linux Kernel - Debug Guide (Linux内核调试指南 )

    http://blog.csdn.net/blizmax6/article/details/6747601 linux内核调试指南 一些前言 作者前言 知识从哪里来 为什么撰写本文档 为什么需要汇编级 ...

  2. linux kernel内存回收机制

    转:http://www.wowotech.net/linux_kenrel/233.html linux kernel内存回收机制 作者:itrocker 发布于:2015-11-12 20:37 ...

  3. Linux kernel 内核学习路线

    看了下各位大神的推荐路线,总结如下: 0. 跟着项目走: 1. 学会用.熟练用linux系统: 2. Linux Kernel Development. 3. Understanding the Li ...

  4. linux kernel RCU 以及读写锁

    信号量有一个很明显的缺点,没有区分临界区的读写属性,读写锁允许多个线程进程并发的访问临界区,但是写访问只限于一个线程,在多处理器系统中允许多个读者访问共享资源,但是写者有排他性,读写锁的特性如下:允许 ...

  5. LINUX kernel笔记系列 :IO块参数 图

      Linux下,I/O处理的层次可分为4层: 系统调用层,应用程序使用系统调用指定读写哪个文件,文件偏移是多少 文件系统层,写文件时将用户态中的buffer拷贝到内核态下,并由cache缓存该部分数 ...

  6. [中英对照]Linux kernel coding style | Linux内核编码风格

    Linux kernel coding style | Linux内核编码风格 This is a short document describing the preferred coding sty ...

  7. linux kernel学习笔记-5内存管理_转

    void * kmalloc(size_t size, gfp_t gfp_mask); kmalloc()第一个参数是要分配的块的大小,第一个参数为分配标志,用于控制kmalloc()的行为. km ...

  8. 编译linux kernel及制作initrd ( by quqi99 )

    编译linux kernel及制作initrd ( by quqi99 ) 作者:张华  发表于:2013-01-27    ( http://blog.csdn.net/quqi99 ) 运行一个l ...

  9. Linux Kernel Maintainers

    http://en.wikipedia.org/wiki/Ingo_Molnár http://zh.wikipedia.org/wiki/英格·蒙內 Ingo Molnár Ingo Molnár, ...

随机推荐

  1. Linux命令详解-install

    install命令的作用是安装或升级软件或备份数据,它的使用权限是所有用户. 1.命令格式: (1)install [选项]... 来源 目的地 (2)install [选项]... 来源... 目录 ...

  2. Lua 中的 function、closure、upvalue

    Lua 中的 function.closure.upvalue function,local,upvalue,closure 参考: Lua基础 语句 lua学习笔记之Lua的function.clo ...

  3. web sql 基本操作 - 增删改查

    不喜欢看md原文的 可以访问这个链接:http://note.youdao.com/noteshare?id=6a91e3dea7cdf5195bb0e851d9fcb5a5 # web sql 增删 ...

  4. 2018HN多校

    http://acm.hi-54.com/contest_problemset.php?cid=1455 A : 摩斯密码 概览问题列表状态排名 Progress Bar 时间限制:1 Sec 内存限 ...

  5. 数论练习(6)——hdu A/B(逆元gcd)

    A/B Time Limit: 1000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submiss ...

  6. ExtJs 6.0+快速入门,ext-bootstrap.js文件的分析,各版本API下载(一)

    ExtAPI 下载地址如下,包含各个版本 http://docs.sencha.com/misc/guides/offline_docs.html 1.使用工具HBuilder 2.java 版本 8 ...

  7. js 数据函数

    //shift:删除原数组第一项,并返回删除元素的值:如果数组为空则返回undefined var a = [1,2,3,4,5]; var b = a.shift(); //a:[2,3,4,5]  ...

  8. 网路防火墙iptables

    linux操作系统自身可以充当交换机,还可以当路由器,也就是说linux多网卡之间拥有互相转发数据包的能力,这种能力的实现主要依靠的是防火墙的功能进行数据包的转发和入站. 路由选择点,就是在一个点分辨 ...

  9. Java Web编程的主要组件技术——Hibernate核心组件

    参考书籍:<J2EE开源编程精要15讲> Hibernate配置文件 1) hibernate.cfg.xml <?xml version='1.0' encoding='UTF-8 ...

  10. ansible入门01

    1.批量操作 1.操作系统选型与安装: 1.安装在实体机上: 批量安装: PXE(预引导执行环境):需要网卡上有DHCP客户端去加载bootloadder cobbler: kickstack: 2. ...