What is O_DIRECT

Starting with kernel 2.4, Linux allows an application to bypass the buffer cache when performing disk I/O, thus transferring data directly from user space to a file or disk device. This is sometimes termed direct I/O or raw I/O.

Direct I/O is sometimes misunderstood as being a means of obtaining fast I/O performance. However, for most applications, using direct I/O can considerably degrade performance. This is because the kernel applies a number of optimiza- tions to improve the performance of I/O done via the buffer cache, including per- forming sequential read-ahead, performing I/O in clusters of disk blocks, and allowing processes accessing the same file to share buffers in the cache. All of these optimizations are lost when we use direct I/O. Direct I/O is intended only for applications with specialized I/O requirements. For example, database systems that perform their own caching and I/O optimizations don’t need the kernel to consume CPU time and memory performing the same tasks.

We can perform direct I/O either on an individual file or on a block device (e.g., a disk). To do this, we specify the O_DIRECT flag when opening the file or device with open().

The O_DIRECT flag is effective since kernel 2.4.10. Not all Linux file systems and kernel versions support the use of this flag. Most native file systems support O_DIRECT, but many non-UNIX file systems (e.g., VFAT) do not. It may be necessary to test the file system concerned (if a file system doesn’t support O_DIRECT, then open() fails with the error EINVAL) or read the kernel source code to check for this support.

If a file is opened with O_DIRECT by one process, and opened normally (i.e., so that the buffer cache is used) by another process, then there is no coherency between the contents of the buffer cache and the data read or written via direct I/O. Such scenarios should be avoided.

The raw(8) manual page describes an older (now deprecated) technique for obtaining raw access to a disk device.

Alignment restrictions for direct I/O

Because direct I/O (on both disk devices and files) involves direct access to the disk, we must observe a number of restrictions when performing I/O:

  • The data buffer being transferred must be aligned on a memory boundary that is a multiple of the block size.

  • The file or device offset at which data transfer commences must be a multiple of the block size.

  • The length of the data to be transferred must be a multiple of the block size.

Failure to observe any of these restrictions results in the error EINVAL. In the above list, block size means the physical block size of the device (typically 512 bytes).

Notes of O_DIRECT

1,数据对齐

从2.6.0内核开始,O_DIRECT要求的对齐基本单位是底层块设备的逻辑块大小(Logical Block Size),此处一定要注意,是底层块设备的逻辑块大小,而不是文件系统(如ext4)的block size。底层块设备的块大小也叫Sector Size,可以用两种方式获取它:

一种是使用ioctl系统调用的BLKSSZGET指令来获取:

#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/mount.h> void getblocksize(char *dev)
{
std::cout << "Params for " << dev << std::endl;
int fd;
fd = open(dev, O_RDWR);
if (fd == -1) {
std::cout << "open error " << errno << std::endl;
return;
} long size = 0;
if (ioctl(fd, BLKSSZGET, &size) >= 0)
std::cout << "BLKSSZGET: " << size << std::endl;
else
std::cout << "error BLKSSZGET " << errno << std::endl;
close(fd);
}

另一种是脚本:

blockdev --getss

值得一提的是,blockdev命令有两个block size:

#blockdev --help

Usage:
blockdev -V
blockdev --report [devices]
blockdev [-v|-q] commands devices Available commands:
--getsz get size in 512-byte sectors
--setro set read-only
--setrw set read-write
--getro get read-only
--getdiscardzeroes get discard zeroes support status
--getss get logical block (sector) size
--getpbsz get physical block (sector) size
--getiomin get minimum I/O size
--getioopt get optimal I/O size
--getalignoff get alignment offset in bytes
--getmaxsect get max sectors per request
--getbsz get blocksize
--setbsz <bytes> set blocksize on file descriptor opening the block device
--getsize get 32-bit sector count (deprecated, use --getsz)
--getsize64 get size in bytes
--setra <sectors> set readahead
--getra get readahead
--setfra <sectors> set filesystem readahead
--getfra get filesystem readahead
--flushbufs flush buffers
--rereadpt reread partition table

 --getss和--getpbsz是指获取设备的逻辑块大小和物理块大小,sector的含义:

Mass storage devices (hard disks, CD-ROMs, tapes) operate on chunks of data, usually called sectors. The size of these device sectors varies, but is fixed for any one device. Hard disks and floppies usually use 512 bytes, while data CDs and DVDs use 2048 bytes. Today, it is customary to number all sectors sequentially and leave the details to the device.

--getbsz是指文件系统的逻辑块大小,block的含义:

File systems also operate on chunks at a time, but they don't need to be the same size as the device's sectors. The chunks used by the file system are usually called blocks, but clusterallocation block, and allocation unit are also common.

O_DIRECT对齐的单位就是上面的sector size。而不是操作系统的block size。这个地方大家往往混淆为操作系统的blocksize(尽管这样做O_DIRECT不会报错)。

2,并行操作

Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.

3,O_DIRECT和fork

O_DIRECT I/Os should never be run concurrently with the fork(2)
system call, if the memory buffer is a private mapping (i.e., any
mapping created with the mmap(2) MAP_PRIVATE flag; this includes
memory allocated on the heap and statically allocated buffers). Any
such I/Os, whether submitted via an asynchronous I/O interface or
from another thread in the process, should be completed before
fork(2) is called. Failure to do so can result in data corruption
and undefined behavior in parent and child processes. This
restriction does not apply when the memory buffer for the O_DIRECT
I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag.
Nor does this restriction apply when the memory buffer has been
advised as MADV_DONTFORK with madvise(2), ensuring that it will not
be available to the child after fork(2).

Notes of O_DIRECT flag的更多相关文章

  1. 【转】open参数O_DIRECT的学习

    open参数O_DIRECT的学习 使用 O_DIRECT 需要注意的地方 posix_memalign详细解释 free:这里好几个方法我都没测试成功,最后还是用posix_memalign 对齐的 ...

  2. InnoDB O_DIRECT选项漫谈(一)【转】

    本文来自:http://insidemysql.blog.163.com/blog/static/2028340422013671186977/   最近和文件系统内核开发人员做技术交流,对O_DIR ...

  3. RAC的QA

    RAC: Frequently Asked Questions [ID 220970.1]   修改时间 13-JAN-2011     类型 FAQ     状态 PUBLISHED   Appli ...

  4. 马哥教育视频笔记:01(Linux常用命令)

    1.查看缓存中使用的命令和命令路径 [wskwskwsk@localhost /]$ hash 命中 命令 /usr/bin/printenv /usr/bin/ls /usr/bin/clear 2 ...

  5. 2.Linux文件IO编程

    2.1Linux文件IO概述 2.1.0POSIX规范 POSIX:(Portable Operating System Interface)可移植操作系统接口规范. 由IEEE制定,是为了提高UNI ...

  6. percona 5.6升级到5.7相关error及解决方法

    今早,把开发环境的mysql升级到了5.7.15,5.6数据导入后,启动一切正常,检查.err日志,发现有如下异常: 2016-10-31T00:29:33.187073Z 0 [Warning] S ...

  7. Linux下使用iostat 监视I/O状态

    我们可以使用 sar(1), pidstat(1), mpstat(1), vmstat(8) 来监控 一.安装 yum install sysstat 二.参数解释 FILES /proc/stat ...

  8. FastDFS配置文件(storage.conf)

    # 该配置文件是否生效 # false:生效 # true:无效 disabled=false # 本storage server所属组名 group_name=group1 # 绑定IP # 后面为 ...

  9. nginx指令

    Directives(指令) Syntax(语法): aio on | off | threads[=pool]; Default: aio off; Context: http, server, l ...

随机推荐

  1. LeetCode 754. Reach a Number到达终点数字

    题目 在一根无限长的数轴上,你站在0的位置.终点在target的位置. 每次你可以选择向左或向右移动.第 n 次移动(从 1 开始),可以走 n 步. 返回到达终点需要的最小移动次数. 示例 1: 输 ...

  2. Neo4j资料 Neo4j教程 Neo4j视频教程 Neo4j 图数据库视频教程

    课程发布地址 地址: 腾讯课堂<Neo4j 图数据库视频教程> https://ke.qq.com/course/327374?tuin=442d3e14 作者 庞国明,<Neo4j ...

  3. 【大数据】Spark-Hadoop-架构对比

    Spark-Hadoop-架构对比 spark executor - zyc920716的博客 - CSDN博客 董的博客 » Apache Spark探秘:多进程模型还是多线程模型? Apache ...

  4. 【Spark】SparkStreaming-foreachrdd foreachpartition

    SparkStreaming-foreachrdd foreachpartition foreachrdd foreachpartition_百度搜索 SparkStreaming之foreachRD ...

  5. bash shell seq的用法

    seq 1 3 100 , 表示起始值为1, 间隔为3,终点值为100 #!/bin/bash aa=(1 2 3 17) #for i in 1 2 3 13 for i in ${aa[*]} d ...

  6. RAMPS1.4 3d打印控制板接线与测试5

    切片软件是生产打印机主控板可以识别的代码(Gcode)的工具,没有这个软件的帮忙,打印机不能识别3d模型文件.这里暂时只介绍Slic3r这个切片软件.简单好用功能强大. 1.打开expert模式 Sl ...

  7. cookie相关的函数

    浏览器中,使用JavaScript操作cookie的两个工具函数. 设置cookie值, 必须的參数是name和value,可选參数是过期天数和域名. // 设置cookie值(key,value,过 ...

  8. 从C++到java

    C++和java都号称是面向对象的语言,虽然C++不完全算是.学习过C++如何快速对java有个大体的掌握,可以通过对比来进行了解. 首先还是来高大上一下,看看他们的使命: · C++ 被设计成主要用 ...

  9. file: /SourceCache/IOKitUser_Sim/IOKitUser-920.1.11/hid.subproj/IOHIDEventQueue.c, line: 512

    //修改main.m 文件. typedef int (*PYStdWriter)(void *, const char *, int); static PYStdWriter _oldStdWrit ...

  10. 求两个有序数组的中位数或者第k小元素

    问题:两个已经排好序的数组,找出两个数组合并后的中位数(如果两个数组的元素数目是偶数,返回上中位数). 设两个数组分别是vec1和vec2,元素数目分别是n1.n2. 算法1:最简单的办法就是把两个数 ...