https://glandium.org/blog/?p=1476

Attempting to track I/O with systemtap

There are several ways a program can hit the disk, and it can be hard to know exactly what’s going on, especially when you want to take into account the kernel caches. This includes, but is not limited to:

  • any access to a file, which may lead to I/O to read its parent directories if they are not already in the inode or dentry caches
  • enumerating a directory with readdir(), which may lead to I/O on the directory for the same reason
  • read()/write() on a file, which may lead to I/O on the file if it is not in the page cache
  • accesses in a mmap()ed area of memory, which may lead to I/O on the underlying file if it is not in the page cache
  • etc.

There are various ways to track system calls (e.g. strace) allowing to track what a program is doing with files and directories, but that doesn’t tell you if you’re actually hitting the disk or not. There are also various ways to track block I/O (e.g. blktrace) allowing to track actual I/O on the block devices, but it is then hard to back-track what part of which files or directories these I/O relate to. To the best of my knowledge, there are unfortunately no tools to do such tracking easily.

Systemtap, however, allows to access the kernel’s internals and to gather almost any kind of information from any place in a running kernel. The downside is that it means you need to know how the kernel works internally to gather the data you need ; that will limit the focus of this post.

I had been playing, in the past, with Taras’ script, which he used a while ago to track I/O during startup. Unfortunately, it became clear something was missing in the picture, so I had to investigate what’s going on in the kernel.

Setting up systemtap

On Debian systems, you need to install the following packages:

  • systemtap
  • linux-image-2.6.x-y-$arch-dbg (where x, y, and $arch correspond to the kernel package you are using)
  • linux-headers-2.6.x-y-$arch (likewise)
  • make

That should be enough to pull all the required dependencies. You may want to add yourself to the stapdev andstapusr groups, if you don’t want to run systemtap as root. If, like in my case, you don’t have enough space left for all the files in /usr/lib/debug, you can trick dpkg into not unpacking files you don’t need:

# echo path-exclude /usr/lib/debug/lib/modules/*/kernel/drivers/* > /etc/dpkg/dpkg.cfg.d/kernel-dbg

The file in /etc/dpkg/dpkg.cfg.d can obviously be named as you like, and you can adjust the path-exclude pattern to what you (don’t) want. In the above case, kernel drivers debugging symbols will be ignored. Please note that this feature requires dpkg version 1.15.8 or greater.

Small digression

One of the first problems I had with Taras’ script is that systemtap would complain that it doesn’t know thekernel.function("ext4_get_block") probe. It is due to a very unfortunate misfeature of systemtap, where the kernel.probes refer to whatever is in the vmlinux image. Modules probes have a separate namespace, namely module("name")..

So for the ext4_get_block() function, this means you need to set a probe for either kernel.function("ext4_get_block") ormodule("ext4").function("ext4_get_block"), depending how your kernel was compiled. And you can’t even use both in your script, because systemtap will complain about either being an unknown probe…

Tracking the right thing

I was very recently pointed to a Red Hat document containing 4 useful systemtap scripts ported from dtrace, which gives me a good occasion to explain the issue at hand with the first of these scripts.

This script attempts to track I/O by following read() and write() system calls. Which is not tracking I/O, it is merely tracking some system calls (Taras’ script had the same kind of problem with read()/write() induced I/O). You could just do the very same with existing tools like strace, and that wouldn’t even require some system privileges.

To demonstrate the script doesn’t actually track the use of storage devices as the document claims, consider the following source code:

#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h> /* Ensure the resulting binary is decently sized */
const char dummy[1024 * 1024] = "a"; int main(int argc, char *argv[]) {
struct stat st;
char buf[65536];
int i, fd = open(argv[0], O_RDONLY);
for (i = 0; i < 100000; i++) {
read(fd, buf, 65536);
lseek(fd, 0, SEEK_SET);
read(fd, buf, 65536);
lseek(fd, 512 * 1024, SEEK_SET);
}
close(fd);
return 0;
}

All it does is reading some parts of the executable a lot of times (notice the trick to make the executable size at least 1MB). Not a lot of programs will actually do something as bold as reading the same data again and again (though we could probably be surprised), but this easily points the problem. Here is what the output of the systemtap script looks like for this program (stripping other unrelevant processes):

         process     read   KB tot    write   KB tot
test 400002 25600001 0 0

Now, do you really believe 25MB have actually been read from the disk? (Note that the read() count seems odd as well, as there should only be around 200000 calls)

Read-ahead, page cache and mmap()

What the kernel actually does is that a read() on a file is going to check the page cache first. If there is nothing corresponding to the read() request in the page cache, then it goes down to the disk, and fills page cache. But as loading only a few bytes or kilobytes from the disk could be wasteful, the kernel also reads a few more blocks ahead, apparently with some heuristic.

But read() and write() aren’t the sole way a program may hit the disk. On UNIX systems, a file can be mapped in memory with the mmap() system call, and all accesses in the memory range corresponding to this mapping will be reflected on the file. There are exceptions, depending on how the mapping is established, but let’s keep it simple. There is a lot of litterature on the subject if you want to document yourself on mmap().

The way the kernel will read from the file, however, is quite similar to that of read(), and uses page cache and read ahead. The systemtap script debunked above doesn’t track these.

I’ll skip write accesses, because for now, they haven’t been in my scope.

Tracking some I/O with systemtap

What I’ve been trying to track so far has been limited to disk reads, which happen to be the only accesses occurring on shared library files. Programs and shared libraries are first being read() from, so that the dynamic linker gets the ELF headers and knows what to load, and then are mmap()ed following PT_LOAD entries in these headers.

As far as my investigation in the Linux kernel code goes, fortunately, both accesses, before they end up actually hitting the disk, go through the __do_page_cache_readahead kernel function (this function was tracked in Taras’ script). Unfortunately, while it is called with an offset and a number of pages to read for a given file, it turns out the last pages in that number are not necessarily being read from the disk. I don’t know for sure, because the latter had an effect on my observations, but some might even already be in the page cache.

Going further down, we reach the VFS layer, which ends up being file-system specific. But fortunately, a bunch of (common) file-systems actually share page mapping code, and commonly use the mpage_readpage and mpage_readpagesfunctions, both calling do_mpage_readpage to do the actual work. And this function seems to be properly called only once for each page not in the page cache already.

If my reading of the kernel source is right, this however is not really where it ends, and do_mpage_readpagedoesn’t actually hit the disk. It seems to only gather some information (basically, a mapping between storage blocks and memory) that is going to be submitted to the block I/O layer, which itself may do some fancy stuff with it, such as reordering the I/Os depending on other requests it got from other processes, the position of the disk heads, etc.

And when I say do_mpage_readpage doesn’t actually hit the disk, I’m again simplifying the issue, because it actually might, as there might be a need to read some metadata from the disk to know where some blocks are located. But tracking metadata reads is much harder, and I haven’t investigated it.

Anyways, skipping metadata, going further down after do_mpage_readpage is hard because it’s difficult to back-track which block I/O is related to which read-ahead, corresponding to what read at what position in which file.do_mpage_readpage already has part of this problem because it is not called with any reference to the corresponding file. But __do_page_cache_readahead is.

So knowing all the above, here is my script, the one I used to get the most recent Firefox startup data you can find in my last posts:

global targetpid;
global file_path; probe begin {
targetpid = target();
} probe kernel.function("__do_page_cache_readahead") {
if (targetpid == pid())
file_path[tid()] = d_path(&$filp->f_path);
} probe kernel.function("do_mpage_readpage") {
if (targetpid == pid() && (tid() in file_path)) {
now = gettimeofday_us();
printf("%d %s %d\n", now, file_path[tid()], $page->index*4096);
}
} probe kernel.function("__do_page_cache_readahead").return {
if (targetpid == pid())
delete file_path[tid()];
}

This script needs to be used with a command given to systemtap, with the -c option, such as in the following command line:

# stap readpage.stp -c firefox

Each line of output represents a page (i.e. 4,096 bytes) being read, and contains a timestamp, the name of the file being read, and the offset in the file. As discussed above, do_mpage_readpage is not really the place the I/O actually occurs, so the timestamps are not entirely accurate, and the actual read order from disk might be slightly different, but it still is a quite reliable view in that the result should be reproducible with the same files even when they’re not located on the same blocks on disk, and provided their page cache status is the same when starting the program.

This systemtap script ignores writes, as well as metadata accesses (including but not limited to inodes, dentries, bitmap blocks and indirect blocks). It also doesn’t account accesses to files opened with the O_DIRECTflag or similar constructs (raw devices, etc.)

Read-ahead in action

Back to the small example program, my systemtap script records 102 page accesses, that is, 417,792 bytes, much less than the actual binary size on my system (1,055,747 bytes). We are still far from the 25MB figure from the other systemtap script. But we are also far from the 128KiB the program actively reads (64KiB twice, leaving a hole between the two blocks).

At this point, it is important to note that the ELF headers and program code all fit within a single page (4 KiB), and following the 1MiB section corresponding to the dummy variable, there are only 5,219 bytes of other ELF sections, including the .dynamic section. So even counting everything the dynamic linker needs to read, and the program code itself, we’re still far from what my systemtap script records.

Grouping consecutive blocks with consecutive timestamps, here is what can be observed:

Offset Length
0 16,384
983,040 73,728
16,384 262,144
524,288 65,536

(By now, you should have guessed why I wanted that big hole between the read()s ; if you want to reproduce at home, I suggest you also use my page cache flush helper)

As earlier investigations showed, the first accesses by the dynamic loader are to read the ELF headers and.dynamic section. As mentioned above, these are really small. Yet, the kernel actually reads much more: 16KiB at the beginning of the file when reading the ELF headers, and 72KiB at the end when reading the .dynamic section. Subsequent accesses from the dynamic loader are obviously already covered by these accesses.

Next accesses are those due to the program itself. The program actively reads 64KiB at the beginning of the file, then 64KiB starting at offset 524,288. For the first read, the kernel already had 16KiB in the page cache, so it didn’t read them again, but instead of reading the remainder, it reads 256KiB. For the second read, however, it only reads the requested 64KiB.

As you can see, this is far from being something like “you wanted n KiB, I’ll read that fixed amount now”.

Further testing with different read patterns (e.g. changing the hole size, read size, or reading from the dummyvariable directly instead of read()ing the binary) is left as an exercise to the reader.

2011-01-12 20:55:56+0900

p.d.op.m.o

You can leave a response, or trackback from your own site.

6 Responses to “Attempting to track I/O with systemtap”

    1. Rahul Saxena Says: 
      2011-01-13 04:56:30+0900

      Interesting analysis, — In the “test” output, everything seems to have been doubled, 400K reads instead of 200K, 25GB read instead of 12.5, very strange.

      BTW, You mentioned that current file I/O analysis is ignoring write() syscalls. — Do we know what is the proportion of read to write syscalls on start-up? (strace -c * should show the writes). If ff is doing a interleaved read/writes (on data files) then that patten can also slow-down successive reads, no? Any data on that?

    2. glandium Says: 
      2011-01-13 08:25:35+0900

      Rahul: I wanted but forgot to mention in this article that I don’t expect writes to be that much of a problem, because they don’t hit the disk immediately thanks to write-back… except when fsync() is used, which sqlite unfortunately does. There is ongoing work to prevent sqlite from doing these fsync()s.

    3. Josh Stone Says: 
      2011-01-13 21:14:12+0900

      Hi – on the kernel/module digression, stap actually can handle that with extra punctuation on the probe. Marking with “?” means a probe point is optional, so it won’t error if not found. Marking with “!” means a probe point is both optional and sufficient, so if found, further alternative will not be attempted. So, you could use:

      probe module(“ext4”).function(“ext4_get_block”)!, kernel.function(“ext4_get_block”) { […] }

    4. Frank Ch. Eigler Says: 
      2011-01-13 21:14:29+0900

      Re. kernel. vs module(“foo”)., yes, it’s a hassle that reflects the nature of kernel build perhaps with unnecessary fidelity. However, you can say

      probe module(“foo”).function(“bar”) !, module(“anotherfoo”).function(“bar”) !, kernel.function(“bar”) { }

      if you suspect “bar” may be in a module or another, or else in the kernel. The magic is in the ! (or ?) flag.

    5. Linix Tip of the Day: SystemTap « roguelazer's website Says: 
      2011-02-16 18:08:43+0900

      […] you can do with SystemTap. For example, Debian Developer/Mozilla Contributor Mike Hommey wrote up a good summary on tracking disk I/O with SystemTap. And SystemTap provides a good base of example scripts for your tracing […]

    6. Jeff Says: 
      2011-08-16 19:03:06+0900

      I’ve been using SystemTap myself for a bit and I love it. Slight problem is I’ve been having inconsistencies while running it on an SSD so beware.

Attempting to track I/O with systemtap的更多相关文章

  1. ASIHTTPREQUEST 文档

    http://blog.csdn.net/ysysbaobei/article/details/17026577 Please note that I am no longer working on ...

  2. ASIHTTPRequest学习(一)

    Creating a synchronous request 可以创建同步和异步两种方式的请求,一般情况下应当使用异步请求.使用同步请求主应用线程会锁住直至解锁为止. 创建异步请求,会在后台执行 - ...

  3. SystemTap Beginners Guide

    SystemTap 3.0 SystemTap Beginners Guide Introduction to SystemTap Edition 3.0   Red Hat, Inc. Don Do ...

  4. systemtap 2.8 news

    * What's new in version 2.8, 2015-06-17 - SystemTap has improved support for probing golang programs ...

  5. [systemtap手记]debian体系安装过程

    Debian体系,本人测试用机 Ubuntu 11.10 uname -r 查看原本的内核版本为 3.0.0-12-generic 第一步: 安装systemtap包 $ sudo apt-get i ...

  6. How to accept Track changes in Microsoft Word 2010?

    "Track changes" is wonderful and remarkable tool of Microsoft Word 2010. The feature allow ...

  7. Encountered an unexpected error when attempting to resolve tag helper directive '@addTagHelper' with value '"*, Microsoft.AspNet.Mvc.TagHelpers"'

    project.json 配置: { "version": "1.0.0-*", "compilationOptions": { " ...

  8. VS2015突然报错————Encountered an unexpected error when attempting to resolve tag helper directive '@addTagHelper' with value 'Microsoft.AspNet.Mvc.Razor.TagHelpers.UrlResolutionTagHelper

    Encountered an unexpected error when attempting to resolve tag helper directive '@addTagHelper' with ...

  9. IEEE/ACM ASONAM 2014 Industry Track Call for Papers

    IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM) 2014 In ...

随机推荐

  1. ACM_递推题目系列之一涂色问题(递推dp)

    递推题目系列之一涂色问题 Time Limit: 2000/1000ms (Java/Others) Problem Description: 有排成一行的n个方格,用红(Red).粉(Pink).绿 ...

  2. 向Linus学习,让代码具有good taste

    在最近关于 Linus Torvalds 的一个采访中,这位 Linux 的创始人,在采访过程中大约 14:20 的时候,提及了关于代码的 “good taste”.good taste?采访者请他展 ...

  3. C#基础 结构体 枚举类型

    结构体:就是一个自定义的集合,里面可以放各种类型的元素,用法大体跟集合一样. 一.定义的例子: struct student { public int nianling; public int fen ...

  4. 专题九:实现类似QQ的即时通信程序

    引言: 前面专题中介绍了UDP.TCP和P2P编程,并且通过一些小的示例来让大家更好的理解它们的工作原理以及怎样.Net类库去实现它们的.为了让大家更好的理解我们平常中常见的软件QQ的工作原理,所以在 ...

  5. 网上商城 Incorrect datetime value: '' for column 'ordertime' at row 1

    今天在做商城项目的[提交订单]功能的时候,向数据库插入数据报错:Incorrect datetime value: '' for column 'ordertime' at row 1 public ...

  6. 【hive】hive表很大的时候查询报错问题

    线上hive使用环境出现了一个奇怪的问题,跑一段时间就报如下错误: FAILED: SemanticException MetaException(message:Exception thrown w ...

  7. Laravel5.1学习笔记22 Eloquent 调整修改

    Eloquent: Mutators Introduction Accessors & Mutators Date Mutators Attribute Casting Introductio ...

  8. Python爬虫+颜值打分,5000+图片找到你的Mrs. Right

        一见钟情钟的不是情,是脸 日久生情生的不是脸,是情 项目简介 本项目利用Python爬虫和百度人脸识别API,针对简书交友专栏,爬取用户照片(侵删),并进行打分. 本项目包括以下内容: 图片爬 ...

  9. 跨站脚本攻击XXS(Cross Site Scripting)修复方案

    今天突然发现,网站被主页莫名奇妙的出现了陌生的广告. 通过排查发现是跨站脚本攻击XXS(Cross Site Scripting).以下为解决方案. 漏洞类型: Cross Site Scriptin ...

  10. jboss之启动加载过程详解

    今天看了看jboss的boot.log和server.log日志,结合自己的理解和其他的资料,现对jboss的启动和加载过程做出如下总结: boot.xml是服务器的启动过程的日志,不涉及后续的操作过 ...