Linux Delay Accounting
https://andrestc.com/post/linux-delay-accounting/
Ever wondered how long is your program spending while waiting for I/O to finish? Or if it is spending lots of time while waiting for a turn to run on one of the cpus? Linux provides delay accounting information that may help answering these and other questions. Delay information is available for many types of resources:
- waiting for a CPU (while being runnable)
- completion of synchronous block I/O initiated by the task
- swapping in pages
- memory reclaim
These information is available in nanoseconds, on a per pid/tid basis, and is pretty useful to find out if your system resources are saturated by the number of concurrent tasks running on the machine. You can either: reduce the amount of work being done on the machine by removing unecessary processes or adjust the priority (cpu priority, io priority and rss limit) for important tasks.
Acessing delay accounting information
This information is available for userspace programs thru the Netlink interface, an interface a user-space program in linux uses to communicate with the kernel. It can be used by a bunch of stuff: managing network interfaces, setting ip addresses and routes and so on.
Linux ships with a source code example, getdelays, on how to build tools to consume such information [2]. By using ./getdelays -d -p <PID>
we can visualize the delay experienced by process while consuming different kinds of resources.
Side note: since this commit, Linux requires a process to run as root
to be able to fetch delay accounting information. I plan to check up if these could be changed so an user may check delay information on any process owned by him/her.
getdelays
states that “It is recommended that commercial grade applications use libnl or libnetlink and use the interfaces provided by the library”, so I decided to rewrite part of getdelays
using a higher level library, instead of having to handle parsing and other instrinsics of the netlink protocol.
Re-implementing getdelays using libnl
I found libnl to be a quite flexible library and was able to write this example in a couple of hours (and I didn’t have any prior experience with netlink). Their documentation on the Netlink protocol had everything I needed to understand the protocol.
The source code for my implementation is available on my github and uses libnl to “talk” netlink. In the following sections I`ll highlight the most important parts of the implementation.
1. Setup
sk = nl_socket_alloc();
if (sk == NULL) {
fprintf(stderr, "Error allocating netlink socket");
exit_code = 1;
goto teardown;
}
if ((err = nl_connect(sk, NETLINK_GENERIC)) < 0) {
fprintf(stderr, "Error connecting: %s\n", nl_geterror(err));
exit_code = 1;
goto teardown;
}
if ((family = genl_ctrl_resolve(sk, TASKSTATS_GENL_NAME)) == 0) {
fprintf(stderr, "Error retrieving family id: %s\n", nl_geterror(err));
exit_code = 1;
goto teardown;
}
The setup is pretty straightforward:
- we start by calling
nl_socket_alloc()
to allocate a netlink socket, required for the communication with the netlink interface - the call to
nl_connect
connects our socket to theNETLINK_GENERIC
protocol (depending on our needs, we can use other protocols likeNETLINK_ROUTE
for routing operations) gen_ctrl_resolve
is used to obtain the family id of the taskstats. This is the “postal code” of the delay information holder
After the setup we are ready to prepare our netlink message.
2. Preparing our message
if ((err = nl_socket_modify_cb(sk, NL_CB_VALID, NL_CB_CUSTOM, callback_message, NULL)) < 0) {
fprintf(stderr, "Error setting socket cb: %s\n", nl_geterror(err));
exit_code = 1;
goto teardown;
}
if (!(msg = nlmsg_alloc())) {
fprintf(stderr, "Failed to alloc message: %s\n", nl_geterror(err));
exit_code = 1;
goto teardown;
}
if (!(hdr = genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0,
NLM_F_REQUEST, TASKSTATS_CMD_GET, TASKSTATS_VERSION))) {
fprintf(stderr, "Error setting message header\n");
exit_code = 1;
goto teardownMsg;
}
if ((err = nla_put_u32(msg, TASKSTATS_CMD_ATTR_PID, pid)) < 0) {
fprintf(stderr, "Error setting attribute: %s\n", nl_geterror(err));
exit_code = 1;
goto teardownMsg;
}
- Libnl offers a bunch of callback hooks that can be used to handle different kinds of events. Using
nl_socket_modify_cb
we register a custom callback (NL_CB_CUSTOM
)callback_message
that will be called for all valid messages received from the kernel (NL_CB_VALID
) nlmsg_alloc
allocs a struct to hold the message that will be sentgenlmsg_put
sets the messsage header:NL_AUTO_PID
andNL_AUTO_SEQ
tells libnl to fill in the message sequence and pid number, required by the protocol; family is thetaskstats
family id;NLM_F_REQUEST
indicates that this message is a request;TASKSTATS_CMD_GET
is the command that we are sending to the taskstats interface, meaning that we want to get some information andTASKSTATS_VERSION
is used by the kernel to be able to handle different versions of this interfacenla_put_u32
sets an attributeTASKSTATS_CMD_ATTR_PID
, which indicates that we are asking for the taskstats information of a particularpid
, provided as the header value
3. Sending the message
if ((err = nl_send_sync(sk, msg)) < 0) {
fprintf(stderr, "Error sending message: %s\n", nl_geterror(err));
exit_code = 1;
goto teardownMsg;
}
if ((err = nl_recvmsgs_default(sk)) < 0) {
fprintf(stderr, "Error receiving message: %s\n", nl_geterror(err));
exit_code = 1;
goto teardownMsg;
}
nl_send_sync
sends a message using the socket and waits for an ack or an error messagenl_recvmsgs_default
waits for a message; this will block until the message is parsed by our callback
4. Receiving the response
Handling of the response is done by the callback_message
function:
int callback_message(struct nl_msg *nlmsg, void *arg) {
struct nlmsghdr *nlhdr;
struct nlattr *nlattrs[TASKSTATS_TYPE_MAX + 1];
struct nlattr *nlattr;
struct taskstats *stats;
int rem, answer;
nlhdr = nlmsg_hdr(nlmsg);
if ((answer = genlmsg_parse(nlhdr, 0, nlattrs, TASKSTATS_TYPE_MAX, NULL)) < 0) {
fprintf(stderr, "error parsing msg\n");
return -1;
}
if ((nlattr = nlattrs[TASKSTATS_TYPE_AGGR_PID]) || (nlattr = nlattrs[TASKSTATS_TYPE_NULL])) {
stats = nla_data(nla_next(nla_data(nlattr), &rem));
print_delayacct(stats);
} else {
fprintf(stderr, "unknown attribute format received\n");
return -1;
}
return 0;
}
nlmsg_hdr
returns the actual message header fromnlmsg
genlmsg_parse
parses a generic netlink message and stores the attributes tonlattrs
- we retrieve the attribute we are interested:
TASKSTATS_TYPE_AGGR_PID
nla_data
returns a pointer to the payload of the message, we need to usenla_next
because the taskstats data is actually returned on the second attribute (the first one being used just to indicate that a pid/tid will be followed by some stats)print_delayacct
is used to finally print the data; this function is the same used by the linux example.
Delay examples
Let’s try to visualize some of the delay types be crafting some examples and running getdelays
.
CPU scheduling delay
In this example I’m going to use the stress
utility to generate some workload on a VM that has 2 cores. Using the -c <N>
flag, stress
creates <N>
workers (forks) running sqrt()
to generate some CPU load. Since this VM has two cores, I will spin two instance of stress
with 2 workers each. By using the nice
command, I’ll configure the niceness of the first instace to be 19, meaning that it will have a lower priority on the scheduling:
$ sudo nice -n 19 stress -c 2 & sudo stress -c 2
stress: info: [15718] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
stress: info: [15719] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
We can check with ps
that we have now 6 processes running stress
, the two parents and their two forks:
root 15718 0.0 0.0 7480 864 pts/2 SN 14:24 0:00 stress -c 2
root 15719 0.0 0.0 7480 940 pts/2 S+ 14:24 0:00 stress -c 2
root 15720 1.4 0.0 7480 92 pts/2 RN 14:24 0:01 stress -c 2
root 15721 1.4 0.0 7480 92 pts/2 RN 14:24 0:01 stress -c 2
root 15722 96.3 0.0 7480 92 pts/2 R+ 14:24 2:00 stress -c 2
root 15723 99.0 0.0 7480 92 pts/2 R+ 14:24 2:03 stress -c 2
With getdelays
we can check their CPU delays (output truncated):
$ ./getdelays -d -p 15722
PID 15722
CPU count real total virtual total delay total delay average
3386 130464000000 132726743949 4190941076 1.238ms
$ ./getdelays -d -p 15723
PID 15723
CPU count real total virtual total delay total delay average
3298 136240000000 138605044896 550886724 0.167ms
$ ./getdelays -d -p 15720
PID 15720
CPU count real total virtual total delay total delay average
533 2060000000 2084325118 142398167037 267.164ms
$ ./getdelays -d -p 15721
PID 15721
CPU count real total virtual total delay total delay average
564 2160000000 2178262982 148843119281 263.906ms
Clearly, the ones from with high niceness value are experience higher delays (the average delay is around 200x higher). If we ran both instances of stress with the same niceness, we will experience the same average delay accross then.
Block I/O delay
Let’s try to experience some I/O delays running a task. We can leverage docker to limit the I/O bps for our process using the --driver-write-bps
flag on docker run
. First, let’s run dd
without any limits:
docker run --name dd --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct
The following screenshot shows the result obtained by running getdelays on the dd
process:
root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2904
print delayacct stats ON
PID 2904
CPU count real total virtual total delay total delay average
6255 1068000000 1879315354 22782428 0.004ms
IO count delay total delay average
5988 13072387639 2ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
We can see that we are getting an average of 2ms delays for I/O.
Now, let’s use --driver-write-bps
to limit I/O to 1mbs
:
docker run --name dd --device-write-bps /dev/sda:1mb --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct
The following screenshot shows the result of running getdelays on the process:
root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2705
print delayacct stats ON
listen forever
PID 2705
CPU count real total virtual total delay total delay average
71 28000000 32436630 600096 0.008ms
IO count delay total delay average
15 40163017300 2677ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
Since I/O is limited, dd
takes much more time to write its output, we can see that our I/O delay average is 1000 times higher than before.
Side note: using --driver-write-<bps,iops>
docker flags uses linux cgroups v1 and those are only able to limit the amount of I/O if we open the files with O_DIRECT
, O_SYNC
or O_DSYNC
flags, but this deserver a blog post on its own.
Memory reclaim delay
In this example we can use, once more, the stress
utility by using the --vm <N>
flag to launch N workers running malloc/free
to generate some memory allocation workload. Once again, this VM has 2 cores.
Using the default --vm-bytes
, which is 256M, I was able to experience some delay on memory reclaim by running more than 2 workers. But the delay average was kept fairly small, below 1ms:
PID 15888
CPU count real total virtual total delay total delay average
2799 38948000000 39507647880 19772492888 7.064ms
RECLAIM count delay total delay average
11 278304 0ms
PID 15889
CPU count real total virtual total delay total delay average
3009 38412000000 38904584951 20402080112 6.780ms
RECLAIM count delay total delay average
22 16641801 0ms
PID 15890
CPU count real total virtual total delay total delay average
2954 39172000000 39772710066 19571509440 6.625ms
RECLAIM count delay total delay average
39 9505559 0ms
Since the 3 tasks are competing on a 2 core CPU, the CPU delays were much higher. Running with --vm-bytes
with lower values produced even lower memory reclaim delays (in some cases, no delay is experienced).
Linux delays on higher level tools
Not many tools expose linux delays to the end user, but those are available on cpustat. I’m currently working on a PR to get them on htop.
Linux Delay Accounting的更多相关文章
- (笔记)Linux下的准确延时,#include <linux/delay.h>调用出错
在编写应用层程序时,有时需要延时一下,这个时候该怎么办呢? 在内核代码中,我们经常会看到这样的头文件使用#include <linux/delay.h>,心想着直接调用这个就可以了吧!可是 ...
- 戴文的Linux内核专题:06配置内核(2)
转自Linux中国 这一部分我们讲配置内核IRQ子系统.中断请求(IRQ)是硬件发给处理器的一个信号,它暂时停止一个正在运行的程序并允许一个特殊的程序占用CPU运行. 这个目录中的第一个问题属于内核特 ...
- Linux下编译内核配置选项简介
Code maturity level options代码成熟度选项 Prompt for development and/or incomplete code/drivers 显示尚在开发中或尚未完 ...
- Linux: 介绍make menuconfig中的每个选项含义【转】
转自:http://blog.csdn.net/gaoyuanlinkconcept/article/details/8810468 介绍make menuconfig中的每个选项含义 Linux 2 ...
- linux kernel menuconfig【转载】
原文网址:http://www.cnblogs.com/kulin/archive/2013/01/04/linux-core.html Linux内核裁减 (1)安装新内核: i)将新内核copy到 ...
- Linux内核配置选项
http://blog.csdn.net/wdsfup/article/details/52302142 http://www.manew.com/blog-166674-12962.html Gen ...
- 深入linux kernel内核配置选项
============================================================================== 深入linux kernel内核配置选项 ...
- linux内核可以接受的参数 | Linux kernel启动参数 | 通过grub给内核传递参数
在Linux中,给kernel传递参数以控制其行为总共有三种方法: 1.build kernel之时的各个configuration选项. 2.当kernel启动之时,可以参数在kernel被GRUB ...
- linux内核调试技术之自构proc
1.简介 在上一篇中,在内核中使用printk可以讲调试信息保存在log_buf缓冲区中,可以使用命令 #cat /proc/kmsg 将缓冲区的数区的数数据打印出来,今天我们就来研究一下,自己写k ...
随机推荐
- SANS社区帐号邮件激活问题
注册时,密码需要数字,大写字母,小写字母,符号10位以上才能注册成功 吐槽:谁来爆破一下这种强度的密码,哈哈. 在我的文章中,有 计算机取证 分类,里面的一篇文章 Virtual Worksta ...
- LEARN HOW TO HACK
出处:https://www.hackerone.com/hacker101 什么是HACKER101? https://hacker101.com/Hacker101是一个视频,资源和实践活动的集合 ...
- 009_【OS X和iOS系统学习笔记】 OS X架构
1.OS X是整个操作系统的集体名称,而Darwin是其中的一个组件. 2.Darwin是操作系统的类UNIX核心,本身由内核.XNU和运行时组成. 3.uname指令:可以得到有关架构的详细信息以及 ...
- Python_oldboy_常用模块(九)
本节大纲: 模块介绍 time &datetime模块 random os sys shutil json & pickle shelve xml处理 yaml处理 configpar ...
- 【Unity_UWP】Unity 工程发布win10 UWP 时的本地文件读取 (上篇)
Universal Windows Platform(UWP)是微软Windows10专用的通用应用平台,其目的在于在统一操作系统下控制所有智能电子设备. 自从Unity 5.2之后,配合VS 201 ...
- 补充NTP知识的初中高
前言 网上流传阿里穆工对NTP知识梳理的初级和中级版本.我从时钟服务器厂商在实践中的经验对穆工的文档进行再次整理和补充,希望对使用此设备的客户和对此有兴趣的同学给出一些指引. 个人认为对知识的了解应该 ...
- NOIp 2018 提高组
T1铺设道路 传送门 题目描述 春春是一名道路工程师,负责铺设一条长度为 $ n $ 的道路. 铺设道路的主要工作是填平下陷的地表.整段道路可以看作是 $ n $ 块首尾相连的区域,一开始,第 ii ...
- 在k8s集群中,利用prometheus的jmx_exporter进行tomcat的JVM性能监控,并用grafana作前端展示
查找了很多文档,没有完全达到我要求的, 于是,作了一定的调整,成现在这样. 操作步骤如下: 一,准备好两个文件. jmx_prometheus_javaagent-0.3.1.jar jmx_expo ...
- 为什么要做A.prototype.constructor=A这样的修正?
问题 虽然看过这篇博文JavaScript prototype之后对原型理解不再那么模糊了,但是依然还有很多理解不甚透彻的地方.比如,今天看到一个原型式继承的例子,又有些困惑,于是找了些帖子看看,有了 ...
- P1855 榨取kkksc03 二维费用背包
Kkksc03的时间和金钱是有限的,所以他很难满足所有同学的愿望.所以他想知道在自己的能力范围内,最多可以完成多少同学的愿望? 输入输出格式 输入格式: 第一行,n M T,表示一共有n(n<= ...