https://andrestc.com/post/linux-delay-accounting/

Ever wondered how long is your program spending while waiting for I/O to finish? Or if it is spending lots of time while waiting for a turn to run on one of the cpus? Linux provides delay accounting information that may help answering these and other questions. Delay information is available for many types of resources:

waiting for a CPU (while being runnable)
completion of synchronous block I/O initiated by the task
swapping in pages
memory reclaim

These information is available in nanoseconds, on a per pid/tid basis, and is pretty useful to find out if your system resources are saturated by the number of concurrent tasks running on the machine. You can either: reduce the amount of work being done on the machine by removing unecessary processes or adjust the priority (cpu priority, io priority and rss limit) for important tasks.

Acessing delay accounting information

This information is available for userspace programs thru the Netlink interface, an interface a user-space program in linux uses to communicate with the kernel. It can be used by a bunch of stuff: managing network interfaces, setting ip addresses and routes and so on.

Linux ships with a source code example, getdelays, on how to build tools to consume such information [2]. By using ./getdelays -d -p <PID> we can visualize the delay experienced by process while consuming different kinds of resources.

Side note: since this commit, Linux requires a process to run as root to be able to fetch delay accounting information. I plan to check up if these could be changed so an user may check delay information on any process owned by him/her.

getdelays states that “It is recommended that commercial grade applications use libnl or libnetlink and use the interfaces provided by the library”, so I decided to rewrite part of getdelays using a higher level library, instead of having to handle parsing and other instrinsics of the netlink protocol.

Re-implementing getdelays using libnl

I found libnl to be a quite flexible library and was able to write this example in a couple of hours (and I didn’t have any prior experience with netlink). Their documentation on the Netlink protocol had everything I needed to understand the protocol.

The source code for my implementation is available on my github and uses libnl to “talk” netlink. In the following sections I`ll highlight the most important parts of the implementation.

1. Setup

sk = nl_socket_alloc();

if (sk == NULL) {

    fprintf(stderr, "Error allocating netlink socket");

    exit_code = 1;

    goto teardown;

}

if ((err = nl_connect(sk, NETLINK_GENERIC)) < 0) {

    fprintf(stderr, "Error connecting: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardown;

}

if ((family = genl_ctrl_resolve(sk, TASKSTATS_GENL_NAME)) == 0) {

    fprintf(stderr, "Error retrieving family id: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardown;

}

The setup is pretty straightforward:

we start by calling nl_socket_alloc() to allocate a netlink socket, required for the communication with the netlink interface
the call to nl_connect connects our socket to the NETLINK_GENERIC protocol (depending on our needs, we can use other protocols like NETLINK_ROUTE for routing operations)
gen_ctrl_resolve is used to obtain the family id of the taskstats. This is the “postal code” of the delay information holder

After the setup we are ready to prepare our netlink message.

2. Preparing our message

if ((err = nl_socket_modify_cb(sk, NL_CB_VALID, NL_CB_CUSTOM, callback_message, NULL)) < 0) {

        fprintf(stderr, "Error setting socket cb: %s\n", nl_geterror(err));

      exit_code = 1;

      goto teardown;

}

if (!(msg = nlmsg_alloc())) {

    fprintf(stderr, "Failed to alloc message: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardown;

}

if (!(hdr = genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0,

    NLM_F_REQUEST, TASKSTATS_CMD_GET, TASKSTATS_VERSION))) {

    fprintf(stderr, "Error setting message header\n");

    exit_code = 1;

    goto teardownMsg;

}

if ((err = nla_put_u32(msg, TASKSTATS_CMD_ATTR_PID, pid)) < 0) {

    fprintf(stderr, "Error setting attribute: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardownMsg;

}

Libnl offers a bunch of callback hooks that can be used to handle different kinds of events. Using nl_socket_modify_cb we register a custom callback (NL_CB_CUSTOM) callback_message that will be called for all valid messages received from the kernel (NL_CB_VALID)
nlmsg_alloc allocs a struct to hold the message that will be sent
genlmsg_put sets the messsage header: NL_AUTO_PID and NL_AUTO_SEQtells libnl to fill in the message sequence and pid number, required by the protocol; family is the taskstats family id; NLM_F_REQUEST indicates that this message is a request; TASKSTATS_CMD_GET is the command that we are sending to the taskstats interface, meaning that we want to get some information and TASKSTATS_VERSION is used by the kernel to be able to handle different versions of this interface
nla_put_u32 sets an attribute TASKSTATS_CMD_ATTR_PID, which indicates that we are asking for the taskstats information of a particular pid, provided as the header value

3. Sending the message

if ((err = nl_send_sync(sk, msg)) < 0) {

    fprintf(stderr, "Error sending message: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardownMsg;

}

if ((err = nl_recvmsgs_default(sk)) < 0) {

    fprintf(stderr, "Error receiving message: %s\n", nl_geterror(err));

    exit_code = 1;

    goto teardownMsg;

}

nl_send_sync sends a message using the socket and waits for an ack or an error message
nl_recvmsgs_default waits for a message; this will block until the message is parsed by our callback

4. Receiving the response

Handling of the response is done by the callback_message function:

int callback_message(struct nl_msg *nlmsg, void *arg) {

    struct nlmsghdr *nlhdr;

    struct nlattr *nlattrs[TASKSTATS_TYPE_MAX + 1];

    struct nlattr *nlattr;

    struct taskstats *stats;

    int rem, answer;

    nlhdr = nlmsg_hdr(nlmsg);

    if ((answer = genlmsg_parse(nlhdr, 0, nlattrs, TASKSTATS_TYPE_MAX, NULL)) < 0) {

        fprintf(stderr, "error parsing msg\n");

        return -1;

    }

    if ((nlattr = nlattrs[TASKSTATS_TYPE_AGGR_PID]) || (nlattr = nlattrs[TASKSTATS_TYPE_NULL])) {

        stats = nla_data(nla_next(nla_data(nlattr), &rem));

        print_delayacct(stats);

    } else {

        fprintf(stderr, "unknown attribute format received\n");

        return -1;

    }

    return 0;

}

nlmsg_hdr returns the actual message header from nlmsg
genlmsg_parse parses a generic netlink message and stores the attributes to nlattrs
we retrieve the attribute we are interested: TASKSTATS_TYPE_AGGR_PID
nla_data returns a pointer to the payload of the message, we need to use nla_next because the taskstats data is actually returned on the second attribute (the first one being used just to indicate that a pid/tid will be followed by some stats)
print_delayacct is used to finally print the data; this function is the same used by the linux example.

Delay examples

Let’s try to visualize some of the delay types be crafting some examples and running getdelays.

CPU scheduling delay

In this example I’m going to use the stress utility to generate some workload on a VM that has 2 cores. Using the -c <N> flag, stress creates <N> workers (forks) running sqrt() to generate some CPU load. Since this VM has two cores, I will spin two instance of stress with 2 workers each. By using the nicecommand, I’ll configure the niceness of the first instace to be 19, meaning that it will have a lower priority on the scheduling:

$ sudo nice -n 19 stress -c 2 & sudo stress -c 2

stress: info: [15718] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd

stress: info: [15719] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd

We can check with ps that we have now 6 processes running stress, the two parents and their two forks:

root     15718  0.0  0.0   7480   864 pts/2    SN   14:24   0:00 stress -c 2

root     15719  0.0  0.0   7480   940 pts/2    S+   14:24   0:00 stress -c 2

root     15720  1.4  0.0   7480    92 pts/2    RN   14:24   0:01 stress -c 2

root     15721  1.4  0.0   7480    92 pts/2    RN   14:24   0:01 stress -c 2

root     15722 96.3  0.0   7480    92 pts/2    R+   14:24   2:00 stress -c 2

root     15723 99.0  0.0   7480    92 pts/2    R+   14:24   2:03 stress -c 2

With getdelays we can check their CPU delays (output truncated):

$ ./getdelays -d -p 15722

PID	15722

CPU             count     real total  virtual total    delay total  delay average

                 3386   130464000000   132726743949     4190941076          1.238ms

$ ./getdelays -d -p 15723

PID	15723

CPU             count     real total  virtual total    delay total  delay average

                 3298   136240000000   138605044896      550886724          0.167ms

$ ./getdelays -d -p 15720

PID	15720

CPU             count     real total  virtual total    delay total  delay average

                  533     2060000000     2084325118   142398167037        267.164ms

$ ./getdelays -d -p 15721

PID	15721

CPU             count     real total  virtual total    delay total  delay average

                  564     2160000000     2178262982   148843119281        263.906ms

Clearly, the ones from with high niceness value are experience higher delays (the average delay is around 200x higher). If we ran both instances of stress with the same niceness, we will experience the same average delay accross then.

Block I/O delay

Let’s try to experience some I/O delays running a task. We can leverage docker to limit the I/O bps for our process using the --driver-write-bps flag on docker run. First, let’s run dd without any limits:

docker run --name dd --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result obtained by running getdelays on the dd process:

root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2904

print delayacct stats ON

PID	2904

CPU             count     real total  virtual total    delay total  delay average

                 6255     1068000000     1879315354       22782428          0.004ms

IO              count    delay total  delay average

                 5988    13072387639              2ms

SWAP            count    delay total  delay average

                    0              0              0ms

RECLAIM         count    delay total  delay average

                    0              0              0ms

We can see that we are getting an average of 2ms delays for I/O.

Now, let’s use --driver-write-bps to limit I/O to 1mbs:

docker run --name dd --device-write-bps /dev/sda:1mb --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result of running getdelays on the process:

root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2705

print delayacct stats ON

listen forever

PID	2705

CPU             count     real total  virtual total    delay total  delay average

                   71       28000000       32436630         600096          0.008ms

IO              count    delay total  delay average

                   15    40163017300           2677ms

SWAP            count    delay total  delay average

                    0              0              0ms

RECLAIM         count    delay total  delay average

                    0              0              0ms

Since I/O is limited, dd takes much more time to write its output, we can see that our I/O delay average is 1000 times higher than before.

Side note: using --driver-write-<bps,iops> docker flags uses linux cgroups v1 and those are only able to limit the amount of I/O if we open the files with O_DIRECT, O_SYNC or O_DSYNC flags, but this deserver a blog post on its own.

Memory reclaim delay

In this example we can use, once more, the stress utility by using the --vm <N>flag to launch N workers running malloc/free to generate some memory allocation workload. Once again, this VM has 2 cores.

Using the default --vm-bytes, which is 256M, I was able to experience some delay on memory reclaim by running more than 2 workers. But the delay average was kept fairly small, below 1ms:

PID	15888

CPU             count     real total  virtual total    delay total  delay average

                 2799    38948000000    39507647880    19772492888          7.064ms

RECLAIM         count    delay total  delay average

                   11         278304              0ms

PID	15889

CPU             count     real total  virtual total    delay total  delay average

                 3009    38412000000    38904584951    20402080112          6.780ms

RECLAIM         count    delay total  delay average

                   22       16641801              0ms

PID	15890

CPU             count     real total  virtual total    delay total  delay average

                 2954    39172000000    39772710066    19571509440          6.625ms

RECLAIM         count    delay total  delay average

                   39        9505559              0ms

Since the 3 tasks are competing on a 2 core CPU, the CPU delays were much higher. Running with --vm-bytes with lower values produced even lower memory reclaim delays (in some cases, no delay is experienced).

Linux delays on higher level tools

Not many tools expose linux delays to the end user, but those are available on cpustat. I’m currently working on a PR to get them on htop.

Linux Delay Accounting的更多相关文章

(笔记)Linux下的准确延时,#include <linux/delay.h>调用出错
在编写应用层程序时,有时需要延时一下,这个时候该怎么办呢? 在内核代码中,我们经常会看到这样的头文件使用#include <linux/delay.h>,心想着直接调用这个就可以了吧!可是 ...
戴文的Linux内核专题：06配置内核（2）
转自Linux中国这一部分我们讲配置内核IRQ子系统.中断请求(IRQ)是硬件发给处理器的一个信号,它暂时停止一个正在运行的程序并允许一个特殊的程序占用CPU运行. 这个目录中的第一个问题属于内核特 ...
Linux下编译内核配置选项简介
Code maturity level options代码成熟度选项 Prompt for development and/or incomplete code/drivers 显示尚在开发中或尚未完 ...
Linux: 介绍make menuconfig中的每个选项含义【转】
转自:http://blog.csdn.net/gaoyuanlinkconcept/article/details/8810468 介绍make menuconfig中的每个选项含义 Linux 2 ...
linux kernel menuconfig【转载】
原文网址:http://www.cnblogs.com/kulin/archive/2013/01/04/linux-core.html Linux内核裁减 (1)安装新内核: i)将新内核copy到 ...
Linux内核配置选项
http://blog.csdn.net/wdsfup/article/details/52302142 http://www.manew.com/blog-166674-12962.html Gen ...
深入linux kernel内核配置选项
============================================================================== 深入linux kernel内核配置选项 ...
linux内核可以接受的参数 | Linux kernel启动参数 | 通过grub给内核传递参数
在Linux中,给kernel传递参数以控制其行为总共有三种方法: 1.build kernel之时的各个configuration选项. 2.当kernel启动之时,可以参数在kernel被GRUB ...
linux内核调试技术之自构proc
1.简介在上一篇中,在内核中使用printk可以讲调试信息保存在log_buf缓冲区中,可以使用命令 #cat /proc/kmsg 将缓冲区的数区的数数据打印出来,今天我们就来研究一下,自己写k ...

随机推荐

【Python】CVE-2017-10271批量自查POC(Weblogic RCE)
1.说明看到大家对weblogic漏洞这么热衷,于是也看看这个漏洞的测试方式. 找了几个安全研究员的博客分析,经过几天的摸索大体清楚漏洞由XMLDecoder的反序列化产生. 漏洞最早4月份被发现, ...
fish（自动推荐命令；语法高亮等）
Fish 是 Linux/Unix/Mac OS 的一个命令行 shell,有一些很好用的功能. 自动推荐 VGA 颜色完美的脚本支持基于网页的配置帮助文档自动补全语法高亮以及更多自动推荐 ...
Go语言Windows 10开发环境搭建：Eclipse+GoClipse
Intel Core i5-8250U,Windows 10家庭中文版,go version go1.11 windows/amd64, Eclipse IDE for C/C++ Developer ...
Android 6.0 变更
Android 6.0(API 级别 23)除了提供诸多新特性和功能外,还对系统和 API 行为做出了各种变更.本文重点介绍您应该了解并在开发应用时加以考虑的一些主要变更. 如果您之前发布过 Andr ...
jenkins安装及环境搭建
Jenkins 是基于Java开发的一种持续集成工具,所以,Jenkins需要Java环境. Jenkins版本是: JAVA版本是: Tomcat版本是: 或者 Jenkins版本是:2.10.2 ...
Intellij IDEA调试功能总结
public class Demo { public static void f1() { System.out.println("one"); System.out.printl ...
【LOJ】#6437. 「PKUSC2018」PKUSC
题解我们把这个多边形三角形剖分了,和统计多边形面积一样每个三角形有个点是原点,把原点所对应的角度算出来,记为theta 对于一个点,相当于半径为这个点到原点的一个圆,圆弧上的弧度为theta的一部 ...
JSR教程1——JSR 303 - Bean Validation介绍
1.Bean Validation 在任何时候,当你要处理一个应用程序的业务逻辑,数据校验是你必须要考虑和面对的事情.应用程序必须通过某种手段来确保输入进来的数据从语义上来讲是正确的.在通常的情况下, ...
洛谷P3964 [TJOI2013]松鼠聚会 [二分答案，前缀和，切比雪夫距离]
题目传送门松鼠聚会题目描述草原上住着一群小松鼠,每个小松鼠都有一个家.时间长了,大家觉得应该聚一聚.但是草原非常大,松鼠们都很头疼应该在谁家聚会才最合理. 每个小松鼠的家可以用一个点x,y表示, ...
003.NFS配置实例
一 NFS常见服务管理 1.1 启动NFS [root@imxhy ~]# systemctl start nfs #CentOS7.x系列启动 [root@imxhy ~]# service nfs ...

Linux Delay Accounting