Tracing on Linux
The Linux tracing APIs are a relatively new addition to the kernel and one of the most powerful new features its gained in a long time. Unfortunately the plethora of terms and names for the system can be confusing, so in this follow-up to my previous post on the proc connector and socket filter, I’ll take a look at achieving the same result using tracing and hopefully unravel a little of the mystery along the way.
Rather than write a program along the way, I’ll be referring to sample code found in the kernel tree itself so you’ll want a checkout. If you’re doing any work that touches the kernel further than standard POSIX APIs, I highly recommend this anyway; it’s quite readable and once you find your way around, is the quickest way to answer questions.
Grab your checkout with git:
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
# cd linux-2.6
Tracepoints
One of the reasons there are so many terms and names is that, like most kernel systems, there are many layers and each of those layers is exposed as different developers have different requirements. An important lower layer is that of tracepoints, also known as static tracepoints. For these we’ll be looking at the code in the samples/tracepoints directory of the kernel source; kernelese documentation of the API can be found inDocumentation/trace/tracepoints.txt
A tracepoint is a placeholder function call in kernel code that the developer of that subsystem has deemed a useful point for debugging code to be able to hook into. Static refers to the fact they are fixed in point by the original developer. You can think of them as the kind of code you’d tend to guard with #if DEBUG in traditional C development, and like those statements they’re nearly free when they’re not in use except that you can turn these on and off at runtime.
The samples/tracepoints/tracepoint-sample.c file is a kernel module that creates a/proc/tracepoint-sample file, and has a couple of tracepoints coded into it by the developer. First it includes the samples/tracepoints/tp-samples-trace.h which actually declares the tracepoints.
DECLARE_TRACE(subsys_event,
TP_PROTO(struct inode *inode, struct file *file),
TP_ARGS(inode, file));
DECLARE_TRACE_NOARGS(subsys_eventb);
You can think of these as declaring the function prototypes, one trace function has two arguments: an inode and a file; the other has no arguments. And if they’re function prototypes, we need to define a function; this is done back in the main tracepoint-sample.cfile.
DEFINE_TRACE(subsys_event);
DEFINE_TRACE(subsys_eventb);
These tracepoints can now be called from the kernel code, passing the arguments that may need to be traced; remember that these have no side-effects unless enabled. The code that calls out to the tracepoints is in the my_open() function.
trace_subsys_event(inode, file);
for (i = 0; i < 10; i++)
trace_subsys_eventb();
Simple, huh? Don’t worry about the rest, this primer is simply so you can recognise tracepoints in the kernel source when you see them. I don’t expect you to go leaping around the kernel adding tracepoints and rebuilding it, unless you want to, of course.
So how do you hook into tracepoints? The answer is from other kernel code, usually in the form of a loadable module such as that defined by samples/tracepoints/tracepoint-probe-sample.c; this includes the same header file as before to get the prototypes.
#include "tp-samples-trace.h"
In the module __init function it registers two functions of its own as hooks into the tracepoint, this activates the tracepoint and turns the code in the previous module from a near no-op to a function call that will call these functions.
ret = register_trace_subsys_event(probe_subsys_event, NULL);
WARN_ON(ret);
ret = register_trace_subsys_eventb(probe_subsys_eventb, NULL);
WARN_ON(ret);
And obviously in the module __exit function we have to unregister these, otherwise we leave dangling things.
unregister_trace_subsys_eventb(probe_subsys_eventb, NULL);
unregister_trace_subsys_event(probe_subsys_event, NULL);
tracepoint_synchronize_unregister();
As to those functions, they take an argument which is a pointer to the same data as the second argument to the register call, and then otherwise take the arguments defined inDECLARE_TRACE. You can do pretty much what you want here, the example simply extracts the filename and outputs it with a a printk()
static void probe_subsys_event(void *ignore,
struct inode *inode, struct file *file){
path_get(&file->f_path);
dget(file->f_path.dentry);
printk(KERN_INFO "Event is encountered with filename %s\n",
file->f_path.dentry->d_name.name);
dput(file->f_path.dentry);
path_put(&file->f_path);
}
So that’s tracepoints; they’re a low-level method for a kernel developer to pick places in their code that may be useful for debugging and a method for loadable kernel code such as modules to hook into those places.
Trace Events (Kernel API)
So you know about tracepoints, and you’ve almost certainly heard about Trace Events, but what’s the difference? Well firstly trace events are actually built on tracepoints, you can think of them as a higher level API – and that’s why I covered tracepoints first. Secondly trace events are usable from userspace! we don’t need to write kernel modules to be able to hook into them, but obviously we can only read data this way.
In fact, since they’re tracepoints with extra benefits, you wouldn’t think anyone would use the basic tracepoints at all, and you’d be right! A git grep DECLARE_TRACE in a current kernel tree will show you that the only user of the raw tracepoint macros is actually the trace events system.
Since everyone just defines trace events, a primer on the kernel-side will be useful, so we’ll be looking at the code in samples/trace_events and if you want to read the userspace API documentation, it’s in Documentation/trace/events.txt
Just one source file and header file this time, first we’ll look at the headersamples/trace_events/trace-events-sample.h; this seems pretty complicated at first, but almost all of this is boiler-plate code that gets copied into every trace events header. The important bit is the TRACE_EVENT macro:
TRACE_EVENT(foo_bar,
TP_PROTO(char *foo, int bar),
TP_ARGS(foo, bar),
TP_STRUCT__entry(
__array( char, foo, 10 )
__field( int, bar )
),
TP_fast_assign(
strncpy(__entry->foo, foo, 10);
__entry->bar = bar;
),
TP_printk("foo %s %d", __entry->foo, __entry->bar)
);
The first part of this looks just like DECLARE_TRACE, and that’s no accident, we’re still declaring a tracepoint too so this will give us a function with the prototype declared in TP_PROTO and argument names in TP_ARGS.
The TP_STRUCT__entry and TP_fast_assign bits are new though. As well as declaring a tracepoint, trace events come with the equivalent “loadable module” code that copies data from the arguments of the function into a struct that can be examined from userspace.TP_STRUCT__entry defines that structure, and TP_fast_assign is C code that should quickly copy data into that structure.
So we’ve declared a tracepoint, we’ve defined a structure containing an array of 10 char and an int, and we’ve written C code to copy from the tracepoint arguments into that structure. The last bit of the trace event is TP_printk, which does exactly what you’d expect. Since the most common (at least, first) use of a trace event is going to be to output something, this macro defines a format string for that printk() call.
Back in the samples/trace_events/trace-events-sample.c file, we include this header but first set a special define. This is only set once in the entire kernel source, and this results in all of the functions being defined; i.e. TRACE_EVENT becomes DEFINE_TRACE rather than DECLARE_TRACE.
#define CREATE_TRACE_POINTS
#include "trace-events-sample.h"
All other users of this header simply include the header.
From here on in the source, the trace event is just a tracepoint and is called in the same way: as a function call.
trace_foo_bar("hello", cnt);
That’s a kernel-side primer, you should be able to git grep through the source and find trace events. But now it’s time to get into the fun bit and look at the userspace API for dealing with them; remember if you want anything more complicated, they’re just tracepoints so you can write kernel modules and hook into them as before.
Trace Events (Userspace API)
We’re in userspace now, so you can leave the kernel source directory, but you do need to be root and you may need to mount a filesystem. This is because some distributions (like Ubuntu) have an allergy to debugging (seriously, they even disable things like gdb -p).
Try and change into the /sys/kernel/debug/tracing directory.
# cd /sys/kernel/debug/tracing
If this fails, you’ll need to mount the debugfs filesystem and try again.
# mount -t debugfs none /sys/kernel/debug
# cd /sys/kernel/debug/tracing
With that done, we should make sure tracing is enabled.
# cat tracing_enabled
1
If that’s 0, enable it:
# echo 1 > tracing_enabled
So we’ve enabled tracing, but what can we trace? Trace events are exposed in the events sub-directory in two levels, the first is the subsystem and the second are the trace events themselves. Since in my last blog post we were looking at tracing forks, it would be great if there were trace events for doing just that. This is where it helps to be able to git greparound the kernel source and recognise trace events, so you at least know the right subsystem name; and it turns out that the sched subsystem has exactly the events we wanted.
deathspank tracing# ls events/sched
enable sched_process_exit/ sched_stat_sleep/
filter sched_process_fork/ sched_stat_wait/
sched_kthread_stop/ sched_process_free/ sched_switch/
sched_kthread_stop_ret/ sched_process_wait/ sched_wait_task/
sched_migrate_task/ sched_stat_iowait/ sched_wakeup/
sched_pi_setprio/ sched_stat_runtime/ sched_wakeup_new/
sched_process_fork sounds exactly right, if you look at it, it’s a directory that contains four files:enable, filter, format and id. I bet you can guess how to enable fork tracing, but if not:
# cat events/sched/sched_process_fork/enable
0
# echo 1 > events/sched/sched_process_fork/enable
Pretty painless, so go ahead and run a few things, and turn the tracing off again when you’re done.
# echo 0 > events/sched/sched_process_fork/enable
Now let’s look at the result of our trace; recall that every trace event comes with a freeprintk() of formatted output? We can find the output from those in the top-level trace file.
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-2667 [001] 6658.716936: sched_process_fork: comm=zsh pid=2667 child_comm=zsh child_pid=2748
So for each process fork, we get the parent and child process ids along with the process name. Pretty much exactly what we want!
There’s plenty to play around with using this API, as you’ve probably noticed you can enable entire subsystems or all events using the enable files at the subsystem and events-levels; there’s also a set_event file at the tracing level which can be used to make batch changes to tracing, see the kernel documentation for more details.
You’re probably wondering though what happened to the rest of the struct, especially if there fields that aren’t included in the default printk(). You can examine the struct format by reading the format file of a trace event, and you can use this with the filter file to exclude events you’re not interested in. Again anything I write here would be just duplicating the kernel documentation, so go read Documentation/trace/events.txt
Perf
After a little bit of playing you’ll realise that not only is tracing not limited to your current process or shell, you’ll get events for processes you’re not intersted in, but also events for subsystems you’re not interested in if other processes are doing traces of their own. There’s also only one global filter for the entire trace events system, so other users or processes doing tracing, could override yours.
There’s an even higher-level that we can use to work around those problems, the perf tool. Originally designed as a userspace component to the performance counters system, it’s grown a wide variety of extra features one of which is the ability to work with kernel tracepoints as an input source.
Since trace events are tracepoints, these count!
So let’s say we want to record the forks made by a process we run, without fear of contamination from other processes on the system or other users performing tracing. Using perf we can simply run
# perf record -e sched:sched_process_fork record bash
And run as many commands as we like in that shell. When the shell exits, perf will write the results of the tracing to a perf.data file for analysis.
# exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (~735 samples) ]
We can analyse this later using various perf sub-commands, the simplest of which is an argument-less perf script which outputs the equivalent of reading the trace file.
# perf script
bash-3141 [003] 10201.049939: sched_process_fork: comm=bash pid=3141 child_comm=bash child_pid=3142
:3142-3142 [001] 10201.050391: sched_process_fork: comm=bash pid=3142 child_comm=bash child_pid=3143
Conclusion
As an administrator debugging their system, or a developer trying to understand the performance or events timeline of their work, perf is perfect. It’s a very well documented tool with all of the bells and whistles you need for tracing a wide variety of events.
Unfortuantely the API between perf and the kernel is a private one; the perf tool source is shipped as part of the kernel source, and they are version-mated with each other.
Recall that the topic of the previous blog post was to write a program to follow forks, rather than doing it as a system administrator.
If we want to write software to do it, the lower (but still high) level trace events API seems a better bet. There are a wide range of applications of this API, for example the ureadaheadprogram in an Ubuntu system uses it to trace the open() and exec() syscalls the system performs during boot so it knows which files to cache for faster boot times. But it’s easy for another process, or a user, to interfere with the results of this tracing so it’s not ideal for our purpose either.
Finally the tracepoints API is too low-level, writing a kernel module and building and maintaining it for each kernel version is just not on the cards.
So it would appear we’re at a dead-end for using tracing to do what we want. That’s not the end of the story though; there are other tracing tools such as kprobes and ftrace that I haven’t covered yet. Unfortunately this blog post has gotten a little too long, and the coverage of tracepoints, trace events and perf was worthwhile in of itself, so we’ll have to pick those up next time!
Tracing on Linux的更多相关文章
- Linux BPF/bcc for Oracle Tracing
Luca Canali on 26 May 2016 Topic: In this post you will find a short discussion and pointers to the ...
- linux申请strace ,lstrace, ptrace, dtrace
ltrace命令是用来跟踪进程调用库函数的情况. ltrace -hUsage: ltrace [option ...] [command [arg ...]]Trace library calls ...
- linux工具大全
Linux Performance hi-res: observability + static + perf-tools/bcc (svg)slides: observabilityslides: ...
- 使用 ftrace 调试 Linux 内核【转】
转自:http://blog.csdn.net/adaptiver/article/details/7930646 使用 ftrace 调试 Linux 内核,第 1 部分 http://blog.c ...
- 使用 ftrace 调试 Linux 内核,第 3 部分
内核头文件 include/linux/kernel.h 中描述了 ftrace 提供的工具函数的原型,这些函数包括 trace_printk.tracing_on/tracing_off 等.本文通 ...
- 使用 ftrace 调试 Linux 内核,第 2 部分
ftrace 操作概述 使用 ftrace 提供的跟踪器来调试或者分析内核时需要如下操作: 切换到目录 /sys/kernel/debug/tracing/ 下 查看 available_tracer ...
- Linux 驱动开发
linux驱动开发总结(一) 基础性总结 1, linux驱动一般分为3大类: * 字符设备 * 块设备 * 网络设备 2, 开发环境构建: * 交叉工具链构建 * NFS和tftp服务器安装 3, ...
- 利用ftrace跟踪内核static tracepoint——实例writeback event
摘要:和很多linux内核子系统一样,static tracepoint有很多层次,其中某个层次都展示给不同层次的开发者来满足他们的不同需求.关于linux tracepoint的详细信息,我们可以在 ...
- 容器中的诊断与分析3——live diagnosis——lldb
windows下,我们对于.net程序发生Crash,资源泄露,死锁等问题的分析,有神器windbg .net core程序运行在linux上时,该怎么进行对对Core Dump文件进行分析呢?今天介 ...
随机推荐
- [golang note] 匿名组合
匿名组合 golang也提供了继承机制,但采用组合的文法,因此称为匿名组合.与其他语言不同, golang很清晰地展示出类的内存布局是怎样的. • 非指针方式组合 ▶ 基本语法 // 基类 type ...
- sublime2常用插件
Package Control 1. 安装方法 • 下载地址:https://sublime.wbond.net/Package%20Control.sublime-package • 将下载下来的P ...
- Charles 抓包工具的使用
抓包工具有很多,目前用过的有Charles, Fiddler, burpsuite.下面主要是Charles 的应用实例. 一. 用Charles抓包 1. PC 抓包 打开Charles, 确保“录 ...
- Linux系统——sed命令
sed命令精讲 cat工作原理 cat命令只想文件,把文件打开后,将文件中所有内容一次性读到内存中,从内存里一次性输出到屏幕上,此时可能存在内存装不下的情况,因此cat命令只能查看小文件内容,不能读取 ...
- 一个简单的ssm项目
准备说明jdk.tomcat.idea.mave配置请看我前两篇,这里说下mysql以及我的mysql图像化工具 数据库 项目概览 项目构建--------搭建一个简单的mave的web项目,构建步骤 ...
- 【android】 如何把gif图片下载到本地
以上图片大家可以看到,虽然是个jpg格式的文件,但是本质上是个动图. 但是发现在咱的图片模块下,本地存储的图片只有一帧,问题出在哪里呢? http获取到的byte[]数据是没问题的 断点跟踪了下,发现 ...
- linux下多线程之pthread_detach(pthread_self())
写个碰到的问题,记录下自己的技术之路点滴pthread_detach(pthread_self())linux线程执行和windows不同,pthread有两种状态joinable状态和unjoina ...
- iOS开发之plist文件操作
之前在想用代码去实现很多界面上能用interface builder可以拖进去的功能,现在想想真是够无知的啊.不仅效率低下,而且对于代码的维护带来不少麻烦,这段时间一直在反思看了design+code ...
- C++文件操作:打开文件和写入文件
如果程序的运行结果仅仅显示在屏幕上,当要再次查看结果时,必须将程序重新运行一遍:而且,这个结果也不能被保留. 如果希望程序的运行结果能够永久保留下来,供随时查阅或取用,则需要将其保存在文件中. 文件分 ...
- MLlib1.6指南笔记
MLlib1.6指南笔记 http://spark.apache.org/docs/latest/mllib-guide.html spark.mllib RDD之上的原始API spark.ml M ...