Linux 调试: systemtap

安装与配置

在ubuntu下直接用apt-get install之后不能正常使用，提示缺少调试信息或者编译探测代码时有问题。

1. 采用官网上的解决方法

2. 可以自己重新编译一次内核，然后再手工编译一次systemtap。这样就可以正常使用了。

Systemtap的编译说明，除了下载地址并没有说太多东西。选择一个版本，自己选择了最新的2.7.

下载后解压，执行

./configure

一般来说会提示缺少组件。Systemtap最先应该是redhat开发的，所以需要的包名称ubuntu不能直接用来apt-get

列出几个自己碰到的依赖问题：

configure: error: missing gnu /usr/bin/msgfmt

apt-get install gettext

configure: error: missing elfutils development headers/libraries (install elfutils-devel, libebl-dev, libdw-dev and/or libebl-devel)

可以通过apt-get install libdw-dev解决

configure: error: in `/root/systemtap-2.7':

configure: error: C++ preprocessor "/lib/cpp" fails sanity check

安装apt-get install g++

使用用例

基本使用

详见systemtap 官方tutorial。这里做个笔记。

hello world

把systemtap脚本转换编译为内核模块然后执行预定义的动作，定义的动作由一系列的事件触发。用户可以指定在哪些事件上触发哪些指定的动作。下面是一个systemtap的helloworld，在模块装载即在脚本运行前执行一次

root@userver:~# stap hello-world.stp

hello world

root@userver:~# cat hello-world.stp

probe begin

{

    print ("hello world\n")

    exit ()

}

如果打开-v选项的话，可以看到执行的详细步骤：

root@userver:~# stap -v hello-world.stp

Pass : parsed user script and  library script(s) using 66544virt/37432res/4324shr/33908data kb, in 120usr/10sys/127real ms.

Pass : analyzed script:  probe(s),  function(s),  embed(s),  global(s) using 67204virt/38136res/4512shr/34568data kb, in 0usr/0sys/4real ms.

Pass : translated to C into "/tmp/stapyZxhXI/stap_847497c1de7927412685a2282f37c57d_881_src.c" using 67204virt/39028res/5232shr/34568data kb, in 0usr/0sys/0real ms.

Pass : compiled C into "stap_847497c1de7927412685a2282f37c57d_881.ko" in 1000usr/590sys/1582real ms.

Pass : starting run.

helloworld

Pass : run completed in 10usr/20sys/472real ms.

如果多次运行同一个脚本的话快很多，因为systemtap直接使用了已经编译好的缓存模块文件。

还可以定时运行一定时间：

root@userver:~# stap strace-open.stp

cat() open ("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC)

cat() open ("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC)

cat() open ("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC)

cat() open ("iw.c", O_RDONLY)

root@userver:~# cat strace-open.stp

probe syscall.open

{

    printf("%s(%d) open (%s)\n", execname(), pid(), argstr);

}

probe timer.ms() # after  seconds

{

    exit()

}

在systemtap运行期间执行了一个cat命令得到的结果，脚本记录了执行系统调用open的进程信息。

如何跟踪

跟踪点选择

`begin`	The startup of the systemtap session.
`end`	The end of the systemtap session.
`kernel.function("sys_open")`	The entry to the function named `sys_open` in the kernel.
`syscall.close.return`	The return from the `close` system call.
`module("ext3").statement(0xdeadbeef)`	The addressed instruction in the `ext3` filesystem driver.
`timer.ms(200)`	A timer that fires every 200 milliseconds.
`timer.profile`	A timer that fires periodically on every CPU.
`perf.hw.cache_misses`	A particular number of CPU cache misses have occurred.
`procfs("status").read`	A process trying to read a synthetic file.
`process("a.out").statement("*@main.c:200")`	Line 200 of the `a.out` program.

全局：probe begin {}, probe end {}用于整个跟踪过程的开头和结尾。

函数： kernel.function("sys_open"){}用于在某个指定的内核函数中执行定义的动作，sys_open可以换成其他的函数如ext4_release_file(在文件close时会执行)

系统调用：syscall.close在执行close调用时执行，其他系统调用也是类似。因为系统调用的函数是通过宏定义实现的

修饰：

内联：kernel.function("xx").inline {} 指定函数被内联时进入

调用：kernel.function("xx").call {} 指定函数被调用是进入（不含内联）

返回：kernel.function("").return{}可以在该函数返回时执行。

格式输出

printf %s表示字符串，%d表示数值类型

`tid()`	The id of the current thread.
`pid()`	The process (task group) id of the current thread.
`uid()`	The id of the current user.
`execname()`	The name of the current process.
`cpu()`	The current cpu number.
`gettimeofday_s()`	Number of seconds since epoch.
`get_cycles()`	Snapshot of hardware cycle counter.
`pp()`	A string describing the probe point being currently handled.
`ppfunc()`	If known, the the function name in which this probe was placed.
`$$vars`	If available, a pretty-printed listing of all local variables in scope.
`print_backtrace()`	If possible, print a kernel backtrace.
`print_ubacktrace()`	If possible, print a user-space backtrace.

1. print_backtrace比较实用可以打印内核的调用栈

2. gettimeofday_s用于获得以秒为单位的时间，gettimeofday_ms则是以毫秒为单位的时间，gettimeofday_us

3. thread_indent用于进程/线程输出时的缩进相当于一个thread_local变量，参数表示作用在该变量上的一个增量，进入一个函数时参数为正值，退出时为负值，就可以产生函数调用的缩进效果，下面是一个类似tutorial上的示例：

probe kernel.function("*@fs/open.c").call {

    printf("%s -> %s(%s)\n", thread_indent(4), ppfunc(), $$parms);

}

probe kernel.function("*@fs/open.c").return {

    printf("%s <- %s\n", thread_indent(-4), ppfunc());

}

部分输出：

 prltoolsd():    -> SyS_open(filename=0x40fa78 flags=0x0 mode=0x6118a0)

      prltoolsd():        -> do_sys_open(dfd=0xffffffffffffff9c filename=0x40fa78 flags=0x8000 mode=0x18a0)

     prltoolsd():            -> finish_open(file=0xffff88002ff1a000 dentry=0xffff88009a978840 open=0x0 opened=0xffff88002f89fdec)

     prltoolsd():                -> do_dentry_open(f=0xffff88002ff1a000 open=0x0 cred=0xffff880140efe300)

     prltoolsd():                    -> generic_file_open(inode=0xffff88002f71a820 filp=0xffff88002ff1a000)

     prltoolsd():                    <- generic_file_open

     prltoolsd():                <- do_dentry_open

     prltoolsd():            <- finish_open

     prltoolsd():            -> open_check_o_direct(f=0xffff88002ff1a000)

     prltoolsd():            <- open_check_o_direct

     prltoolsd():        <- do_sys_open

     prltoolsd():    <- SyS_open

      prltoolsd():    -> SyS_close(fd=0x5)

      prltoolsd():        -> filp_close(filp=0xffff88002ff1a000 id=0xffff880148cb5c80)

     prltoolsd():        <- filp_close

     prltoolsd():    <- SyS_close

更实用的例子

分析执行

变量默认的都是局部变量，即每个处理函数内的变量是不共享的。使用全局变量的话，要在开始使用global关键字进行定义。变量是弱类型的，可以相互转换但是要手工显式进行。字符串使用.连接和php与perl一样。流程控制语句和C语言基本一致。下面是tutorial中的一个例子：

global count_jiffies, count_ms;

probe timer.jiffies() {

    count_jiffies++;

}

probe timer.ms() {

    count_ms++;

}

probe timer.ms() {

    hz = ( * count_jiffies) / count_ms;

    printf("jiffies:ms ratio: %d:%d = %d\n", count_jiffies, count_ms, hz);

}

目标变量

这些变量在跟踪点处理函数所在的上下文种获取，可以直接使用被跟踪函数的参数变量等。下面是一个示例：

probe kernel.function("filp_close") {

    printf("%s %d: %s(%s:%d)\n",

        execname(),

        pid(),

        ppfunc(),

        kernel_string($filp->f_path->dentry->d_iname),

        $filp->f_path->dentry->d_inode->i_ino);

}

输出如下：

bash : filp_close(:)

bash : filp_close(:)

bash : filp_close(:)

bash : filp_close(:)

a.out : filp_close(:)

a.out : filp_close(ld.so.cache:)

a.out : filp_close(libc-2.19.so:)

a.out : filp_close(data.out:)

a.out : filp_close(:)

a.out : filp_close(:)

a.out : filp_close(:)

函数

函数定义function name(arg1, arg2) { return somthing}，跟javascript里差不多。

数组

systemtap里的数组实际上就是一个hashmap，还支持多维hash(hashmap[key1, key2...] = value)，但是需要预先定义容量，当已有的元素超过容量时会报错：

global hashmap[]

global multimap[]

global countmap[]

probe begin {

    hashmap[] = "a";

    hashmap[] = "c";

    hashmap[] = "last";

        #

    # ERROR: Array overflow, check size limit () near identifier 'hashmap' at array-demo.stp::

    # hashmap[] = "excced."

    #

    multimap[,"init"] = "important"

    multimap[, "swap"] = "more import"

    for (i = ; i<; i++) {

        countmap[i] = i * ;

    }

}

probe timer.ms() {

    exit();

}

probe end {

    printf("-----------------------------\n")

    printf("exist: %s, %s, %s\n", hashmap[], hashmap[], hashmap[]);

    printf("!exist: %s\n", hashmap[]);

    printf("-----------------------------\n")

    printf("exist: %s\n", multimap[, "init"]);

    printf("!exist: %s\n", multimap[, "haha"]);

    printf("--------------sorted by key inc[default]-------------\n")

    foreach([a] in countmap) {

        printf("countmap[%d] = %d\n", a, countmap[a]);

    }

    printf("--------------sorted by key desc-------------\n")

    foreach([a-] in countmap) {

        printf("countmap[%d] = %d\n", a, countmap[a]);

    }

    printf("--------------sorted by value desc-------------\n")

    foreach([a] in countmap-) {

        printf("countmap[%d] = %d\n", a, countmap[a]);

    }

}

foreach 语法默认对hashmap中的key进行一个升序的迭代，如果要改变方向可以在key后加个减号，如果需要按值升降序迭代则在hashmap数组名称后加符号。单个key时foreach中的[]可以省略。

统计聚合

聚合变量操作可以使用<<<对变量进行增量，按照tutorial的解释这个变量是分布在各个CPU特有的关联空间所以可以减少竞争，然后使用@avg（增量值的平均），@sum（增量值的累加），@count（增量执行次数）函数进行聚合，不能直接访问。

global hitcount[];

probe kernel.function("__schedule") {

    hitcount[execname()] <<< ;

}

probe timer.ms() {

    exit();

}

probe end {

    foreach (prog in hitcount) {

        printf("%15s : %-6d\n", prog, @count(hitcount[prog]));

    }

}

运行结果：

root@userver:~/stp# stap schedule-stat.stp

      swapper/ :

  rs:main Q:Reg :

      rcu_sched :

   kworker/:1H :

    kworker/: :

     watchdog/ :

        rcuos/ :

    kworker/: :

  systemd-udevd :

      swapper/ :

        rcuos/ :

  kworker/u64: :

    jbd2/sda1- :

    migration/ :

     khugepaged :

      prltoolsd :

     watchdog/ :

      in:imklog :

         stapio :

    ksoftirqd/ :

每次执行都是+1的话不能体现出这些聚集函数的作用，对于每次增量是不同的需求，聚合函数就非常的有用。另外一个例子用来统计调用vfs_read的数据量：

global data_count[]

probe begin {

    print("start profiling.");

}

probe kernel.function("vfs_read") {

    data_count[execname()] <<< $count;

}

probe timer.ms() {

    print(".");

}

probe timer.ms() { #  seconds

    exit();

}

probe end {

    print("\n");

    foreach (prog in data_count) {

        printf("%15s : avg:%-8d cnt:%-12d sum:%-12d\n",

            prog,

            @avg(data_count[prog]),

            @count(data_count[prog]),

            @sum(data_count[prog]));

    }

}

输出：

root@userver:~/stp# stap vfs-read-stat.stp

start profiling.....................

            top : avg:     cnt:         sum:

      in:imklog : avg:     cnt:         sum:

           sshd : avg:    cnt:           sum:

         stapio : avg:   cnt:          sum:

          acpid : avg:       cnt:           sum:

           bash : avg:       cnt:            sum:

  systemd-udevd : avg:      cnt:            sum:

Tapset

tapset是一些systemtap脚本文件，存在于/usr/share/systemtap/tapset。

符号选择

当用户执行脚本时如果发现符号没定义那么会在tapset内进行搜索，其中还有些文件夹，其名称代表了kernel体系架构名称或者kernel版本名称。搜索匹配是具体到泛化的过程，跟路由IP选择一样，如果有精确的选择则优先选择可以精确匹配的，不行则在采用一般脚本中的定义，如都没找到则报错。不过不知为什么按着tutorial上的做依然提示找不到。。。

跟踪点别名

global groups

probe syscallgroup.io =

    syscall.open, syscall.close, syscall.read, syscall.write

{

    groupname = "io";

}

probe syscallgroup.process =

    syscall.fork, syscall.execve

{

    groupname = "process"

}

probe syscallgroup.* {

    groups[pid(), execname() . "/" . groupname]++;

}

probe end {

    foreach ([id, eg+] in groups) {

        printf("%5d %-20s %d\n", id, eg, groups[id, eg])

    }

}

嵌入C代码

用户脚本中嵌入C语言的脚本，在运行时需要使用-g选项。

Do not dereference pointers that are not known or testable valid. （不要随意对指针解引用）
Do not call any kernel routine that may cause a sleep or fault. （不要调用那些会引起阻塞或者睡眠的函数）
Consider possible undesirable recursion, where your embedded C function calls a routine that may be the subject of a probe. If that probe handler calls your embedded C function, you may suffer infinite regress. Similar problems may arise with respect to non-reentrant locks. （不要调用会引起自身脚本无限循环的调用）
If locking of a data structure is necessary, use a trylock type call to attempt to take the lock. If that fails, give up, do not block.（获取锁时先用trylock类型的调用尝试）

头文件可以使用

%{ %}方式在脚本开头引入。

function get_msg:string (id:long) %{

    snprintf(STAP_RETVALUE, MAXSTRINGLEN, "helloworld %ld\n(%d)\n", (long)STAP_ARG_id, MAXSTRINGLEN);

%}

probe begin {

    print(get_msg());

}

输出：

# stap -g embedded-c.stp

helloworld

()