Linux 调试: systemtap

安装与配置

在ubuntu下直接用apt-get install之后不能正常使用，提示缺少调试信息或者编译探测代码时有问题。

1. 采用官网上的解决方法

2. 可以自己重新编译一次内核，然后再手工编译一次systemtap。这样就可以正常使用了。

Systemtap的编译说明，除了下载地址并没有说太多东西。选择一个版本，自己选择了最新的2.7.

下载后解压，执行

./configure

一般来说会提示缺少组件。Systemtap最先应该是redhat开发的，所以需要的包名称ubuntu不能直接用来apt-get

列出几个自己碰到的依赖问题：

configure: error: missing gnu /usr/bin/msgfmt

apt-get install gettext

configure: error: missing elfutils development headers/libraries (install elfutils-devel, libebl-dev, libdw-dev and/or libebl-devel)

可以通过apt-get install libdw-dev解决

configure: error: in `/root/systemtap-2.7':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check

安装apt-get install g++

使用用例

基本使用

详见systemtap 官方tutorial。这里做个笔记。

hello world

把systemtap脚本转换编译为内核模块然后执行预定义的动作，定义的动作由一系列的事件触发。用户可以指定在哪些事件上触发哪些指定的动作。下面是一个systemtap的helloworld，在模块装载即在脚本运行前执行一次

root@userver:~# stap hello-world.stp 
hello world
root@userver:~# cat hello-world.stp 
probe begin
{
    print ("hello world\n")
    exit ()
}

如果打开-v选项的话，可以看到执行的详细步骤：

root@userver:~# stap -v hello-world.stp 
Pass 1: parsed user script and 106 library script(s) using 66544virt/37432res/4324shr/33908data kb, in 120usr/10sys/127real ms.
Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) using 67204virt/38136res/4512shr/34568data kb, in 0usr/0sys/4real ms.
Pass 3: translated to C into "/tmp/stapyZxhXI/stap_847497c1de7927412685a2282f37c57d_881_src.c" using 67204virt/39028res/5232shr/34568data kb, in 0usr/0sys/0real ms.
Pass 4: compiled C into "stap_847497c1de7927412685a2282f37c57d_881.ko" in 1000usr/590sys/1582real ms.
Pass 5: starting run.
helloworld
Pass 5: run completed in 10usr/20sys/472real ms.

如果多次运行同一个脚本的话快很多，因为systemtap直接使用了已经编译好的缓存模块文件。

还可以定时运行一定时间：

root@userver:~# stap strace-open.stp 
cat(24307) open ("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC)
cat(24307) open ("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC)
cat(24307) open ("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC)
cat(24307) open ("iw.c", O_RDONLY)
root@userver:~# cat strace-open.stp 
probe syscall.open 
{
    printf("%s(%d) open (%s)\n", execname(), pid(), argstr);
}
probe timer.ms(4000) # after 4 seconds
{
    exit()
}

在systemtap运行期间执行了一个cat命令得到的结果，脚本记录了执行系统调用open的进程信息。

如何跟踪

跟踪点选择

`begin`	The startup of the systemtap session.
`end`	The end of the systemtap session.
`kernel.function("sys_open")`	The entry to the function named `sys_open` in the kernel.
`syscall.close.return`	The return from the `close` system call.
`module("ext3").statement(0xdeadbeef)`	The addressed instruction in the `ext3` filesystem driver.
`timer.ms(200)`	A timer that fires every 200 milliseconds.
`timer.profile`	A timer that fires periodically on every CPU.
`perf.hw.cache_misses`	A particular number of CPU cache misses have occurred.
`procfs("status").read`	A process trying to read a synthetic file.
`process("a.out").statement("*@main.c:200")`	Line 200 of the `a.out` program.

全局：probe begin {}, probe end {}用于整个跟踪过程的开头和结尾。

函数： kernel.function("sys_open"){}用于在某个指定的内核函数中执行定义的动作，sys_open可以换成其他的函数如ext4_release_file(在文件close时会执行)

系统调用：syscall.close在执行close调用时执行，其他系统调用也是类似。因为系统调用的函数是通过宏定义实现的

修饰：

内联：kernel.function("xx").inline {} 指定函数被内联时进入

调用：kernel.function("xx").call {} 指定函数被调用是进入（不含内联）

返回：kernel.function("").return{}可以在该函数返回时执行。

格式输出

printf %s表示字符串，%d表示数值类型

`tid()`	The id of the current thread.
`pid()`	The process (task group) id of the current thread.
`uid()`	The id of the current user.
`execname()`	The name of the current process.
`cpu()`	The current cpu number.
`gettimeofday_s()`	Number of seconds since epoch.
`get_cycles()`	Snapshot of hardware cycle counter.
`pp()`	A string describing the probe point being currently handled.
`ppfunc()`	If known, the the function name in which this probe was placed.
`$$vars`	If available, a pretty-printed listing of all local variables in scope.
`print_backtrace()`	If possible, print a kernel backtrace.
`print_ubacktrace()`	If possible, print a user-space backtrace.

1. print_backtrace比较实用可以打印内核的调用栈

2. gettimeofday_s用于获得以秒为单位的时间，gettimeofday_ms则是以毫秒为单位的时间，gettimeofday_us

3. thread_indent用于进程/线程输出时的缩进相当于一个thread_local变量，参数表示作用在该变量上的一个增量，进入一个函数时参数为正值，退出时为负值，就可以产生函数调用的缩进效果，下面是一个类似tutorial上的示例：

probe kernel.function("*@fs/open.c").call {
    printf("%s -> %s(%s)\n", thread_indent(4), ppfunc(), $$parms);
}

probe kernel.function("*@fs/open.c").return {
    printf("%s <- %s\n", thread_indent(-4), ppfunc());
}

部分输出：

0 prltoolsd(1230):    -> SyS_open(filename=0x40fa78 flags=0x0 mode=0x6118a0)
     7 prltoolsd(1230):        -> do_sys_open(dfd=0xffffffffffffff9c filename=0x40fa78 flags=0x8000 mode=0x18a0)
    20 prltoolsd(1230):            -> finish_open(file=0xffff88002ff1a000 dentry=0xffff88009a978840 open=0x0 opened=0xffff88002f89fdec)
    24 prltoolsd(1230):                -> do_dentry_open(f=0xffff88002ff1a000 open=0x0 cred=0xffff880140efe300)
    29 prltoolsd(1230):                    -> generic_file_open(inode=0xffff88002f71a820 filp=0xffff88002ff1a000)
    31 prltoolsd(1230):                    <- generic_file_open
    33 prltoolsd(1230):                <- do_dentry_open
    34 prltoolsd(1230):            <- finish_open
    36 prltoolsd(1230):            -> open_check_o_direct(f=0xffff88002ff1a000)
    38 prltoolsd(1230):            <- open_check_o_direct
    41 prltoolsd(1230):        <- do_sys_open
    42 prltoolsd(1230):    <- SyS_open
     0 prltoolsd(1230):    -> SyS_close(fd=0x5)
     6 prltoolsd(1230):        -> filp_close(filp=0xffff88002ff1a000 id=0xffff880148cb5c80)
    15 prltoolsd(1230):        <- filp_close
    17 prltoolsd(1230):    <- SyS_close

更实用的例子

分析执行

变量默认的都是局部变量，即每个处理函数内的变量是不共享的。使用全局变量的话，要在开始使用global关键字进行定义。变量是弱类型的，可以相互转换但是要手工显式进行。字符串使用.连接和php与perl一样。流程控制语句和C语言基本一致。下面是tutorial中的一个例子：

global count_jiffies, count_ms;

probe timer.jiffies(100) {
    count_jiffies++;
}

probe timer.ms(100) {
    count_ms++;
}

probe timer.ms(10000) {
    hz = (1000 * count_jiffies) / count_ms;
    printf("jiffies:ms ratio: %d:%d = %d\n", count_jiffies, count_ms, hz);
}

目标变量

这些变量在跟踪点处理函数所在的上下文种获取，可以直接使用被跟踪函数的参数变量等。下面是一个示例：

probe kernel.function("filp_close") {
    printf("%s %d: %s(%s:%d)\n", 
        execname(), 
        pid(), 
        ppfunc(), 
        kernel_string($filp->f_path->dentry->d_iname),
        $filp->f_path->dentry->d_inode->i_ino);    
}

输出如下：

bash 1724: filp_close(:24831)
bash 1724: filp_close(:24831)
bash 31781: filp_close(:24831)
bash 31781: filp_close(:24831)
a.out 31781: filp_close(1:4)
a.out 31781: filp_close(ld.so.cache:788003)
a.out 31781: filp_close(libc-2.19.so:3936539)
a.out 31781: filp_close(data.out:1460185)
a.out 31781: filp_close(1:4)
a.out 31781: filp_close(1:4)
a.out 31781: filp_close(1:4)

函数

函数定义function name(arg1, arg2) { return somthing}，跟javascript里差不多。

数组

systemtap里的数组实际上就是一个hashmap，还支持多维hash(hashmap[key1, key2...] = value)，但是需要预先定义容量，当已有的元素超过容量时会报错：

global hashmap[3]
global multimap[3]

global countmap[5]

probe begin {
    hashmap[1] = "a";
    hashmap[3] = "c";
    hashmap[100] = "last";

        # 
    # ERROR: Array overflow, check size limit (3) near identifier 'hashmap' at array-demo.stp:8:2
    # hashmap[222] = "excced."
    #

    multimap[1,"init"] = "important"
    multimap[0, "swap"] = "more import"


    for (i = 0; i<5; i++) {
        countmap[i] = i * 10;
    }
}

probe timer.ms(1000) {
    exit();
}

probe end {

    printf("-----------------------------\n")
    printf("exist: %s, %s, %s\n", hashmap[1], hashmap[3], hashmap[100]);
    printf("!exist: %s\n", hashmap[121]);
    printf("-----------------------------\n")
    printf("exist: %s\n", multimap[1, "init"]);
    printf("!exist: %s\n", multimap[1, "haha"]);
    printf("--------------sorted by key inc[default]-------------\n")
    foreach([a] in countmap) {
        printf("countmap[%d] = %d\n", a, countmap[a]);
    }
    printf("--------------sorted by key desc-------------\n")
    foreach([a-] in countmap) {
        printf("countmap[%d] = %d\n", a, countmap[a]);
    }
    printf("--------------sorted by value desc-------------\n")
    foreach([a] in countmap-) {
        printf("countmap[%d] = %d\n", a, countmap[a]);
    }
    
}

foreach 语法默认对hashmap中的key进行一个升序的迭代，如果要改变方向可以在key后加个减号，如果需要按值升降序迭代则在hashmap数组名称后加符号。单个key时foreach中的[]可以省略。

统计聚合

聚合变量操作可以使用<<<对变量进行增量，按照tutorial的解释这个变量是分布在各个CPU特有的关联空间所以可以减少竞争，然后使用@avg（增量值的平均），@sum（增量值的累加），@count（增量执行次数）函数进行聚合，不能直接访问。

global hitcount[10000];

probe kernel.function("__schedule") {
    hitcount[execname()] <<< 1;
}

probe timer.ms(10000) {
    exit();
}

probe end {
    foreach (prog in hitcount) {
        printf("%15s : %-6d\n", prog, @count(hitcount[prog]));
    }
}

运行结果：

root@userver:~/stp# stap schedule-stat.stp 
      swapper/0 : 2     
  rs:main Q:Reg : 40    
      rcu_sched : 7     
   kworker/0:1H : 6     
    kworker/0:2 : 10    
     watchdog/0 : 2     
        rcuos/1 : 3     
    kworker/1:1 : 10    
  systemd-udevd : 3     
      swapper/1 : 579   
        rcuos/0 : 2     
  kworker/u64:0 : 12    
    jbd2/sda1-8 : 7     
    migration/1 : 1     
     khugepaged : 1     
      prltoolsd : 40    
     watchdog/1 : 2     
      in:imklog : 446   
         stapio : 51    
    ksoftirqd/1 : 27

每次执行都是+1的话不能体现出这些聚集函数的作用，对于每次增量是不同的需求，聚合函数就非常的有用。另外一个例子用来统计调用vfs_read的数据量：

global data_count[10000]

probe begin {
    print("start profiling.");
}

probe kernel.function("vfs_read") {
    data_count[execname()] <<< $count;
}

probe timer.ms(1000) {
    print(".");
}

probe timer.ms(20000) { # 20 seconds
    exit();
}

probe end {
    print("\n");
    foreach (prog in data_count) {
        printf("%15s : avg:%-8d cnt:%-12d sum:%-12d\n", 
            prog,
            @avg(data_count[prog]),
            @count(data_count[prog]),
            @sum(data_count[prog]));
    }
}

输出：

root@userver:~/stp# stap vfs-read-stat.stp
start profiling.....................
            top : avg:1050     cnt:1034         sum:1086553     
      in:imklog : avg:8095     cnt:3398         sum:27510134    
           sshd : avg:16384    cnt:20           sum:327680      
         stapio : avg:129057   cnt:122          sum:15745032    
          acpid : avg:24       cnt:46           sum:1104        
           bash : avg:88       cnt:9            sum:793         
  systemd-udevd : avg:128      cnt:2            sum:256

Tapset

tapset是一些systemtap脚本文件，存在于/usr/share/systemtap/tapset。

符号选择

当用户执行脚本时如果发现符号没定义那么会在tapset内进行搜索，其中还有些文件夹，其名称代表了kernel体系架构名称或者kernel版本名称。搜索匹配是具体到泛化的过程，跟路由IP选择一样，如果有精确的选择则优先选择可以精确匹配的，不行则在采用一般脚本中的定义，如都没找到则报错。不过不知为什么按着tutorial上的做依然提示找不到。。。

跟踪点别名

global groups

probe syscallgroup.io = 
    syscall.open, syscall.close, syscall.read, syscall.write 
{
    groupname = "io";    
}

probe syscallgroup.process = 
    syscall.fork, syscall.execve
{
    groupname = "process"
}

probe syscallgroup.* {
    groups[pid(), execname() . "/" . groupname]++;
}

probe end {
    foreach ([id, eg+] in groups) {
        printf("%5d %-20s %d\n", id, eg, groups[id, eg])
    }
}

嵌入C代码

用户脚本中嵌入C语言的脚本，在运行时需要使用-g选项。

Do not dereference pointers that are not known or testable valid. （不要随意对指针解引用）
Do not call any kernel routine that may cause a sleep or fault. （不要调用那些会引起阻塞或者睡眠的函数）
Consider possible undesirable recursion, where your embedded C function calls a routine that may be the subject of a probe. If that probe handler calls your embedded C function, you may suffer infinite regress. Similar problems may arise with respect to non-reentrant locks. （不要调用会引起自身脚本无限循环的调用）
If locking of a data structure is necessary, use a trylock type call to attempt to take the lock. If that fails, give up, do not block.（获取锁时先用trylock类型的调用尝试）

头文件可以使用

%{ %}方式在脚本开头引入。

function get_msg:string (id:long) %{
    snprintf(STAP_RETVALUE, MAXSTRINGLEN, "helloworld %ld\n(%d)\n", (long)STAP_ARG_id, MAXSTRINGLEN);
%}

probe begin {
    print(get_msg(123));
}

输出：

# stap -g embedded-c.stp 
helloworld 123
(512)

posted @ 2015-05-18 16:39 卖程序的小歪阅读(6686) 评论(0) 编辑收藏举报

刷新页面返回顶部

卖程序的小歪