随笔-性能分析-perf off-cpu 原理

off-cpu:

view
# test in centos7

echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -p $(cat /var/run/cs.pid) -g -o perf.data.raw sleep 10
perf inject -v -s -i perf.data.raw -o perf.data # The step is merging sched_start and sched_switch events

perf script -F comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
    NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
    NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
    NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
    /opt/FlameGraph/stackcollapse.pl | \
    /opt/FlameGraph/flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

原理:

1、Application stack traces don't change while off-CPU.
This means we only need to measure the stack trace once, either at the beginning or end of the off-CPU period. The end is usually easier, since you're recording the time interval then anyway. Here is tracing pseudocode for measuring off-CPU time with stack traces:

on context switch finish:
    sleeptime[prev_thread_id] = timestamp
    if !sleeptime[thread_id]
        return
    delta = timestamp - sleeptime[thread_id]
    totaltime[pid, execname, user stack, kernel stack] += delta
    sleeptime[thread_id] = 0

on tracer exit:
    for each key in totaltime:
        print key
        print totaltime[key]

Q: sched_switch事件似乎已经够用了,为什么还要sched_stat_sleep呢? 因为sched_switch虽然能知道花了多少时间,但是并不清楚时间都花在哪里,而且时间包含了involuntary context switches,而我们感兴趣的由于block event而发生的voluntary context switches。怎么获取? -> sched_stat_sleep, 因为block event发生时,程序会相应的进入TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE状态,获取伪代码如下

update_stats_dequeue
    if (tsk->state & TASK_INTERRUPTIBLE)
        __schedstat_set(se->statistics.sleep_start, rq_clock(rq_of(cfs_rq)));
    if (tsk->state & TASK_UNINTERRUPTIBLE)
        __schedstat_set(se->statistics.block_start, rq_clock(rq_of(cfs_rq)));

update_stats_enqueue
    update_stats_enqueue_sleeper(cfs_rq, se); if (flags & ENQUEUE_WAKEUP)
        if (sleep_start)
            trace_sched_stat_sleep(tsk, delta);
        if (block_start)
            trace_sched_stat_iowait(tsk, delta); if (tsk->in_iowait)
            trace_sched_stat_blocked(tsk, delta);

综上:

perf inject:

通过关联以下2个tracepoint, 我们可以得到任务sleep的时长及其对应的callchain

  • sched_iowait/sleep/blocked. 获得sleep的时长
  • sched_switch. 获得调用栈
commit 26a031e136f4f8dc82c64df48cca0eb3b5d3eb4f
Author: Andrew Vagin <avagin@openvz.org>
Date:   Tue Aug 7 16:56:04 2012 +0400

    perf inject: Merge sched_stat_* and sched_switch events

    You may want to know where and how long a task is sleeping. A callchain
    may be found in sched_switch and a time slice in stat_iowait, so I add
    handler in perf inject for merging this events.

    My code saves sched_switch event for each process and when it meets
    stat_iowait, it reports the sched_switch event, because this event
    contains a correct callchain. By another words it replaces all
    stat_iowait events on proper sched_switch events.

参考:

  1. Linux perf_events Off-CPU Time Flame Graph
  2. https://perf.wiki.kernel.org/index.php/Tutorial
  3. PingCAP 工欲性能调优,必先利其器(2)- 火焰图
  4. 透过Tracepoint理解内核 - 调度器框架和性能 - J.FW的文章 - 知乎
posted @ 2024-08-31 23:48  LiYanbin  阅读(63)  评论(0编辑  收藏  举报