Debugging the kernel using Ftrace - part 3
Probably the most powerful tracer derived from Ftrace is the function tracer. It has the ability to trace practically every function in the kernel. It can be run not just for debugging or analyzing, but also to learn and observe the flow of the Linux kernel.
Two previous articles, Debugging the Linux Kernel Using Ftraceparts I and II, explain some of the basic features of Ftrace and the function tracer; this article is written with the assumption that the reader has already read them. As with the previous articles, the examples in this article expect that the user has already changed to the debug file system tracing directory. The kernel configuration options that are need to be enabled to follow the examples in this article are:
- CONFIG_FUNCTION_TRACER
- CONFIG_DYNAMIC_FTRACE
- CONFIG_FUNCTION_GRAPH_TRACER
Note, the CONFIG_HAVE_FUNCTION_TRACER, CONFIG_HAVE_DYNAMIC_FTRACE, and CONFIG_HAVE_FUNCTION_GRAPH_TRACER options are enabled when the architecture supports the corresponding feature. Do not confuse them with the listed options. The features are only enabled when the listed configuration options are enabled and not when only the _HAVE_ options are.
As shown in the previous articles, here is a quick example of how to enable the function tracer.
[tracing]# echo function > current_tracer [tracing]# cat trace <idle>-0 [000] 1726568.996435: hrtimer_get_next_event <-get_next_timer_interrupt <idle>-0 [000] 1726568.996436: _spin_lock_irqsave <-hrtimer_get_next_event <idle>-0 [000] 1726568.996436: _spin_unlock_irqrestore <-hrtimer_get_next_event <idle>-0 [000] 1726568.996437: rcu_needs_cpu <-tick_nohz_stop_sched_tick <idle>-0 [000] 1726568.996438: enter_idle <-cpu_idle ...
The above shows you the process name (<idle>), PID (0) the CPU that the trace executed on ([000]), a time-stamp in seconds with the decimal places down to microseconds (1726568.996435) the function being traced (hrtimer_get_next_event) and the parent that called that function (get_next_timer_interrupt).
Function filtering
Running the function tracer can be overwhelming. The amount of data may be vast, and very hard to get a hold of by the human brain. Ftrace provides a way to limit what functions you see. Two files exist that let you limit what functions are traced:
set_ftrace_filter set_ftrace_notrace
These filtering features depend on the CONFIG_DYNAMIC_FTRACEoption. As explained in the previous articles, when this configuration is enabled all of the mcount caller locations are stored and at boot time are converted into NOPs. These locations are saved and used to enable tracing when the function tracer is activated. But this also has a nice side effect: not all functions must be enabled. The above files will determine which functions gets enabled and which do not.
When any function is listed in the set_ftrace_filter, only those functions will be traced. This will help the performance of the system when the trace is active. Tracing every function incurs a large overhead, but when using the set_ftrace_filter, only those functions listed in that file will have the NOPs changed to call the tracer. Depending on which functions are being traced, just having a couple of hundred functions enabled is hardly noticeable.
The set_ftrace_notrace file is the opposite of set_ftrace_filter. Instead of limiting the trace to a set of functions, functions listed in set_ftrace_notrace will not be traced. Some functions show up quite often and not only does tracing these functions slow down the system, they can fill up the trace buffer and make it harder to analyze the functions you care about. Functions such as rcu_read_lock() and spin_lock()fall into this category.
The process to add functions to these files typically uses bash redirection. Using the symbol '>' will remove all existing functions in the file and add what is being echoed into the file. Appending to the file using '>>' will keep the existing functions and add new ones.
[tracing]# echo sys_read > set_ftrace_filter [tracing]# cat set_ftrace_filter sys_read [tracing]# echo sys_write >> set_ftrace_filter [tracing]# cat set_ftrace_filter sys_write sys_read [tracing]# echo sys_open > set_ftrace_filter [tracing]# cat set_ftrace_filter sys_open
To remove all functions just echo a blank line into the filter file.
[tracing]# echo sys_read sys_open sys_write > set_ftrace_notrace [tracing]# cat set_ftrace_notrace sys_open sys_write sys_read [tracing]# echo > set_ftrace_notrace [tracing]# cat set_ftrace_notrace [tracing]#
The functions listed in these files can also be set on the kernel command line. The options ftrace_notrace and ftrace_filter will preset these files by listing a comma delimited set of functions.
ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault
Functions added by the kernel command line set what will be in the corresponding filter files. These options only pre-load the files, functions can still be removed or added using the bash redirection as explained above.
The functions listed in set_ftrace_notrace take precedence. That is, if a function is listed in both set_ftrace_notrace and set_ftrace_filter, that function will not be traced.
Wildcard filters
A list of functions that can be added to the filter files is shown in the available_filter_functions file. This list of functions was derived from the list of stored mcount callers previously mentioned.
[tracing]# cat available_filter_functions | head -8 _stext do_one_initcall run_init_process init_post name_to_dev_t create_dev T.627 set_personality_64bit
You can grep this file and redirect the result into one of the filter files:
[tracing]# grep sched available_filter_functions > set_ftrace_filter [tracing]# cat set_ftrace_filter | head -8 save_stack_address_nosched mce_schedule_work smp_reschedule_interrupt native_smp_send_reschedule sys32_sched_rr_get_interval sched_avg_update proc_sched_set_task sys_sched_get_priority_max
Unfortunately, adding lots of functions to the filtering files is slow and you will notice that the above grep took several seconds to execute. This is because each function name written into the filter file will be processed individually. The above grep produces over 300 function names. Each of those 300 names will be compared (using strcmp()) against every function name in the kernel, which is quite a lot.
[tracing]# wc -l available_filter_functions 24331 available_filter_functions
So the grep above caused set_ftrace_filter to generate over 300 * 24331 (7,299,300) comparisons!
Fortunately, these files also take wildcards; the following glob expressions are valid:
- value* - Select all functions that begin with value.
- *value* - Select all functions that contain the text value.
- *value - Select all functions that end with value.
The kernel contains a rather simple parser, and will not processvalue*value in the expected way. It will ignore the secondvalue and select all functions that start with value regardless of what it ends with. Wildcards passed to the filter files are processed directly for each available function, which is much faster than passing in individual functions in a list.
Because the star (*) is also used by bash, it is best to wrap the input with quotes:
[tracing]# echo set* > set_ftrace_filter [tracing]# cat set_ftrace_filter #### all functions enabled #### [tracing]# echo 'set*' > set_ftrace_filter [tracing]# cat set_ftrace_filter | head -5 set_personality_64bit set_intr_gate_ist set_intr_gate set_intr_gate set_tsc_mode
The filters can also select only those functions that belong to a specific module by using the 'mod' command in the input to the filter file:
[tracing]# echo ':mod:tg3' > set_ftrace_filter [tracing]# cat set_ftrace_filter |head -8 tg3_write32 tg3_read32 tg3_write_flush_reg32 tw32_mailbox_flush tg3_write32_tx_mbox tg3_read32_mbox_5906 tg3_write32_mbox_5906 tg3_disable_ints
This is very useful if you are debugging a single module, and only want to see the functions that belong to that module in the trace.
In the earlier articles, enabling and disabling recording to the ring buffer was done using the tracing_on file and the tracing_on() andtracing_off() kernel functions. But if you do not want to recompile the kernel, and you want to stop the tracing at a particular function,set_ftrace_filter has a method to do so.
The format of the command to have the function trace enable or disable the ring buffer is as follows:
function:command[:count]
This will execute the command at the start of thefunction. The command is eithertraceon or traceoff, and an optional count can be added to have the command only execute a given number of times. If the count is left off (including the leading colon) then the command will be executed every time the function is called.
A while back, I was debugging a change to the kernel I made that was causing a segmentation fault to some programs. I was having a hard time catching the trace, because by the time I was able to stop the trace after seeing the segmentation fault, the data had already been overwritten. But the backtrace on the console showed that the function __bad_area_nosemaphore was being called. I was then able to stop the tracer with the following command:
[tracing]# echo '__bad_area_nosemaphore:traceoff' > set_ftrace_filter [tracing]# cat set_ftrace_filter #### all functions enabled #### __bad_area_nosemaphore:traceoff:unlimited [tracing]# echo function > current_tracer
Notice that functions with commands do not affect the general filters. Even though a command has been added to __bad_area_nosemaphore, the filter still allowed all functions to be traced. Commands and filter functions are separate and do not affect each other. With the above command attached to the function__bad_area_nosemaphore, the next time the segmentation fault occurred, the trace stopped and contained the data I needed to debug the situation.
Removing functions from the filters
As stated earlier, echoing in nothing with '>' will clear the filter file. But what if you only want to remove a few functions from the filter?
[tracing]# cat set_ftrace_filter > /tmp/filter [tracing]# grep -v lock /tmp/filter > set_ftrace_filter
The above works, but as mentioned, it may take a while to complete if there were several functions already in set_ftrace_filter. The following does the same thing but is much faster:
[tracing]# echo '!*lock*' >> set_ftrace_filter
The '!' symbol will remove functions listed in the filter file. As shown above, the '!' works with wildcards, but could also be used with a single function. Since '!' has special meaning in bash it must be wrapped with single quotes or bash will try to execute what follows it. Also note the '>>' is used. If you make the mistake of using '>' you will end up with no functions in the filter file.
Because the commands and filters do not interfere with each other, clearing the set_ftrace_filter will not clear the commands. The commands must be cleared with the '!' symbol.
[tracing]# echo 'sched*' > set_ftrace_filter [tracing]# echo 'schedule:traceoff' >> set_ftrace_filter [tracing]# cat trace | tail -5 schedule_console_callback schedule_bh schedule_iso_resource schedule_reallocations schedule:traceoff:unlimited [tracing]# echo > set_ftrace_filter [tracing]# cat set_ftrace_filter #### all functions enabled #### schedule:traceoff:unlimited [tracing]# echo '!schedule:traceoff' >> set_ftrace_filter [tracing]# cat set_ftrace_filter #### all functions enabled #### [tracing]#
This may seem awkward, but having the '>' and '>>' only affect the functions to be traced and not the function commands, actually simplifies the control between filtering functions and adding and removing commands.
Tracing a specific process
Perhaps you only need to trace a specific process, or set of processes. The file set_ftrace_pid lets you specify specific processes that you want to trace. To just trace the current thread you can do the following:
[tracing]# echo $$ > set_ftrace_pid
The above will set the function tracer to only trace the bash shell that executed the echo command. If you want to trace a specific process, you can create a shell script wrapper program.
[tracing]# cat ~/bin/ftrace-me #!/bin/sh DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'` echo $$ > $DEBUGFS/tracing/set_ftrace_pid echo function > $DEBUGFS/tracing/current_tracer exec $* [tracing]# ~/bin/ftrace-me ls -ltr
Note, you must clear the set_ftrace_pid file if you want to go back to generic function tracing after performing the above.
[tracing]# echo -1 > set_ftrace_pid
What calls a specific function?
Sometimes it is useful to know what is calling a particular function. The immediate predecessor is helpful, but an entire backtrace is even better. The function tracer contains an option that will create a backtrace in the ring buffer for every function that is called by the tracer. Since creating a backtrace for every function has a large overhead, which could live lock the system, care must be taken when using this feature. Imagine the timer interrupt on a slower system where it is run at 1000 HZ. It is quite possible that having every function that the timer interrupt calls produce a backtrace could take 1 millisecond to complete. By the time the timer interrupt returns, a new one will be triggered before any other work can be done, which leads to a live lock.
To use the function tracer backtrace feature, it is imperative that the functions being called are limited by the function filters. The option to enable the function backtracing is unique to the function tracer and activating it can only be done when the function tracer is enabled. This means you must first enable the function tracer before you have access to the option:
[tracing]# echo kfree > set_ftrace_filter [tracing]# cat set_ftrace_filter kfree [tracing]# echo function > current_tracer [tracing]# echo 1 > options/func_stack_trace [tracing]# cat trace | tail -8 => sys32_execve => ia32_ptregs_common cat-6829 [000] 1867248.965100: kfree <-free_bprm cat-6829 [000] 1867248.965100: <stack trace> => free_bprm => compat_do_execve => sys32_execve => ia32_ptregs_common [tracing]# echo 0 > options/func_stack_trace [tracing]# echo > set_ftrace_filter
Notice that I was careful to cat the set_ftrace_filter before enabling the func_stack_trace option to ensure that the filter was enabled. At the end, I disabled the options/func_stack_trace before disabling the filter. Also note that the option is non-volatile, that is, even if you enable another tracer plugin in current_tracer, the option will still be enabled if you re-enable the function tracer.
The function_graph tracer
The function tracer is very powerful, but it may be difficult to understand the linear format that it produces. Frederic Weisbecker has extended the function tracer into the function_graph tracer. The function_graph tracer piggy-backs off of most of the code created by the function tracer, but adds its own hook in the mcount call. Because it still uses the mcount calling methods most of the function filtering explained above also applies to the function_graph tracer, with the exception of the traceon/traceoff commands and set_ftrace_pid (although the latter may change in the future).
The function_graph tracer was also explained in the previous articles, but the set_graph_function file was not described. The func_stack_trace used in the previous section can see what might call a function, but set_graph_function can be used to see what a function calls:
[tracing]# echo kfree > set_graph_function [tracing]# echo function_graph > current_tracer [tracing]# cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) | kfree() { 0) | virt_to_cache() { 0) | virt_to_head_page() { 0) 0.955 us | __phys_addr(); 0) 2.643 us | } 0) 4.299 us | } 0) 0.855 us | __cache_free(); 0) ==========> | 0) | smp_apic_timer_interrupt() { 0) | apic_write() { 0) 0.849 us | native_apic_mem_write(); 0) 2.853 us | } [tracing]# echo > set_graph_function
This displays the call graph performed only by kfree(). The "==========>" shows that an interrupt happened during the call. The trace records all functions within the kfree()block, even those functions called by an interrupt that triggered while in the scope of kfree().
The function_graph tracer shows the time a function took in the duration field. In the previous articles, it was mentioned that only the leaf functions, the ones that do not call other functions, have an accurate duration, since the duration of parent functions also includes the overhead of the function_graph tracer calling the child functions. By using the set_ftrace_filter file, you can force any function into becoming a leaf function in the function_graph tracer, and this will allow you to see an accurate duration of that function.
[tracing]# echo smp_apic_timer_interrupt > set_ftrace_filter [tracing]# echo function_graph > current_tracer [tracing]# cat trace | head # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) ==========> | 1) + 16.433 us | smp_apic_timer_interrupt(); 1) ==========> | 1) + 25.897 us | smp_apic_timer_interrupt(); 1) ==========> | 1) + 24.764 us | smp_apic_timer_interrupt();
The above shows that the timer interrupt takes between 16 and 26 microseconds to complete.
Function profiling
oprofile and perf are very powerful profiling tools that take periodic samples of the system and can show where most of the time is spent. With the function profiler, it is possible to take a good look at the actual function execution and not just samples. If CONFIG_FUNCTION_GRAPH_TRACER is configured in the kernel, the function profiler will use the function graph infrastructure to record how long a function has been executing. If just CONFIG_FUNCTION_TRACER is configured, the function profiler will just count the functions being called.
[tracing]# echo nop > current_tracer [tracing]# echo 1 > function_profile_enabled [tracing]# cat trace_stat/function 0 |head Function Hit Time Avg -------- --- ---- --- schedule 22943 1994458706 us 86931.03 us poll_schedule_timeout 8683 1429165515 us 164593.5 us schedule_hrtimeout_range 8638 1429155793 us 165449.8 us sys_poll 12366 875206110 us 70775.19 us do_sys_poll 12367 875136511 us 70763.84 us compat_sys_select 3395 527531945 us 155384.9 us compat_core_sys_select 3395 527503300 us 155376.5 us do_select 3395 527477553 us 155368.9 us
The above also includes the times that a function has been preempted orschedule() was called and the task was swapped out. This may seem useless, but it does give an idea of what functions get preempted often. Ftrace also includes options that allow you to have the function graph tracer ignore the time the task was scheduled out.
[tracing]# echo 0 > options/sleep-time [tracing]# echo 0 > function_profile_enabled [tracing]# echo 1 > function_profile_enabled [tracing]# cat trace_stat/function0 | head Function Hit Time Avg -------- --- ---- --- default_idle 2493 6763414 us 2712.962 us native_safe_halt 2492 6760641 us 2712.938 us sys_poll 4723 714243.6 us 151.226 us do_sys_poll 4723 692887.4 us 146.704 us sys_read 9211 460896.3 us 50.037 us vfs_read 9243 434521.2 us 47.010 us smp_apic_timer_interrupt 3940 275747.4 us 69.986 us sock_poll 80613 268743.2 us 3.333 us
Note that the sleep-time option contains a "-" and is notsleep_time.
Disabling the function profiler and then re-enabling it causes the numbers to reset. The list is sorted by the Avg times, but using scripts you can easily sort by any of the numbers. The trace_stat/function0 only represents CPU 0. There exists atrace_stat/function# for each CPU on the system. All functions that are traced and have been hit are in this file.
[tracing]# cat trace_stat/function0 | wc -l 2978
Functions that were not hit are not listed. The above shows that 2978 functions were hit since I started the profiling.
Another option that affects profiling is graph-time (again with a "-"). By default it is enabled. When enabled, the times for a function include the times of all the functions that were called within the function. As you can see from the output in the above examples, several system calls are listed with the highest average. When disabled, the times only include the execution of the function itself, and do not contain the times of functions that are called from the function:
[tracing]# echo 0 > options/graph-time [tracing]# echo 0 > function_profile_enabled [tracing]# echo 1 > function_profile_enabled [tracing]# cat trace_stat/function0 | head Function Hit Time Avg -------- --- ---- --- mwait_idle 10132 246835458 us 24361.96 us tg_shares_up 154467 389883.5 us 2.524 us _raw_spin_lock_irqsave 343012 263504.3 us 0.768 us _raw_spin_unlock_irqrestore 351269 175205.6 us 0.498 us walk_tg_tree 14087 126078.4 us 8.949 us __set_se_shares 274937 88436.65 us 0.321 us _raw_spin_lock 100715 82692.61 us 0.821 us kstat_irqs_cpu 257500 80124.96 us 0.311 us
Note that both sleep-time and graph-time also affect the duration times displayed by the function_graph tracer.
Conclusion
The function tracer is very powerful with lots of different options. It is already available in mainline Linux, and hopefully will be enabled by default in most distributions. It allows you to see into the depths of the kernel and with its arsenal of features, gives you a good idea of why things are happening the way they do. Start using the function tracer to open up the black box that we call the kernel. Have fun!