Extracting kernel stack function arguments from Linux x86-64 kernel crash dumps
Introduction
It's common, when analysing a kernel crash dump, to look at kernel tasks' stack backtraces in order to see what the tasks are doing, e.g. what nested function calls led to the current position; this is easily displayed by the crash utility. We often want also to know the arguments to those function calls; unfortunately these are not so easily displayed.
This blog will illustrate some techniques for extracting kernel function call arguments, where possible, from the crash dump. Several worked examples are given. The examples are from the Oracle UEK kernel, but the techniques are applicable to any Linux kernel.
Note: The Python-Crash API toolkit pykdump includes the command fregs, which automates some of this process. However, it is useful to study how to do it manually, in order to understand what's going on, and to be able to do it when pykdump may not be available, or if fregs fails to produce the desired result.
Basics
This section gives the minimum detail needed to use the techniques. Background explanatory detail will be given in a subsequent section.
This note follows the GAS/AT&T x86
assembly language syntax style, as used by the gdb and crash utilities. Some other documents may follow the NASM/Intel
syntax, which is different; in particular the order of source & destination operands is often reversed. See References for an article that compares the two syntax styles.
You need to know a little bit of the x86 instruction set, but not much. You can get started knowing just that mov %r12,%rdi
places the contents of cpu register %r12
into register %rdi
, and that mov 0x8(%r14),%rcx
takes the value of register %r14
, adds 8 to it, takes the result as the address of a memory location, reads the value at that memory location and puts it into register %rcx
. The GAS
syntax 0x8(%r14)
uses base indirect addressing mode; the NASM
syntax for this is perhaps a little clearer: [%r14+0x8]
. Both denote the content of memory at address location %r14+0x8
.
For the Linux kernel running on the x86-64 architecture, kernel function arguments are normally passed using 64-bit cpu registers. The first six arguments to a function are passed in the following cpu registers, respectively: %rdi
, %rsi
, %rdx
, %rcx
, %r8
, %r9
.
To slightly complicate matters, these 64-bit register contents may be accessed via shorter subsets under different names (for compatibility with previous 32/16/8-bit instruction sets), as shown in the following table:
arg | 1st | 2nd | 3rd | 4th | 5th | 6th |
---|---|---|---|---|---|---|
64-bit | %rdi | %rsi | %rdx | %rcx | %r8 | %r9 |
32-bit | %edi | %esi | %edx | %ecx | %r8d | %r9d |
16-bit | %di | %si | %dx | %cx | %r8w | %r9w |
8-bit | %dil | %sil | %dh/%dl | %ch/%cl | %r8b | %r9b |
The contents of these registers are not preserved (for every kernel function call) in the crash dump. However, it is often possible to extract the values that were placed into these registers, from the kernel stack, or from memory.
In order to find out the arguments supplied to a given function, we need to look at the disassembled code for either the calling function, the called function, or both.
If we disassemble the caller's code, we can see where it obtained the values to place into the registers used to pass the arguments to the called function. We can also disassemble the called function, to see if it stored the values it was passed in those registers, to its stack or to memory.
We may then be able to extract those values from the same place, if that is either on the kernel stack, or in kernel memory (if it has not been subsequently changed), both of which are normally included in the crash dump.
The techniques shown cover different ways that the compiler might have chosen to store the values that are passed in the registers:
- The calling function might retrieve the values from:
- Memory
- The calling function's stack, via an offset from the (fixed) base of the stack
- Another register, which got its value from one of these above in turn
- The called function might save the values it received in the registers, to:
- Memory
- The called function's stack, via push
- Another register, which might itself then be saved in turn
Therefore the technique we use is to:
- Disassemble either or both of:
- The calling function, leading up to the callq instruction, to see from where it obtains the values it passes to the called function in the registers
- The called function, to see whether it puts the values from the registers onto its stack, or into memory
- Inspect those areas (caller/callee stack and/or memory) to see if the values may be extracted
In some cases, it might not be possible to use any of the above methods to find the arguments passed. If this is the case, consider looking at another level of the stack: it's quite common for the same value to be passed down from one function to the next. Thus although you might be unsuccessful trying to recover that argument's value using the above methods for your function of interest, that same value might be passed down to further functions (or itself been passed down from earlier functions), and you might have more luck finding it looking at one of those other functions, using the methods above. The value might also be itself contained in another structure, which may be passed in a function argument. A knowledge of the code in question obviously helps in this case.
Finding a function's stack frame base pointer
For some of the methods noted above, we will need to know how to find the (fixed) base of a kernel function's stack. Whilst the function is executing, this is stored in the %rbp register, the function's stack frame base pointer. We may find it in two places in the kernel task's stack backtrace.
To show a kernel task's stack backtrace, use bt:
crash7latest> bt -sx
PID: 157232 TASK: ffff881a6d6f6200 CPU: 39 COMMAND: "cssdagent"
#0 [ffff88180dc37ba8] __schedule+0x21b at ffffffff816ab68b
#1 [ffff88180dc37c00] schedule+0x37 at ffffffff816abca7
#2 [ffff88180dc37c20] schedule_preempt_disabled+0xe at ffffffff816ac02e
#3 [ffff88180dc37c30] __mutex_lock_slowpath+0x95 at ffffffff816addd5
#4 [ffff88180dc37c90] mutex_lock+0x23 at ffffffff816ade73
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
#6 [ffff88180dc37d60] path_openat+0x80 at ffffffff81214fc0
#7 [ffff88180dc37e00] do_filp_open+0x3a at ffffffff81216ada
#8 [ffff88180dc37ed0] do_sys_open+0x12c at ffffffff8120354c
#9 [ffff88180dc37f40] sys_open+0x1e at ffffffff8120365e
#10 [ffff88180dc37f50] system_call_fastpath+0x18 at ffffffff816b031a
Note: -sx
tells bt
to try to show addresses as as symbol name plus a hex offset.
The above lists the functions calls found in the stack; we also want to see the actual stack content, i.e. the content of the stack frames, for each function call. Let's say we are interested in the arguments passed to the mutex_lock call; we may therefore need to look at its caller, which is do_last, so let's concentrate on its stack frame:
crash7latest> bt -FFsx
...
ffff88180dc37ca8: ***ffff88180dc37d58*** do_last+901
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
ffff88180dc37cb8: 0000000000000000 0000000000000000
ffff88180dc37cc8: [ffff883f64c7cda0:mnt_cache] [ffff881d4601f740:dentry]
ffff88180dc37cd8: [ffff881d4603e600:ext3_inode_cache] kmem_cache_alloc+546
ffff88180dc37ce8: [ffff881d4603e600:ext3_inode_cache] [ffff881dfc120000:kmalloc-4096]
ffff88180dc37cf8: 000001000000002a [ffff8817ec6b7600:kmalloc-256]
ffff88180dc37d08: ffff88180dc37da4 [ffff881d4601f740:dentry]
ffff88180dc37d18: ffff88180dc37d58 path_init+185
ffff88180dc37d28: ffff88180dc37d58 [ffff8817ec6b7600:kmalloc-256]
ffff88180dc37d38: ffff88180dc37e08 ffff88180dc37ef4
ffff88180dc37d48: [ffff881dfc120000:kmalloc-4096] 0000000000000041
***ffff88180dc37d58***: ffff88180dc37df8 path_openat+128
#6 [ffff88180dc37d60] path_openat+0x80 at ffffffff81214fc0
...
Note: -FF
tells bt
to show all the data for a stack frame, symbolically, with its slab cache name, if appropriate.
The stack frame of the calling function do_last
is shown above. Its stack frame base pointer 0xffff88180dc37d58
appears in two locations, shown highlighted with ***.
The stack frame base pointer, for a function, may be found:
- As the second-last value in the stack frame above the function (i.e. above in the bt output)
- As the location of the second-last value in the stack frame for the function
For now, just use the above to find the value of the stack frame base pointer, for a function, if you need it. The structure of the stack frame will be explained in the following section.
Summary of steps
- Note which registers you need, corresponding to the position of the called function's arguments you need
- Refer to the register-naming table above, in case the quantities passed are smaller than 64-bit, e.g. integers, other non-pointer types. The 1st argument will be passed in
%rdi
,%edi
,%di
or%dil
. Note that all the names contain "di
".
- Refer to the register-naming table above, in case the quantities passed are smaller than 64-bit, e.g. integers, other non-pointer types. The 1st argument will be passed in
- Disassemble the calling function, and inspect the instructions leading up to where it calls the function you're interested in. Note from where the compiler gets the values it places in those registers
- If from the stack, find the caller's stack frame base pointer, and from there find the value in the stack frame
- If from memory, can you calculate the memory address used? If so, read the value from memory
- If from another register, from where was that register's contents obtained? And see
case 3.3
below.
- Disassemble the first part of the called function. Note where it stores the values passed in the registers you need
- If onto the stack, find the called function's stack frame base pointer, and find the value in the stack frame
- If from memory, can you calculate the memory address used? If so, read the value from memory
- If the calling function obtained the value from another register (case 2.3 above) does the called function save that register to stack/memory?
- If none of the above gave a usable result, see if the values you need are passed to another function call further up or down the stack, or may be derived from a different value.
- For example the structure you want is referenced from another structure that is passed to a function elsewhere in the stack trace
- Once you've obtained answers, perform a sanity check
- Is the value obtained on a slab cache? If so, is the cache of the expected type?
- Is the value, or what it points to, of the expected type?
- If the value is a pointer to a structure, does the structure content look correct? e.g. pointers where pointers are expected, function op pointers pointing to real functions, etc
- Read the
Caveats
section, to understand whether you can rely on the answer you've found
At this point, you may either skip directly to the Worked Examples , or read on for more detail.
In more depth
This section gives more background. If you're in a hurry, skip directly to the Worked Examples, and come back and read this later; it may help in understanding what's going on, and in identifying edge cases and other apparently odd behaviour.
In Linux on x86-64, the kernel stack grows down, from larger towards smaller memory addresses. i.e. a particular function's stack frame grows downwards from its fixed stack frame base pointer %rbp
, with new elements being added via the push
instruction, which first decrements the current stack pointer %rsp
(which points to the item on the "top" (lowest memory address) of the stack) by the size of the element pushed onto the stack, then copies its argument to the stack location now pointed at by %rsp
, which is left pointing at the new item on top of the stack.
However, the bt
command shows the stack in ascending
memory order, as you read down the page. Therefore we may imagine the bt display of a stack frame as like a pile of magazines stacked up on the table. The top line shown is the top of the stack, which is stored in the stack pointer register %rsp
, and is where new items are pushed onto the stack (magazines added to the pile). The bottom line is the stack frame base (the table), which is fixed, stored in the stack frame base pointer %rbp
(yes, I'm neglecting the function's return address here).
Kernel function stack frame layout
The stack consists of multiple frames, one per function. The layout of a kernel function's stack frame is as follows (in the ascending memory location order as shown by bt
):
# lower memory addresses
[%rsp] top of the stack; new items are pushed on top of this
...
...
[%rbp] base of the stack; fixed. Contains the address of the calling function's stack frame base.
<return address=""> contains the address of where this function should return to, in the calling function
# higher memory addresses</return>
This is built-up in stages, as follows. When a function is called the callq
instruction does two things:
- Pushes the address of the instruction following the
callq
instruction onto the stack (still the caller's stack frame, at this point). This will be the return address for the called function. - Jumps to the first instruction in the called function.
At this point, what will become the stack frame for the called function now looks like this:
# lower memory addresses
[%rsp]<return address=""> contains the address of where this function should return to, in the calling function
# higher memory addresses</return>
The compiler has inserted the following preamble instructions before the start of the code for most called functions:
<fn>: push %rbp
<fn+1>: mov %rsp,%rbp</fn+1></fn>
The push
puts the caller's stack frame base pointer on top of the stack frame, which now looks like this:
# lower memory addresses
[%rsp] caller's stack frame base pointer; the base of the calling function's stack
<return address=""> contains the address of where this function should return to, in the calling function
# higher memory addresses</return>
The mov
changes the stack frame base pointer register %rbp
to also point to the top of the stack
, which now looks like this:
# lower memory addresses
[%rbp][%rsp] base of the stack; fixed. Contains the address of the calling function's stack frame base.
<return address=""> contains the address of where this function should return to, in the calling function
# higher memory addresses</return>
From now on, %rsp
gets decremented as we add things to the top of the stack, but %rbp
remains fixed, denoting the base of the stack. Below that, at the very bottom, is the return address, which records the location of the instruction after the callq
instruction, in the calling function (this is the instruction to which this called function will return, via the retq instruction). From now on, the called function's stack
frame looks like this:
# lower memory addresses
[%rsp] top of the stack; new items are pushed on top of this
...
...
[%rbp] base of the stack; fixed. Contains the address of the calling function's stack frame base.
<return address=""> contains the address of where this function should return to, in the calling function
# higher memory addresses</return>
Caveats
- The description here applies to arguments of simple type, e.g. integer, long, pointer, etc. Things are not quite the same for more complex types, e.g. struct (as a struct, not a pointer to a struct), float, etc. For more detail, refer to the References.
- Remember that the crash dump contains data from when the system crashed, not from the point of execution of the instruction you may be looking at. For example:
- Memory content will be that of when the system crashed, which may be many function calls deeper in the stack below where you are looking, some of which may have overwritten that area of memory
- Stack frame content will be that of when the function (whose stack frame you're looking at) called the next-deeper function. If the function you're looking at went on to modify that stack location, before calling the next-deeper function, that is what you will see when you look at the stack frame
- Remember that there may be more than one code path branch within a function, leading to a callq instruction. The different paths may populate the function-call registers from different sources, and/or with different values. Your linear reading of the instructions leading up to the callq may not be the path that the code took in every instance.
FAQ
- What about other architectures? 32-bit?
- This blog refers exclusively to x86-64; it does not apply to 32-bit x86, which passes arguments on the stack
- See IA-32/cdecl / https://www.wikiwand.com/en/X86_calling_conventions#/List_of_x86_calling_conventions
- I thought I knew x86 instructions, but what is e.g. XYZABC ?
References
- Harvard CS 61; Assembly 2: Calling convention - https://cs61.seas.harvard.edu/site/2018/Asm2/
- x86-64_calling_conventions - https://www.wikiwand.com/en/X86_calling_conventions#/x86-64_calling_conventions
- x86 and amd64 instruction reference - http://www.felixcloutier.com/x86/
- Linux assemblers: A comparison of GAS and NASM - https://web.archive.org/web/20131003180256/http://www.ibm.com/developerworks/linux/library/l-gas-nasm/index.html
Worked examples
Example 1
In this example, we have a hanging task, stuck trying to fsync
a file. We want to obtain the struct file
pointer, and fl_owner
for the file in question.
crash7latest> bt -sx
PID: 22938 TASK: ffff8806aa8d8000 CPU: 2 COMMAND: "httpd.worker"
#0 [ffff8807b1f1f828] __schedule+0x21b at ffffffff816ef65b
#1 [ffff8807b1f1f880] schedule+0x37 at ffffffff816efc87
#2 [ffff8807b1f1f8a0] schedule_timeout+0x24c at ffffffff816f34cc
#3 [ffff8807b1f1f950] io_schedule_timeout+0xa4 at ffffffff816ef3d4
#4 [ffff8807b1f1f990] bit_wait_io+0x1f at ffffffff816f05ff
#5 [ffff8807b1f1f9b0] __wait_on_bit+0x6a at ffffffff816f017a
#6 [ffff8807b1f1fa00] wait_on_page_bit+0xc8 at ffffffff8118e268
#7 [ffff8807b1f1fa50] filemap_fdatawait_range+0xe1 at ffffffff8118e351
#8 [ffff8807b1f1fb30] filemap_write_and_wait_range+0x3f at ffffffff8119039f
#9 [ffff8807b1f1fb60] __dta_nfs4_file_fsync_820+0x93 at ffffffffc034cf83 [nfsv4]
#10 [ffff8807b1f1fbb0] vfs_fsync_range+0x40 at ffffffff812411b0
#11 [ffff8807b1f1fbf0] vfs_fsync+0x1c at ffffffff8124123c
#12 [ffff8807b1f1fc00] nfs_file_flush+0x80 at ffffffffc02d2630 [nfs]
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
#14 [ffff8807b1f1fc50] put_files_struct+0x91 at ffffffff8122d5f1
#15 [ffff8807b1f1fca0] exit_files+0x4b at ffffffff8122d70b
#16 [ffff8807b1f1fcd0] do_exit+0x38f at ffffffff81089ecf
#17 [ffff8807b1f1fd60] do_group_exit+0x45 at ffffffff8108a665
#18 [ffff8807b1f1fd90] get_signal+0x2ee at ffffffff81096d9e
#19 [ffff8807b1f1fe20] do_signal+0x28 at ffffffff810164d8
#20 [ffff8807b1f1ff30] do_notify_resume+0x5f at ffffffff81016c5f
#21 [ffff8807b1f1ff50] int_signal+0x12 at ffffffff816f5587
Note: -sx
tells bt to try to show addresses as a symbol name plus a hex offset.
Let's start by trying to find it via the arguments to filp_close
.
crash7latest> bt -lsx
...
#12 [ffff8807b1f1fc00] nfs_file_flush+0x80 at ffffffffc02d2630 [nfs]
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
/usr/src/debug/kernel-4.1.12/linux-4.1.12-124.28.6.el6uek/fs/open.c: 1070
#14 [ffff8807b1f1fc50] put_files_struct+0x91 at ffffffff8122d5f1
/usr/src/debug/kernel-4.1.12/linux-4.1.12-124.28.6.el6uek/fs/file.c: 370
...
Note: bt -l
shows file and line number of each stack trace function call.
int filp_close(struct file *filp, fl_owner_t id)
typedef void *fl_owner_t;
The compiler will use registers %rdi
& %rsi
, respectively, to pass the two arguments to filp_close
.
Let's look at the full stack frame for filp_close
:
crash7latest> bt -FFsx
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
ffff8807b1f1fc28: ffffffffffffffff 000000000000ffff
ffff8807b1f1fc38: 0000000000000000 [ffff8800c0b21b80:files_cache]
ffff8807b1f1fc48: ffff8807b1f1fc98 put_files_struct+145
Note: -FF
tells bt to show all the data for a stack frame, symbolically, with its slab cache name, if appropriate.
Let's disassemble the calling function put_files_struct
, to see where the compiler obtains the values it will pass in registers to filp_close
:
crash7latest> dis -rx put_files_struct+0x91 | tail
0xffffffff8122d5d4 <put_files_struct+0x74>: je 0xffffffff8122d5c8 put_files_struct+0x68
0xffffffff8122d5d6 <put_files_struct+0x76>: movslq %r15d,%rax
0xffffffff8122d5d9 <put_files_struct+0x79>: mov 0x8(%r14),%rcx
0xffffffff8122d5dd <put_files_struct+0x7d>: mov %r12,%rdi
0xffffffff8122d5e0 <put_files_struct+0x80>: xchg %rdi,(%rcx,%rax,8)
0xffffffff8122d5e4 <put_files_struct+0x84>: test %rdi,%rdi
0xffffffff8122d5e7 <put_files_struct+0x87>: je 0xffffffff8122d5c8 put_files_struct+0x68
0xffffffff8122d5e9 <put_files_struct+0x89>: mov %r13,%rsi
0xffffffff8122d5ec <put_files_struct+0x8c>: callq 0xffffffff8120b1b0 filp_close
0xffffffff8122d5f1 <put_files_struct+0x91>: callq 0xffffffff816eff10 _cond_resched
dis -r
(reverse) displays all instructions from the start of the routine up to and including the designated address.
dis -x
overrides the default output format with hexadecimal format.
The first argument, the struct file pointer will be passed in register %rdi
. The compiler fills that register in this way:
# memory location content -> %rdi
# The location is given by the effective address computation denoted by (%rcx,%rax,8),
# which is of the form: base+(index*offset), i.e. an array lookup.
0xffffffff8122d5e0 <put_files_struct+0x80>: xchg %rdi,(%rcx,%rax,8)
We can't easily retrieve the first argument using this method, since we don't know the values of %rcx
or %rax
.
So how about the second argument? That is passed in register %rsi
, which is populated from another register %r13
:
0xffffffff8122d5e9 <put_files_struct+0x89>: mov %r13,%rsi</put_files_struct+0x89>
That's not immediately helpful, until we notice what the called function filp_close
does, immediately after being called:
crash7latest> dis -x filp_close | head
0xffffffff8120b1b0 <filp_close>: push %rbp
0xffffffff8120b1b1 <filp_close+0x1>: mov %rsp,%rbp
0xffffffff8120b1b4 <filp_close+0x4>: push %r13
Notice that filp_close
pushes %r13
onto its stack. This is the first push instruction that is done by filp_close
after its initial push of %rbp
. Let's look again at the stack frame for filp_close
:
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
ffff8807b1f1fc28: ffffffffffffffff 000000000000ffff
ffff8807b1f1fc38: 0000000000000000 [ffff8800c0b21b80:files_cache]
ffff8807b1f1fc48: ffff8807b1f1fc98 put_files_struct+145
Referring back to the Basics section, we can identify the stack frame base pointer %rbp
for filp_close as 0xffff8807b1f1fc48
.
Referring back to the stack frame layout section, we can see that the stack for filp_close
starts at the very bottom with its return address put_files_struct+145
. The next address "up" (in the bt display) is location 0xffff8807b1f1fc48
, which is filp_close
's stack frame base pointer %rbp
. It contains a pointer to the parent (put_files_struct
) stack frame base pointer 0xffff8807b1f1fc98
. From then on "up" are the normal stack pushes done by filp_close
. Since the push of %r13
is the first push (following the preamble push of %rbp
), we find it next: 0xffff8800c0b21b80
, which is the value of fl_owner_t
id.
To find on the stack the content of push number n (following the preamble push of %rbp
) we calculate the address: %rbp - (n * 8)
In this case, n == 1
, the first push, so:
crash7latest> px (0xffff8807b1f1fc48 - 1*8)
$1 = 0xffff8807b1f1fc40
Note: px
print expression in hex and read its contents:
crash7latest> rd 0xffff8807b1f1fc40
ffff8807b1f1fc40: ffff8800c0b21b80
Note:: rd
reads memory.
Thus we find the value of fl_owner_t id == 0xffff8800c0b21b80
.
We could, of course, simply have walked "up" the stack frame visually, counting pushes, rather than manually calculating the address.
We still need to find the first argument, the struct file
, but we may find that elsewhere, in another function on the stack… it is also the first argument of:
int vfs_fsync(struct file *file, int datasync)
and so will be passed in register %rdi
.
Here's the relevant extract from the stack backtrace:
#11 [ffff8807b1f1fbf0] vfs_fsync+0x1c at ffffffff8124123c
#12 [ffff8807b1f1fc00] nfs_file_flush+0x80 at ffffffffc02d2630 [nfs]
Let's disassemble the caller, leading up to the call:
crash7latest> dis -rx nfs_file_flush+0x80 | tail
0xffffffffc02d260c <nfs_file_flush+0x5c>: mov (%rax),%rax
0xffffffffc02d260f <nfs_file_flush+0x5f>: mov 0xe0(%rax),%rax
0xffffffffc02d2616 <nfs_file_flush+0x66>: mov 0x168(%rax),%rax
0xffffffffc02d261d <nfs_file_flush+0x6d>: callq 0xffffffff816ff2c0 __x86_indirect_thunk_rax
0xffffffffc02d2622 <nfs_file_flush+0x72>: test %eax,%eax
0xffffffffc02d2624 <nfs_file_flush+0x74>: jne 0xffffffffc02d2638 nfs_file_flush+0x88
0xffffffffc02d2626 <nfs_file_flush+0x76>: mov %rbx,%rdi
0xffffffffc02d2629 <nfs_file_flush+0x79>: xor %esi,%esi
0xffffffffc02d262b <nfs_file_flush+0x7b>: callq 0xffffffff81241220 vfs_fsync
0xffffffffc02d2630 <nfs_file_flush+0x80>: pop %rbx
On line 8
we see that register %rdi
— the first argument to vfs_fsync
— is populated from register %rbx
. (Whilst we're here, note that the second argument is passed in register %esi
, which is the 32-bit subset of the 64-bit register %rsi
, since the second argument is an integer: int datasync
)
Now disassemble the called function:
crash7latest> dis -x vfs_fsync | head
0xffffffff81241220 <vfs_fsync>: push %rbp
0xffffffff81241221 <vfs_fsync+0x1: mov="">: nopl 0x0(%rax,%rax,1)
0xffffffff81241229 <vfs_fsync+0x9>: movabs $0x7fffffffffffffff,%rdx
0xffffffff81241233 <vfs_fsync+0x13>: mov %esi,%ecx
0xffffffff81241235 <vfs_fsync+0x15>: xor %esi,%esi
0xffffffff81241237 <vfs_fsync+0x17>: callq 0xffffffff81241170 vfs_fsync_range
We see that vfs_fsync
does not save %rbx
on its stack, but nor does it alter it before calling vfs_fsync_range
. Now disassemble the latter:
crash7latest> dis -x vfs_fsync_range | head
0xffffffff81241170 <vfs_fsync_range>: push %rbp
0xffffffff81241171 <vfs_fsync_range+0x1>: mov %rsp,%rbp
0xffffffff81241174 <vfs_fsync_range+0x4>: push %r14
0xffffffff81241176 <vfs_fsync_range+0x6>: push %r13
0xffffffff81241178 <vfs_fsync_range+0x8>: push %r12
0xffffffff8124117a <vfs_fsync_range+0xa>: push %rbx
We see that vfs_fsync_range
saves %rbx
to its stack. It's the fourth push (after the preamble).
Find vfs_fsync_range
's stack frame base pointer: 0xffff8807b1f1fbe8
. Use the method shown in the Basics section.
Find the value four 8-byte values (four pushes) up from the stack frame base:
crash7latest> px (0xffff8807b1f1fbe8 - 4*8)
$4 = 0xffff8807b1f1fbc8
crash7latest> rd 0xffff8807b1f1fbc8
ffff8807b1f1fbc8: ffff8807eb507b00
We have found our value:
struct file *file = 0xffff8807eb507b00
Perform a sanity check; let's check that the file structure's ops pointers point to an NFS function:
crash7latest> struct -p file.f_op ffff8807eb507b00 | grep llseek
llseek = 0xffffffffc034d2c0,
crash7latest> dis 0xffffffffc034d2c0 1
0xffffffffc034d2c0 nfs4_file_llseek: push %rbp
Example 2
Here's a UEK4 dump, where a process is hung, blocked waiting for a mutex
.
crash7latest> set 157232
PID: 157232
COMMAND: "cssdagent"
TASK: ffff881a6d6f6200 [THREAD_INFO: ffff88180dc34000]
CPU: 39
STATE: TASK_UNINTERRUPTIBLE
crash7latest> bt -sx
PID: 157232 TASK: ffff881a6d6f6200 CPU: 39 COMMAND: "cssdagent"
#0 [ffff88180dc37ba8] __schedule+0x21b at ffffffff816ab68b
#1 [ffff88180dc37c00] schedule+0x37 at ffffffff816abca7
#2 [ffff88180dc37c20] schedule_preempt_disabled+0xe at ffffffff816ac02e
#3 [ffff88180dc37c30] __mutex_lock_slowpath+0x95 at ffffffff816addd5
#4 [ffff88180dc37c90] mutex_lock+0x23 at ffffffff816ade73
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
We want to find the mutex
on which it is waiting. Let's see how do_last
calls mutex_lock
:
crash7latest> dis -rx do_last+0x385 | tail
0xffffffff8121408a <do_last+0x35a>: xor %r14d,%r14d
0xffffffff8121408d <do_last+0x35d>: testl $0x243,(%rbx)
0xffffffff81214093 <do_last+0x363>: movb $0x0,-0x5b(%rbp)
0xffffffff81214097 <do_last+0x367>: jne 0xffffffff812145f0 do_last+0x8c0
0xffffffff8121409d <do_last+0x36d>: mov -0x48(%rbp),%rax
0xffffffff812140a1 <do_last+0x371>: mov 0x30(%rax),%rax
0xffffffff812140a5 <do_last+0x375>: lea 0xa8(%rax),%rdi
0xffffffff812140ac <do_last+0x37c>: mov %rax,-0x70(%rbp)
0xffffffff812140b0 <do_last+0x380>: callq 0xffffffff816ade50 mutex_lock
dis -r
(reverse) displays all instructions from the start of the routine up to and including the designated address.
dis -x
overrides the default output format with hexadecimal format.
The first arg to mutex_lock is passed in %rdi
, which is populated like this:
0xffffffff8121409d <do_last+0x36d>: mov -0x48(%rbp),%rax
0xffffffff812140a1 <do_last+0x371>: mov 0x30(%rax),%rax
0xffffffff812140a5 <do_last+0x375>: lea 0xa8(%rax),%rdi
We need to start with do_last
's stack frame (base) pointer %rbp
.
That may be found here:
crash7latest> bt -FFsx
...
ffff88180dc37ca8: ***ffff88180dc37d58*** do_last+901
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
The stack frame (base) pointer for do_last
- the value of %rbp
- is ffff88180dc37ca8
, and the value at that location — denoted by (%rbp
) — is ffff88180dc37d58
Then we can emulate the effects of the mov/lea
instructions, to arrive at the value that do_last
put into %rdi
:
do_last's (%rbp): ffff88180dc37d58
0xffffffff8121409d <do_last+0x36d>: mov -0x48(%rbp),%rax
crash7latest> px (0xffff88180dc37d58-0x48)
$1 = 0xffff88180dc37d10
0xffffffff812140a1 <do_last+0x371>: mov 0x30(%rax),%rax
crash7latest> rd -x 0xffff88180dc37d10
ffff88180dc37d10: ffff881d4601f740
%rax
now contains 0xffff881d4601f740
We can also note that since we've offset from the stack frame pointer %rbp
, this value is on the stack, and bt will tell us more about it, specifically whether it's part of a slab cache and, if so, which one:
crash7latest> bt -FFsx
ffff88180dc37d08: ffff88180dc37da4 [ffff881d4601f740:dentry]
Address 0xffff88180dc37d10
contains a pointer (0xffff881d4601f740
) to something from the dentry slab cache, i.e. a dentry
.
At this point, we have the dentry pointer in %rax
. The next instruction offsets 0x30
from the dentry
:
0xffffffff812140a1 <do_last+0x371>: mov 0x30(%rax),%rax
crash7latest> px (0xffff881d4601f740+0x30)
$2 = 0xffff881d4601f770
crash7latest> rd -x 0xffff881d4601f770
ffff881d4601f770: ffff881d4603e600
crash7latest> struct -ox dentry | grep 0x30
[0x30] struct inode *d_inode;</do_last+0x371>
%rax
now contains 0xffff881d4603e600
struct -o
shows member offsets when displaying structure definitions; if used with an address or symbol argument, each member will be preceded by its virtual address.
struct -x
overrides default output format with hexadecimal format.
So the above location, in %rax
, is the inode.
The next instruction offsets 0xa8
from the inode:
0xffffffff812140a5 <do_last+0x375>: lea 0xa8(%rax),%rdi
crash7latest> px (0xffff881d4603e600+0xa8)
$3 = 0xffff881d4603e6a8
crash7latest> struct -ox inode | grep 0xa8
[0xa8] struct mutex i_mutex;</do_last+0x375>
(note that the instruciton is lea
(load effective address), not mov
(move
) so we don't dereference the value further)
So the above is the location of the mutex, and this (0xffff881d4603e6a8
) is what ends up in %rdi
, which becomes the first arg to mutex_lock
, as expected:
*void __sched mutex_lock(struct mutex *lock)*
Having found the mutex
, we would likely want to find its owner
:
crash7latest> mutex.owner 0xffff881d4603e6a8
owner = 0xffff8802ad32c600
crash7latest> task 0xffff8802ad32c600 | head -1
PID: 7006 TASK: ffff8802ad32c600 CPU: 39 COMMAND: "updatedb"
Example 3
In this example, we look at a UEK3 crash dump, from a system where processes were spending a lot of time in 'D'
state waiting for an NFS server to respond. The system crashed since hung_task_panic
was set (which is not a good idea on a production system, and should never be set on an NFS client or server).
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
PID: 173
COMMAND: "khungtaskd"
TASK: ffff9c5176add140 [THREAD_INFO: ffff9c7fff454000]
CPU: 13
STATE: TASK_RUNNING (PANIC)
Looking at the hung task:
crash7latest> set 1421
PID: 1421
COMMAND: "ls"
TASK: ffff9c924a7d1040 [THREAD_INFO: ffff9c88b79bc000]
CPU: 10
STATE: TASK_UNINTERRUPTIBLE
crash7latest> bt -sx
PID: 1421 TASK: ffff9c924a7d1040 CPU: 10 COMMAND: "ls"
#0 [ffff9c88b79bfcd8] __schedule+0x3e2 at ffffffff9bd68922
#1 [ffff9c88b79bfd68] schedule_preempt_disabled+0x29 at ffffffff9bd69ce9
#2 [ffff9c88b79bfd78] __mutex_lock_slowpath+0xc7 at ffffffff9bd67c67
#3 [ffff9c88b79bfdd8] mutex_lock+0x1f at ffffffff9bd6704f
#4 [ffff9c88b79bfdf0] nfs_getattr+0x1ae at ffffffffc09e5dce [nfs]
#5 [ffff9c88b79bfe38] vfs_getattr+0x49 at ffffffff9b847299
#6 [ffff9c88b79bfe68] vfs_fstatat+0x75 at ffffffff9b8473c5
#7 [ffff9c88b79bfeb8] SYSC_newlstat+0x31 at ffffffff9b8477d1
#8 [ffff9c88b79bff40] sys_newlstat+0xe at ffffffff9b847c3e
#9 [ffff9c88b79bff50] system_call_fastpath+0x22 at ffffffff9bd75ddb
RIP: 00007f41d13c9ab5 RSP: 00007ffe91e04fa8 RFLAGS: 00010207
RAX: 0000000000000006 RBX: 0000000000624643 RCX: 000000000062c5e0
RDX: 000000000061fc60 RSI: 000000000061fc60 RDI: 00007ffe91e04fa0
RBP: 00007ffe91e05370 R8: 0000000000624550 R9: 000000000061fa10
R10: 0000000000000002 R11: 0000000000000246 R12: 00007ffe91e04fa0
R13: 0000000000000005 R14: 000000000061fc50 R15: 000000000000002e
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
From the hung task traceback, we can see that nfs_getattr
is stuck waiting on a mutex
. It wants to write-back dirty pages, before performing the getattr
call, and it grabs the inode mutex
to keep other apps out from writing whilst we're trying to write-back. So, we need to find out who's got that inode mutex
.
Let's see how nfs_getattr calls mutex_lock
:
crash7latest> dis -rx nfs_getattr+0x1ae | tail
0xffffffffc09e5da8 <nfs_getattr+0x188>: jne 0xffffffffc09e5d90 nfs_getattr+0x170
0xffffffffc09e5daa <nfs_getattr+0x18a>: movzwl (%rbx),%eax
0xffffffffc09e5dad <nfs_getattr+0x18d>: and $0xf000,%ax
0xffffffffc09e5db1 <nfs_getattr+0x191>: cmp $0x8000,%ax
0xffffffffc09e5db5 <nfs_getattr+0x195>: jne 0xffffffffc09e5c64 nfs_getattr+0x44
0xffffffffc09e5dbb <nfs_getattr+0x19b>: lea 0xa8(%rbx),%rdx
0xffffffffc09e5dc2 <nfs_getattr+0x1a2>: mov %rdx,%rdi
0xffffffffc09e5dc5 <nfs_getattr+0x1a5>: mov %rdx,-0x38(%rbp)
0xffffffffc09e5dc9 <nfs_getattr+0x1a9>: callq 0xffffffff9bd67030 mutex_lock
0xffffffffc09e5dce <nfs_getattr+0x1ae>: mov %rbx,%rdi
The first arg to mutex_lock
is the mutex, which is passed in %rdi
.
We can see that nfs_getattr
fills %rdi
from %rdx
, before calling mutex_lock
, but it also stores %rdx
at an offset from its stack frame base pointer %rbp
:
*0xffffffffc09e5dc5 <nfs_getattr+0x1a5>: mov %rdx,-0x38(%rbp)*</nfs_getattr+0x1a5>
We can get nfs_getattr
's %rbp
here:
crash7latest> bt -FFsx
...
#3 [ffff9c88b79bfdd8] mutex_lock+0x1f at ffffffff9bd6704f
ffff9c88b79bfde0: [ffff9c8ed6bb4760:nfs_inode_cache] **ffff9c88b79bfe30**
ffff9c88b79bfdf0: nfs_getattr+430
#4 [ffff9c88b79bfdf0] nfs_getattr+0x1ae at ffffffffc09e5dce [nfs]
So let's calculate that offset:
crash7latest> px (0xffff9c88b79bfe30-0x38)
$1 = 0xffff9c88b79bfdf8
and read the value there, which will be the value of %rdx
, i.e. the address of the mutex
:
crash7latest> rd -x 0xffff9c88b79bfdf8
ffff9c88b79bfdf8: ffff9c8ed6bb4808
Note: rd
reads memory
Let's see who owns
it:
crash7latest> mutex.owner 0xffff9c8ed6bb4808
owner = 0xffff9c5f0ab25140
crash7latest> task 0xffff9c5f0ab25140 | head
PID: 30365 TASK: ffff9c5f0ab25140 CPU: 5 COMMAND: "ls"
and what is that ls task doing?
crash7latest> bt -sx 30365
PID: 30365 TASK: ffff9c5f0ab25140 CPU: 5 COMMAND: "ls"
#0 [ffff9c56c687fa68] __schedule+0x3e2 at ffffffff9bd68922
#1 [ffff9c56c687faf8] schedule+0x29 at ffffffff9bd68dc9
#2 [ffff9c56c687fb08] schedule_timeout+0x221 at ffffffff9bd668d1
#3 [ffff9c56c687fbb0] io_schedule_timeout+0xad at ffffffff9bd6849d
#4 [ffff9c56c687fbe0] io_schedule+0x18 at ffffffff9bd68538
#5 [ffff9c56c687fbf0] bit_wait_io+0x11 at ffffffff9bd66f21
#6 [ffff9c56c687fc08] __wait_on_bit+0x67 at ffffffff9bd66a47
#7 [ffff9c56c687fc48] wait_on_page_bit+0x81 at ffffffff9b7b5ab1
#8 [ffff9c56c687fca0] __filemap_fdatawait_range+0x111 at ffffffff9b7b5be1
#9 [ffff9c56c687fd78] filemap_fdatawait_range+0x14 at ffffffff9b7b5c74
#10 [ffff9c56c687fd98] filemap_fdatawait+0x27 at ffffffff9b7b5cb7
#11 [ffff9c56c687fda8] filemap_write_and_wait+0x4c at ffffffff9b7b800c
#12 [ffff9c56c687fdc8] nfs_wb_all+0x20 at ffffffffc09f2cb0 [nfs]
#13 [ffff9c56c687fdf0] nfs_getattr+0x1be at ffffffffc09e5dde [nfs]
#14 [ffff9c56c687fe38] vfs_getattr+0x49 at ffffffff9b847299
#15 [ffff9c56c687fe68] vfs_fstatat+0x75 at ffffffff9b8473c5
#16 [ffff9c56c687feb8] SYSC_newlstat+0x31 at ffffffff9b8477d1
#17 [ffff9c56c687ff40] sys_newlstat+0xe at ffffffff9b847c3e
#18 [ffff9c56c687ff50] system_call_fastpath+0x22 at ffffffff9bd75ddb
So, the mutex
is held by another NFS getattr
task, that is in the process of performing the write-back
. This is likely just part of the normal NFS writeback, blocked by congestion
, a slow server, or some other interruption.
As mentioned already in general it is advised never to set hung_task_panic
on a production NFS system (client or server).
本文来自博客园,作者:dolinux,未经同意,禁止转载