深入理解Solaris X64系统调用

理解系统调用的关键在于洞悉系统调用号是联系用户模式与内核模式的纽带。而在Solaris x64平台上,系统调用号被保存在寄存器RAX中,从用户模式传递到内核模式。一旦进入内核模式,内核的sys_syscall入口程序就根据保存在RAX中的系统调用号,从内核维护的系统调用表(sysent)中查询出对应的系统调用处理程序,从而进行系统调用。系统调用最多支持6个参数,参数被顺序保存在寄存器RDI, RSI, RDX, RCX, R8, R9中完成传递。另外,从用户模式陷入内核模式,通过汇编指令syscall实现切换,而从内核模式返回到用户模式,则通过汇编指令sysret完成切换。

 

1 系统调用概述

1.1 什么是系统调用
在现代操作系统中,用户的应用程序访问并使用内核所提供的各种服务的途径被称之为系统调用(syscall)。

1.2 为什么需要系统调用
第一,系统调用可以为用户空间提供访问硬件资源的统一接口,以至于用户程序不必去关注具体的硬件操作。比如,读写文件时,用户完全没有必要关心文件存放在何种磁盘上,也不用关心文件在何种文件系统上。
第二,系统调用可以对操作系统进行保护,保证系统的稳定和安全。系统调用的存在规定了用户进程进入操作系统内核的具体方式。换言之,用户进程访问内核的路径是事先规定好了的,只能从规定的位置进入内核,而不允许随便跳入内核。有了这样的进入内核的统一访问路径上的限制,才能充分保证内核的安全。


1.3 系统调用与C库函数的关系
内核提供的系统调用在C库中都有相应的封装函数。系统调用与其封装的C库函数名称常常相同。例如: modctl系统调用在C库中的封装函数即为modctl函数,其实现位于modctl.s汇编文件中。


1.4 系统调用与系统命令的关系
系统命令位于C库函数的上一层,是利用C库函数实现的可执行程序。例如: 命令modinfo调用C库函数modctl()查询内核模块的信息。而C库函数封装了进入内核的系统调用,modctl()使用syscall指令(有别于int 0x80, 是一种快速系统调用指令)进入内核。


1.5 系统调用与系统函数的关系
内核函数与C库函数的区别仅仅是内核函数在内核中实现,因此必须遵循内核编程的规则。系统调用最终必须具有明确的操作。用户应用程序通过系统调用进入内核后,会执行系统调用对应的内核函数,也就是系统调用服务例程。例如:modctl系统调用的服务例程是内核函数modctl()。

 

系统调用过程如下图所示:

2 Solaris x64系统调用实现原理

Solaris 支持x64和sparc两种平台,目前内核都是64位,但是支持32位和64位的应用程序,因此,32位和64位的系统调用都是支持的。为简单起见,接下来的讨论只阐述x64平台上的64位系统调用。

2.1 AMD64 ABI基础

理解Solaris X64系统调用,不可避免地需要了解一下基本的AMD64 ABI。Solaris x64实现遵循的ABI文档是:
System V Application Binary Interface, AMD64 Architecture Processor Supplement
这里使用简化的ABI文档: http://www.x86-64.org/documentation/abi.pdf

A.2 AMD64 Linux Kernel Conventions
...
A.2.1 Calling Conventions

The Linux AMD64 kernel uses internally the same calling conventions as user-level applications (see section 3.2.3 for details). User-level applications that like to call system calls should use the functions from the C library. The interface between the C library and the Linux kernel is the same as for the user-level applications with the following differences: 

1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9. 

2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11. 

3. The number of the syscall has to be passed in register %rax. 

4. System-calls are limited to six arguments, no argument is passed directly on the stack. 

5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno. 

6. Only values of class INTEGER or class MEMORY are passed to the kernel.


另外,来自d3s.mff.cuni.cz/teaching/crash_dump_analysis的slides可以作为参考。 【贴两张主要的截图】

 

下面给出一个内核函数反汇编后的例子帮助理解ABI。

函数原型:
ibt_status_t ibt_suggest_alt_path(ibt_channel_hdl_t channel, 
     ibt_execution_mode_t mode, 
     ibt_suggest_alt_path_info_t *alt_path, 
     void *priv_data, 
     ibt_priv_data_len_t priv_data_len, 
     ibt_spr_returns_t *ret_args); 

用mdb -k进入内核反汇编
root# mdb -k 
> ibt_suggest_alt_path::dis 
ibt_suggest_alt_path:           pushq  %rbp             ; save rbp 
ibt_suggest_alt_path+1:         movq   %rsp,%rbp        ;
ibt_suggest_alt_path+4:         subq   $0x30,%rsp       ;
ibt_suggest_alt_path+8:         movq   %rdi,-0x8(%rbp)  ; arg1 : rdi 
ibt_suggest_alt_path+0xc:       movq   %rsi,-0x10(%rbp) ; arg2 : rsi 
ibt_suggest_alt_path+0x10:      movq   %rdx,-0x18(%rbp) ; arg3 : rdx 
ibt_suggest_alt_path+0x14:      movq   %rcx,-0x20(%rbp) ; arg4 : rcx 
ibt_suggest_alt_path+0x18:      movq   %r8,-0x28(%rbp)  ; arg5 : r8 
ibt_suggest_alt_path+0x1c:      movq   %r9,-0x30(%rbp)  ; arg6 : r9 
ibt_suggest_alt_path+0x20:      pushq  %rbx             ; save rbx 
ibt_suggest_alt_path+0x21:      pushq  %r12             ; save r12 
ibt_suggest_alt_path+0x23:      pushq  %r13             ; save r13 
ibt_suggest_alt_path+0x25:      pushq  %r14             ; save r14 
ibt_suggest_alt_path+0x27:      pushq  %r15             ; save r15 
... 
ibt_suggest_alt_path+0xa11:     popq   %r15             ; restore r15
ibt_suggest_alt_path+0xa13:     popq   %r14             ; restore r14
ibt_suggest_alt_path+0xa15:     popq   %r13             ; restore r13
ibt_suggest_alt_path+0xa17:     popq   %r12             ; restore r12
ibt_suggest_alt_path+0xa19:     popq   %rbx             ; restore rbx
ibt_suggest_alt_path+0xa1a:     leave                   ; restore rsp, rbp
ibt_suggest_alt_path+0xa1b:     ret                     ; 
> $q                            // leave == movq %rbp, %rsp + popq %rbp

 

2.2 系统调用号

每一个系统调用都有一个独一无二的系统调用号。操作系统最多支持512个系统调用。如果一个系统调用被废弃,那么它对应的系统调用号将被保留,而不能分配给新的系统调用使用。所有系统调用号位于文件/etc/name_to_sysnum中。ABI规定了系统调用号是由寄存器rax传递给内核的,例如: modctl的系统调用号为152 (=0x98), 从modctl::dis的输出中我们可以看出,在执行syscall指令之前,%eax == 0x98。

root# egrep "modctl" /etc/name_to_sysnum 
modctl                  152 

root# echo "modctl::dis" | mdb /lib/64/libc.so.1 
modctl:                         movq   %rcx,%r10 
modctl+3:                       movl   $0x98,%eax 
modctl+8:                       syscall 
modctl+0xa:                     jb     -0x126d30        <__cerror> 
modctl+0x10:                    xorq   %rax,%rax 
modctl+0x13:                    ret 

- modctl()的实现可参见usr/src/lib/libc/common/sys/modctl.s

有关系统调用号的定义,见源文件usr/src/uts/common/sys/syscall.h,
例如:
#define     SYS_modctl      152
内核在进入sys_syscall()后,根据寄存器rax中存储的系统调用号查找相应的系统调用内核函数。

2.3 系统调用表
      Solaris内核维护了一张系统调用表,表中的每一个元素是一个struct sysent。

2.3.1 结构体struct sysent

321 /*
322  * Structure of the system-entry table.
323  *
324  *  Changes to struct sysent should maintain binary compatibility with
325  *  loadable system calls, although the interface is currently private.
326  *
327  *  This means it should only be expanded on the end, and flag values
328  *  should not be reused.
329  *
330  *  It is desirable to keep the size of this struct a power of 2 for quick
331  *  indexing.
332  */
333 struct sysent {
334     char            sy_narg;        /* total number of arguments */
335 #ifdef _LP64
336     unsigned short  sy_flags;       /* various flags as defined below */
337 #else
338     unsigned char   sy_flags;       /* various flags as defined below */
339 #endif
340     int             (*sy_call)();   /* argp, rvalp-style handler */
341     krwlock_t       *sy_lock;       /* lock for loadable system calls */
342     int64_t         (*sy_callc)();  /* C-style call hander or wrapper */
343 };
root# mdb -k 
> ::sizeof struct sysent 
sizeof (struct sysent) = 0x20 
> ::offsetof sysent sy_callc 
offsetof (sysent, sy_callc) = 0x18, sizeof (...->sy_callc) = 8 

注意:结构体sysent的大小为0x20(=32), 系统调用服务例程sy_callc在结构体sysent中的偏移为0x18。后面我们分析sys_syscall()汇编代码的时候会用到0x20, 0x18这两个数字。

 

2.3.2 系统调用表struct sysent sysent[NSYSCALL]

o 宏NSYSCALL定义于头文件usr/src/uts/common/sys/systm.h中,

#define NSYSCALL 256 /* number of system calls */

o sysent[NSYSCALL]定义于源文件usr/src/uts/common/os/sysent.c

/*
 * Native sysent table.
 */
struct sysent sysent[NSYSCALL] =
{
        /*  0 */ IF_LP64(
                        SYSENT_NOSYS(),
                        SYSENT_C("indir",       indir,          1)),
        /*  1 */ SYSENT_CI("exit",              rexit,          1),
...
        /* 152 */ SYSENT_CI("modctl",           modctl,         6),
...
        /* 255 */ SYSENT_CI("umount2",          umount2,        2)
...
};

o 宏SYSENT_CI定义于源文件 usr/src/uts/common/os/sysent.c中,

#define    SYSENT_CI(name, call, narg)    \
    { (narg), SE_32RVAL1, NULL, NULL, (llfcn_t)(call) }

o 宏SE_32RVAL1定义于头文件 usr/src/uts/common/sys/systm.h中,

#define SE_32RVAL1 0x0 /* handler returns int32_t in rval1 */

o 以modctl为例,其在sysent表中被展开后就是:

{6, 0x0, NULL, NULL, (llfcn_t)modctl}

o 用mdb查看一下,

> (sysent + 0x20 * 0t152)::print -Ta struct sysent 
fffffffffc243480 struct sysent { 
    fffffffffc243480 char sy_narg = '\006' 
    fffffffffc243482 unsigned short sy_flags = 0 
    fffffffffc243488 int (*)() sy_call = 0 
    fffffffffc243490 krwlock_t *sy_lock = 0 
    fffffffffc243498 int64_t (*)() sy_callc = modctl 
} 
> 

果然,sys_narg = 6, sy_callc = modctl; 也就是说,modctl系统函数中会接收6个参数。

o modctl在usr/src/uts/common/os/sysent.c的申明如下,

int modctl(int, uintptr_t, uintptr_t, uintptr_t, uintptr_t, uintptr_t);

 

2.4 系统调用入口sys_syscall

o 用户态的C库函数调用syscall指令后进入内核,内核从sys_syscall()开始执行。注意sys_syscall()是通过汇编代码实现的,源文件为:
usr/src/uts/i86pc/ml/syscall_asm_amd64.s

525 _syscall_invoke:
526    movq    REGOFF_RDI(%rbp), %rdi
527    movq    REGOFF_RSI(%rbp), %rsi
528    movq    REGOFF_RDX(%rbp), %rdx
529    movq    REGOFF_RCX(%rbp), %rcx
530    movq    REGOFF_R8(%rbp), %r8
531    movq    REGOFF_R9(%rbp), %r9
532
533    cmpl    $NSYSCALL, %eax
534    jae    _syscall_ill
535    shll    $SYSENT_SIZE_SHIFT, %eax
536    leaq    sysent(%rax), %rbx
537
538    call    *SY_CALLC(%rbx)
539
540    movq    %rax, %r12
541    movq    %rdx, %r13

o 对sys_syscall()用mdb查看

 1 > sys_syscall::dis 
 2 sys_syscall:       swapgs 
 3 ... 
 4 sys_syscall+0x21d: movq  0x10(%rbp),%rdi 
 5 sys_syscall+0x221: movq  0x18(%rbp),%rsi 
 6 sys_syscall+0x225: movq  0x20(%rbp),%rdx 
 7 sys_syscall+0x229: movq  0x28(%rbp),%rcx 
 8 sys_syscall+0x22d: movq  0x30(%rbp),%r8 
 9 sys_syscall+0x231: movq  0x38(%rbp),%r9 
10 sys_syscall+0x235: cmpl  $0x100,%eax 
11 sys_syscall+0x23a: jae   +0x11a   <0xfffffffffb8014bb> 
12 sys_syscall+0x240: shll  $0x5,%eax 
13 sys_syscall+0x243: leaq  0xfffffffffc242180(%rax),%rbx <sysent> 
14 sys_syscall+0x24a: call  *0x18(%rbx) 
15 sys_syscall+0x24d: movq  %rax,%r12 
16 sys_syscall+0x250: movq  %rdx,%r13 
17 ... 
18 nopop_sys_syscall_swapgs_sysretq:   swapgs 
19 nopop_sys_syscall_swapgs_sysretq+3: sysret 
20 ... 

o sys_syscall.s中的这3行,

535    shll    $SYSENT_SIZE_SHIFT, %eax
536    leaq    sysent(%rax), %rbx
538    call    *SY_CALLC(%rbx)

对应于将sys_syscall反汇编后这3行

12 sys_syscall+0x240: shll  $0x5,%eax
13 sys_syscall+0x243: leaq  0xfffffffffc242180(%rax),%rbx <sysent> 
14 sys_syscall+0x24a: call  *0x18(%rbx) 

12: 将eax的值也就是系统调用号左移5位,eax = eax << 5 = eax * 32;
13: 将rax的值加上系统调用表sysent的首地址,存入rbx中;
14: 将rbx的值加上0x18, 该内存地址中的值就是系统调用服务例程的收地址,call [rbx+0x18], 就是调用对应的系统调用服务例程。

o 例如: (以modctl为例)

> ::sizeof struct sysent 
sizeof (struct sysent) = 0x20 

> ::offsetof struct sysent sy_callc 
offsetof (struct sysent, sy_callc) = 0x18, sizeof (...->sy_callc) = 8 
       
> sysent + 0x20 * 0t152 = J 
                fffffffffc243480 
> fffffffffc243480 + 0x18 = J 
                fffffffffc243498 
> fffffffffc243498/J 
sysent+0x1318:  fffffffffbcad4e0 
 
> fffffffffbcad4e0::whatis 
fffffffffbcad4e0 is modctl, in genunix's text segment 
> fffffffffbcad4e0/i      
modctl: 
modctl:         pushq  %rbp 
 
> sysent + 0x20 * 0t152 ::print -Ta struct sysent 
fffffffffc243480 struct sysent { 
    fffffffffc243480 char sy_narg = '\006' 
    fffffffffc243482 unsigned short sy_flags = 0 
    fffffffffc243488 int (*)() sy_call = 0 
    fffffffffc243490 krwlock_t *sy_lock = 0 
    fffffffffc243498 int64_t (*)() sy_callc = modctl 
} 
> 

一旦sys_syscall()找到了系统调用的服务例程(当然是根据系统调用号计算出来的),就进入那个服务例程执行。而系统调用的参数准备在寄存器rdi, rsi, rdx, rcx, r8, r9中。

L4-9正是把储存在stack上的参数值取出来,装入对应的寄存器。系统调用服务例程将从寄存器中取得参数,例如:

 4 sys_syscall+0x21d: movq  0x10(%rbp),%rdi 
 5 sys_syscall+0x221: movq  0x18(%rbp),%rsi 
 6 sys_syscall+0x225: movq  0x20(%rbp),%rdx 
 7 sys_syscall+0x229: movq  0x28(%rbp),%rcx 
 8 sys_syscall+0x22d: movq  0x30(%rbp),%r8 
 9 sys_syscall+0x231: movq  0x38(%rbp),%r9 
10 sys_syscall+0x235: cmpl  $0x100,%eax 
11 sys_syscall+0x23a: jae   +0x11a   <0xfffffffffb8014bb> 
12 sys_syscall+0x240: shll  $0x5,%eax 
13 sys_syscall+0x243: leaq  0xfffffffffc242180(%rax),%rbx <sysent> 
14 sys_syscall+0x24a: call  *0x18(%rbx)

系统调用服务例程将从寄存器中取得参数,例如:

> fffffffffbcad4e0::dis 
modctl:                         pushq  %rbp 
modctl+1:                       movq   %rsp,%rbp 
modctl+4:                       subq   $0x30,%rsp 
modctl+8:                       movq   %rdi,-0x8(%rbp) 
modctl+0xc:                     movq   %rsi,-0x10(%rbp) 
modctl+0x10:                    movq   %rdx,-0x18(%rbp) 
modctl+0x14:                    movq   %rcx,-0x20(%rbp) 
modctl+0x18:                    movq   %r8,-0x28(%rbp) 
modctl+0x1c:                    movq   %r9,-0x30(%rbp) 
...

到此为止,用户程序调用C库函数modctl()的参数已经从用户空间传入内核空间,等待内核空间的modctl()执行。当然,中间经过了sys_syscall()存入stack中又从stack中取出来的过程。一旦内核空间的modctl()执行完毕,sys_syscall()就通过sysret指令返回给用户空间的modctl().

 

3 系统调用过程观察实例

3.1 最简单直接的观察

o 在终端A上启动mdb, 调试命令modinfo

root# mdb /usr/sbin/modinfo 
> main:b 
> :r -i 16 
mdb: target stopped at: 
ld.so.1`rtld_db_postinit:       pushq  %rbp 
> :c 
mdb: target stopped at: 
ld.so.1`rtld_db_dlactivity:     pushq  %rbp 
mdb: You've got symbols! 
Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ] 
> modctl:b 
> modctl::dis 
libc.so.1`modctl:               movq   %rcx,%r10 
libc.so.1`modctl+3:             movl   $0x98,%eax 
libc.so.1`modctl+8:             syscall 
libc.so.1`modctl+0xa:           jb     -0x126d30        <libc.so.1`__cerror> 
libc.so.1`modctl+0x10:          xorq   %rax,%rax 
libc.so.1`modctl+0x13:          ret    
> :s 
mdb: target stopped at: 
ld.so.1`rtld_db_dlactivity+1:   movq   %rsp,%rbp 
> :s 
mdb: target stopped at: 
libc.so.1`modctl+3:     movl   $0x98,%eax 
> :s 
mdb: target stopped at: 
libc.so.1`modctl+8:     syscall 

> // 在进入内核模式前,先看看寄存器
> $r 
%rax = 0x0000000000000098       %r8  = 0x0000000000000000 
%rbx = 0xffff80fdae44e2d0       %r9  = 0x0000000000000000 
%rcx = 0x000000061bf989a0       %r10 = 0x000000061bf989a0 
%rdx = 0xffff80fdae44e2d0       %r11 = 0x00007ff91dcbfe90 
%rsi = 0x0000000000000010       %r12 = 0xffff80fdae44e2d8 
%rdi = 0x0000000000000002       %r13 = 0x0000000000000000 
...
> // rax = 0x98 = 152 // modctl的系统调用号
> // rdi = 0x2        // cmd MODINFO = 2
> // rsi = 0x10 = 16  // mod ID 
> // rdx = 0xffff80fdae44e2d0 // struct modinfo *
> 0xffff80fdae44e2d0::print struct modinfo mi_id mi_name mi_msinfo[0] 
mi_id = 0x10 
mi_name = [ '\0', ... ] 
mi_msinfo[0] = { 
    mi_msinfo[0].msi_linkinfo = [ '\001', ... ] 
    mi_msinfo[0].msi_p0 = 0xae44e438 
} 
> 

> :s   // 现在不能按回车键, 一旦按下回车键,就执行syscall进入内核模式

o 在终端B(console)上启动mdb -K, 设置断点modctl:b; 这样,一旦在终端A >:s 后敲入回车,立即进入内核模式,可从终端B上观察到

root# mdb -K 
kmdb: target stopped at: 
kmdb_enter+0xb: movq   %rax,%rdi 
[5]> modctl:b 
[5]> :c 
root#

o 在终端A上键入回车

> :s 
[光标在此闪烁];用户程序被暂停了!
与此同时,终端B上进入内核模式

root# mdb -K 
kmdb: target stopped at: 
kmdb_enter+0xb: movq   %rax,%rdi 
[5]> modctl:b 
[5]> :c 
root# kmdb: stop at modctl 
kmdb: target stopped at: 
modctl:         pushq  %rbp 
[22]> 

o 在终端B上查看调用参数

[22]> $r 
%rax = 0x0000000000001300                 %r9  = 0x0000000000000000 
%rbx = 0xfffffffffc243480   sysent+0x1300 %r10 = 0xffff80fdae44e5d8 
%rcx = 0x000000061bf989a0                 %r11 = 0xffff80fdae44e268 
%rdx = 0xffff80fdae44e2d0                 %r12 = 0x0000000000000000 
%rsi = 0x0000000000000010                 %r13 = 0x0000000000000000 
%rdi = 0x0000000000000002                 %r14 = 0xffffa1c009e87000 
%r8  = 0x0000000000000000                 %r15 = 0xffffa1c00458a4c0 

%rip = 0xfffffffffbcad4e0 modctl 
...

    > // rdi = 0x2 
    > // rsi = 0x10
    > // rdx = 0xffff80fdae44e2d0
[22]> 0xffff80fdae44e2d0::print struct modinfo mi_id mi_name 
mi_id = 0x10 
mi_name = [ '\0', ...]

o 在终端B上输入 :z, :c返回

[22]> 
[22]> :z 
[22]> :c 
与此同时,终端A上的被暂停的:s被激活,
> :s 
mdb: target stopped at: 
libc.so.1`modctl+0xa:   jb     -0x126d30        <libc.so.1`__cerror> 
> 

说明内核调用已经结束,返回到用户模式。

o 在终端A上查看地址0xffff80fdae44e2d0处的内容,我们期望的数据应该已被内核调用填好

> 0xffff80fdae44e2d0::print struct modinfo 
{ 
    mi_info = 5 
    mi_state = 3 
    mi_id = 0x10 
    mi_nextid = 0x10 
    mi_base = 0xfffffffffbdbc2d8 
    mi_size = 0x5d198 
    mi_rev = 1 
    mi_loadcnt = 1 
    mi_name = [ "pcie" ] 
    mi_msinfo = [ 
        { 
            msi_linkinfo = [ "PCI Express Framework Module" ] 
            msi_p0 = 0xffffffff 
        },
    ...

注意: mis_base, mi_size, mi_name, ms_msinfo[0]中的数据已如我们期望的被填充好。

ID  LOADADDR         SIZE   INFO REV NAMEDESC 
16  fffffffffbdbc2d8 5d198  --   1   pcie (PCI Express Framework Module) 
mdb: target has terminated 
> 

到此为止,我们观察到了从用户模式进入内核模式,再从内核模式返回到用户模式的全过程。系统调用的神秘面纱已经被揭开。接下来将用DTrace深入观察内核服务例程的行为。

 

3.2 使用DTrace观察内核行为

o 在终端A启动mdb, 调试命令modinfo

root# mdb /usr/sbin/modinfo 
> main:b 
> :r -i 16 
mdb: target stopped at: 
ld.so.1`rtld_db_postinit:       pushq  %rbp 
> :c 
mdb: target stopped at: 
ld.so.1`rtld_db_dlactivity:     pushq  %rbp 
mdb: You've got symbols! 
Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ] 
> modctl:b 
> :c 
mdb: stop at libc.so.1`modctl 
mdb: target stopped at: 
libc.so.1`modctl:       movq   %rcx,%r10 
> :s 
mdb: target stopped at: 
libc.so.1`modctl+3:     movl   $0x98,%eax 
> :s 
mdb: target stopped at: 
libc.so.1`modctl+8:     syscall 
> 
> $r 
%rax = 0x0000000000000098       %r8  = 0x0000000000000000 
%rbx = 0xffff80dbdcb43120       %r9  = 0x0000000000000000 
%rcx = 0x0000000f25c48f90       %r10 = 0x0000000f25c48f90 
%rdx = 0xffff80dbdcb43120       %r11 = 0x00007ffdc84bfe90 
%rsi = 0x0000000000000010       %r12 = 0xffff80dbdcb43128 
%rdi = 0x0000000000000002       %r13 = 0x0000000000000000 
...
> 0xffff80dbdcb43120::print struct modinfo mi_id mi_name 
mi_id = 0x10 
mi_name = [ '\0', ...]

o 在终端B启动dtrace脚本

root# ./fook.d

  fook.d的代码如下:

  1 #!/usr/sbin/dtrace -qs
  2 
  3 syscall::modctl:entry
  4 /execname == "modinfo"/
  5 {
  6         printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
  7         printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n",
  8                 arg0, arg1, arg2, arg3, arg4, arg5);
  9         stack();
 10         printf("\n---------------------------------------------------------\n");
 11         self->n = 1;
 12 }
 13 
 14 syscall::modctl:return
 15 /execname == "modinfo"/
 16 {
 17         self->n = 0;
 18 }
 19 
 20 fbt::modctl:entry
 21 /self->n == 1/
 22 {
 23         printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
 24         printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n",
 25                 arg0, arg1, arg2, arg3, arg4, arg5);
 26         stack();
 27         printf("\n---------------------------------------------------------\n");
 28 }
 29 
 30 fbt::modctl_modinfo:entry
 31 /self->n == 1/
 32 {
 33         printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
 34         printf("args: 0x%X, 0x%X\n", arg0, arg1);
 35         stack();
 36 
 37         self->mip = (struct modinfo *)arg1;
 38 
 39         /*
 40          * usr/src/uts/common/sys/modctl.h#421
 41          * struct modinfo {
 42          *     int                mi_info;     // Flags for info wanted
 43          *     int                mi_state;    // Flags for module state
 44          *     int                mi_id;       // id of this loaded module
 45          *     int                mi_nextid;   // id of next module or -1
 46          *     caddr_t            mi_base;     // virtual addr of text
 47          *     size_t             mi_size;     // size of module in bytes
 48          *     int                mi_rev;      // loadable modules rev
 49          *     int                mi_loadcnt;  // # of times loaded
 50          *     char               mi_name[MODMAXNAMELEN]; // name of module
 51          *     struct modspecific_info mi_msinfo[MODMAXLINK];
 52          *                                     // mod specific info
 53          * };
 54          *
 55          * struct modspecific_info {
 56          *     char    msi_linkinfo[MODMAXLINKINFOLEN]; // name in linkage struct
 57          *     int     msi_p0;                 // module specific information
 58          * };
 59          *
 60          * usr/src/cmd/modload/modinfo.c#248
 61          *  static boolean_t
 62          *  print_mod_cb(ofmt_arg_t *ofarg, char *buf, uint_t bufsize)
 63          *
 64          * XXX: Here we cannot use self->mip->mi_id,... directly, so copyin !
 65          */
 66         self->mi = (struct modinfo *)(copyin((uintptr_t)(self->mip),
 67                                              sizeof(struct modinfo)));
 68         printf("\n");
 69         printf("ENT: ID       mi->mi_id                     = %d\n",
 70                 self->mi->mi_id);
 71         printf("ENT: LOADADDR mi->mi_base                   = %p\n",
 72                 self->mi->mi_base);
 73         printf("ENT: SIZE     mi->mi_size                   = %x\n",
 74                 self->mi->mi_size);
 75         printf("ENT: INFO     mi->mi_mi_msinfo[0].msi_p0    = %d\n",
 76                 self->mi->mi_msinfo[0].msi_p0);
 77         printf("ENT: REV      mi->mi_rev                    = %x\n",
 78                 self->mi->mi_rev);
 79         printf("ENT: NAME     mi->mi_name                   = %s\n",
 80                 stringof(self->mi->mi_name));
 81         printf("ENT: DESC     mi->mi_msinfo[0].msi_linkinfo = %s\n",
 82                 stringof(self->mi->mi_msinfo[0].msi_linkinfo));
 83         printf("\n---------------------------------------------------------\n");
 84 }
 85 
 86 fbt::modctl_modinfo:return
 87 /self->n == 1/
 88 {
 89         printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
 90 
 91         stack();
 92 
 93         /*
 94          * XXX: Here we cannot use self->mip->mi_id,... directly, so copyin !
 95          */
 96         self->mi = (struct modinfo *)(copyin((uintptr_t)(self->mip),
 97                                              sizeof(struct modinfo)));
 98         printf("\n");
 99         printf("RET: ID       mi->mi_id                     = %d\n",
100                 self->mi->mi_id);
101         printf("RET: LOADADDR mi->mi_base                   = %p\n",
102                 self->mi->mi_base);
103         printf("RET: SIZE     mi->mi_size                   = %x\n",
104                 self->mi->mi_size);
105         printf("RET: INFO     mi->mi_mi_msinfo[0].msi_p0    = %d\n",
106                 self->mi->mi_msinfo[0].msi_p0);
107         printf("RET: REV      mi->mi_rev                    = %x\n",
108                 self->mi->mi_rev);
109         printf("RET: NAME     mi->mi_name                   = %s\n",
110                 stringof(self->mi->mi_name));
111         printf("RET: DESC     mi->mi_msinfo[0].msi_linkinfo = %s\n",
112                 stringof(self->mi->mi_msinfo[0].msi_linkinfo));
113         printf("\n---------------------------------------------------------\n");
114 
115         self->mip = 0;
116 }
117 
118 fbt::copyin:entry,
119 fbt::copyout:entry
120 /self->n == 1/
121 {
122         printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
123         printf("args: 0x%X, 0x%X, 0x%X\n", arg0, arg1, arg2);
124         stack();
125         printf("\n---------------------------------------------------------\n");
126 }

注意:内核内存和用户内存是严格隔离的,当内核需要访问用户内存时,必须使用copyin();反之,如内核需要把数据传递会用户空间,必须使用copyout()。

o 在终端A中执行:s (执行汇编指令syscall)

> :s 
mdb: target stopped at: 
libc.so.1`modctl+0xa:   jb     -0x126d30        <libc.so.1`__cerror> 
> 

与此同时,终端B的输出如下

root# ./fook.d 
syscall::modctl :entry 
args: 0x2, 0x10, 0xFFFF80DBDCB43120, 0xF25C48F90, 0x0, 0x0 

              unix`sys_syscall+0x24d 

--------------------------------------------------------- 
fbt:genunix:modctl :entry 
args: 0x2, 0x10, 0xFFFF80DBDCB43120, 0xF25C48F90, 0x0, 0x0 

              genunix`dtrace_systrace_syscall+0x14d 
              unix`sys_syscall+0x24d 

--------------------------------------------------------- 
fbt:genunix:modctl_modinfo :entry 
args: 0x10, 0xFFFF80DBDCB43120 

              genunix`modctl+0x4e7 
              genunix`dtrace_systrace_syscall+0x14d 
              unix`sys_syscall+0x24d 

ENT: ID       mi->mi_id                     = 16 
ENT: LOADADDR mi->mi_base                   = 0 
ENT: SIZE     mi->mi_size                   = 0 
ENT: INFO     mi->mi_mi_msinfo[0].msi_p0    = -592170360 
ENT: REV      mi->mi_rev                    = 0 
ENT: NAME     mi->mi_name                   = 
ENT: DESC     mi->mi_msinfo[0].msi_linkinfo = ## 

--------------------------------------------------------- 
fbt:unix:copyin :entry 
args: 0xFFFF80DBDCB43120, 0xFFFFFFFC81AEDA30, 0x1B0 

              genunix`modctl_modinfo+0xa0 
              genunix`modctl+0x4e7 
              genunix`dtrace_systrace_syscall+0x14d 
              unix`sys_syscall+0x24d 

--------------------------------------------------------- 
fbt:unix:copyout :entry 
args: 0xFFFFFFFC81AEDA30, 0xFFFF80DBDCB43120, 0x1B0 

              genunix`modctl_modinfo+0x1e6 
              genunix`modctl+0x4e7 
              genunix`dtrace_systrace_syscall+0x14d 
              unix`sys_syscall+0x24d 

--------------------------------------------------------- 
fbt:genunix:modctl_modinfo :return 

              genunix`modctl+0x4e7 
              genunix`dtrace_systrace_syscall+0x14d 
              unix`sys_syscall+0x24d 

RET: ID       mi->mi_id                     = 16 
RET: LOADADDR mi->mi_base                   = fffffffffbdbc2d8 
RET: SIZE     mi->mi_size                   = 5d198 
RET: INFO     mi->mi_mi_msinfo[0].msi_p0    = -1 
RET: REV      mi->mi_rev                    = 1 
RET: NAME     mi->mi_name                   = pcie 
RET: DESC     mi->mi_msinfo[0].msi_linkinfo = PCI Express Framework Module 

---------------------------------------------------------

o 在终端A中查看地址0xffff80dbdcb43120的内容

> 0xffff80dbdcb43120::print struct modinfo 
{ 
    mi_info = 5 
    mi_state = 3 
    mi_id = 0x10 
    mi_nextid = 0x10 
    mi_base = 0xfffffffffbdbc2d8 
    mi_size = 0x5d198 
    mi_rev = 1 
    mi_loadcnt = 1 
    mi_name = [ "pcie" ] 
    mi_msinfo = [ 
        { 
            msi_linkinfo = [ "PCI Express Framework Module" ] 
            msi_p0 = 0xffffffff 
        },
  ...

该输出跟DTrace中观测到的数据一致。

o 在终端A上执行dtrace脚本观察用户模式下的调用栈

root# ./foou.d -c "modinfo -i 16" 
ID  LOADADDR         SIZE   INFO REV NAMEDESC 
16  fffffffffbdbc2d8 5d198  --   1   pcie (PCI Express Framework Module) 

pid22782:libc.so.1:modctl :entry 
args: 0x2, 0x10, 0xFFFF80F02D9A1730, 0x927D90DF0, 0x0 

              libc.so.1`modctl 
              modinfo`main+0x3b6 
              modinfo`0x7ffe48701b34

foou.d的代码如下:

 1 #!/usr/sbin/dtrace -qs
 2 
 3 pid$target::modctl:entry
 4 /execname == "modinfo"/
 5 {
 6         printf("\n%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename);
 7         printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n",
 8                 arg0, arg1, arg2, arg3, arg4);
 9         ustack();
10 }

o 在终端B上观察内核模式下的调用栈

(1) 在终端B上启动DTrace,

root# dtrace -n "fbt::mod_infonull:entry {stack();}" 
dtrace: description 'fbt::mod_infonull:entry ' matched 1 probe 

(2) 在终端A上执行命令 modinfo -i 16

root# modinfo -i 16 
ID  LOADADDR         SIZE   INFO REV NAMEDESC 
16  fffffffffbdbc2d8 5d198  --   1   pcie (PCI Express Framework Module)

与此同时,终端B上的输出为:

root# dtrace -n "fbt::mod_infonull:entry {stack();}" 
dtrace: description 'fbt::mod_infonull:entry ' matched 1 probe 
 CPU     ID                    FUNCTION:NAME 
   3  54642               mod_infonull:entry 
              genunix`mod_info+0x66 
              pcie`_info+0x1f 
              genunix`mod_getinfo+0x5a 
              genunix`modinfo+0x125 
              genunix`modctl_modinfo+0xd5 
              genunix`modctl+0x4e7 
              unix`sys_syscall+0x24d 

^C 

 

4. 直接使用系统调用编程

下面给出一个简单的例子,说明只需要准备好系统调用号和相应的参数,直接使用汇编指令syscall就可以完成系统调用。

 1 BITS 64
 2 
 3 SECTION .data
 4 
 5 Hello:          db "Hello world!", 10
 6 len_Hello:      equ $-Hello
 7 
 8 SECTION .text
 9 
10 global _start
11 
12 _start:
13         mov rdi, 1              ; fd = stdout
14         mov rsi, Hello          ; *buf = Hello
15         mov rdx, len_Hello      ; count = len_Hello
16         mov rax, 4              ; write syscall (x86_64)
17         syscall
18 
19         mov rdi, 0              ; status = 0 (exit normally)
20         mov rax, 1              ; exit syscall (x86_64)
21         syscall

编译,执行如下所示:

root# yasm -f elf64 foo.asm 
root# ld -o foo foo.o 
root# ./foo 
Hello world! 
root# echo $? 
0 

 

最后,关于如何给Solaris添加一个系统调用,请参考《Solaris内核结构》(第2版)一书的附录B:Adding a System Call to Solaris。

推荐阅读: The Definitive Guide to Linux System Calls

posted @ 2017-01-16 22:41  veli  阅读(808)  评论(0编辑  收藏  举报