Kprobes(Kernel Probes)简析
一、kprobes是什么
kprobes是一种基于动态插桩的底层机制,它能动态地插入几乎任何kernel路径,不修改分析对象源码地情况下,来收集debug和性能数据。你可以通过kprobes陷入几乎所有kernel函数地址[1],当该kernel函数被调用时,如果已经定义了一个与之绑定handler,那么handler函数也会同时被调用。
[1] kernel中有一些函数是不能被陷入的,参考kprobes_blacklist。adb shell cat /sys/kernel/debug/kprobes/blacklist
kprobes目前有2种类型:kprobe和kretprobe,前者是可插入到任何cpu虚拟指令上,后者则是特定函数返回时触发。
一般情况下,使用kprobes机制时,通常会将其打包成一个kernel module(ko)。ko的init函数安装(调用register)一个或多个probes,并在exit函数中进行unregister。
二、kprobe如何工作
ARM64架构下,当kprobe被注册后,会将插入点的指令替换为brk(异常指令)。
- 这样当cpu执行到此,就会陷入异常,cpu相关寄存器会被保存,并通过notify call chain通知kprobes
- Kprobe执行与该函数插桩点绑定的“pre_handler”,并将struct kprobe和被保存的寄存器传递到hanlder
- 然后启动single-step(cpu单步调试)功能,将下一条指令设置为插入点原来的指令,从异常态返回
- opcode执行后,便会二进宫:再次陷入异常态,此时将single-step功能退出,并且执行post_handler,然后从异常态安全返回,再沿原函数继续执行
基本思路就是将本来执行一条指令扩展成执行kprobe->pre_handler() ---> setup单步调试 ---> 第一条原函数指令 ---> kprobe->post_hander()这样三个过程。
稍微详细一些的话,就是差不多如下流程图:
此外,同一个被探测地址注册多个kprobe实例时会被调用到,该函数会引入一个kprobe aggregator的概念,即由一个统一的kprobe实例接管所有注册到该地址的kprobe。本篇不详细分析,尽量保证流程简单清晰。
注册
static char symbol_f2fs_write_pages_enter[MAX_SYMBOL_LEN] = "f2fs_write_data_pages"; module_param_string(symbol_f2fs_write_pages_enter, symbol_f2fs_write_pages_enter, sizeof(symbol_f2fs_write_pages_enter), 0644); 初始化struc kprobe,并定义插桩点: static struct kprobe kp_enter = { .symbol_name = symbol_f2fs_write_pages_enter, }; 定义pre_handler函数: kp_enter.pre_handler = handler_pre_enter;
而后调用register_kprobe调用层级以及关键流程如下:
register_kprobe(struct kprobe *p) ---&kp_enter /* Adjust probe address from symbol */ addr = kprobe_addr(p); p->addr = addr; ---这里是symbol函数的地址 p->flags &= KPROBE_FLAG_DISABLED; ----初始化不传参,表示默认打开 |--prepare_kprobe(p); |--arch_prepare_kprobe(p) /* copy instruction */ p->opcode = le32_to_cpu(*p->addr); ---这里将原来的插入点指令保存到p->opcode p->ainsn.api.insn = get_insn_slot(); -----申请slot,后续对应opcode的地址 |---arch_prepare_ss_slot(p); kprobe_opcode_t *addr = p->ainsn.api.insn; void *addrs[] = {addr, addr + 1}; u32 insns[] = {p->opcode, BRK64_OPCODE_KPROBES_SS};---(AARCH64_BREAK_MON | (KPROBES_BRK_SS_IMM << 5)) p->ainsn.api.restore = (unsigned long) p->addr + sizeof(kprobe_opcode_t);---里面则暂存了下一条指令:str x30, [x18],#8 |---aarch64_insn_patch_text(addrs, insns, 2); |---aarch64_insn_patch_text_cb(void *arg) for (i = 0; ret == 0 && i < pp->insn_cnt; i++) --cnt=2 ret = aarch64_insn_patch_text_nosync(pp->text_addrs[i], pp->new_insns[i]);依次替换2条指令 |---aarch64_insn_write(tp, insn); ---就是addr,写p->opcode;addr+1,写BRK64_OPCODE_KPROBES_SS 写的过程disable page-fault,写完再使能。再flush cache |---arm_kprobe(p); |---__arm_kprobe(kp); |---arch_arm_kprobe(p); void *addr = p->addr; u32 insn = BRK64_OPCODE_KPROBES; aarch64_insn_patch_text(&addr, &insn, 1);
反汇编出的原函数:
ffffffc010874988 <f2fs_write_data_pages>: ffffffc010874988: d503233f hint #0x19 ---指令存放到p->opcode ffffffc01087498c: f800865e str x30, [x18],#8 ----地址存放到p->ainsn.api.restore ffffffc010874990: a9bf7bfd stp x29, x30, [sp,#-16]! ffffffc010874994: 910003fd mov x29, sp ffffffc010874998: f9400008 ldr x8, [x0] ffffffc01087499c: d5384109 mrs x9, sp_el0 ffffffc0108749a0: 5280008a mov w10, #0x4 // #4 ffffffc0108749a4: f941ad08 ldr x8, [x8,#856] ffffffc0108749a8: eb09011f cmp x8, x9 ffffffc0108749ac: 52800128 mov w8, #0x9 // #9 ffffffc0108749b0: 1a8a0102 csel w2, w8, w10, eq ffffffc0108749b4: 94001511 bl ffffffc010879df8 <__f2fs_write_data_pages> ffffffc0108749b8: a8c17bfd ldp x29, x30, [sp],#16 ffffffc0108749bc: f85f8e5e ldr x30, [x18,#-8]! ffffffc0108749c0: d50323bf hint #0x1d ffffffc0108749c4: d65f03c0 ret
经过prepare_kprobe(p):
地址 指令 p->ainsn.api.insn hint #0x19 ---已存放到p->opcode p->ainsn.api.insn+1 BRK64_OPCODE_KPROBES_SS ---插入brk #06指令
再经过arm_kprobe(p),函数改为:
ffffffc010874988 <f2fs_write_data_pages>: ffffffc010874988: d503233f BRK64_OPCODE_KPROBES ---插入brk #04指令 ffffffc01087498c: f800865e str x30, [x18],#8 ----地址存放到p->ainsn.api.restore
2条指令解释:
9 /* 10 * #imm16 values used for BRK instruction generation 11 * 0x004: for installing kprobes 12 * 0x005: for installing uprobes 13 * 0x006: for kprobe software single-step 14 * Allowed values for kgdb are 0x400 - 0x7ff 15 * 0x100: for triggering a fault on purpose (reserved) 16 * 0x400: for dynamic BRK instruction 17 * 0x401: for compile time BRK instruction 18 * 0x800: kernel-mode BUG() and WARN() traps 19 * 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff) 20 */ #define KPROBES_BRK_IMM 0x004 #define KPROBES_BRK_SS_IMM 0x006 37 /* 38 * BRK instruction encoding 39 * The #imm16 value should be placed at bits[20:5] within BRK ins 40 */ 41 #define AARCH64_BREAK_MON 0xd4200000 #define BRK64_OPCODE_KPROBES (AARCH64_BREAK_MON | (KPROBES_BRK_IMM << 5)) #define BRK64_OPCODE_KPROBES_SS (AARCH64_BREAK_MON | (KPROBES_BRK_SS_IMM << 5))
所以理论上kprobe使能的情况下,在执行f2fs_write_data_pages时,就会进入SW single-step。
Brk跳转流程
brk指令:会将触发SW断点异常,记录EC值和imm16值到ESR_ELx的ISS段
可以看到brk指令最终产生的是exceptioin异常,同时会将异常信息(EC值=0x3C,imm16)保存到ESR_ELx中:
触发brk指令后,根据所处状态进行异常跳转处理,我们f2fs_write_data_pages函数处于内核态,即el1。(上面异常信息也会保存到ESR_EL1寄存器中)
所以,会执行el1_sync异常处理函数el1_sync_handler:
763 /* 764 * EL1 mode handlers. 765 */ 766 .align 6 767 SYM_CODE_START_LOCAL_NOALIGN(el1_sync) 768 kernel_entry 1 769 mov x0, sp 将sp寄存器传参 770 bl el1_sync_handler 771 kernel_exit 1 772 SYM_CODE_END(el1_sync)
跳转指令至:
66 #define ESR_ELx_EC_BRK64 (0x3C) 198 asmlinkage void noinstr el1_sync_handler(struct pt_regs *regs) 199 { 200 unsigned long esr = read_sysreg(esr_el1); ------读取ESR_el1 201 202 switch (ESR_ELx_EC(esr)) { 203 case ESR_ELx_EC_DABT_CUR: 204 case ESR_ELx_EC_IABT_CUR: 205 el1_abort(regs, esr); 206 break; 207 /* 208 * We don't handle ESR_ELx_EC_SP_ALIGN, since we will have hit a 209 * recursive exception when trying to push the initial pt_regs. 210 */ 211 case ESR_ELx_EC_PC_ALIGN: 212 el1_pc(regs, esr); 213 break; 214 case ESR_ELx_EC_SYS64: 215 case ESR_ELx_EC_UNKNOWN: 216 el1_undef(regs); 217 break; 218 case ESR_ELx_EC_BREAKPT_CUR: 219 case ESR_ELx_EC_SOFTSTP_CUR: 220 case ESR_ELx_EC_WATCHPT_CUR: 221 case ESR_ELx_EC_BRK64: -----匹配上了brk对应的EC值:0x3C 222 el1_dbg(regs, esr); 223 break; 224 case ESR_ELx_EC_FPAC: 225 el1_fpac(regs, esr); 226 break; 227 default: 228 el1_inv(regs, esr); 229 } 230 }
180 static void noinstr el1_dbg(struct pt_regs *regs, unsigned long esr) 181 { 182 unsigned long far = read_sysreg(far_el1); --读取far_el1寄存器 183 184 arm64_enter_el1_dbg(regs); ---lockdep_hardirqs_off(CALLER_ADDR0)关硬中断,但当前平台并未实现 185 do_debug_exception(far, esr, regs); 186 arm64_exit_el1_dbg(regs); ----lockdep_hardirqs_on(CALLER_ADDR0)开硬中断,但当前平台并未实现 187 }
far_el1中存放着异常触发的地址,即第一条brk指令地址:
再接下来看下do_debug_exception:
954 void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr, 955 struct pt_regs *regs) 956 { 957 const struct fault_info *inf = esr_to_debug_fault_info(esr);---根据esr找到对应的fault info 958 unsigned long pc = instruction_pointer(regs); 959 960 if (cortex_a76_erratum_1463225_debug_handler(regs)) 961 return; 962 963 debug_exception_enter(regs); ----关抢占 964 965 if (user_mode(regs) && !is_ttbr0_addr(pc)) 966 arm64_apply_bp_hardening(); 967 968 if (inf->fn(addr_if_watchpoint, esr, regs)) { 969 arm64_notify_die(inf->name, regs, inf->sig, inf->code, pc, esr); 970 } 971 972 debug_exception_exit(regs); ----开抢占 973 }
看下如何获取fault_info:
24 #define DBG_ESR_EVT(x) (((x) >> 27) & 0x7) 63 static inline const struct fault_info *esr_to_debug_fault_info(unsigned int esr) 64 { 65 return debug_fault_info + DBG_ESR_EVT(esr); ----取esr_el1寄存器的bit27-29 66 } 47 struct fault_info { 48 int (*fn)(unsigned long far, unsigned int esr, 49 struct pt_regs *regs); 50 int sig; 51 int code; 52 const char *name; 53 }; 875 /* 876 * __refdata because early_brk64 is __init, but the reference to it is 877 * clobbered at arch_initcall time. 878 * See traps.c and debug-monitors.c:debug_traps_init(). */ 880 static struct fault_info __refdata debug_fault_info[] = { 881 { do_bad, SIGTRAP, TRAP_HWBKPT, "hardware breakpoint" }, 882 { do_bad, SIGTRAP, TRAP_HWBKPT, "hardware single-step" }, 883 { do_bad, SIGTRAP, TRAP_HWBKPT, "hardware watchpoint" }, 884 { do_bad, SIGKILL, SI_KERNEL, "unknown 3" }, 885 { do_bad, SIGTRAP, TRAP_BRKPT, "aarch32 BKPT" }, 886 { do_bad, SIGKILL, SI_KERNEL, "aarch32 vector catch" }, 887 { early_brk64, SIGTRAP, TRAP_BRKPT, "aarch64 BRK" }, 888 { do_bad, SIGKILL, SI_KERNEL, "unknown 7" }, 889 };
左移27位,取后3位。EC=0x3C,即0011 1100,所以偏移是6
那么对应的debug_fault_info就是:
887 { early_brk64, SIGTRAP, TRAP_BRKPT, "aarch64 BRK" },
那么inf->fn对应的是什么呢?其实不是early_brk64,会发生改变。
赋值对应的流程,先看trap_init初始化:
1008 void __init trap_init(void) 1009 { 1010 register_kernel_break_hook(&bug_break_hook); 1011 register_kernel_break_hook(&fault_break_hook); 1012 #ifdef CONFIG_KASAN_SW_TAGS 1013 register_kernel_break_hook(&kasan_break_hook); 1014 #endif 1015 debug_traps_init(); 1016 }
通过hook_debug_fault_code进行fn绑定:
#define DBG_ESR_EVT_BRK 0x6 382 void __init debug_traps_init(void) 383 { 384 hook_debug_fault_code(DBG_ESR_EVT_HWSS, single_step_handler, SIGTRAP, 385 TRAP_TRACE, "single-step handler"); 386 hook_debug_fault_code(DBG_ESR_EVT_BRK, brk_handler, SIGTRAP, 387 TRAP_BRKPT, "BRK handler"); 388 }
891 void __init hook_debug_fault_code(int nr, ----这里nr=0x6 892 int (*fn)(unsigned long, unsigned int, struct pt_regs *), 893 int sig, int code, const char *name) 894 { 895 BUG_ON(nr < 0 || nr >= ARRAY_SIZE(debug_fault_info)); 896 897 debug_fault_info[nr].fn = fn; ---brk_handler 898 debug_fault_info[nr].sig = sig; ---SIGTRAP 899 debug_fault_info[nr].code = code; ---TRAP_BRKPT 900 debug_fault_info[nr].name = name; ---"BRK handler" 901 }
fn和name都发生了改动。
所以inf->fn(addr_if_watchpoint, esr, regs)最终执行的是brk_handler函数:
326 static int brk_handler(unsigned long unused, unsigned int esr, 327 struct pt_regs *regs) 328 { 329 if (call_break_hook(regs, esr) == DBG_HOOK_HANDLED) 330 return 0; 331 332 if (user_mode(regs)) { 333 send_user_sigtrap(TRAP_BRKPT); 334 } else { 335 pr_warn("Unexpected kernel BRK exception at EL1\n"); 336 return -EFAULT; 337 } 338 339 return 0; 340 }
303 static int call_break_hook(struct pt_regs *regs, unsigned int esr) 304 { 305 struct break_hook *hook; 306 struct list_head *list; 307 int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL; 308 309 list = user_mode(regs) ? &user_break_hook : &kernel_break_hook; --我们是kernel hook 310 311 /* 312 * Since brk exception disables interrupt, this function is 313 * entirely not preemptible, and we can use rcu list safely here. 314 */ 315 list_for_each_entry_rcu(hook, list, node) { 316 unsigned int comment = esr & ESR_ELx_BRK64_ISS_COMMENT_MASK; --获取ESR的ISS段,其实保存的就是brk指令的imm值 317 318 if ((comment & ~hook->mask) == hook->imm) ----对比imm,找到 匹配的hook 319 fn = hook->fn; 320 } 321 322 return fn ? fn(regs, esr) : DBG_HOOK_ERROR; 323 }
再反回看kernel hook的注册:
465 int __init arch_init_kprobes(void) 466 { 467 register_kernel_break_hook(&kprobes_break_hook); 468 register_kernel_break_hook(&kprobes_break_ss_hook); 469 470 return 0; 471 } 292 void register_kernel_break_hook(struct break_hook *hook) 293 { 294 register_debug_hook(&hook->node, &kernel_break_hook); 295 }
再看kprobe的2个hook注册信息:
410 static struct break_hook kprobes_break_hook = { 411 .imm = KPROBES_BRK_IMM, -----对应brk #04指令替换的IMM 412 .fn = kprobe_breakpoint_handler, 413 }; 398 static struct break_hook kprobes_break_ss_hook = { 399 .imm = KPROBES_BRK_SS_IMM, -----对应brk #06指令替换的IMM 400 .fn = kprobe_breakpoint_ss_handler, 401 };
所以根据brk指令根据传入imm16参数不同,会进入不同的hook。
执行
首先是brk #04,会调用kprobe_breakpoint_handler:
403 static int __kprobes 404 kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr) 405 { 406 kprobe_handler(regs); 407 return DBG_HOOK_HANDLED; 408 }
326 static void __kprobes kprobe_handler(struct pt_regs *regs) 327 { 328 struct kprobe *p, *cur_kprobe; 329 struct kprobe_ctlblk *kcb; 330 unsigned long addr = instruction_pointer(regs); 331 332 kcb = get_kprobe_ctlblk(); 333 cur_kprobe = kprobe_running(); 334 335 p = get_kprobe((kprobe_opcode_t *) addr); 336 337 if (p) { 338 if (cur_kprobe) { 339 if (reenter_kprobe(p, regs, kcb)) 340 return; 341 } else { 342 /* Probe hit */ 343 set_current_kprobe(p); 344 kcb->kprobe_status = KPROBE_HIT_ACTIVE; 345 346 /* 347 * If we have no pre-handler or it returned 0, we 348 * continue with normal processing. If we have a 349 * pre-handler and it returned non-zero, it will 350 * modify the execution path and no need to single 351 * stepping. Let's just reset current kprobe and exit. 352 */ 353 if (!p->pre_handler || !p->pre_handler(p, regs)) { ---这里会调用kprobe注册的pre_handler 354 setup_singlestep(p, regs, kcb, 0); ---然后再设置单步调试环境配置,将pc指针赋值为brk #04指令的地址 355 } else 356 reset_current_kprobe(); 357 } 358 } 359 /* 360 * The breakpoint instruction was removed right 361 * after we hit it. Another cpu has removed 362 * either a probepoint or a debugger breakpoint 363 * at this address. In either case, no further 364 * handling of this interrupt is appropriate. 365 * Return back to original instruction, and continue. 366 */ 367 }
捋一捋状态:
p->ainsn.api.insn :是原函数第一条指令地址
p->opcode:原函数第一条指令:hint #0x19
kcb->ss_ctx.match_addr:p->ainsn.api.insn+ sizeof(kprobe_opcode_t) 地址:现在放着brk #06
180 static void __kprobes181 set_ss_context(struct kprobe_ctlblk *kcb, unsigned long addr) 182 { 183 kcb->ss_ctx.ss_pending = true; 184 kcb->ss_ctx.match_addr = addr + sizeof(kprobe_opcode_t);-----存放了函数的第二条指令:brk #06的地址 185 } 193 static void __kprobes setup_singlestep(struct kprobe *p, 194 struct pt_regs *regs, 195 struct kprobe_ctlblk *kcb, int reenter) 196 { 197 unsigned long slot; 198 199 if (reenter) { 200 save_previous_kprobe(kcb); 201 set_current_kprobe(p); 202 kcb->kprobe_status = KPROBE_REENTER; 203 } else { 204 kcb->kprobe_status = KPROBE_HIT_SS; 205 } 206 207 208 if (p->ainsn.api.insn) { 209 /* prepare for single stepping */ 210 slot = (unsigned long)p->ainsn.api.insn; 211 212 set_ss_context(kcb, slot); /* mark pending ss */设置pending状态:kcb->ss_ctx.ss_pending 213 kprobes_save_local_irqflag(kcb, regs); 214 instruction_pointer_set(regs, slot); ----将pc指针指向slot地址,即原函数第一条指令地址 215 } else { 216 /* insn simulation */ 217 arch_simulate_insn(p, regs); 218 } 219 }
所以,单步调试设置后,会从异常态返回。
然后就会先执行hint #0x19,也就是nop指令
再然后再执行brk #06,再次陷入异常,流程同上面brk #04类似,只是这次对应的IMM值不同了,handler变为kprobe_breakpoint_ss_handler:
398 static struct break_hook kprobes_break_ss_hook = { 399 .imm = KPROBES_BRK_SS_IMM, -----对应brk #06指令替换的IMM 400 .fn = kprobe_breakpoint_ss_handler, 401 };
381 static int __kprobes 382 kprobe_breakpoint_ss_handler(struct pt_regs *regs, unsigned int esr) 383 { 384 struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); 385 int retval; 386 387 /* return error if this is not our step */ 388 retval = kprobe_ss_hit(kcb, instruction_pointer(regs));--这里会保险,判断下当前执行跳转的地址与kcb->ss_ctx.match_addr是否一致,以及pending状态一致 389 390 if (retval == DBG_HOOK_HANDLED) { 391 kprobes_restore_local_irqflag(kcb, regs);---把单步调试前保存的irq flag恢复 392 post_kprobe_handler(kcb, regs); ---执行post_handler 393 } 394 395 return retval; 396 }
245 static void __kprobes 246 post_kprobe_handler(struct kprobe_ctlblk *kcb, struct pt_regs *regs) 247 { 248 struct kprobe *cur = kprobe_running(); 249 250 if (!cur) 251 return; 252 253 /* return addr restore if non-branching insn */ 254 if (cur->ainsn.api.restore != 0) ---p->ainsn.api.restore,就是存放着原函数第二条指令的地址 255 instruction_pointer_set(regs, cur->ainsn.api.restore); --将pc指针指回f2fs函数的第二条指令地址 256 257 /* restore back original saved kprobe variables and continue */ 258 if (kcb->kprobe_status == KPROBE_REENTER) { 259 restore_previous_kprobe(kcb); 260 return; 261 } 262 /* call post handler */ 263 kcb->kprobe_status = KPROBE_HIT_SSDONE; 264 if (cur->post_handler) 265 cur->post_handler(cur, regs, 0); ---调用post_handler 266 267 reset_current_kprobe(); 268 }
在上面函数执行结束后,因为pc指针已经指回f2fs原函数,所以后续就会继续执行原函数。衔接完成!
至此分析了,从kprobe注册,brk指令跳转调用handler,pre_handler/post_handler执行,以及最后回到原函数继续执行的相关流程。
kretprobe的流程,这里不在详细跟踪,有兴趣的朋友可以自行参考阅读。
四、kprobe使用
可参考官方sample:/kernel-5.10/samples/kprobes/kprobe_example.c
可以看到kprobe的添加插桩函数,相当便利。
五、kprobe的开销
kprobe中有对指令替换、寄存器相关操作、函数跳转调用等,必然会有开销。
从官方文档来看:
无论是kprobe还是kretprobe都是有一定开销的,不一定能忽略。
此外,kretprobe的开销 > kprobe的开销;kprobe+kretprobe的开销则与kretprobe的开销相近。
同时我们当前平台,并不支持optimizekprobe优化开销:CONFIG_OPTPROBES=y没有定义
六、kprobe引起的性能问题
通过上面部分的学习,得知kprobe的流程中会对当前cpu关闭硬中断、关闭抢占;并持续一段时间后,在打开抢占和硬中断。
如果在关闭中断和抢占后,执行了耗时的操作就会影响整个系统性能。并且,耗时操作越长,卡顿越严重。
所以,可能是当前cpu上还有其他关联进程无法得到cpu资源运行,导致了其他部分进程也处于等待,导致性能衰退。(尝试了在仅关闭抢占的时候,加入耗时就可以复现卡顿)
推测可能是当前cpu上还有其他关联进程无法得到cpu资源运行。
参考资料:
https://blog.csdn.net/jasonactions/article/details/120784092?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EOPENSEARCH%7ERate-2-120784092-blog-121065795.pc_relevant_3mothn_strategy_and_data_recovery&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EOPENSEARCH%7ERate-2-120784092-blog-121065795.pc_relevant_3mothn_strategy_and_data_recovery&utm_relevant_index=5
https://www.cnblogs.com/pengdonglin137/p/15173019.html
https://www.51cto.com/article/646338.html
https://blog.51cto.com/u_3078781/3287631
ARM® Architecture Reference Manual ARMv8, for ARMv8-A architecture profile.pdf
还在成长。。。