kernel panic必备知识
获得vmcore
Kernel dump 是什么
Kdump – 捕捉kernel dump的工具
Kdump的工作原理
Kdump的配置
Dump分析的工具crash(1)
准备环境
根据vmcore文件获取内核版本及系统信息
kernel debuginfo 内核符号文件
Kernel source code
RHEL与SLES的不同
时区设置
运行crash utility:基于vmcore或基于live system
Dump分析的思路:从哪里开始
判断panic类型
系统信息 sys
Message buffer – log
Kernel panic的若干种类型
Hard lockup
Kernel panic – not syncing: Watchdog detected hard LOCKUP on cpu 0
soft lockup
kernel panic – not syncing: softlockup: hung tasks
hung task panic
kernel panic – not syncing: hung_task: blocked tasks
oom
Kernel panic – not syncing: Out of memory: system-wide panic_on_oom is enabled
空指针/非法指针
BUG: unable to handle kernel NULL pointer dereference at 0000000000000650
BUG: unable to handle kernel paging request at ffff88081fc03cd0
MCE(Machine Check Exception)
Kernel panic – not syncing: Fatal Machine Check
NMI
Kernel panic – not syncing: NMI IOCK error: Not continuing
HP Watchdog timer module [hpwdt]
Kernel panic – not syncing: An NMI occurred, please see the Integrated Management Log for details
SysRq
PANIC: “SysRq: Trigger a crashdump”
BUG_ON() 断语
kernel BUG at fs/inode.c:322!
理解函数调用栈(backtrace)
代码的执行轨迹
CPU寄存器状态pt_regs
栈帧里的数据
内核栈溢出
汇编指令
调用约定(call convention)
call/ret/leave指令
参数传递约定
通用寄存器,caller-saved vs. callee-saved
对照源代码
changelog
内核模块
Taint flags
crash utility如何加载内核模块的调试信息
Hang分析
思路
- 是没有可运行的进程?
- 还是有很多进程想运行但抢不到CPU?
- 什么是uninterruptible sleep
- 抢占式内核也有不能被抢占的情况
- 自旋锁spinlock
- crash工具的基本命令
进程 ps/task/runq/bt
内存kmem/vm/swap/ipcs
IO : dev/mount/files/fuser
网络 net
Crash utility扩展工具
PyKdump