Linux基础——BClinux8.2 排查vmcore异常宕机问题
一、无法/var/crash生成文件
1、参考配置:
https://cloud.tencent.cn/developer/article/2367955
2、BCoe8.2调整配置
3、手动生成crash
i.参考:参数详解
https://blog.csdn.net/tombaby_come/article/details/134038949
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
注意:执行上述配置,主机重启,开始转储内存中数据到/var/crash目录中。
4、检查kdump
i.参考:kdump原理
https://zhuanlan.zhihu.com/p/684699511
二、crash工具和vmlinux内核一致性检查
1、检查/boot/vmlinuz-4.19.0-240.23.35.el8_2.bclinux.x86_64和/usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vmlinux的md5值必需保持一致
2、主机内核vmlinux位置
/usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vmlinux
3、异常宕机vmcore文件所在位置
/var/crash/127.0.0.1-2024-05-06-03\:24\:36/vmcore
三、分析vmcore
1、crash工具打开vmcore
[root@NewOSBC8 127.0.0.1-2024-05-06-03:24:36]# crash vmcore /usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vmlinux crash 7.2.7-3.el8.1 Copyright (C) 2002-2020 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [178MB]: patching 97096 gdb minimal_symbol values KERNEL: /usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 2 DATE: Mon May 6 03:24:31 2024 UPTIME: 00:12:44 LOAD AVERAGE: 0.00, 0.02, 0.03 TASKS: 346 NODENAME: NewOSBC8.2 RELEASE: 4.19.0-240.23.35.el8_2.bclinux.x86_64 VERSION: #1 SMP Wed Sep 27 10:49:35 EDT 2023 MACHINE: x86_64 (1796 Mhz) MEMORY: 2 GB PANIC: "sysrq: SysRq : Trigger a crash" PID: 2289 COMMAND: "bash" TASK: ffff8d1122bf0000 [THREAD_INFO: ffff8d1122bf0000] CPU: 0 STATE: TASK_RUNNING (SYSRQ) crash> bt PID: 2289 TASK: ffff8d1122bf0000 CPU: 0 COMMAND: "bash" #0 [ffffa2ab80cefbe8] machine_kexec at ffffffff8c25fabe #1 [ffffa2ab80cefc40] __crash_kexec at ffffffff8c3658ba #2 [ffffa2ab80cefd00] crash_kexec at ffffffff8c36678d #3 [ffffa2ab80cefd18] oops_end at ffffffff8c2259fd #4 [ffffa2ab80cefd38] no_context at ffffffff8c26fd4e #5 [ffffa2ab80cefd90] do_page_fault at ffffffff8c270872 #6 [ffffa2ab80cefdc0] page_fault at ffffffff8cc0122e [exception RIP: sysrq_handle_crash+18] RIP: ffffffff8c74eb12 RSP: ffffa2ab80cefe78 RFLAGS: 00010246 RAX: ffffffff8c74eb00 RBX: 0000000000000063 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8d1131017108 RDI: 0000000000000063 RBP: 0000000000000004 R8: 00000000000005ce R9: 000000000000002d R10: 0000000000000000 R11: ffffa2ab80cefd30 R12: 0000000000000000 R13: 0000000000000000 R14: ffffffff8d53c3e0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffa2ab80cefe78] __handle_sysrq.cold.10 at ffffffff8c74f6f8 #8 [ffffa2ab80cefea8] write_sysrq_trigger at ffffffff8c74f5bb #9 [ffffa2ab80cefeb8] proc_reg_write at ffffffff8c55de29 #10 [ffffa2ab80cefed0] vfs_write at ffffffff8c4e0db5 #11 [ffffa2ab80ceff00] ksys_write at ffffffff8c4e102f #12 [ffffa2ab80ceff38] do_syscall_64 at ffffffff8c2041ab #13 [ffffa2ab80ceff50] entry_SYSCALL_64_after_hwframe at ffffffff8cc000ad RIP: 00007f515c78ab28 RSP: 00007ffc1172a678 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f515c78ab28 RDX: 0000000000000002 RSI: 000055b65d8c05c0 RDI: 0000000000000001 RBP: 000055b65d8c05c0 R8: 000000000000000a R9: 00007f515c81bc80 R10: 000000000000000a R11: 0000000000000246 R12: 00007f515ca5b6c0 R13: 0000000000000002 R14: 00007f515ca56880 R15: 0000000000000002 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> dis -l sysrq_handle_crash+18 /usr/src/debug/kernel-4.19.0-240.23.35.el8/linux-4.19.0-240.23.35.el8_2.bclinux.x86_64/drivers/tty/sysrq.c: 159 0xffffffff8c74eb12 <sysrq_handle_crash+18>: movb $0x1,0x0 crash> dis -l 0xffffffff8c74eb12 /usr/src/debug/kernel-4.19.0-240.23.35.el8/linux-4.19.0-240.23.35.el8_2.bclinux.x86_64/drivers/tty/sysrq.c: 159 0xffffffff8c74eb12 <sysrq_handle_crash+18>: movb $0x1,0x0 crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 458790 1.8 GB ---- FREE 194411 759.4 MB 42% of TOTAL MEM USED 264379 1 GB 57% of TOTAL MEM SHARED 50717 198.1 MB 11% of TOTAL MEM BUFFERS 530 2.1 MB 0% of TOTAL MEM CACHED 103545 404.5 MB 22% of TOTAL MEM SLAB 31239 122 MB 6% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 532479 2 GB ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 532479 2 GB 100% of TOTAL SWAP COMMIT LIMIT 761874 2.9 GB ---- COMMITTED 511634 2 GB 67% of TOTAL LIMIT crash> sys KERNEL: /usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 2 DATE: Mon May 6 03:24:31 2024 UPTIME: 00:12:44 LOAD AVERAGE: 0.00, 0.02, 0.03 TASKS: 346 NODENAME: NewOSBC8.2 RELEASE: 4.19.0-240.23.35.el8_2.bclinux.x86_64 VERSION: #1 SMP Wed Sep 27 10:49:35 EDT 2023 MACHINE: x86_64 (1796 Mhz) MEMORY: 2 GB PANIC: "sysrq: SysRq : Trigger a crash" crash> p cpu_info:1 per_cpu(cpu_info, 1) = $1 = { x86 = 23 '\027', x86_vendor = 2 '\002', x86_model = 104 'h', x86_stepping = 1 '\001', x86_tlbsize = 3072, x86_virt_bits = 48 '0', x86_phys_bits = 45 '-', x86_coreid_bits = 0 '\000', cu_id = 255 '\377', extended_cpuid_level = 2147483680, cpuid_level = 16, x86_capability = {126614527, 802421759, 0, 129319184, 4277678595, 0, 4195321, 376123396, 557056, 563872169, 15, 0, 0, 17584641, 4, 0, 4194308, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 229696, 0}, x86_vendor_id = "AuthenticAMD\000\000\000", x86_model_id = "AMD Ryzen 7 5700U with Radeon Graphics\000 \000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", x86_cache_size = 512, x86_cache_alignment = 64, x86_cache_max_rmid = -1, x86_cache_occ_scale = -1, x86_power = 256, loops_per_jiffy = 1796624, x86_max_cores = 1, apicid = 2, initial_apicid = 2, x86_clflush_size = 64, booted_cores = 1, phys_proc_id = 2, logical_proc_id = 1, cpu_core_id = 0, cpu_index = 1, microcode = 0, x86_cache_bits = 45 '-', initialized = 1, cpuinfo_x86_extended_size_rh = 0, _rh = { cpu_die_id = 0, logical_die_id = 1, vmx_capability = {0, 0, 0} } } crash> ps 1489 PID PPID CPU TASK ST %MEM VSZ RSS COMM 1489 1382 0 ffff8d110eb20000 IN 11.9 3106588 249348 llvmpipe-1 crash>
crash vmcore /usr/lib/debug/usr/lib/modules/4.19.0-240.23.35.el8_2.bclinux.x86_64/vm linux
vmcore生成时间:DATE: Mon May 6 03:24:31 2024
中断原因:PANIC: "sysrq: SysRq : Trigger a crash"
2、查看中断寄存器地址和函数RIP
i.分析当时正在运行哪些应用调用函数sysrq_handle_crash,导致中断卡死问题;
ii.参考:
https://blog.csdn.net/weixin_43564241/article/details/130692946
3、查看用户层应用的调用代码
i.通过“[exception RIP: sysrq_handle_crash+18]”标黄部分查看调用代码;
4、查看宕机时内存使用情况
5、用户侧触发
i.手动触发了内存中数据的转储到/var/crash中。
稳步前行,只争朝夕。