记一次Centos7主机自动重启原因查询
1 背景描述
最近上线了一台物理机,IT那边安装的操作系统的版本信息如下:
CentOS Linux release 7.3.1611 (Core)
内核版本
3.10.0-514.el7.x86_64
该系统是跑docker的,docker版本为
Docker version 19.03.6
在运行的故障中,出现异常宕机重启的情况。
2 故障分析
2.1 分析思路
(1)先看操作系统日志/var/log/message,看看能不能看出蛛丝马迹
(2)怀疑硬件兼容性问题,找硬件厂商确定固件、兼容性问题
(3)猜测操作系统有BUG。看看Linux的kdump有没有启动,如果,看看有没有崩溃时候的内核转储文件
2.2 具体分析实践
(1)查看操作系统日志/var/log/message
从日志中可以看出,系统在2020.4.1 18:19:01 宕机了,随即在18:23:19重启了。但除此之外,并没有其它更多可帮助分析的信息了。
(2)分析硬件兼容性问题
同步发送idrac上收集到的硬件信息,发给硬件供应商查询。
(3)使用kdump分析
a. 查看是否安装和启动了kdump
# systemctl status kdump.service ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Thu 2020-04-02 09:01:47 CST; 4h 0min ago Main PID: 284294 (code=exited, status=0/SUCCESS) Tasks: 0 Memory: 0B CGroup: /system.slice/kdump.service
注:安装kdump相关工具见章节3
b. 使用crash命令分析
按照章节3安装好工具之后,使用以下命令分析vmcore(我的是之前默认就已经开了kdump的)
# crash /var/crash/127.0.0.1-2020-04-01-18\:19\:32/vmcore /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux crash 7.2.3-10.el7 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2020-04-01-18:19:32/vmcore [PARTIAL DUMP] CPUS: 72 DATE: Wed Apr 1 18:19:27 2020 UPTIME: 19 days, 08:32:38 LOAD AVERAGE: 0.29, 0.32, 0.29 TASKS: 4177 NODENAME: RELEASE: 3.10.0-514.el7.x86_64 VERSION: #1 SMP Tue Nov 22 16:42:41 UTC 2016 MACHINE: x86_64 (2600 Mhz) MEMORY: 127.5 GB PANIC: "kernel BUG at fs/xfs/xfs_aops.c:1062!" PID: 92639 COMMAND: "kworker/u898:3" TASK: ffff8810f827bec0 [THREAD_INFO: ffff880106fa4000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 92639 TASK: ffff8810f827bec0 CPU: 1 COMMAND: "kworker/u898:3" #0 [ffff880106fa75f0] machine_kexec at ffffffff81059cdb #1 [ffff880106fa7650] __crash_kexec at ffffffff81105182 #2 [ffff880106fa7720] crash_kexec at ffffffff81105270 #3 [ffff880106fa7738] oops_end at ffffffff8168ee88 #4 [ffff880106fa7760] die at ffffffff8102e93b #5 [ffff880106fa7790] do_trap at ffffffff8168e540 #6 [ffff880106fa77e0] do_invalid_op at ffffffff8102b144 #7 [ffff880106fa7890] invalid_op at ffffffff81697e5e [exception RIP: xfs_vm_writepage+1419] RIP: ffffffffa052b2fb RSP: ffff880106fa7948 RFLAGS: 00010246 RAX: 006fffff00040009 RBX: ffff8813abed8fc8 RCX: 000000000000000c RDX: 0000000000000008 RSI: ffff880106fa7c40 RDI: ffffea006be56c00 RBP: ffff880106fa79f0 R8: ffffffffffffffd8 R9: 000000000001a100 R10: ffff88207ffd7000 R11: 0000000000000000 R12: ffff8813abed8fc8 R13: ffff880106fa7c40 R14: ffff8813abed8e78 R15: ffffea006be56c00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff880106fa7990] find_get_pages_tag at ffffffff81180981 #9 [ffff880106fa79f8] __writepage at ffffffff8118b3b3 #10 [ffff880106fa7a10] write_cache_pages at ffffffff8118bed1 #11 [ffff880106fa7b28] generic_writepages at ffffffff8118c19d #12 [ffff880106fa7b88] xfs_vm_writepages at ffffffffa052a063 [xfs] #13 [ffff880106fa7bb8] do_writepages at ffffffff8118d24e #14 [ffff880106fa7bc8] __writeback_single_inode at ffffffff81228730 #15 [ffff880106fa7c08] writeback_sb_inodes at ffffffff8122941e #16 [ffff880106fa7cb0] __writeback_inodes_wb at ffffffff8122967f #17 [ffff880106fa7cf8] wb_writeback at ffffffff81229ec3 #18 [ffff880106fa7d70] bdi_writeback_workfn at ffffffff8122bd05 #19 [ffff880106fa7e20] process_one_work at ffffffff810a7f3b #20 [ffff880106fa7e68] worker_thread at ffffffff810a8d76 #21 [ffff880106fa7ec8] kthread at ffffffff810b052f #22 [ffff880106fa7f50] ret_from_fork at ffffffff81696518 crash>
c. 可以看到exception RIP: xfs_vm_writepage+1419,用谷歌查询一下
感觉这个与我的现象很像
https://access.redhat.com/solutions/2779111
看起来一样,先安排停机时间,按照文档的说法,将内核版本进行升级,后续再观察下是否还会出现宕机。
3 kdump相关工具安装
3.1 安装kexec-tools
yum search kexec-tools yum install crash
3.2 配置kdump服务
vim /etc/kdump.conf
# 修改core文件的目录
path /var/crash systemctl start kdump systemctl enable kdump.service
参考:https://www.linuxtechi.com/how-to-enable-kdump-on-rhel-7-and-centos-7/
3.3 安装kernel-debuginfo工具
(1)下载安装包
在http://debuginfo.centos.org/7/x86_64/上搜索与内核版本一致的rpm包
kernel-debuginfo-3.10.0-514.el7.x86_64.rpm kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm
(2)安装
rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm rpm -ivh kernel-debuginfo-3.10.0-514.el7.x86_64.rpm