记一次VMware 虚拟机遇到意外重启的内核级排查操作

背景:用户业务虚拟机遇到不明原因导致操作系统重启,引起业务中断

需求:要求排查具体原因,定位问题根源

先来查看虚拟机的事件,事件发生时间: 13:37:21

 再到虚拟机对应的宿主机查看相关日志,宿主机日志看到的时间需+8才能与VCenter上的事件时间对应得上,因此我们过滤05:37分左右的日志

 VMKernel日志:

  hostd日志:

 结合vmkernel和hostd的日志基本只能看到操作系统被重启了,并没有看到直接导致虚拟机重启的原因,因此我们将注意力放在虚拟机操作系统内进行排错

 在/var/crash目录下,能看到自动生成了vmcore日志文件,由于这是存在于生产业务机器上,因此我们将日志拷贝到测试环境

 

在这里我找了一台跟生产业务机一致内核版本的机器用于分析

 

[root@demo ~]# cd /opt/

# 安装与生产业务机一致内核版本的kernel-debug工具
[root@demo opt]# wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-957.el7.x86_64.rpm

[root@demo opt]# mkdir test
[root@demo opt]# mv kernel-debuginfo-3.10.0-957.el7.x86_64.rpm  test/

[root@demo opt]# cd test

# rpm2cpio将rpm包转换为cpio,再由cpio提取其内容保存在当前目录下
[root@demo test]# rpm2cpio kernel-debuginfo-3.10.0-957.el7.x86_64.rpm | cpio -idmv

# 查看当前目录树结构,验证rpm已被提取保存在当前目录下
[root@demo test]# tree .|head 
.
├── kernel-debuginfo-3.10.0-957.el7.x86_64.rpm
└── usr
    └── lib
        └── debug
            ├── lib
            │   └── modules
            │       └── 3.10.0-957.el7.x86_64
            │           ├── kernel
            │           │   ├── arch

# 查找rpm内的vmlinux文件路径,将其复制到/opt/下,方便接下来的操作
[root@demo test]# find ./ -name "vmlinux"
[root@demo test]# cp ./usr/lib/debug/lib/modules/3.10.0-957.el7.x86_64/vmlinux /opt/

 前期工作准备完成,接下来开始使用crash工具分析core日志

[root@demo test]# crash /opt/vmlinux /root/vmcore

crash 7.2.3-8.el7
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [710MB]: patching 85605 gdb minimal_symbol values

      KERNEL: /opt/vmlinux                                             
    DUMPFILE: /root/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Thu Jul 18 13:37:21 2024
      UPTIME: 786 days, 20:01:15
LOAD AVERAGE: 1.36, 1.38, 1.45
       TASKS: 720
    NODENAME: 10-41-16-16
     RELEASE: 3.10.0-957.el7.x86_64
     VERSION: #1 SMP Thu Nov 8 23:39:32 UTC 2018
     MACHINE: x86_64  (2095 Mhz)
      MEMORY: 8 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 56383
     COMMAND: "XXXX"
        TASK: ffff98c1fcfbc100  [THREAD_INFO: ffff98c1d3bb8000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt 
PID: 56383  TASK: ffff98c1fcfbc100  CPU: 1   COMMAND: "XXXX"
 #0 [ffff98c3f5643aa8] machine_kexec at ffffffffad663674
 #1 [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12
 #2 [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00
 #3 [ffff98c3f5643bf0] oops_end at ffffffffadd6c758
 #4 [ffff98c3f5643c18] die at ffffffffad62f95b
 #5 [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052
 #6 [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8
    [exception RIP: kmem_cache_alloc+116]
    RIP: ffffffffad81bcf4  RSP: ffff98c3f5643d30  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: 0000000000000780  RCX: 0000002447835d81
    RDX: 0000002447835d80  RSI: 0000000000000020  RDI: ffff98c33fc07700
    RBP: ffff98c3f5643d60   R8: 000000000001f120   R9: ffffffffadc279da
    R10: ffff98c3e6e60f78  R11: ffff98c1f534c8c0  R12: 746e657645534620
    R13: 0000000000000020  R14: ffff98c33fc07700  R15: ffff98c33fc07700
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff98c3f5643d68] __build_skb at ffffffffadc279da
 #8 [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9
 #9 [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3]
#10 [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3]
#11 [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f
#12 [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05
#13 [ffff98c3f5643f68] call_softirq at ffffffffadd7832c
#14 [ffff98c3f5643f80] do_softirq at ffffffffad62e675
#15 [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285
#16 [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6
--- <IRQ stack> ---
#17 [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362
    [exception RIP: extract_dns_request+56]
    RIP: ffffffffc15ce188  RSP: ffff98c1d3bbb2b8  RFLAGS: 00000217
    RAX: 00000000003e3016  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000003e3015  RSI: ffff98c1d3bbb330  RDI: 00000000ffffff8f
    RBP: ffff98c1d3bbb2b8   R8: 0000000000000000   R9: ffff98c3d4efde38
    R10: 00000000ffffff8f  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffcc  CS: 0010  SS: 0018
#18 [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr]
    RIP: 10ffff98c3f75298  RSP: 0000000000000000  RFLAGS: 01000009
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ff00000000000000
    RDX: 000000ffff0000ff  RSI: 0000000000000000  RDI: 4500000000000000
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 000000081e000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 3500000000000000
    ORIG_RAX: 00000006c0000001  CS: ffff98c3f75299  SS: 4000000000000000
bt: WARNING: possibly bogus exception frame

# 以上信息可以看到 操作系统奔溃类型为 PANIC: "general protection fault: 0000 [#1] SMP "
# 奔溃时运行的任务ID为:ffff98c1fcfbc100

 接着再根据任务ID查看更详细的信息

crash> bt -t ffff98c1fcfbc100
PID: 56383  TASK: ffff98c1fcfbc100  CPU: 1   COMMAND: "XXXXX"
              START: machine_kexec at ffffffffad663674
  [ffff98c3f5643aa8] machine_kexec at ffffffffad663674
  [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12
  [ffff98c3f5643b50] __build_skb at ffffffffadc279da
  [ffff98c3f5643b90] kmem_cache_alloc at ffffffffad81bcf4
  [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00
  [ffff98c3f5643bf0] oops_end at ffffffffadd6c758
  [ffff98c3f5643c18] die at ffffffffad62f95b
  [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052
  [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8
  [ffff98c3f5643cc8] __build_skb at ffffffffadc279da
  [ffff98c3f5643d08] kmem_cache_alloc at ffffffffad81bcf4
  [ffff98c3f5643d68] __build_skb at ffffffffadc279da
  [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9
  [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3]
  [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3]
  [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f
  [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05
  [ffff98c3f5643f68] call_softirq at ffffffffadd7832c
  [ffff98c3f5643f80] do_softirq at ffffffffad62e675
  [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285
  [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6
  [ffff98c3f5643ff0] ret_from_intr at ffffffffadd6b362
--- <IRQ stack> ---
  [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362
    [exception RIP: extract_dns_request+56]
    RIP: ffffffffc15ce188  RSP: ffff98c1d3bbb2b8  RFLAGS: 00000217
    RAX: 00000000003e3016  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000003e3015  RSI: ffff98c1d3bbb330  RDI: 00000000ffffff8f
    RBP: ffff98c1d3bbb2b8   R8: 0000000000000000   R9: ffff98c3d4efde38
    R10: 00000000ffffff8f  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffcc  CS: 0010  SS: 0018
  [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr]
    RIP: 10ffff98c3f75298  RSP: 0000000000000000  RFLAGS: 01000009
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ff00000000000000
    RDX: 000000ffff0000ff  RSI: 0000000000000000  RDI: 4500000000000000
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 000000081e000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 3500000000000000
    ORIG_RAX: 00000006c0000001  CS: ffff98c3f75299  SS: 4000000000000000
bt: WARNING: possibly bogus exception frame

 总结:

  从 START项往下查看,可以看到由machine_kexec调用ffffffffad663674地址,接着发生一系列调用操作,直到触发crash_kexec(系统奔溃函数)后出现保护操作general_protection(与前面看到的奔溃类型一致 ---> PANIC: "general protection fault: 0000 [#1] SMP ")即可确认该日志报错信息与实际奔溃类型一致,而引起此次操作系统重启的原因是:

  COMMAND: "XXXX" 命令导致的(这里的XXXX是笔者人为替换的),而该命令正是操作系统内的应用引起的,因此可以将注意力集中在应用本身进行排查

 

posted @ 2024-07-18 23:00  Ky150  阅读(51)  评论(0编辑  收藏  举报