Linux 内存错误诊断
先了解一些概念
DRAM(Dynamic Random Access Memory),即动态随机存取存储器,最为常见的系统内存。ECC是“Error Checking and Correcting”的简写,中文名称是“错误检查和纠正”。ECC内存,即应用了能够实现错误检查和纠正技术(ECC)的内存条。EDAC,即Error Detection And Correction(错误检测与纠正)。
内存有两种错误类型分别是CE和UE,CE 是 Correctable Error 的简称, UE是Uncorrectable Error的简称,CE即可恢复的错误,暂不影响系统的正常运行。可以在找时机停机换掉。UE为不可恢复的内存错误,通常会导致宕机。
系统messages日志
[root@my-host mg4a]# grep kernel /var/log/messages
Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged
Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 ha:1 channel_mask:2 rank:0)
[root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0
[root@my-host mg4a]# dmidecode -t 1
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
Handle 0x0044, DMI type 1, 27 bytes
System Information
Manufacturer: LENOVO
Product Name: Lenovo System x3750 M4 -[8753IH5]-
Version: 03
Serial Number: 06FF367
UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E
Wake-up Type: Other
SKU Number: XxXxXxX
Family: System X
这是另外一台设备messges日志
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960
Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960
Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960
Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk 'NR>1 && int($5) > 80' removes=None creates=None chdir=None
Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files
故障确认及定位故障内存槽位
[root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294
/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0
[root@irora30 ~]#
- count:不为0的行即代表存在内存错误。
- mc:第几个CPU。
- csrow:内存通道。
- ch*:通道内的第几根内存。
内存安装情况
1 Memory Component Status
2
3 Proc 1 DIMM 1A 16384 MB 1333 MHz
4
5 Proc 1 DIMM 2I Not installed Not installed
6
7 Proc 1 DIMM 3E Not installed Not installed
8
9 Proc 1 DIMM 4C Not installed Not installed
10
11 Proc 1 DIMM 5K Not installed Not installed
12
13 Proc 1 DIMM 6G Not installed Not installed
14
15 Proc 1 DIMM 7B 16384 MB 1333 MHz
16
17 Proc 1 DIMM 8J Not installed Not installed
18
19 Proc 1 DIMM 9F Not installed Not installed
20
21 Proc 1 DIMM 10D Not installed Not installed
22
23 Proc 1 DIMM 11L Not installed Not installed
24
25 Proc 1 DIMM 12H Not installed Not installed
26
27 Proc 2 DIMM 1A 16384 MB 1333 MHz
28
29 Proc 2 DIMM 2I Not installed Not installed
30
31 Proc 2 DIMM 3E Not installed Not installed
32
33 Proc 2 DIMM 4C Not installed Not installed
34
35 Proc 2 DIMM 5K Not installed Not installed
36
37 Proc 2 DIMM 6G Not installed Not installed
38
39 Proc 2 DIMM 7B 16384 MB 1333 MHz
40
41 Proc 2 DIMM 8J Not installed Not installed
42
43 Proc 2 DIMM 9F Not installed Not installed
44
45 Proc 2 DIMM 10D Not installed Not installed
46
47 Proc 2 DIMM 11L Not installed Not installed
48
49 Proc 2 DIMM 12H Not installed Not installed
50
51 Proc 3 DIMM 1A 16384 MB 1333 MHz
52
53 Proc 3 DIMM 2I Not installed Not installed
54
55 Proc 3 DIMM 3E Not installed Not installed
56
57 Proc 3 DIMM 4C Not installed Not installed
58
59 Proc 3 DIMM 5K Not installed Not installed
60
61 Proc 3 DIMM 6G Not installed Not installed
62
63 Proc 3 DIMM 7B 16384 MB 1333 MHz
64
65 Proc 3 DIMM 8J Not installed Not installed
66
67 Proc 3 DIMM 9F Not installed Not installed
68
69 Proc 3 DIMM 10D Not installed Not installed
70
71 Proc 3 DIMM 11L Not installed Not installed
72
73 Proc 3 DIMM 12H Not installed Not installed
74
75 Proc 4 DIMM 1A 16384 MB 1333 MHz
76
77 Proc 4 DIMM 2I Not installed Not installed
78
79 Proc 4 DIMM 3E Not installed Not installed
80
81 Proc 4 DIMM 4C Not installed Not installed
82
83 Proc 4 DIMM 5K Not installed Not installed
84
85 Proc 4 DIMM 6G Not installed Not installed
86
87 Proc 4 DIMM 7B 16384 MB 1333 MHz
88
89 Proc 4 DIMM 8J Not installed Not installed
90
91 Proc 4 DIMM 9F Not installed Not installed
92
93 Proc 4 DIMM 10D Not installed Not installed
94
95 Proc 4 DIMM 11L Not installed Not installed
96
97 Proc 4 DIMM 12H Not installed Not installed
使用edac工具来检测服务器内存故障
随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。
下面以CentOS为例,介绍下edac-utils 工具的使用.
在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示。
处理器0 (对应一个内存控制器)
通道0:内存插槽A1、A5 和A9
通道1:内存插槽A2、A6 和A10
通道2:内存插槽A3、A7 和A11
通道3:内存插槽A4、A8 和A12
处理器1 (对应一个内存控制器)
通道0:内存插槽B1、B5 和B9
通道1:内存插槽B2、B6 和B10
通道2:内存插槽B3、B7 和B11
通道3:内存插槽B4、B8 和B12
1.安装 edac-utils 工具
yum install -y libsysfs edac-utils
2.执行检测命令,可查看纠错提示如下
edac-util -v
1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
8 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
12 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12
13
14 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
15 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
16 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
17 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
18 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
19 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
20 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
21 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
22 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
23 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
24 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
25 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12
其中
mc06 表示 表示内存控制器0;
CPU_Src_ID#0 表示源CPU0;
Channel#0 表示通道0;
DIMM#0 标示内存槽0;
Corrected Errors 代表已经纠错的次数;
根据前面列出的CPU通道和内存槽对应关系即可给edac-utils 返回的信息进行编号。
即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进行更换即可。
12条内存的对应关系
1 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
2 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
3 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
4 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
5 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
6 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6
7
8 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
9 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
10 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
11 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
12 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
13 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6
20条内存的对应关系
1 mc0: 0 Uncorrected Errors with no DIMM info
2 mc0: 0 Corrected Errors with no DIMM info
3 mc0: csrow0: 0 Uncorrected Errors
4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
5 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
6 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
7 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
8 mc0: csrow1: 0 Uncorrected Errors
9 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
10 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
11 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
12 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
13 mc0: csrow2: 0 Uncorrected Errors
14 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
15 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
16 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
17 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
18 mc1: 0 Uncorrected Errors with no DIMM info
19 mc1: 0 Corrected Errors with no DIMM info
20 mc1: csrow0: 0 Uncorrected Errors
21 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
22 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
23 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
24 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
25 mc1: csrow1: 0 Uncorrected Errors
26 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
27 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
28 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
29 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
30
31 4x16关系
32 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
33 mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
34 mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
35 mc0: csrow1: 0 Uncorrected Errors
36 mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
37 mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
38 mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
39 mc0: csrow2: 0 Uncorrected Errors
40 mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
41 mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h
参考:
https://www.cnblogs.com/luckyall/p/11225772.html
http://www.voidcn.com/article/p-gvfvakvy-btw.html