EDAC DIMM CE Error错误导致服务器重启
服务器一:
这个是 EDAC (Error Detection AndCorrection) 的日志.
按照上面的文档, 找出错误的DIMM:
根据错误日志:
May 8 09:10:59 localhost kernel: EDAC MC0: CE row 1, channel 0, label "CPU_SrcID#0_Channel#1_DIMM#0": 0 Unknown error(s): memory read on FATAL area : cpu=0 Err=0000:009f (ch=15), addr = 0x6f9326740 => socket=0, Channel=1(mask=2), rank=0
最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错
参考博文:
http://blog.tankywoo.com/2014/12/02/edac-dimm-ce-error.html
http://serverfault.com/questions/648240/how-can-i-find-which-memory-have-ce-error
https://blog.csdn.net/odailidong/article/details/46865255
服务器二:机器内存条报错:
按照上面的文档, 找出错误的DIMM:
[```
root@localhost ~]# grep “[0-9]” /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count|wc -l
16
[root@localhost ~]# grep “[0-9]” /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:1
/sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
[root@localhost ~]# cat /sys/devices/system/edac/mc/mc0/csrow6/ch0_dimm_label
CPU_SrcID#0_Ha#0_Channel#3_DIMM
[root@localhost csrow6]# cat /sys/devices/system/edac/mc/mc0/csrow6/mem_type
Registered-DDR3