【案例】Oracle ORA-29740
Oracle RAC信息
数据库版本 | Oracle11.2.0.4 |
节点数 | 2节点RAC |
操作系统 | Red Hat Enterprise Linux Server release 6.9 (Santiago) |
故障现象:
节点2 实例宕机,vip飘到节点1
下列为故障时间段的alert日志和cssd日志信息
alert日志信息:
IPC Send timeout: Terminating pid 34 osid 52694 Thu Jul 02 12:07:39 2020 Communications reconfiguration: instance_number 1 Detected an inconsistent instance membership by instance 1 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_52640.trc (incident=304089): ORA-29740: evicted by instance number 1, group incarnation 6 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_304089/orcl2_lmon_52640_i304089.trc Thu Jul 02 12:07:41 2020 IPC Send timeout detected. Sender: ospid 52682 [oracle@ze02 (LGWR)] Receiver: inst 1 binc 460990968 ospid 36977 IPC Send timeout to 1.4 inc 4 for msg type 73 from opid 28 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_52640.trc: ORA-29740: evicted by instance number 1, group incarnation 6 LMON (ospid: 52640): terminating the instance due to error 29740 Thu Jul 02 12:07:42 2020 ORA-1092 : opitsk aborting process Thu Jul 02 12:07:46 2020 System state dump requested by (instance=2, osid=52640 (LMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_52630_20200702120746.trc
cssd.log 日志报错信息
2020-07-07 17:54:58.329: [ CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x1383 2020-07-07 17:54:58.329: [ CSSD][2708354816]clssgmDeadProc: proc 0x7f9198099540 2020-07-07 17:54:58.329: [ CSSD][2708354816]clssgmDestroyProc: cleaning up proc(0x7f9198099540) con(0x1354) skgpid ospid 10669 with 0 clients, refcount 0 2020-07-07 17:54:58.329: [ CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x1354 2020-07-07 17:54:58.618: [ CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2020-07-07 17:54:59.122: [ CSSD][2691610368]clssnmSendingThread: sending join msg to all nodes 2020-07-07 17:54:59.122: [ CSSD][2691610368]clssnmSendingThread: sent 5 join msgs to all nodes 2020-07-07 17:54:59.349: [ CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414293, LATS 4294746660, lastSeqNo 6414292, uniqueness 1588026083, timestamp 1594115698/17926648482020-07-07 17:54:59.618: [ CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2020-07-07 17:55:00.355: [ CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414294, LATS 4294747670, lastSeqNo 6414293, uniqueness 1588026083, timestamp 1594115699/17926658482020-07-07 17:55:00.619: [ CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2020-07-07 17:55:01.355: [ CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414295, LATS 4294748670, lastSeqNo 6414294, uniqueness 1588026083, timestamp 1594115700/17926668482020-07-07 17:55:01.619: [ CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2020-07-07 17:55:02.383: [ CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414296, LATS 4294749700, lastSeqNo 6414295, uniqueness 1588026083, timestamp 1594115701/17926678482020-07-07 17:55:02.619: [ CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2020-07-07 17:55:03.108: [ CSSD][2708354816]clssgmDeadProc: proc 0x7f9198039460 2020-07-07 17:55:03.108: [ CSSD][2708354816]clssgmDestroyProc: cleaning up proc(0x7f9198039460) con(0x12f3) skgpid ospid 10932 with 0 clients, refcount 0 2020-07-07 17:55:03.108: [ CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x12f3 2020-07-07 17:55:03.110: [ CSSD][2708354816]clssscSelect: cookie accept request 0xc85280 2020-07-07 17:55:03.110: [ CSSD][2708354816]clssgmAllocProc: (0x7f919807dbb0) allocated 2020-07-07 17:55:03.110: [ CSSD][2708354816]clssgmClientConnectMsg: properties of cmProc 0x7f919807dbb0 - 1,2,3,4,5 2020-07-07 17:55:03.110: [ CSSD][2708354816]clssgmClientConnectMsg: Connect from con(0x13e3) proc(0x7f919807dbb0) pid(10932) version 11:2:1:4, properties: 1,2,3,4,5
判断:
RAC机器发生脑裂,节点服务器防火墙没有accept 私网地址和HAIP地址。
修改了防火墙规则,或者关闭防火墙,CRS自动拉起节点2实例
防火墙规则:
# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT icmp -- anywhere anywhere icmp echo-request
ACCEPT all -- anywhere anywhere
ACCEPT tcp -- 100.82.16.8 anywhere state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT tcp -- 100.82.16.9 anywhere state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT tcp -- 100.82.16.10 anywhere state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT tcp -- 100.82.10.11 anywhere state NEW tcp dpt:ssh #RAC公网地址
ACCEPT tcp -- 100.82.10.12 anywhere state NEW tcp dpt:ssh #RAC公网地址
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ncube-lm
ACCEPT all -- anywhere anywhere source IP range 100.82.16.152-100.82.16.153
ACCEPT all -- ze01-priv anywhere #RAC私网地址
ACCEPT all -- ze02-priv anywhere #RAC私网地址
ACCEPT all -- anywhere anywhere source IP range 100.82.11.11-100.82.11.16
ACCEPT all -- 169.254.85.175 anywhere #HAIP地址
ACCEPT all -- 169.254.180.52 anywhere #HAIP地址
ACCEPT tcp -- anywhere anywhere source IP range 100.82.16.26-100.82.16.27 state NEW tcp multiport dports zabbix-agent,zabbix-trapper
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited