【案例】Oracle ORA-29740

Oracle RAC信息

数据库版本 Oracle11.2.0.4
节点数 2节点RAC
操作系统 Red Hat Enterprise Linux Server release 6.9 (Santiago)

故障现象:

节点2 实例宕机,vip飘到节点1

下列为故障时间段的alert日志和cssd日志信息

alert日志信息:

IPC Send timeout: Terminating pid 34 osid 52694
Thu Jul 02 12:07:39 2020
Communications reconfiguration: instance_number 1
Detected an inconsistent instance membership by instance 1
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_52640.trc  (incident=304089):
ORA-29740: evicted by instance number 1, group incarnation 6
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_304089/orcl2_lmon_52640_i304089.trc
Thu Jul 02 12:07:41 2020
IPC Send timeout detected. Sender: ospid 52682 [oracle@ze02 (LGWR)]
Receiver: inst 1 binc 460990968 ospid 36977
IPC Send timeout to 1.4 inc 4 for msg type 73 from opid 28
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmon_52640.trc:
ORA-29740: evicted by instance number 1, group incarnation 6
LMON (ospid: 52640): terminating the instance due to error 29740
Thu Jul 02 12:07:42 2020
ORA-1092 : opitsk aborting process
Thu Jul 02 12:07:46 2020
System state dump requested by (instance=2, osid=52640 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_52630_20200702120746.trc
alert_orcl.log

cssd.log 日志报错信息

2020-07-07 17:54:58.329: [    CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x1383
2020-07-07 17:54:58.329: [    CSSD][2708354816]clssgmDeadProc: proc 0x7f9198099540
2020-07-07 17:54:58.329: [    CSSD][2708354816]clssgmDestroyProc: cleaning up proc(0x7f9198099540) con(0x1354) skgpid  ospid 10669 with 0 clients, refcount 0
2020-07-07 17:54:58.329: [    CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x1354
2020-07-07 17:54:58.618: [    CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2020-07-07 17:54:59.122: [    CSSD][2691610368]clssnmSendingThread: sending join msg to all nodes
2020-07-07 17:54:59.122: [    CSSD][2691610368]clssnmSendingThread: sent 5 join msgs to all nodes
2020-07-07 17:54:59.349: [    CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414293, LATS 4294746660, lastSeqNo
 6414292, uniqueness 1588026083, timestamp 1594115698/17926648482020-07-07 17:54:59.618: [    CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2020-07-07 17:55:00.355: [    CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414294, LATS 4294747670, lastSeqNo
 6414293, uniqueness 1588026083, timestamp 1594115699/17926658482020-07-07 17:55:00.619: [    CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2020-07-07 17:55:01.355: [    CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414295, LATS 4294748670, lastSeqNo
 6414294, uniqueness 1588026083, timestamp 1594115700/17926668482020-07-07 17:55:01.619: [    CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2020-07-07 17:55:02.383: [    CSSD][2699503360]clssnmvDHBValidateNcopy: node 1, ze02, has a disk HB, but no network HB, DHB has rcfg 483520902, wrtcnt, 6414296, LATS 4294749700, lastSeqNo
 6414295, uniqueness 1588026083, timestamp 1594115701/17926678482020-07-07 17:55:02.619: [    CSSD][2694764288]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2020-07-07 17:55:03.108: [    CSSD][2708354816]clssgmDeadProc: proc 0x7f9198039460
2020-07-07 17:55:03.108: [    CSSD][2708354816]clssgmDestroyProc: cleaning up proc(0x7f9198039460) con(0x12f3) skgpid  ospid 10932 with 0 clients, refcount 0
2020-07-07 17:55:03.108: [    CSSD][2708354816]clssgmDiscEndpcl: gipcDestroy 0x12f3
2020-07-07 17:55:03.110: [    CSSD][2708354816]clssscSelect: cookie accept request 0xc85280
2020-07-07 17:55:03.110: [    CSSD][2708354816]clssgmAllocProc: (0x7f919807dbb0) allocated
2020-07-07 17:55:03.110: [    CSSD][2708354816]clssgmClientConnectMsg: properties of cmProc 0x7f919807dbb0 - 1,2,3,4,5
2020-07-07 17:55:03.110: [    CSSD][2708354816]clssgmClientConnectMsg: Connect from con(0x13e3) proc(0x7f919807dbb0) pid(10932) version 11:2:1:4, properties: 1,2,3,4,5
cssd日志

 

判断:

RAC机器发生脑裂,节点服务器防火墙没有accept 私网地址和HAIP地址。

修改了防火墙规则,或者关闭防火墙,CRS自动拉起节点2实例

 

防火墙规则:

# iptables -L
Chain INPUT (policy ACCEPT)
target      prot    opt     source              destination
ACCEPT    all    --    anywhere      anywhere      state RELATED,ESTABLISHED
ACCEPT   icmp    --    anywhere      anywhere     icmp echo-request
ACCEPT   all           --    anywhere      anywhere
ACCEPT   tcp   --    100.82.16.8       anywhere      state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT   tcp    --    100.82.16.9          anywhere     state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT    tcp   --    100.82.16.10    anywhere   state NEW tcp dpt:ssh #Oracle备份服务器地址
ACCEPT   tcp    --    100.82.10.11    anywhere    state NEW tcp dpt:ssh #RAC公网地址
ACCEPT    tcp   --    100.82.10.12    anywhere    state NEW tcp dpt:ssh #RAC公网地址
ACCEPT   tcp    --    anywhere      anywhere     state NEW tcp dpt:ncube-lm
ACCEPT    all   --    anywhere       anywhere     source IP range 100.82.16.152-100.82.16.153
ACCEPT   all    --    ze01-priv       anywhere     #RAC私网地址
ACCEPT    all   --    ze02-priv        anywhere      #RAC私网地址
ACCEPT    all    --    anywhere      anywhere      source IP range 100.82.11.11-100.82.11.16
ACCEPT   all    --    169.254.85.175    anywhere    #HAIP地址
ACCEPT    all   --    169.254.180.52    anywhere   #HAIP地址
ACCEPT    tcp   --    anywhere      anywhere       source IP range 100.82.16.26-100.82.16.27 state NEW tcp multiport dports zabbix-agent,zabbix-trapper
REJECT   all   --    anywhere      anywhere      reject-with icmp-host-prohibited

 

posted on 2020-07-16 18:03  HelonTian  阅读(443)  评论(0编辑  收藏  举报