Oracle rac环境系统需要调整的参数
概述
一套oracle 11.2.0.4 2Nodes RAC on RHEL 7的环境,数据库第二个节点被驱逐出集群,并且多次自动重启以失败告终,驱逐原因在GI Alert log显示是私网通信丢失,ASM db alert显示IPC Send timeout. 当时ping和tracert并没发现什么异常,从OSW中的netstat -s查看IP packet reassembles failed时间段值大量增长。
Node1 GI ALERT LOG
2017-03-02 11:46:26.607: [/u01/11.2.0/grid/bin/oraagent.bin(121496)]CRS-5011:Check of resource "testdb" failed: details at "(:CLSN00007:)" in "/u01/11.2.0/grid/log/anbob/agent/crsd/oraagent_oracle//oraagent_oracle.log" 2017-03-02 11:46:26.612: [crsd(175117)]CRS-2765:Resource 'ora.testdb.db' has failed on server 'anbob'. 2017-03-02 11:46:42.866: [cssd(172139)]CRS-1612:Network communication with node db2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.620 seconds 2017-03-02 11:46:50.869: [cssd(172139)]CRS-1611:Network communication with node db2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.620 seconds 2017-03-02 11:46:54.870: [cssd(172139)]CRS-1610:Network communication with node db2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.620 seconds 2017-03-02 11:46:57.493: [cssd(172139)]CRS-1607:Node db2 is being evicted in cluster incarnation 351512591; details at (:CSSNM00007:) in /u01/11.2.0/grid/log/anbob/cssd/ocssd.log. 2017-03-02 11:46:58.626: [cssd(172139)]CRS-1662:Member kill requested by node db2 for member number 0, group DBHBCRM
Node2 GI ALERT LOG
2017-03-02 11:46:45.378: [cssd(177450)]CRS-1663:Member kill issued by PID 84816 for 1 members, group DBCRM. Details at (:CSSGM00044:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log. 2017-03-02 11:46:58.982: [cssd(177450)]CRS-1608:This node was evicted by node 1, anbob; details at (:CSSNM00005:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log. 2017-03-02 11:46:58.983: [cssd(177450)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log
解决方案
针对oracle的RAC环境,linux内核需要特别优化的参数如下:
#set new
net.ipv4.conf.eno3.rp_filter = 2
net.ipv4.conf.bond0.rp_filter = 2
#即rp_filter参数有三个值,0、1、2,具体含义:
net.ipv4.ipfrag_high_thresh = 67108864 #分片占用内存的高阈值,默认值4194304
net.ipv4.ipfrag_low_thresh = 66060288 #分片占用内存的低阈值,默认值3145728
net.ipv4.ipfrag_time = 120 #分片超时时间,默认值30
net.ipv4.ipfrag_max_dist = 1024 #分片有效的最长间隔距离,默认值64
net.core.netdev_budget = 600 # 表示一次软中断所能接收的最大报文数,默认值为300
net.core.netdev_max_backlog = 2000 #每个网络接口接收数据包的速率比内核处理这些包的速率快时,允许送到队列的数据包的最大数目
微信赞赏
支付宝赞赏