导航

从巡检备份失败排查解决数据库故障

Posted on 2016-05-09 15:44  datalife  阅读(814)  评论(0编辑  收藏  举报

最近某业务备份报错:
Starting Control File and SPFILE Autobackup at 09-MAY-16
piece handle=c-335040995-20160509-00 comment=API Version 2.0,MMS Version 5.0.0.0
Finished Control File and SPFILE Autobackup at 09-MAY-16

sql statement: alter system archive log current

released channel: ch00

released channel: ch01

allocated channel: ch00
channel ch00: sid=491 instance=dlsc1 devtype=SBT_TAPE
channel ch00: Veritas NetBackup for Oracle - Release 7.0 (2010010501)

released channel: ch00
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-12001: could not open channel ch01
RMAN-10008: could not create channel context
RMAN-10003: unable to connect to target database
ORA-12170: TNS:Connect timeout occurred
RMAN> RMAN>
Recovery Manager complete.
Script /oracle/nbu_scripts/hot_database_backup.sh
==== ended in error on Mon May 9 09:26:39 BEIST 2016 ====
从备份信息上看数据库在备份完Control File and SPFILE切换归档日志后备份归档出现问题。
RMAN-10003: unable to connect to target database提示不能连接到目标库。
检查节点1,发现节点1正常。节点2无法进入sqlplus和rman下。
oracle@xxxxdb2:/oracle$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on Mon May 9 14:46:01 2016

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '/dev/rspfile'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
SQL>
SQL> exit
Disconnected
oracle@xxxxdb2:/oracle$ rman target /

Recovery Manager: Release 10.2.0.3.0 - Production on Mon May 9 14:46:22 2016

Copyright (c) 1982, 2005, Oracle. All rights reserved.

connected to target database (not started)

RMAN> exit


Recovery Manager complete.
oracle@xxxxdb2:/oracle$
查看crs状态:
Recovery Manager complete.
oracle@xxxxdb2:/oracle$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

oracle@xxxxdb2:/oracle$
哎,难道是crs没起来?
尝试启动crs
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
等了一会以为crs能顺利启动结果还是起不来如下:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
什么情况,执行ocr命令看看ocr的情况结果:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
到这里已经知道啥问题了,存储肯定没认到。
查看数据所在的vg
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg
rootvg
datavg
archvg
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg datavg
0516-010 : Volume group must be varied on; use varyonvg command.果然datavg没有激活。到这里别急着激活datavg
因为datavg没有激活很可能是hacmp没有起来,查看hacmp状态,确实没有起来。
root@xxxxdb2:/# /usr/es/sbin/cluster/utilities/clshowsrv -v
Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs inoperative
grpsvcs grpsvcs inoperative
grpglsm grpsvcs inoperative
emsvcs emsvcs inoperative
emaixos emsvcs inoperative
ctrmc rsct 262394 active
Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 200944 active
clstrmgrES cluster 311646 active
Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster inoperative


Obtaining information via SNMP from Node: xxxxdb1...

_____________________________________________________________________________
Cluster Name: xxxxdb
Cluster State: UP
Cluster Substate: STABLE
_____________________________________________________________________________


Node Name: xxxxdb1 State: UP

Network Name: net_ether_02 State: UP

Address: 192.168.77.194 Label: xxxxdb1_priv State: UP


Node Name: xxxxdb2 State: DOWN

Network Name: net_ether_02 State:
Address: xxxx Label: xxxxdb2 State: DOWN
Address: 192.168.77.195 Label: xxxxdb2_priv State: DOWN

 

Cluster Name: xxxxdb

Resource Group Name: orarg
Startup Policy: Online On All Available Nodes
Fallover Policy: Bring Offline (On Error Node Only)
Fallback Policy: Never Fallback
Site Policy: ignore
Priority Override Information:
Primary Instance POL:
[MORE...5]
问题找到了,启动hacmp吧
smitty clstart
Start Cluster Services on these nodes
启动完毕:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00c63cf200004c000000011d0937bc9f
VG STATE: active PP SIZE: 64 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 7990 (511360 megabytes)
MAX LVs: 256 FREE PPs: 661 (42304 megabytes)
LVs: 44 USED PPs: 7329 (469056 megabytes)
OPEN LVs: 5 QUORUM: 3
TOTAL PVs: 5 VG DESCRIPTORS: 5
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 5 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32768 MAX PVs: 1024
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 130852
Used space (kbytes) : 3300
Available space (kbytes) : 127552
ID : 222055846
Device/File Name : /dev/rocr
Device/File integrity check succeeded

Device/File not configured

Cluster registry integrity check succeeded

看来datavg已挂载
接下来启动crs
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.xxxx.db application ONLINE ONLINE xxxxdb1
ora....c1.inst application ONLINE ONLINE xxxxdb1
ora....c2.inst application ONLINE ONLINE xxxxdb2
ora....B1.lsnr application ONLINE ONLINE xxxxdb1
ora....db1.gsd application ONLINE ONLINE xxxxdb1
ora....db1.ons application ONLINE ONLINE xxxxdb1
ora....db1.vip application ONLINE ONLINE xxxxdb1
ora....B2.lsnr application ONLINE ONLINE xxxxdb2
ora....db2.gsd application ONLINE ONLINE xxxxdb2
ora....db2.ons application ONLINE ONLINE xxxxdb2
ora....db2.vip application ONLINE ONLINE xxxxdb2

ok到这里问题解决。分析下故障思路
先是从备份信息得到NBU无法连接到节点2目标库-->sqlplus和rman均失败-->crs启动失败-->ocrcheck失败-->datavg没有激活-->hacmp没有启动