[原]Oracle Control File 意外情况研究
差不多4年前吧,那时候才刚接触Oracle,只记得一位Oracle老师说过“在你的Oracle DBA生涯中不应该丢失Control File”,可见Control File是如何的重要。
近日在泡ITPUB的时候发现一贴:能否在不停数据库情况下换掉损坏的控制文件 ,经过多位高现身说法,自己也做了点实验,与大家分享一下:
首先,看一下我的机器情况:
OS 情况:
[root@mailserver ~]# uname -a ; cat /etc/redhat-release ; Linux mailserver 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:49:06 EST 2007 x86_64 x86_64 x86_64 GNU/Linux CentOS release 4.6 (Final)
Oracle 信息:
sys@mydb(10.168.0.202) SQL> select INSTANCE_NAME,version,database_status from v$instance; INSTANCE_NAME VERSION DATABASE_STATUS ---------------- ----------------- ----------------- mydb 10.2.0.1.0 ACTIVE sys@mydb(10.168.0.202) SQL> select name,open_mode from v$database; NAME OPEN_MODE --------- ---------- MYDB READ WRITE sys@mydb(10.168.0.202) SQL> select name,status from v$controlfile; NAME STATUS -------------------------------------------------- ------- /u02/oradata/mydb/controlfile/control01.ctl /u02/oradata/mydb/controlfile/control02.ctl /u02/oradata/mydb/controlfile/control03.ctl /u02/oradata/mydb/controlfile/control04.ctl
从查询中可以看到我的Oracle有 4 个control file。
在OS中查一下control file的情况:
[oracle@mailserver controlfile]$ pwd /u02/oradata/mydb/controlfile [oracle@mailserver controlfile]$ ls -lth 总用量 31M -rw-r----- 1 oracle oinstall 7.8M 12月 2 14:57 control01.ctl -rw-r----- 1 oracle oinstall 7.8M 12月 2 14:57 control02.ctl -rw-r----- 1 oracle oinstall 7.8M 12月 2 14:57 control03.ctl -rw-r----- 1 oracle oinstall 7.8M 12月 2 14:57 control04.ctl [oracle@mailserver controlfile]$ watch -n1 md5sum * Every 1.0s: md5sum control01.ctl control02.ctl control03.ctl control04.ctl Wed Dec 2 15:11:32 2009 1cbca423c5c18fea3cb1fc13401c49d6 control01.ctl b6a02759b5896da4305d0f2d324073fb control02.ctl b6a02759b5896da4305d0f2d324073fb control03.ctl 1cbca423c5c18fea3cb1fc13401c49d6 control04.ctl
观察 control file 更新情况,这是个空闲库,我把listener也关掉了。我一直以为control file应该都是一样的内容,但是从md5sum的结果来看不是,第1和第4始终保持一样,第二和第三始终保持一样,很有趣的发现
开始做实验: 实验一,模拟Control File 丢失:
[oracle@mailserver controlfile]$ ls control01.ctl control02.ctl control03.ctl control04.ctl [oracle@mailserver controlfile]$ rm -rf control04.ctl [oracle@mailserver controlfile]$ ls control01.ctl control02.ctl control03.ctl [oracle@mailserver controlfile]$ watch -n2 md5sum * Every 2.0s: md5sum control01.ctl control02.ctl control03.ctl Wed Dec 2 15:21:12 2009 2ccb97d9a9ab834d75a3a503960a0cbd control01.ctl a1fc984f27925f0b1ab2b17928289963 control02.ctl a1fc984f27925f0b1ab2b17928289963 control03.ctl
可以观察到 control file 还是在更新的,在session#0 中做一些操作:
sys@mydb(10.168.0.202) SQL> alter system checkpoint ; System altered. sys@mydb(10.168.0.202) SQL> alter system checkpoint ; System altered. sys@mydb(10.168.0.202) SQL> alter system switch logfile ; System altered.
在session#1 中做一下操作:
zhyw@mydb(10.168.0.202) SQL> update ip_state set state=2 where state=1; 130 rows updated. zhyw@mydb(10.168.0.202) SQL> commit; Commit complete. zhyw@mydb(10.168.0.202) SQL> select count(*) from ip_state; COUNT(*) ---------- 189 zhyw@mydb(10.168.0.202) SQL> create table ip_state_bak as select * from ip_state; create table ip_state_bak as select * from ip_state * ERROR at line 1: ORA-00210: cannot open the specified control file ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3
终于看到丢失一个控制文件的影响了 ^_^
尝试复制一个control file 出来,Oracle还在跑啊
[oracle@mailserver controlfile]$ pwd ; ls; /u02/oradata/mydb/controlfile control01.ctl control02.ctl control03.ctl [oracle@mailserver controlfile]$ cp control01.ctl control04.ctl [oracle@mailserver controlfile]$ watch -n2 md5sum * Every 2.0s: md5sum control01.ctl control02.ctl control03.ctl control04.ctl Wed Dec 2 15:30:53 2009 2998c5d3ffe493312284efc5f87c95e7 control01.ctl f77690d016d0f1cbd5256cf0bdda0d93 control02.ctl f77690d016d0f1cbd5256cf0bdda0d93 control03.ctl 49a9c4f48732cded164f474c8464b999 control04.ctl [oracle@mailserver controlfile]$ watch -n2 md5sum * Every 2.0s: md5sum control01.ctl control02.ctl control03.ctl control04.ctl Wed Dec 2 15:31:13 2009 ae6ae06f1395ac9c6783d783be566fb4 control01.ctl 7e56cfb3412a0dc0ed15c55c7e128283 control02.ctl 7e56cfb3412a0dc0ed15c55c7e128283 control03.ctl 49a9c4f48732cded164f474c8464b999 control04.ctl
可见“自作聪明”地Copy一个Control File 出来是“冻结”不动的。过了一阵,Oracle 挂掉了,alert log中出现:
Wed Dec 2 15:32:35 2009 ********************* ATTENTION: ******************** The controlfile header block returned by the OS has a sequence number that is too old. The controlfile might be corrupted. PLEASE DO NOT ATTEMPT TO START UP THE INSTANCE without following the steps below. RE-STARTING THE INSTANCE CAN CAUSE SERIOUS DAMAGE TO THE DATABASE, if the controlfile is truly corrupted. In order to re-start the instance safely, please do the following: (1) Save all copies of the controlfile for later analysis and contact your OS vendor and Oracle support. (2) Mount the instance and issue: ALTER DATABASE BACKUP CONTROLFILE TO TRACE; (3) Unmount the instance. (4) Use the script in the trace file to RE-CREATE THE CONTROLFILE and open the database. ***************************************************** Instance terminated by USER, pid = 2335
看来丢失control file不可怕,最怕control file不一致啊。
此时Oracle挂掉了,control file 当然也不会更新了,做以下操作:
[oracle@mailserver controlfile]$ pwd ; ls ; /u02/oradata/mydb/controlfile control01.ctl control02.ctl control03.ctl control04.ctl [oracle@mailserver controlfile]$ rm -rf control04.ctl [oracle@mailserver controlfile]$ cp control01.ctl control04.ctl [oracle@mailserver controlfile]$ md5sum * f513df299a0789b6e527a2ef6c571d25 control01.ctl 28fd3396e22604ba55b6cbe9d2e8fd40 control02.ctl 28fd3396e22604ba55b6cbe9d2e8fd40 control03.ctl f513df299a0789b6e527a2ef6c571d25 control04.ctl
然后尝试启动Oracle
还是可以起来的 :)
sys@mydb(10.168.0.202) SQL> select name,open_mode from v$database; NAME OPEN_MODE --------- ---------- MYDB READ WRITE sys@mydb(10.168.0.202) SQL> select name,status from v$controlfile; NAME STATUS -------------------------------------------------- ------- /u02/oradata/mydb/controlfile/control01.ctl /u02/oradata/mydb/controlfile/control02.ctl /u02/oradata/mydb/controlfile/control03.ctl /u02/oradata/mydb/controlfile/control04.ctl
再做一个实验,叫实验二,这次是破坏掉其中一个control file,不是“弄丢”噢 :-)
[oracle@mailserver controlfile]$ date +%x%X ; date > control04.ctl 2009年12月02日15时57分32秒 ###### 在 2009年12月02日15时57分32秒 “搞坏”一个Control File ###### alert file 中出现: Wed Dec 2 15:57:34 2009 Hex dump of (file 0, block 1) in trace file /u01/app/admin/mydb/bdump/mydb_arc0_2935.trc Corrupt block relative dba: 0x00000001 (file 0, block 1) Completely zero block found during control file header read Wed Dec 2 15:57:34 2009 Errors in file /u01/app/admin/mydb/bdump/mydb_arc0_2935.trc: ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl' Wed Dec 2 15:57:34 2009 Errors in file /u01/app/admin/mydb/bdump/mydb_arc0_2935.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl'
这证明,我“的确搞坏”了一个控制文件
在一个普通用户中执行一些语句:
zhyw@mydb(10.168.0.202) SQL> update ip_state set state=2 where state=1; 130 rows updated. zhyw@mydb(10.168.0.202) SQL> commit; Commit complete.
可以更新,可以提交.
建表
zhyw@mydb(10.168.0.202) SQL> create table a01 as select * from ip_state; create table a01 as select * from ip_state * ERROR at line 1: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl'
看来Control File出问题后,建表是有问题的,不知其他DDL语句有没有影响。
做一个查询:
zhyw@mydb(10.168.0.202) SQL> select sysdate,ip,state from ip_state where rownum<10; SYSDATE IP STATE ------------------- --------------- ---------- 2009-12-02 16:00:52 10.168.10.132 2 2009-12-02 16:00:52 10.168.10.133 2 2009-12-02 16:00:52 10.168.10.134 2 2009-12-02 16:00:52 10.168.10.135 0 2009-12-02 16:00:52 10.168.10.136 0 2009-12-02 16:00:52 10.168.10.137 0 2009-12-02 16:00:52 10.168.10.138 0 2009-12-02 16:00:52 10.168.10.139 2 2009-12-02 16:00:52 10.168.10.140 0
大家可以看到,Oracle在坏掉一个控制文件的情况下坚持近3分钟,从“2009年12月02日15时57分32秒” 到“2009-12-02 16:00:52”
等不及了,强制checkpoint 一下,让Oracle挂掉
sys@mydb(10.168.0.202) SQL> select sysdate from dual; SYSDATE ------------------- 2009-12-02 16:00:56 sys@mydb(10.168.0.202) SQL> alter system checkpoint; ERROR: ORA-03114: not connected to ORACLE alter system checkpoint * ERROR at line 1: ORA-03113: end-of-file on communication channel
此时,alert file 出现:
Hex dump of (file 0, block 1) in trace file /u01/app/admin/mydb/bdump/mydb_ckpt_2909.trc Corrupt block relative dba: 0x00000001 (file 0, block 1) Completely zero block found during control file header read Wed Dec 2 16:00:57 2009 Errors in file /u01/app/admin/mydb/bdump/mydb_ckpt_2909.trc: ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl' Wed Dec 2 16:00:57 2009 Errors in file /u01/app/admin/mydb/bdump/mydb_ckpt_2909.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/u02/oradata/mydb/controlfile/control04.ctl' CKPT: terminating instance due to error 227 Instance terminated by CKPT, pid = 2909
总结一下: 1。Control File丢失(当然不是全部丢失啦),系统还是可以checkpoing 和 switch logfile 的。 2。Control File不一致,Oracle马上就挂掉了。 3。Control File破坏掉后,Oracle还是能够坚持一阵子的,select update都没有问题,checkpoint的时候就挂掉了。