[原]Oracle数据文件损坏的模拟和修复(一) |ORA-01578 data block corrupted|

造成数据块损坏的原因通常是由于开启了异步I/O或者增加了写进程,还有可能是硬件引起的,今天模拟一下该问题的发生及修复方法。由于水平有限,那面疏漏,欢迎大家指正。

 

创建测试环境

建立测试表空间:

create tablespace test
datafile 
  '/u02/oradata/logdw/test01.dbf' size 180K autoextend off
logging
segment space management auto
extent management local;

创建测试用户并设置相应的权限:

create user test
identified by test
default tablespace test;
grant connect to test;
grant resource to test;
revoke unlimited tablespace from test;
alter user test quota unlimited on test;

插入测试数据,直到填满 test01.dbf:

declare
  i number;
begin
	for	 i in 1..99999 loop
  	insert into test_table values(i,lpad('Q',30));
  	commit;
  end loop;
end;
 SQL> declare
  2    i number;
  3  begin
  4  for i in 1..99999 loop
  5    insert into test_table values(i,lpad('Q',30));
  6    commit;
  7    end loop;
  8  end;
  9  /
declare
*
ERROR at line 1:
ORA-01653: unable to extend table TEST.TEST_TABLE by 8 in tablespace TEST
ORA-06512: at line 5

test01.dbf 已经填满了数据后,看看一共插入了多少条数据:

select count(*) from test_table;

  COUNT(*)
----------
      2356

如果我想再加一个索引是不会成功的,我为该表空间添加多一个数据文件来扩容。

alter tablespace test add datafile '/u02/oradata/logdw/test02.dbf' size 180K autoextend off;

扩容后的 test 表空间可以创建索引了:

create index idx_test_table on test_table(id);

通过这样一些步骤,现在数据的分布情况是test01.dbf这个数据文件装着数据,test02.dbf装着索引。可以使用查询一下dba_extents 确认一下:

select owner,segment_name,segment_type,extent_id,file_id,block_id,bytes  
from dba_extents 
where owner='TEST';
OWNER SEGMENT_NAME    SEGMENT_TYPE        EXTENT_ID    FILE_ID   BLOCK_ID      BYTES
----- --------------- ------------------ ---------- ---------- ---------- ----------
TEST  TEST_TABLE      TABLE                       0         15          8      65536
TEST  TEST_TABLE      TABLE                       1         15         16      65536
TEST  IDX_TEST_TABLE  INDEX                       0         16          8      65536
TEST  IDX_TEST_TABLE  INDEX                       1         16         16      65536

一条使用索引的典型SQL:

SQL> select count(id) from test_table; 

 COUNT(ID)
----------
      2356

----------------------------------------------------------------------------------------
| Id  | Operation             | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                |     1 |    13 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |                |     1 |    13 |            |          |
|   2 |   INDEX FAST FULL SCAN| IDX_TEST_TABLE |  2356 | 30628 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

自此,我们的测试环境就建立完成了。

 

破坏并修复数据文件

在Linux修改二进制文件的方法可以参考我这篇blog《在 Linux下使用vim配合xxd查看并编辑二进制文件》,这里就不熬述了。不同位置的数据块遭到损坏就会报不同的错误,非常幸运,我一个下午就遇到了3个。

 

使用drop和create修复ORA-01578

关闭数据库后打开test02.dbf ,定位到偏移地址 001e9c0 左右的地方修改一下,要注意偏移的位置,不同的位置可能会引发不同的问题,保存并退出。

使用dbv工具检查一下数据文件:

[oracle@logserver logdw]$ dbv file=test02.dbf blocksize=8192

DBVERIFY: Release 11.2.0.1.0 - Production on Wed Jun 23 09:16:35 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u02/oradata/logdw/test02.dbf
Page 15 is marked corrupt
Corrupt block relative dba: 0x0400000f (file 16, block 15)
Bad check value found during dbv: 
Data in bad block:
 type: 6 format: 2 rdba: 0x0400000f
 last change scn: 0x0000.0a71e746 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xe7460601
 check value in block header: 0x46a7
 computed block checksum: 0x12d1

Page 16 is marked corrupt
Corrupt block relative dba: 0x04000010 (file 16, block 16)
Bad check value found during dbv: 
Data in bad block:
 type: 6 format: 2 rdba: 0x04000010
 last change scn: 0x0000.0a71e746 seq: 0x2 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xe7460602
 check value in block header: 0x99f0
 computed block checksum: 0xadca

DBVERIFY - Verification complete

Total Pages Examined         : 23
Total Pages Processed (Data) : 0
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 4
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 10
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 7
Total Pages Marked Corrupt   : 2
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 175236934 (0.175236934)

此时,数据库可以正常启动,但是当接触到坏块的时候就会报ORA-01578:

SQL> startup ; 
ORACLE instance started.

Total System Global Area 1887350784 bytes
Fixed Size                  2214456 bytes
Variable Size            1560282568 bytes
Database Buffers          301989888 bytes
Redo Buffers               22863872 bytes
Database mounted.
Database opened.
SQL> connect test/test
SQL> show user;
USER is "TEST"
SQL> set autotrace on ; 
SQL> select count(id) from test_table;  
select count(id) from test_table
                      *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 16, block # 15)
ORA-01110: data file 16: '/u02/oradata/logdw/test02.dbf'

SQL> select count(*) from test_table;

  COUNT(*)
----------
      2356

-------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |     1 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |            |     1 |            |          |
|   2 |   TABLE ACCESS FULL| TEST_TABLE |  2356 |     5   (0)| 00:00:01 |
-------------------------------------------------------------------------

如果查询不用到索引,就不会有问题,证明我们的数据没有问题,仅是损失了索引。由于场景特殊,我们知道肯定是索引损坏了,数据没有丢失的,通常的场景是我们知道损坏所在的文件编号和块编号,但是不知道对应的是什么对象,可以用以下查询定位到对应的对象上:

sys$logdw@logserver SQL> SELECT tablespace_name, owner,segment_name,segment_type
  2  FROM dba_extents
  3  WHERE file_id = &file
  4  and &block between block_id AND block_id + blocks - 1 ;
Enter value for file: 16
old   3: WHERE file_id = &file
new   3: WHERE file_id = 16
Enter value for block: 15
old   4: and &block between block_id AND block_id + blocks - 1
new   4: and 15 between block_id AND block_id + blocks - 1

TABLESPACE OWNER      SEGMENT_NAME         SEGMENT_TYPE
---------- ---------- -------------------- ------------------
TEST       TEST       IDX_TEST_TABLE       INDEX

由于数据没有丢失,我尝试重建索引。

SQL> alter index IDX_TEST_TABLE rebuild online ; 
alter index IDX_TEST_TABLE rebuild online
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01658: unable to create INITIAL extent for segment in tablespace TEST

空间不足……,于是drop后再create吧:

SQL> drop index IDX_TEST_TABLE ; 

Index dropped.

Elapsed: 00:00:00.57
SQL> create index IDX_TEST_TABLE on test_table(id);

Index created.

Elapsed: 00:00:00.10

我们再看看使用索引的查询:

 SQL> select count(id) from test_table ;

 COUNT(ID)
----------
      2356

----------------------------------------------------------------------------------------
| Id  | Operation             | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |                |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IDX_TEST_TABLE |  2356 |  9424 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

索引恢复正常了。

该方法适用于坏块落在非数据区的情况。

 

使用RMAN修复ORA-01578

如果损坏了仅仅两个数据块(8K*2),导致一个几百兆的索引重建,这样的很不值,我们可以试一下Oracle的blockrecover,但前提是之前有备份。

RMAN> blockrecover datafile 16 block 15; 
RMAN> blockrecover datafile 16 block 16; 
-- 或者组合起来
RMAN> blockrecover datafile 16 block 15 datafile 16 block 16; 
RMAN> blockrecover datafile 16 block 15,16;

如果坏块很多,例如我们就要写很长很长的命令,此时可以试一下:

RMAN> blockrecover corruption list;

这个命令是依靠 v$database_block_corruption 这个表进行工作的。Oracle每当发现一个损坏的数据块就会在这个表中多一条记录,相反地如果没有碰到,就自然不会在这表中显示出来的。

sys$logdw@logserver SQL> select * from v$database_block_corruption;

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        16         15          2                  0 CHECKSUM

我们可以借助rman对数据库、文件或表空间进行“扫描”,检查出所有的坏块。

RMAN> backup validate database;
RMAN> backup validate tablespace test;
RMAN> backup validate datafile 16;

如果数据库比较大或者表空间也比较大,backup validate就会长时间占用大量I/O资源,因此范围应该尽量小。

RMAN> backup validate datafile 16;

Starting backup at 2010-06-23 10:42:23
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=71 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00016 name=/u02/oradata/logdw/test02.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
16   FAILED 0              7            23              175356605 
  File Name: /u02/oradata/logdw/test02.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0               
  Index      2              6               
  Other      0              10              

validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/logdw/logdw/trace/logdw_ora_30130.trc for details
Finished backup at 2010-06-23 10:42:26

只要让Oracle直到了哪些是坏块就好办了,是blockrecover即可修复坏块:

RMAN> blockrecover corruption list; 

Starting recover at 2010-06-23 11:00:49
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=136 device type=DISK
searching flashback logs for block images

finished flashback log search, restored 2 blocks

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 2010-06-23 11:10:48

再使用索引查询一下数据:

test$logdw@logdw SQL> select count(id) from test_table ;

 COUNT(ID)
----------
      2356

----------------------------------------------------------------------------------------
| Id  | Operation             | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |                |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IDX_TEST_TABLE |  2356 |  9424 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

此时索引idx_test_table可以用了,一切都是在线做的。

该方法需要有rman的备份,如果日常有rman的备份就不用怕了。

 

使用exp和imp尽力挽救数据

如果没有rman备份呢?以下主要参考盖国强的《Oracle中模拟及修复数据块损坏

我们再用同样的方法编辑test01.dbf,定位到最后,将数据破坏。

可以预见test_table的数据已经损坏,索引没有损坏。

test$logdw@logdw SQL> select count(*) from test_table ; 
select count(*) from test_table
                     *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 15, block # 23)
ORA-01110: data file 15: '/u02/oradata/logdw/test01.dbf'

Elapsed: 00:00:01.93
test$logdw@logdw SQL> select count(id) from test_table ;   

 COUNT(ID)
----------
      2356

----------------------------------------------------------------------------------------
| Id  | Operation             | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |                |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IDX_TEST_TABLE |  2356 |  9424 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

由于没有已经假设没有rman的备份,所以blockrecover已经不可能了,此时可以用vim+xxd重新对test01.dbf进行修改或者借助内部工具 bbed 将坏块修正也是可以的,据说这是Oracle 8i之前的做法,但几乎是mission impossible。
如果我们将任务定位为可以容忍数据丢失,希望能尽量挽救,减少数据丢失。

[oracle@logserver tmp]$ exp test/test file=test_table.dmp tables=test_table         

Export: Release 11.2.0.1.0 - Production on Wed Jun 23 14:08:29 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.


Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set

About to export specified tables via Conventional Path ...
. . exporting table                     TEST_TABLE
EXP-00056: ORACLE error 1578 encountered
ORA-01578: ORACLE data block corrupted (file # 15, block # 23)
ORA-01110: data file 15: '/u02/oradata/logdw/test01.dbf'
Export terminated successfully with warnings.

我们使用导出得到一个错误

通过设置内部事件,设置在全表扫描时跳过损坏的数据块

ALTER SYSTEM SET EVENTS='10231 trace name context forever,level 10' ;
ALTER SYSTEM SET EVENTS='10231 trace name context off' ;

exp可以成功执行,将损坏的数据表删除掉,在imp即可挽救部分数据。

在当前的场景中,没有设置内部事件的情况下挽救了 2356 条数据中的 1920 条;而设置了内部事件之后挽救了 2175 条数据。

该方法几乎成了最后的救命稻草了,如果你足够强悍可以试一下用bbed或者其他工具直接对数据文件进行二进制编辑从而手动修复数据文件。

 

小结

遇到ORA-01578后,第一步应首先确定rman备份情况,如果有可用备份,一切都不太糟糕,几条rman命令就可以修复;

如果没有再确定坏块对应的是什么对象,如果是索引等非数据对象,可以考虑drop后再create一次;

如果坏块落在数据上,设置10231内部事件,然后求神拜佛,能救回多少数据只能看人品了。

posted @ 2010-06-24 00:31  killkill  阅读(9680)  评论(3编辑  收藏  举报