代码改变世界

人为删除控制文件故障模拟

2016-05-14 16:26  abce  阅读(330)  评论(0编辑  收藏  举报

对于linux和unix环境,当前数据处于run的时候,某个controlfile人为删除是不影响数据库运行的,如下:

#### 删除controlfile

$ rm control01.ctl

删除后,alert日志并没有报错,数据库正常运行

 

在数据库执行以下操作:

SQL> alter system checkpoint;
SQL> alter system switch logfile;
SQL> alter system switch logfile;
SQL> alter system switch logfile;

alert日志对应的内容,数据库仍然能正常运行:

Sat May 14 07:52:39 2016
Thread 1 advanced to log sequence 64 (LGWR switch)
  Current log# 1 seq# 64 mem# 0: /u01/app/oracle/oradata/db11/redo01.log
Thread 1 advanced to log sequence 65 (LGWR switch)
  Current log# 2 seq# 65 mem# 0: /u01/app/oracle/oradata/db11/redo02.log
Thread 1 advanced to log sequence 66 (LGWR switch)
  Current log# 3 seq# 66 mem# 0: /u01/app/oracle/oradata/db11/redo03.log

 

因为其进程持有的句柄并有释放,如下:

$ ps -ef|grep ckpt|grep -v grep
ora11     4616     1  0 07:51 ?        00:00:00 ora_ckpt_db11
$ cd /proc/4616/fd
$ ls -ltr |grep control
lrwx------ 1 ora11 oinstall 64 May 14 07:55 257 -> /u01/app/oracle/oradata/db11/control02.ctl
lrwx------ 1 ora11 oinstall 64 May 14 07:55 256 -> /u01/app/oracle/oradata/db11/control01.ctl (deleted)

 

#### session 1 trace跟踪

$ strace -fr -o /tmp/4616.log -p 4616
Process 4616 attached - interrupt to quit
进程会一直hang在这个状态

 

#### session 2 进行redo切换

SQL> alter system switch logfile;
SQL> alter system switch logfile;

日志切换正常完成
Sat May 14 07:58:33 2016
Thread 1 advanced to log sequence 67 (LGWR switch)
  Current log# 1 seq# 67 mem# 0: /u01/app/oracle/oradata/db11/redo01.log
Thread 1 advanced to log sequence 68 (LGWR switch)
  Current log# 2 seq# 68 mem# 0: /u01/app/oracle/oradata/db11/redo02.log

 

#### 终止session 1 trace跟踪(crtl+c)

$ strace -fr -o /tmp/4616.log -p 4616
Process 4616 attached - interrupt to quit

Process 4616 detached

 

#### 下面观察session 1产生的日志/tmp/4616.log

...
4616       0.000036 gettimeofday({1463183881, 895560}, NULL) = 0
4616       0.000035 pwrite(256, "\25\302\0\0\3\0\0\0\0\0\0\0\0\0\1\4\214C\0\0\2\0\0\0\0\0\0\0\32\0\0\0"..., 16384, 49152) = 16384
4616       0.040894 gettimeofday({1463183881, 936492}, NULL) = 0
4616       0.000044 gettimeofday({1463183881, 936533}, NULL) = 0
4616       0.000079 pwrite(257, "\25\302\0\0\3\0\0\0\0\0\0\0\0\0\1\4\214C\0\0\2\0\0\0\0\0\0\0\32\0\0\0"..., 16384, 49152) = 16384
4616       0.003029 gettimeofday({1463183881, 939643}, NULL) = 0
4616       0.000042 gettimeofday({1463183881, 939697}, NULL) = 0
4616       0.000057 gettimeofday({1463183881, 939740}, NULL) = 0
4616       0.000071 gettimeofday({1463183881, 939815}, NULL) = 0
4616       0.000076 gettimeofday({1463183881, 939888}, NULL) = 0
4616       0.000035 gettimeofday({1463183881, 939922}, NULL) = 0
4616       0.000038 pread(256, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\212\343\0\0\0\0\0\0\0\4 \v~\227\300U"..., 16384, 16384) = 16384
...

其中:

4616是对应的进程号

第二列是时间,如0.000036

 

在看下面这行:

pread(256, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\212\343\0\0\0\0\0\0\0\4 \v~\227\300U"..., 16384, 16384) = 16384

256表示文件描述符

$ ls -ltr |grep control
lrwx------ 1 ora11 oinstall 64 May 14 07:55 257 -> /u01/app/oracle/oradata/db11/control02.ctl
lrwx------ 1 ora11 oinstall 64 May 14 07:55 256 -> /u01/app/oracle/oradata/db11/control01.ctl (deleted)

 

第一个16384表示块大小 第二个16384表示偏移量 第三个16384表示写入数据的大小

 

通过上面的进程跟踪,我们可以得到什么:

1. 进程信息可以在/proc下看到,例如: /proc/4616/stat

2. 对于linux,对于文件的读写,是通过调用函数read,pwrite64 来实现的。

3. 对于pwrite64的操作,是通过写fd (256,257)2个文件来完成的