连接ASM实例后出现ORA-1012错误1.2.3

ASM还真是让人不是很踏实,今天又碰到一个和ASM有关的问题。

今天RAC的测试数据库在导入数据时长时间没有响应。通过V$SESSION_WAIT检查发现,系统在等待归档:

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 log file switch (archiving needed)

数据库采用ASM,由于是测试环境,没有给ASM太大的空间,而且为了方便,将归档日志也放在ASM中了。

产生问题的原因很明显了,由于归档日志不断产生,导致ASM空间用完,因此新的归档无法产生,导致数据库处于等待状态。

本来认为是个小问题,于是通过RMAN将所有的归档日志删除。奇怪的是,成功的删除所有的归档日志后,这个问题仍然没有解决。

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 log file switch (archiving needed)

再次检查等待会话,发现仍然在等待归档,怀疑Oracle处于两次重试之间的间歇期,于是手工切换当前实例的日志去激活归档操作:

SQL> alter system switch logfile;

结果这个操作也一直处于等待中。

难道是ASM中的空间并没有释放?于是在当前节点(节点1racnode1)启动dbca,查看ASM的配置,结果居然出现下面的信息:

DBCA could not startup the ASM instance configured on this node. To proceed with ASM diskgroup management you need the ASM instance to be up and running. Do you want to recreate the ASM instance on the node?

奇怪RAC环境一直是在ASM上运行的,怎么会出现这个问题呢?难道ASM实例没有启动:

SQL> host
$ ps -ef|grep ASM
oracle 8205 1 0 12:31:18 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9734 1 0 12:32:17 ? 0:01 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11007 1 0 12:33:11 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4157 1 0 Apr 13 ? 0:02 asm_lck0_+ASM1
oracle 4077 1 0 Apr 13 ? 0:02 asm_lmon_+ASM1
oracle 4087 1 0 Apr 13 ? 0:03 asm_dbw0_+ASM1
oracle 4079 1 0 Apr 13 ? 0:08 asm_lmd0_+ASM1
oracle 4124 1 0 Apr 13 ? 0:00 asm_rbal_+ASM1
oracle 11098 1 0 12:33:14 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 20253 1 0 11:59:39 ? 2:50 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4578 1 0 Apr 13 ? 0:00 asm_o000_+ASM1
oracle 11131 1 0 12:33:16 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 12954 1 0 12:14:52 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11164 1 0 12:33:17 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4073 1 0 Apr 13 ? 0:01 asm_diag_+ASM1
oracle 7028 1 0 12:10:57 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9349 1 0 12:32:02 ? 0:09 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11067 1 0 12:33:13 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4085 1 0 Apr 13 ? 0:00 asm_mman_+ASM1
oracle 12754 1 0 Apr 13 ? 0:01 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode1.ASM1.asm
oracle 10915 10806 0 15:58:37 pts/4 0:00 grep ASM
oracle 4081 1 0 Apr 13 ? 0:02 asm_lms0_+ASM1
oracle 10773 1 0 12:33:00 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4126 1 0 Apr 13 ? 0:24 asm_gmon_+ASM1
oracle 23205 1 0 15:01:11 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 16362 1 0 12:17:02 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 1075 1 0 12:26:38 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4101 1 0 Apr 13 ? 0:00 asm_lgwr_+ASM1
oracle 4122 1 0 Apr 13 ? 0:00 asm_smon_+ASM1
oracle 11195 1 0 12:33:18 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4120 1 0 Apr 13 ? 0:00 asm_ckpt_+ASM1
oracle 23017 1 0 12:21:22 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 22951 1 0 12:01:32 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4071 1 0 Apr 13 ? 0:00 asm_pmon_+ASM1
oracle 20259 1 2 11:59:40 ? 21:03 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4917 1 1 Apr 13 ? 14:01 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11162 1 0 12:33:17 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 19873 1 0 12:37:51 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11129 1 0 12:33:15 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4075 1 0 Apr 13 ? 0:00 asm_psp0_+ASM1

显然ASM实例已经启动,那么怎么会出现这个问题呢,于是尝试直接连接到ASM实例上:

SQL> exit

 Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 
断开

$ bash
bash-2.03$ export ORACLE_SID=+ASM1
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4 16 16:02:42 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

已连接。
SQL> SELECT * FROM V$ASM_DISKGROUP;
SELECT * FROM V$ASM_DISKGROUP
*
 1 行出现错误:
ORA-01012: not logged on


SQL>

看来是ASM出现了问题。

在节点2上尝试用dbca管理ASM,结果发现ASM尚有40G的剩余空间,这是归档删除后的剩余空间。

而且节点2上直接登陆ASM实例:

$ bash
bash-2.03$ export ORACLE_SID=+ASM2
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4 16 16:08:33 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SET PAGES 100 LINES 120
SQL> COL NAME FORMAT A30
SQL> SELECT GROUP_NUMBER, NAME, TOTAL_MB, FREE_MB FROM V$ASM_DISKGROUP;

GROUP_NUMBER NAME TOTAL_MB FREE_MB
------------ ------------------------------ ---------- ----------
1 DISK 110934 40797

SQL>

看来是节点1上的ASM出现了故障,导致Oracle的实例1无法正常工作。

尝试在节点2上执行日志切换:

SQL> EXIT

 Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 
断开

bash-2.03$ exit
exit
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4 16 16:11:45 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SELECT INSTANCE_NAME FROM V$INSTANCE;

INSTANCE_NAME
----------------
testrac2

SQL> ALTER SYSTEM SWITCH LOGFILE;

系统已更改。

果然可以,看来问题完全定位到节点1ASM实例上了。怀疑是空间用完导致ASM实例出现了问题。

将节点1上的ASM实例重启就应该可以解决问题,但是我不想损失已经进行了一半的工作,于是尝试通过其他的方法解决无法归档的问题,待工作完成后再重启实例。

登陆节点1上的实例:

$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4 16 16:15:11 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SET PAGES 100 LINES 120
SQL> SELECT INSTANCE_NAME FROM V$INSTANCE;

INSTANCE_NAME
----------------
testrac1

SQL> SHOW PARAMETER LOG_ARCHIVE

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_config string
log_archive_dest string
log_archive_dest_1 string LOCATION=+DISK/testrac/
log_archive_dest_10 string
log_archive_dest_2 string
log_archive_dest_3 string
log_archive_dest_4 string
log_archive_dest_5 string
log_archive_dest_6 string
log_archive_dest_7 string
log_archive_dest_8 string
log_archive_dest_9 string
log_archive_dest_state_1 string enable
log_archive_dest_state_10 string enable
log_archive_dest_state_2 string enable
log_archive_dest_state_3 string enable
log_archive_dest_state_4 string enable
log_archive_dest_state_5 string enable
log_archive_dest_state_6 string enable
log_archive_dest_state_7 string enable
log_archive_dest_state_8 string enable
log_archive_dest_state_9 string enable
log_archive_duplex_dest string
log_archive_format string %t_%s_%r.dbf
log_archive_local_first boolean TRUE
log_archive_max_processes integer 2
log_archive_min_succeed_dest integer 1
log_archive_start boolean FALSE
log_archive_trace integer 0

观察归档的相关信息,虽然无法不重启数据库的情况下改变归档模式,

不过由于LOG_ARCHIVE_MIN_SUCCEED_DEST参数设置为1

因此可以通过添加第二归档LOCATION的方式来解决位置1无法使用的问题。

由于实例2没有归档的问题,可以仅修改当前实例的参数:

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2 = 'LOCATION=/data/oracle/oradata/testrac' SCOPE = MEMORY SID = 'testrac1';

系统已更改。

修改后,发现刚才一直处于等待状态的ALTER SYSTEM SWITCH LOGFILE命令已经执行完毕,且一直等待归档的会话也改变了等待事件:

SQL> alter system switch logfile;

系统已更改。

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 SQL*Net more data from client

至此,问题基本上解决了,只需要等到导入工作完成后,手工重启节点1的数据库和ASM实例就可以了。

问题出现的时候只是通过V$SESSION_WAIT和数据库的错误信息进行了分析,事后检查ASM和实例1alert文件,再次确定了ASM的问题:

Mon Apr 16 12:01:33 2007
WARNING: allocation failure on disk DISK_0019 for file 392 xnum 2147483648
Mon Apr 16 12:10:58 2007
WARNING: allocation failure on disk DISK_0006 for file 396 xnum 163
Mon Apr 16 12:14:52 2007
WARNING: allocation failure on disk DISK_0004 for file 396 xnum 2147483648
Mon Apr 16 12:17:02 2007
WARNING: allocation failure on disk DISK_0004 for file 397 xnum 2147483648
Mon Apr 16 12:21:22 2007
WARNING: allocation failure on disk DISK_0002 for file 399 xnum 2147483648
Mon Apr 16 12:26:39 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:31:18 2007
WARNING: allocation failure on disk DISK_0009 for file 400 xnum 2147483648
Mon Apr 16 12:33:00 2007
WARNING: allocation failure on disk DISK_0015 for file 400 xnum 2147483648
Mon Apr 16 12:33:11 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:14 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:15 2007
WARNING: allocation failure on disk DISK_0003 for file 400 xnum 2147483648
Mon Apr 16 12:33:16 2007
WARNING: allocation failure on disk DISK_0017 for file 400 xnum 2147483648
Mon Apr 16 12:33:17 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:17 2007
WARNING: allocation failure on disk DISK_0004 for file 400 xnum 2147483648
Mon Apr 16 12:33:18 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:19 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:37:51 2007
WARNING: allocation failure on disk DISK_0019 for file 400 xnum 2147483648
Mon Apr 16 15:01:12 2007
WARNING: allocation failure on disk DISK_0006 for file 400 xnum 2147483648
Mon Apr 16 15:23:23 2007
Starting ORACLE instance (normal)
Mon Apr 16 15:26:05 2007
Starting ORACLE instance (normal)
Mon Apr 16 15:30:53 2007
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Mon Apr 16 15:31:11 2007
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Mon Apr 16 15:58:34 2007
Starting ORACLE instance (normal)

上面是ASM的错误信息,指出了ASM分配空间失败,而下面的数据库实例错误信息:

ORACLE Instance testrac1 - Archival Error
Mon Apr 16 12:33:24 2007
ORA-16014: log 1 sequence# 85 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Mon Apr 16 12:33:24 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_5191.trc:
ORA-16014: log 1 sequence# 85 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'

则说明ASM的问题导致了归档的错误。

幸好出现问题的是测试库,重启对于测试数据库来说不算问题,但是重启对于产品库来说就不是小问题了,即使这个产品库是RAC环境。

从这个问题上看,感觉ASMbug还是多了一点,这个问题目前还找不到不重启ASM实例就可以解决问题的方法,而ASM给人的感觉太不透明了。

BTW:由于再节点1和节点2上多次尝试启动dbca来管理asm,还碰到了ORA-20错误,对于这个问题metalink有专门的文档描述:Doc ID: Note:308376.1Subject: Dbca Fails To Create Asm Disk Group, then upon restart of ASM instance errors


连接ASM实例后出现ORA-1012错误(二)

前一阵碰到了一个ASM相关的bug,今天又碰到了一次。

问题的现象和上次比较基本一致,只是这次发生故障的节点和上次不一样。

之所以要续写这篇文章,是因为基本上确定了这次产生这个问题的原因。发现问题的时候,状况和上次基本上完全一样。先是归档到ASM上报错,我开始认为是ASM空间已经不足,打算用DBCA登陆检查一下ASM的使用情况,结果报错信息和上面完全一样。从后台ASMalert文件中,也找到了大量的告警信息:WARNING: allocation failure on disk DISK_0000 for file 384 xnum 34

根据上次的经验,我登陆了ASM实例,结果意外的发现了另一个错误:

ORA-00020: 超出最大进程数 (%s)

这个错误是上次没有发现的,莫非进程数已经超过了数据库的限制。

$ ps -ef|grep ASM|grep -v grep|wc -l
40

检查ASM实例启动日志,发现PROCESSES参数是默认值启动的。

没想到过了一段时间,发现居然可以正常的登陆ASM实例,并运行SQL语句,查询PROCESSES初始化参数,发现数据库设置果然是40

SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ---------------------- -----------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 1
job_queue_processes integer 0
log_archive_max_processes integer 2
processes integer 40

当可以登陆ASM实例时,检查后台ASM启动的进程数,发现已经减少到40以下:

$ ps -ef|grep ASM|grep -v grep|wc -l
39

从上面的现象至少可以断定,本次出现的问题是由于进程数超过初始化参数的设置导致的。

可以在ASM实例的启动初始化参数中给processes参数设置一个较大的值。避免这种情况的产生。


连接ASM实例后出现ORA-1012错误(三) 


继续讨论这个问题的产生原因。

上面一篇文章分析了导致ORA-1012错误的原因,是由于连接ASM实例的进程数超过了ASM初始化参数的设置,从而导致了问题的产生。

这篇文章进一步分析,是什么导致了ASM产生了大量的连接。

首先通过操作系统命令检查:

$ ps -ef|grep ASM|grep -v grep
oracle 17061 1 0 May 14 ? 0:00 asm_psp0_+ASM1
oracle 17059 1 0 May 14 ? 0:05 asm_diag_+ASM1
oracle 17108 1 0 May 14 ? 0:00 asm_smon_+ASM1
oracle 2556 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17057 1 0 May 14 ? 0:03 asm_pmon_+ASM1
oracle 17063 1 0 May 14 ? 0:11 asm_lmon_+ASM1
oracle 17110 1 0 May 14 ? 0:00 asm_rbal_+ASM1
oracle 17067 1 0 May 14 ? 0:08 asm_lms0_+ASM1
oracle 17079 1 0 May 14 ? 0:00 asm_mman_+ASM1
oracle 17104 1 0 May 14 ? 0:01 asm_lgwr_+ASM1
oracle 17420 1 0 May 14 ? 0:12 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode1.ASM1.asm
oracle 28629 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 19622 1 0 Jun 01 ? 0:05 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17065 1 0 May 14 ? 0:19 asm_lmd0_+ASM1
oracle 17112 1 0 May 14 ? 2:10 asm_gmon_+ASM1
oracle 17143 1 0 May 14 ? 0:20 asm_lck0_+ASM1
oracle 1204 1 0 May 31 ? 0:00 asm_o001_+ASM1
oracle 8710 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11907 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 29561 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23342 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17098 1 0 May 14 ? 0:01 asm_dbw0_+ASM1
oracle 23469 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17106 1 0 May 14 ? 0:01 asm_ckpt_+ASM1
oracle 12134 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 24374 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28631 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 22777 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 12888 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 15334 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4977 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 26859 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9633 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28634 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23136 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 736 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 6678 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11080 11079 0 17:54:43 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))


$ ps -ef|grep ASM|grep -v grep|wc -l
39

仅仅是一个测试库,且只有我一个人在测试RMAN的备份,怎么会有这么多的会话呢,登陆ASM实例进行检查:


bash-2.03$ export ORACLE_SID=+ASM1
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:21:23 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> select count(*) from v$session;

COUNT(*)
----------
38


SQL> select count(*) from v$process;

COUNT(*)
----------
39


SQL> select distinct process from v$session where username is not null;

PROCESS
------------------------
11234
19620
19969
19995
23709
23768

已选择6行。


SQL> select process, count(*) from v$session where username is not null group by process;

PROCESS COUNT(*)
------------------------ ----------
11234 1
19620 1
19969 11
19995 8
23709 1
23768 1

已选择6行。


根据ASM实例上V$SESSION上的PROCESS列,找到连接到ASM实例的操作系统进程。

上面的查询列出了所有非ASM实例后台进程对应的操作系统进程。可以看到,绝大部分进程是由于1996919995两个进程造成的。

下面依次分析这些进程:

$ ps -ef|grep 11234|grep -v grep
oracle 11246 11234 0 18:21:23 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11234 8887 0 18:21:23 pts/1 0:00 sqlplus / as sysdba
$ ps -ef|grep 19620|grep -v grep
oracle 19620 1 0 Jun 01 ? 0:06 ora_asmb_testrac1

这两个进程一个是通过SQLPLUS连接到ASM实例的进程,也就是当前查询的后台进程,另一个是节点testrac1连接到ASM实例的进程。

$ ps -ef|grep 23709|grep -v grep
oracle 23709 1 0 Jun 07 ? 0:56 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ps -ef|grep 23768|grep -v grep
oracle 23768 1 0 Jun 07 ? 0:39 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))


$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:29:41 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> set pages 100 lines 120
SQL> col program format a25 
SQL> col module format a25
SQL> col event format a30
SQL> col client_info format a20
SQL> select s.program, module, client_info, event from v$session s, v$process p where paddr = p.addr and spid in (23709, 23768);

PROGRAM MODULE CLIENT_INFO EVENT
------------------------- --------------------- ----------------- ----------------------
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c1 control file parallel write
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c2 control file parallel write

这两个会话是当时备份出错后,很久无法结束,通过CTRL+强制结束的,导致会话异常结束,Oracle这里并没有将资源完全释放。

不过这两个会话应该不是问题的关键,进程1996919995两个进程在ASM实例上占用了18个会话,下面看看这两个实例在做什么:


$ ps -ef|grep 19969|grep -v grep
oracle 19969 1 0 Jun 01 ? 0:48 ora_arc0_testrac1
$ ps -ef|grep 19995|grep -v grep
oracle 19995 1 0 Jun 01 ? 0:18 ora_arc1_testrac1


居然是Oracle的归档进程,回想最开始错误产生的时候,就是归档进程出现的问题,那么基本上确定,问题是由于归档无法创建新的文件导致问题的产生。

检查实例1上的alert文件,发现问题产生的时候出现大量下面的错误:

ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted
Sat Jun 9 10:08:04 2007
ARC1: Error 19504 Creating archive log file to '+DISK/testrac/1_307_618591128.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:04 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:04 2007
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:04 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:28 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:28 2007
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:13:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted

显然错误是从ASM磁盘组没有空间,导致归档失败开始的。检查失败归档的进程号,就是上面出现问题的1996919995。显然,Oracle在不断重试归档的过程中,出现了一些错误,导致归档进程连接到ASM实例的会话没有释放,进而导致了下面的错误:

Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '+DISK/testrac/onlinelog/group_2.260.618591151'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_2.260.618591151
ORA-00020: maximum number of processes () exceeded

Tue Jun 12 09:29:04 2007
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Tue Jun 12 09:29:05 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.259.618591145
ORA-00020: maximum number of processes () exceeded
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.258.618591139
ORA-00020: maximum number of processes () exceeded

解决问题的最好方法是重启数据库实例和ASM,如果暂时无法重启,可以使用第一篇文章给出的方法,另外建立一个归档目的地。

或者采用下面的方法清楚异常的会话和进程:

SQL> select count(*) from v$session;

COUNT(*)
----------
38

SQL> 

select 'ALTER SYSTEM KILL SESSION ''' || SID || ', ' || SERIAL# || ''';' from v$session where process in (19995, 19969);


'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
--------------------------------------------------------------------------------
ALTER SYSTEM KILL SESSION '12, 4';
ALTER SYSTEM KILL SESSION '34, 475';
ALTER SYSTEM KILL SESSION '14, 1';
ALTER SYSTEM KILL SESSION '15, 1';
ALTER SYSTEM KILL SESSION '16, 1';
ALTER SYSTEM KILL SESSION '17, 1';
ALTER SYSTEM KILL SESSION '18, 3';
ALTER SYSTEM KILL SESSION '19, 1';
ALTER SYSTEM KILL SESSION '20, 1';
ALTER SYSTEM KILL SESSION '21, 1';
ALTER SYSTEM KILL SESSION '22, 1';
ALTER SYSTEM KILL SESSION '23, 1';
ALTER SYSTEM KILL SESSION '24, 1';
ALTER SYSTEM KILL SESSION '25, 2';
ALTER SYSTEM KILL SESSION '26, 1';
ALTER SYSTEM KILL SESSION '27, 4';
ALTER SYSTEM KILL SESSION '28, 148';
ALTER SYSTEM KILL SESSION '33, 7';
ALTER SYSTEM KILL SESSION '13, 1';

已选择19行。

SQL> ALTER SYSTEM KILL SESSION '12, 4';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '34, 475';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '14, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '15, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '16, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '17, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '18, 3';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '19, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '20, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '21, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '22, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '23, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '24, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '25, 2';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '26, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '27, 4';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '28, 148';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '33, 7';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '13, 1';

系统已更改。

并通过操作系统命令清楚刚才RMAN失败造成的两个后台进程:

bash-2.03$ kill -9 23709
bash-2.03$ kill -9 28631

无论是清除Oracle实例的后台进程23709,还是清除23768进程对应的ASM实例后台进程28631都可以达到相同的目的。

等待一段时间,Oracle自动清除所有的后台进程,数据库恢复正常。

bash-2.03$ ps -ef|grep ASM|grep -v grep|wc -l
18

 


 




posted on 2015-01-22 16:01  夜拿水果刀  阅读(947)  评论(0编辑  收藏  举报

导航