转发 oracle 19c db switchover hang住
关于数据库切换hang住问题的分析简要过程如下:
分析过程
1. 数据库无法open,检查alter log 发现有告警
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!
2.检查数据库V$session 的等待事件和锁的情况
SQL> select event,sql_id,PROGRAM,BLOCKING_SESSION,p1,p2,p3 from V$session where sid=236;
EVENT SQL_ID PROGRAM BLOCKING_SESSION P1 P2 P3
-------------------------------------------------- ------------- ------------------------------------------------ ---------------- ---------- ---------- ----------
row cache lock 8vyjutx6hg3wh oracle@dbdb02 (P000) 392 3 0 3
SQL> select sql_text from V$sql where sql_id='8vyjutx6hg3wh';
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
update /*+ rule */ undo$ set name=:2,file#=:3,block#=:4,status$=:5,user#=:6,undosqn=:7,xactsqn=:8,scnbas=:9,scnwrp=:10,inst#=:11,ts#=:12,spare1=:13 where us#=:1
SQL> select event,sql_id,PROGRAM,BLOCKING_SESSION,p1,p2,p3 from V$session where sid=392;
EVENT SQL_ID PROGRAM BLOCKING_SESSION P1 P2 P3
-------------------------------------------------- ------------- ------------------------------------------------ ---------------- ---------- ---------- ----------
cursor: pin S wait on X 8vyjutx6hg3wh oracle@dbdb02 (P004) 236 1292341136 1.0136E+12 2.1475E+10
session 392 和236 在update undo$ 的时候互相阻塞形成死活 ,产生等待事件 row cache lock =》cursor: pin S wait on X dui 对应program=oracle@dbdb02 (P004) 为并行进程
3)该问题与BUG 30931981,Bug:30159581 的描述基本一致,两个Bug的最终修复都是在bug 31747989
Bug 30931981 - Open Reset Logs Hangs With 'row cache lock' and 'cursor: pin s wait for x' Waits (Doc ID 30931981.8)
Description
During update of undo$ in ktuscu(), kklmsle() tries to look up a dc_users
rowcache entry to fetch from user$. So kklmsle(), which is used to check
see if we have supplemental logging enabled on the table. But since this
lookup is for a data dictionary object, we should not be having
logging enabled.
REDISCOVERY INFORMATION:
Deadlock between parallel slaves which are performing ktuscu() due to
kklmsle() trying to look up a dc_users rowcache entry. This can manifest in
the alert log as the below trace,
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!
Bug 30159581 - A DB open hangs after switchover due to a detected deadlock (Doc ID 30159581.8)
Description
This bug manifests as a hang during the opening of a database in read-write mode or during the switchover
of a standby instance to the role of a primary instance. We speed up the opening of the database by using
parallel slaves to perform first-pass transaction recovery. This phase detects all in-flight transactions from
the past and mark them as such for subsequent recovery. The hang occurs due to a deadlock between two
parallel recovery slaves on dictionary cache objects.
但是发现数据库已经有应用补丁30159581 ,并且也尝试应用补丁 30931981
Patch 30159581 : applied on Fri Jun 05 02:51:31 CST 2020
发现依然未能修复该问题。
根据MOS文档
Alter Database Open" hangs with "WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK" (Doc ID 2886874.1)
Alter New Primary Open Hang After Switchover (Doc ID 2663273.1)
Note that this fix can cause / expose the problem described in Bug:31747989
Note that this fix has been superseded by the fix in Bug:31747989
存在补丁不能完全修复的情况,可能会触发bug 31747989中类似的问题, 两个Bug的最终修复都是在bug 31747989
总结:
数据库open阶段对undo$的update的并行进程产生死锁,最终触发bug 31747989
导致数据库hang住无法打开
建议:
应用补丁31747989 代替原补丁30159581 来修复这个Bug.
or
如果发生在没有应用补丁数据库,可以在数据库open的时设置参数setting _min_undosegs_for_parallel_fptr = 0 来避免这个bug