oracle慎用基于on commit刷新物化视图的方式(一)
oracle慎用基于on commit刷新物化视图的方式(一)
背景
上周五,也就是2020-08-07那天快下班的时候,突然发生一件事。客户反馈xxxx平台登录不上去,当时一开始没意识到是数据库导致的问题。
业务人员在那怀疑这里那里,还是网络ip哪里地区又被封了。后来都排除没发生什么问题,还想着重启Nginx服务器(我也不懂)。
但是维护Nginx的人员刚好请假了。
刚好当时大概是这个事情发生的半个小时之前,监控邮件报出如下错误,
当时以为只是归档目录快满了,简单的备份了下归档,清理掉多余的归档。
排查
问题出现后,虽然没怀疑到数据库的问题,不过我也还是看了一眼数据库。
发现警告日志闪回恢复区有在报使用率超标的问题,有问题,绝对有问题。
除了归档量暴增的问题和刚搭建起来没多久的备库归档也爆掉的问题,没发现其他的报错。
先看看日志切换的频率,和平时同一时段相比,这归档多了100多倍。
让我一度怀疑是应用出现了什么问题导致dml操作剧增导致的。
SELECT TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY, TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'999') "00", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'999') "01", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'999') "02", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'999') "03", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'999') "04", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'999') "05", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'999') "06", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'999') "07", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'999') "08", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'999') "09", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'999') "10", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'999') "11", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'999') "12", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'999') "13", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'999') "14", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'999') "15", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'999') "16", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'999') "17", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'999') "18", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'999') "19", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'999') "20", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'999') "21", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'999') "22", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'999') "23" FROM V$LOG_HISTORY GROUP BY TO_CHAR(FIRST_TIME,'YYYY-MM-DD') ORDER BY 1 DESC;
这个时候我需要定位找出是哪条SQL或者哪些SQL引起的问题。
通过如下语句,可以找出哪些对象有大量数据块变化情况,
SELECT TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24') SNAP_TIME, DHSO.OBJECT_NAME, SUM(DB_BLOCK_CHANGES_DELTA) BLOCK_CHANGED FROM DBA_HIST_SEG_STAT DHSS, DBA_HIST_SEG_STAT_OBJ DHSO, DBA_HIST_SNAPSHOT DHS WHERE DHS.SNAP_ID = DHSS.SNAP_ID AND DHS.INSTANCE_NUMBER = DHSS.INSTANCE_NUMBER AND DHSS.OBJ# = DHSO.OBJ# AND DHSS.DATAOBJ# = DHSO.DATAOBJ# AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00', 'YYYY-MM-DD HH24:MI') AND TO_DATE('2020-08-07 17:30', 'YYYY-MM-DD HH24:MI') GROUP BY TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24'), DHSO.OBJECT_NAME HAVING SUM(DB_BLOCK_CHANGES_DELTA) > 0 ORDER BY SUM(DB_BLOCK_CHANGES_DELTA) asc;
部分关键结果如下,
SNAP_TIME |OBJECT_NAME |BLOCK_CHANGED ---------------------------------------|------------------------------|------------- ............. |.............. | ...... 2020-08-07 16 |.............. | 100480 2020-08-07 16 |.............. | 129184 2020-08-07 16 |.............. | 137024 2020-08-07 17 |.............. | 143520 2020-08-07 17 |.............. | 145440 2020-08-07 16 |.............. | 228624 2020-08-07 16 |.............. | 256704 2020-08-07 17 |.............. | 548336 2020-08-07 17 |.............. | 55019536 2020-08-07 17 |.............. | 103429408 2020-08-07 16 |T_XXXXXXXXXX_INFO | 103997488 2020-08-07 16 |PK_X_XXXXXX_XXXX | 195532752 210 rows selected. Elapsed: 00:00:02.29
在2020-08-07 16点内,有两个对象的块变化是最多的,分别是
T_XXXXXXXXXX_INFO
PK_X_XXXXXX_XXXX
PK_X_XXXXXX_XXXX很容易就可以查出,这是一个主键索引,对应的表名为X_XXXXXX_XXXX。
赶紧取问题时间段的awr报告,看看是否存在相关的X_XXXXXX_XXXX以及T_XXXXXXXXXX_INFO的sql,顺便也看看等待事件。
结果一眼就看出等待事件存在问题,
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
enq: TX - row lock contention | 683 | 223.1K | 326667 | 39.0 | Application |
gc buffer busy acquire | 10,048 | 163.8K | 16302 | 28.6 | Cluster |
enq: JI - contention | 23,034 | 112.7K | 4892 | 19.7 | Other |
buffer busy waits | 1,770 | 20.9K | 11784 | 3.6 | Concurrency |
DB CPU | 20K | 3.5 | |||
enq: TX - index contention | 591 | 6630.2 | 11219 | 1.2 | Concurrency |
log file switch (checkpoint incomplete) | 481 | 6268.1 | 13031 | 1.1 | Configuration |
db file scattered read | 482,552 | 4536 | 9 | .8 | User I/O |
gc current block busy | 10,845 | 3853.8 | 355 | .7 | Cluster |
direct path read | 832,639 | 3170 | 4 | .6 | User I/O |
这个系统平时没有存在那么高的TX锁等待,赶紧看看哪个段的行锁等待比较严重,
Segments by Row Lock Waits
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | Row Lock Waits | % of Capture |
---|---|---|---|---|---|---|
ABCD_EFG | ABCD | X_XXXXXX_XXXX | TABLE | 681 | 53.71 | |
ABCD_XXXXXXXXS | DYCK_FILETRANS | PK_XXXX_XXXXXXXX_TMP | INDEX | 321 | 25.32 |
刚好和前边块变化最多的对象是同一个,也可以看看块变化,
Segments by DB Blocks Changes
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | DB Block Changes | % of Capture |
---|---|---|---|---|---|---|
XXXXX_DATA | USERS | PK_X_XXXXXX_XXXX | INDEX | 161,129,664 | 64.72 | |
XXXXX_DATA | USERS | T_XXXXXXXXXX_INFO | TABLE | 85,726,272 | 34.43 |
都对的上!!!
这个时候在从awr中找找含有X_XXXXXX_XXXX表的SQL,
Elapsed Time (s) | Executions | Elapsed Time per Exec (s) | %Total | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|
226,451.96 | 1,075 | 210.65 | 39.58 | 0.01 | 0.00 | 95nn49n57dqza | JDBC Thin Client | update X_XXXXXX_XXXX set ID=:1... |
166,564.13 | 1,368 | 121.76 | 29.12 | 0.14 | 0.00 | 08x3dymqbdbd4 | JDBC Thin Client | ...................... |
20,730.38 | 294 | 70.51 | 3.62 | 0.01 | 0.03 | 4vs91dcv7u1p6 | JDBC Thin Client | ...................... |
5,395.58 | 0 | 0.94 | 5.25 | 94.88 | c7sa42jj690jw | ...................... | ||
5,395.54 | 7 | 770.79 | 0.94 | 5.25 | 94.88 | 1hsz4n5g5dz1j | ...................... | |
5,205.90 | 38 | 137.00 | 0.91 | 40.58 | 58.45 | 66u44fa687q74 | xxxxxxxxx.exe | ...................... |
4,619.25 | 17,384 | 0.27 | 0.81 | 0.05 | 0.08 | atwcm5hwwbtj3 | xxxxxxxxxxxxxxxxxx.exe | ...................... |
4,231.15 | 1 | 4,231.15 | 0.74 | 99.72 | 0.09 | 2d0r4v0y38vm1 | xxxxxxxxxxx.exe | ...................... |
4,049.48 | 47,215 | 0.09 | 0.71 | 0.14 | 0.20 | cqfs0h217uuc3 | xxxxxxxxxxxxxxxxxx.exe | ...................... |
2,707.50 | 1 | 2,707.50 | 0.47 | 99.87 | 0.10 | gjj8zbpgta9rv | xxxx.exe | ...................... |
查看95nn49n57dqza的历史执行情况,
set lines 150 pages 150 col BEGIN_INTERVAL_TIME for a23 col PLAN_HASH_VALUE for 9999999999 col date_time for a30 col snap_id heading 'SnapId' col executions_delta heading "No. of exec" col sql_profile heading "SQL|Profile" for a7 col date_time heading 'Date time' col avg_lio heading 'LIO/exec' for 99999999999.99 col avg_cputime heading 'CPUTIM/exec' for 9999999.99 col avg_etime heading 'ETIME/exec' for 9999999.99 col avg_pio heading 'PIO/exec' for 9999999.99 col avg_row heading 'ROWs/exec' for 9999999.99 SELECT distinct s.snap_id , PLAN_HASH_VALUE, to_char(s.BEGIN_INTERVAL_TIME,'yyyy-mm-dd hh24:mi:ss')|| to_char(s.END_INTERVAL_TIME,'-hh24:mi:ss') Date_Time, SQL.executions_delta, SQL.buffer_gets_delta/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_lio, --SQL.ccwait_delta, (SQL.cpu_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_cputime , (SQL.elapsed_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_etime, SQL.DISK_READS_DELTA/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_pio, SQL.rows_processed_total/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_row --,SQL.sql_profile FROM dba_hist_sqlstat SQL, dba_hist_snapshot s WHERE SQL.instance_number =(select instance_number from v$instance) and SQL.dbid =(select dbid from v$database) and s.snap_id = SQL.snap_id AND sql_id in ('&SQLID') order by s.snap_id /
结果如下,
SnapId|PLAN_HASH_VALUE|Date time |No. of exec| LIO/exec|CPUTIM/exec| ETIME/exec| PIO/exec| ROWs/exec ----------|---------------|------------------------------|-----------|---------------|-----------|-----------|-----------|----------- 16581| 2297901584|2020-08-07 13:30:02-14:00:09 | 370| 4.04| .00| .00| .00| 5547.76 16583| 2297901584|2020-08-07 14:30:16-15:00:24 | 563| 4.02| .00| .00| .00| 3648.02 16584| 2297901584|2020-08-07 15:00:24-15:30:31 | 567| 4.03| .00| .00| .00| 3623.28 16585| 2297901584|2020-08-07 15:30:31-16:00:39 | 442| 4.05| .00| .00| .00| 4648.97 16586| 2297901584|2020-08-07 16:00:39-16:30:47 | 429| 4.28| .00| .59| .01| 4790.84 16587| 2297901584|2020-08-07 16:30:47-17:01:31 | 366| 5.72| .06| 169.81| .07| 5616.50 16587| 2297901584|2020-08-07 16:30:47-17:07:50 | 366| 5.72| .06| 169.81| .07| 5616.50 16588| 2297901584|2020-08-07 17:01:31-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38 16588| 2297901584|2020-08-07 17:07:50-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38 16589| 2297901584|2020-08-07 17:30:32-18:00:41 | 369| 31.79| .07| 1180.75| .00| 5572.59 16590| 2297901584|2020-08-07 18:00:41-18:30:51 | 535| 44.85| .06| 909.64| .00| 3844.53 16591| 2297901584|2020-08-07 18:30:51-19:00:00 | 201| 55.29| .01| 192.97| .00| 10233.94 16592| 2297901584|2020-08-07 19:00:00-19:30:08 | 26| 4.69| .00| 1.26| .00| 79117.23
从8月7日下午1点开始可以查询到记录,一直在下午4点半后开始,sql执行时间急剧增加,但是执行计划确是并没有变化过。
并且从执行次数看,并没有多太过于剧烈的波动。
到这里目前似乎没有什么好办法可以自己追查下去,能确定的是出问题的SQL语句就是95nn49n57dqza,由于update引起大量的TX锁等待,导致xxxx平台登录不上去或者响应很久才出现页面。
而TX锁严重的原因就是一次update的时间从不到1s变到几百s甚至上千s。
这里存在一个问题,便是在看了awr发现tx锁严重,再到确定SQL的过程,由于重心太过明确导致忽略掉了一开始找出哪些对象有大量数据块变化情况那里,除了PK_X_XXXXXX_XXXX外其实还有另外一个对象T_XXXXXXXXXX_INFO。
暂且忽略T_XXXXXXXXXX_INFO顺着一直下去的思路的话,单纯从视图间的关联貌似已经无法继续下去,可以选择在数据库级别做一次ssd,或者10046。
推荐10046,从结果反推回去,10046是可以看出会话除了有执行95nn49n57dqza之外,也是有对T_XXXXXXXXXX_INFO表做insert的操作的。
因为第一时间发现95nn49n57dqza的问题,立马问了业务人员之后,他们有反馈在4点左右有创建一张物化视图T_XXXXXXXXXX_INFO,基表就是X_XXXXXX_XXXX。因此没做10046。
看了创建语句,用了on commit。大致找了下资料,基本确定是这个引起的。当即决定删除这个物化视图。
18:54:45 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO; drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO * ERROR at line 1: ORA-04021: timeout occurred while waiting to lock object Elapsed: 00:00:49.05
--撤销授权操作,不然物化视图删不掉 18:57:45 SYS@dyckdb2(1269)> revoke select on dyck_pub.t_client_info from MACRO_DATA; ^Crevoke select on dyck_pub.t_client_info from MACRO_DATA * ERROR at line 1: ORA-01013: user requested cancel of current operation Elapsed: 00:02:49.57 19:02:04 SYS@dyckdb2(1269)> / Revoke succeeded. Elapsed: 00:00:09.58 19:02:46 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO; Materialized view dropped. Elapsed: 00:00:03.82
删除视图,问题解决。
这里提供另外一个查找产生指定时间段内产生redo最多的SQL做参考:
SELECT TO_CHAR(BEGIN_INTERVAL_TIME,'YYYY_MM_DD HH24') WHEN, DBMS_LOB.SUBSTR(SQL_TEXT,4000,1) SQL, DHSS.INSTANCE_NUMBER INST_ID, DHSS.SQL_ID, EXECUTIONS_DELTA EXEC_DELTA, ROWS_PROCESSED_DELTA ROWS_PROC_DELTA FROM DBA_HIST_SQLSTAT DHSS, DBA_HIST_SNAPSHOT DHS, DBA_HIST_SQLTEXT DHST WHERE UPPER(DHST.SQL_TEXT) LIKE '%T_CLIENT_INFO%' AND LTRIM(UPPER(DHST.SQL_TEXT)) NOT LIKE 'SELECT%' AND DHSS.SNAP_ID=DHS.SNAP_ID AND DHSS.INSTANCE_NUMBER=DHS.INSTANCE_NUMBER AND DHSS.SQL_ID=DHST.SQL_ID AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00','YYYY-MM-DD HH24:MI') AND TO_DATE('2020-08-07 17:00','YYYY-MM-DD HH24:MI') order by exec_delta;
通过上边的语句查找是可以发现update基表和insert物化视图的SQL语句的。
接下来可以通过模拟on commit对update影响有多大,顺便看看是否由于物化视图关系是否导致redo量激增。