reliable message
reliable message
Table of Contents
1 现象
同事反馈一个Sqlloader 进程加载数据特别慢,平常几分钟运行完的事情,这次跑了3个半小时还没跑完。 查询数据库会话,信息如下:
SID USER_NAME EVENT --- ---------- ------- 1962 STG reliable message
趁此机会,研究一下这个等待事件吧。
2 reliable message问题分析
2.1 事件说明
MOS 上对于reliable message的解释如下:
When a process sends a message using the 'KSR' intra-instance broadcast service, the message publisher waits on this wait-event until all subscribers have consumed the 'reliable message' just sent. The publisher waits on this wait-event for up to one second and then re-tests if all subscribers have consumed the message, or until posted. If the message is not fully consumed the wait recurs, repeating until either the message is consumed or until the waiter is interrupted.
说明此等待事件是发布消息方出现的等待。当消息队列中的消息没有被全部读取的时候,就会等待此事件。 经查阅文档得知,此等待事件,是针对各种channel的。不同的channel 针对不同的情况。也就有不同 的解决方法。而大部分是BUG,需要打补丁,或者升级至更高的版本.workaround,基本上是重启实例, 或者关闭相关的功能。
2.2 查看渠道
从gv$channel_waits 视图里查询问题最严重的 channel. 方法1,可以马上确定有问题的一个或者多个channel. 而方法2虽然也可以,但是略显麻烦。
-
方法1
SELECT CHANNEL, SUM(wait_count) sum_wait_count FROM GV$CHANNEL_WAITS GROUP BY CHANNEL ORDER BY SUM(wait_count) DESC;
查询示例:
CHANNEL SUM_WAIT_COUNT ---------------------------------------------------------------- -------------- Result Cache: Channel 15436686 RBR channel 9393 kxfp control signal channel 7357 MMON remote action broadcast channel 3070 obj broadcast channel 1731 service operations - broadcast channel 2 kill job broadcast - broadcast channel 2 parameters to cluster db instances - broadcast channel 2 quiesce channel 2
从上面查询结果,可以看到 "Result Cache: Channel", 是最有问题的channel.
-
方法2
select to_char(p1, 'XXXXXXXXXXXXXXXX') event_param, count(*), sum(time_waited/1000000) time_waited from gv$active_session_history where event = 'reliable message' group by to_char(p1, 'XXXXXXXXXXXXXXXX') order by time_waited*count(*) desc; -- 取出影响最大的内存地址 select name_ksrcdes from x$ksrcdes where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1)); Enter value for 1: '7ACD8AA60','7ACD8FA88' old 3: where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1)) new 3: where indx in (select name_ksrcctx from x$ksrcctx where addr in ('7ACD8AA60','7ACD8FA88')) NAME_KSRCDES ---------------------------------------------------------------- Result Cache: Channel RBR channel
从上面查询结果来看,已明确定位到有问题的 "Result Cache: Channel". 上面只是一个查询多个 channel的示例。这个例子中只需要查询第一个 addr='7ACD8AA60' 即可。
3 解决办法
3.1 Result Cache: Channel
以下内容三选一:
- 数据库更新到 12.2 或者12.1.0.2.0 Patchset
- 应用补丁 18416368
-
workaround
SQL> alter system set result_cache_max_size=0 scope=both sid='*';
修改参数后,实例需要重启。
3.2 RBR channel
影响版本:11.2.0.3
Bug 15826962 High "reliable message" wait due to "RBR channel"。
最保险的办法是得出进程trace,或者system trace,然后与MOS 文档对照,或者开SR,由Oracle 服务人员帮忙确定。
在以下版本、补丁中得到修复 :
- 11.2.0.4 (Server Patch Set)
- 11.2.0.3.12 (Oct 2014) Database Patch Set Update (DB PSU)
- 11.2.0.3 Bundle Patch 19 for Exadata Database
- 11.2.0.3 Patch 34 on Windows Platforms
- 11.2.0.3 Patch 23 on Windows Platforms
所以解决办法是升级或者打补丁。
3.3 kxfp control signal channel
- 影响版本
- 12.1.0.2
- (no term)
-
现象分析 其实这里并不只是这一个channel等待严重。示例如下:
SQL> select CHANNEL,sum(wait_count) sum_wait_count from GV$CHANNEL_WAITS group by CHANNEL order by sum(wait_count) 2 3 4 / CHANNEL SUM_WAIT_COUNT ---------------------------------------------------------------- -------------- Flashback RVWR init channel 2 quiesce channel 3 PMON actions channel 6 Broker IQ Result Channel 24 kill job broadcast - broadcast channel 54 parameters to cluster db instances - broadcast channel 137 GEN0 ksbxic channel 1035 Flashback Marker channel 1546 LCK0 ksbxic channel 2669 service operations - broadcast channel 7033 MMON remote action broadcast channel 78046 kxfp remote slave spawn channel 157850 Result Cache: Channel 242303 RBR channel 1595647 obj broadcast channel 4105387 kxfp control signal channel 5582125
可以看到除了,kxfp control signal channel 外还有 obj broadcast channel . 这两个是 其他的数倍甚至是数十倍。
同时,建议做一个hang analyze 。 查看trace 文件中是否包含了以下内容:
ervice name: SYS$BACKGROUND Current Wait Stack: 1: waiting for 'CSS group membership query'
如果有,说明CSS 组成员关系查询出现了阻塞,正常情况下应该是非常快的。
以上两个现象,基本可以确定是Oracle BUG: 20470877.
- (no term)
-
解决办法只有更新补丁
Patch 20470877: LONG WAITS FOR "RELIABLE MESSAGE" AFTER A FEW DAYS OF UPTIME
- (no term)
-
workaround
重启实例
Created: 2019-12-26 Thu 13:39
===================
天行健,君子以自强不息
地势坤,君子以厚德载物
===================