记一次监听异常处理
客户抱怨最近监听经常异常并无法注册服务,导致前台进程连接失败。
根据客户的提到最后一次的大致时间,我们来看一下监听日志情况:
[.......]
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.85)(PORT=57408)) * establish * app2012 * 0
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.22)(PORT=1521))) <<<<<<
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.21)(PORT=1521))) <<<<<<
Listener completed notification to CRS on stop <<<<<
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=TRFFAPP2)(USER=oracle))(COMMAND=stop)(ARGUMENTS=64)(SERVICE=LISTENER_TRFFAPP2)(VERSION=169870592)) * stop * 0
21-JUN-2017 15:35:14 * (CONNECT_DATA=(SID=app20122)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.58)(PORT=64878)) * establish * app20122 * 0
TNSLSNR for HPUX: Version 10.2.0.5.0 - Production on 21-JUN-2017 15:35:17
Copyright (c) 1991, 2010, Oracle. All rights reserved.
System parameter file is /oracle/app/oracle/product/10.2.0/dbhome_1/network/admin/listener.ora
Log messages written to /oracle/app/oracle/product/10.2.0/dbhome_1/network/log/listener_trffapp2.log
Trace information written to /oracle/app/oracle/product/10.2.0/dbhome_1/network/trace/listener_trffapp2.trc
Trace level is currently 0
Started with pid=2260
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.22)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.21)(PORT=1521)))
Listener completed notification to CRS on start <<<<<<
TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
21-JUN-2017 15:35:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.94)(PORT=56715)) * establish * app2012 * 12514
TNS-12514: TNS:listener does not currently know of service requested in connect descriptor <<<<<<<<<
21-JUN-2017 15:35:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.109)(PORT=60547)) * establish * app2012 * 12514
从上面可以看到,21-JUN-2017 15:35:14之前监听一直正常,但是突然日志记录监听不再了No longer listening on:.....,
然后是监听通知CRS stop的消息:
Listener completed notification to CRS on stop。
接下来是监听重启的信息:Started with pid=2260
最后是 TNS-12514报错信息,说监听无法处理请求的服务,这也印证了说客户说前台连接不上的说法。
但是这里并没有给出原因为啥监听突然不再了,之前没有任何报错,alert 日志里面也没有任何信息,
那我们就先看一下监听报错时段的请求情况,
147 21-JUN-2017 15:27
127 21-JUN-2017 15:31
95 21-JUN-2017 15:32
1217 21-JUN-2017 15:33 <<<high
1443 21-JUN-2017 15:34 <<<high
12265 21-JUN-2017 15:35 <<issue time very high
10607 21-JUN-2017 15:36
可以看到出现问题前监听请求量先降低,然后突然升高,出现问题时间点达到最高。
这个看起来更像是前台进程引发了连接风暴,既然新连接这么高,我们在看一下os 层面的情况:
但是因为当时客户并没有部署oswatcher,并且问题已经人为处理完毕,所以让其尽快部署。
过几天之后,问题再次发生,我们看一下这次情况:
608 27-JUN-2017 09:23
333 27-JUN-2017 09:24
809 27-JUN-2017 09:25 <<<<
1306 27-JUN-2017 09:26 <<<<<
1965 27-JUN-2017 09:27 <<<<
4 27-JUN-2017 09:28
1070 27-JUN-2017 09:29 <<<<
1055 27-JUN-2017 09:30 <<<
发生问题前后仍然出现大量新请求,我们看一下os watcher 输出情况:
zzz ***Tue Jun 27 09:27:12 EAT 2017
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
95 39 0 64868005 88012273 0 0 2 0 0 0 0 94556 519483 34750 8 3 89 <<<<RunQ 95, blockQ 39
95 39 0 64868005 88000105 0 0 0 0 0 0 0 149539 1720220 79082 45 7 48
95 39 0 64868005 87979642 0 0 0 0 0 0 0 147551 1853425 77468 47 5 47
zzz ***Tue Jun 27 09:27:29 EAT 2017
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
86 61 0 66285766 87435149 0 0 2 0 0 0 0 94556 519489 34750 8 3 89
86 61 0 66285766 87354916 0 0 0 0 0 0 0 141514 2691511 78742 33 8 59
86 61 0 66285766 87287447 0 0 0 0 0 0 0 142313 2680929 80039 32 9 59
zzz ***Tue Jun 27 09:27:46 EAT 2017
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
171 61 0 80397655 86770380 0 0 2 0 0 0 0 94556 519494 34750 8 3 89 <<<RunQ 171, blockQ 61
171 61 0 80397655 86733359 0 0 0 0 0 0 0 153462 2252976 99530 62 8 30
171 61 0 80397655 86709517 0 0 0 0 0 0 0 153881 2250919 97705 59 8 33
当时OS 层面出现大量的RunQ和BlockQ,远超过cpu数量,CRS层面日志也在不断的报警,cpu 使用率过高
++++++++++++++++engine_A.log
2017/06/27 09:00:30 VCS WARNING V-16-1-50086 CPU usage on trffapp2 is 83%
2017/06/27 09:12:00 VCS INFO V-16-1-50086 CPU usage on trffapp2 is 66%
2017/06/27 09:17:36 VCS INFO V-16-1-50086 CPU usage on trffapp2 is 62%
2017/06/27 09:18:06 VCS WARNING V-16-1-50086 CPU usage on trffapp2 is 80%
2017/06/27 09:20:36 VCS NOTICE V-16-1-50086 CPU usage on trffapp2 is 74%
2017/06/27 09:21:06 VCS CRITICAL V-16-1-50086 CPU usage on trffapp2 is 91%
2017/06/27 09:22:01 VCS ERROR V-16-2-13027 (trffapp2) Resource(CSSD_trapp) - monitor procedure did not complete within the expected time.
显然是OS系统资源利用率突然过高引发了监听异常问题,接下来我们分析一下awr 报告
SQL ordered by Gets
947,020,215 43 22,023,725.93 14.71 33793.03 107433.92 7sgm3npmzvwvj JDBC Thin Client SELECT * FROM (SELECT pagetabl...
946,702,135 74 12,793,272.09 14.70 37181.04 122613.98 5aqranh8b8hkf JDBC Thin Client SELECT * FROM (SELECT pagetabl...
475,885,893 33 14,420,784.64 7.39 18047.45 54492.09 50uhrdj7uant0 JDBC Thin Client SELECT * FROM (SELECT pagetabl...
371,382,132 24 15,474,255.50 5.77 15852.55 48278.23 863b3ba2bx6h2 JDBC Thin Client SELECT * FROM (SELECT pagetabl...
317,142,383 24 13,214,265.96 4.93 13056.82 43185.59 f4q2kfd3q1uh6 JDBC Thin Client SELECT * FROM (SELECT pagetabl...
看到排名靠前的sql基本都是一个逻辑,而且前两个SQL的buffer gets异常高,单个进程达到7225GB的读取,ash 显示也有大量的on cpu进程,所以需要优化cpu top排名
靠前的sql可以缓解问题,经过限制这个sql的调用,问题得到缓解,问题没有再次发生。