OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理
OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理
前言
秉着出现的报错就追根问底的原则,这次刚部署不久的OEM 13C有出现如下报警:
Host=xxxxx1 Target type=Automatic Storage Management Target name=+ASM1_xxxxx1 Categories=Availability Message=Failed to connect to ASM instance. The connection is closed: The connection is closed Severity=Fatal Event reported time=Aug 9, 2020 10:08:18 AM CST Operating System=Linux Platform=x86_64 Associated Incident Id=88 Associated Incident Status=New Associated Incident Owner= Associated Incident Acknowledged By Owner=No Associated Incident Priority=None Associated Incident Escalation Level=0 Event Type=Target Availability Event name=Status Availability status=Down Root Cause Analysis Status=Neither Cause Nor Symptom Causal analysis result=Neither a cause nor a symptom Rule Name=Incident management rule set for all targets,Incident creation rule for a Target Down availability status Rule Owner=System Generated Update Details: Failed to connect to ASM instance. The connection is closed: The connection is closed Incident created by rule (Name = Incident management rule set for all targets, Incident creation rule for a Target Down availability status [System generated rule]).
照例问度娘是没问出啥来......
MOS上搜的话就有结果了:
EM 13c: Enterprise Manager 13.2 Cloud Control ASM Incident Reported with Message=Failed To Connect To ASM Instance. The Connection Is Closed: The Connection Is Closed (Doc ID 2251591.1)
文档中提到,这个一个BUG。
验证
文档中提到,在gcagent.log日志会有如下报错(示例):
[65336:GC.Executor.126 (osm_instance:+ASM__host.company.com:ofs_performance_metrics) (osm_instance:+ASM__host.company.com:ofs_performance_metrics:Instance_Volume_Performance)] ERROR - The connection is closed: The connection is closed java.sql.SQLException: The connection is closed: The connection is closed at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464) at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448) at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307) at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50) at com.sun.proxy.$Proxy27.prepareCall(Unknown Source) |
该日志位于客户端如下位置:
[oracle@xxxxx1 log]$ ll $AGENT_HOME/sysman/log/gcagent.log -rw-r----- 1 oracle oinstall 960998 Aug 10 14:20 /u01/app/oem13c/agent/agent_inst/sysman/log/gcagent.log
查看日志可以发现,确实存在相似的日志信息:
2020-08-09 10:08:18,645 [99899:GC.Executor.23807 (osm_instance:+ASM1_xxxxx1:Response) (osm_instance:+ASM1_xxxxx1:Response:Response)] ERROR - The connection is closed: The connection is closed java.sql.SQLException: The connection is closed: The connection is closed at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464) at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448) at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307) at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50) at com.sun.proxy.$Proxy31.prepareCall(Unknown Source)
文档中还提到,
此外,如果在EM代理进程上进行了线程转储,则会观察到大量的"Timer-"线程(它们随着时间的推移而增加,并且从未关闭/结束)。例:
jstack <Agent PID>|grep "Timer-"|wc -l
983
注意:根据经验,"Timer-"线程的数量应随时间保持恒定,少于50,但这是一个近似值,因为它取决于目标数量,监视设置,执行的作业以及许多其他因素。关键因素是随着时间(天)的增加,此类线程的数量将保持恒定。
问题节点再次验证如下:
[oracle@xxxxx1 ~]# ps -ef | grep java ...省略部分内容... oracle 7687 7601 3 Aug05 ? 03:34:38 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=7601 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain [oracle@xxxxx1 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 7687 | grep "Timer-" | wc -l 83
另外一个没报警的节点情况:
[oracle@xxxxx2 ~]$ ps -ef | grep 13.3 oracle 31845 31753 0 Aug05 ? 00:26:26 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=31753 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain [oracle@xxxxx2 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 31845 | grep "Timer-" | wc -l 7
文档中给出经验值为<50,我在问题节点可以看出"Timer-"线程的数量为83,作为一个参考值,说明该节点很有可能出现了BUG。
处理
这是BUG导致的,13.2/13.3/13.4均存在此问题,不过对应BUG号不同,因此补丁也不同。
13.3的对应BUG为Bug 28406747,对应在agent段打上该补丁即可。
如何打补丁
首先是一个打补丁的目标的问题,之前给OMS打psu的时候虽然是第一次但是有给DB打PSU的经验倒是稍微折腾了下。
这次确是一个小补丁,根据readme提到的,其中一步是需要关闭Management Agent,这个地方纠结了好一会。
这个Management Agent指的是哪个?
正常来讲,出现问题的节点在于数据库服务器上的agent端,所以应该是打在数据库服务器上的agent上,但是,
这个management的单词让我觉得是oms上的agent端,并且如果是数据库服务器上的agent上那岂不是有很多台的agent都要关掉打上?
而且说实话,oms上的agent是否和数据库服务器上的agent是一样的我都不确定(后来确定是一样的)。
又是一阵度娘和mos,这次就找不出来啥了。
后来又想到,其实在oms刚刚搭建完成后,默认在网页管理的目标“主机”就有了oms服务器本身,那其实无论是oms的agent还是db服务器上的agent,
本质上应该是一个东西,于是尝试在oms上将agent停掉,
$AGENT_HOME/bin/emctl stop agent
果然,oem的网页还是可以登陆的,目标“主机”处oms本身的机器已经处于不健康的状态,看来确实是一样的。
也就是,全部的agent都需要一个一个打上补丁......
后边有想到一个问题,是否在oms上的agent打上补丁后,之后就算新推送到其他服务器上的agent估计就是带上了新打的补丁了呢?
话不多说,先给oms的agent打上补丁,在推一个新的agent到未监控的db服务器上看看情况就知道了。
首先,一定要先读补丁的readme,按照里边的要求一步一步来!!!
第一,需要给agent的OPatch版本升级,由于oms的agent之前打psu的时候已经升级过了,因此这一步不再需要做。
第二,设置环境变量,
[oracle@oem13c agent]$ export ORACLE_HOME=/u01/app/oem13c/agent/agent_13.3.0.0.0 [oracle@oem13c agent]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/OPatch/opatch version OPatch Version: 13.9.3.3.0 OPatch succeeded.
这里扯点其他的,readme管这个目录/u01/app/oem13c/agent/agent_13.3.0.0.0叫agent core home,实际上,
环境变量AGENT_HOME设置的值为/u01/app/oem13c/agent/agent_inst,这个值在推送客户端的时候叫instance directory,
其中,/u01/app/oem13c/agent为agent的base目录,设置为AGENT_HOME=/u01/app/oem13c/agent/agent_inst原因是emctl命令在这个目录下的bin文件夹中。
实际上打小补丁的应用目录是agent core home。
继续回到打补丁这里,
第三,关闭agent,
[oracle@oem13c 28406747]$ export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$PATH [oracle@oem13c 28406747]$ emctl stop agent Oracle Enterprise Manager Cloud Control 13c Release 3 Copyright (c) 1996, 2018 Oracle Corporation. All rights reserved. Stopping agent ... stopped. [oracle@oem13c 28406747]$ opatch lspatches 25237184;One-off 24470104; OPatch succeeded.
第四,直接应用补丁即可,
[oracle@oem13c 28406747]$ opatch apply Oracle Interim Patch Installer version 13.9.3.3.0 Copyright (c) 2020, Oracle Corporation. All rights reserved. Oracle Home : /u01/app/oem13c/agent/agent_13.3.0.0.0 Central Inventory : /u01/app/oraInventory from : /u01/app/oem13c/agent/agent_13.3.0.0.0/oraInst.loc OPatch version : 13.9.3.3.0 OUI version : 13.9.1.0.0 Log file location : /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log OPatch detects the Middleware Home as "/u01/app/oem13c/agent" Verifying environment and performing prerequisite checks... OPatch continues with these patches: 28406747 Do you want to proceed? [y|n] y User Responded with: Y All checks passed. Backing up files... Applying interim patch '28406747' to OH '/u01/app/oem13c/agent/agent_13.3.0.0.0' Patching component oracle.sysman.agent.ic, 13.3.0.0.0... Patch 28406747 successfully applied. Log file location: /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log OPatch succeeded. [oracle@oem13c 28406747]$ opatch lspatches 28406747; 25237184;One-off 24470104; OPatch succeeded.
最后,开启agent,
[oracle@oem13c 28406747]$ emctl start agent
至此,小补丁成功打上。
后边推送新的agent到未监控的db服务器上,发现推送后,db上的agent是没有新的补丁的...
所以还是要手动全部打一遍。
一样的步骤,不是特别复杂。
后续再观察"Timer-"线程的数量是否会再次异常以及是否还有报警产生。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?