oracle rac grid用户执行命令crs_stat 耗时30秒处理
现象:
环境:2节点oracle rac 11.2.0.4.。物理机
现象1:
一套oracle rac 11.2.0.4 集群环境,grid用户执行一些命令返回时间异常。需要近30秒
[grid@rac1 ~]$ time crs_stat -t -v
Name Type R/RA F/FT Target State Host
----------------------------------------------------------------------
ora.DATA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora.FRA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora....ER.lsnr ora....er.type 0/5 0/ ONLINE ONLINE rac1
ora....N1.lsnr ora....er.type 0/5 0/0 ONLINE ONLINE rac1
ora.OCR.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1
ora.asm ora.asm.type 0/5 0/ ONLINE ONLINE rac1
ora.*.db ora....se.type 0/2 0/1 OFFLINE OFFLINE
ora.*.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1
ora.cvu ora.cvu.type 0/5 0/0 ONLINE ONLINE rac1
ora.gsd ora.gsd.type 0/5 0/ OFFLINE OFFLINE
ora....network ora....rk.type 0/5 0/ ONLINE ONLINE rac1
ora.oc4j ora.oc4j.type 0/1 0/2 ONLINE ONLINE rac1
ora.ons ora.ons.type 0/3 0/ ONLINE ONLINE rac1
ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1
ora....C1.lsnr application 0/5 0/0 ONLINE ONLINE rac1
ora.rac1.gsd application 0/5 0/0 OFFLINE OFFLINE
ora.rac1.ons application 0/3 0/0 ONLINE ONLINE rac1
ora.rac1.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac1
ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2
ora....C2.lsnr application 0/5 0/0 ONLINE ONLINE rac2
ora.rac2.gsd application 0/5 0/0 OFFLINE OFFLINE
ora.rac2.ons application 0/3 0/0 ONLINE ONLINE rac2
ora.rac2.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac2
ora.scan1.vip ora....ip.type 0/0 0/0 ONLINE ONLINE rac1
ora.*.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1
real 0m27.927s
user 0m17.025s
sys 0m10.712s
现象2:zabbix监控告警,偶尔提示 agent.ping超时。同时zabbix serve日志大量报错
14975:20210730:051954.737 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds 14980:20210730:052049.401 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14978:20210730:052250.841 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14979:20210730:052300.585 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052312.489 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:052414.317 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14976:20210730:052557.756 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052647.517 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14976:20210730:052750.795 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14977:20210730:052800.474 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:052812.589 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:052911.698 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14978:20210730:052918.528 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14979:20210730:052956.146 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:053014.555 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored 14980:20210730:053015.559 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14977:20210730:053156.451 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:053246.504 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14978:20210730:053348.958 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds 14980:20210730:053412.577 resuming Zabbix agent checks on host "RAC-NODE1": connection restored 14975:20210730:053557.316 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds 14980:20210730:053649.374 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14962:20210730:054521.587 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell script. 14975:20210730:055156.431 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:055248.282 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14963:20210730:055347.878 item "ORAC-NODE1:oracle.status_offline.process" became supported 14975:20210730:055358.595 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds 14980:20210730:055446.042 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored 14965:20210730:055518.943 item "ORAC-NODE1:oracle.status_online.process" became not supported: Timeout while executing a shell script. 14964:20210730:055519.942 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell scrip
有时候图形还会断
分析:
1 分析oracle错误日志
[grid@rac2 trace]$ pwd
/u01/app/grid/diag/asm/+asm/+ASM2/trace
[grid@rac2 trace]$ tail -n 100 alert_+ASM2.log
[oracle@rac2 trace]$ pwd
/u01/app/oracle/diag/rdbms/*/*2/trace
[oracle@rac2 trace]$ tail -n 100 alert_*.log
没有发现有关异常
2 查看服务器性能信息
[root@rac2 ~]# iostat -x 1 avg-cpu: %user %nice %system %iowait %steal %idle 5.76 0.00 0.53 1.22 0.00 92.49 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 19.00 0.00 6.00 0.00 200.00 33.33 0.00 0.00 0.00 0.00 0.00 0.00 up-0 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 up-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-4 237.00 0.00 562.00 1.00 203376.00 5.00 361.25 0.31 0.55 0.55 1.00 0.31 17.70 up-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 sdd 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 sde 0.00 0.00 1035.00 3.00 385232.00 39.00 371.17 0.55 0.58 0.57 1.67 0.31 32.40 sdf 0.00 0.00 1082.00 3.00 396736.00 39.00 365.69 0.59 0.60 0.60 1.00 0.32 34.30 sdg 0.00 0.00 1049.00 1.00 384224.00 32.00 365.96 0.55 0.56 0.56 1.00 0.32 33.30 sdh 0.00 0.00 0.00 3.00 0.00 39.00 13.00 0.00 2.00 0.00 2.00 0.33 0.10 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-12 233.00 0.00 490.00 3.00 184336.00 39.00 373.99 0.27 0.56 0.55 1.67 0.32 16.00 up-14 237.00 0.00 563.00 0.00 203744.00 0.00 361.89 0.31 0.55 0.55 0.00 0.33 18.70 up-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-19 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 up-21 244.00 0.00 545.00 0.00 200896.00 0.00 368.62 0.33 0.60 0.60 0.00 0.34 18.40 up-23 225.00 0.00 486.00 1.00 180480.00 32.00 370.66 0.29 0.59 0.59 1.00 0.35 17.20 up-25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-27 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 up-29 0.00 0.00 2.00 1.00 2.00 1.00 1.00 0.00 0.33 0.50 0.00 0.33 0.10 up-31 242.00 0.00 520.00 2.00 193360.00 34.00 370.49 0.34 0.65 0.65 1.00 0.37 19.30 up-33 0.00 0.00 0.00 3.00 0.00 39.00 13.00 0.01 2.00 0.00 2.00 2.00 0.60 up-35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 25.00 0.00 200.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [root@rac2 ~]# pidstat -d 1 Linux 2.6.32-642.el6.x86_64 (rac2) 07/30/2021 _x86_64_ (32 CPU) 08:45:30 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:45:31 PM 7829 137292.31 0.00 0.00 oracle 08:45:31 PM 11890 2.88 0.00 0.00 ocssd.bin 08:45:31 PM 12525 30.77 0.00 0.00 oracle 08:45:31 PM 16723 46.15 0.00 0.00 oracle 08:45:31 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:45:32 PM 7829 122400.00 0.00 0.00 oracle 08:45:32 PM 11890 836.00 4.00 0.00 ocssd.bin 08:45:32 PM 12089 0.00 32.00 0.00 ologgerd 08:45:32 PM 12385 0.00 4.00 0.00 orarootagent.bi 08:45:32 PM 12525 48.00 0.00 0.00 oracle 08:45:32 PM 16723 32.00 0.00 0.00 oracle
3 生成并查看Oracle awr报表
[oracle@rac1 admin]$ ll /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql
-rw-r--r-- 1 oracle oinstall 6444 Jul 24 2011 /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql
4 CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
处理:
[grid@rac1 ~]$ cat /opt/synctime.sh #!/bin/bash ntpdate *** hwclock -w [grid@rac1 ~]$ cluvfy comp clocksync -verbose Verifying Clock Synchronization across the cluster nodes Checking if Clusterware is installed on all nodes... Check of Clusterware install passed Checking if CTSS Resource is running on all nodes... Check: CTSS Resource running on all nodes Node Name Status ------------------------------------ ------------------------ rac1 passed Result: CTSS resource check passed Querying CTSS for time offset on all nodes... Result: Query of CTSS for time offset passed Check CTSS state started... Check: CTSS state Node Name State ------------------------------------ ------------------------ rac1 Observer CTSS is in Observer state. Switching over to clock synchronization checks using NTP Starting Clock synchronization checks using Network Time Protocol(NTP)... NTP Configuration file check started... The NTP configuration file "/etc/ntp.conf" is available on all nodes NTP Configuration file check passed Checking daemon liveness... Check: Liveness for "ntpd" Node Name Running? ------------------------------------ ------------------------ rac1 no Result: Liveness check failed for "ntpd" PRVF-5494 : The NTP Daemon or Service was not alive on all nodes PRVF-5415 : Check to see if NTP daemon or service is running failed Result: Clock synchronization check using Network Time Protocol(NTP) failed PRVF-9652 : Cluster Time Synchronization Services check failed Verification of Clock Synchronization across the cluster nodes was unsuccessful on all the specified nodes.
[grid@rac1 ~]$ srvctl status listener Listener LISTENER is enabled Listener LISTENER is running on node(s): rac2,rac1 [grid@rac1 ~]$ ssh rac2 date;date Fri Jul 30 21:56:39 * 2021 Fri Jul 30 21:56:39 * 2021 [grid@rac1 ~]$ crsctl check ctss CRS-4700: The Cluster Time Synchronization Service is in Observer mode. [grid@rac2 ~]$ crsctl check ctss CRS-4700: The Cluster Time Synchronization Service is in Observer mode. [root@rac1 ~]# mv /etc/ntp.conf /etc/ntp.conf.bak [grid@rac1 ~]$ crsctl check ctss CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset (in msec): 0
##节点2执行 [root@rac2 ~]# mv /etc/ntp.conf /etc/ntp.conf.bk [grid@rac2 ~]$ crsctl check ctss CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset (in msec): 0 [grid@rac2 ~]$ cluvfy comp clocksync -verbose Verifying Clock Synchronization across the cluster nodes Checking if Clusterware is installed on all nodes... Check of Clusterware install passed Checking if CTSS Resource is running on all nodes... Check: CTSS Resource running on all nodes Node Name Status ------------------------------------ ------------------------ rac2 passed Result: CTSS resource check passed Querying CTSS for time offset on all nodes... Result: Query of CTSS for time offset passed Check CTSS state started... Check: CTSS state Node Name State ------------------------------------ ------------------------ rac2 Active CTSS is in Active state. Proceeding with check of clock time offsets on all nodes... Reference Time Offset Limit: 1000.0 msecs Check: Reference Time Offset Node Name Time Offset Status ------------ ------------------------ ------------------------ rac2 0.0 passed Time offset is within the specified limits on the following set of nodes: "[rac2]" Result: Check of clock time offsets passed Oracle Cluster Time Synchronization Services check passed Verification of Clock Synchronization across the cluster nodes was successful. [grid@rac1 ~]$ time srvctl status asm -a ##执行时间还是没有变化 ASM is running on rac2,rac1 ASM is enabled. real 0m32.048s user 0m20.051s sys 0m11.758s
5 查看网络日志
[oracle@rac2 ~]$ tail -n 100 /u01/app/grid/diag/tnslsnr/rac2/listener/alert/log.xml host_addr='***'> <txt>31-JUL-2021 10:16:55 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=root))(service_name=***)) * (ADDRESS=(PROTOCOL=tcp)(HOST=***)(PORT=49407)) * establish * *** * 0 </txt> </msg> <msg time='2021-07-31T10:17:15.788+08:00' org_id='oracle' comp_id='tnslsnr' type='UNKNOWN' level='16' host_id='rac2' host_addr='***'> <txt>31-JUL-2021 10:17:15 * service_update * **** * 0 </txt>
6 跟踪命令执行
[grid@rac1 ~]$ strace crs_stat -t -v getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0 getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37 chdir("/home/grid") = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0 access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0 statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0 open("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", O_RDONLY) = 3 close(3) = 0 getcwd("/home/grid", 4096) = 11 ^Cchdir("/u01/app [grid@rac1 ~]$ tail -n 100 /u01/app/11.2.0/grid/log/rac1/client/clsc105036.log Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved. 2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] NAME: `UI_DATA` length=7 2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] Successfully read response
从上面的跟踪日志很明显就能发现问题所在
解决:
发现命令执行大量调用access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0
[grid@rac1 client]$ ll |wc -l 576583 [grid@rac1 client]$ du -sh 2.3G . [grid@rac1 client]$ ll /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log -rw-r--r-- 1 zabbix zabbix 262 Apr 28 08:02 /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log [grid@rac1 client]$ ll clsc*.log|wc -l 576561 [root@rac1 client]# find -type f -mtime -1|wc -l 2328 [root@rac1 client]# ll clsc575437.log -rw-r--r-- 1 zabbix zabbix 262 Aug 1 10:16 clsc575437.log [root@rac1 ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vgnode110102723-lv_root 3276800 793009 2483791 25% / tmpfs 1000000 1024 998976 1% /dev/shm /dev/sda1 128016 43 127973 1% /boot /dev/mapper/vg_node110102723-lv_home 13926400 95 13926305 1% /home [root@rac1 client]# find -amin -20 ./clsc576616.log ./clsc576613.log ./clsc576615.log ./clsc576610.log ./clsc576614.log ./clsc576609.log ./clsc576611.log ./clsc576612.log [root@rac1 client]# ll -h clsc576612.log -rw-r--r-- 1 zabbix zabbix 262 Aug 1 22:31 clsc576612.log [root@rac1 client]# ll clsc5766*.log |wc -l 34 You have mail in /var/spool/mail/root
发现大量的clsc*.log日志,而且用户和组都是zabbix,由此怀疑zabbix监控项目写入了此文件,而且也是同监控频率一致,一分钟一个文件。
查看另外一套正常的rac库,该目录下是没有生成如此多的文件。
[root@rac1 client]# pwd /u01/app/11.2.0/grid/log/rac1/client [root@rac1 client]# rm -f clsc5*.log [root@rac1 client]# ll |wc -l [grid@rac1 ~]$ time crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.DATA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora.FRA.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora....ER.lsnr ora....er.type 0/5 0/ ONLINE ONLINE rac1 ora....N1.lsnr ora....er.type 0/5 0/0 ONLINE ONLINE rac1 ora.OCR.dg ora....up.type 0/5 0/ ONLINE ONLINE rac1 ora.asm ora.asm.type 0/5 0/ ONLINE ONLINE rac1 ora.***.db ora....se.type 0/2 0/1 OFFLINE OFFLINE ora.***.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1 ora.cvu ora.cvu.type 0/5 0/0 ONLINE ONLINE rac1 ora.gsd ora.gsd.type 0/5 0/ OFFLINE OFFLINE ora....network ora....rk.type 0/5 0/ ONLINE ONLINE rac1 ora.oc4j ora.oc4j.type 0/1 0/2 ONLINE ONLINE rac1 ora.ons ora.ons.type 0/3 0/ ONLINE ONLINE rac1 ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1 ora....C1.lsnr application 0/5 0/0 ONLINE ONLINE rac1 ora.rac1.gsd application 0/5 0/0 OFFLINE OFFLINE ora.rac1.ons application 0/3 0/0 ONLINE ONLINE rac1 ora.rac1.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac1 ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2 ora....C2.lsnr application 0/5 0/0 ONLINE ONLINE rac2 ora.rac2.gsd application 0/5 0/0 OFFLINE OFFLINE ora.rac2.ons application 0/3 0/0 ONLINE ONLINE rac2 ora.rac2.vip ora....t1.type 0/0 0/0 ONLINE ONLINE rac2 ora.scan1.vip ora....ip.type 0/0 0/0 ONLINE ONLINE rac1 ora.***.db ora....se.type 0/2 0/1 ONLINE ONLINE rac1 real 0m0.049s user 0m0.014s sys 0m0.008s
目前,问题已经解决
删除相关文件:clsc*.log
再次执行命令,耗时近real 0m0.049s
[grid@rac1 ~]$ strace -tt -T -v -o /tmp/strace_crs_20210801.log crs_stat -t -v
再次开启zabbix监控,发现还是1分钟一个文件生成,看来还没有从根本上解决此问题。目前是解决了。如果找不到根本解决办法,先用定时任务找到该类型文件并删除吧。
疑问:1 为什么这一套oracle rac库会存在zabbix监控oracle的item,会生成这么多文件。监控项目(1521,ora_pmon,asm.process,session_counts,等
2 经过查看zabbix针对Oracle环境没有什么特殊的设置?
3 oracle 环境变量或者参数有没有什么特殊设置?