oracle rac grid用户执行命令crs_stat 耗时30秒处理

现象:

环境:2节点oracle rac 11.2.0.4.。物理机

现象1

一套oracle rac 11.2.0.4 集群环境,grid用户执行一些命令返回时间异常。需要近30

 

[grid@rac1 ~]$ time crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host        
----------------------------------------------------------------------
ora.DATA.dg    ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.FRA.dg     ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora....ER.lsnr ora....er.type 0/5    0/     ONLINE    ONLINE    rac1        
ora....N1.lsnr ora....er.type 0/5    0/0    ONLINE    ONLINE    rac1        
ora.OCR.dg     ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.asm        ora.asm.type   0/5    0/     ONLINE    ONLINE    rac1        
ora.*.db     ora....se.type 0/2    0/1    OFFLINE   OFFLINE               
ora.*.db     ora....se.type 0/2    0/1    ONLINE    ONLINE    rac1        
ora.cvu        ora.cvu.type   0/5    0/0    ONLINE    ONLINE    rac1        
ora.gsd        ora.gsd.type   0/5    0/     OFFLINE   OFFLINE               
ora....network ora....rk.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.oc4j       ora.oc4j.type  0/1    0/2    ONLINE    ONLINE    rac1        
ora.ons        ora.ons.type   0/3    0/     ONLINE    ONLINE    rac1        
ora....SM1.asm application    0/5    0/0    ONLINE    ONLINE    rac1        
ora....C1.lsnr application    0/5    0/0    ONLINE    ONLINE    rac1        
ora.rac1.gsd   application    0/5    0/0    OFFLINE   OFFLINE               
ora.rac1.ons   application    0/3    0/0    ONLINE    ONLINE    rac1        
ora.rac1.vip   ora....t1.type 0/0    0/0    ONLINE    ONLINE    rac1        
ora....SM2.asm application    0/5    0/0    ONLINE    ONLINE    rac2        
ora....C2.lsnr application    0/5    0/0    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    0/5    0/0    OFFLINE   OFFLINE               
ora.rac2.ons   application    0/3    0/0    ONLINE    ONLINE    rac2        
ora.rac2.vip   ora....t1.type 0/0    0/0    ONLINE    ONLINE    rac2        
ora.scan1.vip  ora....ip.type 0/0    0/0    ONLINE    ONLINE    rac1        
ora.*.db     ora....se.type 0/2    0/1    ONLINE    ONLINE    rac1        

real	0m27.927s
user	0m17.025s
sys	0m10.712s

现象2zabbix监控告警,偶尔提示 agent.ping超时。同时zabbix serve日志大量报错

 

14975:20210730:051954.737 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds
 14980:20210730:052049.401 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14978:20210730:052250.841 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds
 14979:20210730:052300.585 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:052312.489 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored
 14980:20210730:052414.317 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14976:20210730:052557.756 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:052647.517 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14976:20210730:052750.795 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds
 14977:20210730:052800.474 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:052812.589 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored
 14980:20210730:052911.698 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14978:20210730:052918.528 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds
 14979:20210730:052956.146 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:053014.555 resuming Zabbix agent checks on host "ORAC-NODE1": connection restored
 14980:20210730:053015.559 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14977:20210730:053156.451 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:053246.504 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14978:20210730:053348.958 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE1" failed: first network error, wait for 15 seconds
 14980:20210730:053412.577 resuming Zabbix agent checks on host "RAC-NODE1": connection restored
 14975:20210730:053557.316 Zabbix agent item "oracle.status_online.process" on host "ORAC-NODE2-" failed: first network error, wait for 15 seconds
 14980:20210730:053649.374 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14962:20210730:054521.587 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell script.
 14975:20210730:055156.431 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:055248.282 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14963:20210730:055347.878 item "ORAC-NODE1:oracle.status_offline.process" became supported
 14975:20210730:055358.595 Zabbix agent item "oracle.status_offline.process" on host "ORAC-NODE2" failed: first network error, wait for 15 seconds
 14980:20210730:055446.042 resuming Zabbix agent checks on host "ORAC-NODE2": connection restored
 14965:20210730:055518.943 item "ORAC-NODE1:oracle.status_online.process" became not supported: Timeout while executing a shell script.
 14964:20210730:055519.942 item "ORAC-NODE1:oracle.status_offline.process" became not supported: Timeout while executing a shell scrip

 

 

 

 有时候图形还会断

分析:

1 分析oracle错误日志

[grid@rac2 trace]$ pwd

/u01/app/grid/diag/asm/+asm/+ASM2/trace

[grid@rac2 trace]$ tail -n 100 alert_+ASM2.log

[oracle@rac2 trace]$ pwd

/u01/app/oracle/diag/rdbms/*/*2/trace

[oracle@rac2 trace]$ tail -n 100 alert_*.log

没有发现有关异常

2 查看服务器性能信息

 

[root@rac2 ~]# iostat -x 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.76    0.00    0.53    1.22    0.00   92.49

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    19.00    0.00    6.00     0.00   200.00    33.33     0.00    0.00    0.00    0.00   0.00   0.00
up-0              0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.00    0.00    0.00   0.00   0.00
up-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-4            237.00     0.00  562.00    1.00 203376.00     5.00   361.25     0.31    0.55    0.55    1.00   0.31  17.70
up-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-8              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.33    0.50    0.00   0.33   0.10
sdd               0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.33    0.50    0.00   0.33   0.10
sde               0.00     0.00 1035.00    3.00 385232.00    39.00   371.17     0.55    0.58    0.57    1.67   0.31  32.40
sdf               0.00     0.00 1082.00    3.00 396736.00    39.00   365.69     0.59    0.60    0.60    1.00   0.32  34.30
sdg               0.00     0.00 1049.00    1.00 384224.00    32.00   365.96     0.55    0.56    0.56    1.00   0.32  33.30
sdh               0.00     0.00    0.00    3.00     0.00    39.00    13.00     0.00    2.00    0.00    2.00   0.33   0.10
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-10             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-12           233.00     0.00  490.00    3.00 184336.00    39.00   373.99     0.27    0.56    0.55    1.67   0.32  16.00
up-14           237.00     0.00  563.00    0.00 203744.00     0.00   361.89     0.31    0.55    0.55    0.00   0.33  18.70
up-16             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-19             0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.33    0.50    0.00   0.33   0.10
up-21           244.00     0.00  545.00    0.00 200896.00     0.00   368.62     0.33    0.60    0.60    0.00   0.34  18.40
up-23           225.00     0.00  486.00    1.00 180480.00    32.00   370.66     0.29    0.59    0.59    1.00   0.35  17.20
up-25             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-27             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
up-29             0.00     0.00    2.00    1.00     2.00     1.00     1.00     0.00    0.33    0.50    0.00   0.33   0.10
up-31           242.00     0.00  520.00    2.00 193360.00    34.00   370.49     0.34    0.65    0.65    1.00   0.37  19.30
up-33             0.00     0.00    0.00    3.00     0.00    39.00    13.00     0.01    2.00    0.00    2.00   2.00   0.60
up-35             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00   25.00     0.00   200.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
[root@rac2 ~]# pidstat -d 1
Linux 2.6.32-642.el6.x86_64 (rac2)     07/30/2021     _x86_64_    (32 CPU)

08:45:30 PM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
08:45:31 PM      7829 137292.31      0.00      0.00  oracle
08:45:31 PM     11890      2.88      0.00      0.00  ocssd.bin
08:45:31 PM     12525     30.77      0.00      0.00  oracle
08:45:31 PM     16723     46.15      0.00      0.00  oracle

08:45:31 PM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
08:45:32 PM      7829 122400.00      0.00      0.00  oracle
08:45:32 PM     11890    836.00      4.00      0.00  ocssd.bin
08:45:32 PM     12089      0.00     32.00      0.00  ologgerd
08:45:32 PM     12385      0.00      4.00      0.00  orarootagent.bi
08:45:32 PM     12525     48.00      0.00      0.00  oracle
08:45:32 PM     16723     32.00      0.00      0.00  oracle

 

3 生成并查看Oracle awr报表

[oracle@rac1 admin]$ ll /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql

-rw-r--r-- 1 oracle oinstall 6444 Jul 24  2011 /u01/app/oracle/product/11.2.0/db_1/rdbms/admin/awrgrpti.sql

4 CRS-4700: The Cluster Time Synchronization Service is in Observer mode.

处理:

 

[grid@rac1 ~]$ cat /opt/synctime.sh 
#!/bin/bash
ntpdate ***
hwclock -w
[grid@rac1 ~]$ cluvfy comp clocksync -verbose

Verifying Clock Synchronization across the cluster nodes 

Checking if Clusterware is installed on all nodes...
Check of Clusterware install passed

Checking if CTSS Resource is running on all nodes...
Check: CTSS Resource running on all nodes
  Node Name                             Status                  
  ------------------------------------  ------------------------
  rac1                                  passed                  
Result: CTSS resource check passed


Querying CTSS for time offset on all nodes...
Result: Query of CTSS for time offset passed

Check CTSS state started...
Check: CTSS state
  Node Name                             State                   
  ------------------------------------  ------------------------
  rac1                                  Observer                
CTSS is in Observer state. Switching over to clock synchronization checks using NTP


Starting Clock synchronization checks using Network Time Protocol(NTP)...

NTP Configuration file check started...
The NTP configuration file "/etc/ntp.conf" is available on all nodes
NTP Configuration file check passed

Checking daemon liveness...

Check: Liveness for "ntpd"
  Node Name                             Running?                
  ------------------------------------  ------------------------
  rac1                                  no                      
Result: Liveness check failed for "ntpd"
PRVF-5494 : The NTP Daemon or Service was not alive on all nodes
PRVF-5415 : Check to see if NTP daemon or service is running failed
Result: Clock synchronization check using Network Time Protocol(NTP) failed


PRVF-9652 : Cluster Time Synchronization Services check failed

Verification of Clock Synchronization across the cluster nodes was unsuccessful on all the specified nodes. 

 

[grid@rac1 ~]$ srvctl status listener
Listener LISTENER is enabled
Listener LISTENER is running on node(s): rac2,rac1
[grid@rac1 ~]$ ssh rac2 date;date
Fri Jul 30 21:56:39 * 2021
Fri Jul 30 21:56:39 * 2021
[grid@rac1 ~]$  crsctl check ctss
CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
[grid@rac2 ~]$ crsctl check ctss
CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
[root@rac1 ~]# mv /etc/ntp.conf /etc/ntp.conf.bak
[grid@rac1 ~]$ crsctl check ctss
CRS-4701: The Cluster Time Synchronization Service is in Active mode.
CRS-4702: Offset (in msec): 0
##节点2执行 [root@rac2 ~]# mv /etc/ntp.conf /etc/ntp.conf.bk [grid@rac2 ~]$ crsctl check ctss CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset (in msec): 0 [grid@rac2 ~]$ cluvfy comp clocksync -verbose Verifying Clock Synchronization across the cluster nodes Checking if Clusterware is installed on all nodes... Check of Clusterware install passed Checking if CTSS Resource is running on all nodes... Check: CTSS Resource running on all nodes Node Name Status ------------------------------------ ------------------------ rac2 passed Result: CTSS resource check passed Querying CTSS for time offset on all nodes... Result: Query of CTSS for time offset passed Check CTSS state started... Check: CTSS state Node Name State ------------------------------------ ------------------------ rac2 Active CTSS is in Active state. Proceeding with check of clock time offsets on all nodes... Reference Time Offset Limit: 1000.0 msecs Check: Reference Time Offset Node Name Time Offset Status ------------ ------------------------ ------------------------ rac2 0.0 passed Time offset is within the specified limits on the following set of nodes: "[rac2]" Result: Check of clock time offsets passed Oracle Cluster Time Synchronization Services check passed Verification of Clock Synchronization across the cluster nodes was successful. [grid@rac1 ~]$ time srvctl status asm -a ##执行时间还是没有变化 ASM is running on rac2,rac1 ASM is enabled. real 0m32.048s user 0m20.051s sys 0m11.758s

5 查看网络日志

 

[oracle@rac2 ~]$ tail -n 100 /u01/app/grid/diag/tnslsnr/rac2/listener/alert/log.xml
 host_addr='***'>
 <txt>31-JUL-2021 10:16:55 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=root))(service_name=***)) * (ADDRESS=(PROTOCOL=tcp)(HOST=***)(PORT=49407)) * establish * *** * 0
 </txt>
</msg>
<msg time='2021-07-31T10:17:15.788+08:00' org_id='oracle' comp_id='tnslsnr'
 type='UNKNOWN' level='16' host_id='rac2'
 host_addr='***'>
 <txt>31-JUL-2021 10:17:15 * service_update * **** * 0
 </txt>

 

6 跟踪命令执行

 

[grid@rac1 ~]$ strace crs_stat -t -v

getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105036.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105037.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105038.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105039.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105040.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
chdir("/u01/app/11.2.0/grid/log/rac1/client") = 0
getcwd("/u01/app/11.2.0/grid/log/rac1/client", 4096) = 37
chdir("/home/grid")                     = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
stat("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {st_mode=S_IFREG|0644, st_size=262, ...}) = 0
access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0
statfs("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12868767, f_bfree=5387069, f_bavail=4731709, f_files=3276800, f_ffree=2483847, f_fsid={-1532779627, -1637007972}, f_namelen=255, f_frsize=4096}) = 0
open("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", O_RDONLY) = 3
close(3)                                = 0
getcwd("/home/grid", 4096)              = 11
^Cchdir("/u01/app
[grid@rac1 ~]$ tail -n 100 /u01/app/11.2.0/grid/log/rac1/client/clsc105036.log
Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] NAME: `UI_DATA` length=7
2021-04-28 08:01:57.908: [ CRSCOMM][3588262272] Successfully read response

 

从上面的跟踪日志很明显就能发现问题所在

 

 

解决:

发现命令执行大量调用access("/u01/app/11.2.0/grid/log/rac1/client/clsc105041.log", F_OK) = 0

[grid@rac1 client]$ ll |wc -l
576583
[grid@rac1 client]$ du -sh
2.3G    .
[grid@rac1 client]$ ll /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log
-rw-r--r-- 1 zabbix zabbix 262 Apr 28 08:02 /u01/app/11.2.0/grid/log/rac1/client/clsc105037.log
[grid@rac1 client]$ ll clsc*.log|wc -l
576561
[root@rac1 client]# find -type f -mtime -1|wc -l
2328
[root@rac1 client]# ll clsc575437.log
-rw-r--r-- 1 zabbix zabbix 262 Aug  1 10:16 clsc575437.log
[root@rac1 ~]# df -i
Filesystem             Inodes  IUsed    IFree IUse% Mounted on
/dev/mapper/vgnode110102723-lv_root
                      3276800 793009  2483791   25% /
tmpfs                 1000000   1024   998976    1% /dev/shm
/dev/sda1              128016     43   127973    1% /boot
/dev/mapper/vg_node110102723-lv_home
                     13926400     95 13926305    1% /home
[root@rac1 client]# find -amin -20
./clsc576616.log
./clsc576613.log
./clsc576615.log
./clsc576610.log
./clsc576614.log
./clsc576609.log
./clsc576611.log
./clsc576612.log
[root@rac1 client]# ll -h clsc576612.log
-rw-r--r-- 1 zabbix zabbix 262 Aug  1 22:31 clsc576612.log
[root@rac1 client]# ll clsc5766*.log |wc -l
34
You have mail in /var/spool/mail/root

发现大量的clsc*.log日志,而且用户和组都是zabbix,由此怀疑zabbix监控项目写入了此文件,而且也是同监控频率一致,一分钟一个文件。

查看另外一套正常的rac库,该目录下是没有生成如此多的文件。

 

[root@rac1 client]# pwd
/u01/app/11.2.0/grid/log/rac1/client
[root@rac1 client]# rm -f clsc5*.log
[root@rac1 client]# ll |wc -l
[grid@rac1 ~]$ time crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host        
----------------------------------------------------------------------
ora.DATA.dg    ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.FRA.dg     ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora....ER.lsnr ora....er.type 0/5    0/     ONLINE    ONLINE    rac1        
ora....N1.lsnr ora....er.type 0/5    0/0    ONLINE    ONLINE    rac1        
ora.OCR.dg     ora....up.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.asm        ora.asm.type   0/5    0/     ONLINE    ONLINE    rac1        
ora.***.db     ora....se.type 0/2    0/1    OFFLINE   OFFLINE               
ora.***.db     ora....se.type 0/2    0/1    ONLINE    ONLINE    rac1        
ora.cvu        ora.cvu.type   0/5    0/0    ONLINE    ONLINE    rac1        
ora.gsd        ora.gsd.type   0/5    0/     OFFLINE   OFFLINE               
ora....network ora....rk.type 0/5    0/     ONLINE    ONLINE    rac1        
ora.oc4j       ora.oc4j.type  0/1    0/2    ONLINE    ONLINE    rac1        
ora.ons        ora.ons.type   0/3    0/     ONLINE    ONLINE    rac1        
ora....SM1.asm application    0/5    0/0    ONLINE    ONLINE    rac1        
ora....C1.lsnr application    0/5    0/0    ONLINE    ONLINE    rac1        
ora.rac1.gsd   application    0/5    0/0    OFFLINE   OFFLINE               
ora.rac1.ons   application    0/3    0/0    ONLINE    ONLINE    rac1        
ora.rac1.vip   ora....t1.type 0/0    0/0    ONLINE    ONLINE    rac1        
ora....SM2.asm application    0/5    0/0    ONLINE    ONLINE    rac2        
ora....C2.lsnr application    0/5    0/0    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    0/5    0/0    OFFLINE   OFFLINE               
ora.rac2.ons   application    0/3    0/0    ONLINE    ONLINE    rac2        
ora.rac2.vip   ora....t1.type 0/0    0/0    ONLINE    ONLINE    rac2        
ora.scan1.vip  ora....ip.type 0/0    0/0    ONLINE    ONLINE    rac1        
ora.***.db     ora....se.type 0/2    0/1    ONLINE    ONLINE    rac1        

real    0m0.049s
user    0m0.014s
sys    0m0.008s

 

目前,问题已经解决

删除相关文件:clsc*.log

再次执行命令,耗时近real 0m0.049s

[grid@rac1 ~]$ strace -tt -T -v -o /tmp/strace_crs_20210801.log crs_stat -t -v

再次开启zabbix监控,发现还是1分钟一个文件生成,看来还没有从根本上解决此问题。目前是解决了。如果找不到根本解决办法,先用定时任务找到该类型文件并删除吧。

疑问:1 为什么这一套oracle rac库会存在zabbix监控oracle的item,会生成这么多文件。监控项目(1521,ora_pmon,asm.process,session_counts,等

2 经过查看zabbix针对Oracle环境没有什么特殊的设置?

3 oracle 环境变量或者参数有没有什么特殊设置?

 

posted @ 2021-08-03 11:19  春困秋乏夏打盹  阅读(488)  评论(0编辑  收藏  举报