Oracle数据库导致系统CPU使用率100%的案例
Oracle数据库导致系统CPU使用率100%的案例
前言
这个案例只是模拟案例。虽然作为一名系统工程师,稍有些数据库基础,但是碰到和系统相关的问题时,我们还是需要从系统层去做分析处理,希望这个案例能给大家带来些许处理思路。
故障现象
Oracle 19c RAC 2节点CPU使用率100%。(使用top、sar等命令)
故障分析过程
1、查询负载很高前35个进程
# ps aux|head -1;ps aux|grep -v PID|sort -rn -k +3|head -35
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
oracle 21974 13.7 0.0 40453412 33848 ? Rs 10:56 1:27 oracleorcl1 (LOCAL=NO)
root 3204 13.6 0.0 114296 6480 ? Sl 11:06 0:00 /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show config
oracle 22790 13.6 0.0 40715556 33888 ? Rs 10:56 1:25 oracleorcl1 (LOCAL=NO)
oracle 22317 13.6 0.0 40715556 32004 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
oracle 22260 13.6 0.0 40453408 33912 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
oracle 22215 13.6 0.0 40453408 33892 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
......
oracle 22219 13.4 0.0 40453412 33904 ? Rs 10:56 1:25 oracleorcl1 (LOCAL=NO)
oracle 22156 13.4 0.0 40453872 34464 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
oracle 22145 13.4 0.0 40453412 33848 ? Rs 10:56 1:25 oracleorcl1 (LOCAL=NO)
oracle 22124 13.4 0.0 40453412 31724 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
oracle 22109 13.4 0.0 40453408 33836 ? Rs 10:56 1:26 oracleorcl1 (LOCAL=NO)
2、查询前35负载高的sPID
# ps aux| grep -v PID | sort -rn -k +3 | head -35 | awk '{print$2}'
23033
22948
23015
22941
23091
23061
......
23501
23440
23429
23404
23266
23161
3、把得到的spid替换到下面sql语句中
select sql_id from v$session
where paddr in(
select addr from v$process
where spid in(
'23033',
'22948',
'23015',
'22941',
......
'23429',
'23404',
'23266',
'23161'
)
);
SQL_ID
--------------------------
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
......
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
f5kskn9df2h2p
35 rows selected.
4、查询“f5kskn9df2h2p”相关的sql_id和数量
> SELECT sid,serial#,sql_id from v$session where sql_id in ( 'f5kskn9df2h2p');
SID SERIAL# SQL_ID
---------- ---------- --------------------------
4609 49390 f5kskn9df2h2p
4722 1377 f5kskn9df2h2p
4723 7168 f5kskn9df2h2p
4724 51376 f5kskn9df2h2p
4725 62475 f5kskn9df2h2p
4726 64323 f5kskn9df2h2p
4727 19318 f5kskn9df2h2p
4728 23453 f5kskn9df2h2p
4729 34066 f5kskn9df2h2p
300 rows selected.
> SELECT count(*) from v$session where sql_id in ( 'f5kskn9df2h2p' );
COUNT(*)
----------
300
5、随机抽一条,查看一下具体是执行的什么内容
> select sql_text from v$sqlarea a,v$session b where a.SQL_ID=b.PREV_SQL_ID and b.SID='4729';
SQL_TEXT
6、思路1:系统层查看相关进程启动时间,发现在同一时间有sh脚本在执行,脚本内容如下:(感觉不准确,可能会误判)
/home/oracle/test$ cat /home/oracle/test/connOracleDB.sh
*******
杀掉所有connOracleDB.sh进程:
# pkill -9 connOracleDB.sh
7、思路2:通过SQL语句查询sql_id对应的系统进程,再查其对应的父进程
> col MACHINE for a30
> select process,MACHINE,PORT,PROGRAM,PADDR from v$session where sql_id = 'f5kskn9df2h2p';
PROCESS MACHINE PORT PROGRAM PADDR
------------------------------------------------ ------------------------------ ---------- ----------------------------------- ----------------
3480 vone1 26564 sqlplus@vone1 (TNS V1-V3) 000000013870A318
3155 vone1 26280 sqlplus@vone1 (TNS V1-V3) 000000016076BF18
3118 vone1 25964 sqlplus@vone1 (TNS V1-V3) 0000000150719888
3393 vone1 26572 sqlplus@vone1 (TNS V1-V3) 00000001686EB510
3244 vone1 26268 sqlplus@vone1 (TNS V1-V3) 00000001587024A8
3170 vone1 25970 sqlplus@vone1 (TNS V1-V3) 00000001487592A0
3457 vone1 26580 sqlplus@vone1 (TNS V1-V3) 0000000160773F58
3338 vone1 26286 sqlplus@vone1 (TNS V1-V3) 00000001507218C8
3218 vone1 25976 sqlplus@vone1 (TNS V1-V3) 000000014081F788
3410 vone1 26592 sqlplus@vone1 (TNS V1-V3) 000000015870A4E8
3139 vone1 26292 sqlplus@vone1 (TNS V1-V3) 00000001487612E0
3148 vone1
/home/oracle/test$ ps -ef| grep 3480
oracle 3480 3399 0 14:10 pts/3 00:00:00 /opt/oracle/products/19.3.0/bin/sqlplus
oracle 23509 21062 0 14:38 pts/4 00:00:00 grep --color=auto 3480
/home/oracle/test$ ps -ef|grep 3399
oracle 3399 1 0 14:10 pts/3 00:00:00 /bin/bash /home/oracle/test/connOracleDB.sh
oracle 3480 3399 0 14:10 pts/3 00:00:00 /opt/oracle/products/19.3.0/bin/sqlplus
oracle 23686 21062 0 14:38 pts/4 00:00:00 grep --color=auto 3399
再查看connOracleDB.sh脚本,分析脚本内容是否会引起异常,杀掉该脚本。
/home/oracle/test$ cat /home/oracle/test/connOracleDB.sh
******
# pkill -9 connOracleDB.sh
8、在sql下执行生成kill的语句
SELECT 'alter system kill session''' || SID || ',' || serial# || ''';' FROM v$session WHERE sql_id IN ('f5kskn9df2h2p');
'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
------------------------------------------------------
alter system kill session'3,19584' immediate;
alter system kill session'4,26016';
alter system kill session'5,543';
alter system kill session'6,22344';
......
alter system kill session'4727,35784';
alter system kill session'4728,57499';
alter system kill session'4729,45135';
300 rows selected.
后语
问题解决,激起了些许深入学习数据库的兴趣,但是想到那些又长又复杂的sql语句又开始头疼了。。。