使用auto_explain插件分析存储过程运行卡住案例
一. 文章概述
Kingbase中,auto_explain 是对explain强有力的补充,explain配合analyze、buffers等命令查看sql语句执行计划在一般场景下已经足够,
但是procedure、function里的SQL则无法分析到,配置auto_explain将自动在日志中输出procedure、function里每条SQL语句的执行计划结果,从而避免了手动去执行explain。
本文总结了一下auto_explain配置方法和演示使用案例,以及一次故障案例分析场景。(如果只看故障案例可以直接到第五节的“一次故障案例分析”)。
二. auto_explain的配置
auto_explain是session级的
需要关注几个相对重要的参数:
1),auto_explain.sample_rate (real)
auto_explain.sample_rate会让 auto_explain 只解释每个会话中的一部分语句。默认值为 1,表示解释所有的查询。
2),auto_explain.log_min_duration (integer)
auto_explain.log_min_duration是最小语句执行时间(以毫秒计),超过这个时间的语句的计划会被记录输出到日志中。
这个参数设置为0时将记录所有计划,设置为-1(默认值)时禁用记录计划。例如,如果你将它设置为250ms,则所有运行时间等于或超过 250ms 的语句将被记录。显然可以用来记录那些耗时的慢SQL语句。
3),auto_explain.log_analyze (boolean)
auto_explain.log_analyze输出EXPLAIN ANALYZE执行计划,而不是EXPLAIN输出被打印。默认情况下这个参数是关闭的。
4),auto_explain.log_nested_statements (boolean)
auto_explain.log_nested_statements可以开启嵌套语句(调用其他函数内执行的SQL语句)被记录输出到日志。
当它被关闭时,只有最外层查询计划被记录。这个参数默认情况下是关闭的。
完整的参数说明可以参考官网:https://help.kingbase.com.cn/v8/development/sql-plsql/ref-extended-plug-in/auto_explain.html?highlight=auto_explain
三. auto_explain的加载
1,动态加载auto_explain
在ksql中直接键入以下命令
--加载auto_explain
set session_preload_libraries to auto_explain
load 'auto_explain'
--设置auto_explain
set auto_explain.sample_rate = 1
set auto_explain.log_min_duration = 0
set auto_explain.log_analyze = true
set auto_explain.log_nested_statements to on
--查看当前session的auto_explain配置(退出后需要重新执行上述配置和加载):
select name,setting from sys_settings where name like 'auto_explain%';
2,默认加载auto_explain
如需 KingbaseES 数据库启动时默认加载该插件,将其添加到 kingbase.conf 文件的 shared_preload_libraries 中。
如需在数据库启动后使用该插件,在ksql中使用LOAD命令即可。
示例:
shared_preload_libraries = 'auto_explain'
3,关闭auto_explain
将禁用auto_explain。
set autoexplain.log_min_duration to -1
当然如果是动态加载auto_explain,退出ksql则auto_explain失效。
四. auto_explain使用示例
动态加载配置好auto_explain后,执行下列代码
create table t1 (id bigint);
create table t2 (id bigint);
insert into t2 values(generate_series(1,100000));
create index idx_t2_id on t2(id);
create or replace function func_test1()
returns record
LANGUAGE 'plpgsql'
as $body$
declare
v_num int :=1;
sfc record;
begin
delete from t1;
for sfc in (select id from t2 where id >99997) loop
insert into t1(id) select * from t2 where sfc.id=t2.id;
end loop;
return sfc;
end;
$body$;
--测试
select func_test1();
查看kingbase.log里的内容:
2023-10-24 05:34:38.582 EDT [73611] LOG: duration: 0.858 ms plan:
Query Text: select name,setting from sys_settings where name like 'auto_explain%';
Function Scan on pg_show_all_settings a (cost=0.00..12.50 rows=1 width=64) (actual time=0.832..0.852 rows=12 loops=1)
Filter: (name ~~ 'auto_explain%'::text)
Rows Removed by Filter: 490
-- delete from t1;执行计划,执行了1次
2023-10-24 05:34:57.773 EDT [73611] LOG: duration: 0.003 ms plan:
Query Text: delete from t1
Delete on t1 (cost=0.00..32.60 rows=2260 width=6) (actual time=0.002..0.002 rows=0 loops=1)
-> Seq Scan on t1 (cost=0.00..32.60 rows=2260 width=6) (actual time=0.001..0.001 rows=0 loops=1)
2023-10-24 05:34:57.773 EDT [73611] CONTEXT: SQL statement "delete from t1"
PL/pgSQL function func_test1() line 6 at SQL statement
-- insert into t2 values(generate_series(1,10000))执行计划,执行了三次
2023-10-24 05:34:57.773 EDT [73611] LOG: duration: 0.053 ms plan:
Query Text: insert into t1(id) select * from t2 where sfc.id=t2.id
Insert on t1 (cost=0.29..8.31 rows=1 width=8) (actual time=0.053..0.053 rows=0 loops=1)
-> Index Only Scan using idx_t2_id on t2 (cost=0.29..8.31 rows=1 width=8) (actual time=0.011..0.011 rows=1 loops=1)
Index Cond: (id = '99998'::bigint)
Heap Fetches: 1
2023-10-24 05:34:57.773 EDT [73611] CONTEXT: SQL statement "insert into t1(id) select * from t2 where sfc.id=t2.id"
PL/pgSQL function func_test1() line 8 at SQL statement
2023-10-24 05:34:57.774 EDT [73611] LOG: duration: 0.010 ms plan:
Query Text: insert into t1(id) select * from t2 where sfc.id=t2.id
Insert on t1 (cost=0.29..8.31 rows=1 width=8) (actual time=0.009..0.010 rows=0 loops=1)
-> Index Only Scan using idx_t2_id on t2 (cost=0.29..8.31 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (id = '99999'::bigint)
Heap Fetches: 1
2023-10-24 05:34:57.774 EDT [73611] CONTEXT: SQL statement "insert into t1(id) select * from t2 where sfc.id=t2.id"
PL/pgSQL function func_test1() line 8 at SQL statement
2023-10-24 05:34:57.774 EDT [73611] LOG: duration: 0.005 ms plan:
Query Text: insert into t1(id) select * from t2 where sfc.id=t2.id
Insert on t1 (cost=0.29..8.31 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1)
-> Index Only Scan using idx_t2_id on t2 (cost=0.29..8.31 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: (id = '100000'::bigint)
Heap Fetches: 1
2023-10-24 05:34:57.774 EDT [73611] CONTEXT: SQL statement "insert into t1(id) select * from t2 where sfc.id=t2.id"
PL/pgSQL function func_test1() line 8 at SQL statement
2023-10-24 05:34:57.774 EDT [73611] LOG: duration: 0.006 ms plan:
Query Text: (select id from t2 where id >99997)
Index Only Scan using idx_t2_id on t2 (cost=0.29..8.35 rows=3 width=8) (actual time=0.003..0.004 rows=3 loops=1)
Index Cond: (id > 99997)
Heap Fetches: 3
2023-10-24 05:34:57.774 EDT [73611] CONTEXT: PL/pgSQL function func_test1() line 7 at FOR over SELECT rows
2023-10-24 05:34:57.774 EDT [73611] LOG: duration: 1.123 ms plan:
Query Text: select func_test1();
Result (cost=0.00..0.26 rows=1 width=32) (actual time=1.112..1.112 rows=1 loops=1)
可以看到每条SQL的执行计划都被做了记录。
五. 一次故障案例分析
auto_explain可以帮助我们查看SQL语句的执行计划,针对执行计划里的信息做出响应的判断和处理。
问题的现象:
出于不泄露某项目代码的缘故,这里仅仅文字描述一下以前遇到过的故障场景;
执行了存储过程A后,一直出在运行中,且服务器中top命令查看kingbase的进程一直100%利用率高消耗。
问题分析过程:
在存储过程A中执行了一个while循环,循环中核心执行的就是先执行了DELETE语句然后是INSERT INTO....SELECT语句(见下截取的日志),
动态加载auto_explain后,在日志中可以明显的看到执行了INSERT INTO....SELECT一次后,第二次循环这条语句再无执行计划内容输出,就可以确定语句是卡在了这条语句处。
由于是CPU高耗问题,直觉和经验就是在某个地方死循环了。从INSERT INTO....SELECT的执行计划来看,优化器连接两表采用的是Nested Loop Left Join问题可能在这里,
接着正式证明猜想。
-> Nested Loop Left Join (cost=156663.07..168405.06 rows=573 width=694) (actual time=1793.426..1793.431 rows=0 loops=1)
问题验证与解决:
KingbaseES允许对单个SQL关闭特定的优化器特性。比如有时优化器为特定查询选择的执行计划并不是最优的,
可以通过设置这些参数强制优化器选择一个更好的执行计划来临时解决这个问题,上面提及的Nested Loop Left Join连接方式,可以通过enable_nestloop命令开启或关闭。
具体参数意义可参考于官网:调整性能参数
https://help.kingbase.com.cn/v8/perfor/sql-optimization/sql-optimization-13.html?highlight=enable_nestloop
我们可以关闭Nested Loop Left Join这种两表连接方式进行验证(优化器会自动采用其他连接方式)。
具体操作就是在INSERT INTO....SELECT语句之前之后分别添加set enable_nestloop to off;set enable_nestloop to on;最终得以验证
最终实测就是这个优化器连接方式的问题,无论如何不应该产生该问题,当然到此问题做了规避处理,最终的修复和分析转到了研发。
2023-09-05 16:26:47 CST [1963648]: [5161-1] user=xxkjb,db=xxkjb,app=kingbase_*&+_,client=[local]LOG: duration: 1793.437 ms plan:
Query Text: INSERT INTO deposit.dhjczx_loan_add_detail
( Loan_BussID ,
Loan_Name ,
Loan_CreditCustomerID ,
Loan_CBODCustomerID ,
Loan_DateBegin ,
Loan_DateEndAppoint ,
Loan_DateEnd ,
Loan_MoneyAll ,
LD_MoneyRemain ,
Loan_LoanAccountType ,
Loan_ContractType ,
Loan_Mainassure ,
Loan_FiveType ,
is_city ,
ui_id ,
big_ui_id ,
ui_name ,
big_ui_name ,
LD_Managerid ,
LD_ManagerName ,
date ,
add_type
)
SELECT t1.Loan_BussID ,
t1.Loan_Name ,
t1.Loan_CreditCustomerID ,
t1.Loan_CBODCustomerID ,
t1.Loan_DateBegin ,
t1.Loan_DateEndAppoint ,
t1.Loan_DateEnd ,
t1.Loan_MoneyAll ,
t2.LD_MoneyRemain ,
t1.Loan_LoanAccountType ,
t1.Loan_ContractType ,
t1.Loan_Mainassure ,
t1.Loan_FiveType ,
'否' ,
t1.Loan_UnitID ,
'' ,
'' ,
'' ,
LD_Managerid ,
'' ,
__targetdate ,
CASE WHEN t3.cust_no_yesterday IS NULL
THEN '新增'
ELSE '部分新增'
END
FROM HistoryData.HD_Loan_Base t1
LEFT JOIN HistoryData.HD_Loan_Detail t2 ON t1.Loan_BussID = t2.LD_BussID
LEFT JOIN ( SELECT t1.* ,
t2.Loan_CreditCustomerID AS cust_no_yesterday
FROM ( SELECT
*
FROM
deposit.dhjczx_loan_add_mid
WHERE
LD_Date = __targetdate
) t1
LEFT JOIN ( SELECT
*
FROM
deposit.dhjczx_loan_add_mid
WHERE
LD_Date = __tempdate
) t2 ON t1.Loan_UnitID = t2.Loan_UnitID
AND t1.Loan_CreditCustomerID = t2.Loan_CreditCustomerID
WHERE ISNULL(t1.LD_MoneyRemain,
0)
- ISNULL(t2.LD_MoneyRemain,
0) > 0
) t3 ON t1.Loan_UnitID = t3.Loan_UnitID
AND t1.Loan_CreditCustomerID = t3.Loan_CreditCustomerID
WHERE t2.LD_Date = __targetdate
AND t3.Loan_CreditCustomerID IS NOT NULL
Insert on dhjczx_loan_add_detail (cost=156663.07..168405.06 rows=573 width=694) (actual time=1793.429..1793.434 rows=0 loops=1)
-> Nested Loop Left Join (cost=156663.07..168405.06 rows=573 width=694) (actual time=1793.426..1793.431 rows=0 loops=1)
Filter: ((ISNULL(dhjczx_loan_add_mid.LD_MoneyRemain, '0'::numeric) - ISNULL(dhjczx_loan_add_mid_1.LD_MoneyRemain, '0'::numeric)) > '0'::numeric)
-> Hash Join (cost=156662.52..162702.15 rows=1718 width=212) (actual time=1793.425..1793.429 rows=0 loops=1)
Hash Cond: ((t2.LD_BussID)::text = (t1.Loan_BussID)::text)
-> Index Scan using idx_HD_Loan_Detail_ld_date on HD_Loan_Detail t2 (cost=0.56..5632.77 rows=104065 width=32) (actual time=0.049..0.050 rows=1 loops=1)
Index Cond: ((LD_Date)::text = ($12)::text)
-> Hash (cost=156467.43..156467.43 rows=15562 width=195) (actual time=1793.356..1793.359 rows=0 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 128kB
-> Hash Join (cost=77815.08..156467.43 rows=15562 width=195) (actual time=1793.355..1793.358 rows=0 loops=1)
Hash Cond: (((t1.Loan_UnitID)::text = (dhjczx_loan_add_mid.Loan_UnitID)::text) AND ((t1.Loan_CreditCustomerID)::text = (dhjczx_loan_add_mid.Loan_CreditCustomerID)::text))
-> Seq Scan on HD_Loan_Base t1 (cost=0.00..73702.77 rows=942777 width=165) (actual time=0.009..0.009 rows=1 loops=1)
-> Hash (cost=76580.62..76580.62 rows=82297 width=30) (actual time=1793.221..1793.222 rows=0 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 1024kB
-> Seq Scan on dhjczx_loan_add_mid (cost=0.00..76580.62 rows=82297 width=30) (actual time=1793.220..1793.220 rows=0 loops=1)
Filter: ((Loan_CreditCustomerID IS NOT NULL) AND ((LD_Date)::text = ($12)::text))
Rows Removed by Filter: 3190217
-> Index Scan using PK_aaa_loan_add_mid_20211009_D3C9DA62 on dhjczx_loan_add_mid dhjczx_loan_add_mid_1 (cost=0.56..3.30 rows=1 width=30) (never executed)
Index Cond: (((Loan_UnitID)::text = (dhjczx_loan_add_mid.Loan_UnitID)::text) AND ((Loan_CreditCustomerID)::text = (dhjczx_loan_add_mid.Loan_CreditCustomerID)::text) AND ((LD_Date)::text = ($13)::text))
2023-09-05 16:26:47 CST [1963648]: [5162-1] user=xxkjb,db=xxkjb,app=kingbase_*&+_,client=[local]CONTEXT: SQL statement "INSERT INTO deposit.dhjczx_loan_add_detail
( Loan_BussID ,
Loan_Name ,
Loan_CreditCustomerID ,
Loan_CBODCustomerID ,
Loan_DateBegin ,
Loan_DateEndAppoint ,
Loan_DateEnd ,
Loan_MoneyAll ,
LD_MoneyRemain ,
Loan_LoanAccountType ,
Loan_ContractType ,
Loan_Mainassure ,
Loan_FiveType ,
is_city ,
ui_id ,
big_ui_id ,
ui_name ,
big_ui_name ,
LD_Managerid ,
LD_ManagerName ,
date ,
add_type
)
SELECT t1.Loan_BussID ,
t1.Loan_Name ,
t1.Loan_CreditCustomerID ,
t1.Loan_CBODCustomerID ,
t1.Loan_DateBegin ,
t1.Loan_DateEndAppoint ,
t1.Loan_DateEnd ,
t1.Loan_MoneyAll ,
t2.LD_MoneyRemain ,
t1.Loan_LoanAccountType ,
t1.Loan_ContractType ,
t1.Loan_Mainassure ,
t1.Loan_FiveType ,
'否' ,
t1.Loan_UnitID ,
'' ,
'' ,
'' ,
LD_Managerid ,
'' ,
__targetdate ,
CASE WHEN t3.cust_no_yesterday IS NULL
THEN '新增'
ELSE '部分新增'
END
FROM HistoryData.HD_Loan_Base t1
LEFT JOIN HistoryData.HD_Loan_Detail t2 ON t1.Loan_BussID = t2.LD_BussID
LEFT JOIN ( SELECT t1.* ,
t2.Loan_CreditCustomerID AS cust_no_yesterday
FROM ( SELECT
*
FROM
deposit.dhjczx_loan_add_mid
WHERE
LD_Date = __targetdate
) t1
LEFT JOIN ( SELECT
*
FROM
deposit.dhjczx_loan_add_mid
WHERE
LD_Date = __tempdate
) t2 ON t1.Loan_UnitID = t2.Loan_UnitID
AND t1.Loan_CreditCustomerID = t2.Loan_CreditCustomerID
WHERE ISNULL(t1.LD_MoneyRemain,
0)
- ISNULL(t2.LD_MoneyRemain,
0) > 0
) t3 ON t1.Loan_UnitID = t3.Loan_UnitID
AND t1.Loan_CreditCustomerID = t3.Loan_CreditCustomerID
WHERE t2.LD_Date = __targetdate
AND t3.Loan_CreditCustomerID IS NOT NULL"
PL/SQL function sp_dhjczx_loan_detail() line 75 at SQL statement