postgresql,MSSQL数据库in对比
问题背景
在优化sqlserver的过程中发现in里面的结果集有500个,速度非常慢,当删除in里面的大部分id时,又可以走索引,那么这个临界值是多少呢?abase又是什么样的呢?
SQLServer 2008
通过修改in里面的数据量来测试是否走执行计划
表数据量:CEDCLASSDTL:549444
--1.in里面包含1000个id时的执行计划:
SET STATISTICS TIME ON
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in总共有1000个id,未完全列举)
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0)))
|--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH)
|--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'...
|--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD)
--2.in里面包含65个id时的执行计划:
SET STATISTICS TIME ON
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in总共有65个id,未完全列举)
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0)))
|--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH)
|--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'70CC515F-34D8-4D06-B3EA-00476D3DCE01'}),({guid'...
|--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD)
--3.in里面包含64个id时的执行计划:
SET STATISTICS TIME ON
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in总共有64个id,未完全列举)
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
|--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
|--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK..
--4.in里面包含10个id时的执行计划:
SET STATISTICS TIME ON
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in总共有10个id,未完全列举)
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
|--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
|--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK...
通过1,2的执行计划可以看出,如果在in字句中有超过64个值,那么就会构建一个内部的临时表,然后索引扫描cedclassfk,最后嵌套循环。所以当in里面个数越多嵌套循环也就越费时。当然也可能对临时表进行排序后走Merge join。效率也不会太高。
通过3,4的执行计划可以看出,当in里面的个数小于等于64个的时候会走仅索引扫描。
在多次的测试过程中发现当in里面的个数小于等于64的时候可以走仅索引扫描。当大于64以后就不再走索引。
(对于非常大的列表,比如in里面超过10000个值,仅仅解析是非常昂贵的,而临时表更可取。)
ArteryBase3.5.3
--将数据同步到abase
--in里面包含1000个id时的执行计划
explain analyze
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
'09C34234-7D19-4501-AB1A-0049AD22E76F',
...--未完全列举)
Aggregate (cost=9822.67..9822.68 rows=1 width=0) (actual time=31.761..31.761 rows=1 loops=1)
-> Index Only Scan using ix_cedclassfk on cedclassdtl (cost=0.42..9446.54 rows=150454 width=0) (actual time=0.278..26.161 rows=120662 loops=1)
Index Cond: (cedclassfk = ANY ('{882EDE67-F19E-4E3A-B75E-003BF34FB2C9,2AD6A41F-2C42-4BA5-BF96-0042AEE35817,70CC515F-34D8-4D06-B3EA-00476D3DCE01,09C34234-7D19-4501-AB1A-0049AD22E76F,E30BE907-C42E-4DD8-A9A7-004BBE65962E,96375C84-CFDA-450E-A95A-0061B0398D17,2A8EF0A8-DC91-4D1A-B7D2-0083651305B6,43A07E25-6F0B-4157-A1CA-008DFF8EB402,2CF8262F-3DBF-4201-A3AC-009143E3BBFC,....(1000个id)}'::bpchar[]))
Heap Fetches: 0
Planning time: 55.979 ms
Execution time: 35.019 ms
abase1000个id可以走索引。
abase-in里面最大可以支持多少个id呢?
--创建表
db_ntjers=# create table tab(id int);
CREATE TABLE
--插入10w数据
db_ntjers=# insert into tab select generate_series(1,100000);
INSERT 0 100000
db_ntjers=# create index i_tab_id on tab(id);
CREATE INDEX
--构造子串
select string_agg(id::varchar,',') from (select id from tab limit 1000)ta
--1000个id的执行计划
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....1000)
Index Only Scan using i_tab_id on tab (cost=0.29..1422.00 rows=1000 width=4) (actual time=0.251..1.097 rows=1000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34...1000}'::integer[]))
Heap Fetches: 1000
Planning time: 0.822 ms
Execution time: 1.184 ms
--1w个id的执行计划
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....10000)
Index Only Scan using i_tab_id on tab (cost=0.29..4252.00 rows=10000 width=4) (actual time=0.018..8.568 rows=10000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...9999,10000}'::integer[]))
Heap Fetches: 10000
Planning time: 5.449 ms
Execution time: 9.110 ms
--10w个id的执行计划
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000)
Index Only Scan using i_tab_id on tab (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.024..110.276 rows=100000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...99999,100000}'::integer[]))
Heap Fetches: 100000
Planning time: 67.080 ms
Execution time: 116.627 ms
--100w个id的执行计划
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000)
Index Only Scan using i_tab_id on tab (cost=0.42..452602.21 rows=632121 width=4) (actual time=0.300..1443.868 rows=1000000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48(...)}')
Heap Fetches: 1000000
Planning time: 713.382 ms
Execution time: 1507.180 ms
--数据类型为char也是一样,10wid
db_ntjers=# select string_agg('\''||c_bh||'\'',',') from (select c_bh from tabl_uuid limit 100000)ta
db_ntjers=#explain analyze select * from tabl_uuid where c_bh in('da370980-559e-4d7c-af1a-59ac381e5bdd','2415c46d-8ccc-4b21-9901-8248023edab6',...)
Index Only Scan using i_tabl_uuid_c_bh on tabl_uuid (cost=0.42..49731.99 rows=100000 width=37) (actual time=0.239..723.233 rows=100000 loops=1)
Index Cond: (c_bh = ANY ('{da370980-559e-4d7c-af1a-59ac381e5bdd,2415c46d-8ccc-4b21-9901-8248023edab6,ad55941d-e9fa-4e0b-be39-dd0855514ebd,..'::bpchar[]))
Heap Fetches: 100000
Planning time: 86.628 ms
Execution time: 1286.896 ms
可以看到abase的in里面不管包含多少值都会走索引。此处也可以看出int和char的效率区别,同样是10w数据int的效率是char的10倍。
但是在这个测试中id都是有序的,如果无序呢?
无序的id子串
--构造无序的id子串
select string_agg((random()*100000)::int::text, ',') into arr from generate_series(1, 100000);
--10w个无需的id执行计划
db_ntjers=#explain analyze select * from tab where id in(16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005....)
Index Only Scan using i_tab_id on tab (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.179..63.821 rows=63232 loops=1)
Index Cond: (id = ANY ('{16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005,....85832}'::integer[]))
Heap Fetches: 63232
Planning time: 76.747 ms
Execution time: 99.788 ms
无序的id也可以走索引。可能有人会注意到无序的要快一点,实际上由于用了random()函数,所以id中可能有重复的id,所以最终返回的结果只有6W+条。
abase为何会有这样的特性呢?
在使用in的时候实际上是转成了=any(array),所以当直接构造一个=any的时候效率会更高。当然在in里里面的个数低于1w的时候in和=any()的区别并不大。到10w和100w的时候区别就较为明显,数据量越大区别越明显。
结语
1.sqlserver的in里面的个数小于等于64的时候能走仅索引扫描,当大于64以后越大效率越低。
2.abase的in实际上转换成了=any(array),和in里面的个数关系不大。