一句T-SQL语句引发的思考

有一网友问:关于ms sqlserver索引优化问题:
有表stress_test(id int, key char(2))
        id 上有普通索引;
        key 上有簇索引;
        id 有有限量的重复;
        key 有无限量的重复;

目前我需要按逻辑和查询表中key=az and key=bw and key=cv 的id

求教高手最有效的查询语句

测试环境：
     hardware:p4 2.6+512m+80g
     software:windows server 2003(enterprise edition)+sqlserver 2000 +sp3a

首先我们建立一个测试的数据，为使数据尽量的分布和随即，我们通过rand()来随机产生2个随机数再组合成一个字符串，首先插入的数据是1,000,000条记录，然后在循环插入到58,000,000条记录。
   因为是随机产生的数据，所以如果你自己测试的数据集和我测试的会不相同，但对索引的优化和运行的效率是相同的。
   下面的“--//测试脚本”是产生测试数据的脚本，你能根据需要修改 @maxgroup， @maxloop的值，比如测试1百万的记录能:

     select @maxgroup=1000
     select @maxloop=1000

如果要测试5千万：

     select @maxgroup=5000
     select @maxloop=10000

所以如果你的server或pc比较慢，请耐心等待.....，
（在我的pc上运行的速度是插入1百万条的时间是1.14m，插入5千八百万条的时间是19.41m，重新建立index的时间是34.36m)

作为一般的研发人员非常容易就想到的语句：

   --语句1

    select a.[id] from
    (select distinct [id] from stress_test where [key] = az) a,
    (select distinct [id] from stress_test where [key] = bw) b ,
    (select distinct [id] from stress_test where [key] = cv) c
    where a.id = b.id and a.id = c.id

   --语句2

     select [id]
     from stress_test
     where [key]=az or [key]=bw or [key]=cv
     group by id having(count(distinct [key])=3)

   --语句5

    select distinct a.[id] from stress_test as a,stress_test as b,stress_test as c
    where a.[key]=az and b.[key]=bw and c.[key]=cv
      and a.[id]=b.[id] and a.[id]=c.[id]

但作为t-sql的所谓“高手”可能会认为这种写法非常“土”，也显得没有水平，所以会选择一些子查询和外连接的写法，按常理子查询的效率是比较高的：

   --语句3

    select distinct [id] from stress_test a where
    not exists (
    select 1 from
     (select az as k union all select bw union all select cv) b
    left join stress_test c on c.id=a.id and b.[k]=c.[key]
    where c.id is null)

   --语句4

     select distinct a.id from stress_test a
     where not exists
     ( select * from keytb c
      where not exists
      ( select * from stress_test b
       where
             b.id = a.id
             and
             c.kf1 = b.[key]
      )
     )

我们先分析这几条语句（针对5千8百万条数据进行分析）：

请大家要特别留心estimated row count的值。

语句1:从执行规划中我们能看出，mssqlserver选择的索引优化非常有规律，先通过clustered index筛选出符合[key]=az条件的id,然后进行hash match,在找出id相等的；依次类推最终检索到符合所有条件的记录。中间的estimated row count的值都不大。

语句2:从执行规划中我们能看出，是先通过clustered index筛选出符合 [key]=az or [key]=bw or [key]=cv 符合所有条件的id，然后分组进行2次hash match 所有的id。我们能看出estimated row count的值是越来越少，从最初的369,262到最后排序的只有402。

语句3:从执行规划中我们能看是非常复杂的，是先通过3组通过constant scan和non-clustered index检索出符合 a.id=c.id and [key]=** 的记录3组，然后分组进行外键匹配，再将3组的数据合并，排序，然后再和一个non-clustered index检索出的记录集进行外键匹配，我们能看出mssqlserver会对所有的记录（5千万条）记录进行分组，estimated row count的值是：58,720,000,所以这句t-sql的瓶颈是对5千万条记录进行分组。

语句4:从执行规划中我们能看和语句3有相似之处，都要对所有的记录（5千万条）记录进行分组，所以这是检索的瓶颈，而且使用的索引都是non-clustered index。

语句5：从执行规划中我们能看出，先通过clustered index检索出符合[key]=az的记录集，然后进行hash match和sorts,因为数量少所以是非常会的，在和通过non-clustered index检索[key]=bw的记录进行inner join,在和通过clustered index检索[key]=cv的记录进行合并,最后是对4百万条数据进行分组检索，如果是6列，我们能看出estimated row count的值是递增，越来越大，最后的分组检索的estimated row count的值是3.46e+15,这已形成巨大的瓶颈。

我们能先测试一下小的数据量（50000条）；

大家能下面测试脚本的：

   select @maxgroup=500
   select @maxloop=100

----------------------------------------------------------------------
|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
| 5万（3列）        5ms       19ms     37ms     59ms      0ms
| 5万（6列）        1ms       26ms     36ms     36ms     1ms

从测试的的数据来看，语句5的效率是最高的，几乎没有花费时间，而语句2的效率只能说是一般。如果测试到这里就结束了，我们能毫不犹豫的选择语句 5 :-(,继续进行下面的测试.....

我们测试百万条以上的记录：
1.先对1百万条记录进行测试（选取3列）
2.先对1百万条记录进行测试（选取6列）
3.对5千万条数据测试（选取3列）
4.对5千万条数据测试（选取6列）

统计表1：
----------------------------------------------------------------------
|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
| 1百万（3列）    0.77%     0.41%    49.30%     48.99%     0.52%
| 1百万（6列）     1.61%     0.81%    48.99%     47.44%     1.14%
| 5千万（3列）     0.14%     0.18%    48.88%     48.86%     1.93%
| 5千万（6列）     0.00%     0.00%     0.00%      0.00%   100.00%
统计表2：
----------------------------------------------------------------------
|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
| 1百万（3列）     9ms       22ms     723ms     753ms      4ms
| 1百万（6列）      15ms      38ms     764ms     773ms     11ms
| 5千万（3列）     575ms     262ms 110117ms 110601ms 12533ms
| 5千万（6列）    1070ms     576ms 107988ms 109704ms     10m以上

测试总结：（我们能比较关注：语句 2和语句 5）
1.在1百万条记录的情况下，语句 5是最快的，但在5千万条记录下是最慢的。这说明index的优化一定的情况下，数据量不同，检索的效率也是不同的。我们平时在写t-sql时一般关注的时index的使用，只要我们写的t-sql是利用clustered index，我们就认为是最优化了，其实这是个误区，我们还要关注estimated row count的值，大量的i/o操作是我们应该关注的，所以我们应该根据数据量的不同选择相应的t-sql语句，不要认为在小数据量下是最高的在大数据量的状态下也许是最慢的:-(。

2.在执行规划中最快的，并不是运行最快的，我们能看在1百万（6列）在这行中，语句 2和语句 5的比例是0.81％：1.14％，但实际的运行效率是，38ms:11ms。所以，我们在选择t-sql是要考虑本地i/o的速度，所以在优化语句时不仅要看执行规划还要计算一下具体的效率。

在测试的语句上加入：

    set statistics time on/off
    set statistics io on/off
是个非常好的调试方法。

3.综合评价，语句 2的效率是最高的，执行效率没有随数据量变化而有非常大的差别。

4.执行规划越简单的语句（语句1），综合效率越高，反之则越低（语句3，语句4）。

5.在平时写t-sql语句时，一定要根据不同的数据量进行测试，虽然都是用clustered index，但检索的效率却大相径庭。

--//测试脚本
use northwind
go
if exists(select * from sysobjects where name=nstress_test and type=u)
drop table stress_test
go
--//定义测试的表stress_test,存放所有的测试数据
create table stress_test([id] int,[key] char(2))

go
--//插入测试的数据
set nocount on
--//变量定义
declare @id int   --//stress_test id 值
declare @key char(2)  --//stress_test [key] 值
declare @maxgroup int  --//组最大的循环数
declare @maxloop int  --//id最大的循环数
declare @tempgroup int  --//临时变量
declare @temploop int  --//临时变量
declare @tempint1 int  --//临时变量
declare @tempint2 int  --//临时变量
declare @rowcount int  --//记录事务提交的行数

--//初始化变量
select @id=1
select @maxgroup=1000
select @maxloop=1000
select @tempgroup=1
select @temploop=1
select @key=
select @rowcount=0

while @temploop<=@maxloop
begin
while @tempgroup<=@maxgroup
begin
  select @tempint1=65+convert(int,rand()*50)
  select @tempint2=65+convert(int,rand()*100)
  if (@tempint1>=122 or @tempint2>=122)
    begin
    select @tempint1=@tempint1-100
    select @tempint2=@tempint2-100

    if (@tempint1<=65 or @tempint2<=65)
     begin
     select @tempint1=@tempint1+57
     select @tempint2=@tempint2+57
    end
   end
  select @key=char(@tempint1)+char(@tempint2)
  if @rowcount=0
  begin tran ins
     insert into stress_test([id],[key])values(@id,@key)
       select @rowcount=@rowcount+1

   if @rowcount>3000 --//判断当行数达到3000条时，开始提交事务
   begin
       commit tran ins
      select @rowcount=0
   end

  select @tempgroup=@tempgroup+1
end
if @rowcount>0
begin
  commit tran ins
  select @rowcount=0
end

select @tempgroup=1
select @id=@id+1
select @temploop=@temploop+1
end
go
--//删除key值为null的记录
delete stress_test where [key]is null
go
--//建立簇索引pk_stress
create clustered index pk_stress on stress_test([key])
--//建立非簇索引ni_stress_id
create nonclustered index ni_stress_id on stress_test([id])
go
--//定义测试的表keytb
if exists(select * from sysobjects where name=nkeytb and type=u)
drop table keytb
go
create table keytb   -----//存放你需要匹配的值的表
(
kf1 varchar(20)
)

--//存放你需要匹配的值，暂定为三个
insert into keytb(kf1) values(az);
insert into keytb(kf1) values(bw);
insert into keytb(kf1) values(cv);

--insert into keytb(kf1) values(du);
--insert into keytb(kf1) values(ex);
--insert into keytb(kf1) values(fy);
go

下面我们就开始测试几种t-sql的index优化问题：

--先对1百万条/1亿条记录进行测试（选取3列）的t-sql:

print 第一种语句：
set statistics time on
set statistics io on
select a.[id] from
(select distinct [id] from stress_test where [key] = az) a,
(select distinct [id] from stress_test where [key] = bw) b ,
(select distinct [id] from stress_test where [key] = cv) c
where a.id = b.id and a.id = c.id
go
print 第二种语句：
select [id]
from stress_test
where [key]=az or [key]=bw or [key]=cv
group by id having(count(distinct [key])=3)
go
print 第三种语句：
select distinct [id] from stress_test a where
not exists (
select 1 from
(select az as k union all select bw union all select cv) b
left join stress_test c on c.id=a.id and b.[k]=c.[key]
where c.id is null)
go
print 第四种语句：
select distinct a.id from stress_test a
where not exists
( select * from keytb c
   where not exists
   ( select * from stress_test b
     where
       b.id = a.id
       and
       c.kf1 = b.[key]
   )
)
go
print 第五种语句：
select distinct a.[id] from stress_test as a,stress_test as b,stress_test as c
where a.[key]=ac and b.[key]=bb and c.[key]=ca
      and a.[id]=b.[id] and a.[id]=c.[id]

go
set statistics time off
set statistics io off

--先对1百万条/1亿条记录进行测试（选取6列）的t-sql:
print 第一种语句：
set statistics time on
set statistics io on
select a.[id] from
(select distinct [id] from stress_test where [key] = az) a,
(select distinct [id] from stress_test where [key] = bw) b ,
(select distinct [id] from stress_test where [key] = cv) c,
(select distinct [id] from stress_test where [key] = du) d,
(select distinct [id] from stress_test where [key] = ex) e,
(select distinct [id] from stress_test where [key] = fy) f
where a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]
go
print 第二种语句：
select [id]
from stress_test
where [key]=az or [key]=bw or [key]=cv or [key]=duor [key]=exor [key]=fy
group by id having(count(distinct [key])=6)
go
print 第三种语句：
select distinct [id] from stress_test a where
not exists (
select 1 from
(select az as k union all select bw union all select cvunion all select duunion all select exunion all select fy) b
left join stress_test c on c.id=a.id and b.[k]=c.[key]
where c.id is null)
go
print 第四种语句：
select distinct a.id from stress_test a
where not exists
( select * from keytb c
   where not exists
   ( select * from stress_test b
     where
       b.id = a.id
       and
       c.kf1 = b.[key]
   )
)
go
print 第五种语句：
select distinct a.[id] from stress_test as a,stress_test as b,stress_test as c,stress_test as d,stress_test as e,stress_test as f
where a.[key]=az and b.[key]=bw and c.[key]=cv and d.[key]=du and e.[key]=ex and f.[key]=fy
     and a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]

go
set statistics time off
set statistics io off

请参考:

http://expert.csdn.net/expert/topic/2630/2630484.xml?temp=.9921686

posted on 2009-06-18 17:54 瞌睡虫阅读(210) 评论(0) 编辑收藏举报