う三余无梦生

一个统计查询问题

编辑器加载中... --今天在公司遇到一个比较棘手的SQL查询问题。 --寻思了一晚上如果查询,最终总算了有了结果。 --题目:有表1 客户信息表,字段:CustomerID,自增长,CustomerName 主键, --有表2 邮件发送记录记录表,字段 CustomerID 客户ID,EmailID 邮件ID,SendDate 发送时间。要求:查找当天的前X天内发送邮件的记录小于等于Y条的客户,显示客户的信息为 客户ID,客户姓名,邮件总发送数量,最近一封邮件发送时间。 --注意:客户表日增长量为1W条,发送记录为每一个客户每一年邮件发送数量不少于10条。 --今天晚上寻思到两个结果,要证明两个结果的性能好坏。为了节约时间,我们可以先把数据库建立起来。 USE master SET NOCOUNT ON GO ----创建客户表 IF EXISTS(SELECT * FROM SYS.TABLES WHERE [NAME] ='Customer') DROP TABLE Customer CREATE TABLE Customer ( CustomerID INT IDENTITY(1,1) NOT NULL, CustomerName VARCHAR(20) PRIMARY KEY NOT NULL ) GO ---创建发送记录表 IF EXISTS(SELECT * FROM SYS.TABLES WHERE NAME='SendInfo' ) DROP TABLE SendInfo CREATE TABLE SendInfo ( CustomerID INT , SendDate DATETIME , EmailID INT ) GO ------循环插入数据 ------模拟插入100W客户,400W条数据,最后再插入1W个新客户 ------模拟最近4天,每天都为客户发送了一条邮件 DECLARE @num INT, @index INT , @j INT, @id INT SET @index =1 ; SET @num =1000000; SET @j =1; WHILE(@index <=@num) BEGIN INSERT INTO Customer (CustomerName) VALUES ('Cus'+Convert(varchar(20),@index) ); SELECT @id =@@IDENTITY; WHILE(@j<=4) BEGIN INSERT INTO SendInfo(CustomerID, SendDate, EmailID) VALUES ( @id, Dateadd(d,-@j,getdate()),@j); SET @j =@j +1; END SET @J =1 ; SET @index =@index +1; END SET @index = 1; WHILE (@index<= 10000) BEGIN INSERT INTO Customer (CustomerName) VALUES ('Cus'+Convert(varchar(20),@index)+'New' ); SELECT @id =@@IDENTITY; SET @index = @index+1 END GO --------写出查询语句 --当个来看,要把客户的最后一次发送时间和发送总次数查询出来不简单, --问题是还要将发送频率的条件进行过滤,并且是小于等于, --那么这个时间要求的日期内,没有发送记录的客户,也就属于了合理的条件范围内。 --我们来进行分解。 --1、主干是客户 --组建A表,查询客户ID和最后一个发送时间,总发送次数 select c.CustomerID ,c.CustomerName,isnull(sumCount,0) sumCount,lastDate from Customer c left join ( select CustomerId,max(Senddate)lastDate ,count(EmailID)sumCount from sendInfo group by CustomerId ) b on (c.CustomerID = b.CustomerId) GO --2、条件是关键 --根据条件组装一个B表,查找出符合要求的客户ID,这段时间内没有发送记录的客户数量是Y select c.CustomerID,count(s.CustomerId) num from Customer c left join sendInfo s on (s.CustomerId = c.CustomerID and datediff(d,SendDate,getdate()) <= X(天)) group by c.CustomerID having count(s.CustomerId) <= Y(条) GO ---3、接下来便是组装这个条件和查询结果 ----第一种情况,使用连接,讲A,B表连接,通过on进行最后的次数过滤 select getdate();--查询开始时间 select A1.*,B1.Num from ( select c.CustomerID ,c.CustomerName,isnull(sumCount,0) sumCount,lastSendDate from Customer c left join ( select CustomerID,max(SendDate)lastSendDate ,count(EmailID)sumCount from sendInfo group by CustomerID ) b on (c.CustomerID = b.CustomerID) ) A1 inner join ( select c.CustomerID,count(s.CustomerID) num from Customer c left join sendInfo s on (s.CustomerID = c.CustomerID and Datediff(d,SendDate,getdate()) <= 2) group by c.CustomerID ) B1 on (A1.CustomerID= B1.CustomerID and B1.num <=2 ); select getdate();---查询结束时间 GO --第二种,先对B表进行次数过滤,然后将过滤结果的CustomerID再对A表进行in查询 select getdate(); select A1.* from ( select c.CustomerID ,c.CustomerName,isnull(sumCount,0) sumCount,lastDate from Customer c left join ( select CustomerId,max(Senddate)lastDate ,count(EmailID)sumCount from sendInfo group by CustomerId ) b on (c.CustomerID = b.CustomerId) ) A1 where A1.CustomerID in ( select CustomerID from ( select c.CustomerID,count(s.CustomerId) num from Customer c left join sendInfo s on (s.CustomerId = c.CustomerID and datediff(d,SendDate,getdate()) <= 2) group by c.CustomerID having count(s.CustomerId) <= 2 ) B1 ); select getdate(); ----------------------- ----查询结束 --现在数据插入还没有完成。虽然我具体不知道各种查询计划和执行计划对于SQL的比较有什么用, --但是我知道,一个TSQL,他会先编译SQL语句,转换成为他的逻辑执行计划,然后具体到磁盘中执行数据读取, --这一个称之为物理执行计划。逻辑执行计划有一定的顺序。具体顺序如下: --(小弟虽有一本MSSQL2008技术内幕T-SQL查询,但是一直没有时间和精力去读懂他和运用他。 --本身SQL的基础不强,这本书学习起来相当吃力。) --(8) SELECT (9) DISTINCT (11) --(1) FROM --(3) JOIN --(2) ON --(4) WHERE --(5) GROUP BY --(6) WITH {CUBE | ROLLUP} --(7) HAVING --(10) ORDER BY --1. FROM: A Cartesian product (cross join) is performed between the first two tables in the FROM clause, and as a result, virtual table VT1 is generated. --2. ON: The ON filter is applied to VT1. Only rows for which the is TRUE are inserted to VT2. --3. OUTER (join): If an OUTER JOIN is specified (as opposed to a CROSS JOIN or an INNER JOIN), rows from the preserved table or tables for which a match was not found are added to the rows from VT2 as outer rows, generating VT3. If more than two tables appear in the FROM clause, steps 1 through 3 are applied repeatedly between the result of the last join and the next table in the FROM clause until all tables are processed. --4. WHERE: The WHERE filter is applied to VT3. Only rows for which the is TRUE are inserted to VT4. --5. GROUP BY: The rows from VT4 are arranged in groups based on the column list specified in the GROUP BY clause. VT5 is generated. --6. CUBE | ROLLUP: Supergroups (groups of groups) are added to the rows from VT5, generating VT6. --7. HAVING: The HAVING filter is applied to VT6. Only groups for which the is TRUE are inserted to VT7. --8. SELECT: The SELECT list is processed, generating VT8. --9. DISTINCT: Duplicate rows are removed from VT8. VT9 is generated. --10. ORDER BY: The rows from VT9 are sorted according to the column list specified in the ORDER BY clause. A cursor is generated (VC10). --11. TOP: The specified number or percentage of rows is selected from the beginning of VC10. Table VT11 is generated and returned to the caller. --了解这些执行计划的顺序,有利于写出一个好的TSQL,减少笛卡尔乘积 -- 由于现在已经晚上11点了,要休息了。数据库建立已经完成,另外关于物理执行计划这些详细的内容我无法提供,所有大家可以自己去 --查看红色钩钩(执行)按钮旁边的(显示估计的执行计划)按钮,查看每一种查询方案的具体执行计划,去了解每一个过程的资源损耗, --去了解他的解析过程。这很有帮助 ---最终结果是: --方案二:第一次执行用去了:21秒。开始时间 2011-10-13 23:17:46.097 结束时间 2011-10-13 23:18:07.190 -- (1010000 行受影响) -- 第二次执行,用时 21秒。 --方案一:第一次执行 22秒 -- 开始时间:2011-10-13 23:19:48.760 结束时间 2011-10-13 23:22:41.343 -- (1010000 行受影响) -- 第二次执行,用去22秒 ------将次数改成为0的时候,瞬间CPU占用率100%, --方案一:执行了5分30秒,才显示出来1970条数据 --方案二:执行了5分30秒,才显示出2421条数据. --几乎同等的速度,因为在方案一执行过程期间,我进行过其他程序操作,占用了系统资源. ---第三次尝试 num<= 1 ---方案二:执行3秒,完毕,(10000 行受影响) 2011-10-13 23:49:13.753 - 2011-10-13 23:49:17.013 ---方案一:执行3秒,完毕,(10000 行受影响) 2011-10-13 23:48:01.807 - 2011-10-13 23:48:04.990 --后来将第二种方案的最后一个条件,进行了一点更改,执行时间就改成了12秒。 select CustomerID from ( select c.CustomerID,count(s.CustomerId) num from Customer c left join sendInfo s on (s.CustomerId = c.CustomerID and datediff(d,SendDate,getdate()) <= 2) group by c.CustomerID having count(s.CustomerId) <= 2 ) B1 变成 select c.CustomerID from Customer c left join sendInfo s on (s.CustomerId = c.CustomerID and datediff(d,SendDate,getdate()) <= 2) group by c.CustomerID having count(s.CustomerId) <= 2 。 有谁可以解释在num <= 0的时候,出现的奇怪现象吗?有谁可以有更好的解决方案吗?

posted on 2011-10-14 15:19  う三余无梦生  阅读(201)  评论(0编辑  收藏  举报

导航