in vs. exists and not in vs. not exists

一、定义

in: where column in (expression_list or subquery)
根据指定表达式列表或者子查询查找相应的记录。

//表达式列表
select * from emp where job_id in ('job1');
// 子查询
select * from emp where salary in (select salary from emp where depart_id = 'depart2');

not in: where column not in (expression_list or subquery)
查找所有不在表达式列表或子查询中的记录。注意：如果表示列表或子查询中存在NULL数据，查询条件则判断为false，相当于查不到任何数据。

select * from emp where job_id not in ('job1');//1 row in set

select * from emp where salary not in (select salary from emp where depart_id = 'depart2');

select * from emp where job_id not in ('job1', NULL);// empty set

exists: where exists (subquery)
根据子查询查找相应的记录。
如果子查询返回至少一行记录，表明查询条件匹配为True。

//查找有员工的部门
select id from depart d where exists (select * from emp e where d.id = e.depart_id); // 2 rows in set

not exists: where not exists (subquery);

//查找没有员工的部门
select id from depart d where not exists (select * from emp e where d.id = e.depart_id); // 1 rows in set

二、区别

in的查询条件可以是表达式列表也可以是子查询
exists的查询条件只能是子查询

in的子查询方式和exists的子查询方式也不同
in是指定字段是否在子查询结果集，语法为where cloumn in (subquery)。本质是循环子查询的结果集。
exists是子查询是否能返回结果集，语法为 where exists (subquery)。本质是在子查询中做了关联查询。

not in 和 not exists是有很大差别的。
当数据条件不同的时候，两者的执行结果是不一样的。
1、对于not exists查询，内表存在空值对查询结果没有影响；对于not in查询，内表存在空值将导致最终的查询结果为空。
2、对于not exists查询，外表存在空值，存在空值的那条记录最终会输出；对于not in查询，外表存在空值，存在空值的那条记录最终将被过滤，其他数据不受影响。

再进一步讲，oralce这样牛叉的公司，为什么不能把not in也优化为关联查询呢。
（关联查询用到的是hash join。这是一种优化的算法，可以提高关联查询效率。）
本质原因在于，in 或者 not in 要对NULL值进行处理，当子查询为NULL时，条件为永假式，要返回FLASE，为了保证执行结果正确性，只能放弃效率，采用filter过滤还不是hash join，hash join不支持把空值放在hash桶中。

三、性能

in 和 exists 的性能并不总是固定的前者优后者劣，或者前者劣后者优的。
也就是说要根据数据库环境和表的数据量大小，确定该使用 in 还是 exists。

什么情况用 in ？什么情况用 exists？
如果明白了两者查询时底层的算法实现，也就很容易明白该如何选择了。
由于 in 的本质是对结果集做循环，exists 是对表做关联查询。
因此，假如有下面两个sql语句

1) select * from T1 where exists(select 1 from T2 where T1.a=T2.a);
    T1数据量小而T2数据量非常大时，T1<<T2 时，1) 的查询效率高。

2) select * from T1 where T1.a in (select T2.a from T2);
     T1数据量非常大而T2数据量小时，T1>>T2 时，2) 的查询效率高。

而对于 not in 和 not exists，一般情况下选择 not exists，效率总是比not in要高。

四、总结

1.当子查询数据少的时候，使用 in 就可以
2.当子查询数据量特别大的时候，要考虑使用 exist
3.当使用not in 或者 not exists的时候，不考虑空值意义的时候，统一使用 not exists。

ps：以上总结，基本是纸上谈兵，我并没有创造这样的数据环境去测试，可能与实际情况不符，后续需要用真实数据测试。

update:我觉得既然花了2小时了解这个破玩意，不如直接实践一下，彻底搞明白为好，要学就学彻底。

测试语句：

SQL1 : select count(*) from emp T1 where depart_id  in (select id from depart T2);
    
SQL2 : select count(*) from emp T1  where exists (select * from depart T2 where T2.id = T1.depart_id);

SQL3 : select count(*) from emp T1 where depart_id  not in (select id from depart T2);  

SQL4 : select count(*) from emp T1 where not exists (select * from depart T2 where T2.id = T1.depart_id);

环境一：

T1 ＝》100000条记录
T2 ＝》 10条记录

SQL1 :   0.06 sec
SQL2 :   0.19 sec
SQL3 :   0.06 sec
SQL4 :   0.3 sec

环境二：

T1 ＝》10条记录
T2 ＝》 100000条记录

SQL1 :   0.00 sec
SQL2 :   0.01 sec
SQL3 :   0.00 sec
SQL4 :   0.00 sec

环境一：

T1 ＝》100000条记录
T2 ＝》 100000条记录

SQL1 :   0.10 sec
SQL2 :   0.15 sec
SQL3 :   0.16 sec
SQL4 :   0.15 sec

看到这样的结果我表示很蒙蔽，花了两个小时的总结几乎在扯蛋，实际证明差别一点也不大，可能我是用mysql测试的，可能是因为我都是在主键上进行查询的，也可能是我造的数据还不够多，不够悬殊。

总而言之，一句话，数据库本身很强大，在出现性能瓶颈之前我认为没有必要过于考虑sql优化，只要按常规建好主键，建好索引，其他一些语句上的差别一般来说不会有太大问题，而一旦数据量过于大的时候，其实更应该考虑的是如何分库分表，化整为分的思路，继续在算法层面做优化，或者sql层面做优化并不能很大的解决问题。

大道至简

参考：
详解not in与not exists的区别与用法
 oracle中的exists 和not exists 用法详解

posted @ 2016-09-04 14:28 braveliu.site 阅读(214) 评论(0) 编辑收藏举报

刷新页面返回顶部

braveliu.site

享受过程，看淡结果。

in vs. exists and not in vs. not exists

一、定义

二、区别

三、性能

四、总结

测试语句：

环境一：

环境二：

环境一：

公告