代码改变世界

sql中的exists

2021-08-03 10:42  abce  阅读(1007)  评论(0编辑  收藏  举报

刚开始工作的开发,很容易过度的使用in、not in。其实,在有些时候可以用exists、not exists来代替in和not in,实现查询性能的提升。

exists操作符时候会和in操作符产生混淆。因为他们都是用于检查某个属性是否包含在某个集合中。但是相同点也仅限于此。

exists的真正目的是检查子查询是否至少包含一条记录。例如,下面的查询会返回行1和2:

WITH numbers (nr) AS (
  SELECT 1 AS nr UNION ALL
  SELECT 2 AS nr UNION ALL
  SELECT 3 AS nr
), letters (letter, nr) AS (
  SELECT 'A' AS letter, 1 AS nr UNION ALL 
  SELECT 'B' AS letter, 2 AS nr 
)
SELECT * FROM numbers n WHERE  EXISTS (SELECT nr FROM letters WHERE nr= n.nr);

当然,你也可以改写成in:

WITH numbers (nr) AS (
  SELECT 1 AS nr UNION ALL
  SELECT 2 AS nr UNION ALL
  SELECT 3 AS nr
), letters (letter, nr) AS (
  SELECT 'A' AS letter, 1 AS nr UNION ALL 
  SELECT 'B' AS letter, 2 AS nr 
)
SELECT * FROM numbers n WHERE n.nr IN (SELECT nr FROM letters);

这两种写法,都可以返回相同的记录。区别是exists会更快,因为在得到第一条满足条件的记录之后就会停止,而in会查询所有的记录(如果in返回很多行的话)。但是,要注意另一件事,exists使用的关联子查询,它可以使用left join加上一个NULL检查来替换。

 

执行计划

下面来看看,示例的执行计划

not in操作符的执行计划:

=# EXPLAIN ANALYZE SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on products  (cost=38.25..50.50 rows=90 width=422) (actual time=0.069..0.072 rows=5 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 2
   SubPlan 1
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=4) (actual time=0.006..0.009 rows=3 loops=1)
 Planning Time: 0.180 ms
 Execution Time: 0.130 ms
(7 rows)

从执行计划可以看到,因为没有用到任何索引,对表bought_products先走的是全表扫描,然后在参与过滤。

 

not exists的执行计划:

=# EXPLAIN ANALYZE SELECT * FROM products WHERE NOT EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.038..0.045 rows=5 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.012..0.014 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.177 ms
 Execution Time: 0.080 ms
(7 rows)

这里走的是,嵌套循环反连接(Nested Loop Anti Join)。

从这两个执行计划来看,最大的区别是执行时间的不同(0.130 ms vs 0.080 ms),计划时间差不多。

所以,当你有些in查询时,可以看看是否可以使用exists来替换。

 

再来看看exists的执行计划:

=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.030..0.046 rows=2 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.011..0.013 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.261 ms
 Execution Time: 0.084 ms
(7 rows)

正如我们所料,执行计划中使用了半连接(SEMI JOIN)。

如果不是关联子查询呢?

=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.01..11.81 rows=180 width=422) (actual time=0.013..0.015 rows=7 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=0) (actual time=0.007..0.007 rows=1 loops=1)
   ->  Seq Scan on products  (cost=0.01..11.81 rows=180 width=422) (actual time=0.002..0.004 rows=7 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 0.036 ms
(7 rows)

这次,子查询使用了顺序扫描,但是计划时间只用了0.077ms。它将只返回1行,因为它是EXISTS的语义。

如果你将这些统计信息和not in进行比较,可能会发现它们都差不多,可能是因为它们都是小的数据集。对于大的数据集,它们的差异会很明显,尤其是not in会返回大量行的场景。(exists只会返回1行记录)

尽管exists和in看起来类似,但是他们是不同的。exists在子查询返回第一条满足条件的记录后就会停止。如果你的查询很慢,将in换成exists可能会提升很多性能。

此外,exists可以用left join加上null检查来取代。