HIVE中IN的坑

问题：为什么HIVE中用了 NOT IN，结果集没了？

注：这个是原创，转载请注明，谢谢！
直接进实验室>>

> select * from a;
OK
1 a1
2 a2
3 a3
Time taken: 0.063 seconds, Fetched: 3 row(s)

hive> select * from b;
OK
1 b1
2 b2
NULL b3
Time taken: 0.063 seconds, Fetched: 3 row(s)

# 两表通过id匹配，求 A-B ,用 left join实现
hive> select t1.id,t1.name,t2.name from a t1
> left join b t2 on t1.id = t2.id
> where t2.name is null
OK
3 a3 NULL
Time taken: 34.123 seconds, Fetched: 1 row(s)

# 两表通过id匹配，求 A-B ，用 NOT IN 实现
select * from a where id not in ( select id from b );
OK
Time taken: 34.123 seconds, Fetched: 0 row(s)

这里有诡异了，为什么结果集没了呢？不能啊？？

原因：

在RMDB中， t1.id IN （select t2.id from b t2 ）等价于： t1 join b t2 on t1.id = t2.id and t1.id is not null
在hive中，虽然我们的版本已经高达2.0.0，但是对于IN的处理还是就比较简陋，没有对null值进行屏蔽，导致凡是子查询中有null值，条件就会变成： id in ( null) , 当然， id in ( null) 这个条件是永远不会有结果的。

正确的用法：

# 两表通过id匹配，求 A-B ，用 NOT IN 实现
select * from a where id not in ( select id from b where id is not null );
OK
3 a3 NULL
Time taken: 34.123 seconds, Fetched: 1 row(s)

各位不妨可以做个试验：
--没结果
hive> select * from a where id not in (null);
OK
Time taken: 3.603 seconds

posted on 2018-11-07 02:47 荒漠依米摩天轮阅读(8501) 评论(0) 收藏举报

刷新页面返回顶部

荒漠依米摩天轮

导航

公告

HIVE中IN的坑