hive 排序和聚集

1、order by 是对数据进行全排序，属于标准排序语句

order by 会对输入做全局排序，因此只有一个reducer（多个reducer无法保证全局有序）
只有一个reducer，会导致当输入规模较大时，需要较长的计算时间
与mysql中 order by区别在于：在 strict 模式下，必须指定 limit，否则执行会报错

• 使用命令set hive.mapred.mode; 查询当前模式
• 使用命令set hive.mapred.mode=strick; 设置当前模式（set hive.mapred.mode=nonstrict; (default value / 默认值)）

hive> select * from logs where date='2015-01-02' order by te;
FAILED: SemanticException 1:52 In strict mode,
 if ORDER BY is specified, LIMIT must also be specified. 
Error encountered near token 'te'

 对于分区表，还必须显示指定分区字段查询

hive> select * from logs order by te limit 5;                
FAILED: SemanticException [Error 10041]: 
No partition predicate found for Alias "logs" Table "logs"

2、sort by 对数据局部排序，是hive的扩展排序语句

可以有多个Reduce Task（以DISTRIBUTE BY后字段的个数为准）。也可以手工指定：set mapred.reduce.tasks=4;
每个Reduce Task 内部数据有序，但全局无序 

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs                         
    sort by te;

 上述查询语句，将结果保存在本地磁盘 /root/hive/b ，此目录下产生2个结果文件：000000_0 + 000001_0 。每个文件中依据te字段排序。 

Distribute by特性：

    按照指定的字段对数据进行划分到不同的输出 reduce 文件中
    distribute by相当于MR 中的paritioner，默认是基于hash 实现的
    distribute by通常与Sort by连用

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs
    distribute by date
    sort by te;

sort by不是全局排序，其在数据进入reducer前完成排序.

因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。

sort by 不受 hive.mapred.mode 是否为strict ,nostrict 的影响

sort by 的数据只能保证在同一reduce中的数据可以按指定字段排序。

使用sort by 你可以指定执行的reduce 个数（set mapred.reduce.tasks=<number>）,对输出的数据再执行归并排序，即可以得到全部结果。

注意：可以用limit子句大大减少数据量。使用limit n后，传输到reduce端（单机）的数据记录数就减少到n* （map个数）。否则由于数据过大可能出不了结果。


hive> set mapred.reduce.tasks;
mapred.reduce.tasks=-1
hive> set mapred.reduce.tasks=2;
hive> set mapred.reduce.tasks;
mapred.reduce.tasks=2
hive> insert overwrite table weather_data2 select year,data from weather_data distribute by year sort by year asc,data desc;

hive> dfs -ls /hive/warehouse/busdata.db/weather_data2;
Found 2 items
-rw-r--r--   1 hadoop supergroup      43647 2019-03-09 16:29 /hive/warehouse/busdata.db/weather_data2/000000_0
-rw-r--r--   1 hadoop supergroup      36470 2019-03-09 16:29 /hive/warehouse/busdata.db/weather_data2/000001_0

3、cluster by

    如果 Sort By 和 Distribute By 中所有的列相同，可以缩写为Cluster By以便同时指定两者所使用的列。
    注意被cluster by指定的列只能是降序，不能指定asc和desc。一般用于桶表

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs
    cluster by date;

4、其他

//五种子句是有严格顺序的：
where → group by → having → order by → limit

//where和having的区别:
//where是先过滤再分组(对原始数据过滤),where限定聚合函数
hive> select count(*),age from tea where id>18 group by age;

//having是先分组再过滤(对每个组进行过滤,having后只能跟select中已有的列)
hive> select age,count(*) c from tea group by age having c>2;

//group by后面没有的列,select后面也绝不能有(聚合函数除外)
hive> select ip,sum(load) as c from logs  group by ip sort by c desc limit 5;

//distinct关键字返回唯一不同的值(返回age和id均不相同的记录)
hive> select distinct age,id from tea;

//hive只支持Union All,不支持Union
//hive的Union All相对sql有所不同,要求列的数量相同,并且对应的列名也相同,但不要求类的类型相同(可能是存在隐式转换吧)
select name,age from tea where id<80
union all
select name,age from stu where age>18;

posted @ 2019-03-09 16:35 我是属车的阅读(1525) 评论(0) 编辑收藏举报

刷新页面返回顶部

我是属车的

hive 排序和聚集

公告