hive学习(三) hive的分区
1.Hive 分区partition
必须在表定义时指定对应的partition字段
a、单分区建表语句:
create table day_table (id int, content string) partitioned by (dt string);
单分区表,按天分区,在表结构中存在id,content,dt三列。
以dt为文件夹区分
b、 双分区建表语句:
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。
先以dt为文件夹,再以hour子文件夹区分
2.创建2个表psn2 psn3
create table psn2 ( id int, name string, hobby array<string>, address map<string,string> ) partitioned by (age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';
3.插入数据
load data local inpath '/root/data' into table psn2 partition (age=10);
4.查询psn2
hive> select * from psn2; OK 1 小明1 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 2 小明2 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 3 小明3 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 4 小明4 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 5 小明5 ["lol","movie"] {"beijing":"changping","shanghai":"pudong"} 10 6 小明6 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 7 小明7 ["lol","book"] {"beijing":"changping","shanghai":"pudong"} 10 8 小明8 ["lol","book"] {"beijing":"changping","shanghai":"pudong"} 10 9 小明9 ["lol","book","movie"] {"beijing":"changping","shanghai":"pudong"} 10 Time taken: 7.93 seconds, Fetched: 9 row(s)
理论上分区可以无限分,但是实际需要根据需求来分区。
如:历史数据按天分区
5.错误实例psn3
create table psn3 ( id int, name string, age int, hobby array<string>, address map<string,string> ) partitioned by (age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';
报错提示:FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
原因:分区字段不能再表的列中
6.同时创建两个分区
create table psn3 ( id int, name string, hobby array<string>, address map<string,string> ) partitioned by (age int,sex string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';
注意:添加两个字段相应的插入数据时要指定两个字段,一个字段会报错。
hive> create table psn3 ( > id int, > name string, > hobby array<string>, > address map<string,string> > ) > partitioned by (age int,sex string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > COLLECTION ITEMS TERMINATED BY '-' > MAP KEYS TERMINATED BY ':' > LINES TERMINATED BY '\n'; OK Time taken: 1.167 seconds
7.向双分区加载数据
hive> load data local inpath '/root/data' into table psn3 partition (age=10); FAILED: SemanticException [Error 10006]: Line 1:63 Partition not found '10' hive> load data local inpath '/root/data' into table psn3 partition (age=10,sex='boy'); Loading data to table default.psn3 partition (age=10, sex=boy) OK Time taken: 3.115 seconds hive>
8.删除分区
alter table psn2 drop partition (sex='boy'); hive> alter table psn3 drop partition (sex='boy'); Dropped the partition age=10/sex=boy OK Time taken: 0.195 seconds
9.结论:
添加分区的时候,必须在现有分区的基础之上
删除分区的时候,会将所有存在的分区都删除
10.添加分区
添加时必须指定age=10 还是 age=20的分区删除,不然会报错
hive> alter table psn3 add partition(sex='man'); FAILED: ValidationFailureSemanticException default.psn3: partition spec {sex=man} doesn't contain all (2) partition columns hive> alter table psn3 add partition(age=10,sex='man'); OK Time taken: 0.418 seconds
删除前age=10 和age=20下分别有boy和man两个目录
11.执行删除
hive> alter table psn3 drop partition(sex='man'); Dropped the partition age=10/sex=man OK Time taken: 0.389 seconds
load 加载数据的过程其实是在上传文件 partition 是对应hdfs的目录
12.通过一个表的查询结果的数据插入到另一个表中
create table psn4 ( id int, name string, hobby array<string> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' LINES TERMINATED BY '\n';
这种插入数据会转换成mapreduce任务
from psn3 insert overwrite table psn4 select id,name,hobby;
13.这种操作的作用:
1.复制表
2.可以作为中间表存在