Hive分区和Hive动态分区

Hive笔记2：Hive分区、Hive动态分区

Hive笔记2：Hive分区、Hive动态分区
- 一、Hive 分区
- 二、Hive动态分区

一、Hive 分区

分区表实际上是在表的目录下再建一个子目录

作用：进行分区裁剪，避免全表扫描，减少MapReduce处理的数据量，提高效率

一般在公司的hive中，所有的表基本上都是分区表，通常按日期分区、地域分区

分区表在使用的时候记得加上分区字段

分区也不是越多越好，一般不超过3级，根据实际业务衡量

建立分区表：

create external table students_pt1
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(pt string)	#建立分区表必须加的语句，括号内指定分区类型为字符串
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/student/input1';	#指定外部表路径语句要放在最后一行

查看一下分区表的结构：

hive> desc students_pt1;
OK
id                  	bigint              	                    
name                	string              	                    
age                 	int                 	                    
gender              	string              	                    
clazz               	string              	                    
pt                  	string   //多了额外的一列字段pt，专门作为分区的信息         	                    
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
pt                  	string              	                    
Time taken: 0.172 seconds, Fetched: 11 row(s)

增加一个分区：

alter table students_pt1 add partition(pt='20220216');
#增加一个分区，HDFS中在/student/input1目录下，增加一个子目录pt=20220218

alter table students_pt1 add partition(pt='20220217');
alter table students_pt1 add partition(pt='20220218');
alter table students_pt1 add partition(pt='20220219');

删除一个分区

alter table students_pt1 drop partition(pt='20220216');
#此命令删除的是子目录pt=20220216的元数据，数据和目录名不会删掉，因为这是个外部表
#dfs -rmr /student/input1/pt=20220216;删除的是子目录的数据和目录名，但是元数据不会删除

查看某个表的所有分区

show partitions students_pt1; // 推荐这种方式（直接从元数据中获取分区信息）

select distinct pt from students_pt; // 不推荐

往分区中插入数据

建立分区表后，加载数据必须加载到分区表下的分区(子目录中)；

insert into table students_pt1 select * from students;像这样直接插入外部表会报错

#未创建分区前，也可以直接插入数据，不过插入的时候一定要加上partition(pt='20220218')，
#会自动创建子目录pt=20220218，并将数据存在至该子目录

#方法1：
insert into table students_pt1 partition(pt='20220218') select * from students;
#方法2：
load data local inpath '/usr/local/soft/data/students.txt' into table students_pt1 partition(pt='20220217');

查询某个分区的数据

下面命令查询的是分区数据有多少条记录，执行结果是一个数字

// 全表扫描，不推荐，效率低
select count(*) from students_pt1;

// 使用where条件进行分区裁剪，避免了全表扫描，效率高
select count(*) from students_pt1 where pt='20220218';

// 也可以在where条件中使用非等值判断
select count(*) from students_pt1 where pt<='20210112' and pt>='20210110';

查询分区的数据

select * from students_pt1 where pt='20220220';

执行结果

hive> select * from students_pt1 where pt='20220220' limit 10;
OK
1500100001	施笑槐	22	女	文科六班	20220220
1500100002	吕金鹏	24	男	文科六班	20220220
1500100003	单乐蕊	22	女	理科六班	20220220
1500100004	葛德曜	24	男	理科三班	20220220
1500100005	宣谷芹	22	女	理科五班	20220220
1500100006	边昂雄	21	男	理科二班	20220220
1500100007	尚孤风	23	女	文科六班	20220220
1500100008	符半双	22	女	理科六班	20220220
1500100009	沈德昌	21	男	理科一班	20220220
1500100010	羿彦昌	23	男	理科六班	20220220
Time taken: 0.104 seconds, Fetched: 10 row(s)

//可见，最后面多了一列20220220，20220220就是分区表的分区字段

二、Hive动态分区

有的时候我们原始表中的数据里面包含了 ''日期字段 dt''，我们需要根据dt中不同的日期，

分为不同的分区，将原始表改造成分区表。

hive默认不开启动态分区

动态分区：根据数据中某几列的不同的取值划分不同的分区

1、开启Hive的动态分区支持

# 表示开启动态分区
hive> set hive.exec.dynamic.partition=true;

# 表示动态分区模式：strict（需要配合静态分区一起使用）、nostrict
# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
 #一般使用下面这个(一次性的，退出hive就没了，还得重新开启动态分区)
hive> set hive.exec.dynamic.partition.mode=nostrict;

# 表示支持的最大的分区数量为1000，可以根据业务自己调整
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

2、建立原始表并加载数据（建立一个普通的表）

create table students_dt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

将数据文件存放到本地`/usr/local/soft/data/`并加载数据

将students_dt.txt文件拖至Xshell

加载数据到所创建的表中
load data local inpath '/usr/local/soft/data/students_dt.txt' into table students_dt;

3、建立分区表并加载数据

create table students_dt_p
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用动态分区插入数据

#分区字段需要放在select的最后，如果有多个分区字段,同理，它是按位置匹配，不是按名字匹配
#动态分区插入数据的时候，要手动输入字段名称
insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;

#比如下面这条语句会使用age作为分区字段，而不会使用student_dt中的dt作为分区字段
insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;

4、多级分区(多级子目录)

#建立常规表
create table students_year_month
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    year string,
    month string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#建立分区表，分区字段为year,month 
create table students_year_month_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(year string,month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#将本地数据文件加载到常规表中
load data local inpath '/usr/local/soft/data/students_year_month.txt' into table students_year_month;

#将常规表中的数据加载到分区表进行分区
insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

自己尝试一下多级分区

上单讲分区(阿里云)：https://developer.aliyun.com/article/81775

posted @ 2022-02-20 10:57 阿伟宝座阅读(1418) 评论(0) 收藏举报

刷新页面返回顶部

阿伟宝座

Hive分区和Hive动态分区

Hive笔记2：Hive分区、Hive动态分区

一、Hive 分区

建立分区表：

增加一个分区：

删除一个分区

查看某个表的所有分区

往分区中插入数据

查询某个分区的数据

二、Hive动态分区

1、开启Hive的动态分区支持

2、建立原始表并加载数据（建立一个普通的表）

将数据文件存放到本地/usr/local/soft/data/并加载数据

3、建立分区表并加载数据

使用动态分区插入数据

4、多级分区(多级子目录)

公告

将数据文件存放到本地`/usr/local/soft/data/`并加载数据