Hive—分区表

一、什么是分区表

　　分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

二、分区表的操作

2.1 分区表基本操作

1．引入分区表（需要根据日期对日志进行管理）

/user/hive/warehouse/log_partition/20170702/20170702.log

/user/hive/warehouse/log_partition/20170703/20170703.log

/user/hive/warehouse/log_partition/20170704/20170704.log

2．创建分区表语法

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

3．加载数据到分区表中

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='202009');
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='202010');
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='202011');

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　加载数据到分区表

图6-6 分区表

4．查询分区表中数据

单分区查询

hive (default)> select * from dept_partition where month='202009';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.month
10      CCOUNTING       1700    202009
20      RESEARCH        1800    202009
30      SALE    1900    202009
40      OPERATIONS      1700    202009
Time taken: 0.116 seconds, Fetched: 4 row(s)

多分区联合查询

hive (default)> select * from dept_partition where month='202009' union select * from  dept_partition where month='202010';

_u2.deptno _u2.dname _u2.loc _u2.month
10 CCOUNTING 1700 202009
10 CCOUNTING 1700 202010
20 RESEARCH 1800 202009
20 RESEARCH 1800 202010
30 SALE 1900 202009
30 SALE 1900 202010
40 OPERATIONS 1700 202009
40 OPERATIONS 1700 202010
Time taken: 769.082 seconds, Fetched: 8 row(s)

5．增加分区

创建单个分区

hive (default)> alter table dept_partition add partition(month='202112');

同时创建多个分区

hive (default)> alter table dept_partition add partition(month='202112') partition(month='202101');

如果分区已经存在报错。

hive (default)> alter table dept_partition add partition(month='202112') partition(month='202101');
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Partition already exists: Partition(values:[202112], dbName:default, tableName:dept_partition, createTime:0, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:deptno, type:int, comment:null), FieldSchema(name:dname, type:string, comment:null), FieldSchema(name:loc, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=       , field.delim=  }), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:null))

View Code

6．删除分区

删除单个分区

hive (default)> alter table dept_partition drop partition (month='202112');

同时删除多个分区

hive (default)> alter table dept_partition drop partition (month='202012'), partition (month='202101');

7．查看分区表有多少分区

hive (default)> show partitions dept_partition;
OK
partition
month=202009
month=202010
month=202011
month=202012
month=202101
Time taken: 0.891 seconds, Fetched: 5 row(s)

8．查看分区表结构

hive (default)> desc formatted dept_partition;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
deptno                  int                                         
dname                   string                                      
loc                     string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
month                   string                                      
                 
# Detailed Table Information             
Database:               default                  
Owner:                  atguigu                  
CreateTime:             Tue Feb 23 10:36:21 CST 2021     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop102:9000/user/hive/warehouse/dept_partition         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1614047781          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.379 seconds, Fetched: 34 row(s)

View Code

2.2 分区表注意事项

1．创建二级分区表

hive (default)> create table dept_partition2(deptno int, dname string, loc string) 
partitioned by (month string, day string) 
row format delimited fields terminated by '\t';

2．加载数据

（1）加载数据到二级分区表中

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition2 partition(month='202102',day='23');
Loading data to table default.dept_partition2 partition (month=202102, day=23)
Partition default.dept_partition2{month=202102, day=23} stats: [numFiles=1, numRows=0, totalSize=67, rawDataSize=0]
OK
Time taken: 1.903 seconds

（2）查询分区数据

hive (default)> select * from dept_partition2 where month='202102' and day='23';
OK
dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.month   dept_partition2.day
10      CCOUNTING       1700    202102  23
20      RESEARCH        1800    202102  23
30      SALE    1900    202102  23
40      OPERATIONS      1700    202102  23
Time taken: 1.057 seconds, Fetched: 4 row(s)

3．把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

（1）方式一：上传数据后修复

上传数据

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=202102/day=22;
hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=202102/day=22;

查询数据（查询不到刚上传的数据）

hive (default)> select * from dept_partition2 where month='202102' and day='22';
OK
dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.month   dept_partition2.day
Time taken: 0.259 seconds

执行修复命令

hive (default)> msck repair table dept_partition2;
OK
Partitions not in metastore:    dept_partition2:month=202102/day=22
Repair: Added partition to metastore dept_partition2:month=202102/day=22
Time taken: 0.749 seconds, Fetched: 2 row(s)

msck repair table 修复表分区，常用于手动复制目录到hive表的location下，此时Hive元数据中没有记录到该目录是hive的分区，所以查不到该分区数据

再次查询数据

hive (default)> select * from dept_partition2 where month='202102' and day='22';
OK
dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.month   dept_partition2.day
10      CCOUNTING       1700    202102  22
20      RESEARCH        1800    202102  22
30      SALE    1900    202102  22
40      OPERATIONS      1700    202102  22
Time taken: 0.147 seconds, Fetched: 4 row(s)

（2）方式二：上传数据后添加分区

上传数据

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=202102/day=24;
hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=202102/day=24;

执行添加分区

hive (default)> alter table dept_partition2 add partition(month='202102',day='24');
OK
Time taken: 0.475 seconds

查询数据

hive (default)> select * from dept_partition2 where month='202102' and day='24';
OK
dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.month   dept_partition2.day
10      CCOUNTING       1700    202102  24
20      RESEARCH        1800    202102  24
30      SALE    1900    202102  24
40      OPERATIONS      1700    202102  24
Time taken: 0.177 seconds, Fetched: 4 row(s)

（3）方式三：创建文件夹后load数据到分区

创建目录

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=202102/day=19;

上传数据

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='202102',day='19'); 
Loading data to table default.dept_partition2 partition (month=202102, day=19)
Partition default.dept_partition2{month=202102, day=19} stats: [numFiles=1, numRows=0, totalSize=67, rawDataSize=0]
OK
Time taken: 0.993 seconds

查询数据

hive (default)> select * from dept_partition2 where month='202102' and day='19';
OK
dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.month   dept_partition2.day
10      CCOUNTING       1700    202102  19
20      RESEARCH        1800    202102  19
30      SALE    1900    202102  19
40      OPERATIONS      1700    202102  19
Time taken: 0.176 seconds, Fetched: 4 row(s)

整理atguigu视频

posted @ 2021-02-24 09:03 cqyyck 阅读(670) 评论(0) 收藏举报

刷新页面返回顶部

cqyyck

勤能补拙，好记性不如烂笔头！

Hive—分区表

一、什么是分区表

二、分区表的操作

2.1 分区表基本操作

2.2 分区表注意事项

公告