hive 之 DDL 操作（五）

DDL 操作是用于操作对象和对象的属性，这种对象包括数据库本身，以及数据库对象，像：表、视图等等

1. 数据库

1.1 创建数据库

数据库在 HDFS 上的默认存储路径是 /user/hive/warehouse/*.db

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]  // 库的注释说明
[LOCATION hdfs_path]        // 库在hdfs上的路径
[WITH DBPROPERTIES (property_name=property_value, ...)]; // 库的属性

示例：

create database  if not exists mydb2 
comment 'this is my db' 
location 'hdfs://hadoop1:9000/mydb2' 
with dbproperties('ownner'='rose','tel'='12345','department'='development');

1.2 数据库查询

show databases;
show databases like 'db_hive*';		// 过滤
use databases;	// 使用数据库
desc database db_hive;	// 显示数据库信息
desc database extended db_hive;		// 显示数据库详细信息

1.3 修改数据库

alter table 命令可以修改数据库的 DBPROPERTIES 的键值对，但是数据库的元数据信息不可更改，如：数据库名称、所在目录位置：

// 同名的属性值会覆盖，之前没有的属性会新增
alter database mydb2 set dbproperties('ownner'='tom','empid'='10001');
				
// 查询修改
desc database extended mydb2;

1.4 删除数据库

// 删除空数据库，只能删空库
drop database if exists db_hive2;

hive> drop database db_hive cascade;

2. 数据表

2.1 表分类

管理表：删除操作时，会删除元数据和数据本身
外部表：删除操作时，只会删除元数据
分区表：在建表时，指定了PARTITIONED BY，分区的目的是为了就数据，分散到多个子目录中，在执行查询时，可以只选择查询某些子目录中的数据，加快查询效率
分桶表：建表时指定了 CLUSTERED BY，区别于分区表，把数据分散到多个文件中。可以结合 hive 提供的抽样查询，只查询指定桶的数据

2.2 创建表

创建表的时候，hive 会：

在 hdfs 生成表的路径
向 MySQL 的 metastone 库中插入两条表的信息（元数据）

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 	// 字段名称、类型、注释
[COMMENT table_comment] 	// 表注释
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 	// 创建分区表
[CLUSTERED BY (col_name, col_name, ...) 		// 创建分桶表
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 	// 表中数据每行的格式，定义数据字段的分隔符，集合元素的分隔符等
[STORED AS file_format] 	//表中的数据要以哪种文件格式来存储，默认为TEXTFILE（文本文件）
[LOCATION hdfs_path]		// 指定表在HDFS上的存储位置

可选字段说明

EXTERNAL：创建外部表，在建表的同时指定实际数据路径 location，hive 创建内部表时，会将数据移动到数据仓库所指向的路径；若是外部表，仅记录数据所在路径，不对数据位置做任何改变。删除表时，内部表元数据、数据都会被删除，而外部表只会删除元数据
COMMENT：为表和列添加注释
PARTITIONED BY：创建分区表
CLUSTERED BY：创建分桶表
SORTED BY：排序，不常用
ROW FORMAT ：数据行格式

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

  [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] 

  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED，将会使用自带的SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过SerDe确定表的具体的列的数据。

SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。

STORED AS：指定存储文件类型，如：sequencefile、rcfile 等，默认 textfile
LOCATION ：指定表在HDFS上的存储位置
LIKE 允许用户复制已存在的表结构，但是不会复制数据

示例

// 创建表
create table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';

// 根据查询结果创建表（查询的结果会添加到新创建的表中）
create table if not exists student3 as select id, name from student;

// 根据已经存在的表结构创建表
create table if not exists student4 like student;

// 查看表类型
hive (hive_1)> desc formatted student;
OK
col_name        data_type       comment
# col_name              data_type               comment

id                      int
name                    string

# Detailed Table Information
Database:               hive_1
Owner:                  hadoop
CreateTime:             Sun Nov 21 18:13:32 CST 2021
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop1:9000/user/hive/warehouse/hive_1.db/student
Table Type:             MANAGED_TABLE	// 表示为管理表

2.2.1 管理表

管理表也称内部表，默认创建的就是管理表，特点就是 删除表的同时也会删除元数据，与之对应的外部表就只会删除元数据，而不删除数据

2.2.2 外部表

实际生产中使用的大部分是外部表

// 创建一个外部表，有 EXTERNAL 关键字就是外部表
create EXTERNAL table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';

// 查询表类型
hive (default)> desc formatted dept;
Table Type:             EXTERNAL_TABLE

2.2.3 管理表和外部表的相关转换

// 修改内部表 test1 为外部表
alter table test1 set tblproperties('EXTERNAL'='TRUE');

// 修改外部表 test1 为内部表
alter table student2 set tblproperties('EXTERNAL'='FALSE');

注意：'EXTERNAL'='TRUE')和('EXTERNAL'='FALSE')为固定写法，区分大小写！

2.3 分区表

分区表实际是在 hdfs 上创建多个独立的文件夹，hive 的分区即是分目录。将一个大的数据集分隔为多个小的数据集，在查询时通过 where 子句来指定查询分区，而不是全表查询，其目的是为了提高效率。

语法：

// 比普通建表语句多了一个 partitioned by

// 一级分区表，其中 month 为分区字段，string 为其类型
create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

// 多级分区表，有多个分区字段 area、province
create external table if not exists default.deptpart2(
deptno int,
dname string,
loc int
)
PARTITIONED BY(area string, province string)
row format delimited fields terminated by '\t';

注意

如果表是个分区表，在导入数据时，必须指定向哪个分区目录导入数据，否则会报错
如果表是多级分区表，在导入数据时，数据必须位于最后一级分区的目录

2.3.1 创建分区

方法一：alter table 表名 add partition(分区字段名=分区字段值);
- 分区创建成功后，会在 hdfs 生成分区路径、也会在 partitions 表中生成分区的元数据
- 创建多个分区：alter table dept_partition add partition(month='201705') partition(month='201704');
方法二：load 命令向分区加载数据，若分区不存在，则会自动创建
方法三：若数据已上传到 hdfs，可使用修复分区命令的方式来自动生成分区的元数据：msck repair table 表名;

2.3.2 其他操作

// 查看分区表有多少分区
show partitions 表名

// 查看分区表结构
desc formatted dept_partition;

// 删除分区
alter table dept_partition drop partition (month='201704');

// 删除多个分区
alter table dept_partition drop partition (month='201705'), partition (month='201706');

// 多分区联合查询
select * from dept_partition where month='201709'
    union
    select * from dept_partition where month='201708'
    union
    select * from dept_partition where month='201707';

2.3.3 练习一

需求：创建一个一级分区表

1、创建分区表：

create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (area string)
row format delimited fields terminated by '\t';

// 查看表描述信息
hive (default)> desc dept_partition;
OK
col_name        data_type       comment
deptno                  int
dname                   string
loc                     string
area                    string

# Partition Information
# col_name              data_type               comment

area                    string
Time taken: 0.398 seconds, Fetched: 9 row(s)

// 查看分区信息
hive (default)> show partitions dept_partition;
OK
partition
area=huanan
Time taken: 0.46 seconds, Fetched: 1 row(s)

2、加载数据：

// 将本地数据加载到分区目录 area='huanan'，
load data local inpath '/home/hadoop/apps/big_source/files/dept.txt' into table default.dept_partition partition(area='huanan');

// hdfs 上数据存储路径
/user/hive/warehouse/dept_partition/area=huanan

3、查询：

// 原本表只有三列，加了分区信息后，变为 4 列，查询时可用 area=huanan 条件来查询
0: jdbc:hive2://hadoop1:10000>  select * from dept_partition where area='huanan';
+------------------------+-----------------------+---------------------+----------------------+--+
| dept_partition.deptno  | dept_partition.dname  | dept_partition.loc  | dept_partition.area  |
+------------------------+-----------------------+---------------------+----------------------+--+
| 10                     | ACCOUNTING            | 1700                | huanan               |
| 20                     | RESEARCH              | 1800                | huanan               |
| 30                     | SALES                 | 1900                | huanan               |
| 40                     | OPERATIONS            | 1700                | huanan               |
+------------------------+-----------------------+---------------------+----------------------+--+
4 rows selected (3.017 seconds)

select * from dept_partition where area='huanan'
    union
    select * from dept_partition where area='huazhong'

2.3.4 练习二修复分区信息

1、创建 hdfs 目录，并上传数据

// 直接 put，可能会导致修复失败
[hadoop@hadoop1 files]$ hadoop fs -mkdir -p /user/hive/warehouse/dept_partition/area=zhong
[hadoop@hadoop1 files]$ hadoop fs -put dept.txt /user/hive/warehouse/dept_partition/area=zhong

// dept.txt
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700

2、查看分区信息、修复分区：

0: jdbc:hive2://hadoop1:10000> show partitions dept_partition;
+--------------+--+
|  partition   |
+--------------+--+
| area=huanan  |
+--------------+--+
1 row selected (0.464 seconds)

0: jdbc:hive2://hadoop1:10000> msck repair table dept_partition;
No rows affected (0.375 seconds)

0: jdbc:hive2://hadoop1:10000> show partitions dept_partition;
+--------------+--+
|  partition   |
+--------------+--+
| area=huanan  |
| area=zhong   |
+--------------+--+
2 rows selected (0.436 seconds)

0: jdbc:hive2://hadoop1:10000> select * from dept_partition where area='zhong';
+------------------------+-----------------------+---------------------+----------------------+--+
| dept_partition.deptno  | dept_partition.dname  | dept_partition.loc  | dept_partition.area  |
+------------------------+-----------------------+---------------------+----------------------+--+
| 10                     | ACCOUNTING            | 1700                | zhong                |
| 20                     | RESEARCH              | 1800                | zhong                |
| 30                     | SALES                 | 1900                | zhong                |
| 40                     | OPERATIONS            | 1700                | zhong                |
+------------------------+-----------------------+---------------------+----------------------+--+
4 rows selected (0.454 seconds)

2.4 分桶表

2.4.1 分桶表创建

在建表的时指定 CLUSTERED BY，那么这个表就是分桶表。区别于分区表，分桶表将数据分散到多个文件，可以结合 hive 提供的抽样查询，只查询指定桶的数据，另外还可以提高 join 查询效率

分桶原理跟 MR 中的 HashPartitioner 的原理一模一样：

MR ：按照 key 的 hash 值去模除以reductTask的个数
hive：按照分桶字段的 hash 值去模除以分桶个数，如：1001、1005、1009 的 hash 值是它本身，三个值对分桶个数取模都是 1，1001/1005/1009 % 4 = 1，那么这三条数据会被分在同一个文件中

// col_name 分桶字段
[CLUSTERED BY (col_name, col_name, ...) 
		分桶的字段，是从表的普通字段中来取
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

示例

1、创建分桶表：

// 创建 4 个桶
create table stu_buck(id int, name string)
clustered by(id) 
SORTED BY (id desc)
into 4 buckets
row format delimited fields terminated by '\t';

2、设置强制分桶、排序（必须）：

set hive.enforce.bucketing=true;	// 需要打开强制分桶开关
set hive.enforce.sorting=true;		// 需要打开强制排序开关

3、加载数据：

// 使用 load 方式加载数据，发现并没有分桶，这是因为向分桶表导入数据时，必须运行MR程序，才能实现分桶操作，需要使用 insert 方式插入数据
load data local inpath '/home/hadoop/apps/big_source/files/student.txt' into table stu_buck;

创建临时表，通过查询临时表的方式向分桶表插入数据：

// 1.  从 hdfs 或本地磁盘中 load 数据，导入中间表
create table stu_buck_tmp(id int, name string)
row format delimited fields terminated by '\t';

load data local inpath '/home/hadoop/apps/big_source/files/student.txt' into table stu_buck_tmp;

// 通过从中间表查询的方式的完成数据导入
insert into table stu_buck select id, name from stu_buck_tmp;

总结

向分桶表插入数据会运行 mr 程序，需要使用 insert 来插入数据
创建分桶表时，需要打开强制分桶的开关
需要确保 reduce 的数量与表中的 bucket 数量一致，有两种方式
- 让 hive 强制分桶，自动按照分桶表的bucket 进行分桶（推荐）
- 手动指定 reduce 数量，并在 SELECT 后增加CLUSTER BY 语句
```
set mapreduce.job.reduces = num;
set mapreduce.reduce.tasks = num;
```

2.4.2 抽样查询

对于一个大的数据集来说，有时并不需要查询全部，而是抽样部分有代表性的结果即可

格式：

select * from 分桶表 tablesample(bucket x out of y on 分桶表分桶字段);

参数详解：

y：必须是 table 总 bucket（即桶）数目的倍数或者因子，hive 会根据 y 的大小来决定抽样的比例。如：bucket=4，当y=2 时，表示抽取 (4/2) 个 bucket 的数据，当y=8时，抽取(4/8=)1/2个bucket的数据
x：表示从第几个 bucket 开始抽取，如：bucket=4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个 bucket 的数据，抽取第1(x)个和第3(x+y)个 bucket 的数据。

翻译过来就是：从第 x 桶开始抽样，每间隔 y 桶抽一桶，知道抽满 bucket/y 桶

// 从第1桶(0号桶)开始抽，抽第x+y*(n-1)，一共抽2桶：0、2 号桶
0: jdbc:hive2://hadoop1:10000> select * from stu_buck tablesample(bucket 1 out of 2 on id);
+--------------+----------------+--+
| stu_buck.id  | stu_buck.name  |
+--------------+----------------+--+
| 1016         | ss16           |
| 1012         | ss12           |
| 1008         | ss8            |
| 1004         | ss4            |
| 1014         | ss14           |
| 1010         | ss10           |
| 1006         | ss6            |
| 1002         | ss2            |
+--------------+----------------+--+

// 从第 1 桶开始抽，抽 4 桶：0、1、2、3 桶
0: jdbc:hive2://hadoop1:10000> select * from stu_buck tablesample(bucket 1 out of 1 on id);
+--------------+----------------+--+
| stu_buck.id  | stu_buck.name  |
+--------------+----------------+--+
| 1016         | ss16           |
| 1012         | ss12           |
| 1008         | ss8            |
| 1004         | ss4            |
| 1013         | ss13           |
| 1009         | ss9            |
| 1005         | ss5            |
| 1001         | ss1            |
| 1014         | ss14           |
| 1010         | ss10           |
| 1006         | ss6            |
| 1002         | ss2            |
| 1015         | ss15           |
| 1011         | ss11           |
| 1007         | ss7            |
| 1003         | ss3            |
+--------------+----------------+--+

// 从第 2 桶开始抽，抽 1 桶，即第 2 桶
0: jdbc:hive2://hadoop1:10000> select * from stu_buck tablesample(bucket 2 out of 4 on id);
+--------------+----------------+--+
| stu_buck.id  | stu_buck.name  |
+--------------+----------------+--+
| 1013         | ss13           |
| 1009         | ss9            |
| 1005         | ss5            |
| 1001         | ss1            |
+--------------+----------------+--+

// 从第 2 桶开始抽，抽 0.5 桶
0: jdbc:hive2://hadoop1:10000> select * from stu_buck tablesample(bucket 2 out of 8 on id);
+--------------+----------------+--+
| stu_buck.id  | stu_buck.name  |
+--------------+----------------+--+
| 1009         | ss9            |
| 1001         | ss1            |
+--------------+----------------+--+

以下为 4 个桶数据分布：

// 第 1 桶
[hadoop@hadoop1 apps]$ hadoop fs -cat /user/hive/warehouse/stu_buck/000000_0
1016	ss16
1012	ss12
1008	ss8
1004	ss4

// 第 2 桶
[hadoop@hadoop1 apps]$ hadoop fs -cat /user/hive/warehouse/stu_buck/000001_0
1013	ss13
1009	ss9
1005	ss5
1001	ss1

// 第 3 桶
[hadoop@hadoop1 apps]$ hadoop fs -cat /user/hive/warehouse/stu_buck/000002_0
1014	ss14
1010	ss10
1006	ss6
1002	ss2

// 第 4 桶
[hadoop@hadoop1 apps]$ hadoop fs -cat /user/hive/warehouse/stu_buck/000003_0
1015	ss15
1011	ss11
1007	ss7
1003	ss3

注意：x 的值必须小于等于 y 的值

2.5 修改表 & 删除表

重命名表：

ALTER TABLE table_name RENAME TO new_table_name

增加/修改/替换列信息:

// 更新列
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

// 替换列
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) 

// 添加一列
alter table dept_partition add columns(deptdesc string);

// 更新列
alter table dept_partition change column deptdesc desc int;

// 替换列
alter table dept_partition replace columns(deptno string, dname string, loc string);

// 改表的属性
alter table 表名 set tblproperties(属性名=属性值)

删除表：

drop table tableName;
truncate table tableName   // 清空管理表，只清空数据

3. 错误集锦

1、启动 hive 报错 org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application

背景：使用 insert 往分桶表插入数据时报以上错误
原因：hadoop 内存不足
解决：调整内存，修改 yarn-site.xml，重启 ResourceManager 和 NodeManager

<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>3.0</value>
</property>

参考：

2、更换 tez 引擎后，insert 数据时报错：org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

原因：namenode、datanode 时间同步问题
解决：datanode 与 namenode 进行时间同步，每个节点都需要执行

cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
ntpdate pool.ntp.org

posted @ 2021-12-05 20:31 Hubery_Jun 阅读(147) 评论(0) 收藏举报

刷新页面返回顶部

Hubery_Jun