Hive 数据类型及操作数据库

3. Hive 数据类型

3.1 基本数据类型

Hive 数据类型	Java 数据类型	长度
TINYINT	byte	1 byte 有符号整数
SMALINT	short	2 byte 有符号整数
INT	int	4 byte 有符号整数
BIGINT	long	8 byte 有符号整数
FLOAT	float	单精度浮点数
DOUBLE	double	双精度浮点数
STRING	string	字符系列, 可以使用单引号或双引号
TIMESTAMP		时间类型
BINARY		字节数组

3.2 集合数据类型

Hive 数据类型	描述	语法示例
STRUCT	类似于C语言的struct
MAP	map
ARRAY	数组

// 原始数据: complicated.txt
zhangsan,lisi_wangwu,xiao zhang:20_zhangfei:22,zhong guan cun_beijing

// 创建表语句
create table studentInfo(
    name string,
    friends array<string>,
    children map<string, int>,
    address struct<street:string, city:string>
)
row format delimited 
fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

// 查询语句
select friends[1],children['wangwu'],address.street from studentInfo;

4. DDL 数据定义

4.1 创建数据库

创建数据库: create database if not exists db_hive;

4.2 查询数据库

显示数据库: show databases;
筛选数据库: show databases like '条件';
查看数据库信息: desc database db_hive;
查看数据库详细信息: desc database extended db_hive;

4.3 修改数据库

增加属性: alter database db_hive set dbproperties('CTtime'='2019-06-21');

4.4 删除数据库

删除空数据库: drop database db_hive;
删除非空数据库: drop database db_hive cascade;

4.5 创建表

4.5.1 管理表(内部表, MANAGED_TABLE)

使用另外一张表的结构和数据: create table student001 as select * from student;
仅使用另外一张表的结构: create table student001 like student;
查看表信息: desc student;
查看内部表(外部表)信息: desc formatted student;

4.5.2 外部表(EXTERNAL_TABLE)

Hive 并未完全拥有这份数据。删除外部表并不会删除掉这份数据,但是描述表的元数据信息会被删除掉。
创建外部表: create external table dept(deptid int, dname string, loc int) row format delimited fields terminated by '\t';
创建外部表: create external table if not exists default.emp(empno int, ename string, job string, mgr int, hiredate string, sal double, comm double, deptno int) row format delimited fields terminated by '\t';

// 原始数据:dept.txt
10  ACCOUNTING  1700
20  RESEARCH    1800
30  SALES   1900
40  OPERATIONS  1700

// 原始数据: emp.txt
7369    SMITH   CLERK   7902    1980-12-17  800.00  20
7499    ALLEN   SALESMAN   7698    1981-2-20  1600.00   300.00  30
7521    WARD   SALESMAN   7698    1981-2-22  1250.00    500.00  30
7566    JONES   MANAGER   7839    1981-4-2  2975.00  20
7654    MARTIN   SALESMAN   7698    1981-9-28  1250.00  1400.00  30
7698    BLAKE   MANAGER   7839    1981-5-1  2850.00  30
7782    CLARK   MANAGER   7839    1981-6-9  2450.00  10
7788    SCOTT   ANALYST   7566    1987-4-19  3000.00  20
7839    KING   PRESIDENT    1981-11-17  5000.00  10
7844    TURNER   SALESMAN   7698    1981-9-8  1500.00   0.00  30
7876    ADAMS   CLERK   7788    1987-5-23   1100.00  20
7900    JAMES   CLERK   7698    1981-12-3  950.00  30
7902    FORD   ANALYST   7566    1981-12-3  3000.00  20
7934    MILLER   CLERK   7782    1982-1-23  1300.00  10

4.5.3 管理表与外部表的相互转换

假如"student002"为外部表,更改为内部表: alter table student002 set tblproperties('EXTERNAL'='FALSE');
修改内部表为外部表: alter table student002 set tblproperties('EXTERNAL'='TRUE');
注意: ('EXTERNAL'='TRUE') 和 ('EXTERNAL'='FALSE') 为固定写法,区分大小写!

4.6 分区表

Hive 中的分区就是分目录。分区表对应 HDFS 文件系统上的独立文件夹。在查询时,通过 WHERE 子句中的表达式选择查询所需要的指定分区,可以提高查询效率。

4.6.1 分区表基本操作

需求: 根据日期对日志进行管理;
创建分区表: create table stu_patition(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';
加载数据: locad data local inpath '文件路径' into table stu_partition partition(month="20190618");
查询分区表: select * from stu_partition where month="20190618";
添加多个分区: alter table stu_partition add partition(month="20190619") partition(month="20190620");
删除一个分区: alter table stu_partition drop partition(month="20190620");
删除多个分区: alter table stu_partition drop partition(month="20190620"),partition(month="20190621");

4.6.2 分区表注意事项

创建二级分区表: create table stu_patition(id int, name string) partitioned by (month string, day string) row format delimited fields terminated by '\t';
加载数据: locad data local inpath '文件路径' into table stu_partition partition(month="201906",day="18");

4.6.3 将上传数据与分区表关联

第一种方式:
- 使用HDFS创建目录数据: dfs -mkdir -p /user/hive/warehouse/stu_partition/month=20190719;
- 使用HDFS上传数据: dfs -put 本地文件路径/student.txt /user/hive/warehouse/stu_partition/month=20190719;
- 执行修复命令: msck repair table stu_partition;
第二种方式:
- 使用HDFS创建目录数据: dfs -mkdir -p /user/hive/warehouse/stu_partition/month=20190720;
- 使用HDFS上传数据: dfs -put 本地文件路径/student.txt /user/hive/warehouse/stu_partition/month=20190720;
- 执行修复命令: alter table stu_partition add partition(month="20190720");

4.7 修改表

重命名表: alter table 原始表名 rename to 新表名;
重命名列: alter table student001 change column 原列名新列名列类型;
添加多列: alter table student001 add columns (gender string, description string);

5. DML 数据操作

5.1 数据导入

向表中装载数据(Load): load data [local] inpath '文件路径' overwrite | into table student [partition(partcol1=val1, ....)]
- "load data": 表示加载数据;
- "local": 表示从本地加载数据到Hive表,否则从HDFS加载数据到Hive表;
- "inpath": 表示加载数据的路径;
- "overwrite": 表示覆盖表中已有数据,否则表示追加;
- "into table": 表示加载到哪张表;
- "student": 表示具体的表;
- "partition": 表示上传到指定分区;
通过查询语句向表中插入数据(Insert)
- 根据单张表查询结果,插入数据:insert into table 表名 partition(month=20190617) select * from student;
- 根据多张表查询结果,插入数据:
根据查询结果创建表: create table if not exists student003 as select id, name from student;

5.1.1 创建表时通过 Location 指定加载数据路径

创建表时,指定在 HDFS 上的位置: create table if not exists student006(id int, name string) row format delimited fields terminated by '\t' location '/user/hive/warehouse/student007';
上传数据到 HDFS 上: hadoop fs -put 本地路径 /user/hive/warehouse/student007;
查询数据: select * from student006;

5.2 数据导出

5.2.1 Insert 导出

将查询的结果导出到本地: insert overwrite local directory '本地路径' select * from student;
将查询的结果格式化导出到本地: insert overwrite local directory '本地路径' row format delimited fields terminated by '\t' select * from student;

5.2.2 Hadoop 命令导出到本地

dfs -get /user/hive/warehouse/student/month=201709/student.txt 本地路径;

5.2.3 Hive Shell 命令导出

bin/hive -e 'select * from default.student;' > 本地路径;

5.2.4 Export 导出到 HDFS 上

export table default.student to '/user/hive/warehouse/export/student;'

5.2.5 Import 数据到指定 Hive 表中

先用 EXPORT 导出后,再将数据导入;
import table student2 partition(month='201907') from '/user/hive/warehouse/export/student';

5.3 清除表中数据

truncate table student;

posted @ 2019-06-15 22:36 小a的软件思考阅读(838) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

思考与践行

Hive 数据类型及操作数据库

3. Hive 数据类型

3.1 基本数据类型

3.2 集合数据类型

4. DDL 数据定义

4.1 创建数据库

4.2 查询数据库

4.3 修改数据库

4.4 删除数据库

4.5 创建表

4.5.1 管理表(内部表, MANAGED_TABLE)

4.5.2 外部表(EXTERNAL_TABLE)

4.5.3 管理表与外部表的相互转换

4.6 分区表

4.6.1 分区表基本操作

4.6.2 分区表注意事项

4.6.3 将上传数据与分区表关联

4.7 修改表

5. DML 数据操作

5.1 数据导入

5.1.1 创建表时通过 Location 指定加载数据路径

5.2 数据导出

5.2.1 Insert 导出

5.2.2 Hadoop 命令导出到本地

5.2.3 Hive Shell 命令导出

5.2.4 Export 导出到 HDFS 上

5.2.5 Import 数据到指定 Hive 表中

5.3 清除表中数据

公告