013_Hive
01:Hive基本概念
将HQL转换为MapReduce程序:
Hive的数据存储在Hdfs上
hive分析数据传递称的实现是MapReduce
执行程序运行在YARN上
Hive的架构
Hive常用于数据分析,对实实时性要求不高
Hive和数据库比较
查询语言
数据存储位置
数据更新
索引
执行
执行延迟
02Hive安装准备
版本,新特性
安装准备
启动hadoop
#在hadoop102上启动dataNode和nameNode
start-dfs.sh
#在hadoop103上开启ResourceManager
start-yarn.sh
Hadoop的各个Web界面的地址和接口
2019-08-26 16:39:11
1、HDFS页面:50070
2、YARN的管理界面:8088
3、HistoryServer的管理界面:19888
4、Zookeeper的服务端口号:2181
5、Mysql的服务端口号:3306
6、Hive.server1=10000
7、Kafka的服务端口号:9092
8、azkaban界面:8443
9、Hbase界面:16010,60010
10、Spark的界面:8080
11、Spark的URL:7077
安装
tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /usr/soft/
mv apache-hive-1.2.1-bin/ hive/
配置文件 #cat hive-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI/HWI etc.) is available via the environment
# variable SERVICE
# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
# if [ -z "$DEBUG" ]; then
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
# else
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
# fi
# fi
# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server (hwi etc).
# Set HADOOP_HOME to point to a specific hadoop install directory
#hadoop的地址
HADOOP_HOME=/usr/soft/hadoop-2.7.2
# Hive Configuration Directory can be controlled by:
# hive 配置地址
export HIVE_CONF_DIR=/usr/soft/hive/conf
# Folder containing extra ibraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=
使用hadoop来创建hadoop库文件
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -chmod g+w /tmp
bin/hadoop fs -mkdir -p /user/hive/warehouse
bin/hadoop fs -chmod g+w /user/hive/warehouse
启动hive
bin/hive
bin/hive 启动 hive命令
查询数据库
show database;
打开默认数据库
use default;
显示default中的表
show tables;
创建一张表
create table student(id int,name string) ;
create table student(id int,name string) row format delimited fields terminated by '\t';
查看表的结构
desc student;
插入数据
insert into student values(001,'caoxiaoyang');
插入的时候,使用hadoop的hdfs
查询表中的数据
select *from studnet;
退出hive
quit;
03将本地文件导入到hive中
创建表
create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
load data local inpath '/home/soft/datas/student.txt' into table student;
04数据库的基本使用
mysql -uroot -p123456
mysql表中主机配置
use mysql;
select host,user,password from user;
+------+------+-------------------------------------------+
| host | user | password |
+------+------+-------------------------------------------+
| % | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+------+------+-------------------------------------------+
1 row in set (0.00 sec)
配置hive元数据到mysql
service mysql status 查看本机 mysql启动状态.
bin/hive -e 'select *from student;' 不启动hive,直接查询结果
bin/hive -f ./hsql.sql 使用外部的sql文件查询结果.
hive中查看hdfs文件系统中的内容
dfs -lsr /user;
05一些基本命令操作
查看本地系统中的内容。
!ls /opt;
查看hive 历史运行的命令
[root@node01 ~]# cat .hivehistory
显示当前数据库,以及查询表的头信息配置
<!-- 显示当前数据库,以及查询表的头信息配置 -->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
查看hive所有的配置信息
set; -- 查看hive所有的配置信息
修改hive中的默认日志配置
mv hive-log4j.properties.temp hive-log4j.properties
vi hive-log4j.properties
hive.log.threshold=ALL
hive.root.logger=INFO,DRFA
hive.log.dir=/home/soft/hive/logs
hive.log.file=hive.log
配置的优先级
文件配置
06数据类型
基本数据类型
集合数据类型
07DDL数据定义
创建数据库
创建数据库
create database db_hive;
创建数据库,如果存在则不创建
create database if not exists db_hive;
创建数据库,指定创建位置
create database if not exists db_hive2
location ' /db_hive2.db' ---/为hadoop下的目录,查看地址为http://192.168.10.120:50070
修改数据库
只能修改配置信息,不能修改数据库名和数据库所在位置。
desc database db_hive
在mysql中查看修改结果
desc database extended db_hive
查询数据库
show databases;
show databases like 'db*'
显示数据库信息。
desc db_hive;
use db_hive;
show tables;
删除数据库
删除空数据库
drop database db_hive;
drop database if exists db_hive;
show databases;
删除有数据的数据库
drop database if exists db_hive cascade;
创建表
内部表管理表
创建表语句
create table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';
迁移数据,相当于生成备份表
create table if not exists student3
as select id, name from student;
只创建表不迁移数据。
create table if not exists student4 like student;
删除表及数据
drop table student;
外部表
外部表:项目组可以同时操作一个表,主要进行查询。
创建外部表。 -- external
create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
分区表
创建分区表。--partitioned by (month string)
create table dept_partition(
deptno int, dname string, loc string
) partitioned by (month string)
row format delimited fields terminated by '\t';
加载数据到分区表
08DML数据库定义
数据导入
加载本地文件到hive
load data local input /opt/home/soft/student.txt into table student
加载hdfs文件到hive
dfs -put /opt/student.txt / --上传数据到hdfs中
load data input '/student.txt' into table student;
加载数据覆盖表中已有数据 --overwrite
load data local input /opt/home/soft/student.txt overwrite into table student
创建分区表
create table student(
deptno int, dname string
) partitioned by (month string)
row format delimited fields terminated by '\t';
insert into table student partition(month ='2019') values(100,'张三',);
insert overwrite table student partition(month='201708') select id,name from student where month='201709';
from student
insert overwrite table student partition(month='201708')
select id,name from student where month='201709'
insert overwrite table student partition(month='201708')
select id,name from student where month='201709'
location 加载数据
create table student2
deptno int, dname string
) row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student2';
上传数据到hdfs上。
dfs -put /opt/student.txt /user/hive/warehouse/student2 ;
数据导出
insert 导出数据。
清除表中的数据
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。HBase是Hadoop的数据库,一个分布式、可扩展、大数据的存储。单个的从字面意思上或许很难看出二者的区别,别急,下面我们就对二者做个详细的介绍。
两者的特点
hive和hbase的区别。
Hive帮助熟悉SQL的人运行MapReduce任务。因为它是JDBC兼容的,同时,它也能够和现存的SQL工具整合在一起。运行Hive查询会花费很长时间,因为它会默认遍历表中所有的数据。虽然有这样的缺点,一次遍历的数据量可以通过Hive的分区机制来控制。分区允许在数据集上运行过滤查询,这些数据集存储在不同的文件夹内,查询的时候只遍历指定文件夹(分区)中的数据。这种机制可以用来,例如,只处理在某一个时间范围内的文件,只要这些文件名中包括了时间格式。
HBase通过存储key/value来工作。它支持四种主要的操作:增加或者更新行,查看一个范围内的cell,获取指定的行,删除指定的行、列或者是列的版本。版本信息用来获取历史数据(每一行的历史数据可以被删除,然后通过Hbase compactions就可以释放出空间)。虽然HBase包括表格,但是schema仅仅被表格和列簇所要求,列不需要schema。Hbase的表格包括增加/计数功能。
限制
Hive目前不支持更新操作。另外,由于hive在hadoop上运行批量操作,它需要花费很长的时间,通常是几分钟到几个小时才可以获取到查询的结果。Hive必须提供预先定义好的schema将文件和目录映射到列,并且Hive与ACID不兼容。
HBase查询是通过特定的语言来编写的,这种语言需要重新学习。类SQL的功能可以通过Apache Phonenix实现,但这是以必须提供schema为代价的。另外,Hbase也并不是兼容所有的ACID特性,虽然它支持某些特性。最后但不是最重要的–为了运行Hbase,Zookeeper是必须的,zookeeper是一个用来进行分布式协调的服务,这些服务包括配置服务,维护元信息和命名空间服务。
应用场景
Hive适合用来对一段时间内的数据进行分析查询,例如,用来计算趋势或者网站的日志。Hive不应该用来进行实时的查询。因为它需要很长时间才可以返回结果。
Hbase非常适合用来进行大数据的实时查询。Facebook用Hbase进行消息和实时的分析。它也可以用来统计Facebook的连接数。
总结
Hive和Hbase是两种基于Hadoop的不同技术–Hive是一种类SQL的引擎,并且运行MapReduce任务,Hbase是一种在Hadoop之上的NoSQL 的Key/vale数据库。当然,这两种工具是可以同时使用的。就像用
01:Hive基本概念
将HQL转换为MapReduce程序:
Hive的数据存储在Hdfs上
hive分析数据传递称的实现是MapReduce
执行程序运行在YARN上
Hive的架构
Hive常用于数据分析,对实实时性要求不高
Hive和数据库比较
查询语言
数据存储位置
数据更新
索引
执行
执行延迟
02Hive安装准备
版本,新特性
安装准备
启动hadoop
#在hadoop102上启动dataNode和nameNode
start-dfs.sh
#在hadoop103上开启ResourceManager
start-yarn.sh
Hadoop的各个Web界面的地址和接口
2019-08-26 16:39:11
1、HDFS页面:50070
2、YARN的管理界面:8088
3、HistoryServer的管理界面:19888
4、Zookeeper的服务端口号:2181
5、Mysql的服务端口号:3306
6、Hive.server1=10000
7、Kafka的服务端口号:9092
8、azkaban界面:8443
9、Hbase界面:16010,60010
10、Spark的界面:8080
11、Spark的URL:7077
安装
tar -zxvf
配置文件 #cat hive-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI/HWI etc.) is available via the environment
# variable SERVICE
# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
# if [ -z "$DEBUG" ]; then
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
# else
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
# fi
# fi
# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server (hwi etc).
# Set HADOOP_HOME to point to a specific hadoop install directory
#hadoop的地址
HADOOP_HOME=/home/soft/hadoop-2.7.2
# Hive Configuration Directory can be controlled by:
# hive 配置地址
export HIVE_CONF_DIR=/home/soft/hive/conf
# Folder containing extra ibraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=
使用hadoop来创建hadoop库文件
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -chmod g+w /tmp
bin/hadoop fs -mkdir -p /user/hive/warehouse
bin/hadoop fs -chmod g+w /user/hive/warehouse
启动hive
bin/hive
bin/hive 启动 hive命令
查询数据库
show database;
打开默认数据库
use default;
显示default中的表
show tables;
创建一张表
create table student(id int,name string) ;
create table student(id int,name string) row format delimited fields terminated by '\t';
查看表的结构
desc student;
插入数据
insert into student values(001,'caoxiaoyang');
插入的时候,使用hadoop的hdfs
查询表中的数据
select *from studnet;
退出hive
quit;
03将本地文件导入到hive中
创建表
create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
load data local inpath '/home/soft/datas/student.txt' into table student;
04数据库的基本使用
安装MySql
#先查看是否安装mysql
rpm -qa|grep mysql
#卸载
rpm -e --nodeps mysql-libs-5.1.73-7.el6.x86_64
#解压mysql-libs.zip
unzip mysql-libs.zip
ls
mysql-libs.zip
mysql-libs
#进入my-libs
cd ./my-libs
#查看mysql-lib文件
cd /home/soft/tar/mysql-libs
[root@hadoop102 mysql-libs]# ll
total 76052
-rw-r--r--. 1 root root 18509960 Mar 26 2015 MySQL-client-5.6.24-1.el6.x86_64.rpm
drwxr-xr-x. 4 root root 4096 Oct 23 2013 mysql-connector-java-5.1.27
-rw-r--r--. 1 root root 3575135 Dec 1 2013 mysql-connector-java-5.1.27.tar.gz
-rw-r--r--. 1 root root 55782196 Mar 26 2015 MySQL-server-5.6.24-1.el6.x86_64.rpm
#安装mysql服务器
rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm
#查看安装mysql中产生的随机密码
cat /root/.mysql_secret
#查看安装mysql后的状态
service mysql status
[root@hadoop102 mysql-libs]# service mysql status
MySQL is not running [FAILED]
#启动mysql服务器
service mysql start
[root@hadoop102 mysql-libs]# service mysql status
MySQL running (1822)
安装MySql
#安装mysql客户端
rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm
#链接mysql
mysql -uroot -p (随机密码 cat /root/.mysql_secret)
#修改密码
SET PASSWORD=PASSWORD('123456');
#退出mysql
exit
使用MySQL中user表中主机配置
#进入mysql
mysql -uroot -p123456
#显示数据库
show databases;
#使用数据库
use mysql;
#展示mysql数据库中的所有表
show databases;
#展示user表结构
desc user;
#查询user表
select User, Host, Password from user;
#修改user表。把host表内容修改为%
update user set host='%' where host='localhost';
#删除root用户的其他host
mysql>delete from user where Host='hadoop102';
mysql>delete from user where Host='127.0.0.1';
mysql>delete from user where Host='::1';
#刷新
mysql>flush privileges;
#退出
quit;
hive元数据配置到Mysql
#解压
tar -zxvf mysql-connector-java-5.1.27.tar.gz
#复制驱动
cp mysql-connector-java-5.1.27-bin.jar /home/soft/hive/lib/
#在/home/soft/hive/config/创建hive-site.xml
touch hive-site.xml
#编辑hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
mysql -uroot -p123456
mysql表中主机配置
use mysql;
select host,user,password from user;
+------+------+-------------------------------------------+
| host | user | password |
+------+------+-------------------------------------------+
| % | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+------+------+-------------------------------------------+
1 row in set (0.00 sec)
配置hive元数据到MySQL(重要)
bin/hiveserver2
bin/beeline
!connect jdbc:hive2://hadoop102:10000
root
回车
HiveJDBC访问
service mysql status 查看本机 mysql启动状态
bin/hive -e 'select *from student;' 不启动hive,直接查询结果
bin/hive -f ./hsql.sql 使用外部的sql文件查询结果.
hive中查看hdfs文件系统中的内容
dfs -lsr /user;
05一些基本命令操作
查看本地系统中的内容。
!ls /opt;
查看hive 历史运行的命令
[root@node01 ~]# cat .hivehistory
显示当前数据库,以及查询表的头信息配置
<!-- 显示当前数据库,以及查询表的头信息配置 -->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
查看hive所有的配置信息
set; -- 查看hive所有的配置信息
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
修改hive中的默认日志配置
mv hive-log4j.properties.temp hive-log4j.properties
vi hive-log4j.properties
hive.log.threshold=ALL
hive.root.logger=INFO,DRFA
hive.log.dir=/home/soft/hive/logs
hive.log.file=hive.log
配置的优先级
文件配置
06数据类型
基本数据类型
集合数据类型
07DDL数据定义
创建数据库
hdfs safemode -get
bin/hiveserver2
bin/beeline
!connect jdbc:hive2://hadoop102:10000
root
回车
创建数据库
create database db_hive;
创建数据库,如果存在则不创建
create database if not exists db_hive;
创建数据库,指定创建位置
create database if not exists db_hive2
location ' /db_hive2.db' ---/为hadoop下的目录,查看地址为http://192.168.10.120:50070
修改数据库
只能修改配置信息,不能修改数据库名和数据库所在位置。
desc database db_hive
在mysql中查看修改结果
desc database extended db_hive
查询数据库
show databases;
show databases like 'db*'
显示数据库信息。
desc db_hive;
use db_hive;
show tables;
删除数据库
删除空数据库
drop database db_hive;
drop database if exists db_hive;
show databases;
删除有数据的数据库
drop database if exists db_hive cascade;
创建表
内部表管理表
创建表语句
create table if not exists student2(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/db_hive.db/student2';
#删除hadoop里面的目录文件
hadoop fs -rm /user/hive/warehouse/student2
#删除hadoop里面的目录文件(加入了-r 选项 循环删除)
hadoop fs -rm -r /user
/*location 表示的是数据在hadoop的位置**/
#删除表及数据# 迁移数据,相当于生成备份表
create table if not exists student
as select id, name from student2;
# 只创建表不迁移数据。
create table if not exists student4 like student;
#查看详情
desc formatted student2;
#删除表及数据
drop table student;
#
hadoop fs -put ./student.txt /student
外部表
外部表:项目组可以同时操作一个表,主要进行查询。
创建外部表。
hadoop fs -mkdir /student
hadoop fs -put /home/soft/student.txt /student
create table student6(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/student';
drop table student6;
-- external
hadoop fs -mkdir /student
hadoop fs -put /home/soft/student.txt /student
create external table student6(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/student';
管理表和内部表转换
分区表
创建分区表。
-- partitioned by (month string)
-- 创建一级分区表
create table dept_partition(
deptno int,
dname string,
loc string
) partitioned by (month string)
row format delimited fields terminated by '\t';
#加载数据到分区表
load data local input '/home/soft/datas/dept.txt'
into table default.dept_partition partition(month='201709')
# 指定查询分区表中的数据
select * from dept_partition where month = '201709';
# 联合查询
select * from dept_partition where month = '201709'
union
select * from dept_partition where month = '201708';
#增加分区
#删除分区
#创建二级分区表
create table dept_partition2(
deptno int,
dname string,
loc string
) partitioned by (month string,day string)
row format delimited fields terminated by '\t';
#分区表需要注意的事情
HDFS中有数据
hive中创建了表但是没有数据
#haiv 只是改变了HDFS中数据的表现样式
08DML数据库定义
数据导入
第一种方式向hive导入数据
加载本地文件到hive
-- 加载本地文件到hive
load data local inpath /home/soft/student.txt into student;
load data local inpath '/home/soft/student.txt' into table default.student;
load data local inpath '本地文件地址' into table 数据库名称.数据库表名称 -- local 表示从linux系统上上传
--- 加载hdfs文件到hive
hadoop fs -put /home/soft/student.txt / --上传数据到hdfs中
create table student(id string, name string) row format delimited fields terminated by '\t';
load data inpath '/student.txt' into table student;
------------------------------------------------------------------------------------------------------
hadoop fs -mkdir /student -- 在hdfs下创建stdent目录 注意目录的权限问题
hadoop fs -put /home/soft/student.txt /student --上传本地/home/soft/student.txt 文件到 hdfs /student目录下。
create table student6(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/student';
-- 加载数据覆盖表中已有数据 overwrite
load data local input /home/soft/student.txt overwrite into table student
第二种方式向hive导入数据
insert 方式进行数据导入
-- 创建分区表
-- partitioned
create table student7(
id int, name string
) partitioned by (month string)
row format delimited fields terminated by '\t';
insert into table student7 partition(month ='201709') values(100,'张三');
#执行 这个语句会产生启动map
/**
0: jdbc:hive2://hadoop102:10000> insert into table student7 partition(month ='201709') values(100,'张三');
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1649984531147_0001
INFO : The url to track the job: http://hadoop103:8088/proxy/application_1649984531147_0001/
INFO : Starting Job = job_1649984531147_0001, Tracking URL = http://hadoop103:8088/proxy/application_1649984531147_0001/
INFO : Kill Command = /home/soft/hadoop-2.7.2/bin/hadoop job -kill job_1649984531147_0001
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2022-04-15 02:12:56,326 Stage-1 map = 0%, reduce = 0%
INFO : 2022-04-15 02:13:01,466 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 110 msec
INFO : Ended Job = job_1649984531147_0001
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201709/.hive-staging_hive_2022-04-15_02-12-49_377_6928154102701663989-1/-ext-10000 from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201709/.hive-staging_hive_2022-04-15_02-12-49_377_6928154102701663989-1/-ext-10002
INFO : Loading data to table db_hive.student7 partition (month=201709) from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201709/.hive-staging_hive_2022-04-15_02-12-49_377_6928154102701663989-1/-ext-10000
INFO : Partition db_hive.student7{month=201709} stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
No rows affected (13.349 seconds)
*/
# 导入数据 (备份数据)
insert overwrite table student7 partition(month='201708') select id,name from student7 where month='201709';
/**
执行过程
0: jdbc:hive2://hadoop102:10000> insert overwrite table student7 partition(month='201708') select id,name from student7 where month='201709';
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1649984531147_0003
INFO : The url to track the job: http://hadoop103:8088/proxy/application_1649984531147_0003/
INFO : Starting Job = job_1649984531147_0003, Tracking URL = http://hadoop103:8088/proxy/application_1649984531147_0003/
INFO : Kill Command = /home/soft/hadoop-2.7.2/bin/hadoop job -kill job_1649984531147_0003
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2022-04-15 02:16:54,995 Stage-1 map = 0%, reduce = 0%
INFO : 2022-04-15 02:16:59,092 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.0 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 0 msec
INFO : Ended Job = job_1649984531147_0003
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201708/.hive-staging_hive_2022-04-15_02-16-48_516_7282749969946638844-1/-ext-10000 from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201708/.hive-staging_hive_2022-04-15_02-16-48_516_7282749969946638844-1/-ext-10002
INFO : Loading data to table db_hive.student7 partition (month=201708) from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student7/month=201708/.hive-staging_hive_2022-04-15_02-16-48_516_7282749969946638844-1/-ext-10000
INFO : Partition db_hive.student7{month=201708} stats: [numFiles=1, numRows=1, totalSize=6, rawDataSize=5]
No rows affected (11.8 seconds)
0: jdbc:hive2://hadoop102:10000>
*/
# 导入数据
from student7
insert overwrite table student7 partition(month='201708')
select id,name from student7 where month = '201709'
insert overwrite table student7 partition(month='201708')
select id,name from student7 where month = '201709';
第三种方式向hive导入数据
create table if not exists student8
as select id,name from student;
/**
0: jdbc:hive2://hadoop102:10000> create table if not exists student8
0: jdbc:hive2://hadoop102:10000> as select id,name from student;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1649984531147_0005
INFO : The url to track the job: http://hadoop103:8088/proxy/application_1649984531147_0005/
INFO : Starting Job = job_1649984531147_0005, Tracking URL = http://hadoop103:8088/proxy/application_1649984531147_0005/
INFO : Kill Command = /home/soft/hadoop-2.7.2/bin/hadoop job -kill job_1649984531147_0005
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2022-04-15 02:35:27,125 Stage-1 map = 0%, reduce = 0%
INFO : 2022-04-15 02:35:31,205 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
INFO : MapReduce Total cumulative CPU time: 890 msec
INFO : Ended Job = job_1649984531147_0005
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/.hive-staging_hive_2022-04-15_02-35-23_664_6886647803983637797-6/-ext-10001 from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/.hive-staging_hive_2022-04-15_02-35-23_664_6886647803983637797-6/-ext-10003
INFO : Moving data to: hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/student8 from hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db/.hive-staging_hive_2022-04-15_02-35-23_664_6886647803983637797-6/-ext-10001
INFO : Table db_hive.student8 stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
No rows affected (8.649 seconds)
*/
第四种方式向hive导入数据
location 加载数据
drop table student;
dfs -mkdir /student;
dfs -put /home/soft/student.txt /student;
create table student(
deptno int,
dname string
)row format delimited fields terminated by '\t'
location '/student';
-- location '/student' 在hdfs 下有一个student文件夹下有student.txt 文件
-- http://192.168.175.102:50070/explorer.html#
-- http://192.168.175.102:50070/explorer.html#/
/*
0: jdbc:hive2://hadoop102:10000> create external table student(
0: jdbc:hive2://hadoop102:10000> id int,
0: jdbc:hive2://hadoop102:10000> name string
0: jdbc:hive2://hadoop102:10000> )row format delimited fields terminated by '\t'
0: jdbc:hive2://hadoop102:10000> location '/student';
No rows affected (0.031 seconds)
*/
第五种向hive导入数据
--
import table db_hive.student10 from '/export/student';
/*
0: jdbc:hive2://hadoop102:10000> import table db_hive.student10 from '/export/student';
INFO : Copying data from hdfs://hadoop102:9000/export/student/data to hdfs://hadoop102:9000/export/student/.hive-staging_hive_2022-04-15_03-47-35_673_2921046789811998505-8/-ext-10000
INFO : Copying file: hdfs://hadoop102:9000/export/student/data/student.txt
INFO : Loading data to table db_hive.student10 from hdfs://hadoop102:9000/export/student/.hive-staging_hive_2022-04-15_03-47-35_673_2921046789811998505-8/-ext-10000
No rows affected (0.14 seconds)
*/
/*第二种写法*/
import table db_hive.student10 partition(month='201709') from '/export/student';
数据导出
第一种方式
insert 导出数据
insert overwrite local directory '/home/soft/datas/export/student'
select * from student;
/**
0: jdbc:hive2://hadoop102:10000> insert overwrite local directory '/home/soft/datas/export/student'
0: jdbc:hive2://hadoop102:10000> select * from student;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1649984531147_0006
INFO : The url to track the job: http://hadoop103:8088/proxy/application_1649984531147_0006/
INFO : Starting Job = job_1649984531147_0006, Tracking URL = http://hadoop103:8088/proxy/application_1649984531147_0006/
INFO : Kill Command = /home/soft/hadoop-2.7.2/bin/hadoop job -kill job_1649984531147_0006
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2022-04-15 02:55:59,020 Stage-1 map = 0%, reduce = 0%
INFO : 2022-04-15 02:56:04,123 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 210 msec
INFO : Ended Job = job_1649984531147_0006
INFO : Copying data to local directory /home/soft/datas/export/student from hdfs://hadoop102:9000/tmp/hive/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_02-55-55_365_1596929057137095706-8/-mr-10000
INFO : Copying data to local directory /home/soft/datas/export/student from hdfs://hadoop102:9000/tmp/hive/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_02-55-55_365_1596929057137095706-8/-mr-10000
No rows affected (9.809 seconds)
0: jdbc:hive2://hadoop102:10000>
*/
insert overwrite local directory '/home/soft/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' -- 格式化输出语句
select * from student;
/**
0: jdbc:hive2://hadoop102:10000> insert overwrite local directory '/home/soft/datas/export/student1'
0: jdbc:hive2://hadoop102:10000> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
0: jdbc:hive2://hadoop102:10000> select * from student;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1649984531147_0007
INFO : The url to track the job: http://hadoop103:8088/proxy/application_1649984531147_0007/
INFO : Starting Job = job_1649984531147_0007, Tracking URL = http://hadoop103:8088/proxy/application_1649984531147_0007/
INFO : Kill Command = /home/soft/hadoop-2.7.2/bin/hadoop job -kill job_1649984531147_0007
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2022-04-15 02:59:19,400 Stage-1 map = 0%, reduce = 0%
INFO : 2022-04-15 02:59:22,474 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec
INFO : MapReduce Total cumulative CPU time: 910 msec
INFO : Ended Job = job_1649984531147_0007
INFO : Copying data to local directory /home/soft/datas/export/student1 from hdfs://hadoop102:9000/tmp/hive/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_02-59-14_993_5708696570995345258-8/-mr-10000
INFO : Copying data to local directory /home/soft/datas/export/student1 from hdfs://hadoop102:9000/tmp/hive/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_02-59-14_993_5708696570995345258-8/-mr-10000
No rows affected (9.552 seconds)
0: jdbc:hive2://hadoop102:10000>
*/
insert overwrite directory '/studentexport'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' -- 格式化输出语句
select * from student;
第二种导出方式
dfs -get /下载命令
第三种导出方式
-- bin/hive -e '执行的sql' > '路径地址'
bin/hive -e 'select * from db_hive.student;' > /home/soft/datas/export/stu.txt
/*
[root@hadoop102 hive]# bin/hive -e 'select * from db_hive.student;' > /home/soft/datas/export/stu.txt
Logging initialized using configuration in file:/home/soft/hive/conf/hive-log4j.properties
OK
Time taken: 0.884 seconds, Fetched: 16 row(s)
[root@hadoop102 hive]#
*/
第四种方式导入数据
export table db_hive.student to '/export/student';
一个表构成的组合 数据 和 元数据(_metadata )
/*
0: jdbc:hive2://hadoop102:10000> export table db_hive.student to '/export/student';
INFO : Copying data from file:/tmp/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_03-44-16_463_7703447412327772980-8/-local-10000/_metadata to hdfs://hadoop102:9000/export/student
INFO : Copying file: file:/tmp/root/2114f150-2589-499b-bdee-b97d0c22ccd2/hive_2022-04-15_03-44-16_463_7703447412327772980-8/-local-10000/_metadata
INFO : Copying data from hdfs://hadoop102:9000/student to hdfs://hadoop102:9000/export/student/data
INFO : Copying file: hdfs://hadoop102:9000/student/student.txt
No rows affected (0.15 seconds)
0: jdbc:hive2://hadoop102:10000>
*/
清除表中的数据
truncate table student;
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。HBase是Hadoop的数据库,一个分布式、可扩展、大数据的存储。单个的从字面意思上或许很难看出二者的区别,别急,下面我们就对二者做个详细的介绍。
两者的特点
hive和hbase的区别。
Hive帮助熟悉SQL的人运行MapReduce任务。因为它是JDBC兼容的,同时,它也能够和现存的SQL工具整合在一起。运行Hive查询会花费很长时间,因为它会默认遍历表中所有的数据。虽然有这样的缺点,一次遍历的数据量可以通过Hive的分区机制来控制。分区允许在数据集上运行过滤查询,这些数据集存储在不同的文件夹内,查询的时候只遍历指定文件夹(分区)中的数据。这种机制可以用来,例如,只处理在某一个时间范围内的文件,只要这些文件名中包括了时间格式。
HBase通过存储key/value来工作。它支持四种主要的操作:增加或者更新行,查看一个范围内的cell,获取指定的行,删除指定的行、列或者是列的版本。版本信息用来获取历史数据(每一行的历史数据可以被删除,然后通过Hbase compactions就可以释放出空间)。虽然HBase包括表格,但是schema仅仅被表格和列簇所要求,列不需要schema。Hbase的表格包括增加/计数功能。
限制
Hive目前不支持更新操作。另外,由于hive在hadoop上运行批量操作,它需要花费很长的时间,通常是几分钟到几个小时才可以获取到查询的结果。Hive必须提供预先定义好的schema将文件和目录映射到列,并且Hive与ACID不兼容。
HBase查询是通过特定的语言来编写的,这种语言需要重新学习。类SQL的功能可以通过Apache Phonenix实现,但这是以必须提供schema为代价的。另外,Hbase也并不是兼容所有的ACID特性,虽然它支持某些特性。最后但不是最重要的–为了运行Hbase,Zookeeper是必须的,zookeeper是一个用来进行分布式协调的服务,这些服务包括配置服务,维护元信息和命名空间服务。
应用场景
Hive适合用来对一段时间内的数据进行分析查询,例如,用来计算趋势或者网站的日志。Hive不应该用来进行实时的查询。因为它需要很长时间才可以返回结果。
Hbase非常适合用来进行大数据的实时查询。Facebook用Hbase进行消息和实时的分析。它也可以用来统计Facebook的连接数。
总结
Hive和Hbase是两种基于Hadoop的不同技术–Hive是一种类SQL的引擎,并且运行MapReduce任务,Hbase是一种在Hadoop之上的NoSQL 的Key/vale数据库。当然,这两种工具是可以同时使用的。就像用
hive查询
基本查询
create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';
#
load data local inpath '/home/soft/dept.txt' into table dept;
#
load data local inpath '/home/soft/emp.txt' into table emp;
/*
0: jdbc:hive2://hadoop102:10000> load data local inpath '/home/soft/dept.txt' into table dept;
INFO : Loading data to table db_hive.dept from file:/home/soft/dept.txt
INFO : Table db_hive.dept stats: [numFiles=1, totalSize=71]
No rows affected (0.151 seconds)
0: jdbc:hive2://hadoop102:10000> load data local inpath '/home/soft/emp.txt' into table emp;
INFO : Loading data to table db_hive.emp from file:/home/soft/emp.txt
INFO : Table db_hive.emp stats: [numFiles=1, totalSize=656]
No rows affected (0.123 seconds)
0: jdbc:hive2://hadoop102:10000>
*/
select * from emp;
select id from emp;
select e.ename from emp e;
select e.ename nn from emp e;
select sal+1 from emp;