《OD学hive》第四周0717

一、Hive基本概念、安装部署与初步使用

1. 后续课程

Hive

项目：hadoop hive sqoop flume hbase

电商离线数据分析

CDH

Storm：分布式实时计算框架

Spark：

2. 如何学习大数据技术

上课时候，认真听，勤做笔记；

遇到难理解的概念，马上记录下来；

课后多动手，操作过程遇到问题，多思考；

不要遇到问题，首先就问别人；

珍惜问问题的机会；

讲究问题的技巧与方式，提出自己的大概思考思路；

多总结：

总结成文档，作为以后的参考；

归档成自己的知识库；

每个技术框架：

参考资料：参考官网，不要依赖其他人的博客；

二、大数据技术框架

1. 数据存储： HDFS(基于磁盘)、Tachyon(基于内存)

一般会把Tachyon架构在HDFS与计算框架之间，一些不需要落地到HDFS磁盘上的数据，

落地到内存中，达到共享内存的目的；

2. 数据分析：

MapReduce：离线批处理计算框架

YARN：任务的分配和资源的管理，大数据的操作系统

Hive: Facebook为了解决海量结构化的日志数据分析；

Storm：

Spark：

Spark core

Spark Streaming

Spark SQL

Spark MLlib 机器学习类库

Spark GraphX 图计算

3. 数据高效实时查询：

HBase： NoSQL基于列存储的分布式数据库

ElasticSearch：搜索引擎

Solr：搜索引擎

4. 数据应用：

搜索引擎、推荐系统、机器学习(人工智能)

精准广告、游戏分析(玩家留存)、

公安系统(疑犯追踪)

交通部门(路况分析、路况预测、频发车祸路段检测)

5. 数据可视化

web项目

三、Hive介绍

Apache基金会组织

一个用类似SQL语句来读、写、管理存储在分布式存储设备(HDFS、HBase)上的大数据集的数据仓库框架。

hive分为以下部分：

（1）hive的数据是存储在hdfs上

（2）hive是以sql的方式来读写管理数据

底层实现：将sql语句转换成一个个Mapreduce Job

（3）hive将数据映射成表

表的元数据信息：表名称、表所属数据库、表的所属者、创建时间、最后访问时间、保护模式、retention

location(hive这张表的数据存储在hdfs的哪个位置)、表类型(内部表，外部表)、标的存储格式信息、表的结构信息(表有哪些字段)、每个字段类型

hive

showdatabases;

create table docs;

show tables;

load data local input "path" overwirte into table docs;

describe formatted;

hive表的元数据信息：通过关系型数据库进行存储，内嵌derby、MySQL

如何保证hive表元数据的安全性问题？

MySQL主从备份

（4）hive驱动引擎： Driver

引擎：

解释器：SQL 解释抽象语法书AST

编译器：将AST编译成逻辑执行计划

优化器：优化逻辑执行计划

执行器：调用底层运行框架执行逻辑执行计划，即将逻辑执行计划转成物理执行计划

运行框架：MapRedcue、Spark(SparkSQL)

四、Hive的优点：

1. 通过sql分析大数据，只要会sql就行，就比较容易上手

DBA：擅长sql，不擅长编程

2. 不再需要编写mapreduce应用程序

3. hive表的元数据信息可以统一管理，与其他框架共享

impala 速度比hive快，但是不稳定，支持函数没有hive多

4. 容易拓展

（1）集群部署：数据是存储在hdfs上，只要拓展hdfs就行；

（2）通过show functions;查看hive自带的函数，195个函数(0.13.1)

自定义函数：通过编写UDF(user definded function)

五、hive的应用场景：

（1）离线批处理　　——日志分析

（2）高延迟场景，对实时性要求不高的场景

Storm

Spark Streaming

（3）适合处理大数据集，不适合处理小数据集

六、Hive技术框架与各组件

HDFS：分布式数据存储

Driver引擎：解释器、编译器、优化器、执行器

MySQL: 存储hive表的元数据信息

Client： cli（命令行交互式窗口）、hiveserver2(jdbc/odbc，用到Thrift）、hwi（升级版hue）

备注： Thirft：不同语言之间的转换器

七、安装部署

任何大数据组件：安装部署一定要参考官网。

http://hive.apache.org/

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

1. 安装前提条件是什么？

网络配置、主机名与网络IP的映射、防火墙(学些环境暴力直接关闭，生产环境申请开墙、配置防火墙策略)

JDK1.7 hadoop 2.x 操作系统：Linux

Apache版本：

archive.apache.org/dist

CDH版本：

archive.cloudera.com/cdh5/cdh/5

hive-0.13.1-cdh5.3.6.tar.gz

2. 修改配置

cp hive-env.sh.template hive-env.sh

cp hive-default.xml.template hive-site.xml

cp hive-exec-log4j.properties.template hive-exec-log4j.properties

cp hive-log4j.properties.template hive-log4j.properties

bin/hive

export JAVA_HOME=/opt/modules/jdk1.7.0_67

2783:3 缺少<property>

继续执行bin/hive

bin/hive --service metastore

3. 安装mysql

yum -y install mysql-server

安装的版本是5.1.17

sudo rpm -Uvh mysql57-community-release-el6-8.noarch.rpmcd /etc/yum.repos.d/

mysql-community.repo

mysql-community-resource.repo

5.6 enable = 1

5.7 enable = 0

sudo yum -y install mysql-community-server

mysql安全性设置

sudo mysql_secure_installation

root beifeng

grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option

4. 安装hive

bin/hive 以cli客户端连接hive

hive启动之前需要启动hadoop集群

（1）命令： create database test;

结果：创建/user/hive/warehouse/test.db目录

（2）命令：

user test;

create tables docs(line string);

结果： /user/hive/warehouse/test.db/docs

（3）加载数据

命令：

load data local inpath '/home/beifeng/Documents/a.txt' overwrite into table docs;

（4）查看元数据信息

describe docs;

describe extended docs;

describe formatted docs;

hive的安装模式：本地内嵌derby作为metastore的安装模式(不用)

元数据存放在内嵌数据 derby metastore_db

（5）hive的wordcount

create table word_counts as

select word, count(1) as count from

(select explode(split(line, ' ')) as word from docs) w

group by word

order by word;

select * from word_counts;

生产环境hive安装模式：

（1）本地mysql作为metastore模式

hive-env.sh JAVA_HOME,HADOOP_HOME

hive-site.xml jdo

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

Config Param	Config Value	Comment
javax.jdo.option.ConnectionURL	`jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true`	metadata is stored in a MySQL server
javax.jdo.option.ConnectionDriverName	`com.mysql.jdbc.Driver`	MySQL JDBC driver class
javax.jdo.option.ConnectionUserName	`<user name>`	user name for connecting to MySQL server
javax.jdo.option.ConnectionPassword	`<password>`	password for connecting to MySQL server

hive.metastore.uris 为空表示本地模式，不为空则表示远程模式

copy mysql驱动到 /opt/modules/hive-0.13.1-cdh5.3.6/lib

mycat： mysql分布式工具

（2）远程mysql作为metastore模式

hive cli -> metastore服务器 ->mysql

jdbc ->

hive.metastore.uris = thriff://beifeng-hadoop-02:9083

启动metastore： hive --service metastore -p <port_num>

作为常驻经常 nohup hive --service metastore > hive_metastore.run.log 2>&1 &

系统日志输出级别： 2 错误，1正常

查看进程信息： ps -ef | grep HiveMetaStore

kill -9 processId

kill -9 `ps -ef | grep HiveMetaStore | awk '{print $2'} | head -n 1`

5. hive命令
hive常见的数据库操作：

show functions;

describe function sum;

describe function extended sum;

list

create database if not exists test;

use test;

create table if not exits docs(line string);

add file /home/aaa/a.sql;

list file

source /home/aaa/a.sql;

add jar /home/aaa/a.jar; udf程序需要jar

6. Hive shell操作

[beifeng@beifeng-hadoop-02 hive-0.13.1-cdh5.3.6]$ hive --service cli --help;
16/07/17 00:48:55 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
 -h <hostname>                    connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

drop database test cascade;

hive -e "use test; select * from docs;"

hive -f /home/aaa/a.sql

hive -e "use test; select * from docs;" > hive_result.txt

hive -hiveconf hive.cli.print.header=true -e "use test; select * from docs;"

hive -hiveconf hive.cli.print.header=true -e "use test; select * from docs;" > hive_result.txt

hive -S -hiveconf hive.cli.print.header=true -e "use test; select * from docs;" > hive_result.txt

hive -hiveconf

　　: hive.cli.print.header=true 是否显示表头信息

　　: hive.cli.print.current.db=ture 是否显示当前数据库

hive -i

　　默认 hive -i ~/.hiverc

set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

create table cust_info(
custNo bigint,
custName string,
gender char(1),
custBirth string,
salary double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORAGE AS TEXTFILE;

hive -e "use test; load"

hive导入数据时，即使文件格式不符合表结构定义，也可以导入成功。但是不符合格式的数据不会正确导入。

schema on read： schema约束读过程

schema on write： schema约束写过程

hive遵循schema on read，如果读取出来的数据不符合表结构定义，用空值替代。好处是写的过程快

关系型数据库遵循schema on write。

7. 结合shell脚本：

posted @ 2016-07-17 09:01 沙漏哟阅读(179) 评论(0) 收藏举报

刷新页面返回顶部

沙漏哟计算机的未来在于联结

哲学 + 社会学 ==> 计算机技术（计算机是人造科学）经济学 + 心理学 + 大数据 ==> 互联网产品经理（产品设计是社会科学）

《OD学hive》第四周0717

公告

沙漏哟 计算机的未来在于联结

哲学 + 社会学 ==> 计算机技术（计算机是人造科学） 经济学 + 心理学 + 大数据 ==> 互联网产品经理（产品设计是社会科学）

《OD学hive》第四周0717

公告

沙漏哟计算机的未来在于联结

哲学 + 社会学 ==> 计算机技术（计算机是人造科学）经济学 + 心理学 + 大数据 ==> 互联网产品经理（产品设计是社会科学）