BigData——Hive笔记

Hive笔记

hive的存储格式：

1 textfile 没有压缩，但是比较占用存储空间。读取效率最高
2 rcfile 有压缩，空间占用小。比较节省空间。
3 SequenceFile 适合小文件存储。
hive和hdfs的关系：
hive的存储是基于hdfs的，在hive中创建一个库，实际上就是在hdfs上的当前用户
路径下创建了一个目录；
比如 create database sz；
hdfs://master:9000/user/hive/warehouse/sz.db
hive中的库和hdfs的映射关系，是存放在mysql中的；

hive的操作

hive创建库：create database 库名
hive 查看数据名：show databases;
hive切换数据库：use sz；
hive建表：
create [EXTERNAL] table vv_stat_fact //EXTERNAL关键字，标识是外部表的
(
userid string, //表中的字段和类型
stat_date string,
tryvv int,
sucvv int,
ptime float
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' //指定列的分隔符
STORED AS textfile //指定压缩格式，如果需要直接将数据文本加载到表中，压缩格式需要时textfile
location '/testdata/'; //可以自定义数据存储的hdfs的位置目录
例子：
create table wc
(
userid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS textfile
hive中表就是hdfs上的一个目录。
hive建表，默认的存储路径在/user/hive/warehouse
创建hive的外部表：
create EXTERNAL table wc_ext
(
userid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS textfile
hive 删除表：drop table wc_ext;
如果是普通表，会删除hdfs上的对应的目录，同时会删除数据文件。
如果是外部表的话，只会删除mysql中元信息，并不会删除hdfs上对应的目录和数据

练习：创建员工表总表
EMPNO\ENAME\JOB\MGR\HIREDATE \SAL\COMM\DEPTNO
字段是：工号，姓名，工作岗位，部门经理，受雇日期，薪金，奖金，部门编号
create table emp
(
EMPNO string,
ENAME string,
JOB string,
MGR string,
HIREDATE string,
SAL int,
COMM int,
DEPTNO string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile
将dept.txt数据插入表字段（DEPTNO、DNAME、LOC）
create table deptl
(
DEPTNO string,
DNAME string,
LOC string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile
location '/testdata/';

hive的分区表：
优点：避免全表扫描，提高查询效率。
创建分区表;
create table dept_p
(
DEPTNO string,
DNAME string,
LOC string
) PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile;
手动增加分区：
alter table dept_p add partition (dt='20190715');
查看表的属性：desc dept_p

hive 加载数据：
load命令

load data [local] inpath '数据的路径' into table 表名称；

给一个表增加/删除分区命令：目标表必须是分区表，建表的时候就要指定
alter table test_table add partition (pt=xxxx)
alter table test_table drop if exists partition(...);
例子：
增加分区
alter table dept_p add partition(dt='20180707');
删除一个分区：
alter table dept_p drop partition(dt='20180707');
如何规划建分区表还是普通表：
如果数据量每天或者有规律的增加，可以预知数据量会非常大。可以采用分区表
如果数据量比较小或者是静态数据。可以创建普通表。

hive语法：

where用法：用于过滤数据，尤其是分区表的数据，添加过滤条件
select * from emp where sal > 1000 and comm > 1000;
分区表同样把分区字段当做where条件过滤字段。
不支持子查询条件:如
select * from emp where sal > (select max(sal) as s from emp ) and comm > 1000;

join的用法：
left join 用法：会按照左边的表进行输出，右边无论有没有关联上，左边都会全部输出
右边没有关联上的，会以null补充
zs 10 10 shanghai
lisi 20 null
假设想要查出员工表emp中部门所在地是纽约的员工的姓名和薪水。
select e.ename,e.sal from emp e left join dept d on e.deptno=d.deptno where d.loc='NEW YORK';
第二种写法：
select e.ename,e.sal,d.loc from emp e left join (select deptno,loc from dept where loc='NEW YORK') d on e.deptno=d.deptno where d.loc is not null

group by 分组聚合：
例子：求每个部门中的平均薪水是多少：
select deptno,round(avg(sal),1) as a_s from emp group by deptno ;
在hive中参与计算的维度，必须要出现在 group by 后面
distinct 去重：

hive 常用的函数：

if(条件表达式,条件为真返回值,否则返回)；
例子，
如果emp中小于1000员工，name每个员工工资+500
select s.ename,s.sal+500 from (select * from emp where sal < 1000) s;
用if函数写
select ename,if(sal<1000,sal+500,sal) as sal from emp;
也可以用case when
select ename,case when sal<1000 then sal+500 else sal end as sal from emp;
null 判断：
select ename,comm from emp where comm is not null;
窗口函数之row_number() over()：
例子：想要查询每个部门中前2名工资高的员工。
select deptno,ename,sal from
(select deptno,ename,sal,row_number() over(partition by deptno order by sal desc ) as rn from emp) e
where e.rn<=2 ;

高级函数：split() 可以对目标字段进行切分，返回数组类型
explode 可以将行转列；
wordcount的例子：

select w.word,count(*) cn from (select explode(split(userid,',')) as w
ord from wc) w group by w.word;

hive笔记3
日期函数：
datediff
例子： select datediff('2019-07-16','2019-07-13')
date_add(string startdate, int days)
例子：date_add('2019-07-16',50)
from_unixtime 将时间戳转换成日期函数：
select from_unixtime(1563261771)

字符串函数：concat
select concat('zhangsan','lisi')

函数练习的例子：数据
2018/6/1,10
2018/6/2,11
2018/6/3,11
2018/6/4,12
2018/6/5,14
2018/6/6,15
2018/6/7,13
2018/6/8,37
2018/6/9,18
2018/6/10,19
2018/6/11,10
2018/6/12,11
2018/6/13,11
2018/6/14,12
2018/6/15,14
2018/6/16,15

需求：想要每隔7天的平均消费金额。
2018-6-1~2018-6-7 20
2018-6-8~2018-6-14 15

第一步：

select split(date_time,'/')

第二步：

select concat_ws('-',split(date_time,'/'))

第三步：对日期求商

select floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7

第四步：对日期进行加

select date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7 as int))
select date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7+6 as int))

第五步拼接日期：(hive中group by 后面是不允许跟别名，需要跟实际参与分组的字段)

select concat(date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7 as int)),'~',date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7+6 as int))) as dt,round(avg(cost),1) 
from dp group by concat(date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7 as int)),'~',date_add('2018-6-1',cast(floor(datediff(concat_ws('-',split(date_time,'/')),'2018-6-1')/7)*7+6 as int)))

结果

2018-06-01~2018-06-07 12.3
2018-06-08~2018-06-14 16.9
2018-06-15~2018-06-21 16.6
2018-06-22~2018-06-28 23.3
2018-06-29~2018-07-05 37.9
2018-07-06~2018-07-12 47.6
2018-07-13~2018-07-19 54.3
2018-07-20~2018-07-26 36.0
2018-07-27~2018-08-02 51.3
2018-08-03~2018-08-09 75.4
2018-08-10~2018-08-16 80.7

create table if not exists movie(
stat_date string,
userid string,
uid string,
version string,
country string,
province string,
movie_tryvv int,
movie_sucvv int,
movie_ptime int
)
PARTITIONED BY (
dt string)
clustered by (userid) into 3000 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS textfile;

清空表数据：truncate table xxx
加载分区表数据
load data local inpath '/usr/local/soft/datadir/movedata.csv' into tab
le movie_vv partition(dt='20190717');
简单指标：
pv:页面访问次数
uv：用户独立访问次数
select count(*) as pv, count(distinct userid) as uv from movie_vv;

练习：一句sql算出，可以用到if函数
0-5分钟，观影人数。5-20分钟观影人数，50-100

hive的udf：
分三类udf
：udf 输入一行数据返回一行数据，一对一的关系
：UDAF输入多行数据，返回一行数据，通常是聚合函数
：UDTF 输入一行数据，返回多行数据，通常是行转列的操作，或者切割的操作。

第一步：maven添加依赖：

org.apache.hive
hive-exec
1.2.1

第二步：编写udf的程序：
import org.apache.hadoop.hive.ql.exec.UDF;

public class hive_udf_demo extends UDF{
//方法名称必须是：evaluate
//这里的参数 s 就是sql中函数需要传入的那个字段，
public String evaluate(String s){
//在函数内实现自己的业务逻辑。
if (s=="JONES") {
s="shangdan";
}
return s.toLowerCase();
}

}
第三步：将jar包上传到hive中
第四步：hive加载jar包
add jar /usr/local/soft/datadir/hive_udf_low.jar;
第五步：创建hive函数
CREATE TEMPORARY FUNCTION my_lower as 'hadoop04.shujia04.hive_udf_demo
';

hive的shell 操作：
可以创建一个shell脚本，脚本中通过hive -e的方式写sql

!/bin/bash

source /etc/profile
hive -e 'select * from sz.dp'
启动命令：比如shell 文件名称是 hivesql.sh
sh hivesql.sh
也可以通过hql文件写sql，然后执行文件
hive -f test.hql

将命令加入到定时服务器
https://baijiahao.baidu.com/s?id=1609952845993989858&wfr=spider&for=pc
打开定时任务清单：crontab -e

exercise

一将empdata.txt数据插入hive表。
EMPNO\ENAME\JOB\MGR\HIREDATE \SAL\COMM\DEPTNO
字段中文名字依次是：工号，姓名，工作岗位，部门经理，受雇日期，薪金，奖金，部门编号
将dept.txt数据插入表字段（DEPTNO、DNAME、LOC）
10,ACCOUNTING,NEW YORK
10,ACCOUNTING,shanghai
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

然后查询（sql和运行结果截图）：
1．列出至少有4个员工的所有部门编号和名称。

select ss.DEPTNO,d.DNAME from (select DEPTNO from (select DEPTNO,count(*) as cn from emp group 
by DEPTNO)s  where s.cn>=4)ss left join dept d on ss.DEPTNO=d.DEPTNO; 

select ss.DEPTNO,d.DNAME from (select DEPTNO,count(*) as cn from emp group by DEPTNO having count(*) > 4)ss left join dept d on ss.DEPTNO=d.DEPTNO;

2．列出薪金比“SCOTT”多的所有员工。
第一种：

select ENAME,SAL from emp where sal>(select sal from emp where ENAME='SCOTT');

第二种:

select e.ENAME, e.SAL from (
 select ENAME,SAL,1 as cid  from emp)e 
 left join (select SAL,1 as cid  from emp where ENAME='SCOTT' )s 
 on e.cid=s.cid where e.SAL>s.SAL;

3．列出所有员工的姓名及其直接上级的姓名。

select e.ENAME,ee.ENAME from hemp e left join (select empno,ENAME,MGR from hemp ) ee on e.MGR=ee.empno;

4．列出受雇日期早于其直接上级的所有员工。

 select e.ENAME from hemp e left join (select * from hemp ) ee on e.MGR=ee.empno where e.hiredate<ee.hiredate;

5．列出部门名称和这些部门的员工信息，同时列出那些没有员工的部门。

select  * from dept d left join emp  e on d.DEPTNO=e.DEPTNO;

6．列出所有“CLERK”（办事员）的姓名及其部门名称。

select j.ENAME,d.DNAME from (select * from hemp where JOB= "CLERK" ) j left join hdept d on j.deptno=d.deptno;

7．列出最低薪金大于1500的各种工作。

select  distinct JOB from hemp where SAL > 1500;

8．列出在部门“SALES”（销售部）工作的员工的姓名，假定不知道销售部的部门编号

select e.ENAME from hemp e join  hdept d on d.deptno=e.deptno where d.DNAME="SALES";

9．列出薪金高于公司平均薪金的所有员工。

select e.ENAME from (select  *,1 as cid from emp) e
 left join (select avg(SAL) cn,1 as cid from emp) s 
 on e.cid=s.cid 
 where e.sal>s.cn;

10．列出与“SCOTT”从事相同工作的所有员工。

select p.ENAME from (select * from hemp where  ENAME='SCOTT' ) e left join (select * from hemp where  ENAME!='SCOTT' ) p on e.JOB=p.JOB;

11．列出薪金等于部门30中员工的薪金的所有员工的姓名和薪金。

select ep.ENAME,ep.SAL from (select distinct sal from hemp where DEPTNO='30') e left join hemp ep on e.SAL=ep.SAL;

12．列出薪金高于在部门30工作的所有员工的薪金的员工姓名和薪金。

select e.ENAME,e.SAL from (select  *,1 as cid from hemp) e left join (select max(SAL) cn,1 as cid from hemp where  DEPTNO='30') s on e.cid=s.cid where e.sal>s.cn;

13．列出在每个部门工作的员工数量、平均工资和平均服务期限。

select  DEPTNO,count(ENAME) as name ,avg(sal) as avg_sal,round(avg(datediff(current_timestamp,hiredate)),1) as avg_day from hemp group by DEPTNO;

14．列出所有员工的姓名、部门名称和工资。

select e.ENAME,d.DNAME,e.sal from hemp e left join hdept d on e.DEPTNO=d.DEPTNO;

15．列出所有部门的详细信息和部门人数。

select * from hdept d left join (select DEPTNO,count(ENAME) as cn from hemp group by DEPTNO )  e on e.DEPTNO=d.DEPTNO;

16．列出各种工作的最低工资。

select job,min(sal)  from hemp group by job;

17．列出各个部门的MANAGER（经理）的最低薪金。

select DEPTNO,min(SAL)  FROM hemp where job ='MANAGER' group by DEPTNO;

18．列出所有员工的年工资,列出年薪最高的前3个员工。

select ENAME,sal*12+COMM as all_m from hemp order by all_m desc limit 3;
select * from (select ENAME,sal*12+COMM as all_m from hemp) e order by e.all_m desc limit 3;

思考：列出每个部门薪水前两名最高的人员名称以及薪水

select ename,sal,deptno from 
(select ename,sal,deptno,row_number() over(partition by deptno order by sal desc) as num from hemp ) 
where num <=2;

posted @ 2022-09-22 17:20 Christopher·阳阅读(26) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

Christopher·Yougn Blog