[TOC]


Hive笔记整理(三)

Hive的函数

Hive函数分类

函数的定义和java、mysql一样,有三种。

UDF(User Definition Function 用户定义函数)

一路输入,一路输出
sin(30°)=1/2

UDAF(User Definition Aggregation Function 聚合函数)

多路输入,一路输出
max min count sum avg等等

UDTF(User Definition Table Function 表函数)

一路输入,多路输出
explode

常用函数

show functions;             列出hive中可用的函数列表
desc function func_name;    查看函数的帮助说明

case when ---->switch或if else
if ---->三元运算符
explode ---->将数组中的元素转换成多行数据
a = [1, 2, 3, 4] explode(a) ===>
1
2
3
4
split ---->就是字符串中的split函数

array ---->
collect_set
collect_list
concat_ws ---->使用给定的字符串来连接元素

row_number ---->分组排序或者二次排序

函数案例

wordcount

分析:
    hello   you
    hello   me
    hello   he
使用mr的的过程
    step1----->split("\t")--->
        ["hello", "you"]
        ["hello", "me"]
        ["hello", "he"]
    step2----->遍历每一个数组,将数组中的每一个值,作为key,value为1写出去<key, 1>
        <"hello", 1>
        <"you", 1>
        <"hello", 1>
        <"me", 1>
        <"hello", 1>
        <"he", 1>
step3,shuffle<span class="hljs-comment">---&gt;</span>
    &lt;"hello", [1, 1, 1]&gt;
    &lt;"you", 1&gt;
    &lt;"me", 1&gt;
    &lt;"he", 1&gt;
step 4, reduce ====&gt;reduceByKey

使用hql
step 1 (mydb1)> select split(line, "\t") from test;
["hello","you"]
["hello","he"]
["hello","me"]
step 2 将数组中的每一行数据转化为多行
(mydb1)> select explode(split(line, "\t")) from test;
hello
you
hello
he
hello
me
step 3 在step2的基础之上进行group by 即可
select
w.word, count(w.word) as count
from (select explode(split(line, "\t")) word from test) w
group by w.word order by count desc;

case when

case when将一下对应的部门名称显示出来:

1--->学工组,2--->行政组,3---->销售组,4---->研发组,5---->其它
hive (mydb1)> select * from t1;
1
2
3
4
5
select
  id,
case id
  when 1 then "学工组"
  when 2 then "行政组"
  when 3 then "销售组"
  when 4 then "研发组"
  else "行政组"
end
from t1;    
分类显示
1   学工组
2   行政组
3   销售组
4   研发组
5   其它

row_number 二次排序

三种连接
    交叉连接
        across join,会有笛卡尔积,所以不用
    内连接(等值连接)
        inner join
        将左表和右表中能够匹配的上的数据做输出
    外链接
        outer join
        左外连接(left outer join)
    右外链接(right outer join)

根据员工、部分、薪资,这三张表,
1、分组显示每一个部分员工的信息(启动显示部分名称,员工姓名,员工性别[男|女],员工薪资),同时分组按照员工薪资降序排序
select
e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null;
2、获取显示部门薪资top2的员工信息
select
tmp.*
from
(select
e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null) tmp
where tmp.rank < 3;
如果查询的是单表,则可以不用子查询,只用用having来获取即可(having rank < 3)

直接看下面的一个例子就可以知道row_number的使用方法了:

hive (mydb2)> create table t9(
            >   id int,
            >   province string,
            >   salary float
            > );
hive (mydb2)> insert into t9 values(1,'gd',18000),(2,'gd',16000),(3,'bj',13000),(4,'gd',15000),(5,'bj',17000),(6,'bj',19000);
hive (mydb2)> select * from t9;
OK
1       gd      18000.0
2       gd      16000.0
3       bj      13000.0
4       gd      15000.0
5       bj      17000.0
6       bj      19000.0
Time taken: 0.097 seconds, Fetched: 6 row(s)
hive (mydb2)> select
            >   id,
            >   province,
            >   salary,
            >   row_number() over(partition by province order by salary desc) as rank
            > from t9;
OK
6       bj      19000.0 1
5       bj      17000.0 2
3       bj      13000.0 3
1       gd      18000.0 1
2       gd      16000.0 2
4       gd      15000.0 3
Time taken: 1.578 seconds, Fetched: 6 row(s)

Hive自定义函数

自定义函数步骤

自定义函数需要遵循的6个步骤:

1°、自定义一个Java类来继承UDF类
2°、覆盖其中的evaluate()的函数,有系统去调用
3°、将写好的程序打成一个jar,上传至服务器
4°、将3°中的jar加载到hive的classpath
hive终端执行add jar jar_path;
5°、给自定义函数设置一个临时的名称,也就是说要创建一个临时的函数
create temporary function 函数名 as '写的evalutor所在类的全类名';
6°、执行函数结束之后,可以手动销毁临时函数,或者不用管,因为当前会话消失,函数自动销毁

UDF案例:要根据用户的birthday,统计对应的生肖和星座

程序代码如下:

package com.uplooking.bigdata.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

@Description(name = "z_c",
value = "FUNC(param1, param2) - 返回给定日期对应的生肖或者星座",
extended = "param1,param2参数可以是一下:\n"
+ "1. param1 is A string in the format of 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'.\n"
+ "2. param1 date value\n"
+ "3. param1 timestamp value\n"
+ "3. param2 0 or 1, 0 means constellation, 1 means zodica\n"
+ "Example:\n "
+ " > SELECT FUNC('2009-07-30', 0) FROM src LIMIT 1;\n" + " 狮子座")
public class ZodicaAndConstellationUDF extends UDF {

<span class="hljs-function"><span class="hljs-keyword">public</span> Text <span class="hljs-title">evaluate</span><span class="hljs-params">(java.sql.Date date, <span class="hljs-keyword">int</span> type)</span> </span>{
    <span class="hljs-keyword">if</span>(type == <span class="hljs-number">0</span>) {<span class="hljs-comment">//星座</span>
        <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> Text(getConstellation(<span class="hljs-keyword">new</span> Date(date.getTime())));
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span>(type == <span class="hljs-number">1</span>) { <span class="hljs-comment">//生肖</span>
        <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> Text(getZodica(<span class="hljs-keyword">new</span> Date(date.getTime())));
    }
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">null</span>;
}

<span class="hljs-keyword">public</span> String[] zodiacArr = { <span class="hljs-string">"猴"</span>, <span class="hljs-string">"鸡"</span>, <span class="hljs-string">"狗"</span>, <span class="hljs-string">"猪"</span>, <span class="hljs-string">"鼠"</span>, <span class="hljs-string">"牛"</span>, <span class="hljs-string">"虎"</span>, <span class="hljs-string">"兔"</span>, <span class="hljs-string">"龙"</span>, <span class="hljs-string">"蛇"</span>, <span class="hljs-string">"马"</span>, <span class="hljs-string">"羊"</span> };
<span class="hljs-keyword">public</span> String[] constellationArr = { <span class="hljs-string">"水瓶座"</span>, <span class="hljs-string">"双鱼座"</span>, <span class="hljs-string">"白羊座"</span>, <span class="hljs-string">"金牛座"</span>, <span class="hljs-string">"双子座"</span>, <span class="hljs-string">"巨蟹座"</span>, <span class="hljs-string">"狮子座"</span>, <span class="hljs-string">"×××座"</span>, <span class="hljs-string">"天秤座"</span>, <span class="hljs-string">"天蝎座"</span>, <span class="hljs-string">"射手座"</span>, <span class="hljs-string">"魔羯座"</span> };
<span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span>[] constellationEdgeDay = { <span class="hljs-number">20</span>, <span class="hljs-number">19</span>, <span class="hljs-number">21</span>, <span class="hljs-number">21</span>, <span class="hljs-number">21</span>, <span class="hljs-number">22</span>, <span class="hljs-number">23</span>, <span class="hljs-number">23</span>, <span class="hljs-number">23</span>, <span class="hljs-number">23</span>, <span class="hljs-number">22</span>, <span class="hljs-number">22</span> };
<span class="hljs-comment">/**
 * 根据日期获取生肖
 * <span class="hljs-doctag">@return</span>
 */</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">getZodica</span><span class="hljs-params">(Date date)</span> </span>{
    Calendar cal = Calendar.getInstance();
    cal.setTime(date);
    <span class="hljs-keyword">return</span> zodiacArr[cal.get(Calendar.YEAR) % <span class="hljs-number">12</span>];
}
<span class="hljs-comment">/**
 * 根据日期获取星座
 * <span class="hljs-doctag">@return</span>
 */</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">getConstellation</span><span class="hljs-params">(Date date)</span> </span>{
    <span class="hljs-keyword">if</span> (date == <span class="hljs-keyword">null</span>) {
        <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>;
    }
    Calendar cal = Calendar.getInstance();
    cal.setTime(date);
    <span class="hljs-keyword">int</span> month = cal.get(Calendar.MONTH);
    <span class="hljs-keyword">int</span> day = cal.get(Calendar.DAY_OF_MONTH);
    <span class="hljs-keyword">if</span> (day &lt; constellationEdgeDay[month]) {
        month = month - <span class="hljs-number">1</span>;
    }
    <span class="hljs-keyword">if</span> (month &gt;= <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">return</span> constellationArr[month];
    }
<span class="hljs-comment">// default to return 魔羯</span>
    <span class="hljs-keyword">return</span> constellationArr[<span class="hljs-number">11</span>];
}

}

注意依赖在笔记最后面。

上传到服务器后,在hive终端中加载到hive的classpath:

add jar /home/uplooking/jars/hive/udf-zc.jar

自定义函数:

create temporary function zc as 'com.uplooking.bigdata.hive.udf.ZodicaAndConstellationUDF';

创建测试用的临时表:

hive (mydb1)>
            > create temporary table tmp(
            > birthday date);

插入测试用的数据:

hive (mydb1)> insert into tmp values('1994-06-21');

在查询中使用函数:

hive (mydb1)> select zc(birthday,0) from tmp;
OK
c0
双子座
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive (mydb1)> select zc(birthday,1) from tmp;
OK
c0
狗
Time taken: 0.044 seconds, Fetched: 1 row(s)

下面是一个更简单的UDF函数,可以参考进行测试:

package cn.xpleaf.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**

  • @author Leaf

  • @date 2018/9/18 下午11:11
    */
    @Description(name = "addUDF", value = "FUNC(num1, num2) - 返回给定两个数的和")
    public class AddUDF extends UDF {

    public Text evaluate(int num1, int num2) {
    return new Text(String.valueOf(num1 + num2));
    }

}

Hive之jdbc

Hive除了提供前面的cli用户接口,还提供了jdbc的用户接口,但是如果需要使用该接口,则需要先启动hiveserver2服务,启动该服务后,可以通过hive提供的beeline继续以cli的方式操作hive(不过需要注意的是,此时是通过jdbc接口进行操作hive的),也可以通过手工编写java代码来进行操作。

启动hiveserver2服务

[uplooking@uplooking01 ~]$ hiveserver2

通过beeline连接hiveserver进行操作

[uplooking@uplooking01 hive]$ beeline
which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/jdk/bin:/home/uplooking/bin:/home/uplooking/app/zookeeper/bin:/home/uplooking/app/hadoop/bin:/home/uplooking/app/hadoop/sbin:/home/uplooking/app/hive/bin)
ls: 无法访问/home/uplooking/app/hive/lib/hive-jdbc-*-standalone.jar: 没有那个文件或目录
Beeline version 2.1.0 by Apache Hive
beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Error: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: uplooking is not allowed to impersonate uplooking (state=,code=0)

可以看到出现错误,解决方案如下:

在执行JDBC的时候,访问不了远程的Hive的ThriftServer服务
报的错误:uplooking不能伪装为uplooking
    是因为版本在进行升级的时候考虑到的安全策略,需要我们手动对uplooking进行配置,需要将
hadoop中的uplooking用户和hive中的uplooking用户进行打通,配置在$HADOOP_HOME/etc/hadoop/core-site.xml
中进行配置:添加一下配置项
    <property>
        <name>hadoop.proxyuser.uplooking.hosts</name>
        <value>*</value>
        <description>这是uplooking用户访问的本机地址</description>
    </property>
    <property>
        <name>hadoop.proxyuser.uplooking.groups</name>
        <value>root</value>
        <description>代理uplooking设置的组用户</description>
    </property>     
配置成功之后,需要同步到集群中的各个节点,
要想让集群重新加载配置信息,至少hdfs需要重启    

这样之后就可以正常使用beeline通过hive提供的jdbc接口来操作hive了:

beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connected to: Apache Hive (version 2.1.0)
Driver: Hive JDBC (version 2.1.0)
18/03/23 08:00:15 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://uplooking01:10000/mydb1> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| mydb1          |
+----------------+--+
2 rows selected (2.164 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
| t1        |
| t2        |
+-----------+--+
2 rows selected (0.118 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> select * from t1;
+------------+--+
|  t1.line   |
+------------+--+
| hello you  |
| hello he   |
| hello me   |
+------------+--+
3 rows selected (2.143 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> 

通过java代码连接hiveserver进行操作

程序代码如下:

package com.uplooking.bigdata.hive.jdbc;

import java.sql.*;

public class HiveJDBC {
public static void main(String[] args) throws Exception {
Class.forName("org.apache.hive.jdbc.HiveDriver");
Connection conn = DriverManager.getConnection("jdbc:hive2://uplooking01:10000/mydb1", "uplooking", "uplooking");
String sql = "select t.word,count(t.word) as count from (select explode(split(line, ' ')) as word from t1) t group by t.word";
PreparedStatement ps = conn.prepareStatement(sql);
ResultSet rs = ps.executeQuery();
while (rs.next()) {
String word = rs.getString("word");
int count = rs.getInt("count");
System.out.println(word + "\t" + count);
}
rs.close();
ps.close();
conn.close();
}
}

程序执行结果如下:

18/03/23 00:48:16 INFO jdbc.Utils: Supplied authorities: uplooking01:10000
18/03/23 00:48:16 INFO jdbc.Utils: Resolved authority: uplooking01:10000
he  1
hello   3
me  1
you 1

在这个过程中,注意观察hiveserver2终端的输出:

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = uplooking_20180323084825_63044683-393d-4625-a3c3-b440109c3d70
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1521765850571_0002, Tracking URL = http://uplooking02:8088/proxy/application_1521765850571_0002/
Kill Command = /home/uplooking/app/hadoop/bin/hadoop job  -kill job_1521765850571_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-23 08:48:33,427 Stage-1 map = 0%,  reduce = 0%
2018-03-23 08:48:40,864 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.54 sec
2018-03-23 08:48:48,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1521765850571_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.84 sec   HDFS Read: 8870 HDFS Write: 159 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK

Hive中文注释乱码解决

如果有乱码出现,可以尝试下面的解决方案:

    hive中文注释乱码解决:
    在hive的元数据库中,执行一下脚本
        ALTER TABLE COLUMNS_V2 MODIFY COLUMN COMMENT VARCHAR(256) CHARACTER SET utf8;
        ALTER TABLE TABLE_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE PARTITION_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE PARTITION_KEYS MODIFY COLUMN PKEY_COMMENT VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE INDEX_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
    同时将url,加上utf-8
        &useUnicode=true&characterEncoding=UTF-8
          <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://uplooking01:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
          </property>

Hive的maven依赖

<properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <hive-api.version>2.1.0</hive-api.version>
   <hadoop-api.version>2.6.4</hadoop-api.version>
   <hadoop-core.version>1.2.1</hadoop-core.version>
</properties>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop-core.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-cli</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libfb303</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>