Hive使用简介

指定分隔符

HIVE输出到文件的分隔符
,列与列之间是'\1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'\2','\3','\4'等。
例:输出为map<string,string>, array, int类型,则分隔符为:
key1 \3 value1 \2 key2 \3 value2 ... \1 value1 \2 value2 \2 value3 ... \1 第三列的int值
如何指定分隔符:
INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
SELECT ...;

文件上传

用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
利用ADD上传文件:

  1. ADD FILE local_filename
  2. ADD CONF local_conf
  3. ADD JAR local_jar
  4. ADD ARCHIVE local.tar.gz/local.tgz
  5. ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file

Hive的几种排序

排序方式 说明
order by key 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错
sort by key 单个reduce范围内按照key排序
distribute by key 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件
cluster by key 按照key分发后在reducer范围内排序,相当于 distribute by + sort by

TRANSFORM语法

Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为'\t', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为'\t')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。

注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
ADD ARCHIVE xx.tar.gz;
SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)

使用Transform语法的示例作业

示例1:仅有mapper的收集特定集合中的URL相关日志

SQL语句文件:

set mapred.max.split.size=128000000;
set mapred.job.map.capacity=1000;
set mapred.map.memory.limit=1000;
set stream.memory.limit=900;
set mapred.reduce.tasks=100;
set mapred.job.name=word_count_demo;

ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE pc_vip_1w_norm;
ADD FILE mapper_clicklog_finder.py;
ADD FILE url_normalizer_tool_without_protocal;
ADD FILE mapper_clicklog_finder.sh;

INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm" 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(event_click_target_url, event_useragent)
USING 'sh mapper_clicklog_finder.sh'
AS(url, ua)
FROM udw.udw_event
WHERE event_action = 'query_click' AND event_day >= '20180515' AND event_day <= '20180615'
CLUSTER BY url;

mapper_clicklog_finder.sh代码内容如下:

#!/bin/bash

chmod a+x url_normalizer_tool_without_protocal

./url_normalizer_tool_without_protocal 2>/dev/null \
        | python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm

mapper_clicklog_find.py代码内容如下:

import sys

def load_url_set(filename):
    urlset = set()
    fp = open(filename, 'rt')

    for line in fp:
        url = line.strip()
        urlset.add(url)

    fp.close()
    return urlset

if __name__ == '__main__':
    filename = sys.argv[1]
    urlset = load_url_set(filename)

    for line in sys.stdin:
        fields = line.strip().split('\t')
        if len(fields) != 2:
            continue
        clickurl = fields[0]
        ua_str = fields[1]
        if clickurl in urlset:
            print('%s\t%s' %(clickurl, ua_str))
示例2:有mapper+reducer的统计URL次数的程序

SQL语句文件:

set mapred.reduce.tasks=10;

ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE mapper_task.py;
ADD FILE reducer_task.py;

FROM (
    FROM default.hpb_dws_bhv_news_1d
    SELECT TRANSFORM(rid, r_url) 
    USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py' 
    AS(url, cnt) 
    WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != '' 
    CLUSTER BY url) mapper_output
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon" 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);

mapper_task.py文件:

import sys
for line in sys.stdin:
    fields = line.rstrip('\n').split('\t')
    rid, r_url = fields
    sys.stdout.write('%s\t1' %r_url)

reducer_task.py文件:

import sys

url_pre = None
cnt_pre = 0

for line in sys.stdin:
    fields = line.strip().split('\t')
    url = fields[0]
    cnt = int(fields[1])
    if url != url_pre:
        if url_pre:
            print('%s\t%d' %(url_pre, cnt_pre))
        cnt_pre = 0
    url_pre = url
    cnt_pre += cnt

if url_pre:
    print('%s\t%d' %(url_pre, cnt_pre))

详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

指定分隔符

HIVE输出到文件的分隔符
,列与列之间是'\1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'\2','\3','\4'等。
例:输出为map<string,string>, array, int类型,则分隔符为:
key1 \3 value1 \2 key2 \3 value2 ... \1 value1 \2 value2 \2 value3 ... \1 第三列的int值
如何指定分隔符:
INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
SELECT ...;

文件上传

用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
利用ADD上传文件:

  1. ADD FILE local_filename
  2. ADD CONF local_conf
  3. ADD JAR local_jar
  4. ADD ARCHIVE local.tar.gz/local.tgz
  5. ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file

Hive的几种排序

排序方式 说明
order by key 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错
sort by key 单个reduce范围内按照key排序
distribute by key 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件
cluster by key 按照key分发后在reducer范围内排序,相当于 distribute by + sort by

TRANSFORM语法

Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为'\t', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为'\t')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。

注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
ADD ARCHIVE xx.tar.gz;
SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)

使用Transform语法的示例作业

示例1:仅有mapper的收集特定集合中的URL相关日志

SQL语句文件:

set mapred.max.split.size=128000000;
set mapred.job.map.capacity=1000;
set mapred.map.memory.limit=1000;
set stream.memory.limit=900;
set mapred.reduce.tasks=100;
set mapred.job.name=word_count_demo;

ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE pc_vip_1w_norm;
ADD FILE mapper_clicklog_finder.py;
ADD FILE url_normalizer_tool_without_protocal;
ADD FILE mapper_clicklog_finder.sh;

INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm" 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(event_click_target_url, event_useragent)
USING 'sh mapper_clicklog_finder.sh'
AS(url, ua)
FROM udw.udw_event
WHERE event_action = 'wiseps_query_click' AND event_day >= '20180515' AND event_day <= '20180615'
CLUSTER BY url;

mapper_clicklog_finder.sh代码内容如下:

#!/bin/bash

chmod a+x url_normalizer_tool_without_protocal

./url_normalizer_tool_without_protocal 2>/dev/null \
        | python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm

mapper_clicklog_find.py代码内容如下:

import sys

def load_url_set(filename):
    urlset = set()
    fp = open(filename, 'rt')

    for line in fp:
        url = line.strip()
        urlset.add(url)

    fp.close()
    return urlset

if __name__ == '__main__':
    filename = sys.argv[1]
    urlset = load_url_set(filename)

    for line in sys.stdin:
        fields = line.strip().split('\t')
        if len(fields) != 2:
            continue
        clickurl = fields[0]
        ua_str = fields[1]
        if clickurl in urlset:
            print('%s\t%s' %(clickurl, ua_str))
示例2:有mapper+reducer的统计URL次数的程序

SQL语句文件:

set mapred.reduce.tasks=10;

ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE mapper_task.py;
ADD FILE reducer_task.py;

FROM (
    FROM default.hpb_dws_bhv_news_1d
    SELECT TRANSFORM(rid, r_url) 
    USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py' 
    AS(url, cnt) 
    WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != '' 
    CLUSTER BY url) mapper_output
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon" 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);

mapper_task.py文件:

import sys
for line in sys.stdin:
    fields = line.rstrip('\n').split('\t')
    rid, r_url = fields
    sys.stdout.write('%s\t1' %r_url)

reducer_task.py文件:

import sys

url_pre = None
cnt_pre = 0

for line in sys.stdin:
    fields = line.strip().split('\t')
    url = fields[0]
    cnt = int(fields[1])
    if url != url_pre:
        if url_pre:
            print('%s\t%d' %(url_pre, cnt_pre))
        cnt_pre = 0
    url_pre = url
    cnt_pre += cnt

if url_pre:
    print('%s\t%d' %(url_pre, cnt_pre))

详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

posted @ 2018-12-03 13:58  JeromeBlog  阅读(1236)  评论(0编辑  收藏  举报