Hive使用简介
指定分隔符
HIVE输出到文件的分隔符
,列与列之间是'\1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'\2','\3','\4'等。
例:输出为map<string,string>, array
key1 \3 value1 \2 key2 \3 value2 ... \1 value1 \2 value2 \2 value3 ... \1 第三列的int值
如何指定分隔符:
INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
SELECT ...;
文件上传
用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
利用ADD上传文件:
- ADD FILE local_filename
- ADD CONF local_conf
- ADD JAR local_jar
- ADD ARCHIVE local.tar.gz/local.tgz
- ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file
Hive的几种排序
排序方式 | 说明 |
---|---|
order by key | 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错 |
sort by key | 单个reduce范围内按照key排序 |
distribute by key | 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件 |
cluster by key | 按照key分发后在reducer范围内排序,相当于 distribute by + sort by |
TRANSFORM语法
Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为'\t', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为'\t')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。
注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
ADD ARCHIVE xx.tar.gz;
SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)
使用Transform语法的示例作业
示例1:仅有mapper的收集特定集合中的URL相关日志
SQL语句文件:
set mapred.max.split.size=128000000;
set mapred.job.map.capacity=1000;
set mapred.map.memory.limit=1000;
set stream.memory.limit=900;
set mapred.reduce.tasks=100;
set mapred.job.name=word_count_demo;
ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE pc_vip_1w_norm;
ADD FILE mapper_clicklog_finder.py;
ADD FILE url_normalizer_tool_without_protocal;
ADD FILE mapper_clicklog_finder.sh;
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(event_click_target_url, event_useragent)
USING 'sh mapper_clicklog_finder.sh'
AS(url, ua)
FROM udw.udw_event
WHERE event_action = 'query_click' AND event_day >= '20180515' AND event_day <= '20180615'
CLUSTER BY url;
mapper_clicklog_finder.sh代码内容如下:
#!/bin/bash
chmod a+x url_normalizer_tool_without_protocal
./url_normalizer_tool_without_protocal 2>/dev/null \
| python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm
mapper_clicklog_find.py代码内容如下:
import sys
def load_url_set(filename):
urlset = set()
fp = open(filename, 'rt')
for line in fp:
url = line.strip()
urlset.add(url)
fp.close()
return urlset
if __name__ == '__main__':
filename = sys.argv[1]
urlset = load_url_set(filename)
for line in sys.stdin:
fields = line.strip().split('\t')
if len(fields) != 2:
continue
clickurl = fields[0]
ua_str = fields[1]
if clickurl in urlset:
print('%s\t%s' %(clickurl, ua_str))
示例2:有mapper+reducer的统计URL次数的程序
SQL语句文件:
set mapred.reduce.tasks=10;
ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE mapper_task.py;
ADD FILE reducer_task.py;
FROM (
FROM default.hpb_dws_bhv_news_1d
SELECT TRANSFORM(rid, r_url)
USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py'
AS(url, cnt)
WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != ''
CLUSTER BY url) mapper_output
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);
mapper_task.py文件:
import sys
for line in sys.stdin:
fields = line.rstrip('\n').split('\t')
rid, r_url = fields
sys.stdout.write('%s\t1' %r_url)
reducer_task.py文件:
import sys
url_pre = None
cnt_pre = 0
for line in sys.stdin:
fields = line.strip().split('\t')
url = fields[0]
cnt = int(fields[1])
if url != url_pre:
if url_pre:
print('%s\t%d' %(url_pre, cnt_pre))
cnt_pre = 0
url_pre = url
cnt_pre += cnt
if url_pre:
print('%s\t%d' %(url_pre, cnt_pre))
详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
指定分隔符
HIVE输出到文件的分隔符
,列与列之间是'\1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'\2','\3','\4'等。
例:输出为map<string,string>, array
key1 \3 value1 \2 key2 \3 value2 ... \1 value1 \2 value2 \2 value3 ... \1 第三列的int值
如何指定分隔符:
INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
SELECT ...;
文件上传
用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
利用ADD上传文件:
- ADD FILE local_filename
- ADD CONF local_conf
- ADD JAR local_jar
- ADD ARCHIVE local.tar.gz/local.tgz
- ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file
Hive的几种排序
排序方式 | 说明 |
---|---|
order by key | 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错 |
sort by key | 单个reduce范围内按照key排序 |
distribute by key | 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件 |
cluster by key | 按照key分发后在reducer范围内排序,相当于 distribute by + sort by |
TRANSFORM语法
Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为'\t', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为'\t')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。
注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
ADD ARCHIVE xx.tar.gz;
SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)
使用Transform语法的示例作业
示例1:仅有mapper的收集特定集合中的URL相关日志
SQL语句文件:
set mapred.max.split.size=128000000;
set mapred.job.map.capacity=1000;
set mapred.map.memory.limit=1000;
set stream.memory.limit=900;
set mapred.reduce.tasks=100;
set mapred.job.name=word_count_demo;
ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE pc_vip_1w_norm;
ADD FILE mapper_clicklog_finder.py;
ADD FILE url_normalizer_tool_without_protocal;
ADD FILE mapper_clicklog_finder.sh;
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(event_click_target_url, event_useragent)
USING 'sh mapper_clicklog_finder.sh'
AS(url, ua)
FROM udw.udw_event
WHERE event_action = 'wiseps_query_click' AND event_day >= '20180515' AND event_day <= '20180615'
CLUSTER BY url;
mapper_clicklog_finder.sh代码内容如下:
#!/bin/bash
chmod a+x url_normalizer_tool_without_protocal
./url_normalizer_tool_without_protocal 2>/dev/null \
| python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm
mapper_clicklog_find.py代码内容如下:
import sys
def load_url_set(filename):
urlset = set()
fp = open(filename, 'rt')
for line in fp:
url = line.strip()
urlset.add(url)
fp.close()
return urlset
if __name__ == '__main__':
filename = sys.argv[1]
urlset = load_url_set(filename)
for line in sys.stdin:
fields = line.strip().split('\t')
if len(fields) != 2:
continue
clickurl = fields[0]
ua_str = fields[1]
if clickurl in urlset:
print('%s\t%s' %(clickurl, ua_str))
示例2:有mapper+reducer的统计URL次数的程序
SQL语句文件:
set mapred.reduce.tasks=10;
ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
ADD FILE mapper_task.py;
ADD FILE reducer_task.py;
FROM (
FROM default.hpb_dws_bhv_news_1d
SELECT TRANSFORM(rid, r_url)
USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py'
AS(url, cnt)
WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != ''
CLUSTER BY url) mapper_output
INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);
mapper_task.py文件:
import sys
for line in sys.stdin:
fields = line.rstrip('\n').split('\t')
rid, r_url = fields
sys.stdout.write('%s\t1' %r_url)
reducer_task.py文件:
import sys
url_pre = None
cnt_pre = 0
for line in sys.stdin:
fields = line.strip().split('\t')
url = fields[0]
cnt = int(fields[1])
if url != url_pre:
if url_pre:
print('%s\t%d' %(url_pre, cnt_pre))
cnt_pre = 0
url_pre = url
cnt_pre += cnt
if url_pre:
print('%s\t%d' %(url_pre, cnt_pre))
详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
作者:JeromeWang
邮箱:yunfeiwang@hust.edu.cn
出处:http://www.cnblogs.com/jeromeblog/
本文版权归作者所有,欢迎转载,未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。