使用hive streaming 统计Wordcount
一、 编译处理程序
使用python编写脚本
1、编写map对应的脚本
map.py
#!/usr/bin/env python
import sys
for i in sys.stdin:
worlds = i.strip().split()
for word in worlds:
print("%s\t1" % word.lower())
2、编写reduce对应的脚本
reduce.py
#!/usr/bin/env python
import sys
wordDict=dict()
for i in sys.stdin:
i = i.split()[0]
if i in wordDict:
wordDict[i] += 1
else:
wordDict[i] = 1
for i in wordDict.keys():
print(i + ":" + str(wordDict[i]))
二、 创建对应的表结构
1、创建docs表
hive (default)> create table docs(line string);
2、 准备docs表对于的数据
[hduser@yjt test]$ cat f.txt
hadoop beijin
yjt hadoop test
spark hadoop shanghai
yjt
mm
jj
gg
gg
this is a beijin
3、加载数据到docs表
hive (default)> load data local inpath '/data1/studby/test/f.txt' into table docs;
4、准备wordcount表,用于存放最终的数据结果
hive (default)> create table wordcount(word string, count int) row format delimited fields terminated by ':';
三、执行
hive (default)> from (from docs select transform (line) using "/data1/studby/test/map.py" as (word, count) cluster by word) as wc insert overwrite table wordcount select transform(wc.word, wc.count) using "/data1/studby/test/reduce.py" as (word, count);
查看结果
hive (default)> select * from wordcount;
OK
wordcount.word wordcount.count
a 1
mm 1
is 1
shanghai 1
beijin 2
hadoop 3
gg 2
this 1
yjt 2
jj 1
test 1
spark 1
记录学习和生活的酸甜苦辣.....哈哈哈