hadoop streaming的使用

　本节我们使用C++和python实现wordcount的编写

　首先我们介绍一下hadoop streaming。

　　mapper和reducer会从标准输入中读取用户数据，一行一行处理后发送给标准输出。Streaming工具会创建MapReduce作业，发送给各个tasktracker，同时监控整个作业的执行过程。

　　如果一个文件（可执行或者脚本）作为mapper，mapper初始化时，每一个mapper任务会把该文件作为一个单独进程启动，mapper任务运行时，它把输入切分成行并把每一行提供给可执行文件进程的标准输入。同时，mapper收集可执行文件进程标准输出的内容，并把收到的每一行内容转化成key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包括tab）作为value。如果没有tab，整行作为key值，value值为null。

　　对于reducer，类似。 (参考董的博客)

　　C++代码如下:

//map端
#include <iostream>
#include <string>

using namespace std;

int main() {
    string key;
    string value = "1";
    while(cin>>key) {
        cout<<key<<"\t"<<value<<endl;
    }
    return 0;
}

//reduce端

#include <iostream>
#include <string>
#include <map>
#include <iterator>

using namespace std;

int main() {
string key;
string value;
map<string, int> word2count;
map<string, int>::iterator it;
while(cin>>key) {
  cin>>value;
  it = word2count.find(key);
  if(it != word2count.end()) {
   (it->second)++;
  }else{
   word2count.insert(make_pair(key, 1));
  }
}
for(it=word2count.begin(); it!=word2count.end(); ++it) {
  cout<<it->first<<"\t"<<it->second<<endl;
}
return 0;
}

python代码

//map端
#!/usr/bin/env python

import sys

word2count = {}

for line in sys.stdin:
	line = line.strip()
	words = filter(lambda word: word, line.split())  #去除空字符串
	for word in words:
		print '%s\t%s' % (word, 1)

//reduce端

from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
	line = line.strip()
	word, count = line.split()
	try:
		count = int(count)
		word2count[word] = word2count.get(word, 0) + count 
	except ValueError:
		pass

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

for word, count in sorted_word2count:
	print "%s\t%s" % (word, count)

posted @ 2017-08-31 16:53 xing-xing 阅读(168) 评论(0) 编辑收藏举报

刷新页面返回顶部

xingxing's blog

不积跬步，无以至千里；不积小流，无以成江海。

hadoop streaming的使用

公告