Python & MapReduce

使用Python实现Hadoop MapReduce程序

原文请参考：

http://blog.csdn.net/zhaoyl03/article/details/8657031/

下面只是将mapper.py和reducer.py在windows上运行了一遍，没有用Hadoop的环境去测试。

环境准备：

Window 7 – 32
安装GunWin32，使得Linux命令可以在cmd上执行
安装IDLE (Python GUI)，使得Python脚本可以执行
将Python的安装路径添加到windows的环境变量中，使得在cmd窗口中切换到Python脚本所在目录时，通过输入脚本名，可以直接执行Python脚本

我的Python安装在： C:\Python27\python.exe下

测试脚本放在： E:\PythonTest下

windows环境变量中增加：C:\Python27

mapper.py :

#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s\t%s' % (word, 1)

reducer.py :

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s\t%s' % (current_word, current_count)

输出结果：

posted on 2015-05-07 15:54 快鸟阅读(661) 评论(0) 收藏举报