hadoop jar hadoop-streaming-2.6.4.jar \
-D mapreduce.job.name='test' \
-files /local/path/to/mapper.py,/local/path/to/reducer.py
-input /test/data/*
-output /test/output/
-mapper 'python /local/path/to/mapper.py'
-reducer 'python /local/path/to/reducer.py'
1. python文件需要分发到每个节点
2. -mapper和-reducer后面必须带python,否则会报错
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or director
mapper.py
#!/usr/bin/python3 # -*- coding: utf-8 -*- import os import sys import re for line in sys.stdin: line = line.strip() words = re.split('[,.?\s"]',line) for word in words: word = word.strip(',|.|?|\s') if word: print("{0}\t{1}".format(word,1))
reducer.py
#!/usr/bin/env python # -*- coding: utf-8 -*- import os import sys from operator import itemgetter current_word = None current_count = 0 word = None for line in sys.stdin: word = line.split('\t',1)[0] count = line.split('\t',1)[1] count = int(count) if current_word == word: current_count+=count else: if current_word: print("{0}\t{1}".format(current_word,current_count)) current_word = word current_count = count if word: print("{0}\t{1}".format(current_word,current_count))
参考官方说明: https://hadoop.apache.org/docs/r2.7.7/hadoop-streaming/HadoopStreaming.html