使用Python为Hadoop编写一个简单的MapReduce程序

尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现 Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程。

我们想要做什么？

我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来相间隔。

前提条件

编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）

如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群

 如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

Python的MapReduce代码

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。

Map: mapper.py

将下列的代码保存在/root/zhangjian/test/mapper.py(系统中的任意位置)中，它将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：
注意：要确保这个脚本有足够权限（chmod +x /root/zhangjian/test/mapper.py）。

#!/usr/bin/env python

import sys
import re

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
#words = line.split()
words = re.split(',| |:|\.',line)
# increase counters
for word in words:
# write the results to STDOUT (standard output)
# what we output here will be the input for the Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word,1)

在这个脚本中，并不计算出单词出现的总数，它将输出 "<word> 1" 迅速地，尽管<word>可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。

Reduce: reducer.py

将代码存储在/root/zhangjian/test/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。
同样，要注意脚本权限：chmod +x /root/zhangjian/test/reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t',1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word,0) + count
except ValueError:
# count was not a number , so silently
# ignore/discard this line
pass

# sort the words lexigraphically
#
# this step is NOT required, we just do it so that our final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word,count in sorted_word2count:
print '%s\t%s' % (word, count)

代码测试

在运行MapReduce job测试前，尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果
测试你的Map和Reduce的功能：

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py

输出：

zhangjian 1
come 1
from 1
shandong 1
shandong 1
is 1
a 1
good 1
space 1

echo "zhangjian come from shandong,shandong is a good space" | /root/zhangjian/test/mapper.py | /root/zhangjian/test/reducer.py

输出：

a 1
come 1
from 1
good 1
is 1
shandong 2
space 1
zhangjian 1

为了测试MapReduce的运行结果，我们需要生成3个文件：1.txt,2.txt,3.txt,为了方便起见，在此，3个文件中是多行上述句子，1.txt中有9926行，2.txt中有119112行，3.txt中有238224行

将这三个文件存放在 /root/zhangjian/test/file 中

[root@dev-slave1 file]# ls -l /root/zhangjian/test/file
total 19372
-rw-r--r-- 1 root root 536004 Sep 10 11:05 2.txt
-rw-r--r-- 1 root root 6432048 Sep 10 11:06 3.txt
-rw-r--r-- 1 root root 12864096 Sep 10 11:06 4.txt

在我们运行MapReduce job 前，我们需要将本地的文件复制到HDFS中：

首先，我们要在HDFS上创建一个目录文件夹，用来存放待处理的源文件和最终的输出结果：

hadoop fs -mkdir -p /test_file

然后，我们要将本地文件复制到HDFS上：

hadoop fs -copyFromLocal /root/zhangjian/test/file /test_file

结果如下：

[root@dev-slave1 file]# hadoop fs -ls /test_file
Found 1 items
drwxr-xr-x - root supergroup 0 2015-09-10 15:12 /test_file/file

[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 3 items
-rw-r--r-- 1 root supergroup 536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r-- 1 root supergroup 6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r-- 1 root supergroup 12864096 2015-09-10 11:35 /test_file/file/3.txt

至此，一切准备就绪，我们将在运行Python MapReduce job 在Hadoop集群上，命令如下：

hadoop jar /root/develop/src/hadoop/hadoop-tools/hadoop-streaming/target/hadoop-streaming-2.7.1.jar -mapper 'python /root/zhangjian/test/mapper.py' -file /root/zhangjian/test/mapper.py -reducer 'python /root/zhangjian/test/reducer.py' -file /root/zhangjian/test/reducer.py -input /test_file/file/* -output /test_file/file/file_output

其中，两个-file参数不可少，否则会造成错误：Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

如果上面运行出错，重新运行前，需要删除dfs中的file_output文件

运行结果如下：

15/09/10 15:12:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/zhangjian/test/mapper.py, /root/zhangjian/test/reducer.py, /tmp/hadoop-unjar4712976301095870488/] [] /tmp/streamjob2552627823217601490.jar tmpDir=null
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:19 INFO client.RMProxy: Connecting to ResourceManager at dev-master/172.16.10.51:8032
15/09/10 15:12:20 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: number of splits:4
15/09/10 15:12:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440156752555_0040
15/09/10 15:12:20 INFO impl.YarnClientImpl: Submitted application application_1440156752555_0040
15/09/10 15:12:21 INFO mapreduce.Job: The url to track the job: http://dev-master:8088/proxy/application_1440156752555_0040/
15/09/10 15:12:21 INFO mapreduce.Job: Running job: job_1440156752555_0040
15/09/10 15:12:27 INFO mapreduce.Job: Job job_1440156752555_0040 running in uber mode : false
15/09/10 15:12:27 INFO mapreduce.Job: map 0% reduce 0%
15/09/10 15:12:38 INFO mapreduce.Job: map 25% reduce 0%
15/09/10 15:12:39 INFO mapreduce.Job: map 50% reduce 0%
15/09/10 15:12:41 INFO mapreduce.Job: map 75% reduce 0%
15/09/10 15:12:42 INFO mapreduce.Job: map 92% reduce 0%
15/09/10 15:12:43 INFO mapreduce.Job: map 100% reduce 0%

15/09/10 15:13:06 INFO mapreduce.Job: map 100% reduce 69%
15/09/10 15:13:09 INFO mapreduce.Job: map 100% reduce 79%
15/09/10 15:13:12 INFO mapreduce.Job: map 100% reduce 81%
15/09/10 15:13:15 INFO mapreduce.Job: map 100% reduce 100%
15/09/10 15:13:15 INFO mapreduce.Job: Job job_1440156752555_0040 completed successfully
15/09/10 15:13:15 INFO mapreduce.Job: Counters: 50
　　　　File System Counters
　　　　　　　　FILE: Number of bytes read=33053586
　　　　　　　　FILE: Number of bytes written=66702385
　　　　　　　　FILE: Number of read operations=0
　　　　　　　　FILE: Number of large read operations=0
　　　　　　　　FILE: Number of write operations=0
　　　　　　　　HDFS: Number of bytes read=19832870
　　　　　　　　HDFS: Number of bytes written=101
　　　　　　　　HDFS: Number of read operations=15
　　　　　　　　HDFS: Number of large read operations=0
　　　　　　　　HDFS: Number of write operations=2
　　　　Job Counters
　　　　　　　　Failed reduce tasks=1
　　　　　　　　Launched map tasks=4
　　　　　　　　Launched reduce tasks=2
　　　　　　　　Data-local map tasks=4

　　　　　　　　Total time spent by all maps in occupied slots (ms)=88718

　　　　　　　　Total time spent by all reduces in occupied slots (ms)=54844
　　　　　　　　Total time spent by all map tasks (ms)=44359
　　　　　　　　Total time spent by all reduce tasks (ms)=27422
　　　　　　　　Total vcore-seconds taken by all map tasks=44359
　　　　　　　　Total vcore-seconds taken by all reduce tasks=27422
　　　　　　　　Total megabyte-seconds taken by all map tasks=44359000
　　　　　　　　Total megabyte-seconds taken by all reduce tasks=27422000
　　　　Map-Reduce Framework
　　　　　　　　Map input records=367262
　　　　　　　　Map output records=3305358
　　　　　　　　Map output bytes=26442864
　　　　　　　　Map output materialized bytes=33053604
　　　　　　　　Input split bytes=380
　　　　　　　　Combine input records=0
　　　　　　　　Combine output records=0
　　　　　　　　Reduce input groups=8
　　　　　　　　Reduce shuffle bytes=33053604
　　　　　　　　Reduce input records=3305358
　　　　　　　　Reduce output records=8

　　　　　　　　Spilled Records=6610716

　　　　　　　　Shuffled Maps =4
　　　　　　　　Failed Shuffles=0
　　　　　　　　Merged Map outputs=4
　　　　　　　　GC time elapsed (ms)=305
　　　　　　　　CPU time spent (ms)=29160
　　　　　　　　Physical memory (bytes) snapshot=1400758272
　　　　　　　　Virtual memory (bytes) snapshot=7690555392
　　　　　　　　Total committed heap usage (bytes)=1355808768
　　　　Shuffle Errors
　　　　　　　　BAD_ID=0
　　　　　　　　CONNECTION=0
　　　　　　　　IO_ERROR=0
　　　　　　　　WRONG_LENGTH=0
　　　　　　　　WRONG_MAP=0
　　　　　　　　WRONG_REDUCE=0
　　　　File Input Format Counters
　　　　　　　　Bytes Read=19832490
　　　　File Output Format Counters
　　　　　　　　Bytes Written=101

15/09/10 15:13:15 INFO streaming.StreamJob: Output directory: /test_file/file/file_output

检查结果是否输出并存储在HDFS目录下的中：

[root@dev-slave1 file]# hadoop fs -ls /test_file/file
Found 4 items
-rw-r--r-- 1 root supergroup 536004 2015-09-10 11:35 /test_file/file/1.txt
-rw-r--r-- 1 root supergroup 6432048 2015-09-10 11:35 /test_file/file/2.txt
-rw-r--r-- 1 root supergroup 12864096 2015-09-10 11:35 /test_file/file/3.txt
drwxr-xr-x - root supergroup 0 2015-09-10 15:13 /test_file/file/file_output

[root@dev-slave1 file]# hadoop fs -ls /test_file/file/file_output
Found 2 items
-rw-r--r-- 1 root supergroup 0 2015-09-10 15:13 /test_file/file/file_output/_SUCCESS
-rw-r--r-- 1 root supergroup 101 2015-09-10 15:13 /test_file/file/file_output/part-00000

[root@dev-slave1 file]# hadoop fs -cat /test_file/file/file_output/part-00000
a 367262
come 367262
from 367262
good 367262
is 367262
shandong 734524
space 367262
zhangjian 367262

将HDFS上的结果下载到本地：

hadoop fs -copyToLocal /test_file/file/file_output/part-00000 /root/zhangjian//test_file/file/file_output

发表于 2015-09-10 15:52 数据挖掘与算法爱好者阅读(372) 评论(0) 编辑收藏举报

Map: mapper.py

Reduce: reducer.py

公告