python实现mapreduce(2)——在hadoop中执行

目的:将python实现mapreduce(1)中的python脚本部署到hadoop上,实现mapreduce。

1. 修改脚本执行权限
[tianyc@TeletekHbase ~]$ which python
/usr/bin/python
#将 #!/usr/bin/python 加到map.py和red.py的第一行,并设置执行权限。
[tianyc@TeletekHbase python]$ chmod 744 *.py
[tianyc@TeletekHbase python]$ ll
总用量 20
-rwxr--r--. 1 tianyc NEU 132 2月 19 00:40 map.py
-rwxr--r--. 1 tianyc NEU 324 2月 19 00:56 red.py
-rw-r--r--. 1 tianyc NEU 314 2月 18 22:34 test.dat
[tianyc@TeletekHbase python]$ scp ~/study/mapred/python/*.py s1:~/study/mapred/python
map.py 100% 132 0.1KB/s 00:00
red.py 100% 324 0.3KB/s 00:00
[tianyc@TeletekHbase python]$ scp ~/study/mapred/python/*.py s2:~/study/mapred/python
map.py 100% 132 0.1KB/s 00:00
red.py 100% 324 0.3KB/s 00:00
#在s1和s2上分别修改这2个脚本的权限为744

2. 将测试文件放到hdfs上
[tianyc@TeletekHbase python]$ ~/hadoop/bin/hadoop dfs -copyFromLocal test.dat test.dat
[tianyc@TeletekHbase python]$ ~/hadoop/bin/hadoop dfs -ls
Found 1 items
-rw-r--r-- 2 tianyc supergroup 314 2013-02-19 03:14 /user/tianyc/test.dat

3. 执行mapreduce
[tianyc@TeletekHbase hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/study/mapred/python/map.py -reducer ~/study/mapred/python/red.py -input test.dat -output test-output
packageJobJar: [/tmp/hadoop-tianyc/hadoop-unjar5892434231150397369/] [] /tmp/streamjob7683487212899164160.jar tmpDir=null
13/02/20 16:45:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/20 16:45:23 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/20 16:45:23 INFO mapred.FileInputFormat: Total input paths to process : 1
13/02/20 16:45:23 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-tianyc/mapred/local]
13/02/20 16:45:23 INFO streaming.StreamJob: Running job: job_201302201459_0010
13/02/20 16:45:23 INFO streaming.StreamJob: To kill this job, run:
13/02/20 16:45:23 INFO streaming.StreamJob: /home/tianyc/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=http://TeletekHbase:9001 -kill job_201302201459_0010
13/02/20 16:45:23 INFO streaming.StreamJob: Tracking URL: http://TeletekHbase:50030/jobdetails.jsp?jobid=job_201302201459_0010
13/02/20 16:45:24 INFO streaming.StreamJob: map 0% reduce 0%
13/02/20 16:45:37 INFO streaming.StreamJob: map 50% reduce 0%
13/02/20 16:45:38 INFO streaming.StreamJob: map 100% reduce 0%
13/02/20 16:45:49 INFO streaming.StreamJob: map 100% reduce 100%
13/02/20 16:45:55 INFO streaming.StreamJob: Job complete: job_201302201459_0010
13/02/20 16:45:55 INFO streaming.StreamJob: Output: test-output
4. 查看执行结果
[tianyc@TeletekHbase hadoop]$ hadoop dfs -ls
Found 7 items
drwxr-xr-x - tianyc supergroup 0 2013-02-20 16:45 /user/tianyc/test-output
-rw-r--r-- 2 tianyc supergroup 314 2013-02-19 03:14 /user/tianyc/test.dat
[tianyc@TeletekHbase hadoop]$ hadoop dfs -ls test-output
Found 3 items
-rw-r--r-- 2 tianyc supergroup 0 2013-02-20 16:45 /user/tianyc/test-output/_SUCCESS
drwxr-xr-x - tianyc supergroup 0 2013-02-20 16:45 /user/tianyc/test-output/_logs
-rw-r--r-- 2 tianyc supergroup 17 2013-02-20 16:45 /user/tianyc/test-output/part-00000
[tianyc@TeletekHbase hadoop]$ hadoop dfs -cat test-output/part-00000
1949 111
1950 22
[tianyc@TeletekHbase hadoop]$


其他:

1. 执行过程可能会比较长,可以通过命令查看作业的执行情况
[tianyc@TeletekHbase ~]$ hadoop job -list
1 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201302180014_0006 1 1361216791160 tianyc NORMAL NA
[tianyc@TeletekHbase ~]$ hadoop job -status job_201302180014_0006

Job: job_201302180014_0006
file: hdfs://master:9000/tmp/hadoop-tianyc/mapred/staging/tianyc/.staging/job_201302180014_0006/job.xml
tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201302180014_0006
map() completion: 1.0
reduce() completion: 0.16666667

Counters: 25
Job Counters
Launched reduce tasks=3
SLOTS_MILLIS_MAPS=13255
Launched map tasks=2
Data-local map tasks=2
SLOTS_MILLIS_REDUCES=8739
File Input Format Counters
Bytes Read=472
FileSystemCounters
HDFS_BYTES_READ=656
FILE_BYTES_WRITTEN=68722
Map-Reduce Framework
Map output materialized bytes=77
Map input records=5
Reduce shuffle bytes=32
Spilled Records=5
Map output bytes=55
Total committed heap usage (bytes)=337780736
CPU time spent (ms)=2790
Map input bytes=314
SPLIT_RAW_BYTES=184
Combine input records=0
Reduce input records=0
Reduce input groups=0
Combine output records=0
Physical memory (bytes) snapshot=334417920
Reduce output records=0
Virtual memory (bytes) snapshot=1138679808
Map output records=5

2. 若mr执行失败,可以通过浏览器登录jobtracker的web控制台查看具体的故障堆栈。一般在jobtracker的50030端口(http://jobmaster:50030)。
我的错误提示为:Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out 解决办法可以参考这个帖子。因为我最初是参考这个帖子部署的hadoop,设置了master,s1和s2,这与/etc/sysconfig/network中的HOSTNAME不一致了。

3. 将/etc/hosts与/etc/sysconfig/network设置一致后,仍然出现了问题。详细描述参考这里

posted @ 2013-02-20 16:58  醇酒醉影  阅读(1625)  评论(0编辑  收藏  举报