hadoop-执行mapreduce时hang住的问题

在执行mapreduce时,map成功后,reduce一直hang在17%。现象如下:

[tianyc@TkHbase hadoop]$ hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper /home/tianyc/study/mapred/python/mapper.py -reducer /home/tianyc/study/mapred/python/reduce.py -input 111/* -output 111-output2
packageJobJar: [/tmp/hadoop-tianyc/hadoop-unjar5068413447400834397/] [] /tmp/streamjob7965021791749826156.jar tmpDir=null
13/02/20 15:45:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/20 15:45:07 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/20 15:45:07 INFO mapred.FileInputFormat: Total input paths to process : 16
13/02/20 15:45:07 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-tianyc/mapred/local]
13/02/20 15:45:07 INFO streaming.StreamJob: Running job: job_201302201459_0006
13/02/20 15:45:07 INFO streaming.StreamJob: To kill this job, run:
13/02/20 15:45:07 INFO streaming.StreamJob: /home/tianyc/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=http://TeletekHbase:9001 -kill job_201302201459_0006
13/02/20 15:45:07 INFO streaming.StreamJob: Tracking URL: http://TkHbase:50030/jobdetails.jsp?jobid=job_201302201459_0006
13/02/20 15:45:08 INFO streaming.StreamJob: map 0% reduce 0%
13/02/20 15:45:21 INFO streaming.StreamJob: map 13% reduce 0%
13/02/20 15:45:22 INFO streaming.StreamJob: map 25% reduce 0%
13/02/20 15:45:28 INFO streaming.StreamJob: map 38% reduce 0%
13/02/20 15:45:30 INFO streaming.StreamJob: map 50% reduce 0%
13/02/20 15:45:34 INFO streaming.StreamJob: map 63% reduce 0%
13/02/20 15:45:36 INFO streaming.StreamJob: map 75% reduce 4%
13/02/20 15:45:39 INFO streaming.StreamJob: map 75% reduce 8%
13/02/20 15:45:40 INFO streaming.StreamJob: map 88% reduce 8%
13/02/20 15:45:42 INFO streaming.StreamJob: map 100% reduce 8%
13/02/20 15:45:48 INFO streaming.StreamJob: map 100% reduce 17%

我辗转于baidu和google,尝试了各种方法:

1. 有一台slave主机的hostname带下划线,已经解决

2. 将/etc/hosts中的主机名与/etc/sysconfig/network中的HOSTNAME一致

3. 将/etc/hosts中127.0.0.1对应的记录删除

4. 关闭防火墙

。。。

这个帖子中介绍到:这个问题是发生在reduce阶段,而提示的消息应该是取不到map阶段的结果,既然在Failed fetch notification #1 for task attempt_201110022127_0003_m_000000_0中有取不到的任务分块的名字,说明namenode正常工作,namenode通知reduce节点进行reduce操作,而它却取不到,只能说明它没法和那些节点通信,又由于我在配置hadoop的时候用的是主机的名字,不是ip,所以想到解决办法应该是把各个datanode节点的映射互相加到/etc/hosts中。试了一下,果然正确。所以在此记录。

有一点启发,但hosts我已经设置好了。

转而查看主节点job日志和从节点task日志(各种尝试前我竟然没有分析日志,汗):

job日志提示:

2013-02-20 15:52:57,956 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for map task: attempt_201302201459_0006_m_000002_0 running on tracker: tracker_TkTest:127.0.0.1/127.0.0.1:50861 and reduce task: attempt_201302201459_0006_r_000000_0 running on tracker: tracker_TkHbase2:127.0.0.1/127.0.0.1:55837

一个从节点task日志一直提示:

2013-02-20 16:21:37,304 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
2013-02-20 16:21:40,346 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
2013-02-20 16:21:46,378 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
……

最终在这个帖子的最下面,提示应该打开50060的防火墙端口,尝试了一下,才好使。其实我看到的另一个帖子里也介绍了,只是当时我没留意。

出现这个问题,主要是在hosts与hostname不统一,或防火墙上。


看到这里,问题来了:当初我测试了关闭防火墙,为什么仍然出错呢?还是基本功不扎实,临时关闭防火墙应该使用service iptables stop,我用的却是chkconfig iptables off。参考这里

 

posted @ 2013-02-20 16:39  醇酒醉影  阅读(5729)  评论(0编辑  收藏  举报