MR并行算法编程过程中遇到问题的思考
1. Reducer 类中 reduce函数外定义的变量是在Reducer机器上属于全局变量的,因此,一台机器上reduce函数均可以对该变量的值做出贡献。如代码:(sum和count数据Reducer机器上的全局变量)‘
public static class AvgCalReducer extends Reducer<EntityEntityWritable,FloatWritable,EntityEntityWritable,FloatWritable> { FloatWritable avg; float sum=0; int count=0; public void reduce(EntityEntityWritable key,Iterable<FloatWritable>values,Context context) throws IOException, InterruptedException { System.out.println("reducer starting:"); for (FloatWritable value:values) { sum=sum+value.get(); count++; System.out.println(" key = "+key+" value = "+value.get()); } System.out.println("average:"+sum/count); System.out.println("this reducer ending..."); avg=new FloatWritable(sum/count); context.write(key, avg); } }
如果想使sum和count的值仅通过reduce函数进行改变,即只计算同一个key对应value的sum和count,则需要将sum和count放入reduce函数内,如下:
public static class AvgCalReducer extends Reducer<EntityEntityWritable,FloatWritable,EntityEntityWritable,FloatWritable> { FloatWritable avg; public void reduce(EntityEntityWritable key,Iterable<FloatWritable>values,Context context) throws IOException, InterruptedException { float sum=0; int count=0; System.out.println("reducer starting:"); for (FloatWritable value:values) { sum=sum+value.get(); count++; System.out.println(" key = "+key+" value = "+value.get()); } System.out.println("average:"+sum/count); System.out.println("this reducer ending..."); avg=new FloatWritable(sum/count); context.write(key, avg); } }
2. 对于顺序组合式MapReduce作业:用两个job举例:
Configuration conf1=new Configuration(); Job job1=new Job(conf1,"Job1"); job1.waitForCompletion(true); Configuration conf2=new Configuration(); Job job2=new Job(conf2,"Job2"); job2.waitForCompletion(true);
注意我们之前经常写的System.exit(job.waitForCompletion(true)?0:1)在这里不可以使用,比如第一个job处的(job1.waitForCompletion(true)改成System.exit(job.waitForCompletion(true)?0:1),则系统成功完成job1后正常退出系统,没有机会再去运行job2了。