总访问量: PV

DataScience && DataMining && BigData

MapReduce排序

  • 在map和reduce阶段进行排序时,比较的是k2。v2是不参与排序比较的。如果要想让v2也进行排序,需要把k2和v2组装成新的类,作为k2,才能参与比较。

例子:

    

二次排序:在第一列有序得到前提下第二列進行排序。

思路:先找<k3,v3>在找<k2,v2>之後的mapreduce就容易寫了

方法1:让输出的第一列作为k3,第二列作为v3    关键:输出的v3需要参与排序,此种方式无法实现二次排序

方法2:让1,2列只作为k3,而v3为空。(

方法3:有可能让k3为空,v3为第二列吗? 答案是不能的,假设k3为空,一般情况下k2也为空,则v2中存放的数据进入后每一组都会放入一个value中,目前没有遇到)

      因此,只能选择方法二进行二次排序。

根据前面知识,关键思路:排序和分组是按照k2进行排序和分组的情形需铭记。

第一部分:分部代码

自定义排序:    

 1 private static class TwoInt implements WritableComparable<TwoInt>{
 2 
 3     public int t1;
 4     public int t2;
 5     public void write(DataOutput out) throws IOException {
 6         out.writeInt(t1);        
 7         out.writeInt(t2);        
 8     }
 9 
10     public void set(int t1, int t2) {
11         this.t1=t1;
12         this.t2=t2;
13     }
14 
15     public void readFields(DataInput in) throws IOException {
16         this.t1=in.readInt();
17         this.t2=in.readInt();
18     }
19 
20     
21     public int compareTo(TwoInt o) {
22         if (this.t1 ==o.t1) { //當第一列相等的時候,第二列升序排列
23             return this.t2 -o.t2;
24         }
25         return this.t1-o.t1;//當第一列不相等的時候,按第一列升序排列
26     }
27     
28 }

自定义Mapper类

 1 private static class MyMapper extends Mapper<LongWritable, Text, TwoInt, NullWritable>{
 2     TwoInt K2 = new TwoInt();
 3     @Override
 4     protected void map(LongWritable key, Text value,
 5             Mapper<LongWritable, Text, TwoInt, NullWritable>.Context context)
 6             throws IOException, InterruptedException {
 7            String[] splited = value.toString().split("\t");
 8            K2.set(Integer.parseInt(splited[0]),Integer.parseInt(splited[1]));
 9            context.write(K2, NullWritable.get());
10     }
11 }

自定义Reduce类

1 //按照k2進行排序,分組,此數據分爲6組,在調用Reduce
2 private static class MyReducer extends Reducer<TwoInt, NullWritable, TwoInt, NullWritable>{
3     @Override
4     protected void reduce(TwoInt k2, Iterable<NullWritable> v2s,
5             Reducer<TwoInt, NullWritable, TwoInt, NullWritable>.Context context)
6             throws IOException, InterruptedException {
7         context.write(k2, NullWritable.get());
8     }
9 }

捆绑Map和Reduce在一起

 1 public static void main(String[] args) throws Exception {
 2     Job job = Job.getInstance(new Configuration(), SecondarySortTest.class.getSimpleName());
 3     job.setJarByClass(SecondarySortTest.class);
 4     //1.自定义输入路径
 5     FileInputFormat.setInputPaths(job, new Path(args[0]));
 6     //2.自定义mapper
 7     //job.setInputFormatClass(TextInputFormat.class);
 8     job.setMapperClass(MyMapper.class);
 9     //job.setMapOutputKeyClass(Text.class);
10     //job.setMapOutputValueClass(TrafficWritable.class);
11     
12     //3.自定义reduce
13     job.setReducerClass(MyReducer.class);
14     job.setOutputKeyClass(TwoInt.class);
15     job.setOutputValueClass(NullWritable.class);
16     //4.自定义输出路径
17     FileOutputFormat.setOutputPath(job, new Path(args[1]));
18     //job.setOutputFormatClass(TextOutputFormat.class);//对输出的数据格式化并写入磁盘
19     
20     job.waitForCompletion(true);
21 }

由此,可以完成二次排序的完整代码如下:

 1 package Mapreduce;
 2 
 3 import java.io.DataInput;
 4 import java.io.DataOutput;
 5 import java.io.IOException;
 6 
 7 import org.apache.hadoop.conf.Configuration;
 8 import org.apache.hadoop.fs.Path;
 9 import org.apache.hadoop.io.LongWritable;
10 import org.apache.hadoop.io.NullWritable;
11 import org.apache.hadoop.io.Text;
12 import org.apache.hadoop.io.WritableComparable;
13 import org.apache.hadoop.mapreduce.Job;
14 import org.apache.hadoop.mapreduce.Mapper;
15 import org.apache.hadoop.mapreduce.Reducer;
16 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
18 
19 public class SecondarySortTest {
20 public static void main(String[] args) throws Exception {
21     Job job = Job.getInstance(new Configuration(), SecondarySortTest.class.getSimpleName());
22     job.setJarByClass(SecondarySortTest.class);
23     //1.自定义输入路径
24     FileInputFormat.setInputPaths(job, new Path(args[0]));
25     //2.自定义mapper
26     //job.setInputFormatClass(TextInputFormat.class);
27     job.setMapperClass(MyMapper.class);
28     //job.setMapOutputKeyClass(Text.class);
29     //job.setMapOutputValueClass(TrafficWritable.class);
30     
31     //3.自定义reduce
32     job.setReducerClass(MyReducer.class);
33     job.setOutputKeyClass(TwoInt.class);
34     job.setOutputValueClass(NullWritable.class);
35     //4.自定义输出路径
36     FileOutputFormat.setOutputPath(job, new Path(args[1]));
37     //job.setOutputFormatClass(TextOutputFormat.class);//对输出的数据格式化并写入磁盘
38     
39     job.waitForCompletion(true);
40 }
41 private static class MyMapper extends Mapper<LongWritable, Text, TwoInt, NullWritable>{
42     TwoInt K2 = new TwoInt();
43     @Override
44     protected void map(LongWritable key, Text value,
45             Mapper<LongWritable, Text, TwoInt, NullWritable>.Context context)
46             throws IOException, InterruptedException {
47            String[] splited = value.toString().split("\t");
48            K2.set(Integer.parseInt(splited[0]),Integer.parseInt(splited[1]));
49            context.write(K2, NullWritable.get());
50     }
51 }
52 //按照k2進行排序,分組,此數據分爲6組,在調用Reduce
53 private static class MyReducer extends Reducer<TwoInt, NullWritable, TwoInt, NullWritable>{
54     @Override
55     protected void reduce(TwoInt k2, Iterable<NullWritable> v2s,
56             Reducer<TwoInt, NullWritable, TwoInt, NullWritable>.Context context)
57             throws IOException, InterruptedException {
58         context.write(k2, NullWritable.get());
59     }
60 }
61 
62 private static class TwoInt implements WritableComparable<TwoInt>{
63     public int t1;
64     public int t2;
65     public void write(DataOutput out) throws IOException {
66         out.writeInt(t1);        
67         out.writeInt(t2);        
68     }
69     public void set(int t1, int t2) {
70         this.t1=t1;
71         this.t2=t2;
72     }
73     public void readFields(DataInput in) throws IOException {
74         this.t1=in.readInt();
75         this.t2=in.readInt();
76     }
77     public int compareTo(TwoInt o) {
78         if (this.t1 ==o.t1) { //當第一列相等的時候,第二列升序排列
79             return this.t2 -o.t2;
80         }
81         return this.t1-o.t1;//當第一列不相等的時候,按第一列升序排列
82     }
83     @Override
84     public String toString() {
85         return t1+"\t"+t2;
86     }
87 }
88 }
二次排序

第二部分:测试代码

(1)准备环境,准备测试数据

[root@neusoft-master filecontent]# vi twoint
3         3
3         2
3         1
2         2
2         1
1         1

 (2)创建文件夹,并将文件上传到HDFS中

[root@neusoft-master filecontent]# hadoop dfs -mkdir /neusoft/

[root@neusoft-master filecontent]# hadoop dfs -put twoint /neusoft/

 

(3)执行jar包,查看中间过程

 [root@neusoft-master filecontent]# hadoop jar SecondarySortTest.jar /neusoft/twoint  /out8

 

(4)查看结果

[root@neusoft-master filecontent]# hadoop dfs -ls /out8

[root@neusoft-master filecontent]# hadoop dfs -text /out8/part-r-00000

 

结果正确。

  • 如果输出有错误的话,或者输出不是数字(有时候是对象),需要查看是否重写了tostring()方法

 注意:如果需求变更为第一列的升序和第二列的降序,只需更改第3行

1     public int compareTo(TwoInt o) {
2         if (this.t1 ==o.t1) { //當第一列相等的時候,第二列降序排列
3             return o.t2-this.t2;
4 } 5 return this.t1-o.t1;//當第一列不相等的時候,按第一列升序排列 6 }

 

 总结:value不能参与排序,如果想参加排序需要放在key中,作为一个新的key进行排序。

 

posted @ 2017-02-21 15:21  CJZhaoSimons  阅读(872)  评论(0编辑  收藏  举报