hadoop简单排序

介绍
实验题目
- 实验目的：
- 实验要求：
实验方案
结论

介绍

该文为Hadoop课程的简单排序实现

实验题目

简单排序的实现

实验目的：

掌握使用MapReduce对数据进行排序的方法。

实验要求：

以下四个txt文件其各有6个数值

s1.txt：

35 12345 21 5 -8 365

s2.txt:

38 156 12 6 -2 -10

s3.txt:

45 2365 68 -15 -18 -30

编写一个简单排序的程序，如果将上面三个文件作为输入，则排序后的输出结果为。

序号数值（从小到大）

实验方案

在MapReduce过程中就有排序,它的默认排序规则按照key值进行排序的，如果key为封装int的IntWritable类型，那么MapReduce按照数字大小对key排序，如果key为封装为String的Text类型，那么MapReduce按照字典顺序对字符串排序。我们可以使用封装int的IntWritable型数据结构。也就是在map中将读入的数据转化成IntWritable型，然后作为key值输出（value任意）。reduce拿到<key，value-list>之后，将输入的key作为value输出，并根据value-list中元素的个数决定输出的次数。输出的key（即代码中的linenum）是一个全局变量，它统计当前key的位次。

在示例代码中，需要将其分为Map和Reduce两部分来作修改。在Map部分中，“Text”需要改成“IntWritable”；“itr.nextToken()”需要改成“Integer.parseInt(itr.nextToken())”。在Reduce部分中，“Text”需要改成“IntWritable”；“IntWritable()”需要改成“IntWritable(1)”；for循环语句需要删除；“job.setCombinerClass(IntSumReducer.class);”和“result.set(sum)”这两段语句也需要删除；“word”和“one”则需要对调位置。

下面是经过修改之后的代码：

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

 

public class WordCount {

 public static class TokenizerMapper 

    extends Mapper<Object, IntWritable, IntWritable, IntWritable>{

  

  private final static IntWritable one = new IntWritable(1);

  private IntWritable word = new IntWritable(1);

   

  public void map(Object key, IntWritable value, Context context

          ) throws IOException, InterruptedException {

   StringTokenizer itr = new StringTokenizer(value.toString());

   while (itr.hasMoreTokens()) {

    word.set(Integer.parseInt(itr.nextToken()));

    context.write(one, word);

   }

  }

 }

 public static class IntSumReducer 

    extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {

  private IntWritable result = new IntWritable(1);

  public void reduce(IntWritable key, Iterable<IntWritable> values, 

            Context context

            ) throws IOException, InterruptedException {

   int sum = 0;

   context.write(key, result);

  }

 }

 public static void main(String[] args) throws Exception {

  Configuration conf = new Configuration();

  String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

  if (otherArgs.length < 2) {

   System.err.println("Usage: wordcount <in> [<in>...] <out>");

   System.exit(2);

  }

  Job job = new Job(conf, "word count");

  job.setJarByClass(WordCount.class);

  job.setMapperClass(TokenizerMapper.class);

  job.setReducerClass(IntSumReducer.class);

  job.setOutputKeyClass(IntWritable.class);

  job.setOutputValueClass(IntWritable.class);

  for (int i = 0; i < otherArgs.length - 1; ++i) {

   FileInputFormat.addInputPath(job, new Path(otherArgs[i]));

  }

  FileOutputFormat.setOutputPath(job,

   new Path(otherArgs[otherArgs.length - 1]));

  System.exit(job.waitForCompletion(true) ? 0 : 1);

 }

}

在输入代码前，需要新建MapReduce项目并新建一个类，类名需要与代码相对应，即是WordCount。输入完上述代码过后，需要修改这个类的输入输出路径，即是在argument窗口中，将路径设置为用于存储需要排序的数据的文件。然后再运行代码，即可得到实验题目中所给出的排序结果（如“三、结论”中的插图所示）。

结论

本次实验的目的与要求均已达成。在实验中，可以熟练利用现成的虚拟机资源在VM上创建Master虚拟机及其slave虚拟机集群。基本能够运用MapReduce将数据从三个不同文件中提取出来并按照其自身的key值进行有小到大的排序，得到题目中给出的实验结果（下图即是代码运行后的到的结果）

     ![Hadoop简单排序结果](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9hdXRvMmRldi5jb2RpbmcubmV0L3AvSW1hZ2VIb3N0aW5nU2VydmljZS9kL0ltYWdlSG9zdGluZ1NlcnZpY2UvZ2l0L3Jhdy9tYXN0ZXIvbWQvY2xpcF9pbWFnZTAwMi5wbmc?x-oss-process=image/format,png)

posted @ 2020-07-14 20:55 赤沙咀-菜虚坤阅读(131) 评论(0) 收藏举报

刷新页面返回顶部

赤沙咀-菜虚坤

hadoop简单排序

介绍

实验题目

实验目的：

实验要求：

实验方案

结论

公告