11-22每日博客

老师留了MapReduce的实验，现在将时间进行记录。

Mapreduce实例——去重
实验原理
“数据去重”主要是为了掌握和利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重。
数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。在MapReduce流程中，map的输出<key,value>经过shuffle过程聚集成<key,value-list>后交给reduce。我们自然而然会想到将同一个数据的所有记录都交给一台reduce机器，无论这个数据出现多少次，只要在最终结果中输出一次就可以了。具体就是reduce的输入应该以数据作为key，而对value-list则没有要求（可以设置为空）。当reduce接收到一个<key,value-list>时就直接将输入的key复制到输出的key中，并将value设置成空值，然后输出<key,value>。
在IDEA创建java类编写Mapreduce代码：
package exper;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class Filter {
public static class Map extends Mapper<Object, Text, Text, NullWritable> {
private static Text newKey = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println(line);
String arr[] = line.split(" ");
newKey.set(arr[1]);
context.write(newKey, NullWritable.get());
System.out.println(newKey);
}
}

public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable> {
public void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
context.write(key, NullWritable.get());
}
}

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
System.out.println("start");
Job job = new Job(conf, "filter");
job.setJarByClass(Filter.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Path in = new Path("hdfs://localhost:9000/mapreduce/1in");
// Path out = new Path("hdfs://localhost:9000/mapreduce/1out");
String InPath="D:\\mapreduce\\1in\\buyer_favorite1.txt";
String OutPath="file:///D:/mapreduce/1out";
FileInputFormat.addInputPath(job,new Path(InPath));
FileOutputFormat.setOutputPath(job,new Path(OutPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

posted @ 2021-11-22 21:21 软工新人阅读(91) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

公告

昵称：软工新人
园龄： 4年8个月
粉丝： 0
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

软工新人

11-22每日博客

公告

搜索

常用链接

随笔档案

阅读排行榜

推荐排行榜