MapReduce基础入门(二)

1.Combiner合并

2.序列化

3.Sort排序

4.Partitioner分区

5.综合案例


1.Combiner合并

    1)什么是Combiner

    Combiner 是 MapReduce 程序中 Mapper 和 Reducer 之外的一种组件,它的作用是在 maptask 之后给 maptask 的结果进行局部汇总,以减轻 reducetask 的计算负载,减少网络传输

    2)如何使用combiner

Combiner 和 Reducer 一样,编写一个类,然后继承 Reducer, reduce 方法中写具体的 Combiner 逻辑,然后在 job 中设置 Combiner 类: job.setCombinerClass(FlowSumCombine.class)

    3)使用combiner注意事项

    (1)Combiner 和 Reducer 的区别在于运行的位置:

           Combiner 是在每一个 maptask 所在的节点运行
           Reducer 是接收全局所有 Mapper 的输出结果

     (2)Combiner 的输出 kv 应该跟 reducer 的输入 kv 类型要对应起来

     (3)Combiner 的使用要非常谨慎,因为 Combiner 在 MapReduce 过程中可能调用也可能不调 用,可能调一次也可能调多次,所以: Combiner 使用的原则是:有或没有都不能影响业务 逻辑,都不能影响最终结果(求平均值时,combiner和reduce逻辑不一样)


2.序列化

    1)概述

    Java 的序列化是一个重量级序列化框架( Serializable),一个对象被序列化后,会附带很多额 外的信息(各种校验信息, header,继承体系等),不便于在网络中高效传输;所以, hadoop 自己开发了一套序列化机制( Writable),精简,高效,Hadoop 中的序列化框架已经对基本类型和 null 提供了序列化的实现了,如下

byte

ByteWritable

short

ShortWritable

int

IntWritable

long

LongWritable

float

FloatWritable

double

DoubleWritable

String

Text

null

NullWritable

    

    2)Java序列化

    Java序列号只需要实现Serializable接口即可如

  1 public class Student implements Serializable{
  2      public Student(){
  3     }
  4     private int id;
  5 }
    3)自定义对象实现mapreduce框架序列化

    如果需要将自定义的 bean 放在 key 中传输,则还需要实现 Comparable 接口,因为 mapreduce框中的 shuffle 过程一定会对 key 进行排序,此时,自定义的 bean 实现的接口应该是:

  1 public class FlowBean implements WritableComparable<FlowBean> {
  2 
  3 	private String phoneNbr;
  4 	private long up_flow;
  5 	private long d_flow;
  6 	private long sum_flow;
  7 
  8 	public void set(String phoneNbr, long up_flow, long d_flow) {
  9 		this.phoneNbr = phoneNbr;
 10 		this.up_flow = up_flow;
 11 		this.d_flow = d_flow;
 12 		this.sum_flow = up_flow + d_flow;
 13 	}
 14 
 15 	/**
 16 	 * 序列化,将数据字段以字节流写出去
 17 	 */
 18 	public void write(DataOutput out) throws IOException {
 19 		out.writeUTF(this.phoneNbr);
 20 		out.writeLong(this.up_flow);
 21 		out.writeLong(this.d_flow);
 22 		out.writeLong(this.sum_flow);
 23 	}
 24 
 25 	/**
 26 	 * 反序列化,从字节流中读出各个数据字段 读出的顺序应该跟序列化时写入的顺序保持一致
 27 	 */
 28 	public void readFields(DataInput in) throws IOException {
 29 		this.phoneNbr = in.readUTF();
 30 		this.up_flow = in.readLong();
 31 		this.d_flow = in.readLong();
 32 		this.sum_flow = in.readLong();
 33 	}
 34 
 35 	public int compareTo(FlowBean o) {
 36 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
 37 	}
 38 
 39 	@Override
 40 	public String toString() {
 41 		return phoneNbr+"\t"+up_flow + "\t" + d_flow + "\t" + sum_flow;
 42 	}
 43 
 44 	public String getPhoneNbr() {
 45 		return phoneNbr;
 46 	}
 47 
 48 	public void setPhoneNbr(String phoneNbr) {
 49 		this.phoneNbr = phoneNbr;
 50 	}
 51 
 52 	public long getUp_flow() {
 53 		return up_flow;
 54 	}
 55 
 56 	public void setUp_flow(long up_flow) {
 57 		this.up_flow = up_flow;
 58 	}
 59 
 60 	public long getD_flow() {
 61 		return d_flow;
 62 	}
 63 
 64 	public void setD_flow(long d_flow) {
 65 		this.d_flow = d_flow;
 66 	}
 67 
 68 	public long getSum_flow() {
 69 		return sum_flow;
 70 	}
 71 
 72 	public void setSum_flow(long sum_flow) {
 73 		this.sum_flow = sum_flow;
 74 	}
 75 
 76 }

3.Sort排序

    在MapReduce执行Spill溢出过程中,会对结果进行排序,排序的方法是,根据key进行排序,如果我们把自己的FlowBean当作key,并且实现了WritableComparable,只需要重写compareTo方法即可,如,按照sun_flow倒叙排序,重新的compareTo方法如下:

  1 public int compareTo(FlowBean o) {
  2 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
  3 	}


4.Partitioner分区

    在MapReduce分区的过程中,Reduce的个数即分区的个数,这个时候只需要设置job.setNumReduceTasks(5),就可以设置Reduce的个数了,但这个分区是系统随便分区的,要想实现自己的分区,则还需要继承extends Partitioner<KEY, VALUE>类,并且重写getPartition即可

  1 public int getPartition(KEY key, VALUE value, int numPartitions)


5.综合案例

测试数据:

测试数据图片

部分数据如下:

1363157985066   13726230503     00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com                24      27      2481    24681   200


1363157995052   13826544101     5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4                    4       0       264     0       200


需求分析:

    (1)统计每一个用户(手机号)所耗费的总上行流量、下行流量、总流量

    (2)得出上体的结果在加一个需求:将统计结果按照总流量倒叙排序

    (3)将流量汇总结果按照手机归属地输出到不同的文件中


实现如下:

FLowBean.java

  1 import java.io.DataInput;
  2 import java.io.DataOutput;
  3 import java.io.IOException;
  4 
  5 import org.apache.hadoop.io.WritableComparable;
  6 
  7 public class FlowBean implements WritableComparable<FlowBean> {
  8 
  9 	private String phoneNbr;
 10 	private long up_flow;
 11 	private long d_flow;
 12 	private long sum_flow;
 13 
 14 	public void set(String phoneNbr, long up_flow, long d_flow) {
 15 		this.phoneNbr = phoneNbr;
 16 		this.up_flow = up_flow;
 17 		this.d_flow = d_flow;
 18 		this.sum_flow = up_flow + d_flow;
 19 	}
 20 
 21 	/**
 22 	 * 序列化,将数据字段以字节流写出去
 23 	 */
 24 	public void write(DataOutput out) throws IOException {
 25 		out.writeUTF(this.phoneNbr);
 26 		out.writeLong(this.up_flow);
 27 		out.writeLong(this.d_flow);
 28 		out.writeLong(this.sum_flow);
 29 	}
 30 
 31 	/**
 32 	 * 反序列化,从字节流中读出各个数据字段 读出的顺序应该跟序列化时写入的顺序保持一致
 33 	 */
 34 	public void readFields(DataInput in) throws IOException {
 35 		this.phoneNbr = in.readUTF();
 36 		this.up_flow = in.readLong();
 37 		this.d_flow = in.readLong();
 38 		this.sum_flow = in.readLong();
 39 	}
 40 
 41 	public int compareTo(FlowBean o) {
 42 		return this.sum_flow > o.getSum_flow() ? -1 : 1;
 43 	}
 44 
 45 	@Override
 46 	public String toString() {
 47 		return phoneNbr+"\t"+up_flow + "\t" + d_flow + "\t" + sum_flow;
 48 	}
 49 
 50 	public String getPhoneNbr() {
 51 		return phoneNbr;
 52 	}
 53 
 54 	public void setPhoneNbr(String phoneNbr) {
 55 		this.phoneNbr = phoneNbr;
 56 	}
 57 
 58 	public long getUp_flow() {
 59 		return up_flow;
 60 	}
 61 
 62 	public void setUp_flow(long up_flow) {
 63 		this.up_flow = up_flow;
 64 	}
 65 
 66 	public long getD_flow() {
 67 		return d_flow;
 68 	}
 69 
 70 	public void setD_flow(long d_flow) {
 71 		this.d_flow = d_flow;
 72 	}
 73 
 74 	public long getSum_flow() {
 75 		return sum_flow;
 76 	}
 77 
 78 	public void setSum_flow(long sum_flow) {
 79 		this.sum_flow = sum_flow;
 80 	}
 81 
 82 }

PhonePartitioner.java

  1 import java.util.HashMap;
  2 import org.apache.hadoop.mapreduce.Partitioner;
  3 
  4 public class PhonePartitioner<KEY, VALUE> extends Partitioner<KEY, VALUE>{
  5 
  6 	private static HashMap<String, Integer> areaMap =  new HashMap<>();
  7 
  8 	static {
  9 		areaMap.put("136", 0);
 10 		areaMap.put("137", 1);
 11 		areaMap.put("138", 2);
 12 		areaMap.put("139", 3);
 13 	}
 14 
 15 	@Override
 16 	public int getPartition(KEY key, VALUE value, int numPartitions) {
 17 		Integer provinceCode = areaMap.get(key.toString().substring(0,3));
 18 		return provinceCode==null?4:provinceCode;
 19 	}
 20 }
FlowCount.java

  1 import java.io.IOException;
  2 import java.net.URI;
  3 import org.apache.commons.lang.StringUtils;
  4 import org.apache.hadoop.conf.Configuration;
  5 import org.apache.hadoop.fs.FileSystem;
  6 import org.apache.hadoop.fs.Path;
  7 import org.apache.hadoop.io.LongWritable;
  8 import org.apache.hadoop.io.Text;
  9 import org.apache.hadoop.mapreduce.Job;
 10 import org.apache.hadoop.mapreduce.Mapper;
 11 import org.apache.hadoop.mapreduce.Reducer;
 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 13 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 16 
 17 
 18 //统计手机号总流,分区并且统计流量总数
 19 public class FlowCount {
 20 
 21 	public static class FlowCountMapper extends
 22 			Mapper<LongWritable, Text, Text, FlowBean> {
 23 		private FlowBean flowBean = new FlowBean();
 24 
 25 		@Override
 26 		protected void map(LongWritable key, Text value, Context context)
 27 				throws IOException, InterruptedException {
 28 
 29 			String line = value.toString();
 30 			if (StringUtils.isBlank(line)) {
 31 				return;
 32 			}
 33 
 34 			// 切分字段
 35 			String[] fields = line.split("\\s+");
 36 
 37 			// 拿到我们需要的若干个字段
 38 			String phoneNbr = fields[1];
 39 			long up_flow = Long.parseLong(fields[fields.length - 3]);
 40 			long d_flow = Long.parseLong(fields[fields.length - 2]);
 41 			// 将数据封装到一个flowbean中
 42 			flowBean.set(phoneNbr, up_flow, d_flow);
 43 			context.write(new Text(phoneNbr), flowBean);
 44 		}
 45 	}
 46 
 47 	public static class FlowCountReducer extends
 48 			Reducer<Text, FlowBean, Text, FlowBean> {
 49 		private FlowBean flowBean = new FlowBean();
 50 
 51 		@Override
 52 		protected void reduce(Text key, Iterable<FlowBean> values,
 53 				Context context) throws IOException, InterruptedException {
 54 			long up_flow_sum = 0;
 55 			long d_flow_sum = 0;
 56 			for (FlowBean flowBean : values) {
 57 				up_flow_sum += flowBean.getUp_flow();
 58 				d_flow_sum += flowBean.getD_flow();
 59 			}
 60 			flowBean.set(key.toString(), up_flow_sum, d_flow_sum);
 61 			context.write(key, flowBean);
 62 		}
 63 	}
 64 
 65 	final static String INPUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/input/phonedata.txt";
 66 	final static String OUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/output";
 67 
 68 	public static void main(String[] args) throws Exception {
 69 		Configuration conf = new Configuration();
 70 
 71 		final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
 72 		if (fileSystem.exists(new Path(OUT_PATH))) {
 73 			fileSystem.delete(new Path(OUT_PATH), true);
 74 		}
 75 
 76 		Job job = Job.getInstance(conf, "flowjob");
 77 		job.setJarByClass(FlowCount.class);
 78 
 79 		job.setMapperClass(FlowCountMapper.class);
 80 		job.setReducerClass(FlowCountReducer.class);
 81 
 82 		job.setPartitionerClass(PhonePartitioner.class);
 83 		job.setNumReduceTasks(5);
 84 
 85 		job.setMapOutputKeyClass(Text.class);
 86 		job.setMapOutputValueClass(FlowBean.class);
 87 
 88 		job.setOutputKeyClass(Text.class);
 89 		job.setOutputValueClass(FlowBean.class);
 90 
 91 		job.setInputFormatClass(TextInputFormat.class);
 92 		job.setOutputFormatClass(TextOutputFormat.class);
 93 
 94 		FileInputFormat.setInputPaths(job, new Path(INPUT_PATH));
 95 		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
 96 
 97 		job.waitForCompletion(true);
 98 	}
 99 }

FlowCountSort.java

  1 import java.io.IOException;
  2 import java.net.URI;
  3 import org.apache.commons.lang.StringUtils;
  4 import org.apache.hadoop.conf.Configuration;
  5 import org.apache.hadoop.fs.FileSystem;
  6 import org.apache.hadoop.fs.Path;
  7 import org.apache.hadoop.io.LongWritable;
  8 import org.apache.hadoop.io.NullWritable;
  9 import org.apache.hadoop.io.Text;
 10 import org.apache.hadoop.mapreduce.Job;
 11 import org.apache.hadoop.mapreduce.Mapper;
 12 import org.apache.hadoop.mapreduce.Reducer;
 13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 14 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 15 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 16 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 17 
 18 
 19 //按照总流量数从大到小排序
 20 public class FlowCountSort {
 21 
 22 	public static class FlowCountSortMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable>{
 23 		private FlowBean bean = new FlowBean();
 24 		@Override
 25 		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
 26 			String line = value.toString();
 27 
 28 			if (StringUtils.isBlank(line)) {
 29 				return;
 30 			}
 31 
 32 			String[] fields = StringUtils.split(line, "\t");
 33 			String phoneNbr = fields[1];
 34 			long up_flow = Long.parseLong(fields[fields.length - 3]);
 35 			long d_flow = Long.parseLong(fields[fields.length - 2]);
 36 			bean.set(phoneNbr, up_flow, d_flow);
 37 
 38 			context.write(bean, NullWritable.get());
 39 		}
 40 	}
 41 
 42 	public static class FlowCountSortReducer extends Reducer<FlowBean, NullWritable, Text, FlowBean>{
 43 		@Override
 44 		protected void reduce(FlowBean bean, Iterable<NullWritable> values,Context context) throws IOException, InterruptedException {
 45 			context.write(new Text(bean.getPhoneNbr()), bean);
 46 		}
 47 	}
 48 
 49 	final static String INPUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/input/phonedata.txt";
 50 	final static String OUT_PATH = "hdfs://10.202.62.66:8020/wmapreduce/phoneflow/output";
 51 
 52 
 53 	public static void main(String[] args) throws Exception {
 54 
 55 		Configuration conf = new Configuration();
 56 
 57 		final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
 58 		if (fileSystem.exists(new Path(OUT_PATH))) {
 59 			fileSystem.delete(new Path(OUT_PATH), true);
 60 		}
 61 
 62 		Job job = Job.getInstance(conf,"flowjob");
 63 		job.setJarByClass(FlowCountSort.class);
 64 
 65 		job.setMapperClass(FlowCountSortMapper.class);
 66 		job.setReducerClass(FlowCountSortReducer.class);
 67 
 68 		job.setMapOutputKeyClass(FlowBean.class);
 69 		job.setMapOutputValueClass(NullWritable.class);
 70 
 71 		job.setOutputKeyClass(Text.class);
 72 		job.setOutputValueClass(FlowBean.class);
 73 
 74 		job.setInputFormatClass(TextInputFormat.class);
 75 		job.setOutputFormatClass(TextOutputFormat.class);
 76 
 77 		FileInputFormat.setInputPaths(job, new Path(INPUT_PATH));
 78 		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
 79 
 80 		job.waitForCompletion(true);
 81 	}
 82 }




posted @ 2019-03-07 20:37  明月故里  阅读(239)  评论(0编辑  收藏  举报