大数据学习笔记——Hadoop编程实战之Mapreduce
Hadoop编程实战——Mapreduce基本功能实现
此篇博客承接上一篇总结的HDFS编程实战,将会详细地对mapreduce的各种数据分析功能进行一个整理,由于实际工作中并不会过多地涉及原理,因此,掌握好mapreduce框架将会有助于了解sql语句在大数据场景下的底层实现原理,从而能够帮助开发人员优化sql语句,提高查询速度,废话不多说,现在正式开始吧!
1. Mapreduce入门——word count实现
一个基本的mapreduce程序一般要写三个类,Mapper类,Reducer类,以及一个APP类,Mapper类按行读取数据同时可以进行数据清洗,Reducer类负责按照某种逻辑对value进行聚合,而APP类中需要写一个入口函数并且对配置文件进行一些必要的设置,具体代码如下:
APP类:
package mapreduce.wc; /* 写一个简单的word count的编程入门,注意,所有需要导入的包都要使用mapreduce而不是mapred!!! */ import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WCApp { public static void main(String[] args) throws Exception { //首先需要进行设置从而让系统识别root用户 //System.setProperty("HADOOP_USER_NAME", "root"); Configuration conf = new Configuration(); //配置文件的默认设置是使用HDFS分布式文件系统,因此需要将conf对象临时设置成本地模式 conf.set("fs.defaultFS", "file:///"); FileSystem fs = FileSystem.get(conf); //使用Job类的静态方法并将配置文件传入实例化一个对象 //使用到了java的单例设计模式 Job job = Job.getInstance(conf); //设置作业的名称 job.setJobName("word count"); //设置本类的class,以及Mapper和Reducer的class job.setJarByClass(WCApp.class); job.setMapperClass(WCMapper.class); job.setReducerClass(WCReducer.class); //设置Mapper的输出KV的class job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //设置Reducer的输出KV的class,如果Mapper和Reducer的输出相一致,只需要写一个即可 job.setOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //设置输入和输出的文件路径 Path outPath = new Path("file:///d:/out"); FileInputFormat.addInputPath(job, new Path("file:///d:/wc.txt")); FileOutputFormat.setOutputPath(job, outPath); //如果文件存在,则要先删除,否则就会出现报错 if(fs.exists(outPath)){ fs.delete(outPath,true); } //开始执行程序 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.wc; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //首先获取到每一行数据 String line = value.toString(); //对每一行数据进行处理 String[] arr = line.split(" "); //遍历arr,使用context上下文对象将KV对写出去 for (String s : arr) { context.write(new Text(s), new IntWritable(1)); } } }
Reducer类:
package mapreduce.wc; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { //对每个reduce循环中出现的重复的key对应的value实现某种聚合逻辑 int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } }
2. Mapreduce综合演练——最高,最低,平均气温统计 + Combiner
Combiner的作用相当于先在map端进行了一次聚合,这样在后面使用Reducer进行Shuffle的时候,数据量就会明显地变小,从而提高运算所需要耗费的时间,设置Combiner非常简单,只需要先写好Combiner类,然后在APP端设置setCombinerClass即可,具体代码如下:
APP类:
package mapreduce.temp; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /* 统计气温数据的最高最低平均值 */ public class TempApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "file:///"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); job.setJobName("temp"); job.setJarByClass(TempApp.class); job.setMapperClass(TempMapper.class); job.setReducerClass(TempReducer.class); //需要加一个Combiner的class job.setCombinerClass(TempCombiner.class); //设置Mapper的输出KV的class job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); //设置Reducer的输出KV的class,如果Mapper和Reducer的输出相一致,只需要写一个即可 job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); //设置输入和输出的文件路径 Path outPath = new Path("file:///d:/out"); FileInputFormat.addInputPath(job, new Path("file:///d:/Temp")); FileOutputFormat.setOutputPath(job, outPath); //如果文件存在,则要先删除,否则就会出现报错 if(fs.exists(outPath)){ fs.delete(outPath,true); } //开始执行程序 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.temp; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class TempMapper extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); //获取到年份以及温度字段 String year = line.substring(15, 19); String temp = line.substring(87, 92); //脏数据处理 if(Integer.parseInt(temp) != 9999){ context.write(new Text(year), new Text(temp)); } } }
Combiner类:
package mapreduce.temp; /* Combiner相当于map端的Reducer,可以对数据进行一次预聚合,从而减少数据量 */ import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class TempCombiner extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; int min = Integer.MAX_VALUE; int sum = 0; int count = 0; for (Text value : values) { int i = Integer.parseInt(value.toString()); max = Math.max(max, i); min = Math.min(min, i); sum += i; count += 1; } context.write(key, new Text("" + max + "\t" + min + "\t" + sum + "\t" + count)); } }
Reducer类:
package mapreduce.temp; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class TempReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; int min = Integer.MAX_VALUE; int sum = 0; int count = 0; for (Text value : values) { //需要先对Combiner中传过来的value进行一波解析 String[] arr = value.toString().split("\t"); int max_tmp = Integer.parseInt(arr[0]); int min_tmp = Integer.parseInt(arr[1]); int sum_tmp = Integer.parseInt(arr[2]); int count_tmp = Integer.parseInt(arr[3]); max = Math.max(max, max_tmp); min = Math.min(min, min_tmp); sum += sum_tmp; count += count_tmp; } context.write(key, new Text("" + max + "\t" + min + "\t" + sum / count)); } }
3. 两种方式解决大数据场景下的数据倾斜问题
用户在使用setNumReduceTasks方法时可以设置多个分区,从而可以达到防止大量数据涌向一个节点而导致该节点崩溃的情况发生,具体使用方法是在APP类中加入一句话:
//设置reduce的个数
job.setNumReduceTasks(3);
用户进行了如上设置就可以实现数据分三个分区进行输出的效果,若没有自定义一个类继承Partitioner类,系统默认使用的是HashPartitioner类,该类的源代码如下所示:
public class HashPartitioner<K, V> extends Partitioner<K, V> { /** Use {@link Object#hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }
从return语句可以看出,实际上getPartition方法就是通过调用key的hashCode的方法来实现的,即先计算出key的哈希值,然后再对用户自定义的分区数进行取余操作,但是,如果有大量的key都是相同的话(比如双十一的促销活动),那么它们除以分区个数取到的余数肯定也就是相同的了,那么这样的话等于说并没有真正解决数据倾斜的问题,为此,我们就需要自己设计方案让数据尽可能地分布均匀了,由于很多情况下解决数据倾斜问题需要结合企业实际的业务场景,因此这里提供的是最为常见的解决方案,即重新设计key,在key后面加上一个随机数,以及随机分区法,下面将会介绍这两种方法:
重新设计key法
首先APP类需要用到二次作业,即分好区之后得到的并不是最终结果,因此还需要再进行一次作业来处理中间结果
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WCApp { public static void main(String[] args) throws Exception { System.setProperty("HADOOP_USER_NAME", "centos"); //初始化作业 Configuration conf = new Configuration(); conf.set("fs.defaultFS", "file:///"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); //作业设置名称 job.setJobName("WC"); //设置入口函数所在的类 job.setJarByClass(WCApp.class); //设置map和reduce类 job.setMapperClass(WCMapper.class); job.setReducerClass(WCReducer.class); //设置map的输出k-v类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //设置reduce的输出k-v类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Path outPath = new Path("D:/out"); //设置输入输出路径 FileInputFormat.addInputPath(job, new Path("D:/1.txt")); FileOutputFormat.setOutputPath(job, outPath); if (fs.exists(outPath)) { fs.delete(outPath, true); } job.setNumReduceTasks(4); //开始执行 boolean b = job.waitForCompletion(true); if (b) { Job job2 = Job.getInstance(conf); //作业设置名称 job2.setJobName("WC2"); //设置入口函数所在的类 job2.setJarByClass(WCApp.class); //设置map和reduce类 job2.setMapperClass(WCMapper2.class); job2.setReducerClass(WCReducer.class); //设置map的输出k-v类型 job2.setMapOutputKeyClass(Text.class); job2.setMapOutputValueClass(IntWritable.class); //设置reduce的输出k-v类型 job2.setOutputKeyClass(Text.class); job2.setOutputValueClass(IntWritable.class); Path outPath2 = new Path("D:/out2"); //设置输入输出路径 FileInputFormat.addInputPath(job2, new Path("D:/out")); FileOutputFormat.setOutputPath(job2, outPath2); if (fs.exists(outPath2)) { fs.delete(outPath2, true); } //开始执行 job2.waitForCompletion(true); } } }
在编写Mapper类时,需要注意的是在定义Random对象时,应该只初始化对象一次才对,因此考虑将新建对象的过程放在setup方法中,然后在每一个key后面拼接上这个随机数即可,具体代码如下:
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.Random; public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> { int num; Random r; @Override protected void setup(Context context) throws IOException, InterruptedException { num = context.getNumReduceTasks(); r = new Random(); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] arr = line.split(" "); for (String word : arr) { context.write(new Text(word + "_" + r.nextInt(num)), new IntWritable(1)); } } }
需要注意的是,还需要编写一个Mapper类将key再用-拆分开来
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.Random; public class WCMapper2 extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] arr = value.toString().split("\t"); String word = arr[0].split("_")[0]; String count = arr[1]; context.write(new Text(word),new IntWritable(Integer.parseInt(count))); } }
重新分区法
此方法的实现方式是不考虑key是如何的,而是在每读取一条数据的时候,让它随机地进入到某一个分区,需要自定义分区函数,关键代码如下:
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; import java.util.Random; public class RandomPartition extends Partitioner<Text,IntWritable> { Random r = new Random(); @Override public int getPartition(Text text, IntWritable intWritable, int numPartitions) { return r.nextInt(numPartitions); } }
并且不要忘了还要在APP类中设置Partitioner类的class:
//设置自定义的Partitioner类所在的class job.setPartitionerClass(RandomPartition.class);
说明:两种方法都能解决数据倾斜的问题,但是相对来说,第二种方式,即随即分区法更为优化,原因一是因为该方法代码更为简洁,而是因为第一种方式由于需要在每个key后再加一个字符串,增加了网络间数据传输的压力,因此不推荐使用,实际场景下更推荐从分区的角度考虑解决数据倾斜的问题
4. 输入输出格式设置
用户可在APP类中设置各种不同的输入格式,如果不指定系统默认使用的就是TextInputFormat,除此之外,还有SequenceFileInputFormat,KeyValueTextInputFormat(一般在处理二次作业时使用较多,因为mapreduce的默认输出格式的KV对就是以"\t"进行分隔的),以及DBInputFormat,当然,输出格式也可以由用户来指定
//设置成序列文件格式 job.setInputFormatClass(SequenceFileInputFormat.class); //设置成KV对文件格式 job.setInputFormatClass(KeyValueTextInputFormat.class);
这里将会重点讲解DBInputFormat格式,因为它在关系型数据库与大数据框架之间的ETL即数据的导入导出中起到非常重要的作用
首先是两个自定义的DBWritable类,一个用来从数据库抓取数据,一个用于导出数据到数据库
package mapreduce.db; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.lib.db.DBWritable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; /* 自定义一个DBWritable类需要实现Writable接口以及DBWritable接口 注意,此类用作从数据库中读取数据即FileInputFormat */ public class MyDBWritable implements Writable, DBWritable { //定义的成员变量分别是mysql数据库中的两个字段 int id; String line; //设置一系列的get,set方法,构造方法,toString方法等等 public MyDBWritable(int id, String line) { this.id = id; this.line = line; } public MyDBWritable() { } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getLine() { return line; } public void setLine(String line) { this.line = line; } @Override public String toString() { return "MyDBWritable{" + "id=" + id + ", line='" + line + '\'' + '}'; } public void write(DataOutput out) throws IOException { out.writeInt(id); out.writeUTF(line); } public void readFields(DataInput in) throws IOException { id = in.readInt(); line = in.readUTF(); } public void write(PreparedStatement ppst) throws SQLException { //注意到参数是PreparedStatement对象,因此可以使用set方法 ppst.setInt(1, id); ppst.setString(2, line); } public void readFields(ResultSet rs) throws SQLException { id = rs.getInt(1); line = rs.getString(2); } }
package mapreduce.db; /* 此DBWritable用于将处理好了的数据导出到数据库中去 */ import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.lib.db.DBWritable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; public class MyDBWritable2 implements Writable, DBWritable { String word; int count; public MyDBWritable2(String word, int count) { this.word = word; this.count = count; } public MyDBWritable2() { } public String getWord() { return word; } public void setWord(String word) { this.word = word; } public int getCount() { return count; } public void setCount(int count) { this.count = count; } @Override public String toString() { return "MyDBWritable2{" + "word='" + word + '\'' + ", count=" + count + '}'; } public void write(DataOutput out) throws IOException { out.writeUTF(word); out.writeInt(count); } public void readFields(DataInput in) throws IOException { word = in.readUTF(); count = in.readInt(); } public void write(PreparedStatement ppst) throws SQLException { ppst.setString(1, word); ppst.setInt(2, count); } public void readFields(ResultSet rs) throws SQLException { word = rs.getString(1); count = rs.getInt(2); } }
APP类:
package mapreduce.db; /* 此APP的功能为从关系型数据库中读取数据,使用mapreduce框架处理完毕后再将数据导出至关系型数据库 */ import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.db.DBConfiguration; import org.apache.hadoop.mapreduce.lib.db.DBInputFormat; import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat; public class DBApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); job.setJobName("DBinput"); job.setJarByClass(DBApp.class); job.setMapperClass(DBMapper.class); job.setReducerClass(DBReducer.class); job.setInputFormatClass(DBInputFormat.class); DBInputFormat.setInput(job,MyDBWritable.class,"select * from test","select count(*) from test"); DBOutputFormat.setOutput(job,"wc",2); //使用下面的方式来使用连接数据库的四大工具,驱动,URL,username以及password DBConfiguration.configureDB(job.getConfiguration(),"com.mysql.jdbc.Driver","jdbc:mysql://s201:3306/big14","root","root"); //设置reduce的输出k-v类型 job.setOutputKeyClass(MyDBWritable2.class); job.setOutputValueClass(NullWritable.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //开始执行 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.db; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class DBMapper extends Mapper<LongWritable, MyDBWritable, Text, IntWritable> { @Override protected void map(LongWritable key, MyDBWritable value, Context context) throws IOException, InterruptedException { //首先从MyDBWritable中获取到一行数据 String line = value.getLine(); String[] arr = line.split(" "); for (String s : arr) { context.write(new Text(s), new IntWritable(1)); } } }
Reducer类:
package mapreduce.db; /* 注意:该Reducer类的输出value可以为空,即NullWritable */ import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class DBReducer extends Reducer<Text, IntWritable, MyDBWritable2, NullWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } //将key和sum的值封装到MyDBWritable2中去 MyDBWritable2 mydb = new MyDBWritable2(key.toString(), sum); context.write(mydb, NullWritable.get()); } }
5. 二次排序
分析人员在使用sql语句进行某项查询的时候,往往会遇到二次排序的场景,即先按某字段进行排序,当某字段的值出现相同的情况时,再按另一字段进行排序,在编写mapreduce程序的时候可以在底层实现二次排序的原理,具体实现时需要注意这几个知识点:
1. 首先需要自定义Writable类实现WritableComparable接口,在compareTo方法中实现二次排序的逻辑
2. 需要重写分组对比器,WritableComparator,用来让系统判断应该让什么样的key作为不重复的key
3. 还需要重写hashCode方法,使得在对数据进行分区时可以将正确的key分到一个分区里去
以下是具体的代码实现:
组合键的类:
package mapreduce.secondsort; import org.apache.hadoop.io.WritableComparable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; public class Compkey implements WritableComparable<Compkey> { //先定义两个成员变量 int year; int temp; public Compkey(int year, int temp) { this.year = year; this.temp = temp; } public Compkey() { } public int getYear() { return year; } public void setYear(int year) { this.year = year; } public int getTemp() { return temp; } public void setTemp(int temp) { this.temp = temp; } public int compareTo(Compkey o) { //在这里实现二次排序的逻辑 if(this.getYear() == o.getYear()){ return o.getTemp() - this.getTemp(); }else{ return this.getYear() - o.getYear(); } } public void write(DataOutput out) throws IOException { out.writeInt(year); out.writeInt(temp); } public void readFields(DataInput in) throws IOException { year = in.readInt(); temp = in.readInt(); } }
分组对比器类:
package mapreduce.secondsort; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.io.WritableComparator; public class MyGroupingComparator extends WritableComparator { //需要重写构造方法来使对象实例化 public MyGroupingComparator() { super(Compkey.class, true); } //重写compare方法 @Override public int compare(WritableComparable a, WritableComparable b) { return ((Compkey) a).getYear() - ((Compkey) b).getYear(); } }
APP类:
package mapreduce.secondsort; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /* 此app的功能为对天气数据先对年份再对气温做一个二次排序,并且气温是倒排序 */ public class SecondApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "file:///"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); job.setJobName("secondsort"); job.setJarByClass(SecondApp.class); job.setMapperClass(SecondMapper.class); job.setReducerClass(SecondReducer.class); //设置Mapper的输出KV的class job.setMapOutputKeyClass(Compkey.class); job.setMapOutputValueClass(NullWritable.class); //设置Reducer的输出KV的class,如果Mapper和Reducer的输出相一致,只需要写一个即可 job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); //设置输入和输出的文件路径 Path outPath = new Path("file:///d:/out"); FileInputFormat.addInputPath(job, new Path("file:///d:/Temp")); FileOutputFormat.setOutputPath(job, outPath); //如果文件存在,则要先删除,否则就会出现报错 if(fs.exists(outPath)){ fs.delete(outPath,true); } //开始执行程序 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.secondsort; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class SecondMapper extends Mapper<LongWritable, Text, Compkey, NullWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); int year = Integer.parseInt(line.substring(15, 19)); int temp = Integer.parseInt(line.substring(87, 92)); if(year != 9999){ context.write(new Compkey(year, temp), NullWritable.get()); } } }
Reducer类:
package mapreduce.secondsort; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class SecondReducer extends Reducer<Compkey, NullWritable, IntWritable, IntWritable> { @Override protected void reduce(Compkey key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { //在这个Reducer类中不需要写聚合的方法,因此只需要在for循环中将最终结果写出即可 for (NullWritable value : values) { context.write(new IntWritable(key.getYear()), new IntWritable(key.getTemp())); } } }
6. mapreduce实现Join操作
分析人员在写sql语句时,连表也就是join操作可以说是非常常见的了,常用的join操作有内连接,左外连接,右外连接,全外连接等等,使用mapreduce在底层实现这些连接有助于在之后写这些sql语句时知道如何才能进行优化从而提高查询效率,join操作一共有两种实现方式,map端join和reduce端join
map端join
map端join非常好理解,现在手头上有两张表,一张表看成是小表,另一张表看作是大表,在大表读取数据之前,先将小表加载至内存,可以使用map的数据结构,然后在读取大表数据时与内存中的数据进行一个拼串的操作即可,具体代码实现如下:
APP类:
package mapreduce.mapjoin; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /* 此App的功能时实现map端的join,对应的两张表分别是订单表和客户信息表 */ public class MapJoinApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "file:///"); //在这里还需要设置小表所在的路径 conf.set("small.file.name", "d:/mapjoin/customers.txt"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); job.setJobName("mapjoin"); job.setJarByClass(MapJoinApp.class); job.setMapperClass(MapJoinMapper.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); //设置输入和输出的文件路径 Path outPath = new Path("file:///d:/out"); FileInputFormat.addInputPath(job, new Path("d:/mapjoin/orders.txt")); FileOutputFormat.setOutputPath(job, outPath); //如果文件存在,则要先删除,否则就会出现报错 if(fs.exists(outPath)){ fs.delete(outPath,true); } //开始执行程序 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.mapjoin; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.HashMap; public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> { //使用map来装数据 HashMap<String, String> map; //首先需要在setup方法中将小表的数据加载至内存,需要使用上下文的getConfiguration方法获取到小表的路径 @Override protected void setup(Context context) throws IOException, InterruptedException { String small_table = context.getConfiguration().get("small.file.name"); map = new HashMap<String, String>(); //使用BufferedReader按行读取小表数据 BufferedReader br = new BufferedReader(new FileReader(small_table)); String line = null; while((line = br.readLine()) != null){ String cid = line.split("\t")[0]; map.put(cid, line); } br.close(); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String oid = line.split("\t")[3]; //对map进行一个判断,如果不存在就不连接,实现的是内连接 if(map.containsKey(oid)){ context.write(new Text(value.toString() + map.get(oid)), NullWritable.get()); } } }
reduce端join
上述情况对应的场景是一张大表和一张小表的情况,那么,如果两张表都是大表的情况该怎么办呢?很显然,将一张表看成小表将不再适用,因为这样做很有可能会消耗大量内存资源,因此在这种情况下一般使用reduce端join的方式,具体原理如下:将两张表值相等的字段看成是reduce中需要做聚合的key,再将key对应的value也就是一行数据进行拼串的操作,那么这样就要涉及到一个问题,如何使得拼串时一张表的数据总在上面而另一张表的数据总在下面呢?这就需要对表名做一个数字标识,然后使用一个二次排序,使得两个表的顺序固定下来即可,具体代码实现如下:
进行reduce端join的五个步骤:
1. 通过不同的文件名,设立不同的标记位
2. 重写WritableComparable即组合键(Compkey)将两个成员变量序列化
3. 需要将相同id的Compkey放到一个reduce循环中,需要重写WritableComparator
4. 重写compareTo方法,将这两个不同的标记位进行排序,使得其中一个总是在另一个上方位置
5. 重写hashCode方法
组合键的类:
package mapreduce.reducejoin; import org.apache.hadoop.io.WritableComparable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; public class Compkey implements WritableComparable<Compkey> { int cid; int flag; public int compareTo(Compkey o) { if(cid == o.cid){ return o.flag - flag; } else { return cid - o.cid; } } public void write(DataOutput out) throws IOException { out.writeInt(cid); out.writeInt(flag); } public void readFields(DataInput in) throws IOException { cid = in.readInt(); flag = in.readInt(); } public int getCid() { return cid; } public void setCid(int cid) { this.cid = cid; } public int getFlag() { return flag; } public void setFlag(int flag) { this.flag = flag; } public Compkey(int cid, int flag) { this.cid = cid; this.flag = flag; } public Compkey() { } @Override public String toString() { return "CompKey{" + "cid=" + cid + ", flag=" + flag + '}'; } @Override public int hashCode() { return cid; } }
分组对比器:
package mapreduce.reducejoin; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.io.WritableComparator; public class MyGroupingComparator extends WritableComparator { public MyGroupingComparator() { super(Compkey.class, true); } @Override public int compare(WritableComparable a, WritableComparable b) { return ((Compkey) a).cid - ((Compkey) b).cid; } }
APP类:
package mapreduce.reducejoin; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /* 此APP的功能是实现mapreduce的reduce端join操作 */ public class ReduceJoinApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "file:///"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); //作业设置名称 job.setJobName("reduce join"); //设置入口函数所在的类 job.setJarByClass(ReduceJoinApp.class); //设置map类 job.setMapperClass(ReduceJoinMapper.class); job.setReducerClass(ReduceJoinReducer.class); job.setGroupingComparatorClass(MyGroupingComparator.class); //设置map的输出k-v类型 job.setMapOutputKeyClass(Compkey.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); Path outPath = new Path("d:/out"); //设置输入输出路径 FileInputFormat.addInputPath(job, new Path("d:/reducejoin")); FileOutputFormat.setOutputPath(job, outPath); if (fs.exists(outPath)) { fs.delete(outPath, true); } job.setNumReduceTasks(3); //开始执行 job.waitForCompletion(true); } }
Mapper类:
package mapreduce.reducejoin; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; public class ReduceJoinMapper extends Mapper<LongWritable, Text, Compkey, Text> { String path; //需要在setup方法中从输入的文件获取到文件名 @Override protected void setup(Context context) throws IOException, InterruptedException { path = ((FileSplit) context.getInputSplit()).getPath().toString(); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] arr = value.toString().split("\t"); Compkey ck; if(path.contains("customers")){ int cid = Integer.parseInt(arr[0]); ck = new Compkey(cid, 1); } else{ int cid = Integer.parseInt(arr[3]); ck = new Compkey(cid, 0); } context.write(ck, value); } }
Reducer类:
package mapreduce.reducejoin; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; import java.util.Iterator; public class ReduceJoinReducer extends Reducer<Compkey, Text, Text, NullWritable> { @Override protected void reduce(Compkey key, Iterable<Text> values, Context context) throws IOException, InterruptedException { Iterator<Text> it = values.iterator(); //直接获取第一条数据(customer) Text cusLine = it.next(); String line = cusLine.toString(); while (it.hasNext()) { //获取orders数据 Text orderLine = it.next(); //拼串 String line2 = orderLine.toString(); String[] cusArr = line.split("\t"); String[] orderArr = line2.split("\t"); // cid name String out = cusArr[0] + "\t" + cusArr[1] + "\t" + orderArr[1] + "\t" + orderArr[2]; context.write(new Text(out), NullWritable.get()); } } }
7. TopN算法实现
在日常的数据分析需求中,先将数据用某个key做聚合,然后再将value的结果倒序输出以求得最高的N项结果,这样的需求十分普遍,在大数据场景下,这样的需求很明显需要使用到二次作业,在第二次作业中,需要使用到组合键,并重写compareTo方法,然而,如果在第二次作业中使用reduce的话会造成大量网络间数据传输,因此,比较优化的解决方案是将输出环节只放在Mapper端中进行,这样避免了Shuffle的过程,因此能够极大地提升计算效率,具体实现只需要在Mapper端使用一个能够进行排序的数据结构,如TreeSet即可,代码如下所示:
组合键类:
import org.apache.hadoop.io.WritableComparable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; public class CompKey implements WritableComparable<CompKey> { String pass; int count; public int compareTo(CompKey o) { if(o.count == count){ return pass.compareTo(o.pass); } return o.count - count; } public void write(DataOutput out) throws IOException { out.writeUTF(pass); out.writeInt(count); } public void readFields(DataInput in) throws IOException { pass= in.readUTF(); count = in.readInt(); } public CompKey(String pass, int count) { this.pass = pass; this.count = count; } public CompKey() { } public String getPass() { return pass; } public void setPass(String pass) { this.pass = pass; } public int getCount() { return count; } public void setCount(int count) { this.count = count; } }
APP类:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class TopApp { public static void main(String[] args) throws Exception { //初始化作业 Configuration conf = new Configuration(); //设置成本地模式,注意不要写在初始化文件系统之后 conf.set("fs.defaultFS","file:///"); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); //作业设置名称 job.setJobName("WORDCOUNT"); //设置入口函数所在的类 job.setJarByClass(TopApp.class); //设置map和reduce类 job.setMapperClass(TopMapper.class); job.setReducerClass(TopReducer.class); job.setCombinerClass(TopReducer.class); //设置map的输出k-v类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //设置reduce的输出k-v类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Path outPath = new Path("D:/wc/out"); //设置输入输出路径 FileInputFormat.addInputPath(job,new Path("D:/wc/duowan_user.txt")); FileOutputFormat.setOutputPath(job,outPath); if(fs.exists(outPath)){ fs.delete(outPath,true); } job.setNumReduceTasks(4); //开始执行 boolean b = job.waitForCompletion(true); } }
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class TopApp2 { public static void main(String[] args) throws Exception { //初始化作业 Configuration conf = new Configuration(); //设置成本地模式,注意不要写在初始化文件系统之后 conf.set("fs.defaultFS","file:///"); conf.set("topN",args[0]); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf); //作业设置名称 job.setJobName("WORDCOUNT"); //设置入口函数所在的类 job.setJarByClass(TopApp2.class); //设置map和reduce类 job.setMapperClass(TopMapper2.class); job.setReducerClass(TopReducer2.class); //设置map的输出k-v类型 job.setMapOutputKeyClass(CompKey.class); job.setMapOutputValueClass(NullWritable.class); //设置reduce的输出k-v类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(KeyValueTextInputFormat.class); Path outPath = new Path("D:/wc/out2"); //设置输入输出路径 FileInputFormat.addInputPath(job,new Path("D:/wc/out")); FileOutputFormat.setOutputPath(job,outPath); if(fs.exists(outPath)){ fs.delete(outPath,true); } job.setNumReduceTasks(1); //开始执行 boolean b = job.waitForCompletion(true); } }
第一个Mapper类:
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class TopMapper extends Mapper<LongWritable, Text,Text,IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] arr = line.split("\t"); //脏数据处理 if(arr.length >= 3 && !arr[2].equals("")){ context.write(new Text(arr[2]), new IntWritable(1)); } } }
第二个Mapper类:
import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.TreeSet; public class TopMapper2 extends Mapper<Text, Text, CompKey, NullWritable> { TreeSet<CompKey> ts; int topN; @Override protected void setup(Context context) throws IOException, InterruptedException { ts = new TreeSet<CompKey>(); topN = Integer.parseInt(context.getConfiguration().get("topN")); } @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { String pass = key.toString(); int count = Integer.parseInt(value.toString()); CompKey ck = new CompKey(pass,count); ts.add(ck); if(ts.size() > topN){ ts.remove(ts.last()); } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { for (CompKey t : ts) { context.write(t,NullWritable.get()); } } }
Reducer类(只需要一个即可):
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class TopReducer extends Reducer<Text,IntWritable,Text,IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key,new IntWritable(sum)); } }
8. Mapreduce框架计算任务执行流程图说明
上图基本将整个Mapreduce从选择文件格式到最后输出的过程描绘了出来,可以做出如下的总结:
1. InputFormat进行文件格式选型的时候,应该结合实际情况,如果涉及二次作业,最好就要用到KeyValueInputFormat,如果需要做数据库的ETL工作,就要使用到DBInputFormat
2. 数据进行切片时,首先会判断该文件能否被切割,如果非压缩格式,都可切割,如果是压缩格式,只有bzip2和带索引的lzo两种格式是可切割的,切勿使用不支持切割的文件格式作为输入,因为这样会导致数据不本地化,造成大量不必要的网络间IO
3. 分区的过程先于排序,默认使用的哈希分区,如果遇到数据倾斜,需要用户自己定义分区算法
4. 使用Combiner可以在后期Shuffle大幅度减少数据量,建议使用
5. Reduce的过程其实就是进行数据混洗,也就是网络间数据传输的过程,因为此过程需要将存放在不同节点上的数据汇总到某一个节点上去,mapreduce中80%的时间都耗费在了网络间的IO上