深入理解hadoop之排序
MapReduce的排序是默认按照Key排序的,也就是说输出的时候,key会按照大小或字典顺序来输出,比如一个简单的wordcount,出现的结果也会是左侧的字母按照字典顺序排列。下面我们主要聊聊面试中比较常见的全排序和二次排序
一、全排序
全排序的方法一般有以下几种:
1.使用一个分区。 但是该方法在处理大型文件的时候效率极低,因为一台机器必须处理所有的输出文件,从而丧失了mapreduce提供的并行架构的优势。这个比较简单,只要在APP中设置分区数量为1就可以了。
2.自定义分区函数,自行设置分解区间。这个方法最关键的地方在于如何划分各分区,如果数据分布不均匀,分区函数设置不恰当,最后会产生数据倾斜。这个地方请看下面统计历年最高气温的例子。
气温数据:
2004 49 1981 -22 1981 -31 1965 -47 2027 -2 1964 6 2030 38 2016 -33 1963 13 2000 21 2019 0 2049 43 2039 8 1989 -18 2017 49 1952 -47 2016 -28 1991 20 1967 -39 2022 -47 2041 41 2039 -38 2021 33 1969 38 1981 0 1960 -26 2023 -12 1969 12 1996 -31 1954 -36 2026 34 2013 -4 1969 37 1990 -22 2007 -31 1987 -8 1972 -30 2019 -17 2042 -22 2011 21 2033 -25 2013 10 2047 30 2008 -2 2047 -5 1994 14 1960 7 2037 44 1990 -41 2047 32 2048 -22 1977 -27 2049 35 2023 2 1952 -44 1979 -5 1996 47 2033 8 2006 3 2030 32 1967 43 1980 -6 2001 39 2049 -31 2028 -16 2029 31 1962 -21 2043 -7 2040 34 2001 9 1977 -21 2047 1 2022 30 2002 12 1956 38 2009 7 2049 11 1981 18 2014 -29 1967 -15 2019 2 1975 25 1965 21 2013 -36 2024 -44 1959 10 1992 4 1997 15 2042 17 2013 -14 1993 -21 2027 19 2016 -44 1989 -47 1999 -6 1993 -35 1953 -21 1952 12 1969 -45 2036 10 1950 29 2022 8 1985 -45 2044 -48 1981 -12 2033 -42 1973 -49 2011 27 1958 -26 2028 35 2037 41 1955 -36 2001 -11 1965 23 1970 -14 2015 -2 1969 -19 1997 3 2016 -38 2045 9 1974 6 1956 -39 2012 1 2022 -28 1991 -31 1974 -40 1998 43 2007 12 2049 9 2034 -18 1956 48 1974 40 2009 -24 2030 -44 1957 27 1979 -23 2034 29 2024 -34 2034 -10 2007 42 2000 33 1990 -44 2048 -48 1967 -30 1969 12 2030 26 2023 -36 2029 22 2044 -2 2043 -47 2040 -18 1990 -3 1996 -16 1974 -20 2023 -11 1990 -16 1980 13 2013 -8 2001 41 2015 -30 1974 28 2031 13 1991 -33 1985 -6 1979 -34 2041 12 1957 -46 2014 25 1969 18 1958 -39 1955 -46 2031 39 2032 11 1991 38 2035 -43 2005 -1 2000 2 2027 -28 1984 -8 1985 -47 2045 -6 1987 -21 2004 35 1968 -47 1968 -19 1995 -47 1990 46 1987 18 2012 29 1987 -12 2048 -8 1987 26 2010 18 1959 -20 1978 8 1997 38 1963 24 1991 8 2005 -34 2019 -4 2042 43 1951 6 1956 -32 1952 18 2003 -15 1979 29 2026 35 2032 -26 2044 -25 2039 -36 2021 49 2037 6 2000 -22 2027 34 2024 38 2019 15 1954 -27 2016 49 2018 -43 2048 23 1978 9 1977 5 2047 -30 2028 -12 1991 -25 2022 -36 1974 -2 2038 25 2014 10 2000 -7 2033 16 2020 5 1985 7 1951 -1 1958 -8 1963 -3 1972 10 1986 9 1961 3 1972 -20 1979 -39 1958 44 2027 -48 2007 -50 2025 33 1970 22 2044 27 2043 -48 1950 1 2023 31 2041 -39 2040 43 2025 21 2038 39 1998 16 1987 -50 1967 -40 2021 -27 1961 6 1981 22 1990 7 1993 -49 2001 -5 2003 21 1990 47 1986 -19 2031 37 1987 -14 2019 16 2008 45 2044 1 1977 5 1952 10 2047 5 2044 21 2002 29 1992 28 1980 -2 1952 -47 2008 15 2017 17 1970 1 2045 -37 2016 5 1951 -28 1978 5 1954 9 1966 18 1957 45 1998 -26 1989 0 1964 10 2036 -44 2037 -22 1965 12 2035 40 1994 7 2024 7 1961 4 2007 34 1980 -36 1950 -39 1987 24 1983 -4 2007 46 2009 -5 1974 43 2026 26 1966 -21 2006 -21 1977 -3 1979 -31 2021 33 2040 39 2020 47 1953 -42 1955 2 2017 0 1973 31 1955 4 1973 -7 2027 28 1968 -17 2029 -3 2021 13 1991 9 2030 19 1952 -35 1987 14 1954 -18 2027 -23 1989 12 1983 13 1966 -45 2039 33 2014 34 2012 -30 1953 -7 2020 -21 1987 22 2041 45 2046 0 2017 26 1951 9 2000 -4 1973 27 1972 -3 2036 -14 1974 32 1987 -8 1993 3 1969 17 2011 -11 2038 -50 2040 -8 1950 -22 2036 13 2025 29 1986 27 2038 41 1971 37 1970 45 2045 -21 2036 41 1956 1 2042 -48 1955 -28 1967 -34 1999 -42 1952 -9 1962 -15 1974 -19 1959 19 1965 -42 1962 41 2003 -12 2029 14 1969 26 1992 -4 1959 8 1962 -18 2000 8 2025 -20 2048 -15 1996 25 2017 -23 1992 -10 2001 30 1960 45 2034 33 1983 -47 2046 19 2041 -4 1978 -6 1967 -49 1993 8 1987 -11 2009 3 1990 40 1972 -6 2029 -47 1990 3 2036 4 1981 22 2019 37 1980 -47 2003 -42 1965 -6 2007 45 2040 -45 1984 24 2048 -15 1984 -16 1992 -39 2040 -33 1984 -24 2046 28 2023 -3 1956 46 1969 0 1983 -4 2030 -50 2004 -36 1958 16 2025 -22 1957 -6 2001 -24 2014 -49 1965 16 2043 42 1966 -10 1971 -13 1996 48 1976 11 2026 -43 1982 2 1965 -50 2038 40 2024 -32 1988 3 2004 -45 2039 8 2029 -30 1974 -11 2033 29 1968 -2 2040 -8 1989 -11 1999 7 2001 37 2001 -44 1979 -30 2048 7 1998 -21 2005 49 1975 44 2031 31 1982 12 1987 35 2004 -33 2000 27 2008 34 1970 -26 2047 0 1974 35 1977 -45 1976 19 1956 48 2025 -37 1991 0 2041 -40 1976 38 2016 36 2024 6 2021 14 2005 27 1951 -38 2046 16 1976 26 2044 -44 1989 -47 2025 26 2045 43 2045 -23 2004 30 2044 46 1962 -20 1954 7 1975 -39 1967 18 2038 4 1956 15 2010 -14 2032 -6 1999 19 2024 7 1993 -23 1961 -43 2007 23 1998 9 2027 -29 1950 29 2010 -47 1953 43 2033 -19 1977 28 2013 -36 2001 43 2008 46 2004 19 1985 6 2043 3 2014 -21 1992 7 1990 8 2020 44 1957 -40 2030 5 1996 16 2018 -5 1989 -14 2016 -11 1988 -18 2012 -3 1998 -12 1979 -41 2043 1 1978 -12 1959 -29 2048 -26 1989 -31 2026 33 1960 32 1978 14 2003 36 2012 15 2036 34 2040 -49 1986 7 1982 19 1959 42 2041 23 2037 20 2020 -24 1977 -27 2039 18 2046 2 2017 -23 2012 30 1962 28 1985 42 2023 15 2030 -30 1983 28 1967 26 1990 -11 1968 -50 2038 -11 1995 34 2005 -43 2011 5 1978 9 1952 -48 1955 27 1958 -21 2020 -36 1985 -23 1991 10 1982 -17 1999 3 1999 -25 2005 -11 2048 -14 1985 -18 2006 -5 1970 -21 2026 -26 1956 -20 2043 -50 1982 -24 1998 8 2034 28 1966 -10 2045 5 1968 -49 2001 48 2026 -9 2005 49 2036 39 2027 -45 1972 -24 2009 -49 1961 38 1991 36 1975 37 1978 12 2003 -45 2021 -46 1962 -8 1972 -8 1961 39 2009 23 1995 30 1996 -19 1983 45 1952 19 1974 -24 1992 33 1981 -1 1981 -32 1984 0 2049 -41 2030 13 1993 -27 1980 -45 1964 -10 2013 39 1975 24 1972 43 1977 -33 1962 -44 2016 -22 2029 47 1999 41 2030 -17 2023 36 2018 32 2025 20 1966 14 1986 29 2036 -20 2022 -36 2027 -46 1994 -8 1992 34 2017 1 2021 32 1966 28 1987 -22 1996 26 1991 48 1993 4 1973 -28 1981 -16 2011 45 1963 -14 1986 -50 1984 -26 1980 30 2024 42 1979 31 2030 3 2035 17 2036 30 2017 -43 1997 9 2004 -25 1999 40 1993 16 1965 -42 2043 24 2017 29 2034 -39 1952 -49 2023 26 1999 -31 1986 23 1962 -10 1960 22 2036 -30 2044 38 2014 -50 1986 0 2024 -40 1962 -15 1950 11 2019 30 1980 -16 1992 -18 1994 -40 1989 33 1999 23 1999 -38 2021 -38 2033 17 1995 -2 2034 -9 2017 -36 1956 -41 1961 1 2020 46 1991 -17 2026 2 2004 9 1976 -7 1956 -4 1981 41 2014 0 1975 -41 2005 47 1966 -47 1968 -27 1953 48 2028 32 1963 40 1982 34 2031 27 2008 1 2037 10 2000 -1 2038 -4 2044 -12 1960 -4 2014 10 2038 -42 1964 -48 1994 -47 1953 -30 1987 -24 2038 5 2027 43 1991 7 2015 21 2038 -2 1999 28 2026 -50 1986 25 2041 -24 2029 -1 2008 18 1952 -41 1969 -50 1973 6 1956 -20 1966 -21 1967 44 1967 39 2035 16 1973 -45 2035 38 1958 22 2000 -6 2004 16 2004 16 2037 -38 2028 -47 1957 -41 1985 41 2028 -3 2014 -32 1980 -14 1960 13 2012 10 1960 -27 1983 -6 1953 8 1954 -42 1979 43 1992 -48 1976 19 1964 -11 1970 -14 2042 -10 1990 -36 1987 -8 2023 31 1959 -12 2008 -40 2033 7 2012 46 2002 -3 1992 -35 2044 17 2010 14 2018 -35 1961 26 2004 -24 2045 33 1965 -9 1970 -16 1977 40 2030 -42 2046 -30 1963 36 2019 -47 2020 -12 2026 -27 1994 21 1951 27 1999 -10 1990 36 2003 -8 1984 31 2015 -26 2015 14 1981 -20 1971 -47 2033 -4 1976 -29 2037 25 2013 33 2011 1 2000 -27 2037 31 1960 8 2048 -26 2037 -8 2039 42 1986 -38 2038 13 1984 -44 2049 -43 2012 3 1962 -39 1959 3 1979 -3 1996 -1 1983 27 1950 -43 1957 36 1951 -28 2010 44 2045 -22 2023 0 2038 37 2011 -30 2009 4 1952 47 1965 -35 2005 -35 1954 -9 2040 14 1987 -24 1978 -15 2009 22 1964 48 2003 -38 1969 -20 1983 -47 2030 13 1990 -45 2013 42 1988 -26 2017 9 2041 -43 1964 -20 2005 30 2024 25 2043 26 1993 27 2018 -41 2008 -14 2013 16 2028 44 1967 29 1973 -5 2027 -38 1954 -12 1963 -21 2008 -3 2049 -14 2022 -34 1976 -39 1976 13 2007 30 2032 -15 2007 -7 2028 -37 2012 29 2029 -7 2002 19 2046 -1 1979 0 2008 -17 1980 42 1986 28 1957 -5 1966 48 1994 43 2047 23 2024 -37 1974 -36 2022 -29 2040 -21 2004 12 1978 40 1982 -22 1984 -8 2030 6 1968 -3 1965 32 1998 -15 2039 10 2033 36 1977 36 2045 43 2045 -17 2021 38 1969 -43 2021 -7 2018 10 2008 40 2012 31 2011 28 1999 -36 1985 -18 2008 4 2040 -46 1954 33 2035 -28 1980 -3 2038 20 1959 29 1979 13 2006 8 2029 22 1962 -44 1978 37 1993 -3 1988 23 1991 39 2013 8 1955 43 1973 0 1976 -3 1963 3 2031 -15 2003 31 2002 16 1981 -44 1959 19 2023 -34 2039 4 1994 -21 1951 36 1997 11 2013 13 1950 32 2020 -12 2016 -22 2009 -38 2031 13 1986 -43 1959 28 2049 10 1954 -45 2018 -1 2008 48 2034 -41 1982 -2 1972 -11 2045 -34 1958 10 1997 31 2013 -13 2025 -19 2038 -32 2041 -21 2013 0 2034 3 2036 -23 2008 -22 2034 3 2042 41 2002 1 2043 -2 1950 19 2041 21 2005 -16 2030 -36 2001 45 1964 33 2027 -25 2046 -5 2044 -42 1965 -37 2004 22 2029 46 1966 7 2008 -48 2016 -22 2033 -28 1999 -33 1987 11 1995 18 1969 -13 2023 9 2018 1 2015 39 2017 31 1975 44 1991 32 2045 10 2046 -35 1952 40 1950 -38 1996 -39 2031 14 2037 -48 2002 41
Map端
package com.heima.hdfs.mr3; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; /** * Map实现对气温文本进行切割,输入key是偏移量,输入map是text,输出的key值是年份,输出的value是气温 * 简单的对文本文档进行切割, */ public class MaxTempMapper extends Mapper<LongWritable ,Text,IntWritable,IntWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] arr =value.toString().split(" "); context.write(new IntWritable(Integer.parseInt(arr[0])),new IntWritable(Integer.parseInt(arr[1]))); } }
Reduce端
package com.heima.hdfs.mr3; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; /** * 这个地方要注意:相同的key值会进入同一个分区,同一个分区里的数据会进入同一个reduce里面 */ public class MaxTempReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> { @Override protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; for(IntWritable iw:values){ max=max>iw.get()?max:iw.get(); } context.write(key,new IntWritable(max)); } }
App端
package com.heima.hdfs.mr3; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; /** * Created by Administrator on 2018/7/5 0005. */ public class MaxTempApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); job.setJobName("MaxTempApp"); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setNumReduceTasks(3); job.setPartitionerClass(YearPartitioner.class); job.setJarByClass(MaxTempApp.class); job.setMapperClass(MaxTempMapper.class); job.setReducerClass(MaxTempReducer.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(IntWritable.class); job.waitForCompletion(true); } }
3.通过hadoop采样机制,对键空间进行采样,较为均匀的划分数据集,采样的核心思想是只查看一小部分键,获得键的近似分布,由此构建分区,在hadoop中已经自带了采样器,不需要开发人员自己编写
Map端
package com.heima.hdfs.allsort; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; /** * Created by Administrator on 2018/7/5 0005. */ public class MaxTempMapper extends Mapper<IntWritable,IntWritable,IntWritable,IntWritable> { @Override protected void map(IntWritable key, IntWritable value, Context context) throws IOException, InterruptedException { context.write(key,value); } }
Reduce端
package com.heima.hdfs.allsort; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; /** * Created by Administrator on 2018/7/5 0005. */ public class MaxTempReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> { @Override protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; for(IntWritable iw :values){ max = max>iw.get()?max:iw.get(); } context.write(key,new IntWritable(max)); } }
App端
package com.heima.hdfs.allsort; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.partition.InputSampler; import org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner; import java.io.IOException; /** * Created by Administrator on 2018/7/5 0005. */ public class MaxTempApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); job.setJobName("MaxTempApp"); job.setNumReduceTasks(3); job.setInputFormatClass(SequenceFileInputFormat.class); FileInputFormat.addInputPath(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.setJarByClass(MaxTempApp.class); job.setMapperClass(MaxTempMapper.class); job.setReducerClass(MaxTempReducer.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); //设置全排序分区 job.setPartitionerClass(TotalOrderPartitioner.class); //创建采样器这里概率是1,6000个key会全部取出来 InputSampler.Sampler<IntWritable,IntWritable> sampler =new InputSampler.RandomSampler<IntWritable,IntWritable>(1,100000,3); TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),new Path("e:/mr/tmp/par.lst")); InputSampler.writePartitionFile(job,sampler); job.waitForCompletion(true); } }
Map起始阶段
在Map阶段,使用job.setInputFormatClass()定义的InputFormat,将输入的数据集分割成小数据块split,同时InputFormat提供一个RecordReader的实现。在这里我们使用的是TextInputFormat,它提供的RecordReader会将文本的行号作为Key,这一行的文本作为Value。这就是自定 Mapper的输入是<LongWritable,Text> 的原因。然后调用自定义Mapper的map方法,将一个个<LongWritable,Text>键值对输入给Mapper的map方法
Map最后阶段
在Map阶段的最后,会先调用job.setPartitionerClass()对这个Mapper的输出结果进行分区,每个分区映射到一个Reducer。每个分区内又调用job.setSortComparatorClass()设置的Key比较函数类排序。可以看到,这本身就是一个二次排序。如果没有通过job.setSortComparatorClass()设置 Key比较函数类,则使用Key实现的compareTo()方法
Reduce阶段
在Reduce阶段,reduce()方法接受所有映射到这个Reduce的map输出后,也会调用job.setSortComparatorClass()方法设置的Key比较函数类,对所有数据进行排序。然后开始构造一个Key对应的Value迭代器。这时就要用到分组,使用 job.setGroupingComparatorClass()方法设置分组函数类。只要这个比较器比较的两个Key相同,它们就属于同一组,它们的 Value放在一个Value迭代器,而这个迭代器的Key使用属于同一个组的所有Key的第一个Key。最后就是进入Reducer的 reduce()方法,reduce()方法的输入是所有的Key和它的Value迭代器,同样注意输入与输出的类型必须与自定义的Reducer中声明的一致
二、二次排序
我们都知道map端的输出结果经过partition()分区函数之后会对key值进行排序,经过shuffle阶段之后,向相同的key值会进入到同一个分组中去,也就是说key的排序是有序的,但是有时候需要对Key排序的同时还需要对Value进行排序,比如上面求每年最高气温的案例时,这时候就要用到二次排序了。经过本人的理解,二次排序可以大致分为以下几个阶段。
Map起始阶段
在Map阶段,使用job.setInputFormatClass()定义的InputFormat,将输入的数据集分割成小数据块split,同时InputFormat提供一个RecordReader的实现。在这里我们使用的是TextInputFormat,它提供的RecordReader会将文本的行号作为Key,这一行的文本作为Value。这就是自定 Mapper的输入是<LongWritable,Text> 的原因。然后调用自定义Mapper的map方法,将一个个<LongWritable,Text>键值对输入给Mapper的map方法
Map最后阶段
在Map阶段的最后,会先调用job.setPartitionerClass()对这个Mapper的输出结果进行分区,每个分区映射到一个Reducer。每个分区内又调用job.setSortComparatorClass()设置的Key比较函数类排序。可以看到,这本身就是一个二次排序。如果没有通过job.setSortComparatorClass()设置 Key比较函数类,则使用Key实现的compareTo()方法
Reduce阶段
在Reduce阶段,reduce()方法接受所有映射到这个Reduce的map输出后,也会调用job.setSortComparatorClass()方法设置的Key比较函数类,对所有数据进行排序。然后开始构造一个Key对应的Value迭代器。这时就要用到分组,使用 job.setGroupingComparatorClass()方法设置分组函数类。只要这个比较器比较的两个Key相同,它们就属于同一组,它们的 Value放在一个Value迭代器,而这个迭代器的Key使用属于同一个组的所有Key的第一个Key。最后就是进入Reducer的 reduce()方法,reduce()方法的输入是所有的Key和它的Value迭代器,同样注意输入与输出的类型必须与自定义的Reducer中声明的一致
排序的案例仍然为上述求取每年最高气温的案例
二次排序的具体流程
在本例中要比较两次。先按照第一字段排序,然后再对第一字段相同的按照第二字段排序。根据这一点,我们可以构造一个复合类key,它有两个字段,先利用分区对第一字段排序,再利用分区内的比较对第二字段排序。二次排序的流程分为以下几步。
1、自定义 key
所有自定义的组合key应该实现接口WritableComparable,WritableComparable接口继承自writable和comparable这两个接,口因为writable接口是可序列化的并且可比较的。WritableComparable。组合key按照年份升序按照气温降序,实现的代码如下
public class Combokey implements WritableComparable<Combokey> { private int year ; public int getYear() { return year; } public void setYear(int year) { this.year = year; } public int getTemp() { return temp; } public void setTemp(int temp) { this.temp = temp; } private int temp; /* * 对key进行比较实现 * */ @Override public int compareTo(Combokey o) { System.out.println("Combokey.compareTo()"+o.toString()); int y0 =o.getYear(); int t0=o.getTemp(); //年份相同(s升序) if(year==y0){ //气温降序 return -(temp-t0); }else{ return (year-y0); } } /* * 串行化过程 * */ @Override public void write(DataOutput out) throws IOException { //年份 out.writeInt(year); //气温 out.writeInt(temp); } //反串行化的过程 @Override public void readFields(DataInput in) throws IOException { year = in.readInt(); temp = in.readInt(); } public String toString(){ return year+":"+temp; } }
2.自定义分区
自定义分区函数类FirstPartitioner,是key的第一次比较,完成对所有key的排序。该分区类按照年份进行分区,相同的年份会进入到同一个分区中去。
public class YearPartitioner extends Partitioner<Combokey,NullWritable>{ @Override public int getPartition(Combokey key, NullWritable nullWritable, int numPartitions) { System.out.println("YearPartitioner.getPartition"+key); int year = key.getYear(); return year%numPartitions; } }
3、Key的比较类CombokeyComparator
这是Key的第二次比较,这个类继承自WirtableComparator这个类,对所有的Key进行排序,即同时完成Combokey中的first和second排序。
public class CombokeyComparator extends WritableComparator{ protected CombokeyComparator(){ super(Combokey.class,true); } public int compare(WritableComparable a,WritableComparable b){ System.out.println("CombokeyComparator"+a+","+b); Combokey k1 = (Combokey)a; Combokey k2 = (Combokey)b; return k1.compareTo(k2); } }
4、定义分组类函数YearGroupComparator
在Reduce阶段,构造一个与 Key 相对应的 Value 迭代器的时候,只要year相同就属于同一个组,放在一个Value迭代器,不同的year按照年份升序进行排序。
public class YearGroupComparator extends WritableComparator{ protected YearGroupComparator(){ super(Combokey.class,true); } public int compare(WritableComparable a,WritableComparable b){ System.out.println("YearGroupComparator"+a+","+b); Combokey key1 = (Combokey)a; Combokey key2 = (Combokey)b; return key1.getYear()-key2.getYear(); } }
5.Map端,输入的(key,value)缩进长度和文本文档,输出的key是组合key,value值是控值
public class MaxTempMapper extends Mapper<LongWritable,Text,Combokey,NullWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println("MaxTempMapper.map"); String[] arr = value.toString().split(" "); Combokey keyout = new Combokey(); keyout.setYear(Integer.parseInt(arr[0])); keyout.setTemp(Integer.parseInt(arr[1])); context.write(keyout,NullWritable.get()); } }
6.Reduce端,将组合key切割成key为year,value为气温的一个列表
public class MaxTempReducer extends Reducer<Combokey,NullWritable,IntWritable,IntWritable> { @Override protected void reduce(Combokey key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { int year = key.getYear(); int temp = key.getTemp(); System.out.println("MaxTempReducer.reduce"+year+","+temp); context.write(new IntWritable(year),new IntWritable(temp)); } }
7.APP端
public class MaxTempApp { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); job.setJobName("MaxTempApp"); FileInputFormat.addInputPath(job,new Path("e:/mr/tmp/1.txt")); FileOutputFormat.setOutputPath(job,new Path("e:/mr/tmp/out")); job.setJarByClass(MaxTempApp.class); //设置Map类 job.setMapperClass(MaxTempMapper.class); //设置Reduce类 job.setReducerClass(MaxTempReducer.class); //设置Map输出类型 job.setMapOutputKeyClass(Combokey.class); job.setMapOutputValueClass(NullWritable.class); //设置reduce输出类型 job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); //设置分区类 job.setPartitionerClass(YearPartitioner.class); //设置分组对比器 job.setGroupingComparatorClass(YearGroupComparator.class); //设置排序对比器 job.setSortComparatorClass(CombokeyComparator.class); job.setNumReduceTasks(3); job.waitForCompletion(true); } }