reduce连接是怎么按组合键分组聚合功能原理详解
1.reduce连接实现目标
气象站数据集,气象站id和名称数据表
StationId StationName
1~hangzhou
2~shanghai
3~beijing
温度记录数据集
StationId TimeStamp Temperature
3~20200216~6
3~20200215~2
3~20200217~8
1~20200211~9
1~20200210~8
2~20200214~3
2~20200215~4
目标:是将上面两个数据集进行连接,将气象站名称按照气象站id加入气象站温度记录中最输出结果:
1~hangzhou ~20200211~9
1~hangzhou ~20200210~8
2~shanghai ~20200214~3
2~shanghai ~20200215~4
3~beijing ~20200216~6
3~beijing ~20200215~2
3~beijing ~20200217~8
2.关键问题:reduce是怎么分组聚合的?
map输出结果会按照组合键第一个字段stationid升序排列,相同stationid的记录按照第二个字段升序排列,气象站数据和记录数据混合再一起,shulfe过程中,map将数据传给reduce,会经过partition分区,相同stationid的数据会被分到同一个reduce,一个reduce中stationid相同的数据会被分为一组。假设采用两个reduce任务,分区按照stationid%2,则分区后的结果为
分区1
<1,0> hangzhou
<1,1> 20200211~9
<1,1> 20200210~8
<3,0> beijing
<3,1> 20200216~6
<3,1> 20200215~2
<3,1> 20200217~8
分区2
<2,0> shanghai
<2,1> 20200214~3
<2,1> 20200215~4
(4)分区之后再将每个分区的数据按照stationid分组聚合
分区1
分组1
<1,0> <Hangzhou, 20200211~9, 20200210~8>
分组2
<3,0> <Beijing, 20200216~6, 20200215~2, 20200217~8>
分区2
<2,0> <shanghai, 20200214~3, 20200215~4>
3.原理剖析
Reduce的源码如下所示,
// // Source code recreated from a .class file by IntelliJ IDEA // (powered by Fernflower decompiler) // package org.apache.hadoop.mapreduce; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.classification.InterfaceAudience.Public; import org.apache.hadoop.classification.InterfaceStability.Stable; import org.apache.hadoop.mapreduce.ReduceContext.ValueIterator; import org.apache.hadoop.mapreduce.task.annotation.Checkpointable; @Checkpointable @Public @Stable public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public Reducer() { } protected void setup(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException { } protected void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException { Iterator i$ = values.iterator(); while(i$.hasNext()) { VALUEIN value = i$.next(); context.write(key, value); } } protected void cleanup(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException { } public void run(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException { this.setup(context); try { while(context.nextKey()) { this.reduce(context.getCurrentKey(), context.getValues(), context); Iterator<VALUEIN> iter = context.getValues().iterator(); if (iter instanceof ValueIterator) { ((ValueIterator)iter).resetBackupStore(); } } } finally { this.cleanup(context); } } public abstract class Context implements ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public Context() { } } }
经过调试发现shuffer过程中,并不是直接调用我们的实现的Reducer的reduce函数,而是执行了上面标红的run函数,run函数中是调用了context中的nextKey()函数去遍历分组后的键,然后在将该键对应的值数组传递给this.reduce()函数,也就是我们自己代码里面实现的reduce函数。所以有这个分组的功能。
具体reduce分组排序的示例见下面链接
https://www.cnblogs.com/bclshuai/p/12319490.html
自己开发了一个股票智能分析软件,功能很强大,需要的点击下面的链接获取:
https://www.cnblogs.com/bclshuai/p/11380657.html