MapReduce 基础知识

MapReduce，它是 Hadoop 框架中处理的核心构建块之一。Google 在 2004 年 12 月发表了一篇关于 MapReduce 技术的论文，这成为 Hadoop Processing Model 的起源。

MapReduce 是一种编程模型，可以让我们对庞大的数据集进行并行和分布式处理。

（一）Traditional Way

当 MapReduce 框架还未出现时，并行和分布式处理是如何以 traditional way 发生的呢？举个例子，我有一个天气日志，其中包含从 2000 年到 2015 年的日平均气温。在这里，我想计算每年气温最高的那一天。

traditional way：首先，将数据拆分成更小的部分或块，并将它们存储在不同的机器上。然后，找到存储在相应机器中的每个部分的最高温度。最后，从每台机器接收到的结果进行组合以获得最终输出。可能的问题：

Critical path problem: 如果任何一台机器延迟了工作，整个工作就会延迟。
Reliability problem: 如果任何正在处理部分数据的机器发生故障怎么办？这种故障转移的管理又成为一个问题。
Equal split issue: 如何平均分配数据，使得没有一台机器过载或未充分利用。
Single split may fail: 如果任何一台机器无法提供输出，将无法计算结果。所以，应该有一种机制来保证系统的容错能力。
Aggregation of the result: 应该有一种机制来聚合每台机器生成的结果以产生最终输出。

这些是在使用traditional way对大型数据集进行并行处理时必须注意的问题。

为了克服这些问题，产生了 MapReduce 框架。MapReduce 允许我们执行并行计算的同时，而不必担心可靠性、容错等问题。MapReduce 使我们可以灵活地编写代码逻辑，而无需关心系统的设计问题。

（二）What is MapReduce?

MapReduce 是一个 programming framework ，允许我们在 distributed environment 中对 large data sets 进行 distributed and parallel processing 。

MapReduce 由两个不同的 tasks 组成——Map 和 Reduce。正如 MapReduce 的名称所暗示的，reducer 阶段发生在 mapper 阶段完成之后。

因此，第一个是 map job ，其中读取和处理 a block of data 以生成 key-value pairs 作为中间输出。
Mapper 或 map job (key-value pairs) 的输出是 Reducer 的输入。
reducer 从多个 map jobs 中接收 key-value pair 。
然后，reducer 将这些 intermediate data tuples (intermediate key-value pair) 聚合成较小的一组 tuples 或 key-value pairs ，即最终输出。

（三）A Word Count Example of MapReduce

让我们通过一个 example 来理解 MapReduce 是如何工作的，有一个名为 example.txt 的 text file ，其内容如下：

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

现在，假设我们必须使用 MapReduce 对 sample.txt 执行字数统计。因此，我们将找到 unique words 和这些 unique words 的出现次数。

首先，我们将 input 分成三个 splits ，如图所示。这将在 all the map nodes 之间 distribute 工作。
然后，我们将每个 mappers 中的单词 tokenize ，并为每个 tokens or words 赋予一个 hardcoded value 。（给出 hardcoded value equal to 1 背后的基本原理是每个单词本身都会出现一次。）
现在，将创建一个 list of key-value pair，其中 key 是单个单词，value is one 。因此，对于第一行（Dear Bear River），我们有 3 个 key-value pairs ——Dear，1； Bear，1； River，1。mapping process 在所有 nodes 上保持相同。
在 mapper 阶段之后，将进行 partition process ，其中发生 sorting and shuffling ，以便将具有相同 key 的所有 tuples 发送到相应的 reducer。
因此，在 sorting and shuffling phase 之后，每个 reducer 都会有一个唯一的 key 和一个与该 key 对应的 list of values 。例如，Bear，[1,1]； Car，[1,1,1]..等
现在，每个 Reducer 都会计算该 list of values 中存在的值。如图所示，reducer 得到一个值为 [1,1] 的值列表。对应于 key Bear，会计算列表中 1 的数量，并给出最终输出为 — Bear, 2。
最后，收集所有输出 key/value pairs 并将其写入输出文件。

（四）Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

在 MapReduce 中，我们将 job 分配给多个 nodes ，每个 node 同时处理 job 的一部分。因此，MapReduce 基于 Divide and Conquer 的范式，帮助我们使用不同的 machines 处理数据。由于数据是由多台 machines 而不是单台 machine 并行处理的，因此处理数据所花费的时间大大减少，如下图所示。

2. Data Locality:

我们不是将 data 移动到 processing unit，而是将 processing unit 移动到 MapReduce 框架中的 data。在 traditional system 中，我们习惯将 data 带到 processing unit 并进行处理。但是，随着数据的增长和 became very huge ，将如此庞大的 data 带到 processing unit 会带来以下问题：

将 huge data 转移到 processing 上成本高昂，并且会降低 network performance 。
处理需要时间，因为数据由成为 bottleneck 的 a single unit 处理。
Master node 可能会 over-burdened 并且可能会 fail。

现在，MapReduce 允许我们通过 bringing the processing unit to the data 来克服上述问题。因此，as you can see in the above image，数据 distributed 在 multiple nodes 中， each node 处理驻留在其上的部分数据。这使我们具有以下优势：

move the processing unit to the data 是非常具有 cost effective 的。
由于所有 nodes 都在 parallel 处理它们的部分数据，因此处理时间减少了。
Every node 都会获取一部分数据进行处理，因此，节点不可能 overburdened。

（五）MapReduce Example Program

Before jumping into the details，让我们先看一下 MapReduce 示例程序，以基本了解 MapReduce 环境中的实际工作原理。采用了相同的 word count example，必须找出 each word 的出现次数。And Don’t worry guys，如果您第一次看代码时不理解代码，请耐心等待，我将向您介绍 MapReduce 代码的每个部分。

1.Source code:

package co.edureka.mapreduce;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;

public class WordCount
{
    public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
        public void map(LongWritable key, Text value,Context context) throws IOException,InterruptedException{
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                value.set(tokenizer.nextToken());
                context.write(value, new IntWritable(1));
            }
        }
    }

    public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,InterruptedException {
            int sum=0;
            for(IntWritable x: values)
            {
                sum+=x.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {

        Configuration conf= new Configuration();
        Job job = new Job(conf,"My Word Count Program");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
        outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

2.Explanation of MapReduce Program

整个 MapReduce 程序基本上可以分为三个 parts ：

Mapper Phase Code
Reducer Phase Code
Driver Code

我们将依次理解这 three parts 的代码。

Mapper code:

    public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
        public void map(LongWritable key, Text value,Context context) throws IOException,InterruptedException{
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                value.set(tokenizer.nextToken());
                context.write(value, new IntWritable(1));
            }
        }
    }

我们创建了一个类 Map，它扩展了 MapReduce 框架中已经定义的类 Mapper。

我们在 class declaration 后使用尖括号定义输入和输出 key/value pair 的数据类型。

Mapper 的输入和输出都是 key/value pair 。

输入：

key 是文本文件中每一行的偏移量：LongWritable
value 是 each individual line（如图所示）： Text

输出：

key 是 tokenized words ：Text
在我们的案例中，我们有硬编码 value 1：IntWritable
例子——Dear 1, Bear 1, etc.。

我们编写了一个 java 代码，其中我们 tokenized each word ，并为它们分配了一个等于 1 的硬编码值。

Reducer Code:

    public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,InterruptedException {
            int sum=0;
            for(IntWritable x: values)
            {
                sum+=x.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

我们创建了一个 Reduce 类，它像 Mapper 一样扩展了 Reducer 类。

我们在类声明之后使用尖括号定义输入和输出 key/value pair 的数据类型，就像 Mapper 所做的那样。

Reducer 的输入和输出都是 key-value pair 。

输入：

key 不过是 sorting and shuffling phase 后生成的那些 unique words ：Text
value 是与每个 key 对应的整数列表：IntWritable
Example — Bear, [1, 1], etc.

输出：

key 是输入文本文件中存在的所有 unique words ：Text
value 是每个 unique words: 的出现次数：IntWritable
Example — Bear, 2; Car, 3, etc.

我们汇总了每个列表中与每个 key 对应的 values ，并生成了最终答案。

通常，会为每个 unique words 创建一个 reducer，但是，您可以在 mapred-site.xml 中指定 reducer 的数量。

Driver Code:

    public static void main(String[] args) throws Exception {

        Configuration conf= new Configuration();
        Job job = new Job(conf,"My Word Count Program");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
        outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

在 driver class 中，我们将 MapReduce 作业的 configuration 设置为在 Hadoop 中运行。

我们指定 job 的名称、mapper 和 reducer 的输入/输出的 data type 。

我们还指定 mapper and reducer classes 的名称。

还指定了输入和输出文件夹的路径。

方法 setInputFormatClass () 用于指定 Mapper 将如何读取输入数据或工作单元是什么。在这里，我们选择了 TextInputFormat，以便 mapper 一次从输入文本文件中读取一行。

main() 方法是 driver 的入口点。在这个方法中，我们为 job 实例化一个新的 Configuration 对象。

3.Run the MapReduce code

运行 MapReduce 代码的命令是：

hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output

现在，你们对 MapReduce 框架有了基本的了解。您可能已经意识到 MapReduce 框架如何帮助我们编写代码来处理 HDFS 中存在的大量数据。

原文链接：Fundamentals of MapReduce with MapReduce Example

posted @ 2023-05-29 21:08 ImreW 阅读(67) 评论(0) 收藏举报

刷新页面返回顶部

imreW