MapReduce编程实践(Hadoop3.1.3)

1、词频统计任务要求

首先，在Linux系统本地创建两个文件，即文件wordfile1.txt和wordfile2.txt。在实际应用中，这两个文件可能会非常大，会被分布存储到多个节点上。但是，为了简化任务，这里的两个文件只包含几行简单的内容。需要说明的是，针对这两个小数据集样本编写的MapReduce词频统计程序，不作任何修改，就可以用来处理大规模数据集的词频统计。

#文件wordfile1.txt的内容如下：
hadoop@hadoop-master:~$ vim wordfile1.txt
hadoop@hadoop-master:~$ cat wordfile1.txt
I love Spark
I love Hadoop

#文件wordfile2.txt的内容如下
hadoop@hadoop-master:~$ vim wordfile2.txt
hadoop@hadoop-master:~$ cat wordfile2.txt
Hadoop is good
Spark is fast

假设HDFS中有一个/user/hadoop/input文件夹，并且文件夹为空，请把文件wordfile1.txt和wordfile2.txt上传到HDFS中的input文件夹下。

hadoop@hadoop-master:~$ hdfs dfs -put /home/hadoop/wordfile* input/

hadoop@hadoop-master:~$ hdfs dfs -ls input/
Found 2 items
-rw-r--r--   1 hadoop supergroup         27 2022-04-23 16:22 input/wordfile1.txt
-rw-r--r--   1 hadoop supergroup         29 2022-04-23 16:22 input/wordfile2.txt

现在需要设计一个词频统计程序，统计input文件夹下所有文件中每个单词的出现次数，也就是说，程序应该输出如下形式的结果：

fast  1
good   1
Hadoop   2
I    2
is   2
love   2
Spark   2

2、在Eclipse中创建项目

首先，启动Eclipse，启动以后会弹出如下图所示界面，提示设置工作空间（workspace）。

root@hadoop-master:~# cd /usr/local/eclipse/
root@hadoop-master:/usr/local/eclipse# ./eclipse

图片.png-226.1kB

可以直接采用默认的设置/home/hadoop/workspace，点击OK按钮。可以看出，由于当前是采用hadoop用户登录了Linux系统，因此，默认的工作空间目录位于hadoop用户目录/home/hadoop下。

Eclipse启动以后，呈现的界面如下图所示。
图片.png-244.5kB

选择File–>New–>Java Project菜单，开始创建一个Java工程，弹出如下图所示界面。
图片.png-357.3kB

在Project name后面输入工程名称WordCount，选中Use default location，让这个Java工程的所有文件都保存到/home/hadoop/workspace/WordCount目录下。在JRE这个选项卡中，可以选择当前的Linux系统中已经安装好的JDK，比如jdk1.8.0_162。然后，点击界面底部的“Next>”按钮，进入下一步的设置。

3、为项目添加需要用到的JAR包

进入下一步的设置以后，会弹出如下图所示界面。
图片.png-328.1kB

需要在这个界面中加载该Java工程所需要用到的JAR包，这些JAR包中包含了与Hadoop相关的Java API。这些JAR包都位于Linux系统的Hadoop安装目录下，对于本教程而言，就是在/usr/local/hadoop/share/hadoop目录下。点击界面中的Libraries选项卡，然后，点击界面右侧的Add External JARs…按钮，弹出如下图所示界面。

在该界面中，上面有一排目录按钮（即usr、local、hadoop、share、hadoop、mapreduce和lib），当点击某个目录按钮时，就会在下面列出该目录的内容。
为了编写一个MapReduce程序，一般需要向Java工程中添加以下JAR包：

1、/usr/local/hadoop/share/hadoop/common目录下的hadoop-common-3.1.3.jar和haoop-nfs-3.1.3.jar；
图片.png-467.5kB

2、/usr/local/hadoop/share/hadoop/common/lib目录下的所有JAR包；
图片.png-698.6kB

3、/usr/local/hadoop/share/hadoop/mapreduce目录下的所有JAR包，但是，不包括jdiff、lib、lib-examples和sources目录，具体如下图所示。
图片.png-670.9kB

4、/usr/local/hadoop/share/hadoop/mapreduce/lib目录下的所有JAR包。
图片.png-381.7kB

全部添加完毕以后，就可以点击界面右下角的“Finish”按钮，完成Java工程WordCount的创建。
图片.png-737.2kB

4、编写Java应用程序

下面编写一个Java应用程序，即WordCount.java。请在Eclipse工作界面左侧的Package Explorer面板中（如下图所示），找到刚才创建好的工程名称WordCount，然后在该工程名称上点击鼠标右键，在弹出的菜单中选择New–>Class菜单。
图片.png-549.7kB

选择New–>Class菜单以后会出现如下图所示界面。
图片.png-367kB

在该界面中，只需要在Name后面输入新建的Java类文件的名称，这里采用名称WordCount，其他都可以采用默认设置，然后，点击界面右下角Finish按钮，出现如下图所示界面。
图片.png-948.3kB

可以看出，Eclipse自动创建了一个名为WordCount.java的源代码文件，并且包含了代码public class WordCount{}，请清空该文件里面的代码，然后在该文件中输入完整的词频统计程序代码，具体如下：

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
    public WordCount() {
    }
     public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCount.TokenizerMapper.class);
        job.setCombinerClass(WordCount.IntSumReducer.class);
        job.setReducerClass(WordCount.IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class); 
        for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();
        public TokenizerMapper() {
        }
        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString()); 
            while(itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }
        }
    }
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
        public IntSumReducer() {
        }
        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;
            IntWritable val;
            for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
                val = (IntWritable)i$.next();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }
}

5、编译打包程序

现在就可以编译上面编写的代码。可以直接点击Eclipse工作界面上部的运行程序的快捷按钮，当把鼠标移动到该按钮上时，在弹出的菜单中选择Run as，继续在弹出来的菜单中选择Java Application，如下图所示。
图片.png-983.8kB

然后，会弹出如下图所示界面。
图片.png-131.4kB

点击界面右下角的OK按钮，开始运行程序。程序运行结束后，会在底部的Console面板中显示运行结果信息（如下图所示）。
图片.png-123.3kB

下面就可以把Java应用程序打包生成JAR包，部署到Hadoop平台上运行。现在可以把词频统计程序放在/usr/local/hadoop/myapp目录下。如果该目录不存在，可以使用如下命令创建：

hadoop@hadoop-master:~$ cd /usr/local/hadoop/
hadoop@hadoop-master:/usr/local/hadoop$ mkdir myapp

首先，请在Eclipse工作界面左侧的Package Explorer面板中，在工程名称WordCount上点击鼠标右键，在弹出的菜单中选择Export，如下图所示。
图片.png-376.6kB

然后，会弹出如下图所示界面。
图片.png-319kB

在该界面中，选择Runnable JAR file，然后，点击Next>按钮，弹出如下图所示界面。

在该界面中，Launch configuration用于设置生成的JAR包被部署启动时运行的主类，需要在下拉列表中选择刚才配置的类WordCount-WordCount。在Export destination中需要设置JAR包要输出保存到哪个目录，比如，这里设置为/usr/local/hadoop/myapp/WordCount.jar。在Library handling下面选择Extract required libraries into generated JAR。然后，点击Finish按钮，会出现如下图所示界面。
图片.png-445.6kB

可以忽略该界面的信息，直接点击界面右下角的OK按钮，启动打包过程。打包过程结束后，会出现一个警告信息界面，如下图所示。
图片.png-243.8kB

图片.png-115.2kB
可以忽略该界面的信息，直接点击界面右下角的“OK”按钮。至此，已经顺利把WordCount工程打包生成了WordCount.jar。可以到Linux系统中查看一下生成的WordCount.jar文件，可以在Linux的终端中执行如下命令：

#可以看到该目录下已经存在一个WordCount.jar文件
hadoop@hadoop-master:~$ ll -d /usr/local/hadoop/myapp/WordCount.jar 
-rw-r--r-- 1 root root 38487845  4月 23 17:00 /usr/local/hadoop/myapp/WordCount.jar

6、运行程序

在运行程序之前，需要启动Hadoop

在启动Hadoop之后，需要首先删除HDFS中与当前Linux用户hadoop对应的input和output目录（即HDFS中的/user/hadoop/input和/user/hadoop/output目录），这样确保后面程序运行不会出现问题，具体命令如下：


hadoop@hadoop-master:~$ hdfs dfs -ls .
Found 1 items
drwxrwxrwx   - hadoop supergroup          0 2022-04-23 16:22 input

hadoop@hadoop-master:~$ hdfs dfs -ls input/
Found 2 items
-rw-r--r--   1 hadoop supergroup         27 2022-04-23 16:22 input/wordfile1.txt
-rw-r--r--   1 hadoop supergroup         29 2022-04-23 16:22 input/wordfile2.txt

现在，就可以在Linux系统中，使用hadoop jar命令运行程序，命令如下：

hadoop@hadoop-master:~$ cd /usr/local/hadoop/
hadoop@hadoop-master:/usr/local/hadoop$ ./bin/hadoop jar ./myapp/WordCount.jar input output

上面命令执行以后，当运行顺利结束时，屏幕上会显示类似如下的信息：

……//这里省略若干屏幕信息
2022-04-23 17:19:21,565 INFO mapreduce.Job:  map 0% reduce 0%
2022-04-23 17:19:27,810 INFO mapreduce.Job:  map 100% reduce 0%
2022-04-23 17:19:33,881 INFO mapreduce.Job:  map 100% reduce 100%
2022-04-23 17:19:33,903 INFO mapreduce.Job: Job job_1650685957973_0002 completed successfully
2022-04-23 17:19:33,987 INFO mapreduce.Job: Counters: 53
	File System Counters
		FILE: Number of bytes read=106
		FILE: Number of bytes written=653149
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=300
		HDFS: Number of bytes written=47
		HDFS: Number of read operations=11
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=7885
		Total time spent by all reduces in occupied slots (ms)=2726
		Total time spent by all map tasks (ms)=7885
		Total time spent by all reduce tasks (ms)=2726
		Total vcore-milliseconds taken by all map tasks=7885
		Total vcore-milliseconds taken by all reduce tasks=2726
		Total megabyte-milliseconds taken by all map tasks=8074240
		Total megabyte-milliseconds taken by all reduce tasks=2791424
	Map-Reduce Framework
		Map input records=4
		Map output records=12
		Map output bytes=104
		Map output materialized bytes=112
		Input split bytes=244
		Combine input records=12
		Combine output records=9
		Reduce input groups=7
		Reduce shuffle bytes=112
		Reduce input records=9
		Reduce output records=7
		Spilled Records=18
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=233
		CPU time spent (ms)=3090
		Physical memory (bytes) snapshot=911650816
		Virtual memory (bytes) snapshot=7746342912
		Total committed heap usage (bytes)=754974720
		Peak Map Physical memory (bytes)=338919424
		Peak Map Virtual memory (bytes)=2581024768
		Peak Reduce Physical memory (bytes)=235368448
		Peak Reduce Virtual memory (bytes)=2585690112
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=56
	File Output Format Counters 
		Bytes Written=47

词频统计结果已经被写入了HDFS的/user/hadoop/output目录中，可以执行如下命令查看词频统计结果：

hadoop@hadoop-master:/usr/local/hadoop$ hdfs dfs -cat output/*
Hadoop	2
I	2
Spark	2
fast	1
good	1
is	2
love	2

至此，词频统计程序顺利运行结束。需要注意的是，如果要再次运行WordCount.jar，需要首先删除HDFS中的output目录，否则会报错。

本文参考：http://dblab.xmu.edu.cn/blog/2481-2/

会员力量，点亮园子希望

刷新页面返回顶部

退役小学生

学习本是一个不断抄袭、模仿、练习、创新的过程。

MapReduce编程实践(Hadoop3.1.3)

MapReduce编程实践(Hadoop3.1.3)

1、词频统计任务要求

2、在Eclipse中创建项目

3、为项目添加需要用到的JAR包

4、编写Java应用程序

5、编译打包程序

6、运行程序

公告