hadoop mapreduce 基础实例一记词
mapreduce实现一个简单的单词计数的功能。
一,准备工作:eclipse 安装hadoop 插件:
下载相关版本的hadoop-eclipse-plugin-2.2.0.jar到eclipse/plugins下。
二,实现:
新建mapreduce project
map 用于分词,reduce计数。
package tank.demo; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * @author tank * @date:2015年1月5日 上午10:03:43 * @description:记词器 * @version :0.1 */ public class WordCount { public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); //主类 job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); //map输出格式 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //输出格式 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
打包world-count.jar
三,准备输入数据
hadoop fs -mkdir /user/hadoop/input//建好输入目录
//随便写点数据文件
echo hello my hadoop this is my first application>file1
echo hello world my deer my applicaiton >file2
//拷贝到hdfs中
hadoop fs -put file* /user/hadoop/input
hadoop fs -ls /user/hadoop/input //查看
四,运行
上传到集群环境中:
hadoop jar world-count.jar WordCount input output
截取一段输出如:
15/01/05 11:14:36 INFO mapred.Task: Task:attempt_local1938802295_0001_r_000000_0 is done. And is in the process of committing
15/01/05 11:14:36 INFO mapred.LocalJobRunner:
15/01/05 11:14:36 INFO mapred.Task: Task attempt_local1938802295_0001_r_000000_0 is allowed to commit now
15/01/05 11:14:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1938802295_0001_r_000000_0' to hdfs://192.168.183.130:9000/user/hadoop/output/_temporary/0/task_local1938802295_0001_r_000000
15/01/05 11:14:36 INFO mapred.LocalJobRunner: reduce > reduce
15/01/05 11:14:36 INFO mapred.Task: Task 'attempt_local1938802295_0001_r_000000_0' done.
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 running in uber mode : false
15/01/05 11:14:36 INFO mapreduce.Job: map 100% reduce 100%
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 completed successfully
15/01/05 11:14:36 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=17706
FILE: Number of bytes written=597506
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=205
HDFS: Number of bytes written=85
HDFS: Number of read operations=25
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2
Map output records=14
Map output bytes=136
Map output materialized bytes=176
Input split bytes=232
Combine input records=0
Combine output records=0
Reduce input groups=10
Reduce shuffle bytes=0
Reduce input records=14
Reduce output records=10
Spilled Records=28
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=67
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=456536064
File Input Format Counters
Bytes Read=80
File Output Format Counters
Bytes Written=85
查看输出目录下的文件
[hadoop@tank1 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
applicaiton 1
application 1
deer 1
first 1
hadoop 1
hello 2
is 1
my 4
this 1
world 1
已经正确统计出单词数量!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 分享 3 个 .NET 开源的文件压缩处理库,助力快速实现文件压缩解压功能!
· Ollama——大语言模型本地部署的极速利器
· DeepSeek如何颠覆传统软件测试?测试工程师会被淘汰吗?
2013-01-05 feathers ui 换肤