mapreduce 对文件分词读取
MapReduce
实例一:(进行文件的分词读取)
1.1 首先导入架包
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
1.2 编写Mapper
private final static LongWritable two = new LongWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//构造一个用来解析str的StringTokenizer对象。java默认的分隔符是“空格”、“制表符(‘\t’)”、“换行符(‘\n’)”、“回车符(‘\r’)”
StringTokenizer st = new StringTokenizer(value.toString());
while(st.hasMoreTokens()){//返回是否还有分隔符
word.set(st.nextToken());//返回从当前位置到下一个分隔符的字符串
context.write(word,two);
}
}
1.3 编写Reduce
private final static LongWritable one = new LongWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value:values) {
sum+=value.get();
}
one.set(sum);
context.write(key,one);
}
1.4 编写job驱动
Job job = Job.getInstance(new Configuration());
job.setJarByClass(TextJob.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(TextMapper.class);
// job.setCombinerClass(TextReduce.class);
job.setReducerClass(TextReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.waitForCompletion(true);
1.5 在hsfs 中的方法:
[root@head42 ~]# hadoop jar mapreduce-1.0-SNAPSHOT.jar com.njbd.normal.text1.TextJob /text /output14
注释:mapreduce-1.0-SNAPSHOT.jar:为java包
com.njbd.normal.text1.TextJob:为job的路径
/text/output14/:为输出目录