每日学习

今日学习MapReduce：

数据清洗

“ETL，是英文 Extract-Transform-Load 的缩写，用来描述将数据从来源端经过抽取

（Extract）、转换（Transform）、加载（Load）至目的端的过程。ETL 一词较常用在数据仓

库，但其对象并不限于数据仓库

在运行核心业务 MapReduce 程序之前，往往要先对数据进行清洗，清理掉不符合用户

要求的数据。清理的过程往往只需要运行 Mapper 程序，不需要运行 Reduce 程序

1）需求

去除日志中字段个数小于等于 11 的日志。

2）需求分析

需要在 Map 阶段对输入的数据根据规则进行过滤清洗。

实现代码：

（1）编写 WebLogMapper 类

package com.atguigu.mapreduce.weblog;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WebLogMapper extends Mapper<LongWritable, Text, Text,

NullWritable>{

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// 1 获取 1 行数据

String line = value.toString();

// 2 解析日志

boolean result = parseLog(line,context);

// 3 日志不合法退出

if (!result) {

return;

}

// 4 日志合法就直接写出

context.write(value, NullWritable.get());

}

// 2 封装解析日志的方法

private boolean parseLog(String line, Context context) {

// 1 截取

String[] fields = line.split(" ");

// 2 日志长度大于 11 的为合法

if (fields.length > 11) {

return true;

}else {

return false;

}

（2）编写 WebLogDriver 类

package com.atguigu.mapreduce.weblog;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WebLogDriver {

public static void main(String[] args) throws Exception {

// 输入输出路径需要根据自己电脑上实际的输入输出路径设置

args = new String[] { "D:/input/inputlog", "D:/output1" };

// 1 获取 job 信息

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

// 2 加载 jar 包

job.setJarByClass(LogDriver.class);

// 3 关联 map

job.setMapperClass(WebLogMapper.class);

// 4 设置最终输出类型

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NullWritable.class);

// 设置 reducetask 个数为 0

job.setNumReduceTasks(0);

// 5 设置输入和输出路径

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 6 提交

boolean b = job.waitForCompletion(true);

System.exit(b ? 0 : 1);

}

posted @ 2021-12-08 21:32 哦心有阅读(29) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

哦心有

每日学习

公告