三-下-1, 数据清洗(ETL)和计数器浅析及案例实操
三-下-1, 数据清洗(ETL)和计数器浅析及案例实操
ETL
- “ETL,是英文 Extract-Transform-Load 的缩写,用来描述将数据从来源端经过抽取(Extract)、转换(Transform)、加载(Load)至目的端的过程。ETL 一词较常用在数据仓库,但其对象并不限于数据仓库
- 在运行核心业务 MapReduce 程序之前,往往要先对数据进行清洗,清理掉不符合用户要求的数据。
- 清理的过程
往往只需要运行 Mapper 程序
,不需要运行 Reduce 程序。
计数器
3.1 数据清洗案例实操-简单版
[需求]
[需求分析和代码实现]
- 在Map阶段对输入的数据根据规则进行过滤清洗。
- Mapper类(写法一, 简单, 不易于后期扩充改写)
public class etlMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//1. 取出一行, 切割
String[] fileds = value.toString().split(" ");
//2. 判断数据长度, >11的才写出
//3. 加入计数器 Context.getCounter(groupName, counterName)
//194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
if(fileds.length > 11){
context.write(value, NullWritable.get());
context.getCounter("数据清洗日志:", "合法数据").increment(1); //3. 计数器的应用
}else{
context.getCounter("数据清洗日志:", "不合法数据").increment(1);
}
}
}
- Mapper类(写法二, 封装清洗方法, 稍微复杂, 降低耦合, 利于改写)
public class EtlMapper2 extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//1. 读取一行
String line = value.toString();
//2. 判断长度
boolean res = parseLog(line,context);
if(! res){
return;
}
//3.
context.write(value,NullWritable.get());
}
public boolean parseLog(String line, Context context){
//切割
String[] fields = line.split(" ");
//判断长度, 加入计数器
if(fields.length > 11){
context.getCounter("数据清洗日志: ", "有效数据").increment(1);
return true;
}else{
context.getCounter("数据清洗日志", "无效数据").increment(1);
return false;
}
}
}
- Driver类
public class etlDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1. 获取job对象, 传入配置
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2. 设置Jar包(driver)
job.setJarByClass(etlDriver.class);
//3. 设置mapper,
job.setMapperClass(etlMapper.class);
//4. 设置Mapper 的输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
//5. 设置最终的输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//5-1.
job.setNumReduceTasks(0);
//6. 设置输入和输出目录
FileInputFormat.setInputPaths(job, new Path("D:\\user\\etl\\input"));
FileOutputFormat.setOutputPath(job, new Path("D:\\user\\etl\\output"));
//7. 提交
boolean res = job.waitForCompletion(true);
System.exit(res? 0 : 1);
}
}
输出结果
Debug补充:
3.2 数据清洗案例实操-复杂版
[需求]
[需求分析和代码实现]
- 由于我们需要去具体判断log文件中的字段, 所以必须把web.log中的数据切割后按字段封装到Bean类中. Bean类代码如下:
//封装日志各个属性
public class LogBean {
private String remote_addr;// 记录客户端的ip地址
private String remote_user;// 记录客户端用户名称,忽略属性"-"
private String time_local;// 记录访问时间与时区
private String request;// 记录请求的url与http协议
private String status;// 记录请求状态;成功是200
private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
private String http_referer;// 用来记录从那个页面链接访问过来的
private String http_user_agent;// 记录客户浏览器的相关信息
private boolean valid = true;// 判断数据是否合法
public String getRemote_addr() {
return remote_addr;
}
public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
}
public String getRemote_user() {
return remote_user;
}
public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
}
public String getTime_local() {
return time_local;
}
public void setTime_local(String time_local) {
this.time_local = time_local;
}
public String getRequest() {
return request;
}
public void setRequest(String request) {
this.request = request;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getBody_bytes_sent() {
return body_bytes_sent;
}
public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
}
public String getHttp_referer() {
return http_referer;
}
public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
}
public String getHttp_user_agent() {
return http_user_agent;
}
public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append(this.valid);
sb.append("\001").append(this.remote_addr);
sb.append("\001").append(this.remote_user);
sb.append("\001").append(this.time_local);
sb.append("\001").append(this.request);
sb.append("\001").append(this.status);
sb.append("\001").append(this.body_bytes_sent);
sb.append("\001").append(this.http_referer);
sb.append("\001").append(this.http_user_agent);
return sb.toString();
}
}
- 同上一小节案例相似, 我们把清洗方法单独分离出来并加入计数器, 按照给定的规则从Mapper中写出合规的数据.
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 1 获取1行
String line = value.toString();
// 2 解析日志是否合法
LogBean bean = parseLog(line);
if (!bean.isValid()) {
return;
}
k.set(bean.toString());
// 3 输出
context.write(k, NullWritable.get());
}
// 解析日志
private LogBean parseLog(String line) {
LogBean logBean = new LogBean();
// 1 截取
String[] fields = line.split(" ");
if (fields.length > 11) {
// 2封装数据
//194.237.142.21 - - [18/Sep/2013:06:49:18 +0000]
// "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
logBean.setRemote_addr(fields[0]);
logBean.setRemote_user(fields[1]);
logBean.setTime_local(fields[3].substring(1));
logBean.setRequest(fields[6]);
logBean.setStatus(fields[8]);
logBean.setBody_bytes_sent(fields[9]);
logBean.setHttp_referer(fields[10]);
if (fields.length > 12) {
logBean.setHttp_user_agent(fields[11] + " "+ fields[12]);
}else {
logBean.setHttp_user_agent(fields[11]);
}
// 大于400,HTTP错误
if (Integer.parseInt(logBean.getStatus()) >= 400) {
logBean.setValid(false);
}
}else {
logBean.setValid(false);
}
return logBean;
}
}
- Driver类
public class LogDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
// 1 获取job信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 加载jar包
job.setJarByClass(LogDriver.class);
// 3 关联map
job.setMapperClass(LogMapper.class);
// 4 设置最终输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 5 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path("D:\\user\\etl\\input"));
FileOutputFormat.setOutputPath(job, new Path("D:\\user\\etl\\ouput-2"));
// 6 提交
job.waitForCompletion(true);
}
}
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)