Hadoop-MR实现日志清洗（二）

（接上：Hadoop-MR实现日志清洗（一））

4.groupbycount测试

编写Hadoop-MR的groupbycount程序测试Hadoop运行环境，同时也是对mapreduce程序的一次复习。

为了不影响logparser项目结构，单独创建了groupbycount项目，配置与logparser一致。

初始结构：

4.1源文件准备

源文件取自工作中部分数据集。

下载地址：2018-08-29-15-03-13_1959.rar

样例数据：

leeyk99    fr    fr    pc    Boutique    Day Dresses            18    0    0    20180828

leeyk99    fr    fr    pc    Women    Accessories    Phone Cases        4    0    0    20180828

leeyk99    fr    fr    pc    Women    Accessories    Scarves        5    0    0    20180828

leeyk99    fr    fr    pc    Women    Beauty    Beauty Tools    Makeup Tools    1    0    0    20180828

leeyk99    fr    fr    pc    Women    Clothing    Beachwear    Swimwear    4    0    0    20180828

leeyk99    fr    fr    pc    Women    Clothing    Bottoms    Shorts    1    0    0    20180828

数据结构：

4.2map函数

package com.leeyk99.udp;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class GroupByCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private final int ONE=1;

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //super.map(key, value, context);

        String line=value.toString();

        String[] lineSplit=line.split("\t");

        String siteTp=lineSplit[0];

        context.write(new Text(siteTp),new IntWritable(ONE));

    }

}

4.3reduce函数

package com.leeyk99.udp;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

//import java.util.Iterator;

public class GroupByCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override

    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        //super.reduce(key, values, context);

        Integer sum=0;

        /*Iterator<IntWritable> it=values.iterator();

        if(it.hasNext()){

            sum+=it.next().get();

        }*/

        for (IntWritable value : values ) {

            sum+=value.get();

        }

        context.write(key,new IntWritable(sum));

    }

}

4.4程序入口

package com.leeyk99.udp;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.File;

//import java.io.IOException;

public class GroupByCount {

    public static void main(String[] args) throws  Exception{//Exception范围比较大，包含IOException、InterruptedException

        if(args.length != 2){

            System.err.println("Usage: GroupByCount <input path> <output path>");

            System.exit(-1);

        }

        Job job=new Job();

        job.setJarByClass(GroupByCount.class);

        job.setJobName("job-GroupByCount");

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(GroupByCountMapper.class);

        job.setReducerClass(GroupByCountReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        delDir(args[1]);

        System.exit(job.waitForCompletion(true)? 0 : 1);

    }

    private static void delDir(String path){

        File f=new File(path);

        if(f.exists()){

            if(f.isDirectory()){

                String[] items=f.list();

                for( String item : items ){

                    File f2=new File(path+"/"+item);

                    if(f2.isDirectory()){

                        delDir(path+"/"+item);

                    }

                    else{

                        f2.delete();

                    }

                }

            }

            f.delete(); //删除文件或者最后的空目录

        }

        else{

            System.out.println("Output directory does not exist .");

        }

    }

}

4.5del函数

写在入口函数中，也可以单独创建个类。

private static void delDir(String path){

        File f=new File(path);

        if(f.exists()){

            if(f.isDirectory()){

                String[] items=f.list();

                for( String item : items ){

                    File f2=new File(path+"/"+item);

                    if(f2.isDirectory()){

                        delDir(path+"/"+item);

                    }

                    else{

                        f2.delete();

                    }

                }

            }

            f.delete(); //删除文件或者最后的空目录

        }

        else{

            System.out.println("Output directory does not exist .");

        }

    }

4.6测试

在IDEA中需要设置下运行配置（Run/Debug Configurations）：点击 + ，创建一个Application，选一下main class（一般输入项目名或类名，去查找的时候已经自动列出来可选的main了），设置下程序参数（Program arguments）： input/ output/