MapReduce实验学习

MapReduce运用

(1) 数据去重

value均为空,此时重复的key将自动合并,达到去重的目的

源数据:

用户id   商品id    收藏日期  
10181   1000481   2010-04-04 16:54:31  
20001   1001597   2010-04-07 15:07:52  
20001   1001560   2010-04-07 15:08:27  
20042   1001368   2010-04-08 08:20:30

根据商品id进行去重，统计用户收藏商品中都有哪些商品被收藏。结果数据如下：

Map部分

public static class Map extends Mapper<Object , Text , Text , NullWritable>  
    //map将输入中的value复制到输出数据的key上，并直接输出  
    {  
    private static Text newKey=new Text();      //从输入中得到的每行的数据的类型  
    public void map(Object key,Text value,Context context) throws IOException, InterruptedException  
    //实现map函数  
    {             //获取并输出每一次的处理过程  
    String line=value.toString();  
    System.out.println(line);  
    String arr[]=line.split("  |   |    ");  // 按两个或三或四个空格分隔
    newKey.set(arr[1]);                     // 将商品ID作为key
    context.write(newKey, NullWritable.get());  // value设为空 
    System.out.println(newKey);  
    }  
    }

Reduce阶段

public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable>{  
        public void reduce(Text key,Iterable<NullWritable> values,Context context) throws IOException, InterruptedException  
    //实现reduce函数  
    {  
    context.write(key,NullWritable.get());   //获取并输出每一次的处理过程  
    }  
    }

(2) 求平均数

源数据:

商品分类 商品点击次数  
52127   5  
52120   93  
52092   93  
52132   38  
52006   462  
52109   28  
52109   43

处理结果:

商品分类 商品平均点击次数  
52006   462  
52009   2615  
52024   347  
52090   11  
52092   93  
52109   35

Map部分

public static class Map extends Mapper<Object , Text , Text , IntWritable>{  
    private static Text newKey=new Text();  
    //实现map函数  
    public void map(Object key,Text value,Context context) throws IOException, InterruptedException{  
    // 将输入的纯文本文件的数据转化成String  
    String line=value.toString();  
    System.out.println(line);  
    String arr[]=line.split(" +");          // "空格+" 代表着按照一个至多个空格分隔
    newKey.set(arr[0]);                    // 将商品分类作为key
    int click=Integer.parseInt(arr[1]);     // 将点击次数转换为int类型,并作为value
    context.write(newKey, new IntWritable(click));  
    }  
    }

Reduce阶段

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{  
//实现reduce函数  
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{  
    int num=0;  
    int count=0;  
    for(IntWritable val:values){  
    num+=val.get(); //每个元素求和num  
    count++;        //统计元素的次数count  
    }  
    int avg=num/count;  //计算平均数  
  
    context.write(key,new IntWritable(avg));  // 输出
    }  
    }

(3) 排序

知识点:

在MapReduce过程中默认就有对数据的排序。它是按照key值进行排序的，如果key为封装int的IntWritable类型，那么MapReduce会按照数字大小对key排序，如果Key为封装String的Text类型，那么MapReduce将按照数据字典顺序对字符排序。

源数据:

商品id  点击次数  
1010037 100  
1010102 100  
1010152 97  
1010178 96  
1010280 104  
1010320 103

实验结果:

点击次数 商品ID  
96  1010603  
96  1010178  
97  1010637  
97  1010152  
100 1010102

Map阶段:

public static class Map extends Mapper<Object,Text,IntWritable,Text>{  
        private static Text goods=new Text();  
        private static IntWritable num=new IntWritable();  
        public void map(Object key,Text value,Context context) throws IOException, InterruptedException{  
            String line=value.toString();  
            String arr[]=line.split(" +");  
            // 将点击次数转化为int类型并封装为IntWritable,作为key. 系统会自动排序 
            num.set(Integer.parseInt(arr[1]));      
            goods.set(arr[0]);  
            context.write(num,goods);  
        }  
    }

Reduce阶段:

public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text>{  
        private static IntWritable result= new IntWritable();  
                 //声明对象result  
        public void reduce(IntWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException{  
    for(Text val:values){  
    context.write(key,val);  
    }

(4) 单表连接

A的好友为B,A的好友为C,B与C就是间接好友. 单表连接就是要寻找间接好友.

源数据:

用户     好友
10001   10002  
10002   10005  
10003   10002  
10004   10006  
10005   10007

实验结果:

好友id  用户id  
10005   10001  
10005   10003

Map阶段:

   //实现map函数  
public void map(Object key,Text value,Context context)  
                throws IOException,InterruptedException{  
                String line = value.toString();  
                String[] arr = line.split("\t");   //按行截取  
                String mapkey=arr[0];  
                String mapvalue=arr[1];  
                String relationtype=new String();  //左右表标识  
                relationtype="1";  //输出左表  
                context.write(new Text(mapkey),new Text(relationtype+"+"+mapvalue));  
                //System.out.println(relationtype+"+"+mapvalue);  
                relationtype="2";  //输出右表  
                context.write(new Text(mapvalue),new Text(relationtype+"+"+mapkey));  
                //System.out.println(relationtype+"+"+mapvalue);  
  
        }  
    }

Reduce阶段:

// reduce端在接收map端传来的数据时已经把相同key的所有value都放到一个Iterator容器中values
// 1. 新建两数组buyer[]和friends[]用来存放map端的两份输出数据
// 2. Iterator迭代中hasNext()和Next()方法加while循环遍历输出values的值并赋值给record
// 


public static class Reduce extends Reducer<Text, Text, Text, Text>{  
 //实现reduce函数  
public void reduce(Text key,Iterable<Text> values,Context context)  
    throws IOException,InterruptedException{  
    int buyernum=0;  
    String[] buyer=new String[20];      // 存放左表values数据
    int friendsnum=0;  
    String[] friends=new String[20];    //  存放右表values数据 
    Iterator ite=values.iterator();  
    while(ite.hasNext()){  
        String record=ite.next().toString();  // 存放当前的values值
        int len=record.length();  
        int i=2;  
        if(0==len){  
            continue;  
        }  
        //取得左右表标识  
        char relationtype=record.charAt(0);  
        //取出record，放入buyer  
        if('1'==relationtype){  
            buyer [buyernum]=record.substring(i);   // substring(i)从第i+1个元素开始截取 
            buyernum++;  
        }  
        //取出record，放入friends  
        if('2'==relationtype){  
            friends[friensnum]=record.substring(i);  
            friendsnum++;  
        }  
    }  
  // buyernum和friendsnum数组求笛卡尔积  
    if(0!=buyernum&&0!=friendsnum){  
    for(int m=0;m<buyernum;m++){  
    for(int n=0;n<friendsnum;n++){  
    if(buyer[m]!=friends[n]){  
    //输出结果  
    context.write(new Text(buyer[m]),new Text(frinds[n]));  
    }  
    }  
    }  
    }  
    }

(5)Map端Join

适用范围:

两个表 , 一个表数据量十分庞大 , 另一个表数据量很少

源数据:

orders表
订单ID   订单号          用户ID    下单日期  
52304   111215052630    176474  2011-12-15 04:58:21  
52303   111215052629    178350  2011-12-15 04:45:31  
52302   111215052628    172296  2011-12-15 03:12:23  
52301   111215052627    178348  2011-12-15 02:37:32  
52300   111215052626    174893  2011-12-15 02:18:56  

order_items1表
明细ID  订单ID   商品ID  
252578  52293   1016840  
252579  52293   1014040  
252580  52294   1014200  
252581  52294   1001012  
252582  52294   1022245  
252583  52294   1014724  
252584  52294   1010731  
252586  52295   1023399

结果:

订单ID  用户ID   下单日期             商品ID  
52293   178338  2011-12-15 00:13:07 1016840  
52293   178338  2011-12-15 00:13:07 1014040  
52294   178341  2011-12-15 00:14:37 1010731  
52294   178341  2011-12-15 00:14:37 1014724  
52294   178341  2011-12-15 00:14:37 1022245  
52294   178341  2011-12-15 00:14:37 1014200  
52294   178341  2011-12-15 00:14:37 1001012

Map部分

public static class MyMapper extends Mapper<Object, Text, Text, Text>{  
        private Map<String, String> dict = new HashMap<>();  
  
        @Override  
        protected void setup(Context context) throws IOException,  
                InterruptedException {  
                    // 获取orders文件
            String fileName = context.getLocalCacheFiles()[0].getName();  
            System.out.println(fileName);
                    // 将orders文件存入内存
            BufferedReader reader = new BufferedReader(new FileReader(fileName));  
            String codeandname = null;  
                    //  读取内存每行内容
            while (null != ( codeandname = reader.readLine() ) ) {  
                //  将订单ID作为key, 用户ID+下单日期 作为value , 放入dict
                String str[]=codeandname.split("\t");  
                dict.put(str[0], str[2]+"\t"+str[3]);  
            }  
            reader.close();  
        }  
        @Override 
        // map判断dict中的 订单ID 与 order_items1表中的 订单iD是否相同
        protected void map(Object key, Text value, Context context)  
                throws IOException, InterruptedException {  
            String[] kv = value.toString().split("\t");  
            if (dict.containsKey(kv[1])) {  
                context.write(new Text(kv[1]), new Text(dict.get(kv[1])+"\t"+kv[2]));  
            }  
        }  
    }

Reduce阶段

public static class MyReducer extends Reducer<Text, Text, Text, Text>{  
        @Override  
        protected void reduce(Text key, Iterable<Text> values, Context context)  
    throws IOException, InterruptedException {  
    for (Text text : values) {  
    context.write(key, text);  
    }

(6)二次排序

根据商品的点击次数(click_num)进行降序排序，再根据goods_id升序排序，并输出所有商品。

源数据:

goods_id click_num  
1010037 100  
1010102 100  
1010152 97  
1010178 96  
1010280 104  
1010320 103

结果:

点击次数 商品id  
------------------------------------------------  
104 1010280  
104 1010510  
------------------------------------------------  
103 1010320  
------------------------------------------------  
100 1010037  
100 1010102  
------------------------------------------------

思路:

所有的key是需要被比较和排序二次 . 先按照第一字段排序 , 相同时按照第二字段排序。

我们可以构造一个复合类IntPair，他有两个字段，先利用分区对第一字段排序，再利用分区内的比较对第二字段排序。

主要分为四部分：自定义key，自定义分区函数类，map部分，reduce部分。

posted @ 2020-11-17 17:12 西西里啊阅读(202) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

西西里啊

MapReduce实验学习

MapReduce运用

(1) 数据去重

(2) 求平均数

(3) 排序

(4) 单表连接

(5)Map端Join

(6)二次排序

公告