MapReduce_PVUV

在互联网环境下,一般网站都需要堆网站的pv,uv进行数据统计,简单理解下pv 就是url被访问的次数,uv则是url被不同ip访问的次数

简单来说pv就是访问量,即点击量总量(不去重,所有相同ip访问多次也属于点击多次)

简单来说uv就是会话量,即1个ip算一次访问量,实现ip去重.

PV:

测试数据

192.168.1.1 aa
192.168.1.2 bb
192.168.1.3 cc
192.168.1.1 dd
192.168.1.1 ee

 

 1 package MapReduce;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.FileSystem;
 5 import org.apache.hadoop.fs.Path;
 6 import org.apache.hadoop.io.IntWritable;
 7 import org.apache.hadoop.io.LongWritable;
 8 import org.apache.hadoop.io.Text;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 
14 import org.apache.hadoop.mapreduce.Job;
15 import java.io.IOException;
16 import java.net.URI;
17 import java.net.URISyntaxException;
18 import java.util.Iterator;
19 import java.util.StringTokenizer;
20 
21 
22 public class IpPv {
23     private static final String INPUT_PATH = "hdfs://h201:9000/user/hadoop/input";
24     private static final String OUTPUT_PATH = "hdfs://h201:9000/user/hadoop/output";
25     public static class IpPvUvMap extends Mapper<LongWritable, Text,Text, IntWritable> {
26         IntWritable one = new IntWritable(1);
27         public void map(LongWritable key, Text value, Context context) throws IOException , InterruptedException {
28             String ip = value.toString().split(" ", 5)[0];
29             context.write(new Text("pv"), one);
30         }                
31     }
32 
33     public static class IpPvUvReduce extends Reducer<Text, IntWritable,Text, IntWritable> {
34             protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {                    
35                  int sum = 0;
36                 for(IntWritable value:values){
37                 sum+=value.get();
38                 //if(sum > 100)#将map pv改为ip可求输出top100
39             }
40             context.write(key, new IntWritable(sum));
41             }
42     }
43             
44 
45     public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
46         System.out.println(args.length);
47         Configuration conf = new Configuration();
48         conf.set("mapred.jar","pv.jar");//申明jar名字为wcapp.jar
49         final FileSystem fileSystem = FileSystem.get(new URI(OUTPUT_PATH), conf);//读路径信息
50         fileSystem.delete(new Path(OUTPUT_PATH), true);//删除路径信息 输出路径不能存在
51         Job job = new Job(conf, "PV");
52         job.setJarByClass(IpPv.class);//启job任务
53         FileInputFormat.setInputPaths(job, INPUT_PATH);
54 
55         //set mapper & reducer class
56         job.setMapperClass(IpPvUvMap.class);
57         job.setMapOutputKeyClass(Text.class);
58         job.setMapOutputValueClass(IntWritable.class);
59         
60         job.setCombinerClass(IpPvUvReduce.class);
61         job.setReducerClass(IpPvUvReduce.class);
62         //set output key class
63         job.setOutputKeyClass(Text.class);
64         job.setOutputValueClass(IntWritable.class);
65         FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));//输出
66         System.exit(job.waitForCompletion(true) ? 0 : 1);
67     }
68 }

[hadoop@h201 IpPv]$ /usr/jdk1.7.0_25/bin/javac IpPv.java
Note: IpPv.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[hadoop@h201 IpPv]$ /usr/jdk1.7.0_25/bin/jar cvf pv.jar IpPv*class
added manifest
adding: IpPv.class(in = 2207) (out= 1116)(deflated 49%)
adding: IpPv$IpPvUvMap.class(in = 1682) (out= 658)(deflated 60%)
adding: IpPv$IpPvUvReduce.class(in = 1743) (out= 752)(deflated 56%)
[hadoop@h201 IpPv]$ hadoop jar pv.jar IpPv
18/04/22 20:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/22 20:08:13 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:08:13 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:08:14 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:08:14 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:08:14 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/04/22 20:08:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0053
18/04/22 20:08:14 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0053
18/04/22 20:08:14 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0053/
18/04/22 20:08:14 INFO mapreduce.Job: Running job: job_1516635595760_0053
18/04/22 20:08:21 INFO mapreduce.Job: Job job_1516635595760_0053 running in uber mode : false
18/04/22 20:08:21 INFO mapreduce.Job:  map 0% reduce 0%
18/04/22 20:08:29 INFO mapreduce.Job:  map 100% reduce 0%
18/04/22 20:08:36 INFO mapreduce.Job:  map 100% reduce 100%
18/04/22 20:08:36 INFO mapreduce.Job: Job job_1516635595760_0053 completed successfully
18/04/22 20:08:36 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=15
                FILE: Number of bytes written=219335
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=182
                HDFS: Number of bytes written=5
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5545
                Total time spent by all reduces in occupied slots (ms)=3564
                Total time spent by all map tasks (ms)=5545
                Total time spent by all reduce tasks (ms)=3564
                Total vcore-seconds taken by all map tasks=5545
                Total vcore-seconds taken by all reduce tasks=3564
                Total megabyte-seconds taken by all map tasks=5678080
                Total megabyte-seconds taken by all reduce tasks=3649536
        Map-Reduce Framework
                Map input records=5
                Map output records=5
                Map output bytes=35
                Map output materialized bytes=15
                Input split bytes=107
                Combine input records=5
                Combine output records=1
                Reduce input groups=1
                Reduce shuffle bytes=15
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=677
                CPU time spent (ms)=1350
                Physical memory (bytes) snapshot=224731136
                Virtual memory (bytes) snapshot=2147983360
                Total committed heap usage (bytes)=136712192
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=75
        File Output Format Counters
                Bytes Written=5

结果:

[hadoop@h201 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
18/04/22 20:08:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
pv      5

UV:

 1 package MapReduce;
 2 
 3 import java.io.IOException;
 4 import java.util.Iterator;
 5 import java.net.URI;
 6 import java.net.URISyntaxException;
 7 
 8 import org.apache.hadoop.fs.Path;
 9 import org.apache.hadoop.io.IntWritable;
10 import org.apache.hadoop.io.LongWritable;
11 import org.apache.hadoop.io.Text;
12 import org.apache.hadoop.conf.Configuration;
13 import org.apache.hadoop.fs.FileSystem;
14 import org.apache.hadoop.mapreduce.Mapper;
15 import org.apache.hadoop.mapreduce.Reducer;
16 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
18 import org.apache.hadoop.mapreduce.Job;
19 
20 public class IpUv {
21     private static final String INPUT_PATH = "hdfs://h201:9000/user/hadoop/input2";
22     private static final String OUTPUT_PATH = "hdfs://h201:9000/user/hadoop/output";
23     private final static IntWritable one = new IntWritable(1);
24     public static  class IpUvMapper1 extends Mapper<LongWritable, Text,Text, IntWritable> {
25         public void map(LongWritable key, Text value, Context context) throws IOException , InterruptedException {
26             String ip = value.toString().split(" ", 5)[0];
27             context.write(new Text(ip.trim()), one);//trim()去掉俩端多余的空格
28         }
29     }
30     public static class IpUvReducer1 extends Reducer<Text, IntWritable, Text,IntWritable> {
31         public void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException {
32             context.write(key, new IntWritable(1));
33         }
34     }
35     public static class IpUvMapper2 extends Mapper<LongWritable, Text,Text, IntWritable>{
36         public void map(LongWritable longWritable, Text text, Context context) throws IOException, InterruptedException {
37             String ip = text.toString().split("\t")[0];
38             context.write(new Text("uv"), one);
39         }
40     }
41     public static class IpUvReducer2 extends Reducer<Text, IntWritable, Text,IntWritable>{
42         public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
43             int sum = 0;
44             /**
45              * uv, [1,1,1,1,1,1]
46              */
47             for (IntWritable value:values){
48                 sum+=value.get(); 
49             }
50             context.write(key, new IntWritable(sum));
51         }
52     }
53 
54     public static void main(String [] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
55         Configuration conf = new Configuration();
56         conf.set("mapred.jar","uv.jar");
57         final FileSystem fileSystem = FileSystem.get(new URI(OUTPUT_PATH), conf);
58         fileSystem.delete(new Path(OUTPUT_PATH), true);
59         Job job = new Job(conf, "UV");
60         job.setJarByClass(IpUv.class);
61         FileInputFormat.setInputPaths(job, INPUT_PATH);
62         job.setOutputKeyClass(Text.class);
63         job.setOutputValueClass(IntWritable.class);
64 
65         //set mapper & reducer class
66         job.setMapperClass(IpUvMapper1.class);
67         job.setReducerClass(IpUvReducer1.class);
68         FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
69         
70         if(job.waitForCompletion(true)){
71             Configuration conf1 = new Configuration();
72             final FileSystem fileSystem1 = FileSystem.get(new URI(OUTPUT_PATH + "-2"), conf1);
73             fileSystem1.delete(new Path(OUTPUT_PATH + "-2"), true);
74             Job job1 = new Job(conf1, "UV");
75             job1.setJarByClass(IpUv.class);
76             FileInputFormat.setInputPaths(job1, OUTPUT_PATH);
77             job1.setOutputKeyClass(Text.class);
78             job1.setOutputValueClass(IntWritable.class);
79 
80             //set mapper & reducer class
81             job1.setMapperClass(IpUvMapper2.class);
82             job1.setReducerClass(IpUvReducer2.class);
83 
84             FileOutputFormat.setOutputPath(job1, new Path(OUTPUT_PATH + "-2"));
85             System.exit(job1.waitForCompletion(true) ? 0 : 1);
86         }
87     }
88 }

[hadoop@h201 IpUv]$ /usr/jdk1.7.0_25/bin/javac IpUv.java
Note: IpUv.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[hadoop@h201 IpUv]$ /usr/jdk1.7.0_25/bin/jar cvf uv.jar IpUv*class
added manifest
adding: IpUv.class(in = 2656) (out= 1323)(deflated 50%)
adding: IpUv$IpUvMapper1.class(in = 1563) (out= 609)(deflated 61%)
adding: IpUv$IpUvMapper2.class(in = 1569) (out= 616)(deflated 60%)
adding: IpUv$IpUvReducer1.class(in = 1344) (out= 513)(deflated 61%)
adding: IpUv$IpUvReducer2.class(in = 1624) (out= 684)(deflated 57%)
[hadoop@h201 IpUv]$ hadoop jar uv.jar IpUv
18/04/22 20:20:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/22 20:20:07 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:20:07 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:20:08 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:20:08 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:20:08 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/04/22 20:20:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0054
18/04/22 20:20:08 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0054
18/04/22 20:20:08 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0054/
18/04/22 20:20:08 INFO mapreduce.Job: Running job: job_1516635595760_0054
18/04/22 20:20:15 INFO mapreduce.Job: Job job_1516635595760_0054 running in uber mode : false
18/04/22 20:20:15 INFO mapreduce.Job:  map 0% reduce 0%
18/04/22 20:20:23 INFO mapreduce.Job:  map 100% reduce 0%
18/04/22 20:20:29 INFO mapreduce.Job:  map 100% reduce 100%
18/04/22 20:20:29 INFO mapreduce.Job: Job job_1516635595760_0054 completed successfully
18/04/22 20:20:29 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=96
                FILE: Number of bytes written=218531
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=182
                HDFS: Number of bytes written=42
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5370
                Total time spent by all reduces in occupied slots (ms)=3060
                Total time spent by all map tasks (ms)=5370
                Total time spent by all reduce tasks (ms)=3060
                Total vcore-seconds taken by all map tasks=5370
                Total vcore-seconds taken by all reduce tasks=3060
                Total megabyte-seconds taken by all map tasks=5498880
                Total megabyte-seconds taken by all reduce tasks=3133440
        Map-Reduce Framework
                Map input records=5
                Map output records=5
                Map output bytes=80
                Map output materialized bytes=96
                Input split bytes=107
                Combine input records=0
                Combine output records=0
                Reduce input groups=3
                Reduce shuffle bytes=96
                Reduce input records=5
                Reduce output records=3
                Spilled Records=10
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=295
                CPU time spent (ms)=1240
                Physical memory (bytes) snapshot=224301056
                Virtual memory (bytes) snapshot=2147659776
                Total committed heap usage (bytes)=136712192
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=75
        File Output Format Counters
                Bytes Written=42
18/04/22 20:20:29 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:20:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:20:29 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:20:29 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:20:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0055
18/04/22 20:20:29 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0055
18/04/22 20:20:29 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0055/
18/04/22 20:20:29 INFO mapreduce.Job: Running job: job_1516635595760_0055
18/04/22 20:20:36 INFO mapreduce.Job: Job job_1516635595760_0055 running in uber mode : false
18/04/22 20:20:36 INFO mapreduce.Job:  map 0% reduce 0%
18/04/22 20:20:42 INFO mapreduce.Job:  map 100% reduce 0%
18/04/22 20:20:47 INFO mapreduce.Job:  map 100% reduce 100%
18/04/22 20:20:48 INFO mapreduce.Job: Job job_1516635595760_0055 completed successfully
18/04/22 20:20:48 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=33
                FILE: Number of bytes written=218409
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=155
                HDFS: Number of bytes written=5
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2962
                Total time spent by all reduces in occupied slots (ms)=3019
                Total time spent by all map tasks (ms)=2962
                Total time spent by all reduce tasks (ms)=3019
                Total vcore-seconds taken by all map tasks=2962
                Total vcore-seconds taken by all reduce tasks=3019
                Total megabyte-seconds taken by all map tasks=3033088
                Total megabyte-seconds taken by all reduce tasks=3091456
        Map-Reduce Framework
                Map input records=3
                Map output records=3
                Map output bytes=21
                Map output materialized bytes=33
                Input split bytes=113
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=33
                Reduce input records=3
                Reduce output records=1
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=218
                CPU time spent (ms)=890
                Physical memory (bytes) snapshot=224432128
                Virtual memory (bytes) snapshot=2147622912
                Total committed heap usage (bytes)=136712192
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=42
        File Output Format Counters
                Bytes Written=5

结果:

[hadoop@h201 ~]$ hadoop fs -cat /user/hadoop/output-2/part-r-00000
18/04/22 20:21:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
uv      3

 

posted @ 2018-04-22 20:28  蜘蛛侠0  阅读(641)  评论(0编辑  收藏  举报