MapReduce_PVUV
在互联网环境下,一般网站都需要堆网站的pv,uv进行数据统计,简单理解下pv 就是url被访问的次数,uv则是url被不同ip访问的次数
简单来说pv就是访问量,即点击量总量(不去重,所有相同ip访问多次也属于点击多次)
简单来说uv就是会话量,即1个ip算一次访问量,实现ip去重.
PV:
测试数据
192.168.1.1 aa
192.168.1.2 bb
192.168.1.3 cc
192.168.1.1 dd
192.168.1.1 ee
1 package MapReduce; 2 3 import org.apache.hadoop.conf.Configuration; 4 import org.apache.hadoop.fs.FileSystem; 5 import org.apache.hadoop.fs.Path; 6 import org.apache.hadoop.io.IntWritable; 7 import org.apache.hadoop.io.LongWritable; 8 import org.apache.hadoop.io.Text; 9 import org.apache.hadoop.mapreduce.Mapper; 10 import org.apache.hadoop.mapreduce.Reducer; 11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 14 import org.apache.hadoop.mapreduce.Job; 15 import java.io.IOException; 16 import java.net.URI; 17 import java.net.URISyntaxException; 18 import java.util.Iterator; 19 import java.util.StringTokenizer; 20 21 22 public class IpPv { 23 private static final String INPUT_PATH = "hdfs://h201:9000/user/hadoop/input"; 24 private static final String OUTPUT_PATH = "hdfs://h201:9000/user/hadoop/output"; 25 public static class IpPvUvMap extends Mapper<LongWritable, Text,Text, IntWritable> { 26 IntWritable one = new IntWritable(1); 27 public void map(LongWritable key, Text value, Context context) throws IOException , InterruptedException { 28 String ip = value.toString().split(" ", 5)[0]; 29 context.write(new Text("pv"), one); 30 } 31 } 32 33 public static class IpPvUvReduce extends Reducer<Text, IntWritable,Text, IntWritable> { 34 protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { 35 int sum = 0; 36 for(IntWritable value:values){ 37 sum+=value.get(); 38 //if(sum > 100)#将map pv改为ip可求输出top100 39 } 40 context.write(key, new IntWritable(sum)); 41 } 42 } 43 44 45 public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException { 46 System.out.println(args.length); 47 Configuration conf = new Configuration(); 48 conf.set("mapred.jar","pv.jar");//申明jar名字为wcapp.jar 49 final FileSystem fileSystem = FileSystem.get(new URI(OUTPUT_PATH), conf);//读路径信息 50 fileSystem.delete(new Path(OUTPUT_PATH), true);//删除路径信息 输出路径不能存在 51 Job job = new Job(conf, "PV"); 52 job.setJarByClass(IpPv.class);//启job任务 53 FileInputFormat.setInputPaths(job, INPUT_PATH); 54 55 //set mapper & reducer class 56 job.setMapperClass(IpPvUvMap.class); 57 job.setMapOutputKeyClass(Text.class); 58 job.setMapOutputValueClass(IntWritable.class); 59 60 job.setCombinerClass(IpPvUvReduce.class); 61 job.setReducerClass(IpPvUvReduce.class); 62 //set output key class 63 job.setOutputKeyClass(Text.class); 64 job.setOutputValueClass(IntWritable.class); 65 FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));//输出 66 System.exit(job.waitForCompletion(true) ? 0 : 1); 67 } 68 }
[hadoop@h201 IpPv]$ /usr/jdk1.7.0_25/bin/javac IpPv.java
Note: IpPv.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[hadoop@h201 IpPv]$ /usr/jdk1.7.0_25/bin/jar cvf pv.jar IpPv*class
added manifest
adding: IpPv.class(in = 2207) (out= 1116)(deflated 49%)
adding: IpPv$IpPvUvMap.class(in = 1682) (out= 658)(deflated 60%)
adding: IpPv$IpPvUvReduce.class(in = 1743) (out= 752)(deflated 56%)
[hadoop@h201 IpPv]$ hadoop jar pv.jar IpPv
18/04/22 20:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/22 20:08:13 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:08:13 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:08:14 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:08:14 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:08:14 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/04/22 20:08:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0053
18/04/22 20:08:14 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0053
18/04/22 20:08:14 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0053/
18/04/22 20:08:14 INFO mapreduce.Job: Running job: job_1516635595760_0053
18/04/22 20:08:21 INFO mapreduce.Job: Job job_1516635595760_0053 running in uber mode : false
18/04/22 20:08:21 INFO mapreduce.Job: map 0% reduce 0%
18/04/22 20:08:29 INFO mapreduce.Job: map 100% reduce 0%
18/04/22 20:08:36 INFO mapreduce.Job: map 100% reduce 100%
18/04/22 20:08:36 INFO mapreduce.Job: Job job_1516635595760_0053 completed successfully
18/04/22 20:08:36 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=15
FILE: Number of bytes written=219335
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=182
HDFS: Number of bytes written=5
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5545
Total time spent by all reduces in occupied slots (ms)=3564
Total time spent by all map tasks (ms)=5545
Total time spent by all reduce tasks (ms)=3564
Total vcore-seconds taken by all map tasks=5545
Total vcore-seconds taken by all reduce tasks=3564
Total megabyte-seconds taken by all map tasks=5678080
Total megabyte-seconds taken by all reduce tasks=3649536
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=35
Map output materialized bytes=15
Input split bytes=107
Combine input records=5
Combine output records=1
Reduce input groups=1
Reduce shuffle bytes=15
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=677
CPU time spent (ms)=1350
Physical memory (bytes) snapshot=224731136
Virtual memory (bytes) snapshot=2147983360
Total committed heap usage (bytes)=136712192
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=75
File Output Format Counters
Bytes Written=5
结果:
[hadoop@h201 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
18/04/22 20:08:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
pv 5
UV:
1 package MapReduce; 2 3 import java.io.IOException; 4 import java.util.Iterator; 5 import java.net.URI; 6 import java.net.URISyntaxException; 7 8 import org.apache.hadoop.fs.Path; 9 import org.apache.hadoop.io.IntWritable; 10 import org.apache.hadoop.io.LongWritable; 11 import org.apache.hadoop.io.Text; 12 import org.apache.hadoop.conf.Configuration; 13 import org.apache.hadoop.fs.FileSystem; 14 import org.apache.hadoop.mapreduce.Mapper; 15 import org.apache.hadoop.mapreduce.Reducer; 16 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 18 import org.apache.hadoop.mapreduce.Job; 19 20 public class IpUv { 21 private static final String INPUT_PATH = "hdfs://h201:9000/user/hadoop/input2"; 22 private static final String OUTPUT_PATH = "hdfs://h201:9000/user/hadoop/output"; 23 private final static IntWritable one = new IntWritable(1); 24 public static class IpUvMapper1 extends Mapper<LongWritable, Text,Text, IntWritable> { 25 public void map(LongWritable key, Text value, Context context) throws IOException , InterruptedException { 26 String ip = value.toString().split(" ", 5)[0]; 27 context.write(new Text(ip.trim()), one);//trim()去掉俩端多余的空格 28 } 29 } 30 public static class IpUvReducer1 extends Reducer<Text, IntWritable, Text,IntWritable> { 31 public void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException { 32 context.write(key, new IntWritable(1)); 33 } 34 } 35 public static class IpUvMapper2 extends Mapper<LongWritable, Text,Text, IntWritable>{ 36 public void map(LongWritable longWritable, Text text, Context context) throws IOException, InterruptedException { 37 String ip = text.toString().split("\t")[0]; 38 context.write(new Text("uv"), one); 39 } 40 } 41 public static class IpUvReducer2 extends Reducer<Text, IntWritable, Text,IntWritable>{ 42 public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { 43 int sum = 0; 44 /** 45 * uv, [1,1,1,1,1,1] 46 */ 47 for (IntWritable value:values){ 48 sum+=value.get(); 49 } 50 context.write(key, new IntWritable(sum)); 51 } 52 } 53 54 public static void main(String [] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException { 55 Configuration conf = new Configuration(); 56 conf.set("mapred.jar","uv.jar"); 57 final FileSystem fileSystem = FileSystem.get(new URI(OUTPUT_PATH), conf); 58 fileSystem.delete(new Path(OUTPUT_PATH), true); 59 Job job = new Job(conf, "UV"); 60 job.setJarByClass(IpUv.class); 61 FileInputFormat.setInputPaths(job, INPUT_PATH); 62 job.setOutputKeyClass(Text.class); 63 job.setOutputValueClass(IntWritable.class); 64 65 //set mapper & reducer class 66 job.setMapperClass(IpUvMapper1.class); 67 job.setReducerClass(IpUvReducer1.class); 68 FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH)); 69 70 if(job.waitForCompletion(true)){ 71 Configuration conf1 = new Configuration(); 72 final FileSystem fileSystem1 = FileSystem.get(new URI(OUTPUT_PATH + "-2"), conf1); 73 fileSystem1.delete(new Path(OUTPUT_PATH + "-2"), true); 74 Job job1 = new Job(conf1, "UV"); 75 job1.setJarByClass(IpUv.class); 76 FileInputFormat.setInputPaths(job1, OUTPUT_PATH); 77 job1.setOutputKeyClass(Text.class); 78 job1.setOutputValueClass(IntWritable.class); 79 80 //set mapper & reducer class 81 job1.setMapperClass(IpUvMapper2.class); 82 job1.setReducerClass(IpUvReducer2.class); 83 84 FileOutputFormat.setOutputPath(job1, new Path(OUTPUT_PATH + "-2")); 85 System.exit(job1.waitForCompletion(true) ? 0 : 1); 86 } 87 } 88 }
[hadoop@h201 IpUv]$ /usr/jdk1.7.0_25/bin/javac IpUv.java
Note: IpUv.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[hadoop@h201 IpUv]$ /usr/jdk1.7.0_25/bin/jar cvf uv.jar IpUv*class
added manifest
adding: IpUv.class(in = 2656) (out= 1323)(deflated 50%)
adding: IpUv$IpUvMapper1.class(in = 1563) (out= 609)(deflated 61%)
adding: IpUv$IpUvMapper2.class(in = 1569) (out= 616)(deflated 60%)
adding: IpUv$IpUvReducer1.class(in = 1344) (out= 513)(deflated 61%)
adding: IpUv$IpUvReducer2.class(in = 1624) (out= 684)(deflated 57%)
[hadoop@h201 IpUv]$ hadoop jar uv.jar IpUv
18/04/22 20:20:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/22 20:20:07 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:20:07 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:20:08 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:20:08 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:20:08 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/04/22 20:20:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0054
18/04/22 20:20:08 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0054
18/04/22 20:20:08 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0054/
18/04/22 20:20:08 INFO mapreduce.Job: Running job: job_1516635595760_0054
18/04/22 20:20:15 INFO mapreduce.Job: Job job_1516635595760_0054 running in uber mode : false
18/04/22 20:20:15 INFO mapreduce.Job: map 0% reduce 0%
18/04/22 20:20:23 INFO mapreduce.Job: map 100% reduce 0%
18/04/22 20:20:29 INFO mapreduce.Job: map 100% reduce 100%
18/04/22 20:20:29 INFO mapreduce.Job: Job job_1516635595760_0054 completed successfully
18/04/22 20:20:29 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=96
FILE: Number of bytes written=218531
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=182
HDFS: Number of bytes written=42
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5370
Total time spent by all reduces in occupied slots (ms)=3060
Total time spent by all map tasks (ms)=5370
Total time spent by all reduce tasks (ms)=3060
Total vcore-seconds taken by all map tasks=5370
Total vcore-seconds taken by all reduce tasks=3060
Total megabyte-seconds taken by all map tasks=5498880
Total megabyte-seconds taken by all reduce tasks=3133440
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=80
Map output materialized bytes=96
Input split bytes=107
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=96
Reduce input records=5
Reduce output records=3
Spilled Records=10
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=295
CPU time spent (ms)=1240
Physical memory (bytes) snapshot=224301056
Virtual memory (bytes) snapshot=2147659776
Total committed heap usage (bytes)=136712192
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=75
File Output Format Counters
Bytes Written=42
18/04/22 20:20:29 INFO client.RMProxy: Connecting to ResourceManager at h201/192.168.121.132:8032
18/04/22 20:20:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/22 20:20:29 INFO input.FileInputFormat: Total input paths to process : 1
18/04/22 20:20:29 INFO mapreduce.JobSubmitter: number of splits:1
18/04/22 20:20:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516635595760_0055
18/04/22 20:20:29 INFO impl.YarnClientImpl: Submitted application application_1516635595760_0055
18/04/22 20:20:29 INFO mapreduce.Job: The url to track the job: http://h201:8088/proxy/application_1516635595760_0055/
18/04/22 20:20:29 INFO mapreduce.Job: Running job: job_1516635595760_0055
18/04/22 20:20:36 INFO mapreduce.Job: Job job_1516635595760_0055 running in uber mode : false
18/04/22 20:20:36 INFO mapreduce.Job: map 0% reduce 0%
18/04/22 20:20:42 INFO mapreduce.Job: map 100% reduce 0%
18/04/22 20:20:47 INFO mapreduce.Job: map 100% reduce 100%
18/04/22 20:20:48 INFO mapreduce.Job: Job job_1516635595760_0055 completed successfully
18/04/22 20:20:48 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=33
FILE: Number of bytes written=218409
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=155
HDFS: Number of bytes written=5
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2962
Total time spent by all reduces in occupied slots (ms)=3019
Total time spent by all map tasks (ms)=2962
Total time spent by all reduce tasks (ms)=3019
Total vcore-seconds taken by all map tasks=2962
Total vcore-seconds taken by all reduce tasks=3019
Total megabyte-seconds taken by all map tasks=3033088
Total megabyte-seconds taken by all reduce tasks=3091456
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=21
Map output materialized bytes=33
Input split bytes=113
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=33
Reduce input records=3
Reduce output records=1
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=218
CPU time spent (ms)=890
Physical memory (bytes) snapshot=224432128
Virtual memory (bytes) snapshot=2147622912
Total committed heap usage (bytes)=136712192
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=42
File Output Format Counters
Bytes Written=5
结果:
[hadoop@h201 ~]$ hadoop fs -cat /user/hadoop/output-2/part-r-00000
18/04/22 20:21:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
uv 3