MapReduce 气象数据集
通过MapReduce程序分析气象数据集,更好的了解计算过程。
环境:Hadoop 1.2.1 & Centos 6.5 x64
1、气象数据集准备
下载链接:ftp://ftp3.ncdc.noaa.gov/pub/data
完整数据集非常大,可以下载部分数据集作为日常实验数据。
2、气象数据上传到HDFS
[huser@master 1971]$ ls 034700-99999-1971.gz 273730-99999-1971.gz 338850-99999-1971.gz 943290-99999-1971.gz 035623-99999-1971.gz 273930-99999-1971.gz 338870-99999-1971.gz 943320-99999-1971.gz 035833-99999-1971.gz 274020-99999-1971.gz 338890-99999-1971.gz 943330-99999-1971.gz 035963-99999-1971.gz 274120-99999-1971.gz 338930-99999-1971.gz 943350-99999-1971.gz 036880-99999-1971.gz 274280-99999-1971.gz 338960-99999-1971.gz 943400-99999-1971.gz 040180-16201-1971.gz 274790-99999-1971.gz 338980-99999-1971.gz 943430-99999-1971.gz 041650-99999-1971.gz 274850-99999-1971.gz 339020-99999-1971.gz 943549-99999-1971.gz 041750-99999-1971.gz 275020-99999-1971.gz 339070-99999-1971.gz 943550-99999-1971.gz 042350-99999-1971.gz 275090-99999-1971.gz 339100-99999-1971.gz 943660-99999-1971.gz 061800-99999-1971.gz 275320-99999-1971.gz 339150-99999-1971.gz 943670-99999-1971.gz
[huser@master 1971]$ zcat *.gz > sample.txt
[huser@master hadoop-1.2.1]$ bin/hadoop fs -put /home/huser/hadoop/1971/sample.txt /user/huser/in/
3、编写MapReduce程序
参考权威指南,摘出部分程序,计算年份最高气温
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus // signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } }
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } }
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class MaxTemperature { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err .println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setJarByClass(MaxTemperature.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
4、编译程序
[huser@master bin]$ javac -classpath ../hadoop-core-1.2.1.jar *.java
5、运行程序
[huser@master bin]$ ../bin/hadoop MaxTemperature ./in/sample.txt ./out6 Warning: $HADOOP_HOME is deprecated. 14/04/18 15:31:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/04/18 15:31:16 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 14/04/18 15:31:16 INFO input.FileInputFormat: Total input paths to process : 1 14/04/18 15:31:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/04/18 15:31:16 WARN snappy.LoadSnappy: Snappy native library not loaded 14/04/18 15:31:17 INFO mapred.JobClient: Running job: job_201404181009_0003 14/04/18 15:31:18 INFO mapred.JobClient: map 0% reduce 0% 14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000002_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 8 more 14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stdout 14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stderr 14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000003_0, Status : FAILED 14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stdout 14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stderr 14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000000_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 8 more 14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stdout 14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stderr 14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000001_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 8 more 14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stdout 14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stderr 14/04/18 15:31:41 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000006_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 8 more
报错原因是因为JAVA程序有三个类,运行程序找不到调用类,需要打成JAR包。
[huser@master bin]$ jar cvf MaxTemperature.jar *.class 已添加清单 正在添加: MaxTemperature.class(输入 = 1418) (输出 = 801)(压缩了 43%) 正在添加: MaxTemperatureMapper.class(输入 = 1876) (输出 = 804)(压缩了 57%) 正在添加: MaxTemperatureReducer.class(输入 = 1664) (输出 = 707)(压缩了 57%) [huser@master bin]$ ls hadoop MaxTemperatureMapper.java start-jobhistoryserver.sh hadoop-config.sh MaxTemperatureReducer.class start-mapred.sh hadoop-daemon.sh MaxTemperatureReducer.java stop-all.sh hadoop-daemons.sh rcc stop-balancer.sh MaxTemperature.class slaves.sh stop-dfs.sh MaxTemperature.jar start-all.sh stop-jobhistoryserver.sh MaxTemperature.java start-balancer.sh stop-mapred.sh MaxTemperatureMapper.class start-dfs.sh task-controller [huser@master bin]$ rm -rf *.class
以JAR包方式运行程序
[huser@master bin]$ ../bin/hadoop jar MaxTemperature.jar MaxTemperature ./in/sample.txt ./out7
Warning: $HADOOP_HOME is deprecated.
14/04/18 15:42:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments Applications should implement Tool for the same.
14/04/18 15:42:48 INFO input.FileInputFormat: Total input paths to process : 1
14/04/18 15:42:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/04/18 15:42:48 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/18 15:43:50 INFO mapred.JobClient: Running job: job_201404181009_0005
14/04/18 15:43:52 INFO mapred.JobClient: map 0% reduce 0%
14/04/18 15:51:04 INFO mapred.JobClient: map 1% reduce 0%
14/04/18 15:51:42 INFO mapred.JobClient: map 2% reduce 0%
14/04/18 15:51:43 INFO mapred.JobClient: map 10% reduce 0%
14/04/18 15:52:46 INFO mapred.JobClient: map 11% reduce 0%
14/04/18 15:53:03 INFO mapred.JobClient: map 12% reduce 0%
14/04/18 15:53:14 INFO mapred.JobClient: map 13% reduce 0%
14/04/18 15:53:16 INFO mapred.JobClient: map 14% reduce 0%
14/04/18 15:53:19 INFO mapred.JobClient: map 15% reduce 0%
14/04/18 15:53:22 INFO mapred.JobClient: map 16% reduce 0%
14/04/18 15:53:32 INFO mapred.JobClient: map 18% reduce 0%
14/04/18 15:54:09 INFO mapred.JobClient: map 19% reduce 0%
14/04/18 16:00:36 INFO mapred.JobClient: map 98% reduce 26%
14/04/18 16:00:41 INFO mapred.JobClient: map 98% reduce 30%
14/04/18 16:00:45 INFO mapred.JobClient: map 100% reduce 30%
14/04/18 16:00:56 INFO mapred.JobClient: map 100% reduce 33%
14/04/18 16:01:13 INFO mapred.JobClient: map 100% reduce 100%
14/04/18 16:01:25 INFO mapred.JobClient: Job complete: job_201404181009_0005
14/04/18 16:01:25 INFO mapred.JobClient: Counters: 30
14/04/18 16:01:25 INFO mapred.JobClient: Job Counters
14/04/18 16:01:25 INFO mapred.JobClient: Launched reduce tasks=1
14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2001708
14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all reduces waiting after eserving slots (ms)=0
14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all maps waiting after resrving slots (ms)=0
14/04/18 16:01:25 INFO mapred.JobClient: Rack-local map tasks=3
14/04/18 16:01:25 INFO mapred.JobClient: Launched map tasks=11
14/04/18 16:01:25 INFO mapred.JobClient: Data-local map tasks=8
14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=638749
14/04/18 16:01:25 INFO mapred.JobClient: File Output Format Counters
14/04/18 16:01:25 INFO mapred.JobClient: Bytes Written=9
14/04/18 16:01:25 INFO mapred.JobClient: FileSystemCounters
14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_READ=111429430
14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_READ=1311937676
14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=167764543
14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
14/04/18 16:01:25 INFO mapred.JobClient: File Input Format Counters
14/04/18 16:01:25 INFO mapred.JobClient: Bytes Read=1311936596
14/04/18 16:01:25 INFO mapred.JobClient: Map-Reduce Framework
14/04/18 16:01:25 INFO mapred.JobClient: Map output materialized bytes=55714697
14/04/18 16:01:25 INFO mapred.JobClient: Map input records=5140229
14/04/18 16:01:25 INFO mapred.JobClient: Reduce shuffle bytes=55714697
14/04/18 16:01:25 INFO mapred.JobClient: Spilled Records=15194901
14/04/18 16:01:25 INFO mapred.JobClient: Map output bytes=45584703
14/04/18 16:01:25 INFO mapred.JobClient: Total committed heap usage (bytes)=2127904768
14/04/18 16:01:25 INFO mapred.JobClient: CPU time spent (ms)=118580
14/04/18 16:01:25 INFO mapred.JobClient: Combine input records=0
14/04/18 16:01:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=1080
14/04/18 16:01:25 INFO mapred.JobClient: Reduce input records=5064967
14/04/18 16:01:25 INFO mapred.JobClient: Reduce input groups=1
14/04/18 16:01:25 INFO mapred.JobClient: Combine output records=0
14/04/18 16:01:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=1685221376
14/04/18 16:01:25 INFO mapred.JobClient: Reduce output records=1
14/04/18 16:01:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7951810560
14/04/18 16:01:25 INFO mapred.JobClient: Map output records=5064967
查看结果
[huser@master bin]$ ../bin/hadoop fs -cat ./out7/part-r-00000 Warning: $HADOOP_HOME is deprecated. 1971 478