hadoop常见问题汇集
1 hadoop conf.addResource
http://stackoverflow.com/questions/16017538/how-does-configuration-addresource-method-work-in-hadoop
How does Configuration.addResource() method work in hadoop up vote 0 down vote favorite Does Configuration.addResource() method load resource file like ClassLoader of java or it just encapsulates ClassLoader class.Because I find it can not use String like "../resource.xml" as argument of addResource() to load resource file out of classpath, this property is just the same as ClassLoader. Thx! hadoop shareimprove this question asked Apr 15 '13 at 14:18 foolyoghurt 478 "How does it work" is a different question from "why is my usage not working for me?" Which do you really want to know? – Matt Ball Apr 15 '13 at 14:19 add a comment 1 Answer active oldest votes up vote 2 down vote Browsing the Javadocs and source code for Configuration, Strings are assumed to be classpaths (line 1162), rather than relative to the file system - you should use URLs to reference files on the local file system as follows: conf.addResource(new File("../resource.xml").toURI().toURL()); shareimprove this answer answered Apr 17 '1
2 hadoop MapReduce 读取参数
下面我们先通过一个表格来看下,在hadoop中,使用全局变量或全局文件共享的几种方法
1 使用Configuration的set方法,只适合数据内容比较小的场景
2 将共享文件放在HDFS上,每次都去读取,效率比较低
3 将共享文件放在DistributedCache里,在setup初始化一次后,即可多次使用,缺点是不支持修改操作,仅能读取
下面是第3中方式的介绍
Alternative to deprecated DistributedCache class in Hadoop 2.2.0 As of Hadoop 2.2.0, if you use org.apache.hadoop.filecache.DistributedCache class to load files you want to add to your job as distributed cache, then your compiler will warn you regarding this class being deprecated. In earlier versions of Hadoop, we used DistributedCache class in the following fashion to add files to be available to all mappers and reducers locally: ? 1 2 3 4 5 6 7 // In the main driver class using the new mapreduce API Configuration conf = getConf(); ... DistributedCache.addCacheFile(new Path(filename).toUri(), conf); ... Job job = new Job(conf); ... ? 1 2 // In the mapper class, mostly in the setup method Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job); But now, with Hadoop 2.2.0, the functionality of addition of files to distributed cache has been moved to the org.apache.hadoop.mapreduce.Job class. You may also notice that the constructor we used to use for the Job class has also been deprecated and instead we should be using the new factory method getInstance(Configuration conf). The alternative solution would look as follows: ? 1 2 3 4 5 6 // In the main driver class using the new mapreduce API Configuration conf = getConf(); ... Job job = Job.getInstance(conf); ... job.addCacheFile(new URI(filename)); ? 1 2 // In the mapper class, mostly in the setup method URI[] localPaths = context.getCacheFiles();
原文链接 http://www.bigdataspeak.com/2014/06/alternative-to-deprecated.html
Hadoop DistributedCache is deprecated - what is the preferred API? http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api 大矩阵相乘 http://www.cnblogs.com/zhangchaoyang/articles/4646315.html 如何使用Hadoop的DistributedCache http://blog.itpub.net/29755051/viewspace-1220340/ DistributedCache小记 http://www.cnblogs.com/xuxm2007/p/3344930.html 迭代式MapReduce解决方案(二) DistributedCache http://hongweiyi.com/2012/02/iterative-mapred-distcache/
3 hadoop Mapper 类
Mapper类有四个方法: (1)protected void setup(Context context) (2)protected void map(KEYIN key,VALUEIN value,Context context) (3)protected void cleanup(Context context) (4)public void run(Context context) setup()方法一般是在实例化时用户程序需要做的一些初始化工作(如打开一个全局文件,建立数据库链接等等) cleanup()方法是收尾工作,如关闭文件或者执行map()后的键值对分发等。 map()方法承担主要的处理工作,一般我们些代码的时候主要用到的是map方法。 默认Mapper的run()方法的核心代码如下: public void run(Context context) throws IOException,InterruptedException { setup(context); while(context.nextKeyValue()) map(context.getCurrentKey(),context,context.getCurrentValue(),context); cleanup(context); } setup和cleanup仅仅在初始化Mapper实例和Mapper任务结束时由系统作为回调函数分别各做一次,并不是每次调用map方法时都去执行。所以如果是要处理map中的某些数值数据时,想把代码写在cleanup里面需要特别注意。 Mapper输出结果到reduce阶段之前,还有几个可以自定义的步骤 (1)combiner 每个节点输出的键值可以先进行合并处理。 (2)合并处理之后如果还想将不同key值分配给不同reduce进行处理,称为shuffle洗牌过程,提供了一个partioner类来完成。 (3)如果想将key值自定义进行排序,这边提供了一个sort类,可以自定义进行排序
4 hadoop ChainMapper 和 ChainReducer
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.chain.ChainMapper; import org.apache.hadoop.mapreduce.lib.chain.ChainReducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "wordcount"); job.setJarByClass(WCount.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); ChainMapper.addMapper(job, WCMapper.class, LongWritable.class, Text.class, Text.class, LongWritable.class, conf); ChainReducer.setReducer(job, WCReduce.class, Text.class, LongWritable.class, Text.class, LongWritable.class, conf); ChainReducer.addMapper(job, WCMapper2.class, Text.class, LongWritable.class, LongWritable.class, Text.class, conf); job.waitForCompletion(true); } public static class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] lineSet = line.split(" "); for (String e : lineSet) { context.write(new Text(e), new LongWritable(1)); } } } public static class WCReduce extends Reducer<Text, LongWritable, Text, LongWritable> { private LongWritable outVla = new LongWritable(); @Override protected void reduce(Text k1, Iterable<LongWritable> v1, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable e : v1) { sum += e.get(); } outVla.set(sum); context.write(k1, outVla); } } public static class WCMapper2 extends Mapper<Text, LongWritable, LongWritable, Text> { @Override protected void map(Text key, LongWritable value, Context context) throws IOException, InterruptedException { context.write(value, key); } } }
5 Hadoop JobControl
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob; import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WCount2 { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); // 第一个job的配置 Job job1 = Job.getInstance(conf, "wordcount1"); job1.setJarByClass(WCount2.class); job1.setMapperClass(WCMapper.class); job1.setMapOutputKeyClass(Text.class); job1.setMapOutputValueClass(LongWritable.class); job1.setReducerClass(WCReduce.class); job1.setOutputKeyClass(Text.class); job1.setOutputValueClass(LongWritable.class); // job1的输入输出文件路径 FileInputFormat.addInputPath(job1, new Path(args[0])); FileOutputFormat.setOutputPath(job1, new Path(args[1])); // 加入控制容器 ControlledJob ctrljob1 = new ControlledJob(conf); ctrljob1.setJob(job1); // 第二个作业的配置 Job job2 = Job.getInstance(conf, "wordcount2"); job2.setJarByClass(WCount2.class); job2.setMapperClass(WCMapper2.class); job2.setMapOutputKeyClass(Text.class); job2.setMapOutputValueClass(Text.class); // 作业2加入控制容器 ControlledJob ctrljob2 = new ControlledJob(conf); ctrljob2.setJob(job2); // 设置多个作业直接的依赖关系 // 如下所写: // 意思为job2的启动,依赖于job1作业的完成 ctrljob2.addDependingJob(ctrljob1); // job2的输入输出文件路径 FileInputFormat.addInputPath(job2, new Path(args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); // 主的控制容器,控制上面的总的两个子作业 JobControl JC = new JobControl("wordcount"); // 添加到总的JobControl里,进行控制 JC.addJob(ctrljob1); JC.addJob(ctrljob2); // 在线程启动,记住一定要有这个 Thread t = new Thread(JC); t.start(); while (true) { if (JC.allFinished()) {// 如果作业成功完成,就打印成功作业的信息 System.out.println(JC.getSuccessfulJobList()); JC.stop(); break; } } } public static class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] lineSet = line.split(" "); for (String e : lineSet) { context.write(new Text(e), new LongWritable(1)); } } } public static class WCReduce extends Reducer<Text, LongWritable, Text, LongWritable> { private LongWritable outVla = new LongWritable(); @Override protected void reduce(Text k1, Iterable<LongWritable> v1, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable e : v1) { sum += e.get(); } outVla.set(sum); context.write(k1, outVla); } } public static class WCMapper2 extends Mapper<LongWritable, Text, Text, Text> { private Text outval = new Text(); private Text outkey = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] lineSet = line.split("\t"); outkey.set(lineSet[1]); outval.set(lineSet[0]); context.write(outkey, outval); } } }
6 hadoop Filesystem closed
We are running a workflow in oozie. It contains two actions: the first is a map reduce job that generates files in the hdfs and the second is a job that should copy the data in the files to a database. Both parts are done successfully but the oozie throws an exception at the end that marks it as a failed process. This is the exception: 2014-05-20 17:29:32,242 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:lpinsight (auth:SIMPLE) cause:java.io.IOException: Filesystem closed 2014-05-20 17:29:32,243 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:589) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.util.LineReader.close(LineReader.java:149) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:222) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:421) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) 2014-05-20 17:29:32,256 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task Any idea ? Thanks, Lital 2 Answers active oldest votes up vote 2 down vote Use the below configuration while accessing file system. Configuration conf = new Configuration(); conf.setBoolean("fs.hdfs.impl.disable.cache", true); FileSystem fileSystem = FileSystem.get(conf); shareimprove this answer answered Jun 24 '14 at 12:49 NelsonPaul 12615 add a comment Did you find this question interesting? Try our newsletter Sign up for our newsletter and get our top new questions delivered to your inbox (see an example). up vote 0 down vote I had encountered a similar issue that prompted java.io.IOException: Filesystem closed. Finally, I found I closed the filesystem somewhere else. The hadoop filesystem API returns the same object. So if I closed one filesystem, then all filesystems are closed. I get the solution from this answer shareimprove this answer
参考资料
如何在hadoop中控制map的个数 http://blog.csdn.net/lylcore/article/details/9136555
大数据技术博客 http://lxw1234.com/
Mapper类 http://blog.csdn.net/witsmakemen/article/details/8445133
使用内部jar http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file
hadoop 集群 Running job 卡住 http://bbs.csdn.net/topics/391031853
Container is running beyond memory limits http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits
MapReduce使用JobControl管理实例 http://www.cnblogs.com/manhua/p/4136138.html
hadoop jar 和 java -cp 之间的不同 http://blog.csdn.net/aaa1117a8w5s6d/article/details/34120003
hadoop map 读取数据库 http://blog.csdn.net/qwertyu8656/article/details/6426054
虚拟内存超出界限 http://stackoverflow.com/questions/14110428/am-container-is-running-beyond-virtual-memory-limits