hadoop distributedcache

1.hadoop DistributedCache使用。

Hadoop有一个叫做分布式缓存(distributed cache)的机制来将数据分发到集群上的所有节点上。为了节约网络带宽，在每一个作业中，各个文件通常只需要复制到一个节点一次。

缓存文件复制位置：mapred-site.xml中

<name>mapred.local.dir</name>

<value>/home/hadoop/tmp</value>

</property>

操作步骤：

1.将数据的分发到每个节点上：

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/hadoop/input/ref.png"), conf); //路径为hdfs中的路径

注意，此操作一定要在创建Job，将conf传递给Job之前进行，否则数据文件的路径不会被Mapper中取到。

出现“The method addCacheFile(URI, Configuration) in the type DistributedCache is not applicable for the arguments (URI, Configuration)”错误的时候，检查下程序对URI和Configuration类的import是否正确，应该分别是 java.net.URI和org.apache.hadoop.conf.Configuration

2.在每个Mapper中获取文件。在setup中获得Path。

private Path[] refPathFromDistributedCache;
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
	// TODO Auto-generated method stub
	conf = context.getConfiguration();
	refPathFromDistributedCache = DistributedCache.getLocalCacheFiles(context.getConfiguration());
	super.setup(context);
}

3.打开文件。因为分布式文件存放在各个节点的本地文件系统中，所以必须FileSystem.getLocal(conf);不然会提示找不到文件。

FileSystem reffs = FileSystem.getLocal(conf);
FSDataInputStream in = null;
for(Path tmpRefPath : refPathFromDistributedCache) {
	if(tmpRefPath.toString().indexOf("ref.png") != -1) {
		in = reffs.open(tmpRefPath);
		break;
	}
}

2.DistributedCache效率比对。

测试了下，发现DistributedCache效率提升明显。

3.分布式文件内存化。

这个hadoop貌似没这个功能，得换个框架，比如spark.

参考文献：

[1]http://www.cnblogs.com/yanzhenxing/archive/2012/08/31/2664761.html

[2]http://stackoverflow.com/questions/13508707/hadoop-filenotfoundexcepion-when-getting-file-from-distributedcache

[3]http://blog.csdn.net/dandingyy/article/details/7569368

posted on 2013-12-11 10:11 hequn8128 阅读(454) 评论(0) 收藏举报

刷新页面返回顶部

hequn8128

hadoop distributedcache

公告

导航