Idea访问HDFS及MapReduce示例
自上篇搭建好高可用hadoop集群后,我们就可以通过程序访问hdfs及MR示例。
1. 访问HDFS
Idea创建Gradle项目,然后再build.gradle添加依赖
dependencies {
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.4'
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.4'
compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '3.1.4'
compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '3.1.4'
compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '3.1.4'
}
将core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml文件复制项目的resources目录下,在main类中添加如下测试代码。
public static void main(String[] args) throws Exception { Configuration cfg=new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://mycluster"),cfg,"hadoop"); fs.copyToLocalFile(new Path("/test/fruits.txt"),new Path("D:/")); fs.close(); }
2.MapReduce例子
2.1 Mapper,Reducer及Partioner类
package com.test; import org.apache.commons.io.FileUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileUtil; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Partitioner; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.net.URI; import java.nio.file.FileSystems; import java.nio.file.Files; import java.nio.file.Paths; public class PartitionerApp { public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> { LongWritable one = new LongWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = line.split(" "); context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1]))); } } public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable value : values) { sum += value.get(); } context.write(key, new LongWritable(sum)); } } public static class MyPartitioner extends Partitioner<Text, LongWritable> { @Override public int getPartition(Text key, LongWritable value, int numPartitions) { if (key.toString().equals("Apple")) { return 0; } if (key.toString().equals("Orange")) { return 1; } if (key.toString().equals("Pear")) { return 2; } return 3; } } }
2.2 windows本地环境运行hadoop MapReduce.
2.2.1.下载一份hadoop本地解压,配置HADOOP_HOME的环境变量
idea运行时会读这个环境变量然后找到他里面的bin文件,其实不需要启动 只要有bin这个目录就行,不然会报错 找不到HADOOP_HOME这个环境变量
2.2.2.bin里面缺少了winutils.exe和hadoop.dll 需要额外下载https://github.com/steveloughran/winutils, 将下载的文件放置hadoop\bin下。
2.2.3.将hadoop.dll复制到C:\Windows\System32中 否则 会报 Exception in thread "main"java.lang.UnsatisfiedLinkError:org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
2.2.4 将文件core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml从resources目录移除
2.2.5 编写测试方法windowTest, 然后main函数调用。
static void windowTest() throws Exception { /*此时要将resources下hadoop相关的配置文件xml移除*/ System.setProperty("hadoop.home.dir", "D:\\hadoop-3.1.4"); Configuration cfg = new Configuration(); Job job = Job.getInstance(cfg, "wordCount"); job.setJarByClass(PartitionerApp.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); job.setNumReduceTasks(2); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path("d:/fruits.txt")); String outPath = "d:/mpr.txt"; if(Files.exists(Paths.get(outPath))){ try { FileUtil.fullyDelete(new File(outPath)); //Files.delete(Paths.get(outPath)); }catch (Exception ex){ ex.printStackTrace(); } } FileOutputFormat.setOutputPath(job, new Path(outPath)); System.exit(job.waitForCompletion(true) ? 0 : 1); }
2.3 本地直连远程hadoop执行mapreduce
2.3.1 将文件core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml复制到resources目录
2.3.2
public static void main(String[] args) throws Exception { //System.setProperty("hadoop.home.dir", "D:\\hadoop-3.1.4"); System.setProperty("HADOOP_USER_NAME", "root"); args = new String[2]; args[0] = "hdfs://mycluster/test/fruits.txt"; args[1] = "hdfs://mycluster/output/fruits"; Configuration configuration = new Configuration(); configuration.set("fs.defaultFS", "hdfs://mycluster"); configuration.set("mapreduce.app-submission.cross-platform", "true");
//通过idea将当前项目打成可自行jar包,然后指定jar configuration.set("mapreduce.job.jar","E:\\hadoop\\out\\artifacts\\hadoop_main_jar\\hadoop.main.jar"); Path outputPath = new Path(args[1]); FileSystem fileSystem = FileSystem.get(configuration); if (fileSystem.exists(outputPath)) { fileSystem.delete(outputPath, true); System.out.println("outputPath: " + args[1] + " exists, but has been deleted."); } Job job = Job.getInstance(configuration, "FruitCount"); job.setJarByClass(PartitionerApp.class); FileInputFormat.setInputPaths(job, new Path(args[0])); job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setPartitionerClass(MyPartitioner.class); job.setNumReduceTasks(4); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
2.4 通过远程Hadoop执行
将打好的可执行jar上传至hadoop,然后通过命令/usr/local/hadoop/bin/hadoop jar hadoop.main.jar com.test.PartitionerApp提交mapred任务。
3.执行MapReduce,报错处理
Container [pid=2347,containerID=container_1604042858880_0001_01_000007] is running 324233728B beyond the 'VIRTUAL' memory limit. Current usage: 110.7 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
原因:从机上运行的
Container试图使用过多的内存,而被
NodeManager kill掉了。
解决方法1(推荐):
提高yarn.nodemanager.vmem-pmem-ratio = 5或者更高
解决方法2(不建议):
关掉虚拟内存检查,修改yarn-site.xml,修改后一定要分发,并重新启动hadoop集群。
yarn.nodemanager.vmem-check-enabled设置为false