Idea访问HDFS及MapReduce示例

自上篇搭建好高可用hadoop集群后，我们就可以通过程序访问hdfs及MR示例。

1. 访问HDFS

Idea创建Gradle项目，然后再build.gradle添加依赖

dependencies {
    compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.4'
    compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.4'
    compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '3.1.4'
    compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '3.1.4'
    compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '3.1.4'
}
将core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml文件复制项目的resources目录下，在main类中添加如下测试代码。

    public static void main(String[] args) throws Exception {
        Configuration cfg=new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://mycluster"),cfg,"hadoop");

        fs.copyToLocalFile(new Path("/test/fruits.txt"),new Path("D:/"));

        fs.close();
}

2.MapReduce例子

2.1 Mapper，Reducer及Partioner类

package com.test;

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.net.URI;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Paths;

public class PartitionerApp {
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String line = value.toString();

            String[] words = line.split(" ");

            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));
        }

    }

    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for (LongWritable value : values) {

                sum += value.get();
            }

            context.write(key, new LongWritable(sum));
        }
    }


    public static class MyPartitioner extends Partitioner<Text, LongWritable> {

        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if (key.toString().equals("Apple")) {
                return 0;
            }
            if (key.toString().equals("Orange")) {
                return 1;
            }
            if (key.toString().equals("Pear")) {
                return 2;
            }

            return 3;
        }
    }
}

2.2 windows本地环境运行hadoop MapReduce.

2.2.1.下载一份hadoop本地解压，配置HADOOP_HOME的环境变量

idea运行时会读这个环境变量然后找到他里面的bin文件，其实不需要启动只要有bin这个目录就行,不然会报错找不到HADOOP_HOME这个环境变量

2.2.2.bin里面缺少了winutils.exe和hadoop.dll 需要额外下载https://github.com/steveloughran/winutils, 将下载的文件放置hadoop\bin下。

2.2.3.将hadoop.dll复制到C:\Windows\System32中否则会报 Exception in thread "main"java.lang.UnsatisfiedLinkError:org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

2.2.4 将文件core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml从resources目录移除

2.2.5 编写测试方法windowTest, 然后main函数调用。

     static void windowTest() throws Exception {
        /*此时要将resources下hadoop相关的配置文件xml移除*/
        System.setProperty("hadoop.home.dir", "D:\\hadoop-3.1.4");
        Configuration cfg = new Configuration();
        Job job = Job.getInstance(cfg, "wordCount");
        job.setJarByClass(PartitionerApp.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setNumReduceTasks(2);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.addInputPath(job, new Path("d:/fruits.txt"));

        String outPath = "d:/mpr.txt";
        if(Files.exists(Paths.get(outPath))){
            try {

                FileUtil.fullyDelete(new File(outPath));
                //Files.delete(Paths.get(outPath));
            }catch (Exception ex){
                ex.printStackTrace();
            }
        }

        FileOutputFormat.setOutputPath(job, new Path(outPath));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

2.3 本地直连远程hadoop执行mapreduce

2.3.1 将文件core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml复制到resources目录

2.3.2

public static void main(String[] args) throws Exception {

        //System.setProperty("hadoop.home.dir", "D:\\hadoop-3.1.4");
        System.setProperty("HADOOP_USER_NAME", "root");

        args = new String[2];
        args[0] = "hdfs://mycluster/test/fruits.txt";
        args[1] = "hdfs://mycluster/output/fruits";

        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS", "hdfs://mycluster");
        configuration.set("mapreduce.app-submission.cross-platform", "true");

        //通过idea将当前项目打成可自行jar包，然后指定jar
        configuration.set("mapreduce.job.jar","E:\\hadoop\\out\\artifacts\\hadoop_main_jar\\hadoop.main.jar");

        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if (fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
            System.out.println("outputPath: " + args[1] + " exists, but has been deleted.");
        }

        Job job = Job.getInstance(configuration, "FruitCount");


        job.setJarByClass(PartitionerApp.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);


        job.setPartitionerClass(MyPartitioner.class);
        job.setNumReduceTasks(4);

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

2.4 通过远程Hadoop执行

将打好的可执行jar上传至hadoop，然后通过命令/usr/local/hadoop/bin/hadoop jar hadoop.main.jar com.test.PartitionerApp提交mapred任务。

3.执行MapReduce，报错处理

Container [pid=2347,containerID=container_1604042858880_0001_01_000007] is running 324233728B beyond the 'VIRTUAL' memory limit. Current usage: 110.7 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.

原因：从机上运行的Container试图使用过多的内存，而被NodeManager kill掉了。
解决方法1(推荐)：
提高yarn.nodemanager.vmem-pmem-ratio = 5或者更高

解决方法2(不建议)：
关掉虚拟内存检查，修改yarn-site.xml，修改后一定要分发，并重新启动hadoop集群。
yarn.nodemanager.vmem-check-enabled设置为false

posted on 2020-10-30 23:56 jmbkeyes 阅读(514) 评论(0) 编辑收藏举报

刷新页面返回顶部

jmbkeyes