Ubuntu配置MapReduce实验（英文版）【大数据处理技术】

1.Experimental environment

	Version
OS	Ubuntu 20.04.4 LTS
JDK	1.8.0_144
Hadoop	2.7.2

2. Configuration

2.1 Create project in idea

2.2 Import Jar package

Add the following jar packages to the Java project:
（1）All jar packages under directory “/usr/local/hadoop/share/hadoop/common” (not include jar packages under the sub directory)
（2）All jar packages under directory “/usr/local/hadoop/share/hadoop/common/lib”
（3）All jar packages under directory “/usr/local/hadoop/share/hadoop/mapreduce” (not include jar packages under the sub directory)
（4）All jar packages under directory “/usr/local/hadoop/share/hadoop/mapreduce/lib”
（5）All jar packages under directory “/usr/local/hadoop/share/hadoop/hdfs” (not include jar packages under the sub directory)
（6）All jar packages under directory “/usr/local/hadoop/share/hadoop/hdfs/lib”
（7）All jar packages under directory “/usr/local/hadoop/share/hadoop/yarn” (not include jar packages under the sub directory)
（8）All jar packages under directory “/usr/local/hadoop/share/hadoop/yarn/lib”

2.3 Add core-site.xml and hdfs-site.xml to the src directory

Add core-site.xml and hdfs-site.xml to src folder in the idea project

2.3 Start hadoop

start-all.sh

3.Program1

3.1 Program Requirement

You are given multiple input files. There is an integer per line in each file. Write a MapReduce program to read the contents in all input files, sort them in ascending order, and output them to a new file.
The output format is: two numbers per line, the first number is the rank, the second number is the ranked integer. Do not combine the same numbers. Examples:

3.2 Create input file

3.3 Upload input file to hdfs

./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put input1.txt input
./bin/hdfs dfs -put input2.txt input
./bin/hdfs dfs -put input3.txt input
./bin/hdfs dfs -ls input

3.4 Design process

3.4.1 Map phase

Read text file in < object, text > format
Split the text, extract the number, convert the number into the object of IntWritable class, and write <IntWritable, IntWritable> into the object of Context class (the first IntWritable is obtained from the extracted number, and the second IntWritable is any intwritable)

3.4.2 Shuffle phase

Partition: default
Sort: default (The default sorting of MapReduce is based on the key value. If the key is IntWritable, it will be sorted by size)
Statute: default
Group: default

Finally, a new <k2, v2> is formed and output

3.4.3 Reduce phase

Receive <k2,[v2,v2,v2...]> from shuffle phase (The data is sorted by key)
Loop through values. Output "key lineNum" to the output file.
After traversing the values corresponding to each key, lineNum add 1.

3.4.4 Execution phase

Add each stage to an object of Job Class
Add configuration, input / output path, etc
Wait for task execution to complete

3.5 Code

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class MapReduceSort {
    public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {
        private static IntWritable data = new IntWritable();
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            data.set(Integer.parseInt(line));
            context.write(data, new IntWritable(1));
        }
    }

    public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        private static IntWritable lineNum = new IntWritable(1);

        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable val : values){
                context.write(lineNum, key);
            }
            lineNum = new IntWritable(lineNum.get()+1);
        }
    }
    
    public static void main(String[] args) throws Exception{
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration, "sort");
        job.setJarByClass(MapReduceSort.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        String inputPath = "/user/hadoop/input";     //输入路径
        String outputPath = "/user/hadoop/output";   //输出路径
        FileInputFormat.addInputPath(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outputPath));
        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

3.6 Running code in IDEA

./bin/hdfs dfs -ls output
./bin/hdfs dfs -cat output/part-r-00000

4.Program2

4.1 Program Requirement

You are given an input file that contains the child-parent relationship. Write a MapReduce program to mine the input relationships and output the grandchild-grandparent relationship. Examples:

4.2 Create input file

Steven,Jack
Jone,Lucy
Jone,Jack
Lucy,Mary
Lucy,Frank
Jack,Alice
Jack,Jesse
David,Alice
David,Jesse
Philip,David
Philip,Alma
Mark,David

4.3 Upload input file to hdfs

./bin/hdfs dfs -mkdir winput
./bin/hdfs dfs -put parent.txt winput
./bin/hdfs dfs -ls winput

4.4 Design process

（1）In the map stage, prefix the parent-child relationship and the opposite child parent relationship with "-" and "+" before each value to identify whether the values in this key value are generated in positive or reverse order, and then enter the context.
The purpose of this is to make the subsequent reduce stage have the same key value, and judge the grandson relationship according to the identifier.
A1,A2 → (A1 A2) (A2,A1)
A2,A31 → (A2,A31) (A31,A2)
（2）MapReduce will automatically combine different value values of the same key and push them to the reduce stage. In the value array, according to the prefix, we can know which is the grandparent and which is the grandparent.
(A2,A1)(A2,A31)→ (A2:A1,A31)
Connect the same key values, and the two values thus obtained are the grandson relationship.

4.5 Code

import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class FindGrandRelation {
    public static class Map extends Mapper<Object, Text, Text, Text> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            // Use "," to separate data, left column is child, right column is parent
            String child = value.toString().split(",")[0];
            String parent = value.toString().split(",")[1];
            // Generate positive and negative key-value and press them into the context
            context.write(new Text(child), new Text("-" + parent));
            context.write(new Text(parent), new Text("+" + child));
            // Get A1,A2  --->  (A1 A2)(A2,A1)
            // A2,A31	---->(A2,A31)(A31,A2)
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            ArrayList<Text> grandparent = new ArrayList<Text>();
            ArrayList<Text> grandchild = new ArrayList<Text>();
            for (Text t : values) {
                // Process the value in each values
                String s = t.toString();
                // If the mark is "-", it means that the value should be grandparent
                if (s.startsWith("-")) {
                    grandparent.add(new Text(s.substring(1)));
                } else {
                    // If the mark is "+", it indicates that the value should be the grandson
                    grandchild.add(new Text(s.substring(1)));
                }
            }
            // Then output the grandparent and grandchild one by one.
            for (Text text : grandchild) {
                for (Text value : grandparent) {
                    context.write(text, value);
                }
            }
        }
    }
    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration, "get GrandParent Relation");
        job.setJarByClass(FindGrandRelation.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        String inputPath = "/user/hadoop/winput";     // input path
        String outputPath = "/user/hadoop/woutput";   // output path

        // Judge whether the output directory exists. If it exists, delete it
        Path path = new Path(outputPath);
        FileSystem fileSystem = path.getFileSystem(configuration);
        if (fileSystem.exists(path)) {
            fileSystem.delete(path, true);
        }
        
        FileInputFormat.addInputPath(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outputPath));
        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

4.6 Running code in IDEA

./bin/hdfs dfs -ls woutput
./bin/hdfs dfs -cat woutput/part-r-00000

5.Problems and Solutions

Originally, I didn't put the two xml files in the src folder. Then in your code to use "hadfs://localhost:9000/input" no connection is successful, then put two XML files in the SRC folder has solved the problem.

posted @ 2023-08-09 08:05 LateSpring 阅读(270) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

Jinyu's Blog