Ubuntu配置MapReduce实验(英文版)【大数据处理技术】
1.Experimental environment
Version | |
---|---|
OS | Ubuntu 20.04.4 LTS |
JDK | 1.8.0_144 |
Hadoop | 2.7.2 |
2. Configuration
2.1 Create project in idea
2.2 Import Jar package
Add the following jar packages to the Java project:
(1)All jar packages under directory “/usr/local/hadoop/share/hadoop/common” (not include jar packages under the sub directory)
(2)All jar packages under directory “/usr/local/hadoop/share/hadoop/common/lib”
(3)All jar packages under directory “/usr/local/hadoop/share/hadoop/mapreduce” (not include jar packages under the sub directory)
(4)All jar packages under directory “/usr/local/hadoop/share/hadoop/mapreduce/lib”
(5)All jar packages under directory “/usr/local/hadoop/share/hadoop/hdfs” (not include jar packages under the sub directory)
(6)All jar packages under directory “/usr/local/hadoop/share/hadoop/hdfs/lib”
(7)All jar packages under directory “/usr/local/hadoop/share/hadoop/yarn” (not include jar packages under the sub directory)
(8)All jar packages under directory “/usr/local/hadoop/share/hadoop/yarn/lib”
2.3 Add core-site.xml and hdfs-site.xml to the src directory
Add core-site.xml and hdfs-site.xml to src folder in the idea project
2.3 Start hadoop
start-all.sh
3.Program1
3.1 Program Requirement
You are given multiple input files. There is an integer per line in each file. Write a MapReduce program to read the contents in all input files, sort them in ascending order, and output them to a new file.
The output format is: two numbers per line, the first number is the rank, the second number is the ranked integer. Do not combine the same numbers. Examples:
3.2 Create input file
3.3 Upload input file to hdfs
./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put input1.txt input
./bin/hdfs dfs -put input2.txt input
./bin/hdfs dfs -put input3.txt input
./bin/hdfs dfs -ls input
3.4 Design process
3.4.1 Map phase
- Read text file in < object, text > format
- Split the text, extract the number, convert the number into the object of IntWritable class, and write <IntWritable, IntWritable> into the object of Context class (the first IntWritable is obtained from the extracted number, and the second IntWritable is any intwritable)
3.4.2 Shuffle phase
- Partition: default
- Sort: default (The default sorting of MapReduce is based on the key value. If the key is IntWritable, it will be sorted by size)
- Statute: default
- Group: default
Finally, a new <k2, v2> is formed and output
3.4.3 Reduce phase
- Receive <k2,[v2,v2,v2...]> from shuffle phase (The data is sorted by key)
- Loop through values. Output "key lineNum" to the output file.
After traversing the values corresponding to each key, lineNum add 1.
3.4.4 Execution phase
- Add each stage to an object of Job Class
- Add configuration, input / output path, etc
- Wait for task execution to complete
3.5 Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MapReduceSort {
public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {
private static IntWritable data = new IntWritable();
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
data.set(Integer.parseInt(line));
context.write(data, new IntWritable(1));
}
}
public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private static IntWritable lineNum = new IntWritable(1);
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable val : values){
context.write(lineNum, key);
}
lineNum = new IntWritable(lineNum.get()+1);
}
}
public static void main(String[] args) throws Exception{
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, "sort");
job.setJarByClass(MapReduceSort.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
String inputPath = "/user/hadoop/input"; //输入路径
String outputPath = "/user/hadoop/output"; //输出路径
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
3.6 Running code in IDEA
./bin/hdfs dfs -ls output
./bin/hdfs dfs -cat output/part-r-00000
4.Program2
4.1 Program Requirement
You are given an input file that contains the child-parent relationship. Write a MapReduce program to mine the input relationships and output the grandchild-grandparent relationship. Examples:
4.2 Create input file
Steven,Jack
Jone,Lucy
Jone,Jack
Lucy,Mary
Lucy,Frank
Jack,Alice
Jack,Jesse
David,Alice
David,Jesse
Philip,David
Philip,Alma
Mark,David
4.3 Upload input file to hdfs
./bin/hdfs dfs -mkdir winput
./bin/hdfs dfs -put parent.txt winput
./bin/hdfs dfs -ls winput
4.4 Design process
(1)In the map stage, prefix the parent-child relationship and the opposite child parent relationship with "-" and "+" before each value to identify whether the values in this key value are generated in positive or reverse order, and then enter the context.
The purpose of this is to make the subsequent reduce stage have the same key value, and judge the grandson relationship according to the identifier.
A1,A2 → (A1 A2) (A2,A1)
A2,A31 → (A2,A31) (A31,A2)
(2)MapReduce will automatically combine different value values of the same key and push them to the reduce stage. In the value array, according to the prefix, we can know which is the grandparent and which is the grandparent.
(A2,A1)(A2,A31)→ (A2:A1,A31)
Connect the same key values, and the two values thus obtained are the grandson relationship.
4.5 Code
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FindGrandRelation {
public static class Map extends Mapper<Object, Text, Text, Text> {
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Use "," to separate data, left column is child, right column is parent
String child = value.toString().split(",")[0];
String parent = value.toString().split(",")[1];
// Generate positive and negative key-value and press them into the context
context.write(new Text(child), new Text("-" + parent));
context.write(new Text(parent), new Text("+" + child));
// Get A1,A2 ---> (A1 A2)(A2,A1)
// A2,A31 ---->(A2,A31)(A31,A2)
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
ArrayList<Text> grandparent = new ArrayList<Text>();
ArrayList<Text> grandchild = new ArrayList<Text>();
for (Text t : values) {
// Process the value in each values
String s = t.toString();
// If the mark is "-", it means that the value should be grandparent
if (s.startsWith("-")) {
grandparent.add(new Text(s.substring(1)));
} else {
// If the mark is "+", it indicates that the value should be the grandson
grandchild.add(new Text(s.substring(1)));
}
}
// Then output the grandparent and grandchild one by one.
for (Text text : grandchild) {
for (Text value : grandparent) {
context.write(text, value);
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, "get GrandParent Relation");
job.setJarByClass(FindGrandRelation.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
String inputPath = "/user/hadoop/winput"; // input path
String outputPath = "/user/hadoop/woutput"; // output path
// Judge whether the output directory exists. If it exists, delete it
Path path = new Path(outputPath);
FileSystem fileSystem = path.getFileSystem(configuration);
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
}
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
4.6 Running code in IDEA
./bin/hdfs dfs -ls woutput
./bin/hdfs dfs -cat woutput/part-r-00000
5.Problems and Solutions
Originally, I didn't put the two xml files in the src folder. Then in your code to use "hadfs://localhost:9000/input" no connection is successful, then put two XML files in the SRC folder has solved the problem.