【Todo】找出共同好友 & Spark & Hadoop面试题
找了这篇文章看了一下面试题<Spark 和hadoop的一些面试题(准备)>
http://blog.csdn.net/qiezikuaichuan/article/details/51578743
其中有一道题目很不错,详见:
http://www.aboutyun.com/thread-18826-1-1.html
http://www.cnblogs.com/lucius/p/3483494.html
我觉得可以在Hadoop上面实际编程做一下。
我觉得第一篇文章里面下面这一段总结的很好:
简要描述你知道的数据挖掘算法和使用场景
(一)基于分类模型的案例
(1)垃圾邮件的判别 通常会采用朴素贝叶斯的方法进行判别
(2)医学上的肿瘤判断 通过分类模型识别
(二)基于预测模型的案例
(1)红酒品质的判断 分类回归树模型进行预测和判断红酒的品质
(2)搜索引擎的搜索量和股价波动
(三)基于关联分析的案例:沃尔玛的啤酒尿布
(四)基于聚类分析的案例:零售客户细分
(五)基于异常值分析的案例:支付中的交易欺诈侦测
(六)基于协同过滤的案例:电商猜你喜欢和推荐引擎
(七)基于社会网络分析的案例:电信中的种子客户
(八)基于文本分析的案例
(1)字符识别:扫描王APP
(2)文学著作与统计:红楼梦归属
上面的统计共同好友的题目。写了个程序试了一下。
在Intellij项目 HadoopProj里面。maven项目,依赖如下:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.hadoop.my</groupId> <artifactId>hadoop-proj</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> </dependencies> <repositories> <repository> <id>aliyunmaven</id> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </repository> </repositories> </project>
代码如下:
package com.hadoop.my; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; /** * Created by baidu on 16/12/3. */ public class HadoopProj { public static class CommonFriendsMapper extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] split = line.split(":"); String person = split[0]; String[] friends = split[1].split(","); for (String f: friends) { context.write(new Text(f), new Text(person)); } } } public static class CommonFriendsReducer extends Reducer<Text, Text, Text, Text> { // 输入<B->A><B->E><B->F>.... // 输出 B A,E,F,J protected void reduce(Text friend, Iterable<Text> persons, Context context) throws IOException, InterruptedException { StringBuffer sb = new StringBuffer(); for (Text person: persons) { sb.append(person+","); } context.write(friend, new Text(sb.toString())); } } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { //读取classpath下的所有xxx-site.xml配置文件,并进行解析 Configuration conf = new Configuration(); Job friendJob = Job.getInstance(conf); //通过主类的类加载器机制获取到本job的所有代码所在的jar包 friendJob.setJarByClass(HadoopProj.class); //指定本job使用的mapper类 friendJob.setMapperClass(CommonFriendsMapper.class); //指定本job使用的reducer类 friendJob.setReducerClass(CommonFriendsReducer.class); //指定reducer输出的kv数据类型 friendJob.setOutputKeyClass(Text.class); friendJob.setOutputValueClass(Text.class); //指定本job要处理的文件所在的路径 FileInputFormat.setInputPaths(friendJob, new Path(args[0])); //指定本job输出的结果文件放在哪个路径 FileOutputFormat.setOutputPath(friendJob, new Path(args[1])); //将本job向hadoop集群提交执行 boolean res = friendJob.waitForCompletion(true); System.exit(res?0:1); } }
打成Jar包之后,传到Hadoop机器m42n05上面。
在上面还要新建输入文件,内容:
A:B,C,D,F,E,O
B:A,C,E,K
C:F,A,D,I
D:A,E,F,L
E:B,C,D,M,L
F:A,B,C,D,E,O,M
G:A,C,D,E,F
H:A,C,D,E,O
I:A,O
J:B,O
K:A,C,D
L:D,E,F
M:E,F,G
O:A,H,I,J
命令:
$ hadoop fs -mkdir /input/frienddata $ hadoop fs -put text.txt /input/frienddata $ hadoop fs -ls /input/frienddata Found 1 items -rw-r--r-- 3 work supergroup 142 2016-12-03 17:12 /input/frienddata/text.txt
把hadoop-proj.jar 拷贝到 m42n05的/home/work/data/installed/hadoop-2.7.3/myjars
运行命令
$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /output/friddata
报错:
$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /outputtput/frienddata 16/12/03 17:19:52 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032 /fri Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master.Hadoop:8390/input/frienddata already exists
看起来是命令行后面参数的索引不对,注意代码里是这样写的。
//指定本job要处理的文件所在的路径 FileInputFormat.setInputPaths(friendJob, new Path(args[0])); //指定本job输出的结果文件放在哪个路径 FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));
而Java里面,和C++不同,参数的确是从0开始的。程序名本身不占位。
所以可能是不需要输入类名。重新输入命令:
$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar /input/frienddata /output/frienddata
获得输出:
16/12/03 17:24:33 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032 16/12/03 17:24:33 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/12/03 17:24:34 INFO input.FileInputFormat: Total input paths to process : 1 16/12/03 17:24:34 INFO mapreduce.JobSubmitter: number of splits:1 16/12/03 17:24:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478254572601_0002 16/12/03 17:24:34 INFO impl.YarnClientImpl: Submitted application application_1478254572601_0002 16/12/03 17:24:34 INFO mapreduce.Job: The url to track the job: http://master.Hadoop:8320/proxy/application_1478254572601_0002/ 16/12/03 17:24:34 INFO mapreduce.Job: Running job: job_1478254572601_0002 16/12/03 17:24:40 INFO mapreduce.Job: Job job_1478254572601_0002 running in uber mode : false 16/12/03 17:24:40 INFO mapreduce.Job: map 0% reduce 0% 16/12/03 17:24:45 INFO mapreduce.Job: map 100% reduce 0% 16/12/03 17:24:49 INFO mapreduce.Job: map 100% reduce 100% 16/12/03 17:24:50 INFO mapreduce.Job: Job job_1478254572601_0002 completed successfully 16/12/03 17:24:50 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=348 FILE: Number of bytes written=238531 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=258 HDFS: Number of bytes written=156 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2651 Total time spent by all reduces in occupied slots (ms)=2446 Total time spent by all map tasks (ms)=2651 Total time spent by all reduce tasks (ms)=2446 Total vcore-milliseconds taken by all map tasks=2651 Total vcore-milliseconds taken by all reduce tasks=2446 Total megabyte-milliseconds taken by all map tasks=2714624 Total megabyte-milliseconds taken by all reduce tasks=2504704 Map-Reduce Framework Map input records=14 Map output records=57 Map output bytes=228 Map output materialized bytes=348 Input split bytes=116 Combine input records=0 Combine output records=0 Reduce input groups=14 Reduce shuffle bytes=348 Reduce input records=57 Reduce output records=14 Spilled Records=114 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=111 CPU time spent (ms)=1850 Physical memory (bytes) snapshot=455831552 Virtual memory (bytes) snapshot=4239388672 Total committed heap usage (bytes)=342360064 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=142 File Output Format Counters Bytes Written=156
看一下输出文件的位置:
$ hadoop fs -ls /output/frienddata Found 2 items -rw-r--r-- 3 work supergroup 0 2016-12-03 17:24 /output/frienddata/_SUCCESS -rw-r--r-- 3 work supergroup 156 2016-12-03 17:24 /output/frienddata/part-r-00000 $ hadoop fs -cat /output/frienddata/part-r-00000 A I,K,C,B,G,F,H,O,D, B A,F,J,E, C A,E,B,H,F,G,K, D G,C,K,A,L,F,E,H, E G,M,L,H,A,F,B,D, F L,M,D,C,G,A, G M, H O, I O,C, J O, K B, L D,E, M E,F, O A,H,I,J,F,
当然,也可以把输出merge到本地文件:
$ hdfs dfs -getmerge hdfs://master.Hadoop:8390/output/frienddata /home/work/frienddatatmp $ cat frienddatatmp A I,K,C,B,G,F,H,O,D, B A,F,J,E, C A,E,B,H,F,G,K, D G,C,K,A,L,F,E,H, E G,M,L,H,A,F,B,D, F L,M,D,C,G,A, G M, H O, I O,C, J O, K B, L D,E, M E,F, O A,H,I,J,F,
上面这道题目,做完了。