(六)Spark-Eclipse开发环境WordCount-Java&Python版Spark

Spark-Eclipse开发环境WordCount

视频教程:

1、优酷

2、YouTube

 

安装eclipse

解压eclipse-jee-mars-2-win32-x86_64.zip

 

JavaWordcount

解压spark-2.0.0-bin-hadoop2.6.tgz

创建 Java Project-->Spark

将spark-2.0.0-bin-hadoop2.6下的jars里面的jar全部复制到Spark项目下的lib下

Add Build Path

  1 package com.bean.spark.wordcount;
  2 
  3  
  4 
  5 import java.util.Arrays;
  6 
  7 import java.util.Iterator;
  8 
  9  
 10 
 11 import org.apache.spark.SparkConf;
 12 
 13 import org.apache.spark.api.java.JavaPairRDD;
 14 
 15 import org.apache.spark.api.java.JavaRDD;
 16 
 17 import org.apache.spark.api.java.JavaSparkContext;
 18 
 19 import org.apache.spark.api.java.function.FlatMapFunction;
 20 
 21 import org.apache.spark.api.java.function.Function2;
 22 
 23 import org.apache.spark.api.java.function.PairFunction;
 24 
 25 import org.apache.spark.api.java.function.VoidFunction;
 26 
 27  
 28 
 29 import scala.Tuple2;
 30 
 31  
 32 
 33 public class WordCount {
 34 
 35 public static void main(String[] args) {
 36 
 37 //创建SparkConf对象,设置Spark应用程序的配置信息
 38 
 39 SparkConf conf = new SparkConf();
 40 
 41 conf.setMaster("local");
 42 
 43 conf.setAppName("wordcount");
 44 
 45  
 46 
 47 //创建SparkContext对象,Java开发使用JavaSparkContext;Scala开发使用SparkContext
 48 
 49 //SparkContext负责连接Spark集群,创建RDD、累积量和广播量等
 50 
 51 JavaSparkContext sc = new JavaSparkContext(conf);
 52 
 53  
 54 
 55 //sc中提供了textFile方法是SparkContext中定义的,用来读取HDFS上的
 56 
 57 //文本文件、集群中节点的本地文本文件或任何支持Hadoop的文件系统上的文本文件,它的返回值是JavaRDD[String],是文本文件每一行
 58 
 59 JavaRDD<String> lines = sc.textFile("D:/tools/data/wordcount/wordcount.txt");
 60 
 61 //将每一行文本内容拆分为多个单词
 62 
 63 //lines调用flatMap这个transformation算子(参数类型是FlatMapFunction接口实现类)返回每一行的每个单词
 64 
 65 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
 66 
 67  
 68 
 69 private static final long serialVersionUID = 1L;
 70 
 71  
 72 
 73 @Override
 74 
 75 public Iterator<String> call(String s) throws Exception {
 76 
 77 // TODO Auto-generated method stub
 78 
 79 return Arrays.asList(s.split(" ")).iterator();
 80 
 81 }
 82 
 83 });
 84 
 85 //将每个单词的初始数量都标记为1个
 86 
 87 //words调用mapToPair这个transformation算子(参数类型是PairFunction接口实现类,
 88 
 89 //PairFunction<String, String, Integer>的三个参数是<输入单词, Tuple2的key, Tuple2的value>),
 90 
 91 //返回一个新的RDD,即JavaPairRDD
 92 
 93 JavaPairRDD<String, Integer> word = words.mapToPair(new PairFunction<String, String, Integer>() {
 94 
 95  
 96 
 97 private static final long serialVersionUID = 1L;
 98 
 99  
100 
101 @Override
102 
103 public Tuple2<String, Integer> call(String s) throws Exception {
104 
105 // TODO Auto-generated method stub
106 
107 return new Tuple2<String, Integer>(s, 1);
108 
109 }
110 
111 });
112 
113 //计算每个相同单词出现的次数
114 
115 //pairs调用reduceByKey这个transformation算子(参数是Function2接口实现类)对每个key的value进行reduce操作,
116 
117 //返回一个JavaPairRDD,这个JavaPairRDD中的每一个Tuple的key是单词、value则是相同单词次数的和
118 
119 JavaPairRDD<String, Integer> counts = word.reduceByKey(new Function2<Integer, Integer, Integer>() {
120 
121  
122 
123 private static final long serialVersionUID = 1L;
124 
125  
126 
127 @Override
128 
129 public Integer call(Integer s1, Integer s2) throws Exception {
130 
131 // TODO Auto-generated method stub
132 
133 return s1 + s2;
134 
135 }
136 
137 });
138 
139 counts.foreach(new VoidFunction<Tuple2<String,Integer>>() {
140 
141  
142 
143 private static final long serialVersionUID = 1L;
144 
145  
146 
147 @Override
148 
149 public void call(Tuple2<String, Integer> wordcount) throws Exception {
150 
151 // TODO Auto-generated method stub
152 
153 System.out.println(wordcount._1+" : "+wordcount._2);
154 
155 }
156 
157 });
158 
159 //将计算结果文件输出到文件系统
160 
161 /*
162 
163  * HDFS
164 
165  * 新版的API
166 
167  * org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
168 
169  * counts.saveAsNewAPIHadoopFile("hdfs://master:9000/data/wordcount/output", Text.class, IntWritable.class, TextOutputFormat.class, new Configuration());
170 
171  * 使用默认TextOutputFile写入到HDFS(注意写入HDFS权限,如无权限则执行:hdfs dfs -chmod -R 777 /data/wordCount/output)
172 
173          * wordCount.saveAsTextFile("hdfs://soy1:9000/data/wordCount/output");
174 
175          *
176 
177  *
178 
179  * */
180 
181 counts.saveAsTextFile("D:/tools/data/wordcount/output");
182 
183  
184 
185  
186 
187 //关闭SparkContext容器,结束本次作业
188 
189 sc.close();
190 
191 }
192 
193 }

 

 

运行出错

在代码中加入:只要式加在JavaSparkContext初始化之前就可以

System.setProperty("hadoop.home.dir", "D:/tools/spark-2.0.0-bin-hadoop2.6");

hadoop2.6(x64)工具.zip解压到D:\tools\spark-2.0.0-bin-hadoop2.6\bin目录下

 

PythonWordcount

eclipse集成python插件

解压pydev.zip将features和plugins中的包复制到eclipse的对应目录

 1 #-*- coding:utf-8-*-
 2 
 3  
 4 
 5 from __future__ import print_function
 6 
 7 from operator import add
 8 
 9 import os
10 
11 from pyspark.context import SparkContext
12 
13 '''
14 
15 wordcount
16 
17 '''
18 
19 if __name__ == "__main__":
20 
21     os.environ["HADOOP_HOME"] = "D:/tools/spark-2.0.0-bin-hadoop2.6"
22 
23     sc = SparkContext()
24 
25     lines = sc.textFile("file:///D:/tools/data/wordcount/wordcount.txt").map(lambda r: r[0:])
26 
27     counts = lines.flatMap(lambda x: x.split(' ')) \
28 
29                   .map(lambda x: (x, 1)) \
30 
31                   .reduceByKey(add)
32 
33     output = counts.collect()
34 
35     for (word, count) in output:
36 
37         print("%s: %i" % (word, count))

 

 

提交代码到集群上运行

java:

[hadoop@master application]$ spark-submit --master spark://master:7077 --class com.bean.spark.wordcount.WordCount spark.jar

 python:

[hadoop@master application]$ spark-submit --master spark://master:7077 wordcount.py

 

posted @ 2017-01-05 10:22  李小新  阅读(484)  评论(0编辑  收藏  举报