1、pom.xml:

  注意:此处spark-streaming的依赖试了很多版本,都不好用,最终调试成功的为下面代码中所使用的版本。

    <dependencies>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.5.2</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.handlers</resource>
                                </transformer>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.schemas</resource>
                                </transformer>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                            <createDependencyReducedPom>JavaStreaming</createDependencyReducedPom>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>  
View Code

 

2、spark streaming 实现 wordCount On Line:

import java.util.Arrays;

import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;


public class JavaStreaming {
    private static final Logger logger = LoggerFactory.getLogger(JavaStreaming.class);

    public static void main(String[] args) throws InterruptedException {
        
        logger.info("start..");
        
        SparkConf conf = new SparkConf().setAppName("wordCountOnline");
        
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10)); ;
        JavaReceiverInputDStream<String> lines = jssc.socketTextStream(ip地址, 端口号);        //创建Spark Streaming输入数据来源
        lines.count().print();
        
        //遍历每一行,将每一行分割单词返回String的Iterable
        System.out.println("words FlatMapFunction");
        JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            private static final long serialVersionUID = 1L;

            public Iterable<String> call(String line) {
                return Arrays.asList(line.split(" "));
            }
        });    
        
        
        //每个单词计数标为1
        JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
            private static final long serialVersionUID = 1L;
            
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<String, Integer>(word, 1);
            }
        }); 
        
        //相同单词的计数标记相加
        JavaPairDStream<String, Integer> word_count = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            private static final long serialVersionUID = 1L;
            
            public Integer call(Integer v1, Integer v2) {
                return v1 + v2;
            }
        });
        
        word_count.count().print();
        word_count.print();
        
        jssc.start();
        jssc.awaitTermination();
    }
}
View Code

  (1)上述 spark streaming 的程序在执行时,并不是按照从上到下的顺序依次实现,而是分布式实现;

  (2)端口号一开始试了9999,发现被占用,后改用7777,成功;

  (3)上述实时计算只有在监控的终端发生变化时才会执行,且统计的是变化的部分;

  (4)如果想改为实时监控某一文件夹,可将语句 JavaReceiverInputDStream<String> lines = jssc.socketTextStream(ip地址, 端口号) 改为:

      JavaDStream<String> lines = jssc.textFileStream("hdfs:///user/data");  

      此时,即为实时监控目标文件夹下的变化,同理,只有该文件夹下的文件增加时,才回执行上述wordcount的过程,对新增加的文件进行实时计算;

      注意,要使用绝对路径!文件夹下的文件格式要一致!

 

  3、spark job 提交及启动:

      spark-submit --class JavaStreaming(主类) --master yarn-cluster spark-streaming-0.0.1-SNAPSHOT.jar(jar包名)

    查看spark-job:

      yarn application -list

    终止spark-job:

      yarn application -kill jobname