Spark Streaming Demo

原创转载请注明出处:https://www.cnblogs.com/agilestyle/p/14701977.html

 

Project Directory

 

Maven Dependency

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.fool</groupId>
    <artifactId>hellospark</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>
    </dependencies>

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.8.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>

</project>

 

Kafka Setup

wget https://downloads.apache.org/kafka/2.8.0/kafka_2.12-2.8.0.tgz

tar zxvf kafka_2.12-2.8.0.tgz -C ~/app

cd kafka_2.12-2.8.0

# start the zookeeper service
bin/zookeeper-server-start.sh config/zookeeper.properties

# start the Kafka broker service
bin/kafka-server-start.sh config/server.properties

# create topic
bin/kafka-topics.sh --create --topic spark_topic --bootstrap-server localhost:9092

# describle message
bin/kafka-topics.sh --describe --topic spark_topic --bootstrap-server localhost:9092

# produce message
bin/kafka-console-producer.sh --topic spark_topic --bootstrap-server localhost:9092

# consumer message
bin/kafka-console-consumer.sh --topic spark_topic --from-beginning --bootstrap-server localhost:9092

Note:Spark Streaming 需要用到Kafka

 

SparkStreamingDemo.scala

package org.fool.spark

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.util.parsing.json.JSON

object SparkStreamingDemo {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("SparkStreamingDemo").setMaster("local")

    val streamingContext = new StreamingContext(sparkConf, Seconds(30))

    val checkPointDirectory = "hdfs://127.0.0.1:9000/spark/checkpoint"
    streamingContext.checkpoint(checkPointDirectory)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "spark_streaming",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val topics = Array("spark_topic")

    val stream = KafkaUtils.createDirectStream[String, String](
      streamingContext,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    val etlResultDirectory = "hdfs://127.0.0.1:9000/spark/etl/"
    val etlResult = stream.map(record => record.value()).filter(message => JSON.parseFull(message).isDefined)
    etlResult.count().print()
    etlResult.saveAsTextFiles(etlResultDirectory)

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

Note:程序是30s 一次的微批处理

 

Run Test

在kafka producer端输入message

{"name":"Caocao","create_time":"2021-06-22 01:45:52.478","update_time":"2021-06-22 02:45:52.478"}

 

Console Output

 

HDFS Output

 

posted @ 2021-04-25 20:46  李白与酒  阅读(169)  评论(0编辑  收藏  举报