JIANGzihao0222

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
统计
 

  本次学习了spark Streaming里进行读取高级数据源的一些操作

进行安装kafka

网站:https://dblab.xmu.edu.cn/blog/1096/

接下来在Ubuntu系统环境下测试简单的实例,按顺序执行如下命令:

# 进入kafka所在的目录
cd /usr/local/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties

命令执行后不会返回Shell命令输入状态,zookeeper就会按照默认的配置文件启动服务,请千万不要关闭当前终端.启动新的终端,输入如下命令:

cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties

kafka服务端就启动了,请千万不要关闭当前终端。启动另外一个终端,输入如下命令:

cd /usr/local/kafka
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic dblab

topic是发布消息发布的category,以单节点的配置创建了一个叫dblab的topic.可以用list列出所有创建的topics,来查看刚才创建的主题是否存在。

bin/kafka-topics.sh --list --zookeeper localhost:2181

可以在结果中查看到dblab这个topic存在。

kafka和Flume等高级输入源,需要依赖独立的库(jar文件)

需要在spark-shell下进行import语句进行测试,但是spark库中没有这类文件

需要下载相对应的spark-streaming-kafka的jar包才可以使用kafka,下载的是:spark-streaming-kafka-0-10_2.12-3.2.0这个版本的,其中2.12是scala的版本,3.20是spark的版本

访问MVNREPOSITORY官网(http://mvnrepository.com),下载spark-streaming-kafka-0-10_2.12-3.2.0.jar和spark-token-provider-kafka-0-10_2.12-3.2.0.jar文件,其中,2.12表示Scala的版本号,3.2.0表示Spark版本号。或者也可以直接到本教材官网的“下载专区”的“软件”目录中下载这两个文件。然后,把这两个文件复制到Spark目录的jars目录下(即“/usr/local/spark/jars”目录)。此外,还需要把“/usr/local/kafka/libs”目录下的kafka-clients-2.6.0.jar文件复制到Spark目录的jars目录下

下载后复制到/usr/local/spark/jars/

[atguigu@hadoop102 ~]$ cd /usr/local/spark/jars/
[atguigu@hadoop102 jars]$ mkdir kafka
[atguigu@hadoop102 jars]$ cd
[atguigu@hadoop102 ~]$ cd 下载
[atguigu@hadoop102 下载]$ cp ./spark-streaming-kafka-0-10_2.12-3.2.0.jar /usr/local/spark/jars/kafka/

继续把kafka安装目录下的libs目录的所有jar文件进行复制到“/usr/local/spark/jars/kafka”目录下,

[atguigu@hadoop102 下载]$ cd /usr/local/kafka/libs/
[atguigu@hadoop102 libs]$ cp ./* /usr/local/spark/jars/kafka

编写Spark Streaming程序使用Kafka数据源

步骤:

1.编写生产者(producer)程序

2.编写消费者(consumer)程序

3.编译打包程序

4.运行程序

1.编写生产者(producer)程序

cd /usr/local/spark/mycode
mkdir kafka
cd kafka
mkdir -p src/main/scala
cd src/main/scala
vim KafkaWordProducer.scala

使用vim编辑器新建了KafkaWordProducer.scala,它是产生一系列字符串的程序,会产生随机的整数序列,每个整数被当做一个单词,提供给KafkaWordCount程序去进行词频统计。请在KafkaWordProducer.scala中输入以下代码:

复制代码
import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
object KafkaWordProducer {
  def main(args: Array[String]) {
    if (args.length < 4) {
      System.err.println("Usage: KafkaWordProducer <metadataBrokerList> <topic> " +
        "<messagesPerSec> <wordsPerMessage>")
      System.exit(1)
    }
    val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
    // Zookeeper connection properties
    val props = new HashMap[String, Object]()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
      "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
      "org.apache.kafka.common.serialization.StringSerializer")
    val producer = new KafkaProducer[String, String](props)
 // Send some messages
    while(true) {
      (1 to messagesPerSec.toInt).foreach { messageNum =>
        val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10).
toString)
          .mkString(" ")
                    print(str)
                    println()
        val message = new ProducerRecord[String, String](topic, null, str)
        producer.send(message)
      }
     Thread.sleep(1000)
    }
  }
}
复制代码

2.编写消费者(consumer)程序

保存后退出vim编辑器。然后,继续在当前目录下创建KafkaWordCount.scala代码文件:

vim KafkaWordCount.scala
复制代码
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object KafkaWordCount{
  def main(args:Array[String]){
    val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")
    val ssc = new StreamingContext(sc,Seconds(10))
    ssc.checkpoint("file:///usr/local/spark/mycode/kafka/checkpoint") //设置检查点,如果存放在HDFS上面,则写成类似ssc.checkpoint("/user/hadoop/checkpoint")这种形式,但是,要启动Hadoop
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )
    val topics = Array("wordsender")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    stream.foreachRDD(rdd => {
      val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      val maped: RDD[(String, String)] = rdd.map(record => (record.key,record.value))
      val lines = maped.map(_._2)
      val words = lines.flatMap(_.split(" "))
      val pair = words.map(x => (x,1))
      val wordCounts = pair.reduceByKey(_+_)
      wordCounts.foreach(println)
    })
    ssc.start
    ssc.awaitTermination
  }
}
复制代码

然后,执行下面命令:

cd /usr/local/spark/mycode/kafka/
vim simple.sbt
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.15"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.2.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.6.0"

3.编译打包程序

cd /usr/local/spark/mycode/kafka/
/usr/local/sbt/sbt package

4.运行程序

启动集群:

1
[atguigu@hadoop102 ~]$ myhadoop.sh start

启动zookeeper

[atguigu@hadoop102 ~]$ cd /usr/local/kafka
[atguigu@hadoop102 kafka]$ bin/zookeeper-server-start.sh config/zookeeper.properties

命令执行后不会返回Shell命令输入状态,zookeeper就会按照默认的配置文件启动服务,请千万不要关闭当前终端.启动新的终端,输入如下命令:

cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties

kafka服务端就启动了,请千万不要关闭当前终端。启动另外一个终端,输入如下命令:

运行生产者

/usr/local/spark/bin/spark-submit --class "KafkaWordProducer" ./target/scala-2.12/simple-project_2.12-1.0.jar localhost:9092 wordsender 3 5

请新打开一个终端,执行下面命令,运行KafkaWordCount程序,执行词频统计:

cd /usr/local/spark/mycode/kafka/
/usr/local/spark/bin/spark-submit  \
--class "KafkaWordCount"  \
./target/scala-2.12/simple-project_2.12-1.0.jar

一直报错,无法解决,判断的是版本之间的冲突,无法进行很好的解决

posted on   实名吓我一跳  阅读(15)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
 
点击右上角即可分享
微信分享提示