本次学习了spark Streaming里进行读取高级数据源的一些操作
进行安装kafka
网站:https://dblab.xmu.edu.cn/blog/1096/
接下来在Ubuntu系统环境下测试简单的实例,按顺序执行如下命令:
# 进入kafka所在的目录 cd /usr/local/kafka bin/zookeeper-server-start.sh config/zookeeper.properties
命令执行后不会返回Shell命令输入状态,zookeeper就会按照默认的配置文件启动服务,请千万不要关闭当前终端.启动新的终端,输入如下命令:
cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties
kafka服务端就启动了,请千万不要关闭当前终端。启动另外一个终端,输入如下命令:
cd /usr/local/kafka bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic dblab
topic是发布消息发布的category,以单节点的配置创建了一个叫dblab的topic.可以用list列出所有创建的topics,来查看刚才创建的主题是否存在。
bin/kafka-topics.sh --list --zookeeper localhost:2181
可以在结果中查看到dblab这个topic存在。
kafka和Flume等高级输入源,需要依赖独立的库(jar文件)
需要在spark-shell下进行import语句进行测试,但是spark库中没有这类文件
需要下载相对应的spark-streaming-kafka的jar包才可以使用kafka,下载的是:spark-streaming-kafka-0-10_2.12-3.2.0这个版本的,其中2.12是scala的版本,3.20是spark的版本
访问MVNREPOSITORY官网(http://mvnrepository.com),下载spark-streaming-kafka-0-10_2.12-3.2.0.jar和spark-token-provider-kafka-0-10_2.12-3.2.0.jar文件,其中,2.12表示Scala的版本号,3.2.0表示Spark版本号。或者也可以直接到本教材官网的“下载专区”的“软件”目录中下载这两个文件。然后,把这两个文件复制到Spark目录的jars目录下(即“/usr/local/spark/jars”目录)。此外,还需要把“/usr/local/kafka/libs”目录下的kafka-clients-2.6.0.jar文件复制到Spark目录的jars目录下
下载后复制到/usr/local/spark/jars/
[atguigu@hadoop102 ~]$ cd /usr/local/spark/jars/ [atguigu@hadoop102 jars]$ mkdir kafka [atguigu@hadoop102 jars]$ cd [atguigu@hadoop102 ~]$ cd 下载 [atguigu@hadoop102 下载]$ cp ./spark-streaming-kafka-0-10_2.12-3.2.0.jar /usr/local/spark/jars/kafka/
继续把kafka安装目录下的libs目录的所有jar文件进行复制到“/usr/local/spark/jars/kafka”目录下,
[atguigu@hadoop102 下载]$ cd /usr/local/kafka/libs/ [atguigu@hadoop102 libs]$ cp ./* /usr/local/spark/jars/kafka
编写Spark Streaming程序使用Kafka数据源
步骤:
1.编写生产者(producer)程序
2.编写消费者(consumer)程序
3.编译打包程序
4.运行程序
1.编写生产者(producer)程序
cd /usr/local/spark/mycode mkdir kafka cd kafka mkdir -p src/main/scala cd src/main/scala vim KafkaWordProducer.scala
使用vim编辑器新建了KafkaWordProducer.scala,它是产生一系列字符串的程序,会产生随机的整数序列,每个整数被当做一个单词,提供给KafkaWordCount程序去进行词频统计。请在KafkaWordProducer.scala中输入以下代码:
import java.util.HashMap import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord} import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka010._ object KafkaWordProducer { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: KafkaWordProducer <metadataBrokerList> <topic> " + "<messagesPerSec> <wordsPerMessage>") System.exit(1) } val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args // Zookeeper connection properties val props = new HashMap[String, Object]() props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers) props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") val producer = new KafkaProducer[String, String](props) // Send some messages while(true) { (1 to messagesPerSec.toInt).foreach { messageNum => val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10). toString) .mkString(" ") print(str) println() val message = new ProducerRecord[String, String](topic, null, str) producer.send(message) } Thread.sleep(1000) } } }
2.编写消费者(consumer)程序
保存后退出vim编辑器。然后,继续在当前目录下创建KafkaWordCount.scala代码文件:
vim KafkaWordCount.scala
import org.apache.spark._ import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka010.KafkaUtils import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe object KafkaWordCount{ def main(args:Array[String]){ val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]") val sc = new SparkContext(sparkConf) sc.setLogLevel("ERROR") val ssc = new StreamingContext(sc,Seconds(10)) ssc.checkpoint("file:///usr/local/spark/mycode/kafka/checkpoint") //设置检查点,如果存放在HDFS上面,则写成类似ssc.checkpoint("/user/hadoop/checkpoint")这种形式,但是,要启动Hadoop val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (true: java.lang.Boolean) ) val topics = Array("wordsender") val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.foreachRDD(rdd => { val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges val maped: RDD[(String, String)] = rdd.map(record => (record.key,record.value)) val lines = maped.map(_._2) val words = lines.flatMap(_.split(" ")) val pair = words.map(x => (x,1)) val wordCounts = pair.reduceByKey(_+_) wordCounts.foreach(println) }) ssc.start ssc.awaitTermination } }
然后,执行下面命令:
cd /usr/local/spark/mycode/kafka/
vim simple.sbt
name := "Simple Project" version := "1.0" scalaVersion := "2.12.15" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.2.0" libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.6.0"
3.编译打包程序
cd /usr/local/spark/mycode/kafka/ /usr/local/sbt/sbt package
4.运行程序
启动集群:
1 | [atguigu@hadoop102 ~]$ myhadoop.sh start |
启动zookeeper
[atguigu@hadoop102 ~]$ cd /usr/local/kafka
[atguigu@hadoop102 kafka]$ bin/zookeeper-server-start.sh config/zookeeper.properties
命令执行后不会返回Shell命令输入状态,zookeeper就会按照默认的配置文件启动服务,请千万不要关闭当前终端.启动新的终端,输入如下命令:
cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties
kafka服务端就启动了,请千万不要关闭当前终端。启动另外一个终端,输入如下命令:
运行生产者
/usr/local/spark/bin/spark-submit --class "KafkaWordProducer" ./target/scala-2.12/simple-project_2.12-1.0.jar localhost:9092 wordsender 3 5
请新打开一个终端,执行下面命令,运行KafkaWordCount程序,执行词频统计:
cd /usr/local/spark/mycode/kafka/ /usr/local/spark/bin/spark-submit \ --class "KafkaWordCount" \ ./target/scala-2.12/simple-project_2.12-1.0.jar
一直报错,无法解决,判断的是版本之间的冲突,无法进行很好的解决
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!