Spark消费Kafka

Posted on 2023-06-05 20:54  Capterlliar  阅读(155)  评论(0编辑  收藏  举报

0. 前言

之前先写了处理数据的spark,用文件读写测了一批数据,能跑出结果;今天调通了Kafka,拼在一起,没有半点输出,查了半天,发现是之前的处理部分出了问题,把一个不等号打成了等号,把数据全filter没了。很恐怖,我保证这段时间我没动过这段代码,但上次真的跑出东西了啊(尖叫

1. 配置流程

主节点打开zookeeper、yarn、hdfs、kafka和spark

从节点打开zookeeper和kafka

(有冗余)

cluster1打开生产者:$ kafka-console-producer.sh --broker-list localhost:9092 --topic mykafka

cluster2可以打开消费者调试:$ kafka-console-consumer.sh -zookeeper cluster1:2181,cluster2:2181,cluster3:2181 --topic mykafka --from-beginning

这时从cluster1发送一条消息,可以从cluster2的屏幕上看到。

使用spark的脚本:

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Direct {
  val topic = "mykafka"
  val brokerList = "cluster1:9092,cluster2:9092,cluster3:9092"
  val groupId = "test-consumer-group"

  def main(args: Array[String]): Unit = {
    val topics = topic.split(",").toSet
    val conf = new SparkConf()
      .setAppName("Direct")
      .setMaster("local[*]")
        //多个节点要加[*]
      .set("spark.streaming.receiver.writeAheadLog.enable", "true")

    val batchInterval = Seconds(5)
    val kafkaParams: Map[String, String] = Map[String, String](
      "metadata.broker.list" -> brokerList,
      "group.id" -> groupId,
        //从最早的消息开始读
      "auto.offset.reset" -> "smallest"
    )
    val ssc = new StreamingContext(conf, batchInterval)
    val input: InputDStream[(String, String)] = KafkaUtils
      .createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
        kafkaParams, topics)

   //...处理input,input为InputDStream格式

    ssc.start()
    ssc.awaitTermination()
  }
}
    

使用Java的代码:

import java.util.*;

import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import kafka.javaapi.consumer.ZookeeperConsumerConnector;
import kafka.message.MessageAndMetadata;

import org.apache.kafka.clients.consumer.KafkaConsumer;
import scala.*;
public class KafkaConsumerTest {
    public static void main(String[] args) {
        Properties props = new Properties();
        // zookeeper地址
        props.put("zookeeper.connect", "cluster1:2181,cluster2:2181,cluster3:2181");
        // 消费者组id
        props.put("group.id", "test-consumer-group");
        // smallest : 从头消费
        // largest : 从最后消费
        props.put("auto.offset.reset", "smallest");

        ConsumerConfig conf = new ConsumerConfig(props);
        ConsumerConnector consumer = new ZookeeperConsumerConnector(conf);
        Map<String, Integer> topicStrams = new HashMap<String, Integer>();
        // 第二个数字是返回几个流,topic几个分区就陪几个流比较合理
        topicStrams.put("mykafka", 1);
        Map<String, List<KafkaStream<byte[], byte[]>>> messageStreamsMap = consumer.createMessageStreams(topicStrams);
        List<KafkaStream<byte[], byte[]>> messageStreams = messageStreamsMap.get("mykafka");
        for(final KafkaStream<byte[], byte[]> kafkaStream : messageStreams){
            new Thread(new Runnable() {
                public void run() {
                    ArrayList<String> input = new ArrayList<>();
                    for(MessageAndMetadata<byte[], byte[]> mm : kafkaStream){
                        String msg = new String(mm.message());
                        //...进行操作
                    }

                }

            }).start();

        }
    }
}

可以看到一个配置是用的zookeeper,端口号为2181;一个是用的spark的端口号,9092,不要搞混,否则会报错:

Received -1 when reading from channel, socket has likely been closed.

将代码打成jar包,注意要带依赖,spark或者kafka不自带spark-streaming-kafka,或者在集群上配置相应jar包。

运行java代码:java -jar <包名> KafkaConsumerTest

运行spark代码:spark-submit --class Direct <包名>

依赖:

<dependencies>
        <!-- 导入scala的依赖 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.6</version>
        </dependency>
        <!-- spark相关依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-8_${scala.version}</artifactId>
            <version>2.0.0-preview</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- sql驱动 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.43</version>
        </dependency>
View Code

插件:

<plugins>
            <!-- scala插件需要 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
View Code

第一个是scala需要的插件,第二个是打jar-with-dependency的插件。

 

2. 消费者group id

配置的时候有groupid,可以在该节点的${KafkaHome}/config/consumer.properties中找到。

group的意义:一个topic有很多分区,一份消息可以拆开来放在若干分区上,但很难分辨哪些分区可以拼出全部消息,需要group来记录一下。

 

3. scala object类

object类里的方法都是可以直接使用的静态方法,相应地,class没有static关键字。

同名的object和class形成伴生关系,互相可以访问私有成员,静态类可以提供一些静态方法,听上去有点怪,但实际蛮好用的。

Kotlin也有这样的设计

 

4. Dstream、InputDStream和RDD

DStream、RDD、DataFrame 的相互转换、spark 比 MapReduce 快的原因 - 赤兔胭脂小吕布 - 博客园 (cnblogs.com)

关于DStream:DStream是一串RDD,过一会来一些,过一会来一些,很符合数据生成的样子。

每一个InputDStream对应一个receiver

 

5. maven scope标签

遇到这个是因为所有依赖都打进包里这个包太大了,有144.8M,想能不能指定依赖,但没成功。后面再学吧。

Maven之scope详解 - satire - 博客园 (cnblogs.com)

 

6. kafka和zookeeper的关系

旧版kafka依赖于zookeeper,v2.8之后就不依赖了。

Kafka参数zookeeper和bootstrap-server的区别 - Clotho_Lee - 博客园 (cnblogs.com)

kafka 中 zookeeper 具体是做什么的? - 知乎 (zhihu.com)