接上篇 Flink SQL 计算 TPS

回顾问题: Flink SQL 每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS

显然,Flink 是不支持这种三重时间窗口的,SQL 不行,Stream 也不行,但是 Flink Stream Api 可以调用跟底层的 process 方法自己实现

下面来看看我实现的代码吧

主类 LateTps

主要方法功能介绍:

  1. kafka source
  2. map 方法, 从 KafkaSimpleStringRecord 类型的数据中获取 value 中的 user_id 和 ts(这里只用了 ts)
  3. assignTimestampsAndWatermarks 固定延迟的水印和timestamp
  4. process10m 窗口,计算一定时间内,每分钟的 TPS
  5. process10s 窗口,计算一定时间内,每个固定时间间隔的最近 1 分钟 TPS。例:每 10 分钟计算每 10 秒的最近 1分钟 TPS
  6. kafka sink
  • 5 和 6 的区别是,5 的时间间隔是整分钟,每条数据只会属于一个时间段;6 的固定间隔是灵活的,数据可能会属于多个时间段

object LateTps {

  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val topic = "user_log"
    val bootstrapServer = "localhost:9092"
    // window size second
    val windowSize: Int = 10 * 60
    // calculate tps interval
    val intervalSize: Int = 5

    // kafka source for read data
    val kafkaSource = KafkaSource
      .builder[KafkaSimpleStringRecord]()
      .setTopics(topic)
      .setBootstrapServers(bootstrapServer)
      .setGroupId("late_tps")
      .setStartingOffsets(OffsetsInitializer.latest())
      .setDeserializer(new SimpleKafkaRecordDeserializationSchema())
      .build()

    // add source
    val source = env
      .fromSource(kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)), "kafkaSource")

    // parse data, only get (user_id, ts)
    val stream = source
      .map(new RichMapFunction[KafkaSimpleStringRecord, (String, Long)] {
        var jsonParse: JsonParser = _
        override def open(parameters: Configuration): Unit = {
          jsonParse = new JsonParser
        }
        override def map(element: KafkaSimpleStringRecord): (String, Long) = {

          val json = jsonParse.parse(element.getValue).getAsJsonObject
          val tsStr = json.get("ts").getAsString
          val ts = DateTimeUtil.parse(tsStr).getTime
          val userId = json.get("user_id").getAsString

          (userId, ts)
        }
        override def close(): Unit = {
          jsonParse = null

        }
      })
      // set timestamp and watermark
      .assignTimestampsAndWatermarks(WatermarkStrategy
        .forBoundedOutOfOrderness[(String, Long)](Duration.ofSeconds(5))
        .withTimestampAssigner(new SerializableTimestampAssigner[(String, Long)] {
          override def extractTimestamp(t: (String, Long), l: Long): Long = {
            t._2
          }
        })
        // idle 1 minute
        .withIdleness(Duration.ofMinutes(1))
      )


    // windowSize 10 minute, export every 1 minute tps
    val process10m = stream
      .windowAll(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
      .process(new FixedLateTpsProcessAllWindowFunction(windowSize, 60))
      .print("10m")

//    // windowSize minute, export every 1 minute tps
    val process10s = stream
      .windowAll(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
      .process(new AdjustLateTpsProcessAllWindowFunction(windowSize , intervalSize))

    process10s.print("10s")

    val tag = new OutputTag[String]("size")
    val side = process10s.getSideOutput(tag)

    // side tmp result to kafka
    val kafkaSink = KafkaSink.builder[String]()
      .setBootstrapServers(bootstrapServer)
      .setRecordSerializer(KafkaRecordSerializationSchema.builder[String]()
        .setTopic(topic +"_side_sink")
        .setValueSerializationSchema(new SimpleStringSchema())
        .build()
      )
      .build()

    // add sink
    side.sinkTo(kafkaSink)

    // execute task
    env.execute("LateTps")
  }

}


process10m 窗口,计算一定时间内,每分钟的 TPS

计算每分钟的 tps,这个比较简单,因为所有数据都只属于一个窗口,窗口号就是 当前的时间 / 1000 % interval ,就可以获取到每条数据属于的创建,简单的加和一下每分钟的数据总数据,最后讲最后一分钟的数据存入窗口状态,因为这分钟的数据属于下一个窗口时段 第 0 分

class FixedLateTpsProcessAllWindowFunction(windowSize: Int, intervalSize: Int) extends ProcessAllWindowFunction[(String, Long), (String, String, Int, Double), TimeWindow] {

  // for last window, last senond
  var lastWindow: ValueState[Double] = _
  var interval: Int = _

  override def open(parameters: Configuration): Unit = {

    //    windowState = getRuntimeContext.getMapState(new MapStateDescriptor[Int, Long]("window", classOf[Int], classOf[Long]))
    lastWindow = getRuntimeContext.getState(new ValueStateDescriptor[Double]("last", classOf[Double]))

    interval = windowSize / intervalSize
  }

  override def process(context: Context, elements: Iterable[(String, Long)], out: Collector[(String, String, Int, Double)]): Unit = {

    // get window
    val windowStart = DateTimeUtil.formatMillis(context.window.getStart, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
    val windowEnd = DateTimeUtil.formatMillis(context.window.getEnd, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
    var lastWindowCount = lastWindow.value()
    if (lastWindowCount == null) {
      lastWindowCount = 0
    }

    // init tps map
    val map = new util.HashMap[Int, Long]()
    for (_ <- 0 to windowSize - 1) {
      map.put(0, 0)
    }

    // for each element, get every window size
    elements.foreach((e: (String, Long)) => {
      val current: Int = (e._2 / 1000 % interval).toInt
      map.put(current, map.get(current) + 1)
    })

    // for every zero window, out last window count
    out.collect(windowStart, windowEnd, 0, lastWindowCount)
    for (i <- 0 until interval - 1) {
      out.collect(windowStart, windowEnd, i + 1, map.get(i + 1) / 60.0)
    }

    // keep window last minute count as next window zero window count
    lastWindow.update(map.get(interval - 1) / 60.0)

  }

  override def close(): Unit = {
    lastWindow.clear()
  }

}


process10s 窗口,计算一定时间内,每个固定时间间隔的最近 1 分钟 TPS

这个比上一个稍麻烦一点,因为每天数据会属于多个时间间隔,比如代码中的 时间间隔为 10 秒(窗口长度 10 分钟),每条数据就会属于 6 个间隔,比如:

讲 10 分钟的数据分为 600 秒:
61 秒的数据属于: 10-70, 20-80,30-90,40-100,50-110,60 - 120 这 6 个窗口

还有就是每个窗口的起始 6 个时间间隔: -60-0,-50-10,-40-20,-30-30,-20-40,-10-50 部分数据在上一个时间窗口里面,再加上属于当前窗口的部分


/**
 * 不固定长度的输出间隔
 * @param windowSize
 * @param intervalSize
 */
class AdjustLateTpsProcessAllWindowFunction(windowSize: Int, intervalSize: Int) extends ProcessAllWindowFunction[(String, Long), (String, String, Int, String, Double), TimeWindow] {

  val LOG = LoggerFactory.getLogger("LateTpsSecondProcessAllWindowFunction")
  // for current window
  //  var windowState: MapState[Int, Long] = _
  // for last window, last senond
  var lastWindow: ValueState[util.HashMap[Int, Long]] = _
  var interval: Int = _
  var tag: OutputTag[String] = _


  override def open(parameters: Configuration): Unit = {

    //    windowState = getRuntimeContext.getMapState(new MapStateDescriptor[Int, Long]("window", classOf[Int], classOf[Long]))
    lastWindow = getRuntimeContext.getState(new ValueStateDescriptor[util.HashMap[Int, Long]]("last", classOf[util.HashMap[Int, Long]]))

    interval = windowSize

    tag = new OutputTag[String]("size")
  }

  override def process(context: Context, elements: Iterable[(String, Long)], out: Collector[(String, String, Int, String, Double)]): Unit = {

    // get window
    val windowStart = DateTimeUtil.formatMillis(context.window.getStart, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
    val windowEnd = DateTimeUtil.formatMillis(context.window.getEnd, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
    // get last window state map, for last window over size date
    var lastWindowStateMap = lastWindow.value()
    // init lastWindow state as zero
    if (lastWindowStateMap == null) {
      lastWindowStateMap = initLastWindowState
    }

    // init tps currentWindowMap  0 - 3600
    val currentWindowMap = new util.HashMap[Int, Long]()
    // init tps next window map 3600 - 3660
    val nextWindowMap = new util.HashMap[Int, Long]()
    for (i <- 0 until interval) {
      currentWindowMap.put(i, 0)
    }
    for (i <- interval - 60 until interval) {
      nextWindowMap.put(i, 0)
    }

    elements.foreach((e: (String, Long)) => {
      // 获取每天数据在1小时内的秒数
      val current: Int = (e._2 / 1000 % interval).toInt
      currentWindowMap.put(current, currentWindowMap.get(current) + 1)
    })
    // load next window data
    for(i <- interval - 60 until interval){
      nextWindowMap.put(i, currentWindowMap.get(i))
    }

    // todo tmp to side
    currentWindowMap.forEach((a: Int, b: Long) => {
      //      context.output(tag, windowStart + "," + windowEnd + "," + a + "," + b)
      context.output(tag, windowStart + ", 1," + a + "," + b)
    })
    nextWindowMap.forEach((a: Int, b: Long) => {
      context.output(tag, windowStart + ", 2," + a + "," + b)
    })

    // calculate every window size
    for (window <- 0 until interval / intervalSize) {
      // load current interval tps
      // 计算 每个窗口的时间范围
      val (start, end) = calWindowStartEnd(window)
      var size = 0l
      for (j <- start until end) {
        // if window second include -60 to 0, add last window state
        if (j <= 0) {
          size += lastWindowStateMap.get(interval + j)
        }
        if (currentWindowMap.containsKey(j)) {
          size += currentWindowMap.get(j)
        }
      }
      out.collect(windowStart, windowEnd, window, start + "-" + end, size / 60.0)

    }
    // clear last window
    lastWindow.clear()
    // keep last
    lastWindow.update(nextWindowMap)
  }

  // init last window state as zero
  private def initLastWindowState: util.HashMap[Int, Long] = {
    val map = new util.HashMap[Int, Long]()
    for (i <- 0 until 60) {
      map.put(i, 0)
    }
    map
  }

  // calculate window start and end
  def calWindowStartEnd(i: Int): (Int, Int) = {
    val end = i * intervalSize
    val start = end - 60
    (start, end)
  }

  override def close(): Unit = {
    lastWindow.clear()
  }

}

测试数据集

直接生成测试数据往 kafka 写,每秒约 100 条,方便评判结果

send topic : user_log, size : 5000, message : {"next_page":"03794db5-97bb-4631-b56f-40546eb67505","category_id":10,"user_id":"\t9909210000","item_id":"1010017","price":26.958991852428383,"last_page":"f6c94247-9af5-46e0-a9f7-9d1b9417a302","page":"c88199fc-9abd-4001-9937-2568a13be9a1","position":"710daa6c-ecb1-4826-8c80-e7efa37649ca","sort":"917acf27-2189-4c9f-a823-e4b0fa4c553b","behavior":"pv","ts":"2023-02-17 21:00:50.000"}
send topic : user_log, size : 10000, message : {"next_page":"220c8dd1-f75c-45d5-ac6f-ad2bfab3d829","category_id":10,"user_id":"\t3375150000","item_id":"1010017","price":19.87789592108874,"last_page":"716bb58a-8aaa-40b5-86ed-315bb4714f67","page":"5f278b9f-638e-4f21-b614-1da8e556faee","position":"d010e4a4-c555-4dfc-b9dd-f69ecaaf8528","sort":"a80671a3-1792-4785-b984-f8e2bbf95d38","behavior":"cart","ts":"2023-02-17 21:01:40.000"}

结果

截取两个窗口的部分数据


# 第一个窗口最后几条
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,56,500-560,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,57,510-570,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,58,520-580,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,59,530-590,101.05)

# 第二个窗口前几条
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,0,-60-0,101.05)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,1,-50-10,101.05)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,2,-40-20,100.0)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,3,-30-30,100.0)

完整代码参考:github flink-rookie

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文
flink 菜鸟公众号

posted on 2023-03-09 09:07  Flink菜鸟  阅读(308)  评论(0编辑  收藏  举报