接上篇 Flink SQL 计算 TPS
回顾问题: Flink SQL 每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS
显然,Flink 是不支持这种三重时间窗口的,SQL 不行,Stream 也不行,但是 Flink Stream Api 可以调用跟底层的 process 方法自己实现
下面来看看我实现的代码吧
主类 LateTps
主要方法功能介绍:
- kafka source
- map 方法, 从 KafkaSimpleStringRecord 类型的数据中获取 value 中的 user_id 和 ts(这里只用了 ts)
- assignTimestampsAndWatermarks 固定延迟的水印和timestamp
- process10m 窗口,计算一定时间内,每分钟的 TPS
- process10s 窗口,计算一定时间内,每个固定时间间隔的最近 1 分钟 TPS。例:每 10 分钟计算每 10 秒的最近 1分钟 TPS
- kafka sink
- 5 和 6 的区别是,5 的时间间隔是整分钟,每条数据只会属于一个时间段;6 的固定间隔是灵活的,数据可能会属于多个时间段
object LateTps {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val topic = "user_log"
val bootstrapServer = "localhost:9092"
// window size second
val windowSize: Int = 10 * 60
// calculate tps interval
val intervalSize: Int = 5
// kafka source for read data
val kafkaSource = KafkaSource
.builder[KafkaSimpleStringRecord]()
.setTopics(topic)
.setBootstrapServers(bootstrapServer)
.setGroupId("late_tps")
.setStartingOffsets(OffsetsInitializer.latest())
.setDeserializer(new SimpleKafkaRecordDeserializationSchema())
.build()
// add source
val source = env
.fromSource(kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)), "kafkaSource")
// parse data, only get (user_id, ts)
val stream = source
.map(new RichMapFunction[KafkaSimpleStringRecord, (String, Long)] {
var jsonParse: JsonParser = _
override def open(parameters: Configuration): Unit = {
jsonParse = new JsonParser
}
override def map(element: KafkaSimpleStringRecord): (String, Long) = {
val json = jsonParse.parse(element.getValue).getAsJsonObject
val tsStr = json.get("ts").getAsString
val ts = DateTimeUtil.parse(tsStr).getTime
val userId = json.get("user_id").getAsString
(userId, ts)
}
override def close(): Unit = {
jsonParse = null
}
})
// set timestamp and watermark
.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness[(String, Long)](Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner[(String, Long)] {
override def extractTimestamp(t: (String, Long), l: Long): Long = {
t._2
}
})
// idle 1 minute
.withIdleness(Duration.ofMinutes(1))
)
// windowSize 10 minute, export every 1 minute tps
val process10m = stream
.windowAll(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.process(new FixedLateTpsProcessAllWindowFunction(windowSize, 60))
.print("10m")
// // windowSize minute, export every 1 minute tps
val process10s = stream
.windowAll(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.process(new AdjustLateTpsProcessAllWindowFunction(windowSize , intervalSize))
process10s.print("10s")
val tag = new OutputTag[String]("size")
val side = process10s.getSideOutput(tag)
// side tmp result to kafka
val kafkaSink = KafkaSink.builder[String]()
.setBootstrapServers(bootstrapServer)
.setRecordSerializer(KafkaRecordSerializationSchema.builder[String]()
.setTopic(topic +"_side_sink")
.setValueSerializationSchema(new SimpleStringSchema())
.build()
)
.build()
// add sink
side.sinkTo(kafkaSink)
// execute task
env.execute("LateTps")
}
}
process10m 窗口,计算一定时间内,每分钟的 TPS
计算每分钟的 tps,这个比较简单,因为所有数据都只属于一个窗口,窗口号就是 当前的时间 / 1000 % interval ,就可以获取到每条数据属于的创建,简单的加和一下每分钟的数据总数据,最后讲最后一分钟的数据存入窗口状态,因为这分钟的数据属于下一个窗口时段 第 0 分
class FixedLateTpsProcessAllWindowFunction(windowSize: Int, intervalSize: Int) extends ProcessAllWindowFunction[(String, Long), (String, String, Int, Double), TimeWindow] {
// for last window, last senond
var lastWindow: ValueState[Double] = _
var interval: Int = _
override def open(parameters: Configuration): Unit = {
// windowState = getRuntimeContext.getMapState(new MapStateDescriptor[Int, Long]("window", classOf[Int], classOf[Long]))
lastWindow = getRuntimeContext.getState(new ValueStateDescriptor[Double]("last", classOf[Double]))
interval = windowSize / intervalSize
}
override def process(context: Context, elements: Iterable[(String, Long)], out: Collector[(String, String, Int, Double)]): Unit = {
// get window
val windowStart = DateTimeUtil.formatMillis(context.window.getStart, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
val windowEnd = DateTimeUtil.formatMillis(context.window.getEnd, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
var lastWindowCount = lastWindow.value()
if (lastWindowCount == null) {
lastWindowCount = 0
}
// init tps map
val map = new util.HashMap[Int, Long]()
for (_ <- 0 to windowSize - 1) {
map.put(0, 0)
}
// for each element, get every window size
elements.foreach((e: (String, Long)) => {
val current: Int = (e._2 / 1000 % interval).toInt
map.put(current, map.get(current) + 1)
})
// for every zero window, out last window count
out.collect(windowStart, windowEnd, 0, lastWindowCount)
for (i <- 0 until interval - 1) {
out.collect(windowStart, windowEnd, i + 1, map.get(i + 1) / 60.0)
}
// keep window last minute count as next window zero window count
lastWindow.update(map.get(interval - 1) / 60.0)
}
override def close(): Unit = {
lastWindow.clear()
}
}
process10s 窗口,计算一定时间内,每个固定时间间隔的最近 1 分钟 TPS
这个比上一个稍麻烦一点,因为每天数据会属于多个时间间隔,比如代码中的 时间间隔为 10 秒(窗口长度 10 分钟),每条数据就会属于 6 个间隔,比如:
讲 10 分钟的数据分为 600 秒:
61 秒的数据属于: 10-70, 20-80,30-90,40-100,50-110,60 - 120 这 6 个窗口
还有就是每个窗口的起始 6 个时间间隔: -60-0,-50-10,-40-20,-30-30,-20-40,-10-50 部分数据在上一个时间窗口里面,再加上属于当前窗口的部分
/**
* 不固定长度的输出间隔
* @param windowSize
* @param intervalSize
*/
class AdjustLateTpsProcessAllWindowFunction(windowSize: Int, intervalSize: Int) extends ProcessAllWindowFunction[(String, Long), (String, String, Int, String, Double), TimeWindow] {
val LOG = LoggerFactory.getLogger("LateTpsSecondProcessAllWindowFunction")
// for current window
// var windowState: MapState[Int, Long] = _
// for last window, last senond
var lastWindow: ValueState[util.HashMap[Int, Long]] = _
var interval: Int = _
var tag: OutputTag[String] = _
override def open(parameters: Configuration): Unit = {
// windowState = getRuntimeContext.getMapState(new MapStateDescriptor[Int, Long]("window", classOf[Int], classOf[Long]))
lastWindow = getRuntimeContext.getState(new ValueStateDescriptor[util.HashMap[Int, Long]]("last", classOf[util.HashMap[Int, Long]]))
interval = windowSize
tag = new OutputTag[String]("size")
}
override def process(context: Context, elements: Iterable[(String, Long)], out: Collector[(String, String, Int, String, Double)]): Unit = {
// get window
val windowStart = DateTimeUtil.formatMillis(context.window.getStart, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
val windowEnd = DateTimeUtil.formatMillis(context.window.getEnd, DateTimeUtil.YYYY_MM_DD_HH_MM_SS)
// get last window state map, for last window over size date
var lastWindowStateMap = lastWindow.value()
// init lastWindow state as zero
if (lastWindowStateMap == null) {
lastWindowStateMap = initLastWindowState
}
// init tps currentWindowMap 0 - 3600
val currentWindowMap = new util.HashMap[Int, Long]()
// init tps next window map 3600 - 3660
val nextWindowMap = new util.HashMap[Int, Long]()
for (i <- 0 until interval) {
currentWindowMap.put(i, 0)
}
for (i <- interval - 60 until interval) {
nextWindowMap.put(i, 0)
}
elements.foreach((e: (String, Long)) => {
// 获取每天数据在1小时内的秒数
val current: Int = (e._2 / 1000 % interval).toInt
currentWindowMap.put(current, currentWindowMap.get(current) + 1)
})
// load next window data
for(i <- interval - 60 until interval){
nextWindowMap.put(i, currentWindowMap.get(i))
}
// todo tmp to side
currentWindowMap.forEach((a: Int, b: Long) => {
// context.output(tag, windowStart + "," + windowEnd + "," + a + "," + b)
context.output(tag, windowStart + ", 1," + a + "," + b)
})
nextWindowMap.forEach((a: Int, b: Long) => {
context.output(tag, windowStart + ", 2," + a + "," + b)
})
// calculate every window size
for (window <- 0 until interval / intervalSize) {
// load current interval tps
// 计算 每个窗口的时间范围
val (start, end) = calWindowStartEnd(window)
var size = 0l
for (j <- start until end) {
// if window second include -60 to 0, add last window state
if (j <= 0) {
size += lastWindowStateMap.get(interval + j)
}
if (currentWindowMap.containsKey(j)) {
size += currentWindowMap.get(j)
}
}
out.collect(windowStart, windowEnd, window, start + "-" + end, size / 60.0)
}
// clear last window
lastWindow.clear()
// keep last
lastWindow.update(nextWindowMap)
}
// init last window state as zero
private def initLastWindowState: util.HashMap[Int, Long] = {
val map = new util.HashMap[Int, Long]()
for (i <- 0 until 60) {
map.put(i, 0)
}
map
}
// calculate window start and end
def calWindowStartEnd(i: Int): (Int, Int) = {
val end = i * intervalSize
val start = end - 60
(start, end)
}
override def close(): Unit = {
lastWindow.clear()
}
}
测试数据集
直接生成测试数据往 kafka 写,每秒约 100 条,方便评判结果
send topic : user_log, size : 5000, message : {"next_page":"03794db5-97bb-4631-b56f-40546eb67505","category_id":10,"user_id":"\t9909210000","item_id":"1010017","price":26.958991852428383,"last_page":"f6c94247-9af5-46e0-a9f7-9d1b9417a302","page":"c88199fc-9abd-4001-9937-2568a13be9a1","position":"710daa6c-ecb1-4826-8c80-e7efa37649ca","sort":"917acf27-2189-4c9f-a823-e4b0fa4c553b","behavior":"pv","ts":"2023-02-17 21:00:50.000"}
send topic : user_log, size : 10000, message : {"next_page":"220c8dd1-f75c-45d5-ac6f-ad2bfab3d829","category_id":10,"user_id":"\t3375150000","item_id":"1010017","price":19.87789592108874,"last_page":"716bb58a-8aaa-40b5-86ed-315bb4714f67","page":"5f278b9f-638e-4f21-b614-1da8e556faee","position":"d010e4a4-c555-4dfc-b9dd-f69ecaaf8528","sort":"a80671a3-1792-4785-b984-f8e2bbf95d38","behavior":"cart","ts":"2023-02-17 21:01:40.000"}
结果
截取两个窗口的部分数据
# 第一个窗口最后几条
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,56,500-560,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,57,510-570,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,58,520-580,101.05)
10s> (2023-02-17 21:10:00,2023-02-17 21:20:00,59,530-590,101.05)
# 第二个窗口前几条
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,0,-60-0,101.05)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,1,-50-10,101.05)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,2,-40-20,100.0)
10s> (2023-02-17 21:20:00,2023-02-17 21:30:00,3,-30-30,100.0)
完整代码参考:github flink-rookie
欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文