基于spark和flink的电商数据分析项目

本文是原项目的一次重写。主要是用DataFrame代替原来的RDD，并在一些实现上进行优化，还有就是实时流计算改用Flink进行实现。
项目分为用户访问session模块、页面转跳转化率统计、热门商品离线统计和广告流量实时统计四部分组成。

业务需求

用户访问session

该模块主要是对用户访问session进行统计分析，包括session的聚合指标计算、按时间比例随机抽取session、获取每天点击、下单和购买排名前10的品类、并获取top10品类的点击量排名前10的session。主要使用Spark DataFrame。

页面单跳转化率统计

该模块主要是计算关键页面之间的单步跳转转化率，涉及到页面切片算法以及页面流匹配算法。主要使用Spark DataFrame。

热门商品离线统计

该模块主要实现每天统计出各个区域的top3热门商品。主要使用Spark DataFrame。

广告流量实时统计

经过实时黑名单过滤的每天各省各城市广告点击实时统计、每天各省topn热门广告、各广告近1小时内每分钟的点击趋势。主要使用Spark streaming或Flink。

业务数据源

输入表

# 用户表
user_id      用户的ID
username     用户的名称
name         用户的名字
age          用户的年龄
professional 用户的职业
city         用户所在的城市
sex          用户的性别

# 商品表
product_id   商品的ID
product_name 商品的名称
extend_info  商品额外的信息

# 用户访问动作表
date               用户点击行为的日期
user_id            用户的ID
session_id         Session的ID
page_id            某个页面的ID
action_time        点击行为的时间点
search_keyword     用户搜索的关键词
click_category_id  某一个商品品类的ID
click_product_id   某一个商品的ID
order_category_ids 一次订单中所有品类的ID集合
order_product_ids  一次订单中所有商品的ID集合
pay_category_ids   一次支付中所有品类的ID集合
pay_product_ids    一次支付中所有商品的ID集合
city_id            城市ID

输出表

# 聚合统计表
taskid                       当前计算批次的ID
session_count                所有Session的总和
visit_length_1s_3s_ratio     1-3sSession访问时长占比
visit_length_4s_6s_ratio     4-6sSession访问时长占比
visit_length_7s_9s_ratio     7-9sSession访问时长占比
visit_length_10s_30s_ratio   10-30sSession访问时长占比
visit_length_30s_60s_ratio   30-60sSession访问时长占比
visit_length_1m_3m_ratio     1-3mSession访问时长占比
visit_length_3m_10m_ratio    3-10mSession访问时长占比
visit_length_10m_30m_ratio   10-30mSession访问时长占比
visit_length_30m_ratio       30mSession访问时长占比
step_length_1_3_ratio        1-3步长占比
step_length_4_6_ratio        4-6步长占比
step_length_7_9_ratio        7-9步长占比
step_length_10_30_ratio      10-30步长占比
step_length_30_60_ratio      30-60步长占比
step_length_60_ratio         大于60步长占比

# 品类Top10表
taskid
categoryid
clickCount
orderCount
payCount

# Top10 Session
taskid
categoryid
sessionid
clickCount

用户访问Session分析

Session聚合统计

统计出符合条件的session中，各访问时长、步长的占比，并将结果保存到MySQL中。符合条件的session指搜索过某些关键词的用户、访问时间在某个时间段内的用户、年龄在某个范围内的用户、职业在某个范围内的用户、所在某个城市的用户，所发起的session。

除了将原rdd的实现改为DF外，本文还在两方面进行了优化。第一是join前提前filter。原实现是先从用户动作表中计算出访问时长、步长后和用户信息表进行关联后再filter的，这无疑是对一些无关的用户多余地计算了访问时长和步长，也增加了join是shuffle的数据量。第二点是原实现采用accumulator实现个访问时长人数和各步长人数的统计，这会增加driver的负担。而重写后的代码基于DF，且利用when函数对访问时长和步长进行离散化，最后利用聚合函数得出统计结果，让所有统计都在executors中并行执行。

// 原始数据包含“用户访问动作表”中的信息
// 先根据时间范围筛选“用户访问动作表”，然后将它和“UserInfo表”进行inner join，补充用于进一步筛选的信息：age、professional、city、sex
// 根据searchKeywords、clickCategoryIds和上面4个条件对数据进行筛选，得出所需的session。

// 利用spark sql筛选特定时间段的session
spark.sql("select * from user_visit_action where date>='" + startDate + "' and date<='" + endDate + "'")

// 下面代码用于合成SQL语句并用于filter特定类型的session，但有一定的安全隐患，要对输入的参数进行严格的校验，防止SQL注入。
val selectClause = new StringBuilder("SELECT * FROM user_visit_action_to_user_info WHERE 1=1 ")
if (ValidUtils.equal(Constants.PARAM_SEX, sex)){
  selectClause append ("AND sex == '" + sex + "'")
}
if (ValidUtils.in(Constants.PARAM_PROFESSIONALS, professionals)){
  selectClause append ("AND professional in (" + professionals + ")")
}
if (ValidUtils.in(Constants.PARAM_CITIES, cities)){
  selectClause append ("AND cities in (" + cities + ")")
}
if (ValidUtils.in(Constants.PARAM_KEYWORDS, keywords)){
  selectClause append ("AND search_keyword in (" + keywords + ")")
}
if (ValidUtils.in(Constants.PARAM_CATEGORY_IDS, categoryIds)){
  selectClause append ("AND click_category_id in (" + categoryIds + ")")
}
if (ValidUtils.between(Constants.FIELD_AGE, startAge, endAge)){
  selectClause append  ("AND age BETWEEN " + startAge + " AND " + endAge)
}
val sqlQuery = selectClause.toString()

// filter完后与“用户表”建立连接

// 下面进行session聚合计算，结果得到的信息包括sessionid、search_keyword、click_category_id、stepLength、visitLength、session开始时间start、AGE、PROFESSIONAL、CITY、SEX
val timeFmt = "yyyy-MM-dd HH:mm:ss"
val sessionid2ActionsRDD2 = UserVisitActionDF
  .withColumn("action_time", unix_timestamp($"action_time", timeFmt))
  .groupBy("session_id")
  .agg(min("action_time") as "start",
    max("action_time") as "end",
    count("*") as "stepLength")
  .withColumn("visitLength", $"start" - $"end")
  .withColumn("discrete_VL", discretiseVisitLength)
  .withColumn("discrete_SL", discretiseStepLength)

// 离散化 visitLength 和 stepLength
val discretiseVisitLength = when($"visitLength" >= 1 && $"visitLength" <= 3 , Constants.TIME_PERIOD_1s_3s)
.when($"visitLength" >= 4 && $"visitLength" <= 6 , Constants.TIME_PERIOD_4s_6s)
...
.when($"visitLength" >= 1800, Constants.TIME_PERIOD_30m)

// 统计信息，获得每种访问时长的人数。将下面discrete_VL换成stepLength就是每种步长的人数了
val statisticVisitLength = sessionid2ActionsRDD2.groupBy("discrete_VL").agg(count("discrete_VL")).collect()

Session分层抽样

根据各时长、步长的比例抽样。原实现利用rdd和scala自身数据结构和方法来实现，新的实现直接利用dataframe的统计函数sampleBy实现。

用df.stat.sampleBy("colName", fractions, seed)，其中fractions为Map，是每distinct key和其需要抽取的比例，如("a" -> 0.8)就是从key为a的数据中抽80%条

val fractions = HashMap(
  TIME_PERIOD_1s_3s -> 0.1,
  TIME_PERIOD_4s_6s -> 0.1,
  TIME_PERIOD_7s_9s -> 0.1,
  TIME_PERIOD_10s_30s -> 0.1,
  TIME_PERIOD_30s_60s -> 0.1,
  TIME_PERIOD_1m_3m -> 0.1,
  TIME_PERIOD_3m_10m -> 0.1,
  TIME_PERIOD_10m_30m -> 0.1,
  TIME_PERIOD_30m -> 0.1
)
df.stat.sampleBy("time_period", fractions, 2L)

// 如果time_period未知，用下面方式得出map
df.select("time_period")
  .distinct
  .map(x=> (x, 0.8))
  .collectAsMap

Top10热门品类

分别计算出各商品的点击数、下单数、支付次数，然后将三个结果进行连接，并排序。排序规则是点击数大的排前面，相同时下单数大的排前面，然后再相同时支付次数大的排前面。这里的优化点是采用rdd的takeOrdered取前十，它的底层是每个分区一个最小堆，取出每个分区的前10，然后再汇总。这样省去了原来实现当中的sortbykey+take，该方法进行了全排序，效率较低。

// 分别计算出各商品的点击数、下单数、支付次数，然后将三个结果进行连接，并排序。
val clickCategoryId2CountDF = sessionid2detailDF
  .select("clickCategoryId")
  .na.drop()
  .groupBy("clickCategoryId")
  .agg(count("clickCategoryId"))
  .withColumnRenamed("clickCategoryId", "categoryId")

val orderCategoryId2CountDF = sessionid2detailDF
  .select("order_category_ids")
  .na.drop()
  .withColumn("splitted_order_category_ids", split($"order_category_ids", ","))
  .withColumn("single_order_category_ids", explode($"splitted_order_category_ids"))
  .groupBy("single_order_category_ids")
  .agg(count("single_order_category_ids"))
  .withColumnRenamed("single_order_category_ids", "categoryId")

val payCategoryId2Count = sessionid2detailDF
  .select("pay_category_ids")
  .na.drop()
  .withColumn("splitted_pay_category_ids", split($"pay_category_ids", ","))
  .withColumn("single_pay_category_ids", explode($"splitted_pay_category_ids"))
  .groupBy("single_pay_category_ids")
  .agg(count("single_pay_category_ids"))
  .withColumnRenamed("single_pay_category_ids", "categoryId")

val top10CategoryId = clickCategoryId2CountDF.join(orderCategoryId2CountDF, Seq("categoryId"), "left")
  .join(payCategoryId2Count, Seq("categoryId"), "left")
  .na.fill(0L, Seq(""))
  .map(row => {
    (row.getAs[Int]("categoryId"), 
     row.getAs[Int]("count(clickCategoryId)"), 
     row.getAs[Int]("count(single_order_category_ids)"),
     row.getAs[Int]("count(single_pay_category_ids)"))
  })
  .rdd
  .takeOrdered(10)(ordering)

// 补充
implicit val ordering = new Ordering[(Int, Int, Int, Int)] {
  override def compare(x: (Int, Int, Int, Int), y: (Int, Int, Int, Int)): Int = {
    val compare2 = x._2.compareTo(y._2)
    if (compare2 != 0) return compare2
    val compare3 = x._3.compareTo(y._3)
    if (compare3 != 0) return compare3
    val compare4 = x._4.compareTo(y._4)
    if (compare4 != 0) return compare4
    0
  }
}.reverse

Top10活跃Session

对于top10的品类，每一个都要获取对它点击次数排名前10的session。
原代码的实现是先groupByKey，统计出每个sessionid对各品类的点击次数，然后再跟前10热门品类连接来减少数据，然后再用groupBuKey，对每个分组数据toList后排序取前10。这个实现并不太好，首先它一开始的groupByKey对非Top10热门品类的数据进行了统计，这是一种浪费。更好的做法是提前filter，即先利用热门品类这个名单进行filter。然后，原代码在实现filter使用的是将热门品类名单parallelise到集群然后利用join实现过滤。这会触发不必要的shuffle，更好的实现进行broadcast join，将名单广播出去后进行join。然后groupByKey的统计也是一个问题，它没有map side聚合，容易OOM，更好的实现是采用DF的groupby + agg。得出统计数据后利用windowfunction取得各热门品类的前十session。即一次shuffle就可以完成需求，windowfunction在这个并不需要shuffle，因为经过前面的shuffle聚合，df已经具有partitioner了，在原节点就可以计算出topn。

// 把top10CategoryId的名单发到集群
val top10CategoryIdRDD = spark.sparkContext.parallelize(top10CategoryId.map(_._1)).toDF("top10CategoryId")

// 利用broadcast实现过滤，然后进行分组统计
val top10Category2SessionAndCount =     filteredUserVisitActionDF.join(broadcast(top10CategoryIdRDD), $"click_category_id" ===  $"top10CategoryId")
      .groupBy("top10CategoryId", "sessionId")
      .agg(count("click_category_id") as "count")

// 分组取前10
// windowfunction在这个并不需要shuffle，因为经过前面的shuffle聚合，df已经具有partitioner了，在原节点就可以计算出topn。
val windowSpec = Window.partitionBy("top10CategoryId", "sessionId").orderBy(desc("count"))
val top10SessionWithinTop10Category = top10Category2SessionAndCount.select(expr("*"), rank().over(windowSpec).as("rank"))
      .filter($"rank" <= 10)

页面单跳转化率分析

计算关键页面之间的单步跳转转化率。方法是先获取目标页面，如1，2，3，将它们拼接成1_2, 2_3得出两个目标转跳形式。同样需要在df的数据中产生页面转跳。方法是利用windowfunction将数据按sessionid分组，访问时间升序排序，然后利用concat_ws和window的lag函数实现当前页面id与前一条数据的页面id的拼接。集群数据中产生转跳数据后，利用filter筛选出之前的目标转跳形式。最后按这些形式分组统计数量，便得出每种转跳的数量，将它collect为map。另外还需要计算起始页1的数量，简单的filter和count实现。接下来就可以根据这些数据计算转跳率了。遍历目标转跳形式，从map中获取相应的数量，然后除以起始页/上一页的数量，进而得出结果。

// 获取需要查询的页面id，结果如"3,1,4,5,2"
val targetPageFlow = ParamUtils.getParam(taskParam, Constants.PARAM_TARGET_PAGE_FLOW)
// 对需要查询的页面id进行分割，结果如Array("3","1","4","5","2")
val targetPages = targetPageFlow.split(",")

// 构建目标转跳页面id，结果如Array(3_1,1_4,4_5,5_2）
val targetPagePairs = targetPages
  .zip(targetPages.tail)
  .map(item => item._1 + "_" + item._2)
val targetPageFlowBroadcast = spark.sparkContext.broadcast(targetPagePairs)

// 设置将要用到的时间格式和window函数
val timeFmt = "yyyy-MM-dd HH:mm:ss"
val windowSpec = Window
  .partitionBy("session_id")
  .orderBy($"action_time")
val pagesPairFun = concat_ws("_", col("page_id"), lag("page_id", -1).over(windowSpec))

// 计算各目标转跳id的数量
val pageSplitPvMap = df.na.drop(Seq("session_id"))
  .withColumn("action_time", to_timestamp($"action_time", timeFmt))
  .withColumn("pagePairs", pagesPairFun)
  // 下面filter方式，条件少时可行，多时用broadcast jion
  .filter($"pagePairs" isin (targetPageFlowBroadcast.value: _*))
  .groupBy("pagePairs")
  .agg(count("pagePairs"))
  .as[(String, Long)]
  .collect().toMap

// 计算起始页面的点击数
val startPage = targetPages(0)
val startPagePv = df.filter($"page_id" === startPage).count().toDouble
var lastPageSplitPv = startPagePv

// 存储结果的map
val convertRateMap = new mutable.HashMap[String, Double]()

for(targetPage <- targetPagePairs){
  val targetPageSplitPv = pageSplitPvMap(targetPage).toDouble
  val convertRate = "%.2f".format(targetPageSplitPv / lastPageSplitPv).toDouble
  convertRateMap.put(targetPage, convertRate)
  lastPageSplitPv = targetPageSplitPv
}

各区域热门商品统计分析

原数据没有地区列和城市列（有城市id），所以先广播一个地区城市表，然后根据城市id进行join。之后按照地区和商品分组进行计数。最后利用windowfunction取各地区topn。

val cityInfo = Array((0L, "北京", "华北"), (1L, "上海", "华东"),
  (2L, "南京", "华东"), (3L, "广州", "华南"),
  (4L, "三亚", "华南"), (5L, "武汉", "华中"),
  (6L, "长沙", "华中"), (7L, "西安", "西北"),
  (8L, "成都", "西南"), (9L, "哈尔滨", "东北"))

// Row(city_id, city_name, area)
val cityInfoDF = spark.sparkContext.makeRDD(cityInfo).toDF("city_id", "city_name", "area")

// 提取 cityid 和 productid
val cityid2clickActionDF = df.select("city_id", "product_id")
  .na.drop(Seq("product_id"))
  .filter($"product_id" =!= -1L)

// (cityid, cityName, area, productid)
val area_product_clickCount_cityListDF = cityid2clickActionDF.join(broadcast(cityInfoDF), Seq("city_id"), "inner")
  .withColumn("cityId_cityName", concat_ws(":", $"city_id", $"city_name"))
  .groupBy($"area", $"product_id")
  .agg(count("*") as "click_count", collect_set("cityId_cityName") as "city_list")


// 和top10热门session类似，利用window求topn
val windowSpec = Window
  .partitionBy("area", "product_id")
  .orderBy($"click_count".desc)

// 每个地区前三热门商品
val areaTop3ProductDF = area_product_clickCount_cityListDF.withColumn("rank", $"click_count".over(windowSpec))
  .filter($"rank" <= 3)


// productInfo表(对json的操作)
val productInfoDF = df.select("product_id", "product_name", "extend_info")
    .withColumn("product_status", get_json_object($"extend_info", "$.product_status"))
    .withColumn("product_status", when($"product_status" === 0, "Self").otherwise("Third Party"))
    .drop("extend_info")

// 补充信息
val areaTop3ProducFullInfoDF = areaTop3ProductDF.join(productInfoDF, Seq("product_id"), "inner")

广告点击流量实时统计分析

经过实时黑名单过滤的每天各省各城市广告点击实时统计、每天各省topn热门广告、各广告近1小时内每分钟的点击趋势。这部分原代码采用Spark Streaming实现，我将之改为基于Flink的实现。下面会首先介绍Spark Streaming的实现，然后到Flink。

流式数据的格式为：
timestamp	1450702800
province 	Jiangsu	
city 	Nanjing
userid 	100001
adid 	100001

总体流程

创建流，利用预先广播的黑名单过滤信息，然后利用过滤后的信息更新黑名单、计算广告点击流量、统计每天每个省份top3热门广告、统计一个小时窗口内每分钟各广告的点击量。

// 构建Spark上下文
val sparkConf = new SparkConf().setAppName("streamingRecommendingSystem").setMaster("local[*]")

// 创建Spark客户端
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))

// 设置检查点目录
ssc.checkpoint("./streaming_checkpoint")

// --- 此处省略Kafka配置 --- //

// 创建DStream
val adRealTimeLogDStream = KafkaUtils.createDirectStream[String,String](...)

var adRealTimeValueDStream = adRealTimeLogDStream.map(_.value)

// 用于Kafka Stream的线程非安全问题，重新分区切断血统
adRealTimeValueDStream = adRealTimeValueDStream.repartition(400)
// 根据动态黑名单过滤数据。利用findAll来查找MySQL中所有的黑名单用户，然后通过join实现过滤。
val filteredAdRealTimeLogDStream = filterByBlacklist(spark, adRealTimeValueDStream)

// 业务功能一：生成动态黑名单
generateDynamicBlacklist(filteredAdRealTimeLogDStream)

// 业务功能二：计算广告点击流量实时统计结果（yyyyMMdd_province_city_adid,clickCount）
val adRealTimeStatDStream = calculateRealTimeStat(filteredAdRealTimeLogDStream)

// 业务功能三：实时统计每天每个省份top3热门广告
calculateProvinceTop3Ad(spark, adRealTimeStatDStream)

// 业务功能四：实时统计每天每个广告在最近1小时的滑动窗口内的点击趋势（每分钟的点击量）
calculateAdClickCountByWindow(adRealTimeValueDStream)

ssc.start()
ssc.awaitTermination()

实时黑名单

实现实时的动态黑名单机制：将每天对某个广告点击超过100次的用户拉黑。提取出日期（yyyyMMdd）、userid、adid，然后reduceByKey统计这一批数据的结果，并批量插入MySQL。然后过滤出新的黑名单用户，实现为从MySQL中查找每条数据的用户是否对某条广告的点击超过100次，即成为了新的黑名单用户，找到后进行distinct操作得出新增黑名单用户，并更新到MySQL。

// 从 adRealTimeValueDStream 中提取出下面三个值并构建(key, 1L)
val key = datekey + "_" + userid + "_" + adid
// 然后 reduceByKey(_ + _)， 得到这batch每天每个用户对每个广告的点击量
dailyUserAdClickCountDStream.foreachRDD{ rdd =>
      rdd.foreachPartition{ items =>
// items 是 Iterator(key, count)，提取key的值，构成(date,  userid, adid, clickCount)，批量写入mysql
      ... }}
// 之后filter，每条数据到 mysql 查询更新后的(date, userid, adid)的count是否大于100，表示当天某用户对某个广告是否点击超过100次，如果是就true（留下）最后得出新黑名单blacklistDStream。去重后直接批量插入mysql
blacklistDStream.transform(_.distinct())

广告点击实时统计

每天各省各城市各广告的点击流量实时统计。分组，key为日期+省份+城市+广告id，利用updateStateByKey实现累加。新的统计结果更新到MySQL。

// 执行updateStateByKey算子
// spark streaming特有的一种算子，在spark集群内存中，维护一份key的全局状态
// 和黑名单一样，先从string中提取出信息并构建key
val aggregatedDStream = dailyUserAdClickDStream.updateStateByKey[Long]{ (values:Seq[Long], old:Option[Long]) =>
  var clickCount = 0L

  // 如果说，之前是存在这个状态的，那么就以之前的状态作为起点，进行值的累加
  if(old.isDefined) {
    clickCount = old.get
  }

  // values代表了，batch rdd中，每个key对应的所有的值
  for(value <- values) {
    clickCount += value
  }

  Some(clickCount)
}
// 然后和黑名单中一样，批量更新到mysql

统计每天各省top3热门广告

利用上一步得到的结果，即key为日期+省份+城市+广告id，value为累积点击量，进行统计及分组topn。reduceByKey + windowfunction

统计各广告最近1小时内的点击量趋势：各广告最近1小时内各分钟的点击量

同样在累积数据的基础上操作，提取出时间，然后利用固定窗口实现需求。

// 从原始流（未去除黑名单的数据）中提取出timeMinute、adid两个值进行聚合统计
pairDStream.reduceByKeyAndWindow((a: Long, b: Long) => a + b, Minutes(60L), Seconds(10L))
// 下面 items 就是 Iterator(timeMinute_adid, count)
aggrRDD.foreachRDD { rdd =>
      rdd.foreachPartition { items => ...}}
// 从key中提取出date、hour和minute写入mysql

Flink实现

Flink的思路是通过三个KeyedProcessFunction来实现的，因为他有state（累积各key的值）和timer（定时删除state）功能。
第一个KeyedProcessFunction是记录每个userId-adId键的量，当达到阈值时对这类信息进行截流，从而实现黑名单的更新和过滤。
第二个是记录每个province的数据量，即每个省的广告点击量
第三个是记录一个map，里面统计每个省的点击量，当进行了一定数量的更新后，就输出一次这个map的前n个kv对（以排好序的string的形式），从而实现topn功能。

// 模块结构
├── Launcher.scala 启动类
├── bean
│   └── AdLog.scala 三个case class
├── constant
│   └── Constant.scala 定义了一些定值字符串
├── function 处理函数，下面介绍。
│   ├── AccProvClick.scala
│   ├── BetterGenerateTopK.scala
│   └── FilterBlackListUser.scala
└── schema
    └── AdLogDeserializationSchema.scala 用于反序列化Kafka信息

Launcher类

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

// kafka配置
val consumerProps = new Properties()
consumerProps.setProperty(KEY_BOOTSTRAP_SERVERS, args(0))
consumerProps.setProperty(KEY_GROUP_ID, args(1))

// kafka消费者
val consumer = new FlinkKafkaConsumer010(
  args(2),
  new AdLogDeserializationSchema(),
  consumerProps
)

// 设置数据源
val adLogStream = env.addSource(consumer)

// 对点击某一广告多于100的用户进行截流，从而一次性完成黑名单过滤和黑名单更新。
val withSideOutput = adLogStream
  .keyBy(adLog => (adLog.userid, adLog.adid))
  .process(new FilterBlackListUser)

// （可选）新增的黑名单流。此处只输出到控制台，有需要可以输出到其他端口。
withSideOutput.getSideOutput(realTimeBlackList)
  .print()
// 在main函数外添加下面代码才能取得sideoutput
// val realTimeBlackList: OutputTag[String] =
//    new OutputTag[String]("black_list")

// 实时统计广告点击量最多的前K个省份。同样此处只输出到控制台，有需要可以输出到其他端口。
withSideOutput
  .keyBy(_.province)
  // 按province进行分组累加的stateful操作
  .process(new AccProvClick) // 这里也可以输出到数据库或者kafka等，从而对这些聚合好的数据进行不同需求的分析
  .keyBy(_.dummyKey)
  .process(new BetterGenerateTopK(10))
  .print()

env.execute("TopK_Province")

AdLog类

广告日志类以及处理过程产生的一些新case class

// 从kafka获取并实现反序列化后的数据
case class AdLog(userid: Int, adid: Int, province: String, city: String, timestamp: Long)
// 经过FilterBlackListUser处理后得到的数据，如果需要对adid、city都进行分组，也可以在这里加属性
case class ProvinceWithCount(province: String, count: Int, dummyKey: Int)

Schema类

class AdLogDeserializationSchema extends DeserializationSchema[AdLog]{

  override def deserialize(bytes: Array[Byte]): AdLog = {
    val json = parse(new String(bytes))
    implicit val formats = DefaultFormats
    json.extract[AdLog]
  }

  // 可以根据接收的AdLog来判断是否需要结束这个数据流。如果不需要这个功能就直接返回false。
  override def isEndOfStream(t: AdLog): Boolean = false

  // 告诉Flink经过反序列化后得到什么类
  override def getProducedType: TypeInformation[AdLog] = TypeInformation.of(AdLog.getClass.asInstanceOf[Class[AdLog]])
}

FilterBlackListUser类

class FilterBlackListUser extends KeyedProcessFunction[(Int, Int), AdLog, ProvinceWithCount] {

  // 存储当前userId-adId键值的广告点击量
  var countState: ValueState[Int] = _
  // 标记当前userId-adId键值是否第一次进入黑名单的flag
  var firstSent: ValueState[Boolean] = _
  // 记录当前userId-adId键值state的生成时间
  var resetTime: ValueState[Long] = _

  // 初始化key state
  override def open(parameters: Configuration): Unit = {

    val countDescriptor = new ValueStateDescriptor[Int]("count", classOf[Int])
    countState = getRuntimeContext
      .getState[Int](countDescriptor)

    val firstSeenDescriptor = new ValueStateDescriptor[Boolean]("firstSent", classOf[Boolean])
    firstSent = getRuntimeContext
      .getState[Boolean](firstSeenDescriptor)

    val resetTimeDescriptor = new ValueStateDescriptor[Long]("resetTime", classOf[Long])
    resetTime = getRuntimeContext
      .getState[Long](resetTimeDescriptor)

  }

  override def processElement(value: AdLog,
                              ctx: KeyedProcessFunction[(Int, Int), AdLog, ProvinceWithCount]#Context,
                              out: Collector[ProvinceWithCount]): Unit = {
    val curCount = countState.value()
    // 第一次处理登记timer，24：00清除state
    if (curCount == 0) {
      val time = (ctx.timerService().currentProcessingTime() / 86400000 + 1) * 86400000
      resetTime.update(time)
      ctx.timerService().registerProcessingTimeTimer(time)
    }

    // 加入黑名单，并在side output输出，但只输出一次
    if (curCount >= 100) {
      // 默认初始为false
      if (!firstSent.value()) {
        firstSent.update(true)
        ctx.output(Launcher.realTimeBlackList, value.userid.toString)
      }
      return
    }
    // 点击次数+1
    countState.update(curCount + 1)
    out.collect(ProvinceWithCount(value.province, 1,1))
  }

  // 到达预定时间时清除state
  override def onTimer(timestamp: Long,
                       ctx: KeyedProcessFunction[(Int, Int), AdLog, ProvinceWithCount]#OnTimerContext,
                       out: Collector[ProvinceWithCount]): Unit = {
    if (timestamp == resetTime.value()) {
      firstSent.clear()
      countState.clear()
    }
  }
}

AccProvClick类

代码形式和上面的类几乎一样

class AccProvClick extends KeyedProcessFunction[String, ProvinceWithCount, ProvinceWithCount] {

  // 存储当前province键值的广告点击量
  var countState: ValueState[Int] = _
  var resetTime: ValueState[Long] = _

  override def open //和上面类似

  override def processElement(value: ProvinceWithCount,
                              ctx: KeyedProcessFunction[String, ProvinceWithCount, ProvinceWithCount]#Context,
                              out: Collector[ProvinceWithCount]): Unit = {
    // 和上面类似，如果countState值为0，先设置timer
    val curCount = countState.value() + 1
    countState.update(curCount)
    out.collect(ProvinceWithCount(value.province, curCount, 1))
  }

  override def onTimer // 和上面类似
}

BetterGenerateTopK类

class BetterGenerateTopK(n: Int) extends KeyedProcessFunction[Int, ProvinceWithCount, String] {

  // 存储各省的广告点击量
  var prov2clickTable : MapState[String, Int] = _

  var resetTime: ValueState[Long] = _

  // 每积累到100条更新就发送一次排名结果
  var sendFlag : Int = 0

  override def open(parameters: Configuration): Unit = {
    val prov2clickDescriptor = new MapStateDescriptor[String, Int]("statistic", classOf[String], classOf[Int])
    prov2clickTable = getRuntimeContext
      .getMapState[String, Int](prov2clickDescriptor)

    val resetTimeDescriptor = // 上面类似
  }
  override def processElement(value: ProvinceWithCount,
                              ctx: KeyedProcessFunction[Int, ProvinceWithCount, String]#Context,
                              out: Collector[String]): Unit = {

    if (!prov2clickTable.iterator().hasNext) {
      val time = (ctx.timerService().currentProcessingTime() / 86400000 + 1) * 86400000
      resetTime.update(time)
      ctx.timerService().registerProcessingTimeTimer(time)
    }

    prov2clickTable.put(value.province, value.count)

    sendFlag += 1
    if (sendFlag % 100 == 0){
      sendFlag = 0
      val res = new StringBuilder
      prov2clickTable.iterator()
        .asScala
        .toArray
        .sortBy(_.getValue)
        .takeRight(n)
        .foreach(x => res.append(x.getKey + x.getValue))
      out.collect(res.toString())
    }
  }

  override def onTimer // 和上面类似
}

posted @ 2018-12-30 18:03 justcodeit 阅读(4624) 评论(0) 收藏举报

刷新页面返回顶部

code2one