精准搜索需求、通过IDEA将数据格式化

精准搜索需求

数据:

{"recordMap":{"screenwriter":"","publishtime":"2021-08-21","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"7232742","utctime":"1635935657824","useMap":{"actor":"","area":"中国","tags":"动画","language":"国语","director":"","category":"电视剧"},"itemName":"斗罗大陆1 第十部分","stbid":"004601000001004003d400692d45007a"}
{"recordMap":{"screenwriter":"金城哲夫|tetsuo kinjô|佐佐木守|关泽新一|tatsuo miyata","publishtime":"1966-07-17","year":"1966","score":"8.6"},"processDate":"2021-11-03","cid":"c370201","itemId":"6201446","utctime":"1635934986456","useMap":{"actor":"小林昭二|黒部进|石井伊吉|黑部进|sandayuu dokumamushi","area":"日本","tags":"剧情|科幻|经典|动画|动作","language":"日语","director":"圆谷英二|实相寺昭雄|圆谷一|饭岛敏宏|满田穧","category":"电视剧"},"itemName":"奥特曼","stbid":"004603000003207021057c5259c7970e"}
{"recordMap":{"screenwriter":"曹薇","publishtime":"2020-10-22","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c440401","itemId":"6139446","utctime":"1635864238205","useMap":{"actor":"王嘉尔|李荣浩|林俊杰|汪苏泷|孙红雷","area":"中国","tags":"青春|真人秀","language":"国语","director":"陈刚|曹薇|吴寒","category":"综艺"},"itemName":"青春有你 第三季","stbid":"299001201500160705"}
{"recordMap":{"screenwriter":"万秦|刘富源|刘服华|叶天龙|姜璐","publishtime":"","year":"","score":"6.0"},"processDate":"2021-11-03","cid":"c340101","itemId":"5838358","utctime":"1635945230702","useMap":{"actor":"张秉君|张伟|谭笑|陈光|孙尧东","area":"中国|其它地区","tags":"喜剧|剧情|科幻|情感|亲子|动画|家庭|儿童|冒险","language":"国语","director":"丁亮|刘富源|林汇达|邵和麒|paulette victor-lifton","category":"电影"},"itemName":"熊出没之夺宝熊兵","stbid":"0010059900e06800122088238cbe2088"}
{"recordMap":{"screenwriter":"万秦|刘富源|刘服华|叶天龙|姜璐","publishtime":"","year":"","score":"6.0"},"processDate":"2021-11-03","cid":"c350201","itemId":"5838358","utctime":"1635932250572","useMap":{"actor":"张秉君|张伟|谭笑|陈光|孙尧东","area":"中国|其它地区","tags":"喜剧|剧情|科幻|情感|亲子|动画|家庭|儿童|冒险","language":"国语","director":"丁亮|刘富源|林汇达|邵和麒|paulette victor-lifton","category":"电影"},"itemName":"熊出没之夺宝熊兵","stbid":"004401ff0001181001f908a5c87d7b38"}
{"recordMap":{"screenwriter":"劳伦斯先生|史蒂芬·海伦伯格|蒂姆·希尔|kaz|德里克·德莱蒙","publishtime":"2020-10-22","year":"2020","score":"7.7"},"processDate":"2021-11-03","cid":"c430201","itemId":"6439310","utctime":"1635910655061","useMap":{"actor":"罗杰·布帕斯|汤姆·肯尼|比尔·法格巴克|克兰西·布朗|罗里·艾伦","area":"美国","tags":"喜剧|动画|家庭|奇幻","language":"英语","director":"戴夫·坎宁安|sherm cohen|dave cunningham","category":"电视剧"},"itemName":"海绵宝宝 第十三季","stbid":"a568f034560747838fe1f3dd88901fdf"}
{"recordMap":{"screenwriter":"史蒂夫·迪特寇|斯坦·李","publishtime":"1997-09-12","year":"1997","score":"8.5"},"processDate":"2021-11-03","cid":"c440401","itemId":"6358430","utctime":"1635936346029","useMap":{"actor":"克里斯托弗·丹尼尔·巴恩斯|加里·英霍夫|marla rubinoff|rodney saulsberry|帕特里克·莱比奥托","area":"美国","tags":"英雄|剧情|经典|动画","language":"英语","director":"bob richardson","category":"电视剧"},"itemName":"蜘蛛侠 第五季","stbid":"109001212401402848"}
{"recordMap":{"screenwriter":"","publishtime":"2019-01-01","year":"2019","score":"7.3"},"processDate":"2021-11-03","cid":"c330201","itemId":"7093925","utctime":"1635943063080","useMap":{"actor":"罗温·艾金森","area":"中国|英国","tags":"喜剧","language":"","director":"马特·胡德","category":"动漫"},"itemName":"憨豆先生 第三季","stbid":"004203000002089018389c62ab755c4b"}
{"recordMap":{"screenwriter":"张佳|咸瑶","publishtime":"2013-09-29","year":"2013","score":"4.1"},"processDate":"2021-11-03","cid":"c340201","itemId":"6243284","utctime":"1635914079939","useMap":{"actor":"海陆|蒋毅|彭冠英|赵樱子|李泰","area":"中国","tags":"剧情|爱情","language":"国语","director":"高先明","category":"电影,电视剧"},"itemName":"因为爱情有多美","stbid":"004303000002089019093050fd2819cc"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-17","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7118559","utctime":"1635934418175","useMap":{"actor":"","area":"其它地区","tags":"动画|母婴|益智","language":"","director":"","category":""},"itemName":"汪汪队立大功第七季","stbid":"005603000001005020c0b46077752152"}
{"recordMap":{"screenwriter":"何冀平|何麒|夏祖辉|方桂兰|贡敏","publishtime":"1992-01-01","year":"1992","score":"9.4"},"processDate":"2021-11-03","cid":"c370201","itemId":"5891000","utctime":"1635896489433","useMap":{"actor":"赵雅芝|叶童|陈美琪|石乃文|夏光莉","area":"台湾","tags":"经典|爱情|古装|奇幻","language":"国语","director":"夏祖辉|何麒","category":"电视剧"},"itemName":"新白娘子传奇","stbid":"0046010000010040d6e42c43be384139"}
{"recordMap":{"screenwriter":"","publishtime":"2021-07-09","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"7222509","utctime":"1635932997417","useMap":{"actor":"","area":"美国|英国|其它地区","tags":"儿童","language":"","director":"","category":""},"itemName":"挖掘机","stbid":"0046010000010040d6e42c43be386aa6"}
{"recordMap":{"screenwriter":"","publishtime":"2021-06-07","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7183631","utctime":"1635939019916","useMap":{"actor":"","area":"其它地区","tags":"","language":"","director":"","category":"电影"},"itemName":"奶龙:我被小狗狗咬了","stbid":"005603000002703021281055e43e2b37"}
{"recordMap":{"screenwriter":"","publishtime":"2019-03-15","year":"2019","score":"7.2"},"processDate":"2021-11-03","cid":"c530201","itemId":"5871224","utctime":"1635936827239","useMap":{"actor":"曾梦雪|戴景耀|黄甫杰|屠画|向昊","area":"中国","tags":"喜剧|偶像|剧情|爱情|古装","language":"","director":"管健雄","category":"电视剧"},"itemName":"大周小冰人第二季","stbid":"00560300000100502030b46077145a58"}
{"recordMap":{"screenwriter":"","publishtime":"2020-06-22","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c440401","itemId":"5922650","utctime":"1635943994622","useMap":{"actor":"","area":"欧美|英国","tags":"儿童","language":"","director":"","category":"动漫"},"itemName":"海绵宝宝 第十二季","stbid":"109001184200374068"}
{"recordMap":{"screenwriter":"郑春华|菜如山","publishtime":"","year":"","score":"7.4"},"processDate":"2021-11-03","cid":"c610201","itemId":"5936025","utctime":"1635913670012","useMap":{"actor":"姚培华|符冲|范蕾颖|苏光琪|范楚绒","area":"中国","tags":"喜剧|经典|动画","language":"国语","director":"崔世昱","category":"电视剧"},"itemName":"大头儿子和小头爸爸","stbid":"00570300000100702009acbb61753c19"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-02","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7112761","utctime":"1635934219883","useMap":{"actor":"","area":"其它地区|欧美|英国","tags":"怀旧|儿童|励志|益智","language":"","director":"","category":"电影"},"itemName":"白雪公主","stbid":"00560300001968001120c0365644b295"}
{"recordMap":{"screenwriter":"elana lesser|克利夫 鲁比|cliff ruby|ruth handler|伊拉娜·莱瑟","publishtime":"2016-12-29","year":"2016","score":"7.6"},"processDate":"2021-11-03","cid":"c370201","itemId":"5855244","utctime":"1635932642896","useMap":{"actor":"kelly sheridan|安杰丽卡·休斯顿|cree summer|ian james corlett|马克·海德斯","area":"美国|欧美","tags":"童话|少女|动画|家庭|儿童|冒险","language":"英语","director":"欧文·赫利|owen hurley","category":"动漫"},"itemName":"芭比之长发公主","stbid":"00460300001968000420c0365624c6fb"}
{"recordMap":{"screenwriter":"","publishtime":"2020-09-28","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c350201","itemId":"6208294","utctime":"1635922385481","useMap":{"actor":"","area":"中国","tags":"儿童","language":"","director":"","category":""},"itemName":"贝乐虎儿歌","stbid":"00440300000100402131b460778fab2b"}
{"recordMap":{"screenwriter":"潘俊杰","publishtime":"2020-07-15","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c320201","itemId":"6512678","utctime":"1635907168404","useMap":{"actor":"陆双|祖晴","area":"中国","tags":"剧情|动画|儿童","language":"国语","director":"","category":"电视剧"},"itemName":"猪猪侠之深海小英雄","stbid":"004103000002091020519c62abbd5646"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-17","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c360201","itemId":"7118559","utctime":"1635897657838","useMap":{"actor":"","area":"其它地区","tags":"动画|母婴|益智","language":"","director":"","category":""},"itemName":"汪汪队立大功第七季","stbid":"9c2f4e8f36e7"}
{"recordMap":{"screenwriter":"","publishtime":"2020-12-15","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c510203","itemId":"7077927","utctime":"1635934708464","useMap":{"actor":"","area":"中国","tags":"动画|励志|冒险","language":"","director":"蔡奕诚","category":""},"itemName":"咖宝车神4","stbid":"64e0ab71101f"}
{"recordMap":{"screenwriter":"瞿绍婷","publishtime":"2020-01-01","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"6522114","utctime":"1635930513266","useMap":{"actor":"何炅|撒贝宁|白敬亭|刘昊然|张若昀","area":"中国","tags":"喜剧|真人秀","language":"国语","director":"何舒|陈晓翎","category":"综艺"},"itemName":"明星大侦探 第六季","stbid":"004603000002002020039c62aba2c1ff"}

…………


需求:

数据整理

1、精准搜索数据 -- json格式

每一条数据代表的含义:一个用户什么时间观看的某一步影片,以及影片的信息

数据格式

一级字段名 二级字段名 字段类型 字段描述
recordMap/影片信息 screenwriter String 编剧
recordMap/影片信息 publishtime String 发布时间
recordMap/影片信息 year String 年代
recordMap/影片信息 score Strnig 评分
processDate String 数据处理时间
cid String 用户行为id
itemId String 影片编号
itemName String 影片名
stbid String 用户编号
utctime String 用户行为时间戳
useMap/影片信息 actor String 演员
useMap/影片信息 area String 地区
useMap/影片信息 tags String 一级标签
useMap/影片信息 language String 语言
useMap/影片信息 director String 导演
useMap/影片信息 category String 类型

使用spark处理数据

1、统计每个年代观影次数,年代取year

· 输出结果

年代 观看次数 排名
1991 1000 1
2001 222 2
1950 111 3

2、统计每种tag观看影片的次数,tags有多种

· 输出结果

tag 观看次数 排名
喜剧 1000 1
冒险 222 2
冒险 111 3

假如说以后遇到数据很长又很复杂,看不懂的情况

可以从数据中抽一条数据出来,然后在IDEA中新建一个文件File(本题中新建一个以 .json 结尾的文件),然后对数据进行格式化

将抽出的数据放到刚刚新建的文件中,然后在最上方工具栏中点击 Code --> Reformat Code 或者 Ctrl + Alt + L 将数据格式化。

1、统计每个年代观影次数,年代取year

2、统计每种tag观看影片的次数,tags有多种

package com.shujia.sql

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}

object Demo11Year {
  def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession
      .builder()
      .master("local")
      .appName("year")
      .config("spark.sql.shuffle.partitions", 1)
      .getOrCreate()

    import spark.implicits._
    import org.apache.spark.sql.functions._

    val data: DataFrame = spark
      .read
      .format("json")
      .load("data/精准搜索.txt")
    
    //优化
    data.cache()
    
    //先看一下数据的大致格式
    data.printSchema()
    data.show()

    /**
      * 统计每个年代观影次数,年代取year
      */
    data
      //读取年代
      .select($"recordMap.year" as "year")
      //过滤为空的年代
      .where($"year".isNotNull && $"year" =!= "")
      //统计年代观影次数
      .groupBy($"year")
      .agg(count($"year") as "num")
      //增加排名
      .withColumn("r", row_number() over Window.orderBy($"num".desc))
      .show(1000)

    /**
      * 2、统计每种tag观看影片的次数,tags有多种
      */

    data
      //取出标签,将数据展开
      .select(explode(split($"useMap.tags", "\\|")) as "tag")
      //过滤数据
      .where($"tag".isNotNull && $"tag" =!= "")
      .groupBy($"tag")
      .agg(count($"tag") as "num")
      .withColumn("r", row_number() over Window.orderBy($"num".desc))
      .show()

  }
}
posted @ 2022-03-17 19:00  赤兔胭脂小吕布  阅读(85)  评论(0编辑  收藏  举报