精准搜索需求、通过IDEA将数据格式化
精准搜索需求
数据:
{"recordMap":{"screenwriter":"","publishtime":"2021-08-21","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"7232742","utctime":"1635935657824","useMap":{"actor":"","area":"中国","tags":"动画","language":"国语","director":"","category":"电视剧"},"itemName":"斗罗大陆1 第十部分","stbid":"004601000001004003d400692d45007a"}
{"recordMap":{"screenwriter":"金城哲夫|tetsuo kinjô|佐佐木守|关泽新一|tatsuo miyata","publishtime":"1966-07-17","year":"1966","score":"8.6"},"processDate":"2021-11-03","cid":"c370201","itemId":"6201446","utctime":"1635934986456","useMap":{"actor":"小林昭二|黒部进|石井伊吉|黑部进|sandayuu dokumamushi","area":"日本","tags":"剧情|科幻|经典|动画|动作","language":"日语","director":"圆谷英二|实相寺昭雄|圆谷一|饭岛敏宏|满田穧","category":"电视剧"},"itemName":"奥特曼","stbid":"004603000003207021057c5259c7970e"}
{"recordMap":{"screenwriter":"曹薇","publishtime":"2020-10-22","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c440401","itemId":"6139446","utctime":"1635864238205","useMap":{"actor":"王嘉尔|李荣浩|林俊杰|汪苏泷|孙红雷","area":"中国","tags":"青春|真人秀","language":"国语","director":"陈刚|曹薇|吴寒","category":"综艺"},"itemName":"青春有你 第三季","stbid":"299001201500160705"}
{"recordMap":{"screenwriter":"万秦|刘富源|刘服华|叶天龙|姜璐","publishtime":"","year":"","score":"6.0"},"processDate":"2021-11-03","cid":"c340101","itemId":"5838358","utctime":"1635945230702","useMap":{"actor":"张秉君|张伟|谭笑|陈光|孙尧东","area":"中国|其它地区","tags":"喜剧|剧情|科幻|情感|亲子|动画|家庭|儿童|冒险","language":"国语","director":"丁亮|刘富源|林汇达|邵和麒|paulette victor-lifton","category":"电影"},"itemName":"熊出没之夺宝熊兵","stbid":"0010059900e06800122088238cbe2088"}
{"recordMap":{"screenwriter":"万秦|刘富源|刘服华|叶天龙|姜璐","publishtime":"","year":"","score":"6.0"},"processDate":"2021-11-03","cid":"c350201","itemId":"5838358","utctime":"1635932250572","useMap":{"actor":"张秉君|张伟|谭笑|陈光|孙尧东","area":"中国|其它地区","tags":"喜剧|剧情|科幻|情感|亲子|动画|家庭|儿童|冒险","language":"国语","director":"丁亮|刘富源|林汇达|邵和麒|paulette victor-lifton","category":"电影"},"itemName":"熊出没之夺宝熊兵","stbid":"004401ff0001181001f908a5c87d7b38"}
{"recordMap":{"screenwriter":"劳伦斯先生|史蒂芬·海伦伯格|蒂姆·希尔|kaz|德里克·德莱蒙","publishtime":"2020-10-22","year":"2020","score":"7.7"},"processDate":"2021-11-03","cid":"c430201","itemId":"6439310","utctime":"1635910655061","useMap":{"actor":"罗杰·布帕斯|汤姆·肯尼|比尔·法格巴克|克兰西·布朗|罗里·艾伦","area":"美国","tags":"喜剧|动画|家庭|奇幻","language":"英语","director":"戴夫·坎宁安|sherm cohen|dave cunningham","category":"电视剧"},"itemName":"海绵宝宝 第十三季","stbid":"a568f034560747838fe1f3dd88901fdf"}
{"recordMap":{"screenwriter":"史蒂夫·迪特寇|斯坦·李","publishtime":"1997-09-12","year":"1997","score":"8.5"},"processDate":"2021-11-03","cid":"c440401","itemId":"6358430","utctime":"1635936346029","useMap":{"actor":"克里斯托弗·丹尼尔·巴恩斯|加里·英霍夫|marla rubinoff|rodney saulsberry|帕特里克·莱比奥托","area":"美国","tags":"英雄|剧情|经典|动画","language":"英语","director":"bob richardson","category":"电视剧"},"itemName":"蜘蛛侠 第五季","stbid":"109001212401402848"}
{"recordMap":{"screenwriter":"","publishtime":"2019-01-01","year":"2019","score":"7.3"},"processDate":"2021-11-03","cid":"c330201","itemId":"7093925","utctime":"1635943063080","useMap":{"actor":"罗温·艾金森","area":"中国|英国","tags":"喜剧","language":"","director":"马特·胡德","category":"动漫"},"itemName":"憨豆先生 第三季","stbid":"004203000002089018389c62ab755c4b"}
{"recordMap":{"screenwriter":"张佳|咸瑶","publishtime":"2013-09-29","year":"2013","score":"4.1"},"processDate":"2021-11-03","cid":"c340201","itemId":"6243284","utctime":"1635914079939","useMap":{"actor":"海陆|蒋毅|彭冠英|赵樱子|李泰","area":"中国","tags":"剧情|爱情","language":"国语","director":"高先明","category":"电影,电视剧"},"itemName":"因为爱情有多美","stbid":"004303000002089019093050fd2819cc"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-17","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7118559","utctime":"1635934418175","useMap":{"actor":"","area":"其它地区","tags":"动画|母婴|益智","language":"","director":"","category":""},"itemName":"汪汪队立大功第七季","stbid":"005603000001005020c0b46077752152"}
{"recordMap":{"screenwriter":"何冀平|何麒|夏祖辉|方桂兰|贡敏","publishtime":"1992-01-01","year":"1992","score":"9.4"},"processDate":"2021-11-03","cid":"c370201","itemId":"5891000","utctime":"1635896489433","useMap":{"actor":"赵雅芝|叶童|陈美琪|石乃文|夏光莉","area":"台湾","tags":"经典|爱情|古装|奇幻","language":"国语","director":"夏祖辉|何麒","category":"电视剧"},"itemName":"新白娘子传奇","stbid":"0046010000010040d6e42c43be384139"}
{"recordMap":{"screenwriter":"","publishtime":"2021-07-09","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"7222509","utctime":"1635932997417","useMap":{"actor":"","area":"美国|英国|其它地区","tags":"儿童","language":"","director":"","category":""},"itemName":"挖掘机","stbid":"0046010000010040d6e42c43be386aa6"}
{"recordMap":{"screenwriter":"","publishtime":"2021-06-07","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7183631","utctime":"1635939019916","useMap":{"actor":"","area":"其它地区","tags":"","language":"","director":"","category":"电影"},"itemName":"奶龙:我被小狗狗咬了","stbid":"005603000002703021281055e43e2b37"}
{"recordMap":{"screenwriter":"","publishtime":"2019-03-15","year":"2019","score":"7.2"},"processDate":"2021-11-03","cid":"c530201","itemId":"5871224","utctime":"1635936827239","useMap":{"actor":"曾梦雪|戴景耀|黄甫杰|屠画|向昊","area":"中国","tags":"喜剧|偶像|剧情|爱情|古装","language":"","director":"管健雄","category":"电视剧"},"itemName":"大周小冰人第二季","stbid":"00560300000100502030b46077145a58"}
{"recordMap":{"screenwriter":"","publishtime":"2020-06-22","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c440401","itemId":"5922650","utctime":"1635943994622","useMap":{"actor":"","area":"欧美|英国","tags":"儿童","language":"","director":"","category":"动漫"},"itemName":"海绵宝宝 第十二季","stbid":"109001184200374068"}
{"recordMap":{"screenwriter":"郑春华|菜如山","publishtime":"","year":"","score":"7.4"},"processDate":"2021-11-03","cid":"c610201","itemId":"5936025","utctime":"1635913670012","useMap":{"actor":"姚培华|符冲|范蕾颖|苏光琪|范楚绒","area":"中国","tags":"喜剧|经典|动画","language":"国语","director":"崔世昱","category":"电视剧"},"itemName":"大头儿子和小头爸爸","stbid":"00570300000100702009acbb61753c19"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-02","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c530201","itemId":"7112761","utctime":"1635934219883","useMap":{"actor":"","area":"其它地区|欧美|英国","tags":"怀旧|儿童|励志|益智","language":"","director":"","category":"电影"},"itemName":"白雪公主","stbid":"00560300001968001120c0365644b295"}
{"recordMap":{"screenwriter":"elana lesser|克利夫 鲁比|cliff ruby|ruth handler|伊拉娜·莱瑟","publishtime":"2016-12-29","year":"2016","score":"7.6"},"processDate":"2021-11-03","cid":"c370201","itemId":"5855244","utctime":"1635932642896","useMap":{"actor":"kelly sheridan|安杰丽卡·休斯顿|cree summer|ian james corlett|马克·海德斯","area":"美国|欧美","tags":"童话|少女|动画|家庭|儿童|冒险","language":"英语","director":"欧文·赫利|owen hurley","category":"动漫"},"itemName":"芭比之长发公主","stbid":"00460300001968000420c0365624c6fb"}
{"recordMap":{"screenwriter":"","publishtime":"2020-09-28","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c350201","itemId":"6208294","utctime":"1635922385481","useMap":{"actor":"","area":"中国","tags":"儿童","language":"","director":"","category":""},"itemName":"贝乐虎儿歌","stbid":"00440300000100402131b460778fab2b"}
{"recordMap":{"screenwriter":"潘俊杰","publishtime":"2020-07-15","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c320201","itemId":"6512678","utctime":"1635907168404","useMap":{"actor":"陆双|祖晴","area":"中国","tags":"剧情|动画|儿童","language":"国语","director":"","category":"电视剧"},"itemName":"猪猪侠之深海小英雄","stbid":"004103000002091020519c62abbd5646"}
{"recordMap":{"screenwriter":"","publishtime":"2021-05-17","year":"2021","score":"0.0"},"processDate":"2021-11-03","cid":"c360201","itemId":"7118559","utctime":"1635897657838","useMap":{"actor":"","area":"其它地区","tags":"动画|母婴|益智","language":"","director":"","category":""},"itemName":"汪汪队立大功第七季","stbid":"9c2f4e8f36e7"}
{"recordMap":{"screenwriter":"","publishtime":"2020-12-15","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c510203","itemId":"7077927","utctime":"1635934708464","useMap":{"actor":"","area":"中国","tags":"动画|励志|冒险","language":"","director":"蔡奕诚","category":""},"itemName":"咖宝车神4","stbid":"64e0ab71101f"}
{"recordMap":{"screenwriter":"瞿绍婷","publishtime":"2020-01-01","year":"2020","score":"0.0"},"processDate":"2021-11-03","cid":"c370201","itemId":"6522114","utctime":"1635930513266","useMap":{"actor":"何炅|撒贝宁|白敬亭|刘昊然|张若昀","area":"中国","tags":"喜剧|真人秀","language":"国语","director":"何舒|陈晓翎","category":"综艺"},"itemName":"明星大侦探 第六季","stbid":"004603000002002020039c62aba2c1ff"}…………
需求:
数据整理
1、精准搜索数据 -- json格式
每一条数据代表的含义:一个用户什么时间观看的某一步影片,以及影片的信息
数据格式
一级字段名 二级字段名 字段类型 字段描述 recordMap/影片信息 screenwriter String 编剧 recordMap/影片信息 publishtime String 发布时间 recordMap/影片信息 year String 年代 recordMap/影片信息 score Strnig 评分 processDate String 数据处理时间 cid String 用户行为id itemId String 影片编号 itemName String 影片名 stbid String 用户编号 utctime String 用户行为时间戳 useMap/影片信息 actor String 演员 useMap/影片信息 area String 地区 useMap/影片信息 tags String 一级标签 useMap/影片信息 language String 语言 useMap/影片信息 director String 导演 useMap/影片信息 category String 类型 使用spark处理数据
1、统计每个年代观影次数,年代取year
· 输出结果
年代 观看次数 排名 1991 1000 1 2001 222 2 1950 111 3 2、统计每种tag观看影片的次数,tags有多种
· 输出结果
tag 观看次数 排名 喜剧 1000 1 冒险 222 2 冒险 111 3
假如说以后遇到数据很长又很复杂,看不懂的情况
可以从数据中抽一条数据出来,然后在IDEA中新建一个文件File(本题中新建一个以 .json 结尾的文件),然后对数据进行格式化
将抽出的数据放到刚刚新建的文件中,然后在最上方工具栏中点击 Code --> Reformat Code 或者 Ctrl + Alt + L 将数据格式化。
1、统计每个年代观影次数,年代取year
2、统计每种tag观看影片的次数,tags有多种
package com.shujia.sql
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}
object Demo11Year {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.master("local")
.appName("year")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
val data: DataFrame = spark
.read
.format("json")
.load("data/精准搜索.txt")
//优化
data.cache()
//先看一下数据的大致格式
data.printSchema()
data.show()
/**
* 统计每个年代观影次数,年代取year
*/
data
//读取年代
.select($"recordMap.year" as "year")
//过滤为空的年代
.where($"year".isNotNull && $"year" =!= "")
//统计年代观影次数
.groupBy($"year")
.agg(count($"year") as "num")
//增加排名
.withColumn("r", row_number() over Window.orderBy($"num".desc))
.show(1000)
/**
* 2、统计每种tag观看影片的次数,tags有多种
*/
data
//取出标签,将数据展开
.select(explode(split($"useMap.tags", "\\|")) as "tag")
//过滤数据
.where($"tag".isNotNull && $"tag" =!= "")
.groupBy($"tag")
.agg(count($"tag") as "num")
.withColumn("r", row_number() over Window.orderBy($"num".desc))
.show()
}
}