Spark SQL:将嵌套的json类型DataFrame压平
参考:https://www.soinside.com/question/JjhZCytMUFpTNyk6W7ixZa
(没找到真正的出处,看拙劣的翻译,应该是从Stack Overflow扒过来的)
将数据如下的DataFrame压平
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | val json_string = "" "{ "Total Value" : 3 , "Topic" : "Example" , "values" : [ { "value1" : "#example1" , "points" : [ [ "123" , "156" ] ], "properties" : { "date" : "12-04-19" , "model" : "Model example 1" } }, { "value2" : "#example2" , "points" : [ [ "124" , "157" ] ], "properties" : { "date" : "12-05-19" , "model" : "Model example 2" } } ] } "" " |
希望得到如下输出
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+ |Total Value| Topic |values 1 | values.points[0] | values.points[1] | values.properties.date | values.properties.model | +-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+ | 3 | Example | example1 | 123 | 156 | 12-04-19 | Model Example 1 | | 3 | Example | example2 | 124 | 157 | 12-05-19 | Model example 2 +-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
解决办法:
采用spark in-built
函数,尤其是explode函数(此函数在org.apache.spark.sql.functions包下)
先将第一层“炸”开
scala> val df = spark.read.json(Seq(json_string).toDS) scala> var dfd = df.select($"topic",$"total value",explode($"values").as("values"))
然后选择第二层中想要的列
scala> dfd.select($"topic",$"total value",$"values.points".getItem(0)(0).as("point_0"),$"values.points".getItem(0)(1).as("point_1"),$"values.properties.date".as("_date"),$"values.properties.model".as("_model")).show +-------+-----------+-------+-------+--------+---------------+ | topic|total value|point_0|point_1| _date| _model| +-------+-----------+-------+-------+--------+---------------+ |Example| 3| 123| 156|12-04-19|Model example 1| |Example| 3| 124| 157|12-05-19|Model example 2| +-------+-----------+-------+-------+--------+---------------+
详细点的示例:
scala> val str = "{\"avg_orders_count\":[{\"count\":1.0,\"days\":3},{\"count\":0.6,\"days\":5},{\"count\":0.3,\"days\":10},{\"count\":0.2,\"days\":15},{\"count\":0.1,\"days\":30},{\"count\":0.066,\"days\":45},{\"count\":0.066,\"days\":60},{\"count\":0.053,\"days\":75},{\"count\":0.044,\"days\":90}],\"m_hotel_id\":\"92500636\"}" str: String = {"avg_orders_count":[{"count":1.0,"days":3},{"count":0.6,"days":5},{"count":0.3,"days":10},{"count":0.2,"days":15},{"count":0.1,"days":30},{"count":0.066,"days":45},{"count":0.066,"days":60},{"count":0.053,"days":75},{"count":0.044,"days":90}],"m_hotel_id":"92500636"} scala> val rdd = sc.makeRDD(str::Nil) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at makeRDD at <console>:29 scala> val df = spark.read.json(rdd) scala> df.show +--------------------+----------+ | avg_orders_count|m_hotel_id| +--------------------+----------+ |[[1.0, 3], [0.6, ...| 92500636| +--------------------+----------+ scala> df.select($"m_hotel_id",explode($"avg_orders_count")).show +----------+-----------+ |m_hotel_id| col| +----------+-----------+ | 92500636| [1.0, 3]| | 92500636| [0.6, 5]| | 92500636| [0.3, 10]| | 92500636| [0.2, 15]| | 92500636| [0.1, 30]| | 92500636|[0.066, 45]| | 92500636|[0.066, 60]| | 92500636|[0.053, 75]| | 92500636|[0.044, 90]| +----------+-----------+ scala> val dfs = dfd.select($"m_hotel_id",$"ct.count", $"ct.days") dfs: org.apache.spark.sql.DataFrame = [m_hotel_id: string, count: double ... 1 more field] scala> dfs.show +----------+-----+----+ |m_hotel_id|count|days| +----------+-----+----+ | 92500636| 1.0| 3| | 92500636| 0.6| 5| | 92500636| 0.3| 10| | 92500636| 0.2| 15| | 92500636| 0.1| 30| | 92500636|0.066| 45| | 92500636|0.066| 60| | 92500636|0.053| 75| | 92500636|0.044| 90| +----------+-----+----+
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· 字符编码:从基础到乱码解决