|NO.Z.00018|——————————|BigDataEnd|——|Hadoop&PB级数仓.V02|——|PB数仓.v02|会员活跃度分析|json数据处理&使用内建函数处理|
一、json数据处理
### --- json数据处理
~~~ 数据文件中每行必须是一个完整的 json 串,一个 json串 不能跨越多行。
~~~ # Hive 处理json数据总体来说有三个办法:
~~~ 使用内建的函数get_json_object、json_tuple
~~~ 使用自定义的UDF
~~~ 第三方的SerDe
### --- 使用内建函数处理
~~~ # get_json_object(string json_string, string path)
~~~ 返回值:String
~~~ 说明:解析json字符串json_string,返回path指定的内容;
~~~ 如果输入的json字符串无效,那么返回NUll;函数每次只能返回一个数据项;
~~~ # json_tuple(jsonStr, k1, k2, ...)
~~~ 返回值:所有的输入参数、输出参数都是String;
~~~ 说明:参数为一组键k1,k2,。。。。。和json字符串,返回值的元组
~~~ 该方法比get_json_object高效,因此可以在一次调用中输入多个键;
~~~ explode,使用explod将Hive一行中复杂的 array 或 map 结构拆分成多行。
二、json数据处理测试
### --- 测试数据:
~~~ # 创建数据表
[root@hadoop02 ~]# vim /data/yanqidw/logs/data/weibo.json
user1;18;male;{"id": 1,"ids": [101,102,103],"total_number": 3}
user2;20;female;{"id": 2,"ids": [201,202,203,204],"total_number": 4}
user3;23;male;{"id": 3,"ids": [301,302,303,304,305],"total_number": 5}
user4;17;male;{"id": 4,"ids": [401,402,403,304],"total_number": 5}
user5;35;female;{"id": 5,"ids": [501,502,503],"total_number": 3}
### --- 建表加载数据:
~~~ # 建表语句
CREATE TABLE IF NOT EXISTS jsont1(
username string,
age int,
sex string,
json string
)
row format delimited fields terminated by ';';
~~~ # 加载数据语句
load data local inpath '/data/yanqidw/logs/data/weibo.json' overwrite into table jsont1;
### --- 创建库及表
~~~ # 创建测试库
hive (ods)> create database test;
hive (ods)> use test;
~~~ # 创建表
hive (test)> CREATE TABLE IF NOT EXISTS jsont1(
> username string,
> age int,
> sex string,
> json string
> )
> row format delimited fields terminated by ';';
~~~ # 加载数据
hive (test)> load data local inpath '/data/yanqidw/logs/data/weibo.json' overwrite into table jsont1;
### --- 验证数据是否加载进来
hive (test)> select * from jsont1;
jsont1.username jsont1.age jsont1.sex jsont1.json
user1 18 male {"id": 1,"ids": [101,102,103],"total_number": 3}
user2 20 female {"id": 2,"ids": [201,202,203,204],"total_number": 4}
user3 23 male {"id": 3,"ids": [301,302,303,304,305],"total_number": 5}
user4 17 male {"id": 4,"ids": [401,402,403,304],"total_number": 5}
user5 35 female {"id": 5,"ids": [501,502,503],"total_number": 3}
三、json的处理
### --- 取单层值
~~~ # 语法:-- get 单层值
select username, age, sex, get_json_object(json, "$.id") id,
get_json_object(json, "$.ids") ids,
get_json_object(json, "$.total_number") num
from jsont1;
hive (test)> select username, age, sex, get_json_object(json, "$.id") id,
> get_json_object(json, "$.ids") ids,
> get_json_object(json, "$.total_number") num
> from jsont1;
OK
username age sex id ids num
user1 18 male 1 [101,102,103] 3
user2 20 female 2 [201,202,203,204] 4
user3 23 male 3 [301,302,303,304,305] 5
user4 17 male 4 [401,402,403,304] 5
user5 35 female 5 [501,502,503] 3
### --- 取数组
~~~ # 语法:-- get 数组
select username, age, sex, get_json_object(json, "$.id") id,
get_json_object(json, "$.ids[0]") ids0,
get_json_object(json, "$.ids[1]") ids1,
get_json_object(json, "$.ids[2]") ids2,
get_json_object(json, "$.ids[3]") ids3,
get_json_object(json, "$.total_number") num
from jsont1;
hive (test)> select username, age, sex, get_json_object(json, "$.id") id,
> get_json_object(json, "$.ids[0]") ids0,
> get_json_object(json, "$.ids[1]") ids1,
> get_json_object(json, "$.ids[2]") ids2,
> get_json_object(json, "$.ids[3]") ids3,
> get_json_object(json, "$.total_number") num
> from jsont1;
OK
username age sex id ids0 ids1 ids2 ids3 num
user1 18 male 1 101 102 103 NULL 3
user2 20 female 2 201 202 203 204 4
user3 23 male 3 301 302 303 304 5
user4 17 male 4 401 402 403 304 5
user5 35 female 5 501 502 503 NULL 3
### --- 处理多个字段
~~~ # 语法:-- 使用 json_tuple 一次处理多个字段
select json_tuple(json, 'id', 'ids', 'total_number')
from jsont1;
~~~ 有语法错误
select username, age, sex, json_tuple(json, 'id', 'ids','total_number')
from jsont1;
~~~ # 实例操作
hive (test)> select json_tuple(json, 'id', 'ids', 'total_number')
> from jsont1;
OK
c0 c1 c2
1 [101,102,103] 3
2 [201,202,203,204] 4
3 [301,302,303,304,305] 5
4 [401,402,403,304] 5
5 [501,502,503] 3
~~~ # 语法
~~~ 使用 explode + lateral view
~~~ 在上一步的基础上,再将数据展开
~~~ 第一步,将 [101,102,103] 中的 [ ] 替换掉
~~~ select "[101,102,103]"
~~~ select "101,102,103"
select regexp_replace("[101,102,103]", "\\[|\\]", "");
~~~ # 实例操作
hive (test)> select regexp_replace("[101,102,103]", "\\[|\\]", "");
OK
101,102,103
~~~ # 语法
~~~ 第二步,将上一步的字符串变为数组
select split(regexp_replace("[101,102,103]", "\\[|\\]", ""), ",");
~~~ # 实例操作
hive (test)> select split(regexp_replace("[101,102,103]", "\\[|\\]", ""), ",");
OK
["101","102","103"]
~~~ # 语法
~~~ 第三步,使用explode + lateral view 将数据展开
select username, age, sex, id, ids, num from jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num;
with tmp as(
select username, age, sex, id, ids, num from jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num )
select username, age, sex, id, ids1, num
from tmp
lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""),
",")) t1 as ids1;
~~~ # 实例操作
hive (test)> select username, age, sex, id, ids, num from jsont1
> lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num;
OK
username age sex id ids num
user1 18 male 1 [101,102,103] 3
user2 20 female 2 [201,202,203,204] 4
user3 23 male 3 [301,302,303,304,305] 5
user4 17 male 4 [401,402,403,304] 5
user5 35 female 5 [501,502,503] 3
~~~ # 将数据展开
hive (test)> with tmp as(
> select username, age, sex, id, ids, num from jsont1
> lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num )
> select username, age, sex, id, ids1, num
> from tmp
> lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""),
> ",")) t1 as ids1;
OK
username age sex id ids1 num
user1 18 male 1 101 3
user1 18 male 1 102 3
user1 18 male 1 103 3
user2 20 female 2 201 4
user2 20 female 2 202 4
user2 20 female 2 203 4
user2 20 female 2 204 4
user3 23 male 3 301 5
user3 23 male 3 302 5
user3 23 male 3 303 5
user3 23 male 3 304 5
user3 23 male 3 305 5
user4 17 male 4 401 5
user4 17 male 4 402 5
user4 17 male 4 403 5
user4 17 male 4 304 5
user5 35 female 5 501 3
user5 35 female 5 502 3
user5 35 female 5 503 3
### --- 小结:
~~~ json_tuple 优点是一次可以解析多个json字段,对嵌套结果的解析操作复杂;
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
分类:
bdv014-PB离线数仓
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通