|NO.Z.00018|——————————|BigDataEnd|——|Hadoop&PB级数仓.V02|——|PB数仓.v02|会员活跃度分析|json数据处理&使用内建函数处理|

一、json数据处理
### --- json数据处理
~~~     数据文件中每行必须是一个完整的 json 串,一个 json串 不能跨越多行。

~~~     # Hive 处理json数据总体来说有三个办法:
~~~     使用内建的函数get_json_object、json_tuple
~~~     使用自定义的UDF
~~~     第三方的SerDe
### --- 使用内建函数处理
~~~     # get_json_object(string json_string, string path)

~~~     返回值:String
~~~     说明:解析json字符串json_string,返回path指定的内容;
~~~     如果输入的json字符串无效,那么返回NUll;函数每次只能返回一个数据项;
~~~     # json_tuple(jsonStr, k1, k2, ...)

~~~     返回值:所有的输入参数、输出参数都是String;
~~~     说明:参数为一组键k1k2,。。。。。和json字符串,返回值的元组
~~~     该方法比get_json_object高效,因此可以在一次调用中输入多个键;
~~~     explode,使用explod将Hive一行中复杂的 array 或 map 结构拆分成多行。
二、json数据处理测试
### --- 测试数据:
~~~     # 创建数据表

[root@hadoop02 ~]# vim /data/yanqidw/logs/data/weibo.json
user1;18;male;{"id": 1,"ids": [101,102,103],"total_number": 3}
user2;20;female;{"id": 2,"ids": [201,202,203,204],"total_number": 4}
user3;23;male;{"id": 3,"ids": [301,302,303,304,305],"total_number": 5}
user4;17;male;{"id": 4,"ids": [401,402,403,304],"total_number": 5}
user5;35;female;{"id": 5,"ids": [501,502,503],"total_number": 3}
### --- 建表加载数据:

~~~     # 建表语句
CREATE TABLE IF NOT EXISTS jsont1(
username string,
age int,
sex string,
json string
)
row format delimited fields terminated by ';';
~~~     # 加载数据语句

load data local inpath '/data/yanqidw/logs/data/weibo.json' overwrite into table jsont1;
### --- 创建库及表

~~~     # 创建测试库
hive (ods)> create database test;
hive (ods)> use test;
~~~     # 创建表
hive (test)> CREATE TABLE IF NOT EXISTS jsont1(
           > username string,
           > age int,
           > sex string,
           > json string
           > )
           > row format delimited fields terminated by ';';

~~~     # 加载数据
hive (test)> load data local inpath '/data/yanqidw/logs/data/weibo.json' overwrite into table jsont1;
### --- 验证数据是否加载进来

hive (test)> select * from jsont1;
jsont1.username jsont1.age  jsont1.sex  jsont1.json
user1   18  male    {"id": 1,"ids": [101,102,103],"total_number": 3}
user2   20  female  {"id": 2,"ids": [201,202,203,204],"total_number": 4}
user3   23  male    {"id": 3,"ids": [301,302,303,304,305],"total_number": 5}
user4   17  male    {"id": 4,"ids": [401,402,403,304],"total_number": 5}
user5   35  female  {"id": 5,"ids": [501,502,503],"total_number": 3}
三、json的处理
### --- 取单层值

~~~     # 语法:-- get 单层值
select username, age, sex, get_json_object(json, "$.id") id,
get_json_object(json, "$.ids") ids,
get_json_object(json, "$.total_number") num
from jsont1;
hive (test)> select username, age, sex, get_json_object(json, "$.id") id,
           > get_json_object(json, "$.ids") ids,
           > get_json_object(json, "$.total_number") num
           > from jsont1;
OK
username    age sex id  ids num
user1   18  male    1   [101,102,103]   3
user2   20  female  2   [201,202,203,204]   4
user3   23  male    3   [301,302,303,304,305]   5
user4   17  male    4   [401,402,403,304]   5
user5   35  female  5   [501,502,503]   3
### --- 取数组
~~~     # 语法:-- get 数组

select username, age, sex, get_json_object(json, "$.id") id,
get_json_object(json, "$.ids[0]") ids0,
get_json_object(json, "$.ids[1]") ids1,
get_json_object(json, "$.ids[2]") ids2,
get_json_object(json, "$.ids[3]") ids3,
get_json_object(json, "$.total_number") num
from jsont1;
hive (test)> select username, age, sex, get_json_object(json, "$.id") id,
           > get_json_object(json, "$.ids[0]") ids0,
           > get_json_object(json, "$.ids[1]") ids1,
           > get_json_object(json, "$.ids[2]") ids2,
           > get_json_object(json, "$.ids[3]") ids3,
           > get_json_object(json, "$.total_number") num
           > from jsont1;
OK
username    age sex id  ids0    ids1    ids2    ids3    num
user1   18  male    1   101 102 103 NULL    3
user2   20  female  2   201 202 203 204 4
user3   23  male    3   301 302 303 304 5
user4   17  male    4   401 402 403 304 5
user5   35  female  5   501 502 503 NULL    3
### --- 处理多个字段
~~~     # 语法:-- 使用 json_tuple 一次处理多个字段

select json_tuple(json, 'id', 'ids', 'total_number')
from jsont1;
~~~     有语法错误
select username, age, sex, json_tuple(json, 'id', 'ids','total_number') 
from jsont1;
~~~     # 实例操作

hive (test)> select json_tuple(json, 'id', 'ids', 'total_number')
           > from jsont1;
OK
c0  c1  c2
1   [101,102,103]   3
2   [201,202,203,204]   4
3   [301,302,303,304,305]   5
4   [401,402,403,304]   5
5   [501,502,503]   3
~~~     # 语法
~~~     使用 explode + lateral view
~~~     在上一步的基础上,再将数据展开
~~~     第一步,将 [101,102,103] 中的 [ ] 替换掉
~~~     select "[101,102,103]"
~~~     select "101,102,103"

select regexp_replace("[101,102,103]", "\\[|\\]", "");
~~~     # 实例操作

hive (test)> select regexp_replace("[101,102,103]", "\\[|\\]", "");
OK
101,102,103
~~~     # 语法
~~~     第二步,将上一步的字符串变为数组
select split(regexp_replace("[101,102,103]", "\\[|\\]", ""), ",");

~~~     # 实例操作
hive (test)> select split(regexp_replace("[101,102,103]", "\\[|\\]", ""), ",");
OK
["101","102","103"]
~~~     # 语法
~~~     第三步,使用explode + lateral view 将数据展开

select username, age, sex, id, ids, num from jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num;

with tmp as(
select username, age, sex, id, ids, num from jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num )
select username, age, sex, id, ids1, num 
from tmp
lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""),
",")) t1 as ids1;
~~~     # 实例操作

hive (test)> select username, age, sex, id, ids, num from jsont1
           > lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num;
OK
username    age sex id  ids num
user1   18  male    1   [101,102,103]   3
user2   20  female  2   [201,202,203,204]   4
user3   23  male    3   [301,302,303,304,305]   5
user4   17  male    4   [401,402,403,304]   5
user5   35  female  5   [501,502,503]   3
~~~     # 将数据展开

hive (test)> with tmp as(
           > select username, age, sex, id, ids, num from jsont1
           > lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 as id, ids, num )
           > select username, age, sex, id, ids1, num 
           > from tmp
           > lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""),
           > ",")) t1 as ids1;
OK
username    age sex id  ids1    num
user1   18  male    1   101 3
user1   18  male    1   102 3
user1   18  male    1   103 3
user2   20  female  2   201 4
user2   20  female  2   202 4
user2   20  female  2   203 4
user2   20  female  2   204 4
user3   23  male    3   301 5
user3   23  male    3   302 5
user3   23  male    3   303 5
user3   23  male    3   304 5
user3   23  male    3   305 5
user4   17  male    4   401 5
user4   17  male    4   402 5
user4   17  male    4   403 5
user4   17  male    4   304 5
user5   35  female  5   501 3
user5   35  female  5   502 3
user5   35  female  5   503 3
### --- 小结:

~~~     json_tuple 优点是一次可以解析多个json字段,对嵌套结果的解析操作复杂;

 
 
 
 
 
 
 
 
 

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
                                                                                                                                                   ——W.S.Landor

 

 

posted on   yanqi_vip  阅读(26)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

导航

统计

点击右上角即可分享
微信分享提示