hive 截取不固定位置字符串

滴水穿石不是靠力，而是因为不舍昼夜。

hive 截取不固定位置字符串

需求

数据准备

需求解决

需求

数据准备

需求解决

公告

日志表中有如下数据

film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx
call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx
xxxxxabc[867]xxx[-]xxxx[5309]xxxxx
xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx

希望获取一下数据

non_marked	unit	withabanana?
F_GET_ROWS()	ROSEWOOD...SIR	44400002
867	-	5309
11271978	4	Joe

可以看出，截取每行的字符串位置未固定，但是可以借助【】来判断截取位置以及长度。

reate table temp_log_0213
as
select 'xxxxxabc[867]xxx[-]xxxx[5309]xxxxx' msg
union all
select 'xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx' msg
union all
select 'call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx' msg
union all
select 'film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx' msg;

hive 获取字符串位置的函数有一下2个

1、instr(String str, String substr)函数返回字符串第一次出现的索引

select instr("hello hive",'e');
2

2、locate(string substr, string str, int index)返回字符串 substr 在 str 中从 index后查找,首次出现的位置

select locate('e', 'hello hive');
2
select locate('e', 'hello hive', 2);
2
select locate('e', 'hello hive', 3);
10

今天我们使用locate 函数处理这个需求

select msg,
       locate('[', msg, 1)                                             start_id_1,
       locate(']', msg, 1)                                             end_id_1,
       locate('[', msg, locate('[', msg, 1) + 1)                       start_id_2,
       locate(']', msg, locate(']', msg, 1) + 1)                       end_id_2,
       locate('[', msg, locate('[', msg, locate('[', msg, 1) + 1) + 1) startd_id_3,
       locate(']', msg, locate(']', msg, locate(']', msg, 1) + 1) + 1) end_end_id_3
from temp_log_0213;

其实位置为方括号"["索引值+1从该处开始截取，截取长度为"]"位置的索引值减去"["位置处的索引值再减1.结果如下

select substr(msg,
              locate('[', msg, 1) + 1,
              locate(']', msg, 1) - locate('[', msg, 1) - 1)                       as first_value,
       substr(msg,
              locate('[', msg,
                     locate('[', msg, 1) + 1) + 1,
              locate(']', msg, locate(']', msg, 1) + 1) -
              locate('[', msg, locate('[', msg, 1) + 1) - 1)                       as second_value,
       substr(msg,
              locate('[', msg,
                     locate('[', msg, locate('[', msg, 1) + 1) + 1) + 1,
              locate(']', msg, locate(']', msg, locate(']', msg, 1) + 1) + 1) -
              locate('[', msg, locate('[', msg, locate('[', msg, 1) + 1) + 1) - 1) as thirdd_value
from temp_log_0213;

posted @ 2022-02-14 11:04 晓枫的春天阅读(1944) 评论(0) 编辑收藏举报

刷新页面返回顶部

film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx

call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx

xxxxxabc[867]xxx[-]xxxx[5309]xxxxx

xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx