antlr解析hive语句(转载)
转载:antlr解析hive语句
hive是使用antlr来解析的
parser要做的事情,是从无结构的字符串里面,解码产生有结构的数据结构(a parser is a function accepting strings as input and returning some structure as output),参考 Parser_combinator wiki
parser分成两种,一种是parser combinator,一种是parser generator,区别可以参考 王垠的文章——对 Parser 的误解
1.parser combinator是需要手写parser,a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output,比如Thrift的Parser
1
|
https: //github .com /apache/thrift/blob/master/compiler/cpp/src/thrift/main .cc |
2.parser generator是需要你用某种指定的描述语言来表示出语法,然后自动把他们转换成parser的代码,比如Antlr里面的g4语法文件,calcite的ftl语法文件,hue使用的jison以及flex和cup等,缺点是由于代码是生成的,排错比较困难
使用了Antlr的parser有Hive,Presto,Spark SQL
美团点评的文章
1
|
https: //tech .meituan.com /2014/02/12/hive-sql-to-mapreduce .html |
以及hive源码的测试用例
1
|
https: //github .com /apache/hive/blob/branch-1 .1 /ql/src/test/org/apache/hadoop/hive/ql/parse/TestHiveDecimalParse .java |
hive的g4文件如下
老版本的hive
1
|
https: //github .com /apache/hive/blob/59d8665cba4fe126df026f334d35e5b9885fc42c/parser/src/java/org/apache/hadoop/hive/ql/parse/HiveParser .g |
新版本的hive
1
|
https: //github .com /apache/hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql .g4 |
spark的g4文件如下
1
|
https: //github .com /apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase .g4 |
Presto的g4文件如下
1
|
https: //github .com /prestodb/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase .g4 |
confluent的kSql的g4文件
1
|
https: //github .com /confluentinc/ksql/blob/master/ksqldb-parser/src/main/antlr4/io/confluent/ksql/parser/SqlBase .g4 |
使用了Apache Calcite的parser有Apache Flink,Mybatis,Apache Storm等
Flink的ftl文件如下
1
|
https: //github .com /apache/flink/blob/master/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls .ftl |
Mybatis的mapper模板生成
1
|
https: //github .com /abel533/Mapper/blob/master/generator/src/main/resources/generator/mapper .ftl |
Storm的ftl文件如下
1
|
https: //github .com /apache/storm/blob/master/sql/storm-sql-core/src/codegen/includes/parserImpls .ftl |
以及使用了flex和cup的impala,如何使用impala的parser来解析query可以参考另一篇文章:使用Impala parser解析SQL
parser的测试用例
1
|
https: //github .com /cloudera/Impala/blob/master/fe/src/test/java/com/cloudera/impala/analysis/ParserTest .java |
源码
1
|
https: //github .com /apache/impala/blob/master/fe/src/main/jflex/sql-scanner .flex |
和
1
|
https: //github .com /apache/impala/blob/master/fe/src/main/cup/sql-parser .cup |
impala也用了少量的antlr
1
|
https: //github .com /apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/ToSqlUtils .java |
还有hue使用的jison,jison是JavaScript语言的语法分析器
1
|
https: //github .com /cloudera/hue/tree/master/desktop/core/src/desktop/js/parse/jison |
以hive的Hplsql.g4为例,来解析一句sql
1
2
|
antlr4 Hplsql.g4 javac Hplsql*.java |
解析select语句
1
2
3
4
5
6
7
8
9
10
11
12
|
grun Hplsql r -tokens Warning: TestRig moved to org.antlr.v4.gui.TestRig; calling automatically select * from db1.tb1; [@0,0:5= 'select' ,<T_SELECT>,1:0] [@1,7:7= '*' ,< '*' >,1:7] [@2,9:12= 'from' ,<T_FROM>,1:9] [@3,14:16= 'db1' ,<L_ID>,1:14] [@4,17:17= '.' ,< '.' >,1:17] [@5,18:20= 'tb1' ,<L_ID>,1:18] [@6,21:21= ';' ,< ';' >,1:21] [@7,23:22= '<EOF>' ,<EOF>,2:0] No method for rule r or it has arguments |
可以看到打印出token流
解析建表语句
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
grun Hplsql r -tokens Warning: TestRig moved to org.antlr.v4.gui.TestRig; calling automatically CREATE TABLE IF NOT EXISTS db1.tb1 ( `f1` string, `f2` bigint, `f3` string, `f4` string, `f5` string) partitioned by(ds string) stored as parquet TBLPROPERTIES ( "parquet.compression" = "SNAPPY" ); [@0,0:5= 'CREATE' ,<T_CREATE>,1:0] [@1,7:11= 'TABLE' ,<T_TABLE>,1:7] [@2,13:14= 'IF' ,<T_IF>,1:13] [@3,16:18= 'NOT' ,<T_NOT>,1:16] [@4,20:25= 'EXISTS' ,<T_EXISTS>,1:20] [@5,27:29= 'db1' ,<L_ID>,1:27] [@6,30:30= '.' ,< '.' >,1:30] [@7,31:33= 'tb1' ,<L_ID>,1:31] [@8,35:35= '(' ,< '(' >,1:35] [@9,39:42= '`f1`' ,<L_ID>,2:2] [@10,44:49= 'string' ,<T_STRING>,2:7] [@11,50:50= ',' ,< ',' >,2:13] [@12,54:57= '`f2`' ,<L_ID>,3:2] [@13,59:64= 'bigint' ,<T_BIGINT>,3:7] [@14,65:65= ',' ,< ',' >,3:13] [@15,69:72= '`f3`' ,<L_ID>,4:2] [@16,74:79= 'string' ,<T_STRING>,4:7] [@17,80:80= ',' ,< ',' >,4:13] [@18,84:87= '`f4`' ,<L_ID>,5:2] [@19,89:94= 'string' ,<T_STRING>,5:7] [@20,95:95= ',' ,< ',' >,5:13] [@21,99:102= '`f5`' ,<L_ID>,6:2] [@22,104:109= 'string' ,<T_STRING>,6:7] [@23,110:110= ')' ,< ')' >,6:13] [@24,112:122= 'partitioned' ,<L_ID>,7:0] [@25,124:125= 'by' ,<T_BY>,7:12] [@26,126:126= '(' ,< '(' >,7:14] [@27,127:128= 'ds' ,<L_ID>,7:15] [@28,130:135= 'string' ,<T_STRING>,7:18] [@29,136:136= ')' ,< ')' >,7:24] [@30,138:143= 'stored' ,<T_STORED>,8:0] [@31,145:146= 'as' ,<T_AS>,8:7] [@32,148:154= 'parquet' ,<L_ID>,8:10] [@33,156:168= 'TBLPROPERTIES' ,<L_ID>,9:0] [@34,170:170= '(' ,< '(' >,9:14] [@35,171:191= '"parquet.compression"' ,<L_ID>,9:15] [@36,192:192= '=' ,< '=' >,9:36] [@37,193:200= '"SNAPPY"' ,<L_ID>,9:37] [@38,201:201= ')' ,< ')' >,9:45] [@39,202:202= ';' ,< ';' >,9:46] [@40,204:203= '<EOF>' ,<EOF>,10:0] No method for rule r or it has arguments |
上面介绍了antlr如果解析hive语句,而在hive中使用的就是由antlr编译出来的java代码来解析hive语句
接下来介绍如何使用java代码解析hive语句,首先引用依赖
1
2
3
4
5
|
<dependency> <groupId>org.apache.hive< /groupId > <artifactId>hive- exec < /artifactId > <version>1.1.0-cdh5.16.2< /version > < /dependency > |
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
import com.google.common.collect.Lists; import com.google.common.collect.Maps; import org.apache.hadoop.hive.conf.HiveConf; import org.apache.hadoop.hive.metastore.api.FieldSchema; import org.apache.hadoop.hive.ql.Context; import org.apache.hadoop.hive.ql.lib.*; import org.apache.hadoop.hive.ql.parse.*; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.util.List; import java.util.Map; import java.util.Stack; public class MyProcessor implements NodeProcessor { private static Logger logger = LoggerFactory.getLogger(MyProcessor.class); private static Context context = null; private final static String HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path" ; private final static String LOCAL_SESSION_PATH_KEY = "_hive.local.session.path" ; private static String hdfsTemporaryDirectory(HiveConf hiveConf) { return hiveConf.get( "hadoop.tmp.dir" , "/tmp" ); } private static String localTemporaryDirectory() { return System.getProperty( "java.io.tmpdir" , "/tmp" ); } static { HiveConf hiveConf = new HiveConf(); if (hiveConf.get(HDFS_SESSION_PATH_KEY) == null) { hiveConf. set (HDFS_SESSION_PATH_KEY, hdfsTemporaryDirectory(hiveConf)); } if (hiveConf.get(LOCAL_SESSION_PATH_KEY) == null) { hiveConf. set (LOCAL_SESSION_PATH_KEY, localTemporaryDirectory()); } try { context = new Context(hiveConf); } catch (IOException e) { logger.error( "Init hive context fail, message: " + e); } } String tableName = "" ; List<FieldSchema> fieldSchemas; public void parse(String query) throws ParseException, SemanticException { ParseDriver pd = new ParseDriver(); ASTNode tree = pd.parse(query, context); while ((tree.getToken() == null) && (tree.getChildCount() > 0)) { tree = (ASTNode) tree.getChild(0); } logger.info( "start to analyze query: {}, ASTNode: {}" , query, tree.dump()); Map<Rule, NodeProcessor> rules = Maps.newLinkedHashMap(); Dispatcher disp = new DefaultRuleDispatcher(this, rules, null); GraphWalker ogw = new DefaultGraphWalker(disp); final List<Node> topNodes = Lists.newArrayList(tree); // 遍历 ogw.startWalking(topNodes, null); // 打印 System.out.println(tableName); System.out.println(fieldSchemas); } @Override public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs) throws SemanticException { ASTNode pt = (ASTNode) nd; switch (pt.getToken().getType()) { case org.apache.hadoop.hive.ql.parse.HiveParser.TOK_CREATETABLE: for (Node node : pt.getChildren()) { ASTNode createTableChild = (ASTNode) node; if (createTableChild.getToken().getType() == HiveParser.TOK_TABNAME) { tableName = BaseSemanticAnalyzer.getUnescapedName(createTableChild); } else if (createTableChild.getToken().getType() == HiveParser.TOK_TABCOLLIST) { fieldSchemas = BaseSemanticAnalyzer.getColumns(createTableChild, true ); } } } return null; } } |
测试用例,解析了hive的建表语句
1
2
3
4
5
6
7
8
9
10
11
12
|
import org.junit.Test; public class MyProcessorTest { @Test public void parse() throws Exception{ String query = "create table my_table(id int,name string)row format delimited fields terminated by '\\t'" ; MyProcessor processor = new MyProcessor(); processor.parse(query); } } |
输出
上面例子中是将hive表名和字段解析出来,其他属性也可以使用类似的方法从语法树中取出
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南