Hadoop-Impala学习笔记之SQL参考
参考:Apache Impala Guide--Impala SQL Language Reference。
Impala使用和Hive一样的元数据存储,Impala可以访问使用原生Impala CREATE TABLE创建的表和Hive DDL创建的表;
Impala支持和HiveQL类似的DML;
支持TRUNCATE(至少在5.5及之前,不支持Kudu表)
UPDATE/UPSERT(5.10开始支持Kudu表)
Impala提供了很多和HiveQL等价的内置函数;
支持大多数子句如WITH, JOIN, AGGREGATE,DISTINCT, UNION ALL, ORDER BY, LIMIT(相关子查询除外);
支持和Hive一样的数据类型,如STRING, TINYINT,SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING, TIMESTAMP(需要注意Kudu和Parquet文件格式的限制);
impala支持分区表、外部表;
impala支持绝大部分SQL 92语句;
查看各种上下文信息,可以使用SHOW XXX;如:
SHOW DATABASES [[LIKE] 'pattern'] SHOW SCHEMAS [[LIKE] 'pattern'] - an alias for SHOW DATABASES SHOW TABLES [IN database_name] [[LIKE] 'pattern'] SHOW [AGGREGATE | ANALYTIC] FUNCTIONS [IN database_name] [[LIKE] 'pattern'] SHOW CREATE TABLE [database_name].table_name SHOW CREATE VIEW [database_name].view_name SHOW TABLE STATS [database_name.]table_name SHOW COLUMN STATS [database_name.]table_name SHOW [RANGE] PARTITIONS [database_name.]table_name SHOW FILES IN [database_name.]table_name [PARTITION (key_col_expression [, key_col_expression]] SHOW ROLES SHOW CURRENT ROLES SHOW ROLE GRANT GROUP group_name SHOW GRANT ROLE role_name
和RDBMS一样,支持统计信息收集COMPUTE [INCREMENTAL] STATS [db_name.]table_name(不同于Hive的ANALYZE TABLE需要分别为列和表收集统计信息,会一次性收集);增量统计信息收集适合于分区表(在ETL的最后应该收集统计信息)。下列语句可用来查看表的统计信息:
[quickstart.cloudera:21000] > show table stats t1; Query: show table stats t1 +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | -1 | 7 | 512B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/t1 | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ Fetched 1 row(s) in 3.17s
[quickstart.cloudera:21000] > show column stats t1; Query: show column stats t1 +--------+------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+------+------------------+--------+----------+----------+ | x | INT | -1 | -1 | 4 | 4 | +--------+------+------------------+--------+----------+----------+ Fetched 1 row(s) in 0.11s
支持优化器提示(但是比较弱,要掌握优化器提示,必须先掌握执行计划)
有好几种格式(有点乱):
SELECT STRAIGHT_JOIN select_list FROM join_left_hand_table JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }] join_right_hand_table remainder_of_query; INSERT insert_clauses [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }] [/* +CLUSTERED */] SELECT remainder_of_query; SELECT select_list FROM table_ref /* +{SCHEDULE_CACHE_LOCAL | SCHEDULE_DISK_LOCAL | SCHEDULE_REMOTE} [,RANDOM_REPLICA] */ remainder_of_query; -- 最近的优化器提示支持下列格式: SELECT select_list FROM join_left_hand_table JOIN -- +BROADCAST|SHUFFLE join_right_hand_table remainder_of_query; INSERT insert_clauses /* +SHUFFLE|NOSHUFFLE */ SELECT remainder_of_query;
有针对Impala执行本身的、也有针对控制HDFS调度的。
支持的各种hint可以参考Query Hints in Impala SELECT Statements
数据移动操作,LOAD DATA语句可以用来将HDFS目录的数据移动到Impala数据目录(目前不支持从本地文件系统移动,真正的移动、不是复制)可以直接映射到HDFS文件,那什么时候需要这个操作??(其实和外部表性质类似,只不过移动后文件为Impala管辖,否则为HDFS管辖)。
LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
会话上下文设置:
它原先是一个impala-shell命令,后来被扩展为SQL语句,主要用于JDBC和ODBC API中。
不带任何参数可输出所有可设置的上下文选项列表,如下:
[quickstart.cloudera:21000] > set > ; Query options (defaults shown in []): ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] ABORT_ON_ERROR: [0] ALLOW_UNSUPPORTED_FORMATS: [0] APPX_COUNT_DISTINCT: [0] BATCH_SIZE: [0] BUFFER_POOL_LIMIT: [0] COMPRESSION_CODEC: [NONE] DEBUG_ACTION: [] DECIMAL_V2: [0] DEFAULT_JOIN_DISTRIBUTION_MODE: [0] DEFAULT_ORDER_BY_LIMIT: [-1] DEFAULT_SPILLABLE_BUFFER_SIZE: [2097152] DISABLE_CACHED_READS: [0] DISABLE_CODEGEN: [0] DISABLE_CODEGEN_ROWS_THRESHOLD: [50000] DISABLE_OUTERMOST_TOPN: [0] DISABLE_ROW_RUNTIME_FILTERING: [0] DISABLE_STREAMING_PREAGGREGATIONS: [0] DISABLE_UNSAFE_SPILLS: [0] ENABLE_EXPR_REWRITES: [1] EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] EXPLAIN_LEVEL: [1] HBASE_CACHE_BLOCKS: [0] HBASE_CACHING: [0] MAX_ERRORS: [100] MAX_IO_BUFFERS: [0] MAX_NUM_RUNTIME_FILTERS: [10] MAX_ROW_SIZE: [524288] MAX_SCAN_RANGE_LENGTH: [0] MEM_LIMIT: [0] MIN_SPILLABLE_BUFFER_SIZE: [65536] MT_DOP: [0] NUM_NODES: [0] NUM_SCANNER_THREADS: [0] OPTIMIZE_PARTITION_KEY_SCANS: [0] PARQUET_ANNOTATE_STRINGS_UTF8: [0] PARQUET_ARRAY_RESOLUTION: [2] PARQUET_DICTIONARY_FILTERING: [1] PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] PARQUET_FILE_SIZE: [0] PARQUET_READ_STATISTICS: [1] PREFETCH_MODE: [1] QUERY_TIMEOUT_S: [0] REPLICA_PREFERENCE: [0] REQUEST_POOL: [] RESERVATION_REQUEST_TIMEOUT: [0] RM_INITIAL_MEM: [0] RUNTIME_BLOOM_FILTER_SIZE: [1048576] RUNTIME_FILTER_MAX_SIZE: [16777216] RUNTIME_FILTER_MIN_SIZE: [1048576] RUNTIME_FILTER_MODE: [2] RUNTIME_FILTER_WAIT_TIME_MS: [0] S3_SKIP_INSERT_STAGING: [1] SCAN_NODE_CODEGEN_THRESHOLD: [1800000] SCHEDULE_RANDOM_REPLICA: [0] SCRATCH_LIMIT: [-1] SEQ_COMPRESSION_MODE: [0] STRICT_MODE: [0] SUPPORT_START_OVER: [false] SYNC_DDL: [0] V_CPU_CORES: [0] Shell Options LIVE_PROGRESS: False LIVE_SUMMARY: False Variables: No variables defined.
作为命令行参数值不需要带引号,否则需要引号。所有可用的选项可以参考“Query Options for the SET Statement”。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)
2017-04-07 windows 7/10 安装u盘制作