Hadoop-Impala学习笔记之入门
CDH quickstart vm包含了单节点的全套hadoop服务生态,可从https://www.cloudera.com/downloads/quickstart_vms/5-13.html下载。如下:
对应的节点如下(不包含Cloudera Navigator):
要学习完整的hadoop生态,最好是使用8C/32GB以上的服务器,4C/16GB勉强能跑、但是很勉强(最好使用2个以上节点)。
impala 使用c++编写(Spark使用Scala编写),采用MPP架构(类似于MariaDB Columnstore,也就是之前的infinidb),由下列组件组成:
Hue是一个Web智能查询分析器,能够进行语法提示,查询Impala、HDFS、HBase。如下:
其中impala服务器由Impala Daemon(执行SQL)、Impala Statestore(监控Daemon状态)、Impala Catalog(将DDL变更传输给Daemon节点,避免了DDL通过Impala执行时运行REFRESH/INVALIDATE METADATA的必要,通过Hive时,仍然需要)组成。impala-shell和mysql客户端类似,执行SQL。
Impala使用和Hive一样的元数据,其可以存储在mysql或postgresql中,称为metastore。
Impala使用HDFS作为主要的存储底层,利用其冗余特性。
Impala还支持Hbase作为存储,通过定义映射到Hbase的表,可以查询HBase表,还可以关联查询HBase和Impala表。
impala可以使用Cloudera Manager或命令行启动:
Cloudera Manager启动如下:
命令行启动(这种方式启动CM是无法监控到其状态的,而且进程也略有不同):
service impala-state-store start/restart/stop
service impala-catalog start/restart/stop
service impala-server start/restart/stop
CM启动后进程如下:
日志信息位于/var/log/impala,如下:
配置可通过CM修改、也可以修改配置文件/etc/default/impala。
impala客户端
[root@quickstart impala]# impala-shell Starting Impala Shell without Kerberos authentication Connected to quickstart.cloudera:21000 Server version: impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) *********************************************************************************** Welcome to the Impala shell. (Impala Shell v2.10.0-cdh5.13.0 (2511805) built on Wed Oct 4 10:55:37 PDT 2017) The HISTORY command lists all shell commands in chronological order. *********************************************************************************** [quickstart.cloudera:21000] >
全新的Impala实例包含2个库:default(新创建表的默认库)以及_impala_builtins。
可通过show database/show table/select version()查看数据库信息(语法兼容SQL 92/MySQL,大部分NoSQL的参考实现,不同于Oracle)
[quickstart.cloudera:21000] > select version(); Query: select version() Query submitted at: 2019-04-06 04:59:25 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=ef4cbbf93a7662e5:f4e103f500000000 +-------------------------------------------------------------------------------------------+ | version() | +-------------------------------------------------------------------------------------------+ | impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) | | Built on Wed Oct 4 10:55:37 PDT 2017 | +-------------------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.34s [quickstart.cloudera:21000] > show databases; Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | default | Default Hive database | +------------------+----------------------------------------------+ Fetched 2 row(s) in 0.01s [quickstart.cloudera:21000] > show tables; Query: show tables +------+ | name | +------+ | tab1 | | tab2 | | tab3 | +------+ Fetched 3 row(s) in 0.02s
[quickstart.cloudera:21000] > select * from tab1; Query: select * from tab1 Query submitted at: 2019-04-06 05:05:57 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=dd4debde8a589622:81983b200000000 +----+-------+------------+-------------------------------+ | id | col_1 | col_2 | col_3 | +----+-------+------------+-------------------------------+ | 1 | true | 123.123 | 2012-10-24 08:55:00 | | 2 | false | 1243.5 | 2012-10-25 13:40:00 | | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | | 5 | true | 243.325 | 1953-04-22 09:11:33 | +----+-------+------------+-------------------------------+ Fetched 5 row(s) in 6.06s -- 第一次访问特别慢,因为需要加载到内存 [quickstart.cloudera:21000] > select * from tab1; Query: select * from tab1 Query submitted at: 2019-04-06 05:06:05 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=21486d50db995bf2:a2d061b400000000 +----+-------+------------+-------------------------------+ | id | col_1 | col_2 | col_3 | +----+-------+------------+-------------------------------+ | 1 | true | 123.123 | 2012-10-24 08:55:00 | | 2 | false | 1243.5 | 2012-10-25 13:40:00 | | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | | 5 | true | 243.325 | 1953-04-22 09:11:33 | +----+-------+------------+-------------------------------+ Fetched 5 row(s) in 0.26s [quickstart.cloudera:21000] > desc tab1; Query: describe tab1 +-------+-----------+---------+ | name | type | comment | +-------+-----------+---------+ | id | int | | | col_1 | boolean | | | col_2 | double | | | col_3 | timestamp | | +-------+-----------+---------+ Fetched 4 row(s) in 0.03s [quickstart.cloudera:21000] > select count(1) from tab1; Query: select count(1) from tab1 Query submitted at: 2019-04-06 05:06:28 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9141cf84d4efd2e3:e667f1d00000000 +----------+ | count(1) | +----------+ | 5 | +----------+ Fetched 1 row(s) in 0.17s [quickstart.cloudera:21000] > create database my_first_impala_db; Query: create database my_first_impala_db Fetched 0 row(s) in 0.08s [quickstart.cloudera:21000] > create table t1 (x int); Query: create table t1 (x int) Fetched 0 row(s) in 0.08s [quickstart.cloudera:21000] > insert into t1 values (1), (3), (2), (4); --支持mysql语法 Query: insert into t1 values (1), (3), (2), (4) Query submitted at: 2019-04-06 05:07:32 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8c41d97919423a19:1f7aeeeb00000000 Modified 4 row(s) in 5.17s [quickstart.cloudera:21000] > insert into t1 select * from t1; --支持insert select Query: insert into t1 select * from t1 Query submitted at: 2019-04-06 05:07:45 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9c452b951b08da49:92d9fee100000000 Modified 4 row(s) in 0.23s
创建基于HDFS文件的表(需要先创建HDFS文件,并加载数据,参见Hadoop-HDFS学习笔记):
[quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab2; Query: drop TABLE IF EXISTS tab2 [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE tab2 > ( > id INT, > col_1 BOOLEAN, > col_2 DOUBLE > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > LOCATION '/user/cloudera/sample_data/tab2'; Query: create EXTERNAL TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/cloudera/sample_data/tab2' Fetched 0 row(s) in 0.09s [quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab3; Query: drop TABLE IF EXISTS tab3 [quickstart.cloudera:21000] > CREATE TABLE tab3 > ( > id INT, > col_1 BOOLEAN, > col_2 DOUBLE, > month INT, > day INT > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Query: create TABLE tab3 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Fetched 0 row(s) in 0.09s
包含关联、子查询、聚合的SQL查询:
SELECT tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2;
查看SQL的执行计划:
explain SELECT tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2;
+------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=136.00MB | | Per-Host Resource Estimates: Memory=4.34GB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.tab1, default.tab2 | | | | PLAN-ROOT SINK | | | | | 10:EXCHANGE [UNPARTITIONED] | | | | | 05:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: tab2.col_2 = max(tab2.col_2) | | | runtime filters: RF000 <- max(tab2.col_2) | | | | | |--09:EXCHANGE [BROADCAST] | | | | | | | 08:AGGREGATE [FINALIZE] | | | | output: max:merge(tab2.col_2) | | | | group by: tab1.col_1 | | | | | | | 07:EXCHANGE [HASH(tab1.col_1)] | | | | | | | 04:AGGREGATE [STREAMING] | | | | output: max(tab2.col_2) | | | | group by: tab1.col_1 | | | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | | hash predicates: tab2.id = tab1.id | | | | runtime filters: RF001 <- tab1.id | | | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | | | 02:SCAN HDFS [default.tab1] | | | | partitions=1/1 files=1 size=192B | | | | | | | 01:SCAN HDFS [default.tab2] | | | partitions=1/1 files=1 size=158B | | | runtime filters: RF001 -> tab2.id | | | | | 00:SCAN HDFS [default.tab2] | | partitions=1/1 files=1 size=158B | | runtime filters: RF000 -> tab2.col_2 | +------------------------------------------------------------------------------------+ Fetched 41 row(s) in 0.05s
创建基于Parquet的文件,并转换为内部分区表
[quickstart.cloudera:21000] > USE airlines_data; Query: use airlines_data [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external > LIKE PARQUET > 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq' > STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines'; Query: create EXTERNAL TABLE airlines_external LIKE PARQUET 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq' STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines' WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/impala/staging' Fetched 0 row(s) in 0.82s [quickstart.cloudera:21000] > SHOW TABLE STATS airlines_external; Query: show TABLE STATS airlines_external +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ | -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines | +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ Fetched 1 row(s) in 3.89s [quickstart.cloudera:21000] > SHOW FILES IN airlines_external; Query: show FILES IN airlines_external +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ | Path | Size | Partition | +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | | +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ Fetched 8 row(s) in 0.04s [quickstart.cloudera:21000] > DESCRIBE airlines_external; Query: describe airlines_external +---------------------+--------+-----------------------------+ | name | type | comment | +---------------------+--------+-----------------------------+ | year | int | Inferred from Parquet file. | | month | int | Inferred from Parquet file. | | day | int | Inferred from Parquet file. | | dayofweek | int | Inferred from Parquet file. | | dep_time | int | Inferred from Parquet file. | | crs_dep_time | int | Inferred from Parquet file. |
-- 单表查询速度还是不错的。。。
[quickstart.cloudera:21000] > SELECT COUNT(*) FROM airlines_external;
Query: select COUNT(*) FROM airlines_external
Query submitted at: 2019-04-06 07:08:33 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=a04cd594518cba3a:1666b4e00000000
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 0.33s
[quickstart.cloudera:21000] > SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
> NDV(origin), NDV(dest) FROM airlines_external;
Query: select NDV(carrier), NDV(flight_num), NDV(tail_num),
NDV(origin), NDV(dest) FROM airlines_external
Query submitted at: 2019-04-06 07:08:53 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=df4776d3fa8c1f69:e4e09e6100000000
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29 | 8463 | 3 | 342 | 349 |
+--------------+-----------------+---------------+-------------+-----------+
Fetched 1 row(s) in 9.33s
[quickstart.cloudera:21000] > SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
> GROUP BY tail_num;
Query: select tail_num, COUNT(*) AS howmany FROM airlines_external
GROUP BY tail_num
Query submitted at: 2019-04-06 07:09:19 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=7d43af89d18c600e:bc464f0f00000000
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
| NULL | 123122001 |
+----------+-----------+
Fetched 4 row(s) in 2.42s
-- 大表关联,内存不足报错 [quickstart.cloudera:21000] > SELECT DISTINCT dest FROM airlines_external > WHERE dest NOT IN (SELECT origin FROM airlines_external); Query: select DISTINCT dest FROM airlines_external WHERE dest NOT IN (SELECT origin FROM airlines_external) Query submitted at: 2019-04-06 07:09:35 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=314343c761a55f97:61ce9aa500000000 WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002 Memory left in process limit: -328.00 KB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=7): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB Exprs: Total=21.12 KB Peak=21.12 KB EXCHANGE_NODE (id=5): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB CodeGen: Total=1.86 KB Peak=347.00 KB Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB Exprs: Total=21.12 KB Peak=21.12 KB HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB EXCHANGE_NODE (id=4): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=11.65 MB DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB CodeGen: Total=12.80 KB Peak=2.00 MB Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB Buffer Pool: Free Buffers: Total=260.00 MB Buffer Pool: Clean Pages: Total=40.00 MB Buffer Pool: Unused Reservation: Total=-300.00 MB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB RequestPool=root.root: Total=0 Peak=139.93 MB RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Untracked Memory: Total=89.56 MB WARNING: The following tables are missing relevant table and/or column statistics. airlines_data.airlines_external Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002 Memory left in process limit: -328.00 KB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=7): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB Exprs: Total=21.12 KB Peak=21.12 KB EXCHANGE_NODE (id=5): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB CodeGen: Total=1.86 KB Peak=347.00 KB Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB Exprs: Total=21.12 KB Peak=21.12 KB HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB EXCHANGE_NODE (id=4): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=11.65 MB DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB CodeGen: Total=12.80 KB Peak=2.00 MB Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB Buffer Pool: Free Buffers: Total=260.00 MB Buffer Pool: Clean Pages: Total=40.00 MB Buffer Pool: Unused Reservation: Total=-300.00 MB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB RequestPool=root.root: Total=0 Peak=139.93 MB RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Untracked Memory: Total=89.56 MB (1 of 3 similar)
impala-server:25000可以查看语句执行进度:
CREATE TABLE airlines_data.airlines (month INT, day INT, dayofweek INT, dep_time INT, crs_dep_time INT, arr_time INT, crs_arr_time INT, carrier STRING, flight_num INT, actual_elapsed_time INT, crs_elapsed_time INT, airtime INT, arrdelay INT, depdelay INT, origin STRING, dest STRING, distance INT, taxi_in INT, taxi_out INT, cancelled INT, cancellation_code STRING, diverted INT, carrier_delay INT, weather_delay INT, nas_delay INT, security_delay INT, late_aircraft_delay INT) PARTITIONED BY (year INT) STORED AS PARQUET; INSERT INTO airlines_data.airlines PARTITION (year) SELECT month, day, dayofweek, dep_time, crs_dep_time, arr_time, crs_arr_time, carrier, flight_num, actual_elapsed_time, crs_elapsed_time, airtime, arrdelay, depdelay, origin, dest, distance, taxi_in, taxi_out, cancelled, cancellation_code, diverted, carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay, year FROM airlines_data.airlines_external limit 1000000; -- 1亿的时候内存不足
[quickstart.cloudera:21000] > SHOW TABLE STATS airlines; Query: show TABLE STATS airlines +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ | 1990 | 897600 | 1 | 6.62MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=1990 | | 2002 | 49152 | 1 | 527.13KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2002 | | 2003 | 53248 | 1 | 603.76KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2003 | | Total | 1000000 | 3 | 7.73MB | 0B | | | | | +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ Fetched 4 row(s) in 0.14s [quickstart.cloudera:21000] > COMPUTE INCREMENTAL STATS airlines; Query: compute INCREMENTAL STATS airlines WARNINGS: No partitions selected for incremental stats update Fetched 0 row(s) in 0.01s [quickstart.cloudera:21000] > COMPUTE STATS airlines; Query: compute STATS airlines +------------------------------------------+ | summary | +------------------------------------------+ | Updated 3 partition(s) and 27 column(s). | +------------------------------------------+ Fetched 1 row(s) in 2.02s
TPCDS:https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen(可生成TPC-DS测试数据集、同时包含TPC-DS测试,可生成10TB级别)
官方自带的tpcds-kit customer数据无效了(主要是http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp链接无效了),脚本在https://github.com/sleberknight/impalascripts-0.6/blob/master/tpcds-setup.sh。
常见问题:
在使用Hadoop建立文件的时候,出现“Cannot create directory /user/hadoop/input. Name node is in safe mode.”问题的原因及解决方案
解决方法:https://www.waitig.com/hadoop-name-node-is-in-safe-mode.html
[cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
put: Permission denied: user=cloudera, access=WRITE, inode="/user/cloudera/sample_data/tab1":hdfs:cloudera:drwxr-xr-x
切换到HDFS用户即可,如下:
-bash-4.1$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
-bash-4.1$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
-bash-4.1$ hdfs dfs -ls /user/cloudera/sample_data/tab1
Found 1 items
-rw-r--r-- 1 hdfs cloudera 192 2019-04-05 23:06 /user/cloudera/sample_data/tab1/tab1.csv
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/cloudera/sample_data'
报错分析:
impala-shell运行的时候使用Impala用户,impala对hfds路径没有读写权限;
问题处理:
方法一:对hdfs的目录进行赋权:hadoop fs -chomd -R 777 path
-bash-4.1$ hadoop fs -chmod -R 777 /user/cloudera/sample_data
-bash-4.1$ exit
logout
方法二:创建hadoop用户组,然后将impala加入到hadoop用户组中,同时给impala用户定制权限
[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
> LIKE PARQUET
> 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
> STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines'
ERROR: AnalysisException: null
CAUSED BY: IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
CAUSED BY: URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
解决方法:
路径无效,使用正确的路径,如hdfs:/user/impala/staging/airlines/XXX