Hadoop-Impala学习笔记之入门

CDH quickstart vm包含了单节点的全套hadoop服务生态,可从https://www.cloudera.com/downloads/quickstart_vms/5-13.html下载。如下:

 

对应的节点如下(不包含Cloudera Navigator):

 

要学习完整的hadoop生态,最好是使用8C/32GB以上的服务器,4C/16GB勉强能跑、但是很勉强(最好使用2个以上节点)。

impala 使用c++编写(Spark使用Scala编写),采用MPP架构(类似于MariaDB Columnstore,也就是之前的infinidb),由下列组件组成:

 

Hue是一个Web智能查询分析器,能够进行语法提示,查询Impala、HDFS、HBase。如下:

 

其中impala服务器由Impala Daemon(执行SQL)、Impala Statestore(监控Daemon状态)、Impala Catalog(将DDL变更传输给Daemon节点,避免了DDL通过Impala执行时运行REFRESH/INVALIDATE METADATA的必要,通过Hive时,仍然需要)组成。impala-shell和mysql客户端类似,执行SQL。

Impala使用和Hive一样的元数据,其可以存储在mysql或postgresql中,称为metastore。

Impala使用HDFS作为主要的存储底层,利用其冗余特性。

Impala还支持Hbase作为存储,通过定义映射到Hbase的表,可以查询HBase表,还可以关联查询HBase和Impala表。

impala可以使用Cloudera Manager或命令行启动:

Cloudera Manager启动如下:

命令行启动(这种方式启动CM是无法监控到其状态的,而且进程也略有不同):

service impala-state-store start/restart/stop

service impala-catalog start/restart/stop

service impala-server start/restart/stop

CM启动后进程如下:

 

日志信息位于/var/log/impala,如下:

配置可通过CM修改、也可以修改配置文件/etc/default/impala。

 impala客户端

[root@quickstart impala]# impala-shell 
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.10.0-cdh5.13.0 (2511805) built on Wed Oct  4 10:55:37 PDT 2017)

The HISTORY command lists all shell commands in chronological order.
***********************************************************************************
[quickstart.cloudera:21000] > 

全新的Impala实例包含2个库:default(新创建表的默认库)以及_impala_builtins。

可通过show database/show table/select version()查看数据库信息(语法兼容SQL 92/MySQL,大部分NoSQL的参考实现,不同于Oracle)

[quickstart.cloudera:21000] > select version();
Query: select version()
Query submitted at: 2019-04-06 04:59:25 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=ef4cbbf93a7662e5:f4e103f500000000
+-------------------------------------------------------------------------------------------+
| version()                                                                                 |
+-------------------------------------------------------------------------------------------+
| impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) |
| Built on Wed Oct  4 10:55:37 PDT 2017                                                     |
+-------------------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.34s
[quickstart.cloudera:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
+------------------+----------------------------------------------+
Fetched 2 row(s) in 0.01s
[quickstart.cloudera:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| tab1 |
| tab2 |
| tab3 |
+------+
Fetched 3 row(s) in 0.02s
[quickstart.cloudera:21000] > select * from tab1;
Query: select * from tab1
Query submitted at: 2019-04-06 05:05:57 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=dd4debde8a589622:81983b200000000
+----+-------+------------+-------------------------------+
| id | col_1 | col_2      | col_3                         |
+----+-------+------------+-------------------------------+
| 1  | true  | 123.123    | 2012-10-24 08:55:00           |
| 2  | false | 1243.5     | 2012-10-25 13:40:00           |
| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5  | true  | 243.325    | 1953-04-22 09:11:33           |
+----+-------+------------+-------------------------------+
Fetched 5 row(s) in 6.06s  -- 第一次访问特别慢,因为需要加载到内存
[quickstart.cloudera:21000] > select * from tab1;
Query: select * from tab1
Query submitted at: 2019-04-06 05:06:05 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=21486d50db995bf2:a2d061b400000000
+----+-------+------------+-------------------------------+
| id | col_1 | col_2      | col_3                         |
+----+-------+------------+-------------------------------+
| 1  | true  | 123.123    | 2012-10-24 08:55:00           |
| 2  | false | 1243.5     | 2012-10-25 13:40:00           |
| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5  | true  | 243.325    | 1953-04-22 09:11:33           |
+----+-------+------------+-------------------------------+
Fetched 5 row(s) in 0.26s
[quickstart.cloudera:21000] > desc tab1;
Query: describe tab1
+-------+-----------+---------+
| name  | type      | comment |
+-------+-----------+---------+
| id    | int       |         |
| col_1 | boolean   |         |
| col_2 | double    |         |
| col_3 | timestamp |         |
+-------+-----------+---------+
Fetched 4 row(s) in 0.03s
[quickstart.cloudera:21000] > select count(1) from tab1;
Query: select count(1) from tab1
Query submitted at: 2019-04-06 05:06:28 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9141cf84d4efd2e3:e667f1d00000000
+----------+
| count(1) |
+----------+
| 5        |
+----------+
Fetched 1 row(s) in 0.17s
[quickstart.cloudera:21000] > create database my_first_impala_db;
Query: create database my_first_impala_db
Fetched 0 row(s) in 0.08s
[quickstart.cloudera:21000] > create table t1 (x int);
Query: create table t1 (x int)
Fetched 0 row(s) in 0.08s
[quickstart.cloudera:21000] > insert into t1 values (1), (3), (2), (4); --支持mysql语法
Query: insert into t1 values (1), (3), (2), (4)
Query submitted at: 2019-04-06 05:07:32 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8c41d97919423a19:1f7aeeeb00000000
Modified 4 row(s) in 5.17s
[quickstart.cloudera:21000] > insert into t1 select * from t1;  --支持insert select
Query: insert into t1 select * from t1
Query submitted at: 2019-04-06 05:07:45 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9c452b951b08da49:92d9fee100000000
Modified 4 row(s) in 0.23s

创建基于HDFS文件的表(需要先创建HDFS文件,并加载数据,参见Hadoop-HDFS学习笔记):

[quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab2;
Query: drop TABLE IF EXISTS tab2
[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE tab2
                            > (
                            >    id INT,
                            >    col_1 BOOLEAN,
                            >    col_2 DOUBLE
                            > )
                            > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
                            > LOCATION '/user/cloudera/sample_data/tab2';
Query: create EXTERNAL TABLE tab2
(
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab2'
Fetched 0 row(s) in 0.09s
[quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab3;
Query: drop TABLE IF EXISTS tab3
[quickstart.cloudera:21000] > CREATE TABLE tab3
                            > (
                            >    id INT,
                            >    col_1 BOOLEAN,
                            >    col_2 DOUBLE,
                            >    month INT,
                            >    day INT
                            > )
                            > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Query: create TABLE tab3
(
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE,
   month INT,
   day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Fetched 0 row(s) in 0.09s

 包含关联、子查询、聚合的SQL查询:

SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;

查看SQL的执行计划:

explain SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=136.00MB                                 |
| Per-Host Resource Estimates: Memory=4.34GB                                         |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| default.tab1, default.tab2                                                         |
|                                                                                    |
| PLAN-ROOT SINK                                                                     |
| |                                                                                  |
| 10:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 05:HASH JOIN [INNER JOIN, BROADCAST]                                               |
| |  hash predicates: tab2.col_2 = max(tab2.col_2)                                   |
| |  runtime filters: RF000 <- max(tab2.col_2)                                       |
| |                                                                                  |
| |--09:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  08:AGGREGATE [FINALIZE]                                                         |
| |  |  output: max:merge(tab2.col_2)                                                |
| |  |  group by: tab1.col_1                                                         |
| |  |                                                                               |
| |  07:EXCHANGE [HASH(tab1.col_1)]                                                  |
| |  |                                                                               |
| |  04:AGGREGATE [STREAMING]                                                        |
| |  |  output: max(tab2.col_2)                                                      |
| |  |  group by: tab1.col_1                                                         |
| |  |                                                                               |
| |  03:HASH JOIN [INNER JOIN, BROADCAST]                                            |
| |  |  hash predicates: tab2.id = tab1.id                                           |
| |  |  runtime filters: RF001 <- tab1.id                                            |
| |  |                                                                               |
| |  |--06:EXCHANGE [BROADCAST]                                                      |
| |  |  |                                                                            |
| |  |  02:SCAN HDFS [default.tab1]                                                  |
| |  |     partitions=1/1 files=1 size=192B                                          |
| |  |                                                                               |
| |  01:SCAN HDFS [default.tab2]                                                     |
| |     partitions=1/1 files=1 size=158B                                             |
| |     runtime filters: RF001 -> tab2.id                                            |
| |                                                                                  |
| 00:SCAN HDFS [default.tab2]                                                        |
|    partitions=1/1 files=1 size=158B                                                |
|    runtime filters: RF000 -> tab2.col_2                                            |
+------------------------------------------------------------------------------------+
Fetched 41 row(s) in 0.05s

创建基于Parquet的文件,并转换为内部分区表

[quickstart.cloudera:21000] > USE airlines_data;
Query: use airlines_data
[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
                            > LIKE PARQUET
                            > 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
                            > STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines'
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/impala/staging'

Fetched 0 row(s) in 0.82s
[quickstart.cloudera:21000] > SHOW TABLE STATS airlines_external;
Query: show TABLE STATS airlines_external
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                     |
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
| -1    | 8      | 1.34GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines |
+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
Fetched 1 row(s) in 3.89s
[quickstart.cloudera:21000] > SHOW FILES IN airlines_external;
Query: show FILES IN airlines_external
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
| Path                                                                                                                  | Size     | Partition |
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB  |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq  | 252.84MB |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq  | 63.92MB  |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB |           |
| hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB |           |
+-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
Fetched 8 row(s) in 0.04s
[quickstart.cloudera:21000] > DESCRIBE airlines_external;
Query: describe airlines_external
+---------------------+--------+-----------------------------+
| name                | type   | comment                     |
+---------------------+--------+-----------------------------+
| year                | int    | Inferred from Parquet file. |
| month               | int    | Inferred from Parquet file. |
| day                 | int    | Inferred from Parquet file. |
| dayofweek           | int    | Inferred from Parquet file. |
| dep_time            | int    | Inferred from Parquet file. |
| crs_dep_time        | int    | Inferred from Parquet file. |

-- 单表查询速度还是不错的。。。

[quickstart.cloudera:21000] > SELECT COUNT(*) FROM airlines_external;
Query: select COUNT(*) FROM airlines_external
Query submitted at: 2019-04-06 07:08:33 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=a04cd594518cba3a:1666b4e00000000
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 0.33s
[quickstart.cloudera:21000] > SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
> NDV(origin), NDV(dest) FROM airlines_external;
Query: select NDV(carrier), NDV(flight_num), NDV(tail_num),
NDV(origin), NDV(dest) FROM airlines_external
Query submitted at: 2019-04-06 07:08:53 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=df4776d3fa8c1f69:e4e09e6100000000
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29 | 8463 | 3 | 342 | 349 |
+--------------+-----------------+---------------+-------------+-----------+
Fetched 1 row(s) in 9.33s
[quickstart.cloudera:21000] > SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
> GROUP BY tail_num;
Query: select tail_num, COUNT(*) AS howmany FROM airlines_external
GROUP BY tail_num
Query submitted at: 2019-04-06 07:09:19 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=7d43af89d18c600e:bc464f0f00000000
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
| NULL | 123122001 |
+----------+-----------+
Fetched 4 row(s) in 2.42s

-- 大表关联,内存不足报错
[quickstart.cloudera:21000] > SELECT DISTINCT dest FROM airlines_external
                            > WHERE dest NOT IN (SELECT origin FROM airlines_external);
Query: select DISTINCT dest FROM airlines_external
WHERE dest NOT IN (SELECT origin FROM airlines_external)
Query submitted at: 2019-04-06 07:09:35 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=314343c761a55f97:61ce9aa500000000
WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002
Memory left in process limit: -328.00 KB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
  Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB
  Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
    EXCHANGE_NODE (id=7): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=0
    PLAN_ROOT_SINK: Total=0 Peak=0
    CodeGen: Total=0 Peak=0
  Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
    AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB
      Exprs: Total=21.12 KB Peak=21.12 KB
    EXCHANGE_NODE (id=5): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=0
    DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB
    CodeGen: Total=1.86 KB Peak=347.00 KB
  Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
    AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
    HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
      Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB
        Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB
    HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB
    EXCHANGE_NODE (id=4): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=11.65 MB
    DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB
    CodeGen: Total=12.80 KB Peak=2.00 MB
  Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB
    HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB
    DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB
    CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
  Buffer Pool: Free Buffers: Total=260.00 MB
  Buffer Pool: Clean Pages: Total=40.00 MB
  Buffer Pool: Unused Reservation: Total=-300.00 MB
  RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
  RequestPool=root.root: Total=0 Peak=139.93 MB
  RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
    Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
  RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
    Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
  Untracked Memory: Total=89.56 MB

WARNING: The following tables are missing relevant table and/or column statistics.
airlines_data.airlines_external
Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002
Memory left in process limit: -328.00 KB
Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
  Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB
  Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
    EXCHANGE_NODE (id=7): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=0
    PLAN_ROOT_SINK: Total=0 Peak=0
    CodeGen: Total=0 Peak=0
  Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
    AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB
      Exprs: Total=21.12 KB Peak=21.12 KB
    EXCHANGE_NODE (id=5): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=0
    DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB
    CodeGen: Total=1.86 KB Peak=347.00 KB
  Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
    AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
    HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
      Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB
        Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB
    HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB
    EXCHANGE_NODE (id=4): Total=0 Peak=0
    DataStreamRecvr: Total=0 Peak=11.65 MB
    DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB
    CodeGen: Total=12.80 KB Peak=2.00 MB
  Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB
    HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB
    DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB
    CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
  Buffer Pool: Free Buffers: Total=260.00 MB
  Buffer Pool: Clean Pages: Total=40.00 MB
  Buffer Pool: Unused Reservation: Total=-300.00 MB
  RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
  RequestPool=root.root: Total=0 Peak=139.93 MB
  RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
    Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
  RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
    Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
  Untracked Memory: Total=89.56 MB (1 of 3 similar)

impala-server:25000可以查看语句执行进度:

CREATE TABLE airlines_data.airlines
(month INT,
day INT,
dayofweek INT,
dep_time INT,
crs_dep_time INT,
arr_time INT,
crs_arr_time INT,
carrier STRING,
flight_num INT,
actual_elapsed_time INT,
crs_elapsed_time INT,
airtime INT,
arrdelay INT,
depdelay INT,
origin STRING,
dest STRING,
distance INT,
taxi_in INT,
taxi_out INT,
cancelled INT,
cancellation_code STRING,
diverted INT,
carrier_delay INT,
weather_delay INT,
nas_delay INT,
security_delay INT,
late_aircraft_delay INT)
PARTITIONED BY (year INT)
STORED AS PARQUET;

INSERT INTO airlines_data.airlines
PARTITION (year)
SELECT
month,
day,
dayofweek,
dep_time,
crs_dep_time,
arr_time,
crs_arr_time,
carrier,
flight_num,
actual_elapsed_time,
crs_elapsed_time,
airtime,
arrdelay,
depdelay,
origin,
dest,
distance,
taxi_in,
taxi_out,
cancelled,
cancellation_code,
diverted,
carrier_delay,
weather_delay,
nas_delay,
security_delay,
late_aircraft_delay,
year
FROM airlines_data.airlines_external limit 1000000; -- 1亿的时候内存不足
[quickstart.cloudera:21000] > SHOW TABLE STATS airlines;
Query: show TABLE STATS airlines
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
| year  | #Rows   | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                                |
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
| 1990  | 897600  | 1      | 6.62MB   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=1990 |
| 2002  | 49152   | 1      | 527.13KB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2002 |
| 2003  | 53248   | 1      | 603.76KB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2003 |
| Total | 1000000 | 3      | 7.73MB   | 0B           |                   |         |                   |                                                                                         |
+-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
Fetched 4 row(s) in 0.14s
[quickstart.cloudera:21000] > COMPUTE INCREMENTAL STATS airlines;
Query: compute INCREMENTAL STATS airlines
WARNINGS: No partitions selected for incremental stats update

Fetched 0 row(s) in 0.01s
[quickstart.cloudera:21000] > COMPUTE STATS airlines;
Query: compute STATS airlines
+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 3 partition(s) and 27 column(s). |
+------------------------------------------+
Fetched 1 row(s) in 2.02s

 

TPCDS:https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen(可生成TPC-DS测试数据集、同时包含TPC-DS测试,可生成10TB级别)

官方自带的tpcds-kit customer数据无效了(主要是http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp链接无效了),脚本在https://github.com/sleberknight/impalascripts-0.6/blob/master/tpcds-setup.sh。

 常见问题:

在使用Hadoop建立文件的时候,出现“Cannot create directory /user/hadoop/input. Name node is in safe mode.”问题的原因及解决方案
解决方法:https://www.waitig.com/hadoop-name-node-is-in-safe-mode.html

[cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
put: Permission denied: user=cloudera, access=WRITE, inode="/user/cloudera/sample_data/tab1":hdfs:cloudera:drwxr-xr-x
切换到HDFS用户即可,如下:
-bash-4.1$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
-bash-4.1$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
-bash-4.1$ hdfs dfs -ls /user/cloudera/sample_data/tab1
Found 1 items
-rw-r--r-- 1 hdfs cloudera 192 2019-04-05 23:06 /user/cloudera/sample_data/tab1/tab1.csv

WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/cloudera/sample_data'
报错分析:
impala-shell运行的时候使用Impala用户,impala对hfds路径没有读写权限;

问题处理:
方法一:对hdfs的目录进行赋权:hadoop fs -chomd -R 777 path
-bash-4.1$ hadoop fs -chmod -R 777 /user/cloudera/sample_data
-bash-4.1$ exit
logout

方法二:创建hadoop用户组,然后将impala加入到hadoop用户组中,同时给impala用户定制权限

[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
> LIKE PARQUET
> 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
> STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines'
ERROR: AnalysisException: null
CAUSED BY: IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
CAUSED BY: URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq

解决方法:

路径无效,使用正确的路径,如hdfs:/user/impala/staging/airlines/XXX

posted @ 2019-04-06 22:43  zhjh256  阅读(2082)  评论(0编辑  收藏  举报