Loading

测试数据及常见问题

hive 创建测试表

单个分区键

建表 SQL:

-- 创建分区表
CREATE TABLE my_partitioned_table (
    id INT,
    name STRING
) 
PARTITIONED BY (created_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01');

-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02');

-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');

-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');

多个分区键

建表 SQL:

-- 创建分区表
CREATE TABLE my_partitioned_table (
    id INT,
    name STRING
) 
PARTITIONED BY (created_date STRING, updated_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01', updated_date='2023-06-01');

-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02', updated_date='2023-06-02');

-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01', updated_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');

-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02', updated_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');

常见问题

  1. 分区字段配置

  2. 遇到重复数据更新的配置:

    • mysql mysqlwriter 支持参数 writeMode: "update",同时也支持参数 preSql
    • postgresql pgwriter 支持参数 preSql
    • 其他方案为先手动执行 delete 语句,然后启动 datax 进程
  3. 通过 impala 写入数据时,可能会遇到写入 1 行数据,查询后得到的却是 2 行数据,其中 1 行为除分区字段外其他字段值为 null,参考 java - Impala single insert statement creating multiple files - Stack Overflow 并结合 impala 文档描述:

    You might set the NUM_NODES option to 1 briefly, during INSERT or CREATE TABLE AS SELECT statements. Normally, those statements produce one or more data files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many small files when intuitively you might expect only a single output file. SET NUM_NODES=1 turns off the "distributed" aspect of the write operation, making it more likely to produce only one or a few data files.

    即在执行 inert 语句前,先执行语句 SET NUM_NODES = 1 即可,结束后应再次执行 SET NUM_NODES = 0 恢复原来的状态。

posted @ 2024-05-30 14:29  kingron  阅读(21)  评论(0编辑  收藏  举报