测试数据及常见问题

hive 创建测试表

单个分区键

建表 SQL：

-- 创建分区表
CREATE TABLE my_partitioned_table (
    id INT,
    name STRING
) 
PARTITIONED BY (created_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01');

-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02');

-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');

-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');

多个分区键

建表 SQL：

-- 创建分区表
CREATE TABLE my_partitioned_table (
    id INT,
    name STRING
) 
PARTITIONED BY (created_date STRING, updated_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01', updated_date='2023-06-01');

-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02', updated_date='2023-06-02');

-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01', updated_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');

-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02', updated_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');

常见问题

分区字段配置
遇到重复数据更新的配置：
- mysql mysqlwriter 支持参数 writeMode: "update"，同时也支持参数 preSql
- postgresql pgwriter 支持参数 preSql
- 其他方案为先手动执行 delete 语句，然后启动 datax 进程
通过 impala 写入数据时，可能会遇到写入 1 行数据，查询后得到的却是 2 行数据，其中 1 行为除分区字段外其他字段值为 null，参考 java - Impala single insert statement creating multiple files - Stack Overflow 并结合 impala 文档描述：

You might set the NUM_NODES option to 1 briefly, during INSERT or CREATE TABLE AS SELECT statements. Normally, those statements produce one or more data files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many small files when intuitively you might expect only a single output file. SET NUM_NODES=1 turns off the "distributed" aspect of the write operation, making it more likely to produce only one or a few data files.

即在执行 inert 语句前，先执行语句 SET NUM_NODES = 1 即可，结束后应再次执行 SET NUM_NODES = 0 恢复原来的状态。

posted @ 2024-05-30 14:29 kingron 阅读(21) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Loading

WONDERING

everything in the world~

测试数据及常见问题

hive 创建测试表

单个分区键

多个分区键

常见问题

公告