测试数据及常见问题
hive 创建测试表
单个分区键
建表 SQL:
-- 创建分区表
CREATE TABLE my_partitioned_table (
id INT,
name STRING
)
PARTITIONED BY (created_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01');
-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02');
-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');
-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');
多个分区键
建表 SQL:
-- 创建分区表
CREATE TABLE my_partitioned_table (
id INT,
name STRING
)
PARTITIONED BY (created_date STRING, updated_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- 创建第一个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-01', updated_date='2023-06-01');
-- 创建第二个分区
ALTER TABLE my_partitioned_table ADD PARTITION (created_date='2023-06-02', updated_date='2023-06-02');
-- 向第一个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-01', updated_date='2023-06-01')
VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve'),
(6, 'Frank'),
(7, 'Grace'),
(8, 'Heidi'),
(9, 'Ivan'),
(10, 'Judy');
-- 向第二个分区插入数据
INSERT INTO TABLE my_partitioned_table PARTITION (created_date='2023-06-02', updated_date='2023-06-02')
VALUES
(11, 'Mallory'),
(12, 'Niaj'),
(13, 'Olivia'),
(14, 'Peggy'),
(15, 'Quentin'),
(16, 'Rupert'),
(17, 'Sybil'),
(18, 'Trent'),
(19, 'Ursula'),
(20, 'Victor');
常见问题
-
遇到重复数据更新的配置:
- mysql mysqlwriter 支持参数
writeMode: "update"
,同时也支持参数preSql
- postgresql pgwriter 支持参数
preSql
- 其他方案为先手动执行 delete 语句,然后启动 datax 进程
- mysql mysqlwriter 支持参数
-
通过 impala 写入数据时,可能会遇到写入 1 行数据,查询后得到的却是 2 行数据,其中 1 行为除分区字段外其他字段值为
null
,参考 java - Impala single insert statement creating multiple files - Stack Overflow 并结合 impala 文档描述:You might set the
NUM_NODES
option to 1 briefly, duringINSERT
orCREATE TABLE AS SELECT
statements. Normally, those statements produce one or more data files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many small files when intuitively you might expect only a single output file.SET NUM_NODES=1
turns off the "distributed" aspect of the write operation, making it more likely to produce only one or a few data files.即在执行
inert
语句前,先执行语句SET NUM_NODES = 1
即可,结束后应再次执行SET NUM_NODES = 0
恢复原来的状态。