doris数据导入-严格模式
转载自:https://zhuanlan.zhihu.com/p/621836873
1、严格模式
对导入中的数据列类型转换进行严格过滤。
列类型转换后为 null 的数据将被过滤。
不含函数计算为 null 的数据。
(1)设置方式
代理加载 Broker Load 通过属性 "strict_mode"="true" 设置:
LOAD LABEL example_db.label1
(
DATA INFILE("bos://my_bucket/input/file.txt")
INTO TABLE `my_table`
COLUMNS TERMINATED BY ","
)
WITH BROKER bos
(
"bos_endpoint" = "http://bj.bcebos.com",
"bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
"bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyyyy"
)
PROPERTIES
(
"strict_mode" = "true"
)
流式加载 Stream Load 通过头 -H "strict_mode: true" 设置:
curl --location-trusted -u user:passwd \
-H "strict_mode: true" \
-T 1.txt \
http://host:port/api/example_db/my_table/_stream_load
插入数据通过会话变量设置:
SET enable_insert_strict = true;
INSERT INTO my_table ...;
(2)关闭严格模式
建表
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE `example_strict_1` (
> `k1` int COMMENT "第一列",
> `k2` TinyInt COMMENT "第二列",
> `k3` double COMMENT "第三列"
> )
> DISTRIBUTED BY HASH(k1) BUCKETS 2
> PROPERTIES (
> "replication_num" = "1"
> );'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> desc example_strict_1;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+---------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-------+---------+-------+
| k1 | INT | Yes | true | NULL | |
| k2 | TINYINT | Yes | true | NULL | |
| k3 | DOUBLE | Yes | false | NULL | NONE |
+-------+---------+------+-------+---------+-------+
[root@vm30 ~]#
造数
[root@vm30 ~]# cat > example_strict_1 <<EOL
> 1,\\N,1.1
> 2,abc,1.2
> 3,2000,1.3
> 4,1,1.4
> EOL
[root@vm30 ~]# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
[root@vm30 ~]#
流式加载 - 默认关闭严格模式
[root@vm30 ~]# curl --location-trusted \
> -u root:123456 \
> -H "column_separator:," \
> -T example_strict_1 \
> http://vm30:8030/api/demo/example_strict_1/_stream_load
{
"TxnId": 2024,
"Label": "5bd983e8-0bcf-4fa7-a835-60fd8ec1d4e2",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 4,
"NumberLoadedRows": 4,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 38,
"LoadTimeMs": 147,
"BeginTxnTimeMs": 3,
"StreamLoadPutTimeMs": 12,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 59,
"CommitAndPublishTimeMs": 69
}
[root@vm30 ~]#
查询
关闭严格模式下 abc 和 2000 都会转换为空值 null。
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_1;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+------+
| k1 | k2 | k3 |
+------+------+------+
| 3 | NULL | 1.3 |
| 1 | NULL | 1.1 |
| 2 | NULL | 1.2 |
| 4 | 1 | 1.4 |
+------+------+------+
[root@vm30 ~]# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
[root@vm30 ~]#
总结一下:
表结构中,k1
字段类型是 int,k2
字段类型是 TinyInt,k3
字段类型是 double
但是构造的数据中,
# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
第一行的第二列也就是k2字段的值是\N,不符合TinyInt类型
第二行的第二列也就是k2字段的值是abc, 不符合TinyInt类型
第三行的第二列也就是k2字段的值是2000, 不符合TinyInt类型
因为流式加载是默认关闭严格模式的,因此关闭严格模式下 abc 和 2000 都会转换为空值 null。
abc 及 2000 在转换为 TinyInt 后,会因类型或精度问题变为 NULL。在严格模式开启的情况下,这类数据将会被过滤。而如果是关闭状态,则会导入 null。
(3)开启严格模式
建表
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE `example_strict_2` (
> `k1` int COMMENT "第一列",
> `k2` TinyInt COMMENT "第二列",
> `k3` double COMMENT "第三列"
> )
> DISTRIBUTED BY HASH(k1) BUCKETS 2
> PROPERTIES (
> "replication_num" = "1"
> );'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> desc example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+---------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-------+---------+-------+
| k1 | INT | Yes | true | NULL | |
| k2 | TINYINT | Yes | true | NULL | |
| k3 | DOUBLE | Yes | false | NULL | NONE |
+-------+---------+------+-------+---------+-------+
[root@vm30 ~]#
流式加载 - 开启严格模式
[root@vm30 ~]# curl --location-trusted \
> -u root:123456 \
> -H "column_separator:," \
> -H "strict_mode: true" \
> -T example_strict_1 \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
"TxnId": 2025,
"Label": "6e86c67f-6496-46d2-a09a-ead19b5ea23c",
"TwoPhaseCommit": "false",
"Status": "Fail",
"Message": "[INTERNAL_ERROR]too many filtered rows",
"NumberTotalRows": 4,
"NumberLoadedRows": 2,
"NumberFilteredRows": 2,
"NumberUnselectedRows": 0,
"LoadBytes": 38,
"LoadTimeMs": 64,
"BeginTxnTimeMs": 2,
"StreamLoadPutTimeMs": 8,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 50,
"CommitAndPublishTimeMs": 0,
"ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_5/error_log_insert_stmt_f2411f67383594aa-ee11744d7d3da2a3_f2411f67383594aa_ee11744d7d3da2a3"
}
[root@vm30 ~]#
访问错误地址页面显示:
Reason: column(k2) value is incorrect while strict mode is true, src value is 1.2. src line [2 abc 1.2];
Reason: column(k2) value is incorrect while strict mode is true, src value is 1.3. src line [3 2000 1.3];
查询
表未加载任何数据。
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
多造点数据
[root@vm30 ~]# cat > example_strict_2 <<EOL
> 1,\\N,1.1
> 2,abc,1.2
> 3,2000,1.3
> 4,1,1.4
> 5,2,1.5
> 6,3,1.6
> 7,4,1.7
> EOL
[root@vm30 ~]# cat example_strict_2
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
5,2,1.5
6,3,1.6
7,4,1.7
[root@vm30 ~]#
流式加载
[root@vm30 ~]# curl --location-trusted \
> -u root:123456 \
> -H "column_separator:," \
> -H "strict_mode: true" \
> -T example_strict_2 \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
"TxnId": 2026,
"Label": "8aac4480-b126-4469-8304-5302f087ef28",
"TwoPhaseCommit": "false",
"Status": "Fail",
"Message": "[INTERNAL_ERROR]too many filtered rows",
"NumberTotalRows": 7,
"NumberLoadedRows": 5,
"NumberFilteredRows": 2,
"NumberUnselectedRows": 0,
"LoadBytes": 62,
"LoadTimeMs": 86,
"BeginTxnTimeMs": 5,
"StreamLoadPutTimeMs": 16,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 61,
"CommitAndPublishTimeMs": 0,
"ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_6/error_log_insert_stmt_324199914f5445c5-56e83207cb1f8099_324199914f5445c5_56e83207cb1f8099"
}
[root@vm30 ~]#
(4)数据质量 - 过滤阈值
NumberFilteredRows - 过滤掉的数据行数
因数据质量不合格而被过滤掉的数据。
01.类型错误 - type error - 上例中 abc
02.精度错误 - precision error - 上例中 2000
03.字符串长度超长 - long string length
04.文件列数不匹配 - mismatched file column number - 前几篇中的 (k1, k2, k3) "Message": "[INTERNAL_ERROR]too many filtered rows",
NumberLoadedRows - 正确导入数据行数
NumberUnselectedRows - 前置过滤 Preceding Filter 和过滤 where 掉的数据行数
max_filter_ratio - 最大错误率
max_filter_ratio = NumberFilteredRows / (NumberFilteredRows + NumberLoadedRows)
Doris's import task allows the user to set a maximum error rate (max_filter_ratio). If the error rate of the imported data is below the threshold, those erroneous rows will be ignored and other correct data will be imported.
多丽丝允许用户设置最大错误率 max_filter_ratio。如果导入数据的错误率在这个阈值之下,会忽略这些错误行,并导入其他正确的数据。
章节(3)中的错误率是
[root@vm30 ~]# echo "2 7" | awk '{printf "%.2f\n", $1/$2}'
0.29
[root@vm30 ~]#
流式加载 - 指定错误率阈值 0.3,这次状态 Status 为成功 Success 了。
[root@vm30 ~]# curl --location-trusted \
> -u root:123456 \
> -H "column_separator:," \
> -H "strict_mode: true" \
> -H "max_filter_ratio:0.3" \
> -T example_strict_2 \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
"TxnId": 2027,
"Label": "2c432ba9-c980-4f19-9ffa-b42b95df4018",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 7,
"NumberLoadedRows": 5,
"NumberFilteredRows": 2,
"NumberUnselectedRows": 0,
"LoadBytes": 62,
"LoadTimeMs": 188,
"BeginTxnTimeMs": 4,
"StreamLoadPutTimeMs": 15,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 58,
"CommitAndPublishTimeMs": 106,
"ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_7/error_log_insert_stmt_c74923d5781e29b5-33c0f24c617770a0_c74923d5781e29b5_33c0f24c617770a0"
}
[root@vm30 ~]#
查询数据
仔细查看,会发现少了两条数据,这两条数据不符合要求
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+------+
| k1 | k2 | k3 |
+------+------+------+
| 5 | 2 | 1.5 |
| 6 | 3 | 1.6 |
| 1 | NULL | 1.1 |
| 4 | 1 | 1.4 |
| 7 | 4 | 1.7 |
+------+------+------+
[root@vm30 ~]# cat example_strict_2
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
5,2,1.5
6,3,1.6
7,4,1.7
[root@vm30 ~]#