doris数据导入-严格模式

转载自:https://zhuanlan.zhihu.com/p/621836873

1、严格模式

对导入中的数据列类型转换进行严格过滤。

列类型转换后为 null 的数据将被过滤。

不含函数计算为 null 的数据。

(1)设置方式

代理加载 Broker Load 通过属性 "strict_mode"="true" 设置:

LOAD LABEL example_db.label1
(
    DATA INFILE("bos://my_bucket/input/file.txt")
    INTO TABLE `my_table`
    COLUMNS TERMINATED BY ","
)
WITH BROKER bos
(
    "bos_endpoint" = "http://bj.bcebos.com",
    "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
    "bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyyyy"
)
PROPERTIES
(
    "strict_mode" = "true"
)

流式加载 Stream Load 通过头 -H "strict_mode: true" 设置:

curl --location-trusted -u user:passwd \
-H "strict_mode: true" \
-T 1.txt \
http://host:port/api/example_db/my_table/_stream_load

插入数据通过会话变量设置:

SET enable_insert_strict = true;
INSERT INTO my_table ...;

(2)关闭严格模式

建表

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE `example_strict_1` (
>   `k1` int     COMMENT "第一列",
>   `k2` TinyInt COMMENT "第二列",
>   `k3` double  COMMENT "第三列"
> )
> DISTRIBUTED BY HASH(k1) BUCKETS 2
> PROPERTIES (
>     "replication_num" = "1"
> );'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> desc example_strict_1;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+---------+------+-------+---------+-------+
| Field | Type    | Null | Key   | Default | Extra |
+-------+---------+------+-------+---------+-------+
| k1    | INT     | Yes  | true  | NULL    |       |
| k2    | TINYINT | Yes  | true  | NULL    |       |
| k3    | DOUBLE  | Yes  | false | NULL    | NONE  |
+-------+---------+------+-------+---------+-------+
[root@vm30 ~]#

造数

[root@vm30 ~]# cat > example_strict_1 <<EOL
> 1,\\N,1.1
> 2,abc,1.2
> 3,2000,1.3
> 4,1,1.4
> EOL
[root@vm30 ~]# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
[root@vm30 ~]#

流式加载 - 默认关闭严格模式

[root@vm30 ~]# curl --location-trusted      \
> -u root:123456               \
> -H "column_separator:,"      \
> -T example_strict_1          \
> http://vm30:8030/api/demo/example_strict_1/_stream_load
{
    "TxnId": 2024,
    "Label": "5bd983e8-0bcf-4fa7-a835-60fd8ec1d4e2",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 4,
    "NumberLoadedRows": 4,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 38,
    "LoadTimeMs": 147,
    "BeginTxnTimeMs": 3,
    "StreamLoadPutTimeMs": 12,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 59,
    "CommitAndPublishTimeMs": 69
}
[root@vm30 ~]#

查询

关闭严格模式下 abc 和 2000 都会转换为空值 null。

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_1;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
|    3 | NULL |  1.3 |
|    1 | NULL |  1.1 |
|    2 | NULL |  1.2 |
|    4 |    1 |  1.4 |
+------+------+------+
[root@vm30 ~]# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
[root@vm30 ~]#

总结一下:
表结构中,k1字段类型是 int,k2字段类型是 TinyInt,k3字段类型是 double
但是构造的数据中,

# cat example_strict_1
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4

第一行的第二列也就是k2字段的值是\N,不符合TinyInt类型
第二行的第二列也就是k2字段的值是abc, 不符合TinyInt类型
第三行的第二列也就是k2字段的值是2000, 不符合TinyInt类型

因为流式加载是默认关闭严格模式的,因此关闭严格模式下 abc 和 2000 都会转换为空值 null。

abc 及 2000 在转换为 TinyInt 后,会因类型或精度问题变为 NULL。在严格模式开启的情况下,这类数据将会被过滤。而如果是关闭状态,则会导入 null。

(3)开启严格模式

建表

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE `example_strict_2` (
>   `k1` int     COMMENT "第一列",
>   `k2` TinyInt COMMENT "第二列",
>   `k3` double  COMMENT "第三列"
> )
> DISTRIBUTED BY HASH(k1) BUCKETS 2
> PROPERTIES (
>     "replication_num" = "1"
> );'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> desc example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+---------+------+-------+---------+-------+
| Field | Type    | Null | Key   | Default | Extra |
+-------+---------+------+-------+---------+-------+
| k1    | INT     | Yes  | true  | NULL    |       |
| k2    | TINYINT | Yes  | true  | NULL    |       |
| k3    | DOUBLE  | Yes  | false | NULL    | NONE  |
+-------+---------+------+-------+---------+-------+
[root@vm30 ~]#

流式加载 - 开启严格模式

[root@vm30 ~]# curl --location-trusted      \
> -u root:123456               \
> -H "column_separator:,"      \
> -H "strict_mode: true"       \
> -T example_strict_1          \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
    "TxnId": 2025,
    "Label": "6e86c67f-6496-46d2-a09a-ead19b5ea23c",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "[INTERNAL_ERROR]too many filtered rows",
    "NumberTotalRows": 4,
    "NumberLoadedRows": 2,
    "NumberFilteredRows": 2,
    "NumberUnselectedRows": 0,
    "LoadBytes": 38,
    "LoadTimeMs": 64,
    "BeginTxnTimeMs": 2,
    "StreamLoadPutTimeMs": 8,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 50,
    "CommitAndPublishTimeMs": 0,
    "ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_5/error_log_insert_stmt_f2411f67383594aa-ee11744d7d3da2a3_f2411f67383594aa_ee11744d7d3da2a3"
}
[root@vm30 ~]#

访问错误地址页面显示:

Reason: column(k2) value is incorrect while strict mode is true, src value is 1.2. src line [2 abc 1.2]; 
Reason: column(k2) value is incorrect while strict mode is true, src value is 1.3. src line [3 2000 1.3]; 

查询

表未加载任何数据。

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#

多造点数据

[root@vm30 ~]# cat > example_strict_2 <<EOL
> 1,\\N,1.1
> 2,abc,1.2
> 3,2000,1.3
> 4,1,1.4
> 5,2,1.5
> 6,3,1.6
> 7,4,1.7
> EOL
[root@vm30 ~]# cat example_strict_2
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
5,2,1.5
6,3,1.6
7,4,1.7
[root@vm30 ~]#

流式加载

[root@vm30 ~]# curl --location-trusted      \
> -u root:123456               \
> -H "column_separator:,"      \
> -H "strict_mode: true"       \
> -T example_strict_2          \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
    "TxnId": 2026,
    "Label": "8aac4480-b126-4469-8304-5302f087ef28",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "[INTERNAL_ERROR]too many filtered rows",
    "NumberTotalRows": 7,
    "NumberLoadedRows": 5,
    "NumberFilteredRows": 2,
    "NumberUnselectedRows": 0,
    "LoadBytes": 62,
    "LoadTimeMs": 86,
    "BeginTxnTimeMs": 5,
    "StreamLoadPutTimeMs": 16,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 61,
    "CommitAndPublishTimeMs": 0,
    "ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_6/error_log_insert_stmt_324199914f5445c5-56e83207cb1f8099_324199914f5445c5_56e83207cb1f8099"
}
[root@vm30 ~]#

(4)数据质量 - 过滤阈值

NumberFilteredRows - 过滤掉的数据行数

因数据质量不合格而被过滤掉的数据。
01.类型错误 - type error - 上例中 abc
02.精度错误 - precision error - 上例中 2000
03.字符串长度超长 - long string length
04.文件列数不匹配 - mismatched file column number - 前几篇中的 (k1, k2, k3) "Message": "[INTERNAL_ERROR]too many filtered rows",

NumberLoadedRows - 正确导入数据行数

NumberUnselectedRows - 前置过滤 Preceding Filter 和过滤 where 掉的数据行数

max_filter_ratio - 最大错误率

max_filter_ratio = NumberFilteredRows / (NumberFilteredRows + NumberLoadedRows)

Doris's import task allows the user to set a maximum error rate (max_filter_ratio). If the error rate of the imported data is below the threshold, those erroneous rows will be ignored and other correct data will be imported.
多丽丝允许用户设置最大错误率 max_filter_ratio。如果导入数据的错误率在这个阈值之下,会忽略这些错误行,并导入其他正确的数据。

章节(3)中的错误率是

[root@vm30 ~]# echo "2 7" | awk '{printf "%.2f\n", $1/$2}'
0.29
[root@vm30 ~]#


流式加载 - 指定错误率阈值 0.3,这次状态 Status 为成功 Success 了。

[root@vm30 ~]# curl --location-trusted      \
> -u root:123456               \
> -H "column_separator:,"      \
> -H "strict_mode: true"       \
> -H "max_filter_ratio:0.3"    \
> -T example_strict_2          \
> http://vm30:8030/api/demo/example_strict_2/_stream_load
{
    "TxnId": 2027,
    "Label": "2c432ba9-c980-4f19-9ffa-b42b95df4018",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 7,
    "NumberLoadedRows": 5,
    "NumberFilteredRows": 2,
    "NumberUnselectedRows": 0,
    "LoadBytes": 62,
    "LoadTimeMs": 188,
    "BeginTxnTimeMs": 4,
    "StreamLoadPutTimeMs": 15,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 58,
    "CommitAndPublishTimeMs": 106,
    "ErrorURL": "http://10.10.1.31:8040/api/_load_error_log?file=__shard_7/error_log_insert_stmt_c74923d5781e29b5-33c0f24c617770a0_c74923d5781e29b5_33c0f24c617770a0"
}
[root@vm30 ~]#

查询数据

仔细查看,会发现少了两条数据,这两条数据不符合要求

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> select * from example_strict_2;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+------+
| k1   | k2   | k3   |
+------+------+------+
|    5 |    2 |  1.5 |
|    6 |    3 |  1.6 |
|    1 | NULL |  1.1 |
|    4 |    1 |  1.4 |
|    7 |    4 |  1.7 |
+------+------+------+
[root@vm30 ~]# cat example_strict_2
1,\N,1.1
2,abc,1.2
3,2000,1.3
4,1,1.4
5,2,1.5
6,3,1.6
7,4,1.7
[root@vm30 ~]#

posted @ 2023-06-14 14:42  哈喽哈喽111111  阅读(394)  评论(0编辑  收藏  举报