BE故障排查处理
前言
DORIS集群规模:3FE、10BE
DORIS集群版本:1.2.3
故障现象
be节点一台主机 /目录使用率达到92%。
排查过程
查看根目录磁盘占用过高的文件
##找出占比最大的前四个文件 cd / du -h -BG| sort -nr |head -n4 |
发现doris的bin目录下生产N个core.***的文件,且每个占用磁盘较大
推断BE服务重复启动、挂掉产生的core文件。
查看系统资源参数设置
ulimit -a |
其中第一行 core file size:设定core文件的最大值,单位为区块,如果指定为0,不会产生core文件。
处理磁盘告警:删除bin目录下产生的core文件
cd /doris安装目录/bin rm -f core.* |
查看磁盘空间是否减少
df -h |
根目录使用率已降低,告警恢复。
由于BE服务启动、挂掉产生的core文件,接下来查看BE节点是否正常,排查根因。
查看BE服务是否正常
##查看进程 ps -ef |grep be ##查看服务状态 systemctl status doris-be |
发现BE进程不存在、服务状态为failed
查看错误日志信息:
be.WARNING
|
be.INFO
|
发现已挂掉的BE节点为陆续挂掉。关键错误日志信息:
W0531 19:57:25.001701 23977 task_worker_pool.cpp:723] failed to publish version|signature=6701994|transaction_id=6701994|error_tablets_num=2|error=[E-3115]
W0531 19:57:44.269308 23992 tablet.cpp:743] fail to find Rowset for version. tablet=3772291.1631060552.a844e74588ec4125-7252d67f4d550f99, version='[339650-339677]
W0531 19:57:44.269470 23992 file_utils.cpp:58] path does exist: /data/doris-storage/data/507/3772291/1631060552/0200000001956503a7432009e21e4e9d8ee251697fcf2aa3_0.dat
W0531 19:59:33.142494 23654 tablet_sink.cpp:156] cancel node channel VNodeChannel[1979559-10003], load_id=a11449a6926c4d80-ba8dd8c3b56638a4, txn_id=6702763, node=10.30.81.38:8060, error message: [INTERNAL_ERROR]wait close failed.
W0531 19:59:33.142549 23654 tablet_sink.cpp:1179] VNodeChannel[1979559-10003], load_id=a11449a6926c4d80-ba8dd8c3b56638a4, txn_id=6702763, node=10.30.81.38:8060, close chann
el failed, err: [INTERNAL_ERROR]wait close failed.
W0531 19:59:33.315030 23593 brpc_client_cache.h:150] open brpc connection to 10.30.81.38:8060 failed: [E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R1][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R2][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R3][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R4][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R5][E112]Not connected to 10.30.81.38:8060 yet, server_id=3435
9738880 [R6][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R7][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R8][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R9][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R10][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880W0531 19:59:33.315069 23593 tablet_sink.cpp:179] failed to open tablet writer, error=Host is down, error_text=[E111]Fail to connect Socket{id=34359738880 addr=10.30.81.38:8
060:8060} (0x0x7faf1e659b00): Connection refused [R1][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R2][E112]Not connected to 10.30.81.38:8060 yet, ser
ver_id=34359738880 [R3][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R4][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R5][E112]N
ot connected to 10.30.81.38:8060 yet, server_id=34359738880 [R6][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R7][E112]Not connected to 10.30.81.38:80
60 yet, server_id=34359738880 [R8][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R9][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880
[R10][E112]Not connected to 10.30.81.38:8060 yet,server_id=34359738880 VNodeChannel[1979559-10003],load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.38:8060
W0531 19:59:33.315783 23593 tablet_sink.cpp:156] cancel node channel VNodeChannel[1979559-10003],load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.3
8:8060, error message: [INTERNAL_ERROR]wait close failed.
W0531 19:59:33.315840 23593 tablet_sink.cpp:1179] VNodeChannel[1979559-10003], load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.38:8060, close channel failed, err: [INTERNAL_ERROR]wait close failed.
查看集群节点状态:
##连接FE mysql -uroot -p -hlocalhost -P9030 查看BE状态: |
查看表属性:
##连接FE show databases; show tables; show create table tbl_website_performance_info;
show create table tbl_website_click_log;
|
发现两张表的属性参数都打开了Merge-on-write,确定根因。
故障原因
Doris 1.2.3版本load数据,由于建表时设置了Merge-on-write打开,触发了bug导致be节点挂掉。官方明确 1.2.4.1版本已修复该bug。
故障处理
方法一:重新建表,修改enable_unique_key_merge_on_write参数为false。关闭Merge-on-write。
方法二:Doris 1.2.3版本升级至1.2.4.1。
经和业务确认。选择方法一进行处理。
- 关闭数据load任务,启动BE节点。查看BE是否正常。
##关闭load任务 ##启动BE节点 systemctl status doris-be Systemctl start doris-be ##查看BE日志 tail -100f be.INFO tail -100f be.WARNING |
启动成功。
- 连接FE修改表属性
##连接FE mysql -uroot -p -hlocalhost -P9030 ##修改属性 ALTER TABLE tbl_website_performance_info SET ("enable_unique_key_merge_on_write" = "false"); 由于这个属性不能修改,只能在建表的时候指定。只能选择重建表。 |
重建表
##查看原表的创建语句 show create table tbl_website_performance_info; Show create table tbl_website_click_log; ##复制语句,将enable_unique_key_merge_on_write参数修改为false。创建表。 tbl_website_performance_info:
tbl_website_click_log:
|
将原表中的数据导入至新表
##查看原表中数据总数 mysql>select count(*) from tbl1; ##导入数据至新表 mysql> insert into tbl1 select * from tbl2; ##查看新表与旧表数据量是否一致 mysql>select count(*) from tbl1; mysql>select count(*) from tbl2; |
- 开始load任务,查看表中数据是否正常增加。查看BE节点是否正常。
##开启load任务 ##查看表中数据是否正常增加 mysql>select count(*) from tbl1; mysql>select count(*) from tbl2; ##查看BE节点是否正常 ps -ef|grep doris-be systemctl status doris-be ##查看是否有错误日志 tail -100f be.INFO tail -100f be.WARNING |
Load任务正常运行,BE节点正常运行。至此故障修复。