BE故障排查处理

前言

DORIS集群规模:3FE、10BE

DORIS集群版本:1.2.3

故障现象

be节点一台主机 /目录使用率达到92%。

排查过程

查看根目录磁盘占用过高的文件

##找出占比最大的前四个文件

cd /

du -h -BG| sort -nr |head -n4

 

发现doris的bin目录下生产N个core.***的文件,且每个占用磁盘较大

 

推断BE服务重复启动、挂掉产生的core文件。

查看系统资源参数设置

ulimit -a

 

其中第一行 core file size:设定core文件的最大值,单位为区块,如果指定为0,不会产生core文件。

 

处理磁盘告警:删除bin目录下产生的core文件

cd /doris安装目录/bin

rm -f core.*

 

查看磁盘空间是否减少

df -h

根目录使用率已降低,告警恢复。

 

 

由于BE服务启动、挂掉产生的core文件,接下来查看BE节点是否正常,排查根因。

查看BE服务是否正常

##查看进程

ps -ef |grep be

##查看服务状态

systemctl status doris-be

发现BE进程不存在、服务状态为failed

 

 

 

查看错误日志信息:

be.WARNING

 

 

be.INFO

 

 

发现已挂掉的BE节点为陆续挂掉。关键错误日志信息:

W0531 19:57:25.001701 23977 task_worker_pool.cpp:723] failed to publish version|signature=6701994|transaction_id=6701994|error_tablets_num=2|error=[E-3115]

W0531 19:57:44.269308 23992 tablet.cpp:743] fail to find Rowset for version. tablet=3772291.1631060552.a844e74588ec4125-7252d67f4d550f99, version='[339650-339677]

W0531 19:57:44.269470 23992 file_utils.cpp:58] path does exist: /data/doris-storage/data/507/3772291/1631060552/0200000001956503a7432009e21e4e9d8ee251697fcf2aa3_0.dat

 

W0531 19:59:33.142494 23654 tablet_sink.cpp:156] cancel node channel VNodeChannel[1979559-10003], load_id=a11449a6926c4d80-ba8dd8c3b56638a4, txn_id=6702763, node=10.30.81.38:8060, error message: [INTERNAL_ERROR]wait close failed.

W0531 19:59:33.142549 23654 tablet_sink.cpp:1179] VNodeChannel[1979559-10003], load_id=a11449a6926c4d80-ba8dd8c3b56638a4, txn_id=6702763, node=10.30.81.38:8060, close chann

el failed, err: [INTERNAL_ERROR]wait close failed.

W0531 19:59:33.315030 23593 brpc_client_cache.h:150] open brpc connection to 10.30.81.38:8060 failed: [E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R1][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R2][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R3][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R4][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R5][E112]Not connected to 10.30.81.38:8060 yet, server_id=3435

9738880 [R6][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R7][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R8][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R9][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R10][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880W0531 19:59:33.315069 23593 tablet_sink.cpp:179] failed to open tablet writer, error=Host is down, error_text=[E111]Fail to connect Socket{id=34359738880 addr=10.30.81.38:8

060:8060} (0x0x7faf1e659b00): Connection refused [R1][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R2][E112]Not connected to 10.30.81.38:8060 yet, ser

ver_id=34359738880 [R3][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R4][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R5][E112]N

ot connected to 10.30.81.38:8060 yet, server_id=34359738880 [R6][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R7][E112]Not connected to 10.30.81.38:80

60 yet, server_id=34359738880 [R8][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880 [R9][E112]Not connected to 10.30.81.38:8060 yet, server_id=34359738880

[R10][E112]Not connected to 10.30.81.38:8060 yet,server_id=34359738880 VNodeChannel[1979559-10003],load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.38:8060

W0531 19:59:33.315783 23593 tablet_sink.cpp:156] cancel node channel VNodeChannel[1979559-10003],load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.3

8:8060, error message: [INTERNAL_ERROR]wait close failed.

W0531 19:59:33.315840 23593 tablet_sink.cpp:1179] VNodeChannel[1979559-10003], load_id=6987e916c1894c26-9e2877fbfbb43ff2, txn_id=6702764, node=10.30.81.38:8060, close channel failed, err: [INTERNAL_ERROR]wait close failed.

 

 

查看集群节点状态:

##连接FE

mysql -uroot -p -hlocalhost -P9030

查看BE状态:
show proc '/backends';

查看表属性:

##连接FE

show databases;

show tables;

show create table tbl_website_performance_info;

 

show create table tbl_website_click_log;

 

发现两张表的属性参数都打开了Merge-on-write,确定根因。

故障原因

Doris 1.2.3版本load数据,由于建表时设置了Merge-on-write打开,触发了bug导致be节点挂掉。官方明确 1.2.4.1版本已修复该bug。

故障处理

方法一:重新建表,修改enable_unique_key_merge_on_write参数为false。关闭Merge-on-write。

方法二:Doris 1.2.3版本升级至1.2.4.1。

经和业务确认。选择方法一进行处理。

 

  1. 关闭数据load任务,启动BE节点。查看BE是否正常。

##关闭load任务

##启动BE节点

systemctl status doris-be

Systemctl start doris-be

##查看BE日志

tail -100f be.INFO

tail -100f be.WARNING

启动成功。

  1. 连接FE修改表属性

##连接FE

mysql -uroot -p -hlocalhost -P9030

##修改属性

ALTER TABLE tbl_website_performance_info SET ("enable_unique_key_merge_on_write" = "false");
ERROR 1105 (HY000): errCode = 2, detailMessage = Alter tablet type not supported;

由于这个属性不能修改,只能在建表的时候指定。只能选择重建表。

重建表

##查看原表的创建语句

show create table tbl_website_performance_info;

Show create table tbl_website_click_log;

##复制语句,将enable_unique_key_merge_on_write参数修改为false。创建表。

tbl_website_performance_info:

 

 

tbl_website_click_log:

 

将原表中的数据导入至新表

##查看原表中数据总数

mysql>select count(*) from tbl1;

##导入数据至新表

mysql> insert into tbl1 select * from tbl2;

##查看新表与旧表数据量是否一致

mysql>select count(*) from tbl1;

mysql>select count(*) from tbl2;

 

  1. 开始load任务,查看表中数据是否正常增加。查看BE节点是否正常。

##开启load任务

##查看表中数据是否正常增加

mysql>select count(*) from tbl1;

mysql>select count(*) from tbl2;

##查看BE节点是否正常

ps -ef|grep doris-be

systemctl status doris-be

##查看是否有错误日志

tail -100f be.INFO

tail -100f be.WARNING

 

Load任务正常运行,BE节点正常运行。至此故障修复。

posted @ 2023-06-13 09:45  大鹏o  阅读(1020)  评论(0编辑  收藏  举报