Doris-BE节点集体挂掉问题排查
7月14版本上线,7月16日doris
集群BE节点短时间内陆续挂掉,暂时重启解决,7月17日周一上班,BE节点开始反复挂掉影响使用
问题定位:
1、查看doris BE
节点日志
be.out
日志如下所示,由第7行(doris::PlanFragmentExecutor
)可看出是因为sql
执行引发的问题,需要进一步的通过CoreDump
来定位到触发BE的查询
*** Aborted at 1689488662 (unix time) try "date -d @1689488662" if you are using GNU date ***
*** SIGSEGV unkown detail explain (@0x0) received by PID 44257 (TID 0x7fb793b90700) from PID 0; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/common/signal_handler.h:420
1# 0x00007FB7CC97C400 in /lib64/libc.so.6
2# doris::vectorized::IAggregateFunctionHelper<doris::vectorized::AggregateFunctionCountNotNullUnary>::add_batch(unsigned long, char**, unsigned long, doris::vectorized::IColumn const**, doris::vectorized::Arena*) const at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/aggregate_functions/aggregate_function.h:151
3# doris::vectorized::AggFnEvaluator::execute_batch_add(doris::vectorized::Block*, unsigned long, char**, doris::vectorized::Arena*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exprs/vectorized_agg_fn.cpp:131
4# doris::vectorized::AggregationNode::_execute_with_serialized_key(doris::vectorized::Block*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exec/vaggregation_node.cpp:864
5# std::_Function_handler<doris::Status (doris::vectorized::Block*), std::_Bind_result<doris::Status, doris::Status (doris::vectorized::AggregationNode::*(doris::vectorized::AggregationNode*, std::_Placeholder<1>))(doris::vectorized::Block*)> >::_M_invoke(std::_Any_data const&, doris::vectorized::Block*&&) at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/std_function.h:293
6# doris::vectorized::AggregationNode::open(doris::RuntimeState*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exec/vaggregation_node.cpp:375
7# doris::PlanFragmentExecutor::open_vectorized_internal() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/plan_fragment_executor.cpp:286
8# doris::PlanFragmentExecutor::open() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/plan_fragment_executor.cpp:259
9# doris::FragmentExecState::execute() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:248
10# doris::FragmentMgr::_exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:481
11# std::_Function_handler<void (), std::_Bind_result<void, void (doris::FragmentMgr::*(doris::FragmentMgr*, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>))(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>)> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/std_function.h:291
12# doris::ThreadPool::dispatch_thread() at /mnt/disk2/ygl/code/github/apache-doris/be/src/util/threadpool.cpp:578
13# doris::Thread::supervise_thread(void*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/util/thread.cpp:407
14# start_thread in /lib64/libpthread.so.0
15# clone in /lib64/libc.so.6
2、如何生成CoreDump
-
查看生成
CoreDump
文件的开关是否开启,输入命令ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1544256
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 655350
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 655350
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
第一行可以看到此服务器的core file size为不限制(若为0则不生成),可以通过命令来改变CoreDump
的大小,也可以在be启动脚本中增加ulimit -c unlimited -n 65536
ulimit -c 1024 #设置CoreDump文件大小为1024k
ulimit -c unlimited #不限制CoreDump文件大小
-
查看
CoreDump
文件的路径
默认情况下,CoreDump
生成的文件名为core
,而且就在运行启动BE脚本目录下,新生成的CoreDump
文件会覆盖旧的CoreDump
文件。而如果proc/sys/kernel/core_uses_pid
内容为1,则CoreDump
文件会以core.进程id
的方式被生成。(这里建议通过系统管理员将该开关打开)。如果在运行启动BE脚本目录下没有找到对应的CoreDump
文件的话,可能是系统管理员修改了core_pattern
。可以执行cat /proc/sys/kernel/core_pattern
来查看core目录
3、利用CoreDump
定位问题SQL
-
安装
GDB
后打开CoreDump
文件就能帮助我们取得对应的Query ID。GDB
下载链接http://www.rpmfind.net/linux/centos/7.9.2009/os/x86_64/Packages/gdb-7.6.1-120.el7.x86_64.rpm,rpm -ivh rpm包名
安装即可,也可在此网站上搜索合适系统的GDB
版本安装 -
使用
GDB
打开CoreDump
文件,在be的bin目录下执行(若已指定其他目录生成CoreDump
文件,则需指定CoreDump
文件目录)
gdb ../lib/palo_be core.xxxx
-
通过查询栈索引得到
QueryID
执行完上一步之后再次输入
bt
命令打开堆栈,找到doris::PlanFragmentExecutor
(可以不断按回车键查看下一批),此处日志可以看到在栈449#0 add (row_num=0, columns=0x556361d9a9b8, place=0x556361cb7018 "", this=0x55638449a790)
at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/common/pod_array.h:342
····················
#447 0x0000561787fe19d8 in doris::ScanNode::prepare(doris::RuntimeState*) ()
at /mnt/disk2/ygl/code/github/apache-doris/be/src/exec/scan_node.cpp:30
#448 0x00005617880bedea in doris::OdbcScanNode::prepare(doris::RuntimeState*) ()
at /mnt/disk2/ygl/code/github/apache-doris/be/src/exec/odbc_scan_node.cpp:57
#449 0x0000561787990495 in doris::PlanFragmentExecutor::prepare(doris::TExecPlanFragmentParams const&, doris::QueryFragmentsCtx*) ()
at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h:421
#450 0x000056178790106d in doris::FragmentExecState::prepare (this=this@entry=0x5617b76a2000, params=...)
at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:227
#451 0x0000561787906b87 in doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&, std::function<void (doris::PlanFragmentExecutor*)>) () at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:646
#452 0x0000561787908bd0 in doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&) ()
at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/tuple:746
#453 0x0000561787a00cb6 in doris::PInternalServiceImpl<doris::PBackendService>::_exec_plan_fragment (this=0x5617906fe4e0,
ser_request=..., version=<optimized out>, compact=<optimized out>)
at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/exec_env.h:150
---Type <return> to continue, or q <return> to quit---q
Quit输入q再回车退出,再次输入
f 449
切换到栈449,再次输入p _query_id
得到query_id(用hi的值即可),输入p /x query_id
将query_id转换为16进制(gdb) f 449
#449 0x0000561787990495 in doris::PlanFragmentExecutor::prepare(doris::TExecPlanFragmentParams const&, doris::QueryFragmentsCtx*) ()
at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h:421
421 /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h: 没有那个文件或目录.
(gdb) p _query_id
$1 = {<apache::thrift::TBase> = {_vptr.TBase = 0x56178ca192a0 <vtable for doris::TUniqueId+48>}, hi = -2521141818464581758,
lo = -7080784611811882727}
(gdb) p /x -2521141818464581758
$2 = 0xdd031adfaa094782此时需要查询所有FE的
fe.audit.log
来搜索(grep
对应日期的fe.audit.log日志)query_id如下所示,此处通过
Stmt属性看出问题
sql`
[root@localhost log]# grep dd031adfaa094782 fe.audit.log.20230717-1
2023-07-17 14:08:06,389 [query] |Client=10.196.166.3:34996|User=root|Db=default_cluster:ssom_doris|State=ERR|Time=3377|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=14121202|QueryId=dd031adfaa094782-9dbbffe941e68919|IsQuery=true|feIp=10.196.166.4|Stmt=SELECT columns FROM table_name WHERE del_flag='0' AND ((condition1 = '113.108.173.100' AND condition2 = 3602959022916898816) OR (condition1 = '61.147.93.7' AND condition2 = null) OR (condition1 = '120.197.38.18' AND condition2 = null) OR (condition1 = '61.147.96.60' AND condition2 = null) OR (condition1 = '119.34.177.100' AND condition2 = null) OR (condition1 = '14.125.55.70' AND condition2 = null)....)|CpuTimeMS=0|SqlHash=e332b6574b085aa6a57425e79cbf4104|peakMemoryBytes=0
参考链接:https://www.jianshu.com/p/60a5df15093c