Exadata X6-2,出现RS-7445 [Serv CELLSRV hang detected] [It will be restarted]
1、驻场的同事发现X6-2的某个存储节点,出现7445错误。
# cellcli -e list alerthistory
2023-03-27T23:01:44+08:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
2、检查该存储节点的alert日志:
2023-03-27T23:01:44.912828+08:00
[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 16281) returned with error: 123
[RS] Service CELLSRV will be restarted.
Errors in file /opt/oracle/cell/log/diag/asm/cell/dm01celadm12/trace/rstrc_16269_omt.trc (incident=1):
RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/dm01celadm12/incident/incdir_1/rstrc_16269_omt_i1.trc
2023-03-27T23:01:45.172217+08:00
State dump signal delivered to CELLSRV<16314> by pid - 16269, uid - 0
State dump signal delivered to CELLSRV<16314> by RS.
2023-03-27T23:01:45.947036+08:00
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d80be000, bioreq 0x600003dbf5d0 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 5584060416 bytes with size 131072 bytes membuf 0x601324800000, bioreq 0x6000042926c8 (errno: Input/output error [5])
Write Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 19931332608 bytes with size 512 bytes membuf 0x6001cbb51400, bioreq 0x600004647cf8 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d6ea2000, bioreq 0x6002cde5cb48 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 924221440 bytes with size 16384 bytes membuf 0x6001d91de000, bioreq 0x600004526538 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 4483727360 bytes with size 16384 bytes membuf 0x6001d885a000, bioreq 0x600003e77518 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 33554432 bytes with size 512 bytes membuf 0x6001cbbece00, bioreq 0x6002cbe1c108 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 33554432 bytes with size 512 bytes membuf 0x6001cbad3400, bioreq 0x6002cc8b5ab8 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 5584060416 bytes with size 131072 bytes membuf 0x601379500000, bioreq 0x6002d0c7f578 (errno: Input/output error [5])
Read Error on Cell Disk FD_00_dm01celadm12 (/dev/nvme3n1) at device offset 4483727360 bytes with size 16384 bytes membuf 0x6001d7bc6000, bioreq 0x600003da6d60 (errno: Input/output error [5])
Max number of IO Error messages for FD_00_dm01celadm12 have been logged, further IO error messages for this device are temporary disabled
Mon Mar 27 23:01:45 2023 961 msec State dump completed for CELLSRV<16314>
2023-03-27T23:02:13.900399+08:00
[RS] Stopped Service CELLSRV
2023-03-27T23:02:13.911836+08:00
[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt with pid 12591
[RS] Previously detected 1 hang(s) for service CELLSRV. Using heartbeat timeout of 8 seconds.
可以看出,在报RS-7445错误时,/dev/nvme3n1这块FlashDISK出现IO读失败。
3、搜索MOS网站,可以找到MOS文档《Exadata: Cell Service crash with RS-7445 [SERV CELLSRV HANG DETECTED] during a flash disk failure (Doc ID 2486713.1)》 和 《Exadata: Database performance issues or outages after a flash disk failure, Cell Service may crash with RS-7445 [Serv CELLSRV hang detected] (Doc ID 2584475.1)》。
简单地说,就是FlashDISK出现IO失败,导致CELLSRV服务hang住。
4、后期需要升级存储软件版本,解决CELLSRV服务hang住的问题。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?
2021-03-28 expdp导出时,排除scheduler的job