Oracle 11g RAC CSSD进程无法启动real time模式

问题描述

主机因故障重启,重启后节点无法正常启动,其它节点可以正常对外提供服务。

问题处理

  1. 检查集群状态

    css服务启动异常

  2. 检查集群日志

[gpnpd(231513)]CRS-2328:GPNPD started on node xxx.                                                                                                                                                    
2023-08-14 19:46:09.210:                                                                                                                                                                                 
[cssd(231620)]CRS-1713:CSSD daemon is started in clustered mode                                                                                                                                          
2023-08-14 19:46:09.219:                                                                                                                                                                                 
[cssd(231620)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /opt/oracle/grid/11.2.0/grid/log/db2/cssd/ocssd.log                                                 
2023-08-14 19:46:11.034:                                                                                                                                                                                 
[ohasd(229354)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
  1. 检查ocssd.log日志
view /opt/oracle/grid/11.2.0/grid/log/db2/cssd/ocssd.log

view /opt/oracle/grid/11.2.0/grid/log/jcsjdb2/cssd/ocssd.log
2023-08-14 19:49:55.911: [ CSSD][3743803200](:CSSSC00011:)clssscExit: A fatal error occurred during initialization
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = , TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[ CSSD][1963468608]clsugetconf : Configuration type [4].2023-08-14 19:59:56.862: [ CSSD][1963468608]clssscmain:Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1692014396
2023-08-14 19:59:56.863: [ CSSD][1963468608]clssscmain:Environment is production
2023-08-14 19:59:56.863: [ CSSD][1963468608]clssscmain:Core file size limit extended
2023-08-14 19:59:56.868: [ CSSD][1963468608]clssscmain:GIPCHA down 0
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscExtendLimits: The current soft limit for file descriptors is 65536,hard limit is 65536
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscExtendLimits: The current soft limit for locked memory is 4294967295, hard limit is 4294967295
2023-08-14 19:59:56.871: [ CSSD][1963468608]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21
2023-08-14 19:59:56.871: [ CSSD][1963468608]clssscSetPrivEnv: Setting priority to 4
2023-08-14 19:59:56.881: [ CSSD][1963468608]clssscSetPrivEnv: unable to set priority to 4
2023-08-14 19:59:56.881: [ CSSD][1963468608]SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time

从ocss日志中可以看到ocssd进程启动时无法得到较高的优先级,无法启动到real time。

Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1) 描述与此现象高度相似

  1. 检查cgconfig.conf,发现未配置任何信息。
cat /etc/cgconfig.conf 
#
#  Copyright IBM Corporation. 2007
#
#  Authors:     Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
#
# By default, we expect systemd mounts everything on boot,
# so there is not much to do.
# See man cgconfig.conf for further details, how to create groups
# on system boot using this file.
  1. 检查/sys/fs/cgroup/cpu/cpu.rt_*
cat /sys/fs/cgroup/cpu/cpu.rt_period_us
1000000

cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us
950000

cpu.rt_period_us和cpu.rt_runtime_us设置的就是推荐值950000
该文档《Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1)》的解决方案不适用。

  1. reahat官方关于CPU的相关设置说明
    How to configure a RHEL 7 or RHEL 8 system to be able to run programs requiring Real-Time Scheduling

当CPUAccounting参数enabled时,将不能创建real-time进程。排查system.conf配置文件发现并没有开启CPUAccounting参数

find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota
# 返回结果
/etc/systemd/system.conf: #DefaultCPUAccounting=no
/usr/lib/systemd/system/titanagent.service:CPUQuota=50%

发现/usr/lib/systemd/system/titanagent.service中有CPUQuota=50%参数配置,而CPUQuota参数如果配置就会隐性开启CPUAccounting,所以即使第六步中CPUAccounting参数没有配置enabled也会开启CPUAccounting

  1. 禁用titanagent.service后,重启主机集群启动正常
systemctl stop titanagent.service
systemctl disable titanagent.service
posted @ 2023-08-15 15:06  树苗叶子  阅读(478)  评论(0编辑  收藏  举报