16 常⻅故障排查(转载)
常⻅故障排查
Ceph
排障思路
Rook
将 Ceph
相关的组件运行在 kubernetes
之上,因此维护 Ceph
的时候需要同时维护 kuebrnetes
集群和 Ceph
集群,确保两个集群状态正常,通常 Ceph
依赖于 kubernetes ,然而 kubernetes
状态正常并不完全等同于 Ceph 正常, Ceph
包含有内置的状态维护方法。
- 对于
kubernetes
而言,需要确保相关的pods
状态处于正常状态
[root@m1 ceph]# kubectl get pods -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-7jhg5 3/3 Running 3 8d
......
csi-cephfsplugin-provisioner-8658f67749-jxshb 6/6 Running 190 8d
......
csi-rbdplugin-2q747 3/3 Running 16 8d
......
csi-rbdplugin-provisioner-94f699d86-bh4fv 6/6 Running 69 27h
......
prometheus-rook-prometheus-0 3/3 Running 1 27h
rook-ceph-crashcollector-192.168.100.133-778bbd9bc5-slv77 1/1 Running 3 7d6h
......
rook-ceph-mds-myfs-a-5558ffd8db-d526j 1/1 Running 0 92m
rook-ceph-mds-myfs-b-55df4cd74b-z84kq 1/1 Running 0 27h
......
rook-ceph-mgr-a-868b455884-8f4h6 1/1 Running 33 28h
rook-ceph-mon-b-7486b4b679-hbsng 1/1 Running 0 34m
......
rook-ceph-operator-fd756b5dc-xhxf2 1/1 Running 0 8h
rook-ceph-osd-0-66dd4575f7-c64wh 1/1 Running 23 28h
rook-ceph-osd-1-5866f9f558-jq994 1/1 Running 5 28h
......
rook-ceph-rgw-my-store-a-847c97bc4-vhxjx 1/1 Running 3 12h
rook-ceph-rgw-my-store-b-c4bc6b4b6-rhbqz 1/1 Running 0 6d22h
rook-ceph-tools-77bf5b9b7d-9pq6m 1/1 Running 3 7d22h
- 如果某个
pods
异常时,可以通过describe
查看其kuebrnetes
的事件和logs
查看容器内部的运行日志,从而窥探到容器内部的运行状态
[root@m1 ceph]# kubectl -n rook-ceph logs -f csi-cephfsplugin-provisioner-8658f67749-whmrx csi-attacher
[root@m1 ceph]# kubectl -n rook-ceph logs -f csi-cephfsplugin-provisioner-8658f67749-whmrx csi-attacher
I1202 08:44:14.847409 1 main.go:91] Version: v3.0.0
I1202 08:44:14.849004 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I1202 08:44:14.851324 1 common.go:111] Probing CSI driver for readiness
W1202 08:44:14.853018 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1202 08:44:14.856296 1 leaderelection.go:243] attempting to acquire leader lease rook-ceph/external-attacher-leader-rook-ceph-cephfs-csi-ceph-com...
- 对于
Ceph
的状态来说,Ceph
提供了很多命令检测的工具,如查看整体集群健康状态的ceph -s
和详细的ceph health detail
[root@m1 ceph]# ceph -s
cluster:
id: 17a413b5-f140-441a-8b35-feec8ae29521
health: HEALTH_WARN
2 daemons have recently crashed
services:
mon: 3 daemons, quorum b,d,e (age 37m)
mgr: a(active, since 98m)
mds: myfs:2 {0=myfs-d=up:active,1=myfs-b=up:active} 2 up:standby-replay
osd: 5 osds: 5 up (since 41m), 5 in (since 27h)
rgw: 2 daemons active (my.store.a, my.store.b)
task status:
data:
pools: 16 pools, 353 pgs
objects: 910 objects, 1.5 GiB
usage: 10 GiB used, 240 GiB / 250 GiB avail
pgs: 353 active+clean
io:
client: 1.7 KiB/s rd, 3 op/s rd, 0 op/s wr
- 除此之外,还提供了很多
OSD
的状态查看相关的工具
# 查看 osd 的目录树结构
[root@m1 ceph]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.99518 root default
-5 0.04880 host 192-168-100-133
0 hdd 0.04880 osd.0 up 1.00000 1.00000
-3 0.04880 host 192-168-100-134
1 hdd 0.04880 osd.1 up 1.00000 1.00000
-7 0.04880 host 192-168-100-135
2 hdd 0.04880 osd.2 up 1.00000 1.00000
-9 0.79999 host 192-168-100-136
3 hdd 0.79999 osd.3 up 1.00000 1.00000
-11 0.04880 host 192-168-100-137
4 hdd 0.04880 osd.4 up 1.00000 1.00000
# 查看 osd 的磁盘利用率
[root@m1 ceph]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.04880 1.00000 50 GiB 2.0 GiB 1.0 GiB 600 KiB 1023 MiB 48 GiB 4.01 1.00 176 up
1 hdd 0.04880 1.00000 50 GiB 1.8 GiB 784 MiB 732 KiB 1023 MiB 48 GiB 3.53 0.88 175 up
2 hdd 0.04880 1.00000 50 GiB 1.9 GiB 875 MiB 369 KiB 1024 MiB 48 GiB 3.71 0.93 179 up
3 hdd 0.79999 1.00000 50 GiB 2.6 GiB 1.6 GiB 2.2 MiB 1022 MiB 47 GiB 5.24 1.31 353 up
4 hdd 0.04880 1.00000 50 GiB 1.8 GiB 794 MiB 2.1 MiB 1022 MiB 48 GiB 3.55 0.89 176 up
TOTAL 250 GiB 10 GiB 5.0 GiB 5.9 MiB 5.0 GiB 240 GiB 4.01
MIN/MAX VAR: 0.88/1.31 STDDEV: 0.64
# 查看 osd 集群状态
[root@m1 ceph]# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 192.168.100.133 2053M 47.9G 0 0 0 0 exists,up
1 192.168.100.134 1808M 48.2G 0 0 0 0 exists,up
2 192.168.100.135 1899M 48.1G 0 0 0 0 exists,up
3 192.168.100.136 2684M 47.3G 0 0 7 211 exists,up
4 192.168.100.137 1818M 48.2G 0 0 0 0 exists,up
# 查看 osd 的利用率
[root@m1 ceph]# ceph osd utilization
avg 211.8
stddev 70.6127 (expected baseline 13.0169)
min osd.1 with 175 pgs (0.826251 * mean)
max osd.3 with 353 pgs (1.66667 * mean)
k8s
故障排查
异常状态需要 describe
查看 event
日志和 logs
查看容器的日志,结合进一步分析排查。
Ceph
故障排查
结合 Ceph
状态和日志进一步分析