16 常⻅故障排查(转载)

常⻅故障排查

Ceph 排障思路

RookCeph 相关的组件运行在 kubernetes 之上,因此维护 Ceph 的时候需要同时维护 kuebrnetes 集群和 Ceph 集群,确保两个集群状态正常,通常 Ceph 依赖于 kubernetes ,然而 kubernetes 状态正常并不完全等同于 Ceph 正常, Ceph 包含有内置的状态维护方法。

  • 对于 kubernetes 而言,需要确保相关的 pods 状态处于正常状态
[root@m1 ceph]# kubectl get pods -n rook-ceph
NAME                                                        READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-7jhg5                                      3/3     Running            3          8d
......
csi-cephfsplugin-provisioner-8658f67749-jxshb               6/6     Running            190        8d
......
csi-rbdplugin-2q747                                         3/3     Running            16         8d
......
csi-rbdplugin-provisioner-94f699d86-bh4fv                   6/6     Running            69         27h
......
prometheus-rook-prometheus-0                                3/3     Running            1          27h
rook-ceph-crashcollector-192.168.100.133-778bbd9bc5-slv77   1/1     Running            3          7d6h
......
rook-ceph-mds-myfs-a-5558ffd8db-d526j                       1/1     Running            0          92m
rook-ceph-mds-myfs-b-55df4cd74b-z84kq                       1/1     Running            0          27h
......
rook-ceph-mgr-a-868b455884-8f4h6                            1/1     Running            33         28h
rook-ceph-mon-b-7486b4b679-hbsng                            1/1     Running            0          34m
......
rook-ceph-operator-fd756b5dc-xhxf2                          1/1     Running            0          8h
rook-ceph-osd-0-66dd4575f7-c64wh                            1/1     Running            23         28h
rook-ceph-osd-1-5866f9f558-jq994                            1/1     Running            5          28h
......
rook-ceph-rgw-my-store-a-847c97bc4-vhxjx                    1/1     Running            3          12h
rook-ceph-rgw-my-store-b-c4bc6b4b6-rhbqz                    1/1     Running            0          6d22h
rook-ceph-tools-77bf5b9b7d-9pq6m                            1/1     Running            3          7d22h
  • 如果某个 pods 异常时,可以通过 describe 查看其 kuebrnetes 的事件和 logs 查看容器内部的运行日志,从而窥探到容器内部的运行状态
[root@m1 ceph]# kubectl -n rook-ceph logs -f csi-cephfsplugin-provisioner-8658f67749-whmrx csi-attacher
[root@m1 ceph]# kubectl -n rook-ceph logs -f csi-cephfsplugin-provisioner-8658f67749-whmrx csi-attacher
I1202 08:44:14.847409       1 main.go:91] Version: v3.0.0
I1202 08:44:14.849004       1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I1202 08:44:14.851324       1 common.go:111] Probing CSI driver for readiness
W1202 08:44:14.853018       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1202 08:44:14.856296       1 leaderelection.go:243] attempting to acquire leader lease  rook-ceph/external-attacher-leader-rook-ceph-cephfs-csi-ceph-com...
  • 对于 Ceph 的状态来说, Ceph 提供了很多命令检测的工具,如查看整体集群健康状态的 ceph -s 和详细的 ceph health detail
[root@m1 ceph]# ceph -s
  cluster:
    id:     17a413b5-f140-441a-8b35-feec8ae29521
    health: HEALTH_WARN
            2 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum b,d,e (age 37m)
    mgr: a(active, since 98m)
    mds: myfs:2 {0=myfs-d=up:active,1=myfs-b=up:active} 2 up:standby-replay
    osd: 5 osds: 5 up (since 41m), 5 in (since 27h)
    rgw: 2 daemons active (my.store.a, my.store.b)
 
  task status:
 
  data:
    pools:   16 pools, 353 pgs
    objects: 910 objects, 1.5 GiB
    usage:   10 GiB used, 240 GiB / 250 GiB avail
    pgs:     353 active+clean
 
  io:
    client:   1.7 KiB/s rd, 3 op/s rd, 0 op/s wr
  • 除此之外,还提供了很多 OSD 的状态查看相关的工具
# 查看 osd 的目录树结构
[root@m1 ceph]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
 -1         0.99518  root default                                       
 -5         0.04880      host 192-168-100-133                           
  0    hdd  0.04880          osd.0                 up   1.00000  1.00000
 -3         0.04880      host 192-168-100-134                           
  1    hdd  0.04880          osd.1                 up   1.00000  1.00000
 -7         0.04880      host 192-168-100-135                           
  2    hdd  0.04880          osd.2                 up   1.00000  1.00000
 -9         0.79999      host 192-168-100-136                           
  3    hdd  0.79999          osd.3                 up   1.00000  1.00000
-11         0.04880      host 192-168-100-137                           
  4    hdd  0.04880          osd.4                 up   1.00000  1.00000

# 查看 osd 的磁盘利用率
[root@m1 ceph]# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.04880   1.00000   50 GiB  2.0 GiB  1.0 GiB  600 KiB  1023 MiB   48 GiB  4.01  1.00  176      up
 1    hdd  0.04880   1.00000   50 GiB  1.8 GiB  784 MiB  732 KiB  1023 MiB   48 GiB  3.53  0.88  175      up
 2    hdd  0.04880   1.00000   50 GiB  1.9 GiB  875 MiB  369 KiB  1024 MiB   48 GiB  3.71  0.93  179      up
 3    hdd  0.79999   1.00000   50 GiB  2.6 GiB  1.6 GiB  2.2 MiB  1022 MiB   47 GiB  5.24  1.31  353      up
 4    hdd  0.04880   1.00000   50 GiB  1.8 GiB  794 MiB  2.1 MiB  1022 MiB   48 GiB  3.55  0.89  176      up
                       TOTAL  250 GiB   10 GiB  5.0 GiB  5.9 MiB   5.0 GiB  240 GiB  4.01                   
MIN/MAX VAR: 0.88/1.31  STDDEV: 0.64

# 查看 osd 集群状态
[root@m1 ceph]# ceph osd status
ID  HOST              USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  192.168.100.133  2053M  47.9G      0        0       0        0   exists,up  
 1  192.168.100.134  1808M  48.2G      0        0       0        0   exists,up  
 2  192.168.100.135  1899M  48.1G      0        0       0        0   exists,up  
 3  192.168.100.136  2684M  47.3G      0        0       7      211   exists,up  
 4  192.168.100.137  1818M  48.2G      0        0       0        0   exists,up 

# 查看 osd 的利用率
[root@m1 ceph]# ceph osd utilization
avg 211.8
stddev 70.6127 (expected baseline 13.0169)
min osd.1 with 175 pgs (0.826251 * mean)
max osd.3 with 353 pgs (1.66667 * mean)

k8s 故障排查

异常状态需要 describe 查看 event 日志和 logs 查看容器的日志,结合进一步分析排查。

Ceph 故障排查

结合 Ceph 状态和日志进一步分析

参考官方常⻅问题分析

posted @ 2022-12-02 17:55  evescn  阅读(203)  评论(0编辑  收藏  举报