ceph集群使用常见20+报错指南

                                              作者:尹正杰

版权声明:原创作品,谢绝转载!否则将追究法律责任。

目录

Q1: 节点有服务无法移除主机

报错信息

[root@ceph141 ~]# ceph orch host rm ceph143
Error EINVAL: Not allowed to remove ceph143 from cluster. The following daemons are running in the host:
type                 id             
-------------------- ---------------
node-exporter        ceph143        

Please run 'ceph orch host drain ceph143' to remove daemons from host
[root@ceph141 ~]# 

错误原因

	待移除节点已经存在部署的服务,需要先驱逐后再移除主机。	

解决方案:

[root@ceph141 ~]# ceph orch host drain ceph143
Scheduled to remove the following daemons from host 'ceph143'
type                 id             
-------------------- ---------------
node-exporter        ceph143        
[root@ceph141 ~]#
[root@ceph141 ~]# ceph orch host drain ceph143
Scheduled to remove the following daemons from host 'ceph143'
type                 id             
-------------------- ---------------
[root@ceph141 ~]# 
[root@ceph141 ~]# ceph orch host rm ceph143
Removed  host 'ceph143'
[root@ceph141 ~]# 

Q2: docker代理出错

报错信息

stat: stderr docker: Error response from daemon: Get "https://quay.io/v2/": proxyconnect tcp: dial tcp 10.0.0.1:7890: connect: connection refused.

错误原因

	docker开代理导致的报错。

解决方案

	可以将代理的行注释掉,或者打开代理。

Q3: Ceph 19版本默认不允许设置副本数量为1

报错信息

[root@ceph141 ~]# ceph osd pool set oldboyedu size 1
Error EPERM: configuring pool size as 1 is disabled by default.
[root@ceph141 ~]# 

错误原因

	Ceph 19版本默认不允许设置副本数量为1。

解决方案

	官方默认不允许设置为1,以免数据丢失。

Q4: mon组件不允许删除存储池

报错信息

[root@ceph141 ~]# ceph osd pool delete yinzhengjie
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool yinzhengjie.  If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.
[root@ceph141 ~]# 

错误原因

	默认情况下,mon组件不允许删除存储池。

解决方案

	需要将存储池的名字写2次,然后再跟上--yes-i-really-really-mean-it参数即可。

Q5: 未启用存储池的删除功能

报错信息

[root@ceph141 ~]# ceph osd pool delete yinzhengjie yinzhengjie --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
[root@ceph141 ~]# 

错误原因

	默认情况下,未启用存储池的删除功能。

解决方案

	将mon_allow_pool_delete的属性设置为true即可。

Q6: nodelete属性为true则无法删除

报错信息

[root@ceph141 ~]# ceph osd pool delete oldboyedu oldboyedu --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must unset nodelete flag for the pool first
[root@ceph141 ~]# 

错误原因

	nodelete属性为true则无法删除。

解决方案

	将nodelete属性修改为false即可。

Q7: 1 pool(s) do not have an application enabled

报错信息

[root@ceph141 ~]# ceph -s
  cluster:
    id:     0f06b0e2-b128-11ef-9a37-4971ded8a98b
    health: HEALTH_WARN 
            1 pool(s) do not have an application enabled
 
  services:
    mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 46h)
    mgr: ceph141.bszrgd(active, since 47h), standbys: ceph143.ihhymg
    osd: 7 osds: 7 up (since 46h), 7 in (since 46h)
 
  data:
    pools:   3 pools, 33 pgs
    objects: 2 objects, 449 KiB
    usage:   531 MiB used, 3.3 TiB / 3.3 TiB avail
    pgs:     45.455% pgs unknown
             3.030% pgs not active
             17 active+clean
             15 unknown
             1  creating+peering
 
[root@ceph141 ~]# 

错误原因

	"1 pool(s) do not have an application enabled"表示有一个存储池没有声明应用的类型。

解决方案

	查看存储池是否有"application"字段
[root@ceph141 ~]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 22 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.98
pool 5 'oldboyedu' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 531 flags hashpspool stripe_width 0 application rbd read_balance_score 2.19
pool 6 'yinzhengjie' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 534 flags hashpspool stripe_width 0 read_balance_score 2.63

[root@ceph141 ~]# 

Q8: shrinking an image is only allowed with the --allow-shrink flag

报错信息

[root@ceph141 ~]# rbd resize -s 4G oldboyedu/wp
rbd: shrinking an image is only allowed with the --allow-shrink flag
[root@ceph141 ~]# 

错误原因

	块设备扩容时不需要指定额外的参数,但是缩容时,需要指定--allow-shrink参数。

解决方案

	修改块设备大小时,指定--allow-shrink参数即可。生产环境缩容要谨慎!

Q9: rbd: unmap failed: (16) Device or resource busy

报错信息

[root@harbor250 ~]# rbd unmap oldboyedu/mysql80
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
[root@harbor250 ~]# 

错误原因

	当前设备被占用,可能没有取消挂载。

解决方案

	在删除映射前,应该先取消挂载哟。

Q10: rbd: failed to create snapshot: (122) Disk quota exceeded

报错信息

[root@ceph141 ~]# rbd snap create oldboyedu/mysql80  --snap oldboyedu-linux94-hehe
Creating snap: 10% complete...failed.
rbd: failed to create snapshot: (122) Disk quota exceeded
[root@ceph141 ~]# 

错误原因

	创建快照时超出了快照限制。

解决方案

	- 新增快照限制的数量;
	- 清除快照限制;
	- 删除已有的快照已释放更多的快照数量;

Q11: 查看的用户不存在

报错信息

[root@ceph141 ~]# ceph auth get client.yinzhengjie
Error ENOENT: failed to find client.yinzhengjie in keyring
[root@ceph141 ~]# 

错误原因

	创建的用户不存在。

解决方案

	检查用户名称是否正确,或者是否创建国该用户。

Q12: 用户已存在,不能创建

报错信息

[root@ceph141 ~]# ceph auth get-or-create client.yinzhengjie mon 'allow rwx' osd 'allow r'
Error EINVAL: key for client.yinzhengjie exists but cap mon does not match
[root@ceph141 ~]# 

错误原因

	创建的用户已存在。

解决方案

	- 删除原有用户 
	- 使用命令修改现有用户权限

Q13: RADOS permission denied

报错信息

[root@harbor250 ~]# ceph -s
2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory

2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 AuthRegistry(0x7fb534063668) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx

2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory

2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 AuthRegistry(0x7fb53c40cf80) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx

2024-12-05T20:13:07.360+0800 7fb53a9ab640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]

2024-12-05T20:13:07.360+0800 7fb53a1aa640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]

2024-12-05T20:13:07.360+0800 7fb5399a9640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]

2024-12-05T20:13:07.360+0800 7fb53c40e640 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication

[errno 13] RADOS permission denied (error connecting to the cluster)
[root@harbor250 ~]# 

错误原因

	ceph工具默认会找admin用户访问集群,权限被拒绝。

解决方案

	- 拷贝admin的证书文件到相应目录。
	- 使用"--user"或者"--id"指定自定义用户的认证名称。

Q14: Operation not permitted

报错信息

[root@harbor250 ~]# rbd -p linux94 ls  -l --id k3s
2024-12-05T20:21:05.659+0800 7f3e2228d4c0 -1 librbd::api::Image: list_images: error listing v1 images: (1) Operation not permitted

rbd: listing images failed: (1) Operation not permitted
[root@harbor250 ~]# 

错误原因

	ceph当前使用的用户权限不足。

解决方案

	修改相应用户的权限。

Q15: rbd: couldn't connect to the cluster!

报错信息

[root@harbor250 ~]# rbd -p oldboyedu ls  -l --id k3s
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error

2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 monclient: keyring not found

rbd: couldn't connect to the cluster!
rbd: listing images failed: (5) Input/output error
[root@harbor250 ~]# 

错误原因

	无法链接集群。

解决方案

	检查认证文件是否被篡改,若被修改从服务端重新导出即可。

Q16: ERROR: Parameter problem: File name required, not only the bucket name. Alternatively use --recursive

报错信息

[root@ceph141 ~]# s3cmd rm s3://yinzhengjie-bucket
ERROR: Parameter problem: File name required, not only the bucket name. Alternatively use --recursive
[root@ceph141 ~]# 

错误原因

	删除的bucket存在对象,需要递归删除。

解决方案

	如果想要删除存储桶,可以直接加--recursive选项进行递归删除。

Q17: ERROR: Parameter problem: Please use --force to delete ALL contents of s3://yinzhengjie-bucket

报错信息

[root@ceph141 ~]# s3cmd rm s3://yinzhengjie-bucket --recursive
ERROR: Parameter problem: Please use --force to delete ALL contents of s3://yinzhengjie-bucket
[root@ceph141 ~]# 

错误原因

	需要强制删除。

解决方案

	使用" --force"选项就可以强制删除数据。

Q18: 2 failed cephadm daemon(s)

报错信息

[root@ceph141 ~]# ceph -s
  cluster:
    id:     0f06b0e2-b128-11ef-9a37-4971ded8a98b
    health: HEALTH_WARN
            2 failed cephadm daemon(s)
            
 
  services:
    mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 46h)
    mgr: ceph141.bszrgd(active, since 47h), standbys: ceph143.ihhymg
    osd: 7 osds: 7 up (since 46h), 7 in (since 46h)
 
  data:
    pools:   3 pools, 33 pgs
    objects: 2 objects, 449 KiB
    usage:   531 MiB used, 3.3 TiB / 3.3 TiB avail
    pgs:     45.455% pgs unknown
             3.030% pgs not active
             17 active+clean
             15 unknown
             1  creating+peering
 
[root@ceph141 ~]# 

错误原因

	"2 failed cephadm daemon(s)"表示有2个守护进程不正常工作导致的错误。

解决方案

	使用"ceph orch ps"和"ceph orch daemon restart"来重启相应的守护进程,官方服务是否启动成功。

Q19: rule 0 already exists

报错信息

[root@ceph141 ~]# crushtool -c yinzhengjie-hdd-ssd.file -o yinzhengjie-hdd-ssd.crushmap
rule 0 already exists
[root@ceph141 ~]#

错误原因

	id为0号的规则的已经存在,说明重复定义。

解决方案

	检查yinzhengjie-hdd-ssd.file配置是否存在id冲突。
posted @ 2024-12-29 23:02  尹正杰  阅读(40)  评论(0编辑  收藏  举报