ceph集群使用常见20+报错指南
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
目录
- Q1: 节点有服务无法移除主机
- Q2: docker代理出错
- Q3: Ceph 19版本默认不允许设置副本数量为1
- Q4: mon组件不允许删除存储池
- Q5: 未启用存储池的删除功能
- Q6: nodelete属性为true则无法删除
- Q7: 1 pool(s) do not have an application enabled
- Q8: shrinking an image is only allowed with the --allow-shrink flag
- Q9: rbd: unmap failed: (16) Device or resource busy
- Q10: rbd: failed to create snapshot: (122) Disk quota exceeded
- Q11: 查看的用户不存在
- Q12: 用户已存在,不能创建
- Q13: RADOS permission denied
- Q14: Operation not permitted
- Q15: rbd: couldn't connect to the cluster!
- Q16: ERROR: Parameter problem: File name required, not only the bucket name. Alternatively use --recursive
- Q17: ERROR: Parameter problem: Please use --force to delete ALL contents of s3://yinzhengjie-bucket
- Q18: 2 failed cephadm daemon(s)
- Q19: rule 0 already exists
Q1: 节点有服务无法移除主机
报错信息
[root@ceph141 ~]# ceph orch host rm ceph143
Error EINVAL: Not allowed to remove ceph143 from cluster. The following daemons are running in the host:
type id
-------------------- ---------------
node-exporter ceph143
Please run 'ceph orch host drain ceph143' to remove daemons from host
[root@ceph141 ~]#
错误原因
待移除节点已经存在部署的服务,需要先驱逐后再移除主机。
解决方案:
[root@ceph141 ~]# ceph orch host drain ceph143
Scheduled to remove the following daemons from host 'ceph143'
type id
-------------------- ---------------
node-exporter ceph143
[root@ceph141 ~]#
[root@ceph141 ~]# ceph orch host drain ceph143
Scheduled to remove the following daemons from host 'ceph143'
type id
-------------------- ---------------
[root@ceph141 ~]#
[root@ceph141 ~]# ceph orch host rm ceph143
Removed host 'ceph143'
[root@ceph141 ~]#
Q2: docker代理出错
报错信息
stat: stderr docker: Error response from daemon: Get "https://quay.io/v2/": proxyconnect tcp: dial tcp 10.0.0.1:7890: connect: connection refused.
错误原因
docker开代理导致的报错。
解决方案
可以将代理的行注释掉,或者打开代理。
Q3: Ceph 19版本默认不允许设置副本数量为1
报错信息
[root@ceph141 ~]# ceph osd pool set oldboyedu size 1
Error EPERM: configuring pool size as 1 is disabled by default.
[root@ceph141 ~]#
错误原因
Ceph 19版本默认不允许设置副本数量为1。
解决方案
官方默认不允许设置为1,以免数据丢失。
Q4: mon组件不允许删除存储池
报错信息
[root@ceph141 ~]# ceph osd pool delete yinzhengjie
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool yinzhengjie. If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.
[root@ceph141 ~]#
错误原因
默认情况下,mon组件不允许删除存储池。
解决方案
需要将存储池的名字写2次,然后再跟上--yes-i-really-really-mean-it参数即可。
Q5: 未启用存储池的删除功能
报错信息
[root@ceph141 ~]# ceph osd pool delete yinzhengjie yinzhengjie --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
[root@ceph141 ~]#
错误原因
默认情况下,未启用存储池的删除功能。
解决方案
将mon_allow_pool_delete的属性设置为true即可。
Q6: nodelete属性为true则无法删除
报错信息
[root@ceph141 ~]# ceph osd pool delete oldboyedu oldboyedu --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must unset nodelete flag for the pool first
[root@ceph141 ~]#
错误原因
nodelete属性为true则无法删除。
解决方案
将nodelete属性修改为false即可。
Q7: 1 pool(s) do not have an application enabled
报错信息
[root@ceph141 ~]# ceph -s
cluster:
id: 0f06b0e2-b128-11ef-9a37-4971ded8a98b
health: HEALTH_WARN
1 pool(s) do not have an application enabled
services:
mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 46h)
mgr: ceph141.bszrgd(active, since 47h), standbys: ceph143.ihhymg
osd: 7 osds: 7 up (since 46h), 7 in (since 46h)
data:
pools: 3 pools, 33 pgs
objects: 2 objects, 449 KiB
usage: 531 MiB used, 3.3 TiB / 3.3 TiB avail
pgs: 45.455% pgs unknown
3.030% pgs not active
17 active+clean
15 unknown
1 creating+peering
[root@ceph141 ~]#
错误原因
"1 pool(s) do not have an application enabled"表示有一个存储池没有声明应用的类型。
解决方案
查看存储池是否有"application"字段
[root@ceph141 ~]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 22 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.98
pool 5 'oldboyedu' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 531 flags hashpspool stripe_width 0 application rbd read_balance_score 2.19
pool 6 'yinzhengjie' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 534 flags hashpspool stripe_width 0 read_balance_score 2.63
[root@ceph141 ~]#
Q8: shrinking an image is only allowed with the --allow-shrink flag
报错信息
[root@ceph141 ~]# rbd resize -s 4G oldboyedu/wp
rbd: shrinking an image is only allowed with the --allow-shrink flag
[root@ceph141 ~]#
错误原因
块设备扩容时不需要指定额外的参数,但是缩容时,需要指定--allow-shrink参数。
解决方案
修改块设备大小时,指定--allow-shrink参数即可。生产环境缩容要谨慎!
Q9: rbd: unmap failed: (16) Device or resource busy
报错信息
[root@harbor250 ~]# rbd unmap oldboyedu/mysql80
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
[root@harbor250 ~]#
错误原因
当前设备被占用,可能没有取消挂载。
解决方案
在删除映射前,应该先取消挂载哟。
Q10: rbd: failed to create snapshot: (122) Disk quota exceeded
报错信息
[root@ceph141 ~]# rbd snap create oldboyedu/mysql80 --snap oldboyedu-linux94-hehe
Creating snap: 10% complete...failed.
rbd: failed to create snapshot: (122) Disk quota exceeded
[root@ceph141 ~]#
错误原因
创建快照时超出了快照限制。
解决方案
- 新增快照限制的数量;
- 清除快照限制;
- 删除已有的快照已释放更多的快照数量;
Q11: 查看的用户不存在
报错信息
[root@ceph141 ~]# ceph auth get client.yinzhengjie
Error ENOENT: failed to find client.yinzhengjie in keyring
[root@ceph141 ~]#
错误原因
创建的用户不存在。
解决方案
检查用户名称是否正确,或者是否创建国该用户。
Q12: 用户已存在,不能创建
报错信息
[root@ceph141 ~]# ceph auth get-or-create client.yinzhengjie mon 'allow rwx' osd 'allow r'
Error EINVAL: key for client.yinzhengjie exists but cap mon does not match
[root@ceph141 ~]#
错误原因
创建的用户已存在。
解决方案
- 删除原有用户
- 使用命令修改现有用户权限
Q13: RADOS permission denied
报错信息
[root@harbor250 ~]# ceph -s
2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 AuthRegistry(0x7fb534063668) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2024-12-05T20:13:07.356+0800 7fb53c40e640 -1 AuthRegistry(0x7fb53c40cf80) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2024-12-05T20:13:07.360+0800 7fb53a9ab640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-12-05T20:13:07.360+0800 7fb53a1aa640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-12-05T20:13:07.360+0800 7fb5399a9640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-12-05T20:13:07.360+0800 7fb53c40e640 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication
[errno 13] RADOS permission denied (error connecting to the cluster)
[root@harbor250 ~]#
错误原因
ceph工具默认会找admin用户访问集群,权限被拒绝。
解决方案
- 拷贝admin的证书文件到相应目录。
- 使用"--user"或者"--id"指定自定义用户的认证名称。
Q14: Operation not permitted
报错信息
[root@harbor250 ~]# rbd -p linux94 ls -l --id k3s
2024-12-05T20:21:05.659+0800 7f3e2228d4c0 -1 librbd::api::Image: list_images: error listing v1 images: (1) Operation not permitted
rbd: listing images failed: (1) Operation not permitted
[root@harbor250 ~]#
错误原因
ceph当前使用的用户权限不足。
解决方案
修改相应用户的权限。
Q15: rbd: couldn't connect to the cluster!
报错信息
[root@harbor250 ~]# rbd -p oldboyedu ls -l --id k3s
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: error parsing file /etc/ceph/ceph.client.k3s.keyring: error setting modifier for [client.k3s] type=key val=aQAcmFFnpogjBRAAVUr1iwjxlXkbTCreXoizrg==: Malformed input
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 auth: failed to load /etc/ceph/ceph.client.k3s.keyring: (5) Input/output error
2024-12-05T20:24:57.669+0800 7f7c5b5394c0 -1 monclient: keyring not found
rbd: couldn't connect to the cluster!
rbd: listing images failed: (5) Input/output error
[root@harbor250 ~]#
错误原因
无法链接集群。
解决方案
检查认证文件是否被篡改,若被修改从服务端重新导出即可。
Q16: ERROR: Parameter problem: File name required, not only the bucket name. Alternatively use --recursive
报错信息
[root@ceph141 ~]# s3cmd rm s3://yinzhengjie-bucket
ERROR: Parameter problem: File name required, not only the bucket name. Alternatively use --recursive
[root@ceph141 ~]#
错误原因
删除的bucket存在对象,需要递归删除。
解决方案
如果想要删除存储桶,可以直接加--recursive选项进行递归删除。
Q17: ERROR: Parameter problem: Please use --force to delete ALL contents of s3://yinzhengjie-bucket
报错信息
[root@ceph141 ~]# s3cmd rm s3://yinzhengjie-bucket --recursive
ERROR: Parameter problem: Please use --force to delete ALL contents of s3://yinzhengjie-bucket
[root@ceph141 ~]#
错误原因
需要强制删除。
解决方案
使用" --force"选项就可以强制删除数据。
Q18: 2 failed cephadm daemon(s)
报错信息
[root@ceph141 ~]# ceph -s
cluster:
id: 0f06b0e2-b128-11ef-9a37-4971ded8a98b
health: HEALTH_WARN
2 failed cephadm daemon(s)
services:
mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 46h)
mgr: ceph141.bszrgd(active, since 47h), standbys: ceph143.ihhymg
osd: 7 osds: 7 up (since 46h), 7 in (since 46h)
data:
pools: 3 pools, 33 pgs
objects: 2 objects, 449 KiB
usage: 531 MiB used, 3.3 TiB / 3.3 TiB avail
pgs: 45.455% pgs unknown
3.030% pgs not active
17 active+clean
15 unknown
1 creating+peering
[root@ceph141 ~]#
错误原因
"2 failed cephadm daemon(s)"表示有2个守护进程不正常工作导致的错误。
解决方案
使用"ceph orch ps"和"ceph orch daemon restart"来重启相应的守护进程,官方服务是否启动成功。
Q19: rule 0 already exists
报错信息
[root@ceph141 ~]# crushtool -c yinzhengjie-hdd-ssd.file -o yinzhengjie-hdd-ssd.crushmap
rule 0 already exists
[root@ceph141 ~]#
错误原因
id为0号的规则的已经存在,说明重复定义。
解决方案
检查yinzhengjie-hdd-ssd.file配置是否存在id冲突。
本文来自博客园,作者:尹正杰,转载请注明原文链接:https://www.cnblogs.com/yinzhengjie/p/18639775,个人微信: "JasonYin2020"(添加时请备注来源及意图备注,有偿付费)
当你的才华还撑不起你的野心的时候,你就应该静下心来学习。当你的能力还驾驭不了你的目标的时候,你就应该沉下心来历练。问问自己,想要怎样的人生。