2017-7-14 Ceph三节点(3 mon+9 osd)集群部署
建议大家可以去买《Ceph分布式存储学习指南》,快把Ceph吹上天了,不过我看Ceph确实真的很强啊。三副本存储无单点故障。不需要RAID、分布式可伸缩性的,加权选择不一样大小的磁盘,纠删码机制节省空间,写时复制快速生成Openstack数百个实例,crush算法动态选择数据存储和访问位置,不需要元数据表,自我感应组件故障并自我恢复速度极快,不存在性能瓶颈,是将块存储、对象存储、文件存储与一身的软件定义存储方案。
一、基础环境准备
3台机器,每台机器标配:1G内存、2块网卡、3块20G的SATA裸盘
"$"符号表示三个节点都进行同样配置
$ cat /etc/hosts
ceph-node1 10.20.0.101
ceph-node2 10.20.0.102
ceph-node3 10.20.0.103
$ yum install epel-release
二、安装ceph-deploy
[root@ceph-node1 ~]# ssh-keygen //在node1节点配置免SSH密钥登录到其他节点
[root@ceph-node1 ~]# ssh-copy-id ceph-node2
[root@ceph-node1 ~]# ssh-copy-id ceph-node3
[root@ceph-node1 ~]# yum install ceph-deploy -y
三个节点安装: yum install ceph
[root@ceph01 yum.repos.d]# cat ceph.repo
[ceph]
name=ceph
baseurl=http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/
gpgcheck=0
[ceph-noarch]
name=cephnoarch
baseurl=http://mirrors.aliyun.com/ceph/rpm-jewel/el7/noarch/
gpgcheck=0
[root@ceph01 yum.repos.d]#
[root@ceph-node1 ~]# ceph-deploy new ceph-node1 ceph-node2 ceph-node3 //new命令会生成一个ceph-node1集群,并且会在当前目录生成配置文件和密钥文件
+++++++++++++++++++++++++++++++++++++++
[root@ceph-node1 ~]# ceph -v
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
[root@ceph-node1 ~]# ceph-deploy mon create-initial //创建第一个monitor
[root@ceph-node1 ~]# ceph status //此时集群处于error状态是正常的
cluster ea54af9f-f286-40b2-933d-9e98e7595f1a
health HEALTH_ERR
[root@ceph-node1 ~]# systemctl start ceph
[root@ceph-node1 ~]# systemctl enable ceph
三、创建对象存储设备OSD,并加入到ceph集群
[root@ceph-node1 ~]# ceph-deploy disk list ceph-node1 //列出ceph-node1已有的磁盘,很奇怪没有列出sdb、sdc、sdd,但是确实存在的
//下面的zap命令慎用,会销毁磁盘中已经存在的分区表和数据。ceph-node1是主机名,同样可以是ceph-node2
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node1:sdb ceph-node1:sdc ceph-node1:sdd
[root@ceph-node1 ~]# ceph-deploy osd create ceph-node1:sdb ceph-node1:sdc ceph-node1:sdd //擦除磁盘原有数据,并创建新的文件系统,默认是XFS,然后将磁盘的第一个分区作为数据分区,第二个分区作为日志分区。加入到OSD中。
[root@ceph-node1 ~]# ceph status //可以看到集群依旧没有处于健康状态。我们需要再添加一些节点到ceph集群中,以便它能够形成分布式的、冗余的对象存储,这样集群状态才为健康。
cluster ea54af9f-f286-40b2-933d-9e98e7595f1a
health HEALTH_WARN
64 pgs stuck inactive
64 pgs stuck unclean
monmap e1: 1 mons at {ceph-node1=10.20.0.101:6789/0}
election epoch 2, quorum 0 ceph-node1
osdmap e6: 3 osds: 0 up, 0 in
pgmap v7: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating
四、纵向扩展多节点Ceph集群,添加Monitor和OSD
注意:Ceph存储集群最少需要一个Monitor处于运行状态,要提供可用性的话,则需要奇数个monitor,比如3个或5个,以形成仲裁(quorum)。
(1)在Ceph-node2和Ceph-node3部署monitor,但是是在ceph-node1执行命令!
[root@ceph-node1 ~]# ceph-deploy mon add ceph-node2
[root@ceph-node1 ~]# ceph-deploy mon add ceph-node3
++++++++++++++++++++++++++
报错:[root@ceph-node1 ~]# ceph-deploy mon create ceph-node2
[ceph-node3][WARNIN] Executing /sbin/chkconfig ceph on
[ceph-node3][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
[ceph-node3][WARNIN] monitor: mon.ceph-node3, might not be running yet
[ceph-node3][INFO ] Running command: ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node3.asok mon_status
[ceph-node2][WARNIN] neither `public_addr` nor `public_network` keys are defined for monitors
解决:①通过在CentOS 7上chkconfig,怀疑节点1并没有远程启动节点2的ceph服务,如果我在node2手动启动的话,应该就可以了
[root@ceph-node2 ~]# systemctl status ceph
● ceph.service - LSB: Start Ceph distributed file system daemons at boot time
Loaded: loaded (/etc/rc.d/init.d/ceph)
Active: inactive (dead)
结果enable后还是失败了
②沃日,原来是书上写错了,在已经添加了监控节点后,后续添加监控节点应该是mon add,真的是醉了!
++++++++++++++++++++++++++++++++
[root@ceph-node1 ~]# ceph status
monmap e3: 3 mons at {ceph-node1=10.20.0.101:6789/0,ceph-node2=10.20.0.102:6789/0,ceph-node3=10.20.0.103:6789/0}
election epoch 8, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3
(2)添加更多的OSD节点,依然在ceph-node1执行命令即可。
[root@ceph-node1 ~]# ceph-deploy disk list ceph-node2 ceph-node3
//确保磁盘号不要出错,否则的话,容易把系统盘都给格式化了!
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node2:sdb ceph-node2:sdc ceph-node2:sdd
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node3:sdb ceph-node3:sdc ceph-node3:sdd
//经过实践,下面的这条命令,osd create最好分两步,prepare和activate,终于为什么不清楚。
[root@ceph-node1 ~]# ceph-deploy osd create ceph-node2:sdb ceph-node2:sdc ceph-node2:sdd
[root@ceph-node1 ~]# ceph-deploy osd create ceph-node3:sdb ceph-node3:sdc ceph-node3:sdd
++++++++++++++++++++++++++++++++++++++++++++
报错:
[ceph-node3][WARNIN] ceph-disk: Error: Command '['/usr/sbin/sgdisk', '--new=2:0:5120M', '--change-name=2:ceph journal', '--partition-guid=2:fa28bc46-55de-464a-8151-9c2b51f9c00d', '--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106', '--mbrtogpt', '--', '/dev/sdd']' returned non-zero exit status 4
[ceph-node3][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare --fs-type xfs --cluster ceph -- /dev/sdd
[ceph_deploy][ERROR ] GenericError: Failed to create 3 OSDs
未解决:原来敲入osd create命令不小心把node2写成node3了,哎我尼玛,后面越来越难办了
+++++++++++++++++++++++++++++
[root@ceph-node1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.08995 root default
-2 0.02998 host ceph-node1
0 0.00999 osd.0 up 1.00000 1.00000
1 0.00999 osd.1 up 1.00000 1.00000
2 0.00999 osd.2 up 1.00000 1.00000
-3 0.02998 host ceph-node2
3 0.00999 osd.3 down 0 1.00000
4 0.00999 osd.4 down 0 1.00000
8 0.00999 osd.8 down 0 1.00000
-4 0.02998 host ceph-node3
5 0.00999 osd.5 down 0 1.00000
6 0.00999 osd.6 down 0 1.00000
7 0.00999 osd.7 down 0 1.00000
有6个OSD都处于down状态,ceph-deploy osd activate 依然失败,根据之前的报告,osd create的时候就是失败的。
未解决:由于刚部署ceph集群,还没有数据,可以把OSD给清空。
参考文档:http://www.cnblogs.com/zhangzhengyan/p/5839897.html
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node2:sdb ceph-node2:sdc ceph-node2:sdd
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node3:sdb ceph-node3:sdc ceph-node3:sdd
(1)从ceph osd tree移走crush map的osd.4,还有其他osd号
[root@ceph-node1 ~]# ceph osd crush remove osd.3
[root@ceph-node1 ~]# ceph osd crush remove osd.4
[root@ceph-node1 ~]# ceph osd crush remove osd.8
[root@ceph-node1 ~]# ceph osd crush remove osd.5
[root@ceph-node1 ~]# ceph osd crush remove osd.6
[root@ceph-node1 ~]# ceph osd crush remove osd.7
(2)[root@ceph-node1 ~]# ceph osd rm 3
[root@ceph-node1 ~]# ceph osd rm 4
[root@ceph-node1 ~]# ceph osd rm 5
[root@ceph-node1 ~]# ceph osd rm 6
[root@ceph-node1 ~]# ceph osd rm 7
[root@ceph-node1 ~]# ceph osd rm 8
[root@ceph-node1 ~]# ceph osd tree //终于清理干净了
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.02998 root default
-2 0.02998 host ceph-node1
0 0.00999 osd.0 up 1.00000 1.00000
1 0.00999 osd.1 up 1.00000 1.00000
2 0.00999 osd.2 up 1.00000 1.00000
-3 0 host ceph-node2
-4 0 host ceph-node3
可以登录到node2和node3,sdb/sdc/sdd都被干掉了,除了还剩GPT格式,现在重新ceph-deploy osd create
我尼玛还是报错呀,[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare --fs-type xfs --cluster ceph -- /dev/sdd
++++++++++++++++++++++++++++++++++
报错:(1)node1远程激活node2的osd出错。prepare和activate能够取代osd create的步骤
[root@ceph-node1 ~]# ceph-deploy osd prepare ceph-node2:sdb ceph-node2:sdc ceph-node2:sdd
[root@ceph-node1 ~]# ceph-deploy osd activate ceph-node2:sdb ceph-node2:sdc ceph-node2:sdd
[ceph-node2][WARNIN] ceph-disk: Cannot discover filesystem type: device /dev/sdb: Line is truncated:
[ceph-node2][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /dev/sdb
解决:格式分区权限问题,在报错的节点执行ceph-disk activate-all即可。
(2)明明是node2的sdb盘,但是ceph osd tree却发现执行的是node3的sdb盘,当然会报错了
Starting Ceph osd.4 on ceph-node2...
Running as unit ceph-osd.4.1500013086.674713414.service.
Error EINVAL: entity osd.3 exists but key does not match
[root@ceph-node1 ~]# ceph osd tree
3 0 osd.3 down 0 1.00000
解决:[root@ceph-node1 ~]# ceph auth del osd.3
[root@ceph-node1 ~]# ceph osd rm 3
在node2上lsblk发现sdb不正常,没有挂载osd,那么于是
[root@ceph-node1 ~]# ceph-deploy disk zap ceph-node2:sdb
[root@ceph-node1 ~]# ceph-deploy osd prepare ceph-node2:sdb
[root@ceph-node1 ~]# ceph osd tree //至少osd跑到node2上面,而不是node3,还好还好。
-3 0.02998 host ceph-node2
3 0.00999 osd.3 down 0 1.00000
[root@ceph-node1 ~]# ceph-deploy osd activate ceph-node2:sdb //肯定失败,按照上面的经验,必须在Node2上单独激活
[root@ceph-node2 ~]# ceph-disk activate-all
[root@ceph-node1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.08995 root default
-2 0.02998 host ceph-node1
0 0.00999 osd.0 up 1.00000 1.00000
1 0.00999 osd.1 up 1.00000 1.00000
2 0.00999 osd.2 up 1.00000 1.00000
-3 0.02998 host ceph-node2
4 0.00999 osd.4 up 1.00000 1.00000
5 0.00999 osd.5 up 1.00000 1.00000
3 0.00999 osd.3 up 1.00000 1.00000
-4 0.02998 host ceph-node3
6 0.00999 osd.6 up 1.00000 1.00000
7 0.00999 osd.7 up 1.00000 1.00000
8 0.00999 osd.8 up 1.00000 1.00000
哎,卧槽,终于解决了,之前只是一个小小的盘符写错了,就害得我搞这么久啊,细心点!
=========================================
拓展:
[root@ceph-node1 ~]# lsblk // OSD up的分区都挂载到/var/lib/ceph/osd目录下
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 40G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 39.5G 0 part
├─centos-root 253:0 0 38.5G 0 lvm /
└─centos-swap 253:1 0 1G 0 lvm [SWAP]
sdb 8:16 0 20G 0 disk
├─sdb1 8:17 0 15G 0 part /var/lib/ceph/osd/ceph-0
└─sdb2 8:18 0 5G 0 part
sdc 8:32 0 20G 0 disk
├─sdc1 8:33 0 15G 0 part /var/lib/ceph/osd/ceph-1
└─sdc2 8:34 0 5G 0 part
sdd 8:48 0 20G 0 disk
├─sdd1 8:49 0 15G 0 part /var/lib/ceph/osd/ceph-2
└─sdd2 8:50 0 5G 0 part
sr0 11:0 1 1024M 0 rom
解决完问题后,心情真爽啊。。。