开源云存储gluster学习之旅--安装,创建,客户端挂载使用

国庆值班!!!那就整理整理这个上周弄得gluster吧
-----------------------------------------题记----------------------------------------

什么是Gluster?
Gluster是可伸缩的分布式文件系统,它将来自多个服务器的磁盘存储资源聚合到单个全局名称空间中。

Advantages
Scales to several petabytes
Handles thousands of clients
POSIX compatible
Uses commodity hardware
Can use any ondisk filesystem that supports extended attributes
Accessible using industry standard protocols like NFS and SMB
Provides replication, quotas, geo-replication, snapshots and bitrot detection
Allows optimization for different workloads

Open Source

快速安装步骤:

Step 1 – Have at least three nodes
Fedora 26 (or later) on 3 nodes named "server1", "server2" and "server3"
A working network connection
At least two virtual disks, one for the OS installation, and one to be used to serve GlusterFS storage (sdb), on each of these VMs. This will emulate a real-world deployment, where you would want to separate GlusterFS storage from the OS install.
Setup NTP on each of these servers to get the proper functioning of many applications on top of filesystem.
Note: GlusterFS stores its dynamically generated configuration files at /var/lib/glusterd. If at any point in time GlusterFS is unable to write to these files (for example, when the backing filesystem is full), it will at minimum cause erratic behavior for your system; or worse, take your system offline completely. It is recommended to create separate partitions for directories such as /var/log to reduce the chances of this happening.

Step 2 - Format and mount the bricks
Perform this step on all the nodes, "server{1,2,3}"

Note: We are going to use the XFS filesystem for the backend bricks. But Gluster is designed to work on top of any filesystem, which supports extended attributes.

The following examples assume that the brick will be residing on /dev/sdb1.

mkfs.xfs -i size=512 /dev/sdb1 mkdir -p /data/brick1 echo '/dev/sdb1 /data/brick1 xfs defaults 1 2' >> /etc/fstab mount -a && mount
You should now see sdb1 mounted at /data/brick1

Step 3 - Installing GlusterFS
Install the software

yum install glusterfs-server
Start the GlusterFS management daemon:

service glusterd start service glusterd status glusterd.service - LSB: glusterfs server        Loaded: loaded (/etc/rc.d/init.d/glusterd)    Active: active (running) since Mon, 13 Aug 2012 13:02:11 -0700; 2s ago   Process: 19254 ExecStart=/etc/rc.d/init.d/glusterd start (code=exited, status=0/SUCCESS)    CGroup: name=systemd:/system/glusterd.service        ├ 19260 /usr/sbin/glusterd -p /run/glusterd.pid        ├ 19304 /usr/sbin/glusterfsd --xlator-option georep-server.listen-port=24009 -s localhost...        └ 19309 /usr/sbin/glusterfs -f /var/lib/glusterd/nfs/nfs-server.vol -p /var/lib/glusterd/...
Step 4 - Configure the firewall
The gluster processes on the nodes need to be able to communicate with each other. To simplify this setup, configure the firewall on each node to accept all traffic from the other node.

iptables -I INPUT -p all -s -j ACCEPT
where ip-address is the address of the other node.

Step 5 - Configure the trusted pool
From "server1"

gluster peer probe server2 gluster peer probe server3
Note: When using hostnames, the first server needs to be probed from one other server to set its hostname.

From "server2"

gluster peer probe server1
Note: Once this pool has been established, only trusted members may probe new servers into the pool. A new server cannot probe the pool, it must be probed from the pool.

Check the peer status on server1

gluster peer status
You should see something like this (the UUID will differ)

Number of Peers: 2 Hostname: server2 Uuid: f0e7b138-4874-4bc0-ab91-54f20c7068b4 State: Peer in Cluster (Connected) Hostname: server3 Uuid: f0e7b138-4532-4bc0-ab91-54f20c701241 State: Peer in Cluster (Connected)
Step 6 - Set up a GlusterFS volume
On all servers:

mkdir -p /data/brick1/gv0
From any single server:

gluster volume create gv0 replica 3 server1:/data/brick1/gv0 server2:/data/brick1/gv0 server3:/data/brick1/gv0 gluster volume start gv0
Confirm that the volume shows "Started":

gluster volume info
You should see something like this (the Volume ID will differ):

Volume Name: gv0 Type: Replicate Volume ID: f25cc3d8-631f-41bd-96e1-3e22a4c6f71f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: server1:/data/brick1/gv0 Brick2: server2:/data/brick1/gv0 Brick3: server3:/data/brick1/gv0 Options Reconfigured: transport.address-family: inet
Note: If the volume does not show "Started", the files under /var/log/glusterfs/glusterd.logshould be checked in order to debug and diagnose the situation. These logs can be looked at on one or, all the servers configured.

Step 7 - Testing the GlusterFS volume
For this step, we will use one of the servers to mount the volume. Typically, you would do this from an external machine, known as a "client". Since using this method would require additional packages to be installed on the client machine, we will use one of the servers as a simple place to test first , as if it were that "client".

mount -t glusterfs server1:/gv0 /mnt   for i in `seq -w 1 100`; do cp -rp /var/log/messages /mnt/copy-test-$i; done
First, check the client mount point:

ls -lA /mnt/copy* | wc -l
You should see 100 files returned. Next, check the GlusterFS brick mount points on each server:

ls -lA /data/brick1/gv0/copy*
You should see 100 files on each server using the method we listed here. Without replication, in a distribute only volume (not detailed here), you should see about 33 files on each one.

我自己在安装的时候用源码没有成功,在linux上我更倾向于使用 yum 的 rpm包,法门如下:

yum安装注意的yum 源问题
https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-6/
cat
/etc/yum.repos.d/gluster.repo
[gluster]
name=gluster
baseurl=https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-6/
gpgcheck=0
enabled=1

url地址可能根据更新在变,登录 
https://download.gluster.org/pub/gluster/glusterfs/

架构介绍:

GlusterFS 五种卷

Distributed:分布式卷,文件通过 hash
算法随机分布到由 bricks 组成的卷上。

分布式Glusterfs卷 -这是默认的glusterfs卷,即,如果未指定卷的类型,则在创建卷时,默认选项是创建分布式卷。在这里,文件分布在卷中的各个块之间。因此,file1只能存储在brick1或brick2中,而不能存储在两者中。因此,没有数据冗余。这种存储卷的目的是轻松而便宜地缩放卷大小。但是,这也意味着砖块故障将导致数据完全丢失,并且必须依靠底层硬件来提供数据丢失保护。


gluster volume create test-volume server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4Creation of test-volume has been successfulPlease start the volume to access data

#gluster volume infoVolume Name: test-volumeType: DistributeStatus: CreatedNumber of Bricks: 4Transport-type: tcpBricks:Brick1: server1:/exp1Brick2: server2:/exp2Brick3: server3:/exp3Brick4: server4:/exp4

Replicated: 复制式卷,类似 RAID
1,replica 数必须等于 volume 中 brick 所包含的存储服务器数,可用性高。

复制的Glusterfs卷 -在此卷中,我们克服了分布式卷中面临的数据丢失问题。此处,数据的精确副本将保留在所有模块上。卷中的副本数可以由客户端在创建卷时决定。因此,我们至少需要两个砖块才能创建具有2个副本的卷,或者至少需要三个砖块才能创建3个副本的卷。这种卷的一个主要优点是,即使一个砖块发生故障,仍然可以从其复制的砖块访问数据。这样的卷用于更好的可靠性和数据冗余。


#gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2Creation of test-volume has been successful

Please start the volume to access data

Striped: 条带式卷,类似 RAID 0,stripe 数必须等于 volume 中 brick
所包含的存储服务器数,文件被分成数据块,以 Round Robin 的方式存储在 bricks
中,并发粒度是数据块,大文件性能好。
–这个一般没有用的

Distributed Striped: 分布式的条带卷,volume中 brick 所包含的存储服务器数必须是 stripe
的倍数(>=2倍),兼顾分布式和条带式的功能。
分布式复制Glusterfs卷
分布式复制卷(Distributed
Replicated Glusterfs
Volume),是分布式卷与复制卷的组合,兼具两者的功能,特点如下:

若干brick组成1个复制卷,另外若干brick组成其他复制卷;单个文件在复制卷内数据保持副本,不同文件在不同复制卷之间进行哈希分布;即分布式卷跨复制卷集(replicated
sets );
brick
server数量是副本数量的倍数,且>=2倍,即最少需要4台brick
server,同时组建复制卷集的brick容量相等。
即最少需要4台brick server – 当然了,可能我的 brick server 有不止一个的 brick ,那么 这个最少需要就不适应了

在此卷中,文件分布在复制的砖块集中。块数必须是副本数的倍数。同样,我们指定积木的顺序也很重要,因为相邻的积木成为彼此的复制品。当由于冗余和扩展存储而需要高数据可用性时,使用这种类型的卷。因此,如果有八个砖块且副本数为2,则前两个砖块成为彼此的副本,然后成为接下来的两个砖块,依此类推。该体积表示为4x2。同样,如果有八个砖块且副本数为4,则四个砖块将成为彼此的副本,我们将此体积表示为2x4体积。
根据这个顺序可以通过合理的规划将故障点分散开.
创建分布式复制卷:

#gluster卷创建NEW-VOLNAME [副本COUNT] [传输[tcp | rdma | tcp,rdma]] NEW-BRICK …

例如,具有两个镜像的四个节点分布式(复制)卷:

#gluster volume create test-volume replica 2 transport tcp
server1:/exp1 server2:/exp2 server3:/exp3
server4:/exp4

Creation of test-volume has been successful
Please start the volume to access data

可以看到其实就是 在复制卷的基础上,以副本数的倍数添加 Brick 即可

关于条带卷:
个人认为,无论是 条带,分布式条带 都是由 Brick数和条带数,Brick的顺序 决定的

  1. Brick数=条带数 --> 条带卷

gluster volume create
stripe-volume stripe 3 transport tcp glusterfs01:/brick1/str_volume
glusterfs02:/brick2/str_volume
glusterfs03:/brick3/str_volume

这是创建了一个条带卷,就是最普通的那种,相当于一个raid0,效率好,但是没有冗余,而且单点故障之后,文件都瞎了,所以一般没有用的

Brick数=条带数的倍数 --> 分布式条带卷

#gluster volume create distributed-stripe-volume stripe 2 transport tcp
glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume

另外,还有两种条带卷:
条带镜像卷(Deprecated),或者叫条带复制卷
条带复制卷(STRIPE REPLICA Volume),是条带与复制卷的组合,兼具两者的功能,特点如下:
若干brick组成1个复制卷,另外若干brick组成其他复制卷;单个文件以条带的形式存储在2个或多个复制集(replicated
sets ),复制集内文件分片以副本的形式保存;相当于文件级raid01;
brick server数量是副本数的倍数,且>=2倍
创建命令应该是:
#gluster volume
create distributed-stripe-volume
stripe
2
replica 2
transport tcp \ glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume
分布式条带复制卷
分布式条带复制卷(DISTRIBUTE STRIPE
REPLICA VOLUME),是分布式卷,条带与复制卷的组合,兼具三者的功能,特点如下:
多个文件哈希分布到到多个条带集中,单个文件在条带集中以条带的形式存储在2个或多个复制集(replicated
sets ),复制集内文件分片以副本的形式保存;

brick server数量是副本数的倍数,且>=2倍

创建命令应该是:

gluster volume create distributed-stripe-volume
stripe
2
replica 2
transport tcp \ glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume
\
glusterfs05:/brick5/dis_str_volume
glusterfs06:/brick6/dis_str_volume
\
glusterfs07:/brick7/dis_str_volume
glusterfs08:/brick8/dis_str_volume
\

AFR恢复原理
数据恢复只针对复制卷,AFR数据修复主要涉及三个方面:ENTRY,META,DATA。
记录描述副本状态的称之为ChangeLog,记录在每个副本文件扩展属性里,读入内存后以矩阵形式判断是否需要修复以及要以哪个副本为Source进行修复;初始值以及正常值为0(注:ENTRY和META,DATA分布对应着一个数值)。
以冗余度为2,即含有2个副本A和B的DATA修复为例,write的步骤分解为:
下发Write操作;
加锁Lock;
向A,B副本的ChangeLog分别加1,记录到各副本的扩展属性中;
对A,B副本进行写操作;
若副本写成功则ChangeLog减1,若该副本写失败则ChangLog值不变,记录到各个副本的扩展属性中;

解锁UnLock;
向上层返回,只要有一个副本写成功就返回成功。
上述操作在AFR中是完整的一个transaction动作,根据两个副本记录的ChangeLog的数值确定了副本的几种状态:
WISE:智慧的,即该副本的ChangeLog中对应的值是0,而另一副本对应的数值大于0;
INNOCENT:无辜的,即两副本的ChangeLog对应的值都是0;
FOOL:愚蠢的,即该副本的ChangeLog对应的值大于是0,而另一副本对应的数值是0;
IGNORANT,忽略的,即该副本的ChangeLog丢失。
恢复分以下场景:
1个节点changelog状态为WISE,其余节点为FOOL或其他非WISE状态,以WISE节点去恢复其他节点;
所有节点是IGNORANT状态,手动触发heal,通过命令以UID最小的文件作为source,去恢复大小为0的其他文件;
多个状态是WISE时,即出现脑裂状态,脑裂的文件通常读不出来,报"Input/Output
error",可查看日志/var/log/glusterfs/glustershd.log。
脑裂原理及解决方案:https://docs.gluster.org/en/latest/Administrator
Guide/Split brain and ways to deal with it/

通过命令查看副本文件的扩展属性:getfattr -m . -d
-e hex [filename]#
“trusted.afr.xxx”部分即扩展属性,值是24bit,分3部分,依次标识DATA ,META, ENTRY
3者的changelog[root@glusterfs01 ~]# getfattr -m . -d -e hex
/brick1/repl_volume/replica1.txt

客户端挂载
生产挂载示例
node3.hfvast.com:/HFCloud on
/var/lib/one/datastores type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
手动挂载
mount.glusterfs
node3.hfvast.com:/HFCloud /var/lib/one/datastores
自动挂载-/etc/fstab
192.168.56.11:/gv1 /mnt/glusterfs glusterfs
defaults,_netdev 0
0
  可以使用Gluster Native
Client方法在GNU / Linux客户端中实现高并发性,性能和透明故障转移。可以使用NFS
v3访问gluster卷。已经对GNU / Linux客户端和其他操作系统中的NFS实现进行了广泛的测试,例如FreeBSD,Mac
OS X,以及Windows 7(Professional和Up)和Windows Server
2003.其他NFS客户端实现可以与gluster一起使用NFS服务器。使用Microsoft
Windows以及SAMBA客户端时,可以使用CIFS访问卷。对于此访问方法,Samba包需要存在于客户端。
  总结:GlusterFS支持三种客户端类型。Gluster
Native Client、NFS和CIFS。Gluster Native
Client是在用户空间中运行的基于FUSE的客户端,官方推荐使用Native
Client,可以使用GlusterFS的全部功能。
1、使用Gluster Native
Client挂载
Gluster Native
Client是基于FUSE的,所以需要保证客户端安装了FUSE。这个是官方推荐的客户端,支持高并发和高效的写性能。
在开始安装Gluster Native
Client之前,您需要验证客户端上是否已加载FUSE模块,并且可以访问所需的模块,如下所示:
[root@localhost ~]# modprobe
fuse  #将FUSE可加载内核模块(LKM)添加到Linux内核 [root@localhost ~]# dmesg |
grep -i
fuse  #验证是否已加载FUSE模块 [ 569.630373] fuse init (API
version 7.22)
安装Gluseter
Native Client:
[root@localhost ~]# yum
-y install
glusterfs-client  #安装glusterfs-client客户端 [root@localhost ~]#
mkdir
/mnt/glusterfs  #创建挂载目录 [root@localhost ~]# mount.glusterfs 192.168.56.11:/gv1
/mnt/glusterfs/  #挂载/gv1 [root@localhost ~]# df -h Filesystem Size Used Avail
Use% Mounted on /dev/sda2 20G 1.4G 19G 7% / devtmpfs 231M 0 231M 0% /dev tmpfs 241M 0 241M 0% /dev/shm tmpfs 241M
4.6M 236M
2% /run tmpfs 241M
0 241M 0% /sys/fs/cgroup /dev/sda1 197M
97M 100M 50% /boot
tmpfs 49M 0 49M
0%
/run/user/0
192.168.56.11:/gv1 4.0G 312M 3.7G 8% /mnt/glusterfs [root@localhost
~]# ll /mnt/glusterfs/  #查看挂载目录的内容 total 100000 -rw-r–r-- 1 root root 102400000 Aug 7 04:30 100M.file [root@localhost ~]#
mount  #查看挂载信息 sysfs on
/sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc
type proc (rw,nosuid,nodev,noexec,relatime) … 192.168.56.11:/gv1 on /mnt/glusterfs type
fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

=================================================

手动挂载卷选项:
使用该mount -t
glusterfs命令时,可以指定以下选项 。请注意,您需要用逗号分隔所有选项。
backupvolfile-server=server-name  #在安装fuse客户端时添加了这个选择,则当第一个vofile服务器故障时,该选项执行的的服务器将用作volfile服务器来安装客户端
volfile-max-fetch-attempts=number of attempts  指定在装入卷时尝试获取卷文件的尝试次数。
log-level=loglevel  #日志级别 log-file=logfile    #日志文件
transport=transport-type  #指定传输协议 direct-io-mode=[enable|disable]
use-readdirp=[yes|no]  #设置为ON,则强制在fuse内核模块中使用readdirp模式

举个例子:
#mount -t glusterfs -o
backupvolfile-server=volfile_server2,use-readdirp=no,volfile-max-fetch-attempts=2,log-level=WARNING,log-file=/var/log/gluster.log
server1:/test-volume /mnt/glusterfs
自动挂载卷:
除了使用mount挂载,还可以使用/etc/fstab自动挂载
语法格式:HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR glusterfs
defaults,_netdev 0
0 举个例子: 192.168.56.11:/gv1 /mnt/glusterfs glusterfs
defaults,_netdev 0 0

管理维护篇改天再整理,呵呵~

posted @ 2019-11-12 10:22  运维小九九  阅读(43)  评论(0编辑  收藏  举报