分布式存储Ceph(八) Ceph集群应用基础

通过套接字进行单机管理

向管理节点同步admin认证文件

点击查看代码
root@ceph-deploy:/var/lib/ceph/ceph-cluster# scp ceph.client.admin.keyring root@ceph-mon-01:/etc/ceph
root@ceph-deploy:/var/lib/ceph/ceph-cluster# scp ceph.client.admin.keyring root@ceph-node-01:/etc/ceph

通过套接字管理node节点

验证node节点权限

点击查看代码
root@ceph-node-01:~# ceph -s
  cluster:
    id:     6e521054-1532-4bc8-9971-7f8ae93e8430
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 56m)
    mgr: ceph-mgr-01(active, since 7d), standbys: ceph-mgr-02
    mds: 2/2 daemons up, 2 standby
    osd: 9 osds: 9 up (since 8d), 9 in (since 8d)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 161 pgs
    objects: 64 objects, 24 MiB
    usage:   1.4 GiB used, 179 GiB / 180 GiB avail
    pgs:     161 active+clean

查看node节点套接字文件

点击查看代码
root@ceph-node-01:~# ls -l /var/run/ceph/
total 0
srwxr-xr-x 1 ceph ceph 0 Sep 22 21:41 ceph-osd.0.asok
srwxr-xr-x 1 ceph ceph 0 Sep 22 21:41 ceph-osd.1.asok
srwxr-xr-x 1 ceph ceph 0 Sep 22 21:41 ceph-osd.2.asok

通过套接字管理node节点使用帮助

点击查看代码
root@ceph-node-01:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok help
{
    "bench": "OSD benchmark: write <count> <size>-byte objects(with <obj_size> <obj_num>), (default count=1G default size=4MB). Results in log.",
    "bluefs debug_inject_read_zeros": "Injects 8K zeros into next BlueFS read. Debug only.",
    "bluefs files list": "print files in bluefs",
    "bluefs stats": "Dump internal statistics for bluefs.",
    "bluestore allocator dump block": "dump allocator free regions",
    "bluestore allocator fragmentation block": "give allocator fragmentation (0-no fragmentation, 1-absolute fragmentation)",
    "bluestore allocator score block": "give score on allocator fragmentation (0-no fragmentation, 1-absolute fragmentation)",
    "bluestore bluefs device info": "Shows space report for bluefs devices. This also includes an estimation for space available to bluefs at main device. alloc_size, if set, specifies the custom bluefs allocation unit size for the estimation above.",
    "cache drop": "Drop all OSD caches",
    "cache status": "Get OSD caches statistics",
    "calc_objectstore_db_histogram": "Generate key value histogram of kvdb(rocksdb) which used by bluestore",
    "cluster_log": "log a message to the cluster log",
    "compact": "Commpact object store's omap. WARNING: Compaction probably slows your requests",
    "config diff": "dump diff of current config and default config",
    "config diff get": "dump diff get <field>: dump diff of current and default config setting <field>",
    "config get": "config get <field>: get the config value",
    "config help": "get config setting schema and descriptions",
    "config set": "config set <field> <val> [<val> ...]: set a config variable",
    "config show": "dump current config settings",
    "config unset": "config unset <field>: unset a config variable",
    "cpu_profiler": "run cpu profiling on daemon",
    "debug dump_missing": "dump missing objects to a named file",
    "debug kick_recovery_wq": "set osd_recovery_delay_start to <val>",
    "deep_scrub": "Trigger a scheduled deep scrub ",
    "dump_blocked_ops": "show the blocked ops currently in flight",
    "dump_blocklist": "dump blocklisted clients and times",
    "dump_historic_ops": "show recent ops",
    "dump_historic_ops_by_duration": "show slowest recent ops, sorted by duration",
    "dump_historic_slow_ops": "show slowest recent ops",
    "dump_mempools": "get mempool stats",
    "dump_objectstore_kv_stats": "print statistics of kvdb which used by bluestore",
    "dump_op_pq_state": "dump op priority queue state",
    "dump_ops_in_flight": "show the ops currently in flight",
    "dump_osd_network": "Dump osd heartbeat network ping times",
    "dump_pg_recovery_stats": "dump pg recovery statistics",
    "dump_pgstate_history": "show recent state history",
    "dump_recovery_reservations": "show recovery reservations",
    "dump_scrub_reservations": "show scrub reservations",
    "dump_scrubs": "print scheduled scrubs",
    "dump_watchers": "show clients which have active watches, and on which objects",
    "flush_journal": "flush the journal to permanent store",
    "flush_pg_stats": "flush pg stats",
    "flush_store_cache": "Flush bluestore internal cache",
    "get_command_descriptions": "list available commands",
    "get_heap_property": "get malloc extension heap property",
    "get_latest_osdmap": "force osd to update the latest map from the mon",
    "get_mapped_pools": "dump pools whose PG(s) are mapped to this OSD.",
    "getomap": "output entire object map",
    "git_version": "get git sha1",
    "heap": "show heap usage info (available only if compiled with tcmalloc)",
    "help": "list available commands",
    "injectargs": "inject configuration arguments into running daemon",
    "injectdataerr": "inject data error to an object",
    "injectfull": "Inject a full disk (optional count times)",
    "injectmdataerr": "inject metadata error to an object",
    "list_devices": "list OSD devices.",
    "list_unfound": "list unfound objects on this pg, perhaps starting at an offset given in JSON",
    "log dump": "dump recent log entries to log file",
    "log flush": "flush log entries to log file",
    "log reopen": "reopen log file",
    "mark_unfound_lost": "mark all unfound objects in this pg as lost, either removing or reverting to a prior version if one is available",
    "objecter_requests": "show in-progress osd requests",
    "ops": "show the ops currently in flight",
    "perf dump": "dump perfcounters value",
    "perf histogram dump": "dump perf histogram values",
    "perf histogram schema": "dump perf histogram schema",
    "perf reset": "perf reset <name>: perf reset all or one perfcounter name",
    "perf schema": "dump perfcounters schema",
    "query": "show details of a specific pg",
    "reset_pg_recovery_stats": "reset pg recovery statistics",
    "rmomapkey": "remove omap key",
    "scrub": "Trigger a scheduled scrub ",
    "scrub_purged_snaps": "Scrub purged_snaps vs snapmapper index",
    "send_beacon": "send OSD beacon to mon immediately",
    "set_heap_property": "update malloc extension heap property",
    "set_recovery_delay": "Delay osd recovery by specified seconds",
    "setomapheader": "set omap header",
    "setomapval": "set omap key",
    "smart": "probe OSD devices for SMART data.",
    "status": "high-level status of OSD",
    "truncobj": "truncate object to length",
    "version": "get ceph version"
}

通过套接字查看node节点状态

点击查看代码
root@ceph-mgr-01:~# ceph --admin-daemon /var/run/ceph/ceph-mgr.ceph-mgr-01.asok status
{
    "metadata": {},
    "dentry_count": 0,
    "dentry_pinned_count": 0,
    "id": 0,
    "inst": {
        "name": {
            "type": "mgr",
            "num": 1014372
        },
        "addr": {
            "type": "v1",
            "addr": "172.16.10.225:0",
            "nonce": 19551
        }
    },
    "addr": {
        "type": "v1",
        "addr": "172.16.10.225:0",
        "nonce": 19551
    },
    "inst_str": "mgr.1014372 172.16.10.225:0/19551",
    "addr_str": "172.16.10.225:0/19551",
    "inode_count": 0,
    "mds_epoch": 0,
    "osd_epoch": 626,
    "osd_epoch_barrier": 0,
    "blocklisted": false,
    "fs_name": "cephfs"
}

通过套接字管理mon节点

验证mon节点权限

点击查看代码
root@ceph-mon-01:~# ceph crash archive-all
root@ceph-mon-01:~# ceph -s
  cluster:
    id:     6e521054-1532-4bc8-9971-7f8ae93e8430
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 99m)
    mgr: ceph-mgr-01(active, since 7d), standbys: ceph-mgr-02
    mds: 2/2 daemons up, 2 standby
    osd: 9 osds: 9 up (since 9d), 9 in (since 9d)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 161 pgs
    objects: 64 objects, 24 MiB
    usage:   1.4 GiB used, 179 GiB / 180 GiB avail
    pgs:     161 active+clean

查看mon节点套接字文件

点击查看代码
root@ceph-mon-01:~# ls -l /var/run/ceph/
total 0
srwxr-xr-x 1 ceph ceph 0 Oct  1 20:36 ceph-mon.ceph-mon-01.asok

通过套接字管理mon节点使用帮助

点击查看代码
root@ceph-mon-01:~# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon-01.asok help
{
    "add_bootstrap_peer_hint": "add peer address as potential bootstrap peer for cluster bringup",
    "add_bootstrap_peer_hintv": "add peer address vector as potential bootstrap peer for cluster bringup",
    "compact": "cause compaction of monitor's leveldb/rocksdb storage",
    "config diff": "dump diff of current config and default config",
    "config diff get": "dump diff get <field>: dump diff of current and default config setting <field>",
    "config get": "config get <field>: get the config value",
    "config help": "get config setting schema and descriptions",
    "config set": "config set <field> <val> [<val> ...]: set a config variable",
    "config show": "dump current config settings",
    "config unset": "config unset <field>: unset a config variable",
    "connection scores dump": "show the scores used in connectivity-based elections",
    "connection scores reset": "reset the scores used in connectivity-based elections",
    "dump_historic_ops": "dump_historic_ops",
    "dump_mempools": "get mempool stats",
    "get_command_descriptions": "list available commands",
    "git_version": "get git sha1",
    "heap": "show heap usage info (available only if compiled with tcmalloc)",
    "help": "list available commands",
    "injectargs": "inject configuration arguments into running daemon",
    "log dump": "dump recent log entries to log file",
    "log flush": "flush log entries to log file",
    "log reopen": "reopen log file",
    "mon_status": "report status of monitors",
    "ops": "show the ops currently in flight",
    "perf dump": "dump perfcounters value",
    "perf histogram dump": "dump perf histogram values",
    "perf histogram schema": "dump perf histogram schema",
    "perf reset": "perf reset <name>: perf reset all or one perfcounter name",
    "perf schema": "dump perfcounters schema",
    "quorum enter": "force monitor back into quorum",
    "quorum exit": "force monitor out of the quorum",
    "sessions": "list existing sessions",
    "smart": "Query health metrics for underlying device",
    "sync_force": "force sync of and clear monitor store",
    "version": "get ceph version"
}

通过套接字查看mon节点状态

点击查看代码
root@ceph-mon-01:~# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon-01.asok mon_status
{
    "name": "ceph-mon-01",
    "rank": 0,
    "state": "leader",
    "election_epoch": 962,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_age": 6172,
    "features": {
        "required_con": "2449958747317026820",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging"
        ],
        "quorum_con": "4540138297136906239",
        "quorum_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging"
        ]
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 3,
        "fsid": "6e521054-1532-4bc8-9971-7f8ae93e8430",
        "modified": "2021-09-07T08:10:34.206861Z",
        "created": "2021-08-29T06:36:59.023456Z",
        "min_mon_release": 16,
        "min_mon_release_name": "pacific",
        "election_strategy": 1,
        "disallowed_leaders: ": "",
        "stretch_mode": false,
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus",
                "octopus",
                "pacific",
                "elector-pinging"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "ceph-mon-01",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "172.16.10.148:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "172.16.10.148:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "172.16.10.148:6789/0",
                "public_addr": "172.16.10.148:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 1,
                "name": "ceph-mon-02",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "172.16.10.110:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "172.16.10.110:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "172.16.10.110:6789/0",
                "public_addr": "172.16.10.110:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 2,
                "name": "ceph-mon-03",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "172.16.10.182:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "172.16.10.182:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "172.16.10.182:6789/0",
                "public_addr": "172.16.10.182:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3f01cfb9fffdffff",
                "release": "luminous",
                "num": 1
            }
        ]
    },
    "stretch_mode": false
}

通过套接字查看mon节点配置信息

点击查看代码
root@ceph-mon-01:~# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon-01.asok config show

Ceph集群的停止

  • 关闭服务之前,要提前设置ceph集群不要将OSD标记为out,避免node节点关闭服务后被踢出ceph集群外。

OSD设置noout

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd set noout # 关闭服务前设置noout
noout is set

ceph@ceph-deploy:~/ceph-cluster$ ceph osd unset noout #启动服务后取消noout
noout is unset

卸载客户端

root@ceph-client-01:~# umount  /data/rbd_data/
root@ceph-client-01:~# rbd -p [pool_name] unmap [img_name]

关闭RGW

点击查看代码
root@ceph-mgr-01:~# systemctl stop ceph-radosgw@rgw.ceph-mgr-01
root@ceph-mgr-02:~# systemctl stop ceph-radosgw@rgw.ceph-mgr-02

关闭cephfs元数据服务

root@ceph-mgr-01:~# systemctl stop ceph-mds@ceph-mgr-01
root@ceph-mgr-02:~# systemctl stop ceph-mds@ceph-mgr-02

关闭ceph OSD

点击查看代码
root@ceph-node-01:~# systemctl stop ceph-osd@0 ceph-osd@1 ceph-osd@2
root@ceph-node-02:~# systemctl stop ceph-osd@3 ceph-osd@4 ceph-osd@5
root@ceph-node-03:~# systemctl stop ceph-osd@6 ceph-osd@7 ceph-osd@8

关闭ceph manager

点击查看代码
root@ceph-mgr-01:~# systemctl stop ceph-mgr@ceph-mgr-01
root@ceph-mgr-02:~# systemctl stop ceph-mgr@ceph-mgr-02

关闭ceph monitor

点击查看代码
root@ceph-mon-01:~# systemctl stop ceph-mon@ceph-mon-01
root@ceph-mon-02:~# systemctl stop ceph-mon@ceph-mon-02

关闭主机

  • 关闭ceph集群主机

ceph集群启动 

启动ceph monitor

root@ceph-mon-01:~# systemctl start ceph-mon@ceph-mon-01
root@ceph-mon-02:~# systemctl start ceph-mon@ceph-mon-02

启动ceph manager

点击查看代码
root@ceph-mgr-01:~# systemctl start ceph-mgr@ceph-mgr-01
root@ceph-mgr-02:~# systemctl start ceph-mgr@ceph-mgr-02

启动ceph OSD

点击查看代码
root@ceph-node-01:~# systemctl start ceph-osd@0 ceph-osd@1 ceph-osd@2
root@ceph-node-02:~# systemctl start ceph-osd@3 ceph-osd@4 ceph-osd@5
root@ceph-node-03:~# systemctl start ceph-osd@6 ceph-osd@7 ceph-osd@8

启动cephfs元数据服务

root@ceph-mgr-01:~# systemctl start ceph-mds@ceph-mgr-01
root@ceph-mgr-02:~# systemctl start ceph-mds@ceph-mgr-02

启动RGW

点击查看代码
root@ceph-mgr-01:~# systemctl start ceph-radosgw@rgw.ceph-mgr-01
root@ceph-mgr-02:~# systemctl start ceph-radosgw@rgw.ceph-mgr-02

启动存储客户端

root@ceph-client-01:~# rbd -p [pool_name] map [img_name]
root@ceph-client-01:~# mount /dev/rbd0 /data/rbd_data/

OSD取消noout

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd unset noout
noout is unset

添加node节点

设置服务器时间同步

点击查看代码
root@ceph-node-x:~# apt -y install chrony
root@ceph-node-x:~# systemctl start chrony
root@ceph-node-x:~# systemctl enable chrony

添加仓库

root@ceph-node-x:~# wget -q -O- 'https://mirrors.tuna.tsinghua.edu.cn/ceph/keys/release.asc' | sudo apt-key add -
root@ceph-node-x:~# echo "deb https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific $(lsb_release -cs) main" >> /etc/apt/sources.list
root@ceph-node-x:~# apt -y update && apt -y upgrade

安装公共组件

root@ceph-node-x:~# apt -y install ceph-common

设置ceph用户

root@ceph-node-x:~# usermod -s /bin/bash ceph
root@ceph-node-x:~# passwd ceph
root@ceph-node-x:~# echo "ceph ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

安装OSD节点运行环境

ceph@ceph-deploy:~/ceph-cluster$ ceph-deploy install --release pacific ceph-node-x

擦除磁盘

ceph@ceph-deploy:~/ceph-cluster$ ceph-deploy disk zap ceph-node-x /dev/vdb /dev/vdc  /dev/vdd

添加OSD

ceph@ceph-deploy:~/ceph-cluster$ ceph-deploy osd create ceph-node-x --data  /dev/vdb
ceph@ceph-deploy:~/ceph-cluster$ ceph-deploy osd create ceph-node-x --data  /dev/vdc
ceph@ceph-deploy:~/ceph-cluster$ ceph-deploy osd create ceph-node-x --data  /dev/vdd

删除node节点

把OSD踢出集群

ceph@ceph-deploy:~/ceph-cluster$ ceph osd out osd.x1

等待数据迁移

ceph@ceph-deploy:~/ceph-cluster$ ceph -w

停止OSD进程

点击查看代码
root@ceph-nodex:~# systemctl stop ceph-osd@x1

删除osd

ceph@ceph-deploy:~/ceph-cluster$ ceph osd rm x

当前主机其它OSD重复以上操作

  1. 把osd踢出集群
  2. 等待数据迁移完成
  3. 停止osd.x进程
  4. 删除osd

下线主机

root@ceph-node-x:~# shutdown -h now

ceph配置文件

点击查看代码
[global]
fsid = 6e521054-1532-4bc8-9971-7f8ae93e8430
public_network = 172.16.10.0/24
cluster_network = 172.16.10.0/24
mon_initial_members = ceph-mon-01
mon_host = 172.16.10.148
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon_max_pg_per_osd = 300
mon_allow_pool_delete = true
osd_pool_default_size = 3
osd pool default min size = 1
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_max_pg_per_osd_hard_ratio = 2.5
ms_dispatch_throttle_bytes = 2097152000
ms_bind_before_connect = true
bluestore default buffered write = true
rgw enable usage log = true
rgw usage log tick interval = 30
rgw usage log flush threshold = 1024
rgw usage max shards = 32
rgw usage max user shards = 1

[mon]
mon max pg per osd = 1000
mon max pool pg num = 300000
mon_allow_pool_delete = true
mon clock drift allowed = 1
mon osd down out interval = 600
mon clock drift warn backoff = 30
 
[osd]
mon_osd_nearfull_ratio = .85
mon_osd_full_ratio = .95
osd_backfill_full_ratio = .90
osd journal size = 20000
osd_max_write_size = 512
osd_client_message_size_cap = 2147483648
osd op threads = 16
osd disk threads = 4
osd_map_cache_size = 1024
osd map cache bl size = 128
osd mon heartbeat interval = 40
objecter inflight ops = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
osd crush update on start = false
osd recovery op priority = 3
osd recovery max active = 10
osd max backfills = 5
osd max scrubs = 2
bluestore_cache_meta_ratio = 0.8
bluestore_cache_kv_ratio = 0.2
bluestore_csum_type = none
 
[rgw]
rgw override bucket index max shards = 8
 
[mds]
mds cache memory limit = 4G
mds reconnect timeout = 30
mds decay halflife = 10
 
[client]
rbd cache = true
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache target dirty = 235544320
rbd cache max dirty age = 30
rbd cache max dirty object = 8
rbd cache writethrough until flush = false

副本池

  • 副本池(replicated):定义每个对象在集群中保存为多少个副本,默认三个副本,一主两备,实现高可用,副本池是ceph默认的存储池类型。

副本池读取数据

  1. 客户端发送读请求。RADOS将请求发送到主OSD。
  2. 主OSD从本地磁盘读取数据并返回数据,最终完成读请求。

副本池写入数据

  • 在客户端写入操作时,ceph使用CRUSH算法计算出与对象相对应的PG ID和primary OSD,主OSD根据设置的副本数、对象名称、存储池名称和集群运行图(cluster map)计算出PG的各辅助OSD,然后由OSD将数据在同步给备OSD。
    1. 客户端APP请求写入数据,RADOS将请求发送到主OSD。
    2. 主OSD识别副本OSDs,并发送数据到各副本OSD。
    3. 副本OSDs写入数据,并发送写入完成信号给主OSD。
    4. 主OSD发送写入完成信号给客户端APP。

纠删码池

  • 纠删码池(erasure code):把各对象存储为N=K+M个块,其中K为数据块数量,M为编码块数量,因此存储池的尺寸为K+M,即数据保存在K个数据块,并提供M个冗余块提供数据高可用,最多故障为M个块,实际磁盘占用为K+M块,因此相比副本池机制比较节省存储资源,一般采用8+4机制,即8个数据块+4个冗余块,也就是12个数据块有8个数据块保存数据,有4个块实现数据冗余,即1/3的磁盘空间用于数据冗余,比默认副本池的三倍冗余节省空间,但是不能出现大于一定数据块故障。但是不是所有的应用都支持纠删码池,RBD只支持副本池而radosgw则可以支持纠删码池。
  • ceph从Firefly版本开始支持纠删码池,但是不推荐在生产环境使用纠删码池。纠删码池降低了数据保存所需的磁盘总空间数量,但是读写数据的计算成本要比副本池高。RGW支持纠删码池,RBD不支持。纠删码池可以降低企业的前期总拥有成本。

纠删码池写

  • 数据将在主OSD进行编码然后分发到相应的OSDs上去。
    1. 计算合适的数据块并进行编码
    2. 对每个数据块进行编码并写入OSD

纠删码池读

  1. 从相应的OSDs中获取数据后进行解码。
  2. 如果此时有数据丢失,ceph会自动从存放校检码的OSD中读取数据进行解码。

创建纠删码池

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool create erasure-testpool 16 16
pool 'erasure-testpool' created

查看纠删码池属性

ceph@ceph-deploy:~/ceph-cluster$  ceph osd erasure-code-profile get default
k=2    # k 为数据块的数量,即要将原始对象分割成的块数量,例如,如果k=2,则会将一个10KB对象分割成为5KB的k个对象
m=2    # m 为编码块的数量,即编码函数计算的额外块的数量。如果有2个编码块,则表示有2个额外的备份,最多可以从当前pg中宕机2个OSD,而不会丢失数据
plugin=jerasure  # 默认的纠删码池插件
technique=reed_sol_van 

写入数据

ceph@ceph-deploy:~/ceph-cluster$ sudo rados put -p erasure-testpool testfile1  /var/log/syslog

验证数据

ceph@ceph-deploy:~/ceph-cluster$ ceph osd map erasure-testpool testfile1
osdmap e1092 pool 'erasure-testpool' (22) object 'testfile1' -> pg 22.3a643fcb (22.b) -> up ([1,3,6], p1) acting ([1,3,6], p1)

验证当前pg状态

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph pg ls-by-pool erasure-testpool | awk '{print $1,$2,$15}'
PG OBJECTS ACTING
22.0 0 [3,0,8]p3
22.1 0 [6,1,3]p6
22.2 0 [6,4,2]p6
22.3 0 [4,8,2]p4
22.4 0 [2,5,8]p2
22.5 0 [4,1,6]p4
22.6 0 [8,4,1]p8
22.7 0 [2,6,5]p2
22.8 0 [3,8,2]p3
22.9 0 [1,5,7]p1
22.a 0 [7,2,3]p7
22.b 1 [1,3,6]p1
22.c 0 [3,1,7]p3
22.d 0 [4,6,0]p4
22.e 0 [5,7,1]p5
22.f 0 [5,8,1]p5
  
* NOTE: afterwards

测试获取数据

ceph@ceph-deploy:~/ceph-cluster$ rados --pool erasure-testpool get testfile1 -  #获取文件内容数据
ceph@ceph-deploy:~/ceph-cluster$ rados get -p erasure-testpool testfile1 /tmp/testfile1 # 下载文件到/tmp/testfile1

PG与PGP

  • PG = Placement Group #归置组
  • PGP = Placement Group for Placement purpose #归置组的组合,pgp相当于是pg对应osd的一种排列组合关系。
  • 归置组是用于跨越多OSD将数据存储在每个存储池中的内部数据结构。归置组在OSD守护进程和ceph客户端之间生成了一个中间层,CRUSH算法负责将每个对象动态映射到一个归置组,然后再将每个归置组动态映射到一个或者多个OSD守护进程,从而能够支持在新OSD设备上线时进行数据重新平衡。 
  • 相对于存储池来说,PG是一个虚拟组件,它是对象映射到存储池时使用的虚拟层。
  • 可以自定义存储池中归置组数量。
  • ceph出于规模伸缩及性能方面的考虑,ceph将存储池细分为多个归置组,把每个单独的对象映射到归置组,并未归置组分配一个主OSD。
  • 存储池由一系列的归置组组成,而CRUSH算法则根据集群运行图和集群状态,将PG均匀、伪随机(基于hash映射,每次的计算结果能够一样)的分布到集群中的OSD之上。
  • 如果某个OSD失败或需要对集群进行重新平衡,ceph则移动或复制整个归置组而不需要单独对每个镜像进行寻址。

PG与OSD的关系

  • ceph基于CRUSH算法将归置组PG分配至OSD。
  • 当一个客户端存储对象的时候,CRUSH算法映射每一个对象至归置组(PG)。

PG分配计算

  •  https://docs.ceph.com/en/mimic/rados/configuration/pool-pg-config-ref/ 
  • 归置组(PG)的数量是由管理员在创建存储池的时候指定的,然后由CRUSH负责创建和使用,PG的数量是2的N次方的倍数,每个OSD的PG不要超出250个PG,官方是每个OSD 100个左右。例如,OSD总数乘以100除以副本数量(即osd池默认大小),因此,对于10个OSD、存储池为4个,我们建议每个存储池大约(100*10)/5=256。
  • 通常,PG的数量应该是数据的合理力度的子集。
    • 例如:一个包含256个PG的存储池,每个PG中包含大约1/256的存储池数据。
  • 当需要将PG从一个OSD移动到另一个OSD的时候,PG的数量会对性能产生影响。
    • PG的数量过少,一个OSD上保存的数据会相对增多,那额ceph同步数据的时候产生的网络负载将对集群的性能输出产生一定的影响。
    • PG过多的时候,ceph将会占用过多CPU和内存资源用于记录PG状态信息。 
  •  PG的数量在集群分发数据和从新平衡时扮演着重要的角色作用。
    • 在所有OSD之间进行数据持久存储以及完整数据分布会需要较多的归置组,但是他们的数量应该减少到实现ceph最大性能所需的最小PG数量值,以节省CPU和内存资源。
    • 一般来说,对于有着超过50个OSD的RADOS集群,建议每个OSD大约有50-100个PG以平衡资源使用及取得更好的数据持久性和数据分布,而在更大的集群中,每个OSD可以有100-200个PG。
    • 至于一个pool应该使用多少个PG,可以通过下面的公式计算后,将pool的PG值四舍五入到最近的2的N次幂,如下先计算出ceph集群的总PG数:
      •  Total OSDs * PGPerOSD/replication factor => total PGs 
      •  磁盘总数 x 每个磁盘 PG /副本数 => ceph 集群总 PG (略大于 2^n 次方) 
    • 官方计算公式
      • Total PGs = (Total_number_of_OSD * 100) / max_replication_count 
    • 单个pool的PG计算,有100个osd,3副本,5个pool:
      •  Total PGs =100*100/3=3333 
      •  每个 pool PG=3333/5=512, 那么创建 pool 的时候就指定 pg 512 
  • 一个RADOS集群上会存在多个存储池,因此管理还需要考虑所有存储池上的PG分布后每个OSD需要映射的PG数量。

PG的状态

Peering

  • 正在同步状态,同一个PG中的OSD需要将准备数据同步一致,而Peering(对等)就是OSD同步过程中的状态。

Activating

  • Peering已经完成,PG正在等待所有PG实例同步Peering的结果(info,log等)

Clean

  • 干净态,PG当前不存在等待修复的对象,并且大小等于存储池的副本数,即PG的活动集(Acting set)和上行集(Up set)为同一组OSD且内容一致。
  • 活动集(Acting set):由PG当前主的OSD和其余处于活动状态的备用OSD组成,当前PG内的PSD负责处理用户的读写请求。
  • 上行集(Up set):在某个OSD故障时,需要将故障的OSD更换为可用的OSD,并主PG内部的主OSD同步数据到新的OSD上,例如PG内有OSD1、OSD2、OSD3,当OSD3故障后需要用OSD4替换OSD3,那么OSD1、OSD2、OSD3就是上行集,替换后OSD1、OSD2、OSD4就是活动集,OSD替换完成后活动集最终要替换上行集。

Active

  • 就绪状态或活跃状态,Active表示主OSD和备OSD处于正常工作状态,此时的PG可以正常处理来自客户端的读写请求,正常的PG默认就是Active+Clean状态。

Degraded:降级状态

  • 降级状态出现于OSD备标记为down以后,那么其它映射到此OSD的PG都会转换到降级状态。
  • 如果此OSD还能重新启动并完成Peering操作后,那么使用此OSD的PG将重新恢复为clean状态。
  • 如果此OSD被标记为down的时候超过5分钟还没有修复,那么此OSD将会被提出集群,然后ceph会对降级的PG启动恢复操作,知道所有由于此OSD而被降级的PG从新恢复为clean状态。
  • 恢复数据会从PG内的主OSD恢复,如果是主OSD故障,那么会在剩下的两个备用OSD从新选择一个作为主OSD。

Stale:过期状态

  • 正常状态下,每个主OSD都要周期性的向RADOS集群中的监视器(Mon)报告其作为主OSD所有持有的所有PG的最小统计数据,因任何原因导致每个OSD无法正常向监视器发送汇报信息的或者由其它OSD报告某个OSD已经down的时候,则所有以此OSD为主PG则会立即被标记为stale状态,即他们的主OSD已经不是最新的数据了,如果是备OSD发送down的时候,则ceph会执行修复而不会触发PG状态转为stale状态。

undersized

  • PG当前副本数小于其它存储池定义的值的时候,PG会转换为undersized状态,比如两个备OSD都down了,那么此时PG中就只有一个主OSD了,不符合ceph最新要求一个主OSD加一个备OSD的要求,那么就会导致使用此OSD的PG转换为undersized状态,直到添加备OSD添加完成或者修复完成。

Scrubbing

  • scrub是ceph对数据的清洗状态,用来保证数据完整性的机制,ceph的OSD定期启动scrub线程来扫描部分对象,通过与其它副本比对来发现是否一致,如果存在不一致,抛出异常提示用户手动解决,scrub以PG为单位,对于每一个pg,ceph分析该pg下所有的object,产生一个类似于元数据信息摘要的数据结构,如对象大小,属性等,叫scrubmap比较主与副scrubmap来保证是不是有object丢失或者不匹配,扫描分为轻量扫描和深度扫描,轻量级扫描也叫做light scrubs或者shallow scrubs或者simply scrubs即轻量级扫描。
  • Light scrub(daily)比较object size和属性,deep scrub(weekly)读取数据部分并通过checksum对于和数据的一致性,深度扫描过程中的PG会处于scrubbing+deep状态。

Recovering

  • 正在恢复状态,集群正在执行迁移或同步对象和它们的副本,这可能是由于添加了一个新的OSD到集群中或者某个OSD宕掉后,PG可能会被CRUSH算法从新分配不同的OSD,而由于OSD更换导致PG发生内部数据同步的过程中PG会被标记为Recovering。

Backfilling

  • 正在后台填充态,backfill是recovery的一种特殊场景,值peering完成后,如果基于当前权威日志无法对Up set(上行集)当中的某些PG实例实施增量同步(例如承载这些PG实例的OSD离线太久或者是新的OSD加入集群导致的PG实例整体迁移)则通过完全拷贝当前Primary所有对象的方式进行全量同步,此过程中的PG会处于backfilling。

Backfill-toofull

  • 某个需要被Backfill的PG实例,其所在的OSD可用空间不足,Backfill流程当前被挂起时PG的状态。

客户端数据读写流程

  • ceph读写对象的时候,客户端从ceph监视器检索出集群运行图(cluster map),然后客户端访问指定的存储池,并对存储池内PG的对象指定读写操作。
  • 存储池的CRUSH计算结果和PG的数量,是决定ceph如何放置数据的关键因素。基于集群的最新运行图,客户端能够了解到集群中的所有监视器和OSD以及他们各自当前的状态,但是客户端仍然不知道对象的保存位置。。
  • 客户端需要在存储池中读写对象时,需要客户端对象名称。对象名称的hash码、存储池中的PG数量和存储池名称作为输入信息提供给ceph,然后由CRUSH计算出PG的ID以及此PG针对的主OSD即可读写OSD中的对象。
  • 具体写操作流程如下:
    1. APP向ceph客户端发送对某个对象的请求,此请求包含对象和存储池,然后ceph客户端对访问的对象做hash计算,并根据此hash值计算出对象所在的PG,完成对象从Pool至PG的映射。
      • APP访问pool ID和object ID
      • ceph client对object做哈希
      • ceph client对该hash值取PG总数的模,得到PG编码
      • ceph client对pool ID 取hash
      • ceph client将pool ID和PG ID组合在一起
    2. 然后客户端根据PG、CRUSH运行图和归置组作为输入参数并再次进行计算,并计算出对象所在的PG内的主OSD,从而完成对象从PG到OSD映射。
      • ceph client从MON获取最新的cluster map。
      • ceph client根据上面的第2步计算出该object将要在的PG的ID。
      • ceph client在根据CRUSH算法计算出PG中目标主和备OSD的ID,即可对OSD的数据进行读写。
    3. 客户端开始对主OSD进行读写请求,如果发生了写请求,会有ceph服务端完成对象从主OSD到备OSD的同步。

ceph 存储池操作

官方运维手册

  •  http://docs.ceph.org.cn/rados/ 

存储池常用命令

  • ceph osd pool create <poolname> pg_num pgp_num {replicated|erasure} #创建存储池
  • ceph osd pool ls [detail] #列出存储池
  • ceph osd pool lspools #列出存储池
  • ceph osd pool stats [pool name] #获取存储池的时间信息
  • ceph osd pool old-name new-name #重命名存储池
  • ceph osd pool get [pool name] size #获取存储池对象副本数默认为一主两倍3副本
  • ceph osd pool get [pool name] min_size #存储池最下副本数
  • ceph osd pool get [pool name] pg_num #查看当前pg数量
  • ceph osd pool get [pool name] crush_rule #设置crush算法规则,默认为副本池(replicated_rule)
  • ceph osd pool get [pool name] nodelete  #控制是否可以删除。默认可以
  • ceph osd pool get [pool name] nopgchange  #控制是否可更新存储池的pg num 和pgp num
  • ceph osd pool set [pool name] pg_num 64 #修改制定pool的pg数量
  • ceph osd pool get [pool name] nosizechange #控制是否可以更改存储池的大小,默认允许修改
  • ceph osd pool set-quota [pool name] #获取存储池配额信息
  • ceph osd pool set-quota [pool name] max_bytes   21474836480 #设置存储池最大空间,单位字节
  • ceph osd pool set-quota [pool name] max_objects 1000 #设置存储池最大对象数量
  • ceph osd pool get [pool name] noscrub #查看当前是否关闭轻量扫描数据,默认值为false,不关闭,开启
  • ceph osd pool set [pool name] noscrub true #修改制定的pool轻量扫描为true,不执行轻量扫描
  • ceph osd pool set [pool name] nodeep-scrub #查看当前是否关闭深度扫描数据,默认值为false,不关闭,开启
  • ceph osd pool set [pool name] nodeep-scrub true #修改制定pool的深度扫描测量为true,即不执行深度扫描
  • ceph osd pool get [pool name] scrub_min_interval #查看存储池的最小整理时间间隔,默认值没有设置,可以通过配置文件中的osd_scrub_min_interval参数指定间隔时间。
  • ceph osd pool get [pool name] scrub_max_interval #查看存储池的最大整理时间间隔,默认值没有设置,可以通过配置文件中的osd_scrub_max_interval参数指定。
  • ceph osd pool get [pool name] deep_scrub_interval #查看存储池的深层整理时间间隔,默认值没有设置,可以通过配置文件中的osd_deep_scrub_interval参数指定。
  • rados df #显示存储池的用量信息
  • ceph daemon osd.x config show | grep scrub  #查看node节点osd的配置选项
    • osd_deep_scrub_interval: 604800.000000  #定义深度清洗间隔,604800秒=7天
    • osd_max_scrubs: 1 #定义一个ceph osd daemon内能够同时进行scrubbing的操作数
    • osd_scrub_invalid_stats: true #定义scrub是否有效
    • osd_scrub_max_interval: 604800.000000  # 定义最大执行scrub间隔,604800秒=7天
    • osd_scrub-min_interval: 86400.000000 #定义最小执行scrub间隔,86400秒=1天

存储池的删除 

创建测试pool

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool create wgspool01 32 32
pool 'wgspool01' created

设置NODELETE标志

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set wgspool01 nodelete false
set pool 27 nodelete to false # 默认值为false

设置集群参数

ceph@ceph-deploy:~/ceph-cluster$ ceph tell mon.* injectargs --mon-allow-pool-delete=true
mon.ceph-mon-01: {}
mon.ceph-mon-01: mon_allow_pool_delete = 'true' 
mon.ceph-mon-02: {}
mon.ceph-mon-02: mon_allow_pool_delete = 'true' 
mon.ceph-mon-03: {}
mon.ceph-mon-03: mon_allow_pool_delete = 'true' 

删除pool

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool rm wgspool01 wgspool01  --yes-i-really-really-mean-it
pool 'wgspool01' removed

恢复集群参数设置

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph tell mon.* injectargs --mon-allow-pool-delete=false
mon.ceph-mon-01: {}
mon.ceph-mon-01: mon_allow_pool_delete = 'false' 
mon.ceph-mon-02: {}
mon.ceph-mon-02: mon_allow_pool_delete = 'false' 
mon.ceph-mon-03: {}
mon.ceph-mon-03: mon_allow_pool_delete = 'false'

存储池配额

查看存储池配额信息

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool get-quota wgsrbd
quotas for pool 'wgsrbd':
  max objects: N/A   # 默认不限制对象数量
  max bytes  : N/A   # 默认不限制空间大小,单位字节

存储池对象最大数量

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set-quota wgsrbd max_objects 1000
set-quota max_objects = 1000 for pool wgsrbd

存储池最大空间

点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set-quota wgsrbd max_bytes 10737418240
set-quota max_bytes = 10737418240 for pool wgsrbd   

存储池配额验证

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool get-quota wgsrbd
quotas for pool 'wgsrbd':
  max objects: 1k objects  (current num objects: 18 objects)
  max bytes  : 10 GiB  (current num bytes: 25026596 bytes)

存储池快照

  • https://docs.ceph.com/en/pacific/rbd/rbd-snapshot/?highlight=snap
  • pool snapshot:对整个pool打一个snapshot,该pool中所有的对象都受影响。
  • self managed snapshot:用户管理的snapshot,这个pool受影响的对象是受用户控制的。这里的用户往往是应用如librbd。
  • 创建了image的存储池无法再创建存储池快照了,因为存储池当前已经为unmanaged snaps mode了,而没有创建image的,就可以做存储池快照。

用ceph命令创建存储池快照

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool mksnap [pool name] [snap name]
ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool mksnap cephfs-data cephfs-data-snap
created pool cephfs-data snap cephfs-data-snap

用rados命令创建存储池快照

ceph@ceph-deploy:~/ceph-cluster$ rados -p [pool name] mksnap [snap name]
ceph@ceph-deploy:~/ceph-cluster$ rados -p cephfs-data mksnap cephfs-data-snap02
created pool cephfs-data snap cephfs-data-snap02

验证快照

ceph@ceph-deploy:~/ceph-cluster$ rados lssnap -p cephfs-data
1       cephfs-data-snap        2021.10.25 13:44:57
2       cephfs-data-snap02      2021.10.25 13:46:00
2 snaps

回滚快照

ceph@ceph-deploy:~/ceph-cluster$ rados rollback -p cephfs-data testfile  cephfs-data-snap02
rolled back pool cephfs-data to snapshot cephfs-data-snap02

删除快照

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool rmsnap cephfs-data cephfs-data-snap02
removed pool cephfs-data snap cephfs-data-snap02

数据压缩

  • 如果使用 bulestore 存储引擎, ceph 支持称为实时数据压缩即边压缩边保存数据的功能,该功能有助于节省磁盘空间, 可以在BlueStore OSD上创建的每个Ceph池上启用或禁用压缩,以节约磁盘空间, 默认没有开启压缩, 需要后期配置并开启。 

启用压缩并指定压缩算法

  • 压缩算法有nonezliblz4zstdsnappy等几种, 默认为snappy
  • zstd有较好的压缩比, 但比较消耗CPU
  • lz4snappyCPU占用比例较低
  • 不建议使用zlib 
点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set <pool-name> compression_algorithm snappy

指定压缩模式

  • none: 从不压缩数据。默认none
  • passive: 除非写操作具有可压缩的提示集, 否则不要压缩数据。
  • aggressive: 压缩数据, 除非写操作具有不可压缩的提示集。
  • force: 无论如何都尝试压缩数据, 即使客户端暗示数据不可压缩也会压缩, 也就是在所有情况下都使用压缩。 
ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set <pool name> compression_mode aggressive

压缩参数

  • compression_algorithm #压缩算法
  • compression_mode #压缩模式
  • compression_required_ratio #压缩后与压缩前的压缩比, 默认为.875
  • compression_max_blob_size  #大于此的块在被压缩之前被分解成更小的 blob(), 此设置将覆盖 bluestore 压缩 max blob *的全局设置。 
  • compression_min_blob_size#小于此的块不压缩, 此设置将覆盖 bluestore 压缩 min blob *的全局设置 

全局压缩选项

  • bluestore_compression_algorithm #压缩算法
  • bluestore_compression_mode #压缩模式
  • bluestore_compression_required_ratio #压缩后与压缩前的压缩比, 默认为.875
  • bluestore_compression_min_blob_size #小于它的块不会被压缩,默认 0
  • bluestore_compression_max_blob_size #大于它的块在压缩前会被拆成更小的块,默认 0
  • bluestore_compression_min_blob_size_ssd #默认 8k
  • bluestore_compression_max_blob_size_ssd 默认 64k
  • bluestore_compression_min_blob_size_hdd #默认 128k
  • bluestore_compression_max_blob_size_hdd #默认 512k 

修改压缩算法

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set [pool name] compression_algorithm snappy

修改压缩模式

ceph@ceph-deploy:~/ceph-cluster$ ceph osd pool set [pool name] compression_mode passive
posted @ 2021-11-22 14:24  小吉猫  阅读(335)  评论(0编辑  收藏  举报