NetApp存储方案及巡检命令
一、MCC概述
Clustered Metro Cluster(简称MCC)是Netapp Data Ontap提供的存储双活解决方案,当初的方案是把1个FAS/ V系列双控在数据中心之间拉远形成异地HA Pair,每站点只有单控制器节点,数据中心两站点之间通过额外的FC/VI集群适配器相连,数据中心间SAS磁盘框通过SAS转FC的FibreBridge相连。在500米以内、同一个机房采用直接光纤通道交换机连接;在500米以上(最远100km)采用光纤通道和DWDM交换机相连。
MetroCluster在此架构上也进行了演变。通过在站点A、B两个站点分别放置两套FAS/ V双控阵列,阵列A的A控和阵列B的A控,阵列A的B控和阵列B的B控分别形成集群,这样可以充分把A、B站点数据中心资源充分利用,同时对外提供存储服务;但阵列内的A、B不是集群。如果站点间形成集群Pair的任意一个控制器节点故障,故障站点的主机都需要远程访问远端控制器节点;如何站点间形成集群Pair的两个节点同时故障,就会发生业务中断。
Netapp Data Ontap8.3版本推出了4控双活解决方案,最远支持200公里距离,4控Metro Cluster方案首先由2个HA Pair组成2个本地集群,然后再从2个集群上做4节点集群。集群控制器之间内存日志通过存放在NVRAM里面,NVRAM对没有下盘的日志做了镜像,保证节点故障以后,HA Pair集群的Partner节点能够接管业务;或者站点故障以后,远端HA Pair集群能够接管业务。当日志到达一定水位或者发生系统操作刷盘时,下盘数据同步通过SyncMirror实现主从站点双写,从而确保一个站点磁盘故障以后,另外一个站点磁盘还能提供系统访问,实现站点故障切换,保证业务不中断。
MetroCluster使用两个不同地点的镜像和集群来保护数据,每个集群把数据和Storage Virtual Machine (SVM) 配置都镜像同步另一个集群。当某个站点发生灾难时,管理员可以激活远端SVM并在另一站点接管业务。此外,每个集群在本地节点均配置为HA Pair,从而提供了本地故障转移能力。
NetApp MetroCluster是以NetApp SyncMirror是配合Cluster_remote和控制器Cluster Failover的功能实现的。
-
Clustered Failover – 在主存储和容灾存储间提供高可用性失败恢复能力,故障接管的决策是由管理员通过单一命令行决定的。
-
SyncMirror – 为远端存储提供即时的数据拷贝,当故障接管时,数据可以仅通过远端的存储进行访问。
-
ClusterRemote – 提供管理机制用以判断灾难的发生并初始远端存储进行接管。
二、MCC巡检常用命令
1、系统健康状态检查
cluster1::> system health status show Status --------------- ok
2、集群状态检查
cluster1::> cluster show Node Health Eligibility --------------------- ------- ------------ cluster1-01 true true cluster1-02 true true 2 entries were displayed.
3、集群统计状态检查
cluster1::> cluster statistics show Counter Value Delta ---------------- ----------------- ------------- CPU Busy: 0% - Operations: Total: 0 - NFS: 0 - CIFS: 0 - Data Network: Busy: 0% - Received: 5.78GB - Sent: 13.7GB - Cluster Network: Busy: 0% - Received: 967KB - Sent: 979KB - Storage Disk: Read: 6.38PB - Write: 6.26PB -
4、查看RAID组信息
cluster1::> aggr show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ aggr0_A1 953.8GB 247.3GB 74% online 1 cluster1-01 raid4, mirrored, normal aggr0_A2 953.8GB 247.3GB 74% online 1 cluster1-02 raid4, mirrored, normal aggr_data_A1 68.93TB 16.04TB 77% online 32 cluster1-01 mixed_raid_ type, mirrored, hybrid, normal aggr_data_A2 68.93TB 14.77TB 79% online 31 cluster1-02 mixed_raid_ type, mirrored, hybrid, normal 4 entries were displayed.
5、查看节点信息
cluster1::> node show Node Health Eligibility Uptime Model Owner Location --------- ------ ----------- ------------- ----------- -------- --------------- cluster1-01 true true 369 days 19:12 FAS8040 gz_idc cluster1-02 true true 369 days 19:23 FAS8040 gz_idc 2 entries were displayed.
6、查看版本信息
cluster1::> version NetApp Release 8.3.2P9: Fri Jan 06 05:54:05 UTC 2017
7、查看序列号
cluster1::> system license show Serial Number: 1-80-023992 Owner: cluster1 Package Type Description Expiration ----------------- ------- --------------------- -------------------- Base license Cluster Base License - Serial Number: 1-81-0000000000000451515****** Package Type Description Expiration ----------------- ------- --------------------- -------------------- NFS license NFS License - iSCSI license iSCSI License - Serial Number: 1-81-0000000000000451515****** Owner: cluster1-02 Package Type Description Expiration ----------------- ------- --------------------- -------------------- NFS license NFS License - iSCSI license iSCSI License - 5 entries were displayed.
8、查看子系统健康状态
cluster1::> system health subsystem show Subsystem Health ----------------- ------------------ SAS-connect ok Environment ok Memory ok Service-Processor ok Switch-Health ok CIFS-NDO ok Motherboard ok IO ok MetroCluster ok MetroCluster_Node ok FHM-Switch ok FHM-Bridge ok 12 entries were displayed.
9、查看MCC集群信息状态及节点信息状态
cluster1::> metrocluster show Configuration: fabric Cluster Configuration State Mode ------------------------------ ---------------------- ------------------------ Local: cluster1 configured normal Remote: cluster1_dr configured normal cluster1::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- ------------------ -------------- --------- -------------------- 1 cluster1 cluster1-01 configured enabled normal cluster1-02 configured enabled normal cluster1_dr cluster1_dr-01 configured enabled normal cluster1_dr-02 configured enabled normal 4 entries were displayed.
10、查看控制器状态
cluster1::> system controller show Controller Name System ID Serial Number Model Status ------------------------- ------------- ----------------- -------- ----------- cluster1-01 536964819 451515****** FAS8040 ok cluster1-02 536961600 451515****** FAS8040 ok 2 entries were displayed.
11、查看故障硬盘
cluster1::> storage disk show -broken
There are no entries matching your query.
12、查看spare硬盘
cluster1::> storage disk show -spare Original Owner: cluster1-01 Checksum Compatibility: block Usable Physical Disk HA Shelf Bay Chan Pool Type RPM Size Size Owner --------------- ------------ ---- ------ ----- ------ -------- -------- -------- 1.30.11 3a 30 11 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.30.13 3a 30 13 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.31.4 3a 31 4 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.32.20 4b 32 20 B Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.32.23 3a 32 23 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.33.0 3a 33 0 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.33.1 3a 33 1 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 1.33.10 4b 33 10 B Pool0 SAS 10000 1.09TB 1.09TB cluster1-01 2.42.22 3a 42 22 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 2.42.23 4b 42 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 2.43.2 4b 43 2 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 2.43.22 3b 43 22 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 2.43.23 4b 43 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 3.11.21 4b 11 21 B Pool0 SSD - 372.4GB 372.6GB cluster1-01 4.20.21 3a 20 21 A Pool1 SSD - 372.4GB 372.6GB cluster1-01 4.21.14 3a 21 14 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01 Original Owner: cluster1-02 Checksum Compatibility: block Usable Physical Disk HA Shelf Bay Chan Pool Type RPM Size Size Owner --------------- ------------ ---- ------ ----- ------ -------- -------- -------- 2.44.23 3b 44 23 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-02 3.12.21 4a 12 21 B Pool0 SSD - 372.4GB 372.6GB cluster1-02 4.23.21 3b 23 21 A Pool1 SSD - 372.4GB 372.6GB cluster1-02 5.60.23 3b 60 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-02 20 entries were displayed.
13、查看SAS桥故障
cluster1::> storage bridge show Is Monitor Bridge Symbolic Name Monitored Status Vendor Model Bridge WWN ------------------------ ------------- --------- ------- ------ --------------------- ---------------- ATTO_10.0.15.17 BRIDGE_B_1 true ok Atto FibreBridge 6500N 2000001086627bc0 ATTO_10.0.15.18 BRIDGE_B_2 true ok Atto FibreBridge 6500N 2000001086630f0e ATTO_10.0.15.19 BRIDGE_B_3 true ok Atto FibreBridge 6500N 2000001086630edc ATTO_10.0.15.20 BRIDGE_B_4 true ok Atto FibreBridge 6500N 2000001086630ed2 ATTO_10.0.15.6 BRIDGE_A_1 true ok Atto FibreBridge 6500N 2000001086630eb4 ATTO_10.0.15.7 BRIDGE_A_2 true ok Atto FibreBridge 6500N 2000001086630efa ATTO_10.0.15.8 BRIDGE_A_3 true ok Atto FibreBridge 6500N 2000001086630f18 ATTO_10.0.15.9 BRIDGE_A_4 true ok Atto FibreBridge 6500N 2000001086630ef0 ATTO_FibreBridge6500N_10 - false - Atto FibreBridge6500N 200000108663e514 ATTO_FibreBridge6500N_11 - false - Atto FibreBridge6500N 200000108663e3f2 ATTO_FibreBridge6500N_12 - false - Atto FibreBridge6500N 200000108663e488 ATTO_FibreBridge6500N_13 - false - Atto FibreBridge6500N 20000010866114ec ATTO_FibreBridge6500N_14 - false - Atto FibreBridge6500N 2000001086627bc0 ATTO_FibreBridge6500N_7 - false - Atto FibreBridge6500N 2000001086630e96 ATTO_FibreBridge6500N_9 - false - Atto FibreBridge6500N 200000108663e4c4 15 entries were displayed.
14、查看纤交换机故障
cluster1::> storage switch show Symbolic Is Monitor Switch Name Vendor Model Switch WWN Monitored Status --------------------- -------- ------- ----- ---------------- --------- ------- Brocade_10.0.15.10 SW_A_1 Brocade Brocade6505 100050eb1a88327f true ok Brocade_10.0.15.11 SW_A_2 Brocade Brocade6505 100050eb1a881582 true ok Brocade_10.0.15.21 SW_B_3 Brocade Brocade6505 100050eb1a882f69 true ok Brocade_10.0.15.22 SW_B_4 Brocade Brocade6505 100050eb1a881522 true ok 4 entries were displayed.
15、查看failover状态
cluster1::> storage failover show Takeover Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- cluster1-01 cluster1-02 true Connected to cluster1-02 cluster1-02 cluster1-01 true Connected to cluster1-01 2 entries were displayed.
16、查看严重告警日志及错误告警日志
cluster1::> event log show -severity critical There are no entries matching your query. cluster1::> event log show -severity error Time Node Severity Event ------------------- ---------------- ------------- --------------------------- 3/6/2018 02:28:30 cluster1-02 ERROR asup.post.drop: AutoSupport message (HA Group Notification from cluster1-02 (MANAGEMENT_LOG) INFO) for host (0) was not posted to NetApp. The system will drop the message. 3/6/2018 01:28:18 cluster1-02 ERROR asup.post.drop: AutoSupport message (HA Group Notification from cluster1-02 (PERFORMANCE DATA) INFO) for host (0) was not posted to NetApp. The system will drop the message. 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) cluster1, Serial Number 5589765F, Certificate Authority 'cluster1' and type server for Vserver cluster1 has expired. 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UC_SVM2, Serial Number 55A03966, Certificate Authority 'SVM2' and type server for Vserver SVM2 has expired. 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UC_SVM, Serial Number 559FFD76, Certificate Authority 'SVM' and type server for Vserver SVM has expired. 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UCS_SVM_DR, Serial Number 545845C16E278, Certificate Authority 'SVM_DR' and type server for Vserver SVM_DR-mc has expired. 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UCS_SVM2_DR, Serial Number 545845A7B01FA, Certificate Authority 'SVM2_DR' and type server for Vserver SVM2_DR-mc has expired. 7 entries were displayed.
17、查看某个聚合下的Volume状态信息
cluster1::> vol show -aggregate aggr_data_A1
18、查看Lun信息及Lun详细信息
cluster1::> lun show
cluster1::> lun show -v
19、查看map信息及map详情
cluster1::> igroup show
cluster1::> igroup show -v
20、查看Lun的map情况
cluster1::> lun show -m
21、进入某一节点
cluster1::> run -node cluster1-01 Type 'exit' or 'Ctrl-D' to return to the CLI cluster1-01>
22、节点下查看spare disks
cluster1-01> vol status -s Local spares Pool1 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare SW_B_3:6.126L41 3a 21 14 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968 (not zeroed) spare SW_B_3:7.126L75 3a 42 22 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_B_3:7.126L101 3b 43 22 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_B_4:7.126L76 4b 42 23 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_B_4:7.126L29 4b 43 2 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_B_4:7.126L50 4b 43 23 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_B_3:6.126L22 3a 20 21 FC:A 1 SSD N/A 381304/780910592 381554/781422768 Pool0 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare SW_A_1:7.126L12 3a 30 11 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_1:7.126L14 3a 30 13 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_1:7.126L31 3a 31 4 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_1:7.126L76 3a 32 23 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_1:7.126L79 3a 33 0 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_1:7.126L80 3a 33 1 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_2:7.126L73 4b 32 20 FC:B 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_2:7.126L37 4b 33 10 FC:B 0 SAS 10000 1142352/2339537408 1144641/2344225968 spare SW_A_2:6.126L74 4b 11 21 FC:B 0 SSD N/A 381304/780910592 381554/781422768
23、节点下查看fail disk
cluster1-01> vol status -f Broken disks (empty)
24、显示没有ownership(归属权)的硬盘
cluster1-01> disk show -n disk show : No unassigned disks
25、分配硬盘的归属(硬盘更换常用)
cluster1-01> disk assign all
26、查看所有硬盘位置信息
cluster1-01> storage show disk -p