因为ceph集群的服务器硬盘都是直通的,当我们发现有硬盘存储坏道需要更换硬盘,但是因为盘序可能不是连续的,无法定位服务器上那块硬盘是故障的,如果冒然测试可能把正常的硬盘拔出,得不偿失,简单记录一下定位故障硬盘的思路。
1、硬盘定位思路
适用于故障硬盘亚健康但未离线硬盘
1.1 找到故障硬盘
[root@a01r1n06 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk
├─sda1 8:1 0 2M 0 part
├─sda2 8:2 0 1G 0 part /boot
├─sda3 8:3 0 62.5G 0 part [SWAP]
└─sda4 8:4 0 3.6T 0 part /
sdb 8:16 0 14.6T 0 disk /public/home/test
1.2 通过smartctl工具找到硬盘的SN号
[root@a01r1n06 ~]# smartctl --all /dev/sdad
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/sdad failed: No such device
[root@a01r1n06 ~]# smartctl --all /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WUH721816ALE6L4
Serial Number: 2CJSSATJ ##硬盘序列号
LU WWN Device Id: 5 000cca 2a1e6fb30
Add. Product Id: 202116
Firmware Version: PCGAW232
User Capacity: 16,000,900,661,248 bytes [16.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: Unknown(0x0ffc) (unknown minor revision code: 0x009c)
SATA Version is: SATA >3.2 (0x1ff), 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Jul 6 14:50:20 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 101) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
.....
1.3 通过阵列卡工具找到对应序列号对应的服务器的槽位号;
Storcli工具安装
工具安装移步 -> https://www.xxshell.com/2800.html
通过Storcli找到硬盘solt
筛选硬盘SN找到Slot;
./storcli64 /call /eall /sall show all |grep -5 -i WKD26RCS
1.4 点亮硬盘定位灯更换硬盘
通过./storcli64 /c0/e0/s23 start locate
点亮定位灯
2、使用IPMI系统定位硬盘
因为服务器没有安装Storcli,其有IPMI系统,通过IPMI系统来定位硬盘
找到对应的硬盘序列号,点击【定位物理设备】即可
· 敏捷开发:如何高效开每日站会
· 为什么 .NET8线程池 容易引发线程饥饿
· golang自带的死锁检测并非银弹
· 如何做好软件架构师
· 记录一次线上服务OOM排查
· 为什么 .NET8线程池 容易引发线程饥饿
· 场景题:假设有40亿QQ号,但只有1G内存,如何实现去重?
· 在 .NET 中使用 Tesseract 识别图片文字
· Bolt.new 30秒做了一个网站,还能自动部署,难道要吊打 Cursor?
· C#/.NET/.NET Core技术前沿周刊 | 第 20 期(2025年1.1-1.5)