HDFS块检查命令Fsck机理的分析
前言
在HDFS中,所有的文件都是以block块的概念而存在的,那么在这样海量的文件数据的情况下,难免会发生一些文件块损坏的现象,那么有什么好的办法去发现呢.答案是使用HDFS的fsck相关的命令.这个命令独立于dfsadmin的命令,可能会让部分人不知道HDFS中还存在这样的命令,本文就来深度挖掘一下这个命令的特殊的用处和内在机理的实现.
Fsck命令
其实说到fsck命令本身,熟悉Linux操作系统的人,可能或多或少听到过或使用过这个命令.Fsck命令的全称为file system check,更加类似的是一种修复命令.当然,本文不会讲大量的关于操作系统的fsck怎么用,而是HDFS下的fsck的使用,在bin/hdfs fsck下还是有很多可选参数的.
Fsck参数使用
本人在测试集群中输入hdfs fsck命令,获取了帮助信息,在此信息中展示了最全的参数使用说明:
$ hdfs fsck
Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks print out list of missing blocks and files they belong to
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
-storagepolicies print out storage policy summary for the blocks
-blockId print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)
简单的总结一下,首先是必填参数和命令名:
bin/hdfs fsck <path>
然后是一堆的可选参数:
- -move: 移动损坏的文件到/lost+found目录下
- -delete: 删除损坏的文件
- -files: 输出正在被检测的文件
- -openforwrite: 输出检测中的正在被写的文件
- -includeSnapshots: 检测的文件包括系统snapShot快照目录下的
- -list-corruptfileblocks: 输出损坏的块及其所属的文件
- -blocks: 输出block的详细报告
- -locations: 输出block的位置信息
- -racks: 输出block的网络拓扑结构信息
- -storagepolicies: 输出block的存储策略信息
- -blockId: 输出指定blockId所属块的状况,位置等信息
具体参数功能对应到相应的程序会在下文的分析中进行详细的阐述.
Fsck过程调用
Fsck过程的调用指的是从终端机器输入到最终fsck在HDFS内部被执行的整个过程.中间穿过的类的其实不多,本人做了一张简图:
上图的调用形式,可以说是三层调用的结构.DFSck就是暴露在最外层的类.我们再来规整规整中间的过程.
- 输入fsck 直接调用到的就是此类.DFSck内部会发送http请求的方式,根据参数构造URL请求地址,发送到下一个处理对象中.
- 下一个处理对象就是FsckServlet.FsckServlet在这里相当于一个过渡者,马上调用真正操作类NamenodeFsck.
- NamenodeFsck在这里会取出请求参数,然后在HDFS内部做真正的fsck检测操作.
Fsck原理分析
Fsck原理分析将会展示更加细致的fsck过程调用.按照上小节的提到的3层调用,同样我们也分为3个层次的渐近性的分析.
DFSck请求构造
你可以把此类想象成DFSAdmin.首先进入命令输入处理入口方法:
public int run(final String[] args) throws IOException {
if (args.length == 0) {
printUsage(System.err);
return -1;
}
try {
return UserGroupInformation.getCurrentUser().doAs(
new PrivilegedExceptionAction<Integer>() {
@Override
public Integer run() throws Exception {
return doWork(args);
}
});
} catch (InterruptedException e) {
throw new IOException(e);
}
}
在doWork方法中,马上就看到了对于参数的判别分类,同时开始构造不同的参数请求.
private int doWork(final String[] args) throws IOException {
final StringBuilder url = new StringBuilder();
url.append("/fsck?ugi=").append(ugi.getShortUserName());
String dir = null;
boolean doListCorruptFileBlocks = false;
for (int idx = 0; idx < args.length; idx++) {
if (args[idx].equals("-move")) { url.append("&move=1"); }
else if (args[idx].equals("-delete")) { url.append("&delete=1"); }
else if (args[idx].equals("-files")) { url.append("&files=1"); }
else if (args[idx].equals("-openforwrite")) { url.append("&openforwrite=1"); }
else if (args[idx].equals("-blocks")) { url.append("&blocks=1"); }
else if (args[idx].equals("-locations")) { url.append("&locations=1"); }
else if (args[idx].equals("-racks")) { url.append("&racks=1"); }
else if (args[idx].equals("-storagepolicies")) { url.append("&storagepolicies=1"); }
...
不同类型的参数后面接的参数值也不一定相同,比如-blockId后面则会跟连续的blockId.
...
} else if (args[idx].equals("-blockId")) {
StringBuilder sb = new StringBuilder();
idx++;
while(idx < args.length && !args[idx].startsWith("-")){
sb.append(args[idx]);
sb.append(" ");
idx++;
}
url.append("&blockId=").append(URLEncoder.encode(sb.toString(), "UTF-8"));
...
请求url构造好之后,就会发起请求
URL path = new URL(url.toString());
URLConnection connection;
try {
connection = connectionFactory.openConnection(path, isSpnegoEnabled);
} catch (AuthenticationException e) {
throw new IOException(e);
}
然后获取响应回复,直接输出到终端上.
InputStream stream = connection.getInputStream();
BufferedReader input = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
String line = null;
String lastLine = null;
int errCode = -1;
try {
while ((line = input.readLine()) != null) {
out.println(line);
lastLine = line;
}
} finally {
input.close();
}
OK,DFSck最外层面的调用过就走通了.
FsckServlet请求处理
上个步骤中url请求会转到FsckServlet中处理,类似代理人的角色,然后马上调用NamenodeFsck进行处理
/** Handle fsck request */
@Override
public void doGet(HttpServletRequest request, HttpServletResponse response
) throws IOException {
@SuppressWarnings("unchecked")
final Map<String,String[]> pmap = request.getParameterMap();
...
final UserGroupInformation ugi = getUGI(request, conf);
try {
ugi.doAs(new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
NameNode nn = NameNodeHttpServer.getNameNodeFromContext(context);
final FSNamesystem namesystem = nn.getNamesystem();
final BlockManager bm = namesystem.getBlockManager();
final int totalDatanodes =
namesystem.getNumberOfDatanodes(DatanodeReportType.LIVE);
new NamenodeFsck(conf, nn,
bm.getDatanodeManager().getNetworkTopology(), pmap, out,
totalDatanodes, remoteAddress).fsck();
return null;
}
});
} catch (InterruptedException e) {
response.sendError(400, e.getMessage());
}
}
NamenodeFsck的fsck处理
上节中最后一个步骤最终调用的就是NamenodeFsck的fsck方法.在进入这个方法之前,先看一下,这个类的一些关键变量.
private String lostFound = null;
private boolean lfInited = false;
private boolean lfInitedOk = false;
private boolean showFiles = false;
private boolean showOpenFiles = false;
private boolean showBlocks = false;
private boolean showLocations = false;
private boolean showRacks = false;
private boolean showStoragePolcies = false;
private boolean showCorruptFileBlocks = false;
这些布尔类型的变量对应的就是控制fsck帮助信息所展示的各个参数.个人感觉fsck方法内部的处理顺序看起来有点乱,为了便于大家的理解,这里对指定参数进行指定分析的方式,就不转行对照的描述了.
-list-corruptfileblocks
第一个参数方法-list-corruptfileblocks,展示丢失/损坏的块.
if (showCorruptFileBlocks) {
listCorruptFileBlocks();
return;
}
然后调用到同名方法listCorruptFileBlocks.
private void listCorruptFileBlocks() throws IOException {
Collection<FSNamesystem.CorruptFileBlockInfo> corruptFiles = namenode.
getNamesystem().listCorruptFileBlocks(path, currentCookie);
int numCorruptFiles = corruptFiles.size();
...
out.println("Cookie:\t" + currentCookie[0]);
for (FSNamesystem.CorruptFileBlockInfo c : corruptFiles) {
out.println(c.toString());
}
out.println("\n\nThe filesystem under path '" + path + "' has " + filler
+ " CORRUPT files");
out.println();
}
此方法最终会调用到FSNamesystem的listCorruptFileBlocks方法,注意这里还传入了一个特别的参数currentCookie.这个参数的作用可是非常的巧妙的.进入FSNamesystem的方法,首先初始化对象损坏文件块对象:
ArrayList<CorruptFileBlockInfo> corruptFiles = new ArrayList<CorruptFileBlockInfo>();
方法返回的对象也即是此对象.
然后进入关键的损坏文件的判断逻辑
// Do a quick check if there are any corrupt files without taking the lock
if (blockManager.getMissingBlocksCount() == 0) {
if (cookieTab[0] == null) {
cookieTab[0] = String.valueOf(getIntCookie(cookieTab[0]));
}
if (LOG.isDebugEnabled()) {
LOG.debug("there are no corrupt file blocks.");
}
return corruptFiles;
}
blockManager的getMissingBlocksCount方法取的就是损坏块队列的大小.
public long getMissingBlocksCount() {
// not locking
return this.neededReplications.getCorruptBlockSize();
}
如果此方法的Count返回值有值,就是大于0,则方法执行继续
// 获取损坏块的block迭代器
final Iterator<Block> blkIterator = blockManager.getCorruptReplicaBlockIterator();
// 取出cookie值作为标记位,跳过标记下标之前的文件,代表已经浏览过
int skip = getIntCookie(cookieTab[0]);
for (int i = 0; i < skip && blkIterator.hasNext(); i++) {
blkIterator.next();
}
while (blkIterator.hasNext()) {
Block blk = blkIterator.next();
final INode inode = (INode)blockManager.getBlockCollection(blk);
//更新skip跳过值
skip++;
if (inode != null && blockManager.countNodes(blk).liveReplicas() == 0) {
String src = FSDirectory.getFullPathName(inode);
if (src.startsWith(path)){
corruptFiles.add(new CorruptFileBlockInfo(src, blk));
count++;
if (count >= DEFAULT_MAX_CORRUPT_FILEBLOCKS_RETURNED)
break;
}
}
}
//更新cookie标记值
cookieTab[0] = String.valueOf(skip);
cookie的作用就是如上注释所说,获取到此返回损坏文件列表后,会在上一方法中将结果输出
for (FSNamesystem.CorruptFileBlockInfo c : corruptFiles)
{
out.println(c.toString());
}
fsck -path默认处理方法
fsck的默认处理方法指的就是fsck+path的方法,为什么紧接着讲这个方法呢,因为fsck的path方法处理也包括了扫描损坏块的方法,但是在逻辑上与-list-corruptfiles竟然还不一样,这一点本人在阅读的时候,也是感到比较吃惊的.首先大家传入的path会被传入到内部方法check中处理
Result res = new Result(conf);
check(path, file, res);
out.println(res);
out.println(" Number of data-nodes:\t\t" + totalDatanodes);
out.println(" Number of racks:\t\t" + networktopology.getNumOfRacks());
然后会进行目录,文件的判断,如果是目录,则进行递归调用
if (file.isDir()) {
// 如果快照目录包含此路径,则递归快照目录下的path
if (snapshottableDirs != null && snapshottableDirs.contains(path)) {
String snapshotPath = (path.endsWith(Path.SEPARATOR) ? path : path
+ Path.SEPARATOR)
+ HdfsConstants.DOT_SNAPSHOT_DIR;
HdfsFileStatus snapshotFileInfo = namenode.getRpcServer().getFileInfo(
snapshotPath);
check(snapshotPath, snapshotFileInfo, res);
}
...
do {
assert lastReturnedName != null;
thisListing = namenode.getRpcServer().getListing(
path, lastReturnedName, false);
if (thisListing == null) {
return;
}
HdfsFileStatus[] files = thisListing.getPartialListing();
//递归变量此path的子文件,如果此path是目录的话
for (int i = 0; i < files.length; i++) {
check(path, files[i], res);
}
lastReturnedName = thisListing.getLastName();
} while (thisListing.hasMore());
return;
}
在接下来的分析检测文件时,会进行相应指标的统计值更新
isOpen = blocks.isUnderConstruction();
if (isOpen && !showOpenFiles) {
// We collect these stats about open files to report with default options
res.totalOpenFilesSize += fileLen;
res.totalOpenFilesBlocks += blocks.locatedBlockCount();
res.totalOpenFiles++;
return;
}
res.totalFiles++;
res.totalSize += fileLen;
res.totalBlocks += blocks.locatedBlockCount();
下面是关键的判断path下所属的block块中的损坏块的判断逻辑:
...
for (LocatedBlock lBlk : blocks.getLocatedBlocks()) {
ExtendedBlock block = lBlk.getBlock();
boolean isCorrupt = lBlk.isCorrupt();
String blkName = block.toString();
...
这里直接利用了LocatedBlock内部的isCorrupt的方法,然后进行corrupt计数累加
// Check if block is Corrupt
if (isCorrupt) {
corrupt++;
res.corruptBlocks++;
out.print("\n" + path + ": CORRUPT blockpool " + block.getBlockPoolId() +
" block " + block.getBlockName()+"\n");
}
而且在这里,missing块的判断逻辑是独立于corrupt块的.
// 重新进行块副本数的统计
NumberReplicas numberReplicas =
namenode.getNamesystem().getBlockManager().countNodes(block.getLocalBlock());
// 获取存在的副本数
int liveReplicas = numberReplicas.liveReplicas();
// 如果当前副本数确实为0,则表明已经是missing块
if (liveReplicas == 0) {
report.append(" MISSING!");
res.addMissing(block.toString(), block.getNumBytes());
missing++;
missize += block.getNumBytes();
} else {
重新回顾以上check方法中的这2类块判断逻辑,第二个missing块的判断逻辑,我个人认为是没有问题的,但是第一个corrupt的判断我个人感觉可能有点问题,尽管说LocatedBlock提供了内部方法isCorrupt,但是我在查询isCorrupt的调用处时发现绝大多数情况下都是false参数默认传入的,而且在数据实时性和有效性上而言,这个方法是没有-list-corruptfiles参数来的快与准的(个人观点,可能理解不同).因为-list-corruptfiles直接是从FSNamesystem类中取的,一方面代表的已经是最新的损坏数据情况了.
fsck -delete/-move
这2个命令作用是找到损坏块之后,打算要做什么事情,就是下面2行代码所控制的:
...
} else {
if (doMove) copyBlocksToLostFound(parent, file, blocks);
if (doDelete) deleteCorruptedFile(path);
}
...
LostFound指的是/lost+found目录,下,就是说-move参数会将损坏块文件,移至此目录下,而-delet则会调用直接删除的方法
private void deleteCorruptedFile(String path) {
try {
namenode.getRpcServer().delete(path, true);
LOG.info("Fsck: deleted corrupt file " + path);
} catch (Exception e) {
LOG.error("Fsck: error deleting corrupted file " + path, e);
internalError = true;
}
}
其实这2个命令的还是比较有用的.如果集群中存在大量损坏块数据的情况时,如果不及时进行清理,会出现大量客户端读写操作的失败,因为元数据虽然存在,但是真实数据已经损坏,读写操作必然会抛出异常.
fsck辅助显示参数
以上几个是fsck的主要参数,下面是一些辅助的次要一些的参数.
-locations/-racks
if (showLocations || showRacks) { StringBuilder sb = new StringBuilder("["); for (int j = 0; j < locs.length; j++) { if (j > 0) { sb.append(", "); } if (showRacks) sb.append(NodeBase.getPath(locs[j])); else sb.append(locs[j]); } sb.append(']'); report.append(" " + sb.toString()); }
-storagepolicies
if (this.showStoragePolcies) { storageTypeSummary = new StoragePolicySummary( namenode.getNamesystem().getBlockManager().getStoragePolicies()); } ... if (this.showStoragePolcies) { out.print(storageTypeSummary.toString()); }
-includeSnapshots
此参数会获取到namenode快照中的目录信息if (snapshottableDirs != null) { SnapshottableDirectoryStatus[] snapshotDirs = namenode.getRpcServer() .getSnapshottableDirListing(); if (snapshotDirs != null) { for (SnapshottableDirectoryStatus dir : snapshotDirs) { snapshottableDirs.add(dir.getFullPath().toString()); } } }
在这些参数执行期间,会伴随着输出结果的直接输出,所以你会看到路线的信息被展示范,输出的最末端,会给出总结报告,如下所示
Total size: 88.13 KB
Total dirs: 14
Total files: 20
Total symlinks: 0
Total blocks (validated): 20 (avg. block size 4512 B)
********************************
UNDER MIN REPL'D BLOCKS: 20 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 20
MISSING BLOCKS: 20
MISSING SIZE: 88.13 KB
CORRUPT BLOCKS: 20
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 0
Average block replication: 0.0
Corrupt blocks: 20
Missing replicas: 0
Number of data-nodes: 0
Number of racks: 0
FSCK ended at Tue Mar 29 11:10:33 CST 2016 in 10 milliseconds
The filesystem under path '/' is CORRUPT
OK,NamenodeFsck的处理过程和参数控制就是如上所述,方法集中在fsck和check2个方法内,其间根据所选参数进行选择性中间结果输出,下面是一张简图
希望本文能给大家对HDFS的fsck命令相关的理解与使用带来帮助.