HDFS写文件源码分析
最近参加的面试,一面问了HDFS写文件流程,完了之后把《Hadoop权威指南》上的流程配合源码简单的看了一遍。二面又问到了这个问题,一些细节还是漏了,对这部分还是专门记个笔记吧。稍微梳理了一下,发现感觉这个问题确实太适合作为面试题了,有细节还适合引申到其它的点。
写数据流程
大体流程如图(来自《Hadoop权威指南》)
- 客户端通过DistributedFileSystem的create()创建文件,中间会对namenode创建RPC调用(namenode.create),namenode在检查权限等问题后新建文件。
- 创建完文件后会返回FSDataOutputStream,在DFSOutputStream上的一层包装,用于处理写入数据所需要的namenode和datanode之间的通信。
- 在数据写入过程中,DFSOutputStream会将数据划分为packets,并放入data queue,由DataStreamer处理(向namenode申请分配新的block)。
- 被选择的datanode连起来构成pipeline,DataStreamer流式的将packets发送到第一个datanode,第一个datanode每次收到packet后存储并转发到pipeline上下一个datanode。
- DFSOutputStream还维护了一个ack queue,当所有的datanode都应答后对应的packet会被移除。
源码分析
以下代码来自Hadoop 2.7.7。
FileSystem.get()通过URI返回对应的DistributedFileSystem,调用create函数在namenode上创建文件,返回输出流。
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(URI.create(hdfsDirectory), conf);
FSDataOutputStream outputStream = hdfs.create(new Path(hdfsDirectory));
outputStream.writeInt(1);
outputStream.close();
DistributedFileSystem的create()中,通过DFSClient的对象dfs,调用dfs.create()。中间使用了FileSystemLinkResolver类,当doCall函数中抛出UnresolvedLinkException时(解析Path路径失败),会重新解析路径得到合适的FileSystem和Path传入next()。总体上相当于给定的路径解析失败时,重新解析后再次调用create函数。创建完成后对返回的DFSOutputStream外套了一层FSDataOutputStream用于写结果。
final DFSOutputStream out = dfs.create(getPathName(f), permission,
overwrite ? EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
: EnumSet.of(CreateFlag.CREATE),
true, replication, blockSize, progress, bufferSize, null,
favoredNodes);
return dfs.createWrappedOutputStream(out, statistics);
dfs.create() 调用了DFSOutputStream.newStreamForCreate(…)。
public DFSOutputStream create(String src,
//...
InetSocketAddress[] favoredNodes) throws IOException {
//...
final DFSOutputStream result = DFSOutputStream.newStreamForCreate(this,
src, masked, flag, createParent, replication, blockSize, progress,
buffersize, dfsClientConf.createChecksum(checksumOpt),
getFavoredNodesStr(favoredNodes));
// ...
return result;
}
DFSOutputStream.newStreamForCreate(…)。首先通过RPC到namenode上调用create函数(dfsClient.namenode.create),然后new DFSOutputStream并启动。此时,文件在namenode上的创建完成,可以开始写入。
static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
FsPermission masked, EnumSet<CreateFlag> flag, boolean createParent,
short replication, long blockSize, Progressable progress, int buffersize,
DataChecksum checksum, String[] favoredNodes) throws IOException {
// ...
stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
new EnumSetWritable<CreateFlag>(flag), createParent, replication,
blockSize, SUPPORTED_CRYPTO_VERSIONS);
// ...
}
// ...
final DFSOutputStream out = new DFSOutputStream(dfsClient, src, stat,
flag, progress, checksum, favoredNodes);
out.start();
return out;
}
有了DFSOutputStream后的代码不能直接跳,DFSOutputStream的写数据会进入类中覆盖的writeChunk(),在writeChunkImpl()中实现具体的逻辑。主要过程为:当currentPacket为空时,创建一个新的DFSPacket,然后写入checkSum、data等信息,在DFSPacket里的chunk数量或者大小达到一定值时,会将当前的DFSPacket放入dataQueue队列中。
@Override
protected synchronized void writeChunk(byte[] b, int offset, int len,
byte[] checksum, int ckoff, int cklen) throws IOException {
TraceScope scope =
dfsClient.getPathTraceScope("DFSOutputStream#writeChunk", src);
try {
writeChunkImpl(b, offset, len, checksum, ckoff, cklen);
} finally {
scope.close();
}
}
private synchronized void writeChunkImpl(byte[] b, int offset, int len,
byte[] checksum, int ckoff, int cklen) throws IOException {
//...
if (currentPacket == null) {
currentPacket = createPacket(packetSize, chunksPerPacket,
bytesCurBlock, currentSeqno++, false);
}
//...
currentPacket.writeChecksum(checksum, ckoff, cklen);
currentPacket.writeData(b, offset, len);
currentPacket.incNumChunks();
bytesCurBlock += len;
// If packet is full, enqueue it for transmission
//
if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
bytesCurBlock == blockSize) {
//...
waitAndQueueCurrentPacket();
adjustChunkBoundary();
// if encountering a block boundary, send an empty packet to
// indicate the end of block and reset bytesCurBlock.
//
if (bytesCurBlock == blockSize) {
currentPacket = createPacket(0, 0, bytesCurBlock, currentSeqno++, true);
currentPacket.setSyncBlock(shouldSyncBlock);
waitAndQueueCurrentPacket();
bytesCurBlock = 0;
lastFlushOffset = 0;
}
}
}
DFSOutputStream并没有直接在writeChunk()中将数据写入datanode,而是创建了DFSPacket写入到dataQueue中。DataStreamer在另外的一个线程将dataQueue中的数据写入到datanode。
DataStreamer的写入流程,主要在其线程函数run()中。这个地方在《Hadoop权威指南》里没有特别的提,代码流程主要如下:
- 等待当前packet发送完毕后,从dataQueue中取一个新的packet;
- 如果当前stage状态为PIPLINE_SETUP_CREATE 2.1 rpc到namenode添加新的block,向pipline中第一个datanode建立socket连接,构建输出流并发送WRITE_BLOCK指令; 2.2 启动新的ResposeProcesser,并将当前的stage状态改为PIPLINE_STREAMING;
- 判断当前的packet是否为当前block中的最后一个packet,如果是则将stage状态改为PIPLINE_CLOSE;
- 将当前的packet从dataQueue中移除,并放入ackQueue,向输出流中写入packet;
- 如果是当前block的最后一个packet,则结束当前的输出流(关闭response、关闭输出流、删除pipline信息、设置状态为PIPLINE_SETUP_CREATE)。
DFSClient与datanode的通信(写数据)并非用的RPC,而是直接用NetUtils.createSocketAddr()连接到datanode,然后向datanode发送Op.WRITE_BLOCK指令告诉datanode执行写block的命令。
ResponseProcessor在单独的线程中处理datanode回复的ack(每个block会新建一个ResponseProcessor,对应一个新线程),当成功的收到datanode的ack时,会将ackQueue中对应的packet移除。
//
// The DataStreamer class is responsible for sending data packets to the
// datanodes in the pipeline. It retrieves a new blockid and block locations
// from the namenode, and starts streaming packets to the pipeline of
// Datanodes. Every packet has a sequence number associated with
// it. When all the packets for a block are sent out and acks for each
// if them are received, the DataStreamer closes the current block.
//
class DataStreamer extends Daemon {
//...
/*
* streamer thread is the only thread that opens streams to datanode,
* and closes them. Any error recovery is also done by this thread.
*/
@Override
public void run() {
// ...
DFSPacket one;
try {
// ...
synchronized (dataQueue) {
// wait for a packet to be sent.
if (dataQueue.isEmpty()) {
one = createHeartbeatPacket();
assert one != null;
} else {
one = dataQueue.getFirst(); // regular data packet
}
}
// get new block from namenode.
if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
setPipeline(nextBlockOutputStream());
// nextBlockOutputStream 中会调用namenode的rpc添加新的block
// 然后建立与pipline中第一个Datanode的socket连接,并写入指令告诉Datanode进入block写模式
initDataStreaming(); // 启动ResposeProcesser对应的线程,并将当前的状态改为PIPELINE_STREAMING
} else if (stage == BlockConstructionStage.PIPELINE_SETUP_APPEND) {
// 对block的append模式
setupPipelineForAppendOrRecovery(); // 此过程会调用namenode的RPC更新block
// LocatedBlock lb = dfsClient.namenode.
// updateBlockForPipeline(block.getCurrentBlock(), dfsClient.clientName);
initDataStreaming();
}
long lastByteOffsetInBlock = one.getLastByteOffsetBlock();
if (lastByteOffsetInBlock > blockSize) {
throw new IOException("BlockSize " + blockSize +
" is smaller than data size. " +
" Offset of packet in block " +
lastByteOffsetInBlock +
" Aborting file " + src);
}
// 当前packet(one)是block中的最后一个packet时
if (one.isLastPacketInBlock()) {
// wait for all data packets have been successfully acked
synchronized (dataQueue) {
while (!streamerClosed && !hasError &&
ackQueue.size() != 0 && dfsClient.clientRunning) {
try {
// wait for acks to arrive from datanodes
dataQueue.wait(1000);
} catch (InterruptedException e) {
DFSClient.LOG.warn("Caught exception ", e);
}
}
}
if (streamerClosed || hasError || !dfsClient.clientRunning) {
continue;
}
stage = BlockConstructionStage.PIPELINE_CLOSE;
}
Span span = null;
synchronized (dataQueue) {
// move packet from dataQueue to ackQueue
if (!one.isHeartbeatPacket()) {
span = scope.detach();
one.setTraceSpan(span);
dataQueue.removeFirst();
ackQueue.addLast(one);
dataQueue.notifyAll();
}
}
// 这里才到了向第一个数据节点写数据的地方
// write out data to remote datanode
TraceScope writeScope = Trace.startSpan("writeTo", span);
try {
one.writeTo(blockStream);
blockStream.flush();
} catch (IOException e) {
// HDFS-3398 treat primary DN is down since client is unable to
// write to primary DN. If a failed or restarting node has already
// been recorded by the responder, the following call will have no
// effect. Pipeline recovery can handle only one node error at a
// time. If the primary node fails again during the recovery, it
// will be taken out then.
tryMarkPrimaryDatanodeFailed();
throw e;
} finally {
writeScope.close();
}
lastPacket = Time.monotonicNow();
// update bytesSent
long tmpBytesSent = one.getLastByteOffsetBlock();
if (bytesSent < tmpBytesSent) {
bytesSent = tmpBytesSent;
}
if (streamerClosed || hasError || !dfsClient.clientRunning) {
continue;
}
// 当最后一个packet写完后结束当前的block写过程
if (one.isLastPacketInBlock()) {
// wait for the close packet has been acked
synchronized (dataQueue) {
while (!streamerClosed && !hasError &&
ackQueue.size() != 0 && dfsClient.clientRunning) {
dataQueue.wait(1000);// wait for acks to arrive from datanodes
}
}
if (streamerClosed || hasError || !dfsClient.clientRunning) {
continue;
}
endBlock();
}
if (progress != null) { progress.progress(); }
// This is used by unit test to trigger race conditions.
if (artificialSlowdown != 0 && dfsClient.clientRunning) {
Thread.sleep(artificialSlowdown);
}
}
private class ResponseProcessor extends Daemon {
}
写数据失败的处理
当datanode写入数据失败。pipeline会被关闭,ack queue中的所有packet会被添加到data queue的头部,未发生故障的datanode会将当前的block标记后发送给namenode,使namenode能够正确的删除发生故障的datanode上未完成的block。发生问题的datanode会从pipeline中移除,然后由剩下的datanode完成pipeline的数据传输。namenode在发现replica的数量不够时,会选择新的节点放置block。
可能会存在多个datanode同时失败的情况,但在写入过程中,只要有dfs.namenode.replication.min个副本写入到结果中就会被认为写入成功。其它的block会在集群中异步的复制直到达到要求的副本数。
写文件需要注意Hadoop对写可能存在失败,hflush hsync能够以更高开销为代价保证数据的安全。
一致性模型
A coherency model for a filesystem describes the data visibility of reads and writes for a file.
为了保证效率,HDFS的一致性并不满足组POSIX。主要表现在下面几个方面。
文件在创建后,对于文件系统是可见的,如下:
Path p = new Path("p");
fs.create(p);
assertThat(fs.exists(p), is(true));
但是,文件的写入不一定是可见的,如下面的代码,即使已经flush到了文件,在马上获取文件内容时,得到的结果还是会为0。
Path p = new Path("p");
OutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush();
assertThat(fs.getFileStatus(p).getLen(), is(0L));
在超过一个block写完后,第一个block会变为可见。
HDFS提供了一种机制强制写入,将上面的out.flush()换成out.hflush()时,HDFS会确定数据已经送到了所有datanode,此时所有数据会对Reader可见。调用close()时会调用hflush。
但是,hflush并不保证数据被写入到磁盘,而只是保证数据在datanode的内存中,要想有更强的保证,需要使用hsync()。
从安全性上看,没有调用hflush()和hsync()时,某些客户端或者系统的错误会导致数据的丢失。因此,这两个函数的调用与否,是在数据安全和吞吐量两方面的tradeoff。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!