hadoop写入,读取源码分析
读取
先看读取的操作
FileSystem hdfs = FileSystem.get(new Configuration());
Path path = new Path("/testfile");// reading
FSDataInputStream dis = hdfs.open(path);
byte[] writeBuf = new byte[1024];
int len = dis.read(writeBuf);
System.out.println(new String(writeBuf, 0, len, "UTF-8"));
dis.close();
hdfs.close();
首先创建了一个分布式文件管理系统,这里有一个FileSystem只是规定了一个抽象的文件系统概念,HDFS只是其中一个实现。
看一下FileSystem的创建
public abstract class FileSystem extends Configured implements Closeable{
...
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
}
调用了另一个get
public static FileSystem get(URI uri, Configuration conf) throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null && authority == null) { // use default FS
return get(conf);
}
if (scheme != null && authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
最后调用了
Class<?> clazz = getFileSystemClass(uri.getScheme(), conf);
通过配置文件fs.defaultFS的值来获取相应的文件系统对象的class文件
这里使用的是 DistributedFileSystem
https://blog.csdn.net/chengyuqiang/article/details/78636721
Open()方法
之后调用了FileSystem的open方法。进入open看一下
public abstract class FileSystem extends Configured implements Closeable{
...
public abstract FSDataInputStream open(Path f, int bufferSize)
throws IOException;
}
看到是一个方法,返回一个FSDataInputStream
FSDataInputStream dis = hdfs.open(path);
看一下实际调用的是哪个类的方法
public class DFSInputStream extends FSInputStream
implements ByteBufferReadable, CanSetDropBehind, CanSetReadahead,
HasEnhancedByteBufferAccess, CanUnbuffer {
public class DistributedFileSystem extends FileSystem {
...
@Override
public FSDataInputStream open(Path f, final int bufferSize)
throws IOException {
statistics.incrementReadOps(1);
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<FSDataInputStream>() {
@Override
public FSDataInputStream doCall(final Path p)
throws IOException, UnresolvedLinkException {
final DFSInputStream dfsis =
dfs.open(getPathName(p), bufferSize, verifyChecksum);
return dfs.createWrappedInputStream(dfsis);
}
@Override
public FSDataInputStream next(final FileSystem fs, final Path p)
throws IOException {
return fs.open(p, bufferSize);
}
}.resolve(this, absF);
}
}
调用了resolve
public T resolve(final FileSystem filesys, final Path path)
throws IOException {
int count = 0;
T in = null;
Path p = path;
// Assumes path belongs to this FileSystem.
// Callers validate this by passing paths through FileSystem#checkPath
FileSystem fs = filesys;
for (boolean isLink = true; isLink;) {
try {
in = doCall(p);
isLink = false;
} catch (UnresolvedLinkException e) {
if (!filesys.resolveSymlinks) {
throw new IOException("Path " + path + " contains a symlink"
+ " and symlink resolution is disabled ("
+ CommonConfigurationKeys.FS_CLIENT_RESOLVE_REMOTE_SYMLINKS_KEY
+ ").", e);
}
if (!FileSystem.areSymlinksEnabled()) {
throw new IOException("Symlink resolution is disabled in" +
" this version of Hadoop.");
}
if (count++ > FsConstants.MAX_PATH_LINKS) {
throw new IOException("Possible cyclic loop while " +
"following symbolic link " + path);
}
// Resolve the first unresolved path component
p = FSLinkResolver.qualifySymlinkTarget(fs.getUri(), p,
filesys.resolveLink(p));
fs = FileSystem.getFSofPath(p, filesys.getConf());
// Have to call next if it's a new FS
if (!fs.equals(filesys)) {
return next(fs, p);
}
// Else, we keep resolving with this filesystem
}
}
// Successful call, path was fully resolved
return in;
}
}
这里的in调用了dfs.open()在上一个调用中实现了,进去看一下
public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
throws IOException, UnresolvedLinkException {
checkOpen();
// Get block info from namenode
return new DFSInputStream(this, src, buffersize, verifyChecksum);
}
// 进入DFSInputStream
DFSInputStream(DFSClient dfsClient, String src, int buffersize, boolean verifyChecksum
) throws IOException, UnresolvedLinkException {
this.dfsClient = dfsClient;
this.verifyChecksum = verifyChecksum;
this.buffersize = buffersize;
this.src = src;
this.cachingStrategy =
dfsClient.getDefaultReadCachingStrategy();
openInfo();
}
// 进入openInfo
/**
* Grab the open-file info from namenode
*/
synchronized void openInfo() throws IOException, UnresolvedLinkException {
lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
int retriesForLastBlockLength = dfsClient.getConf().retryTimesForGetLastBlockLength;
while (retriesForLastBlockLength > 0) {
// Getting last block length as -1 is a special case. When cluster
// restarts, DNs may not report immediately. At this time partial block
// locations will not be available with NN for getting the length. Lets
// retry for 3 times to get the length.
if (lastBlockBeingWrittenLength == -1) {
DFSClient.LOG.warn("Last block locations not available. "
+ "Datanodes might not have reported blocks completely."
+ " Will retry for " + retriesForLastBlockLength + " times");
waitFor(dfsClient.getConf().retryIntervalForGetLastBlockLength);
lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
} else {
break;
}
retriesForLastBlockLength--;
}
if (retriesForLastBlockLength == 0) {
throw new IOException("Could not obtain the last block locations.");
}
}
到这里基本就调用完了,创建了一个DFSInputStream对象
public class DFSInputStream extends FSInputStream
implements ByteBufferReadable, CanSetDropBehind, CanSetReadahead,
HasEnhancedByteBufferAccess, CanUnbuffer {
...
}
dis.read(writeBuf);
@Override
public synchronized int read(final ByteBuffer buf) throws IOException {
ReaderStrategy byteBufferReader = new ByteBufferStrategy(buf);
return readWithStrategy(byteBufferReader, 0, buf.remaining());
}
private int readWithStrategy(ReaderStrategy strategy, int off, int len) throws IOException {
dfsClient.checkOpen();
if (closed) {
throw new IOException("Stream closed");
}
Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap
= new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
failures = 0;
if (pos < getFileLength()) {
int retries = 2;
while (retries > 0) {
try {
// currentNode can be left as null if previous read had a checksum
// error on the same block. See HDFS-3067
if (pos > blockEnd || currentNode == null) {
currentNode = blockSeekTo(pos);
}
int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
if (locatedBlocks.isLastBlockComplete()) {
realLen = (int) Math.min(realLen,
locatedBlocks.getFileLength() - pos);
}
int result = readBuffer(strategy, off, realLen, corruptedBlockMap);
if (result >= 0) {
pos += result;
} else {
// got a EOS from reader though we expect more data on it.
throw new IOException("Unexpected EOS from the reader");
}
if (dfsClient.stats != null) {
dfsClient.stats.incrementBytesRead(result);
}
return result;
} catch (ChecksumException ce) {
throw ce;
} catch (IOException e) {
if (retries == 1) {
DFSClient.LOG.warn("DFS Read", e);
}
blockEnd = -1;
if (currentNode != null) { addToDeadNodes(currentNode); }
if (--retries == 0) {
throw e;
}
} finally {
// Check if need to report block replicas corruption either read
// was successful or ChecksumException occured.
reportCheckSumFailure(corruptedBlockMap,
currentLocatedBlock.getLocations().length);
}
}
}
return -1;
}
readBuffer方法比较简单,直接调用BlockReader的read方法直接读取数据。BlockReader的read方法就根据请求的块起始偏移量,长度,通过socket连接DataNode,获取块内容,BlockReader的read方法不会做缓存优化。
接下来看一下mapreduce的读取
其已经知道了文件具体信息,所以直接调用另一个read()
/**
* Read bytes starting from the specified position.
*
* @param position start read from this position
* @param buffer read buffer
* @param offset offset into buffer
* @param length number of bytes to read
*
* @return actual number of bytes read
*/
@Override
public int read(long position, byte[] buffer, int offset, int length)
throws IOException {
// sanity checks
dfsClient.checkOpen();
if (closed) {
throw new IOException("Stream closed");
}
failures = 0;
long filelen = getFileLength();
if ((position < 0) || (position >= filelen)) {
return -1;
}
int realLen = length;
if ((position + length) > filelen) {
realLen = (int)(filelen - position);
}
// determine the block and byte range within the block
// corresponding to position and realLen
List<LocatedBlock> blockRange = getBlockRange(position, realLen);
int remaining = realLen;
Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap
= new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
for (LocatedBlock blk : blockRange) {
long targetStart = position - blk.getStartOffset();
long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);
try {
if (dfsClient.isHedgedReadsEnabled()) {
hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
- 1, buffer, offset, corruptedBlockMap);
} else {
fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
buffer, offset, corruptedBlockMap);
}
} finally {
// Check and report if any block replicas are corrupted.
// BlockMissingException may be caught if all block replicas are
// corrupted.
reportCheckSumFailure(corruptedBlockMap, blk.getLocations().length);
}
remaining -= bytesToRead;
position += bytesToRead;
offset += bytesToRead;
}
assert remaining == 0 : "Wrong number of bytes read.";
if (dfsClient.stats != null) {
dfsClient.stats.incrementBytesRead(realLen);
}
return realLen;
}
这里在2.4以后增加了一个hedgedRead,通过判断一个节点的速度来决定要不要跳过。
http://chengfeng96.com/blog/2018/11/25/HDFS里的Hedged-Read源码以及局限性分析/
在这先看不开启的
private void fetchBlockByteRange(LocatedBlock block, long start, long end,
byte[] buf, int offset,
Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
throws IOException {
block = getBlockAt(block.getStartOffset(), false);
while (true) {
DNAddrPair addressPair = chooseDataNode(block, null);
try {
actualGetFromOneDataNode(addressPair, block, start, end, buf, offset,
corruptedBlockMap);
return;
} catch (IOException e) {
// Ignore. Already processed inside the function.
// Loop through to try the next node.
}
}
}
private void actualGetFromOneDataNode(final DNAddrPair datanode,
LocatedBlock block, final long start, final long end, byte[] buf,
int offset, Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
throws IOException {
...
int nread = reader.readAll(buf, offset, len);
...
}
调用BlockReader readAll()方法
public interface BlockReader extends ByteBufferReadable {
...
}
通过工厂模式产生BlockReaderLocalLegacy(还没有看和其他之间的区别)
public synchronized int read(byte[] buf, int off, int len) throws IOException {
if (LOG.isTraceEnabled()) {
LOG.trace("read off " + off + " len " + len);
}
if (!verifyChecksum) {
return dataIn.read(buf, off, len);
}
int nRead = fillSlowReadBuffer(slowReadBuff.capacity());
if (nRead > 0) {
// Possible that buffer is filled with a larger read than we need, since
// we tried to read as much as possible at once
nRead = Math.min(len, nRead);
slowReadBuff.get(buf, off, nRead);
}
return nRead;
}
\\ 进入readWithBounceBuffer
private synchronized int readWithBounceBuffer(byte arr[], int off, int len,
boolean canSkipChecksum) throws IOException {
createDataBufIfNeeded();
if (!dataBuf.hasRemaining()) {
dataBuf.position(0);
dataBuf.limit(maxReadaheadLength);
fillDataBuf(canSkipChecksum);
}
if (dataBuf.remaining() == 0) return -1;
int toRead = Math.min(dataBuf.remaining(), len);
dataBuf.get(arr, off, toRead);
return toRead;
}
在这里filldatebuff会根据是否需要校验对databuf进行校验,教研也是根据chunk进行教研。
private synchronized boolean fillDataBuf(boolean canSkipChecksum)
throws IOException {
createDataBufIfNeeded();
final int slop = (int)(dataPos % bytesPerChecksum);
final long oldDataPos = dataPos;
dataBuf.limit(maxReadaheadLength);
if (canSkipChecksum) {
dataBuf.position(slop);
fillBuffer(dataBuf, canSkipChecksum);
} else {
dataPos -= slop;
dataBuf.position(0);
fillBuffer(dataBuf, canSkipChecksum);
}
dataBuf.limit(dataBuf.position());
dataBuf.position(Math.min(dataBuf.position(), slop));
if (LOG.isTraceEnabled()) {
LOG.trace("loaded " + dataBuf.remaining() + " bytes into bounce " +
"buffer from offset " + oldDataPos + " of " + block);
}
return dataBuf.limit() != maxReadaheadLength;
}
最后会读取databuf中的内容
总结
写了一些,进行一下阶段性总结,之后看看哪里还需要深入。
首先读取会创建一个指向文件的流,这个流是通过分布式文件系统创建的。通过这个流会对数据进行读取。这里的抽象是
- FileSystem
- DSInputStream
具体的实例
- DistributedFileSystem
- DFSInputStream
收获:
减轻了对源码的恐惧,同时对于具体源码实现也有了一定的了解,下一步要在过完其他源码的基础上,再深一步的去了解细节的处理。
下一步计划
- 写入源码
- mapreduce的任务过程
- 底层读取的内容以及异常的处理
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!