hadoop写入,读取源码分析

读取
先看读取的操作

FileSystem hdfs = FileSystem.get(new Configuration());
Path path = new Path("/testfile");// reading
FSDataInputStream dis = hdfs.open(path);
byte[] writeBuf = new byte[1024];
int len = dis.read(writeBuf);
System.out.println(new String(writeBuf, 0, len, "UTF-8"));
dis.close();

hdfs.close();

首先创建了一个分布式文件管理系统,这里有一个FileSystem只是规定了一个抽象的文件系统概念,HDFS只是其中一个实现。
看一下FileSystem的创建

public abstract class FileSystem extends Configured implements Closeable{
...
  public static FileSystem get(Configuration conf) throws IOException {
    return get(getDefaultUri(conf), conf);
  }
}

调用了另一个get

    public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }
    
    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }

最后调用了

    Class<?> clazz = getFileSystemClass(uri.getScheme(), conf);

通过配置文件fs.defaultFS的值来获取相应的文件系统对象的class文件
这里使用的是 DistributedFileSystem
https://blog.csdn.net/chengyuqiang/article/details/78636721

Open()方法

之后调用了FileSystem的open方法。进入open看一下

public abstract class FileSystem extends Configured implements Closeable{
...
 public abstract FSDataInputStream open(Path f, int bufferSize)
    throws IOException;
}

看到是一个方法,返回一个FSDataInputStream

FSDataInputStream dis = hdfs.open(path);

看一下实际调用的是哪个类的方法

public class DFSInputStream extends FSInputStream
implements ByteBufferReadable, CanSetDropBehind, CanSetReadahead,
    HasEnhancedByteBufferAccess, CanUnbuffer {
public class DistributedFileSystem extends FileSystem {
...
  @Override
  public FSDataInputStream open(Path f, final int bufferSize)
      throws IOException {
    statistics.incrementReadOps(1);
    Path absF = fixRelativePart(f);
    return new FileSystemLinkResolver<FSDataInputStream>() {
      @Override
      public FSDataInputStream doCall(final Path p)
          throws IOException, UnresolvedLinkException {
        final DFSInputStream dfsis =
          dfs.open(getPathName(p), bufferSize, verifyChecksum);
        return dfs.createWrappedInputStream(dfsis);
      }
      @Override
      public FSDataInputStream next(final FileSystem fs, final Path p)
          throws IOException {
        return fs.open(p, bufferSize);
      }
    }.resolve(this, absF);
  }
}

调用了resolve

 public T resolve(final FileSystem filesys, final Path path)
      throws IOException {
    int count = 0;
    T in = null;
    Path p = path;
    // Assumes path belongs to this FileSystem.
    // Callers validate this by passing paths through FileSystem#checkPath
    FileSystem fs = filesys;
    for (boolean isLink = true; isLink;) {
      try {
        in = doCall(p);
        isLink = false;
      } catch (UnresolvedLinkException e) {
        if (!filesys.resolveSymlinks) {
          throw new IOException("Path " + path + " contains a symlink"
              + " and symlink resolution is disabled ("
              + CommonConfigurationKeys.FS_CLIENT_RESOLVE_REMOTE_SYMLINKS_KEY
              + ").", e);
        }
        if (!FileSystem.areSymlinksEnabled()) {
          throw new IOException("Symlink resolution is disabled in" +
              " this version of Hadoop.");
        }
        if (count++ > FsConstants.MAX_PATH_LINKS) {
          throw new IOException("Possible cyclic loop while " +
                                "following symbolic link " + path);
        }
        // Resolve the first unresolved path component
        p = FSLinkResolver.qualifySymlinkTarget(fs.getUri(), p,
            filesys.resolveLink(p));
        fs = FileSystem.getFSofPath(p, filesys.getConf());
        // Have to call next if it's a new FS
        if (!fs.equals(filesys)) {
          return next(fs, p);
        }
        // Else, we keep resolving with this filesystem
      }
    }
    // Successful call, path was fully resolved
    return in;
  }
}

这里的in调用了dfs.open()在上一个调用中实现了,进去看一下

  public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
      throws IOException, UnresolvedLinkException {
    checkOpen();
    //    Get block info from namenode
    return new DFSInputStream(this, src, buffersize, verifyChecksum);
  }

  // 进入DFSInputStream
    DFSInputStream(DFSClient dfsClient, String src, int buffersize, boolean verifyChecksum
                 ) throws IOException, UnresolvedLinkException {
    this.dfsClient = dfsClient;
    this.verifyChecksum = verifyChecksum;
    this.buffersize = buffersize;
    this.src = src;
    this.cachingStrategy =
        dfsClient.getDefaultReadCachingStrategy();
    openInfo();
  }

  // 进入openInfo
  /**
   * Grab the open-file info from namenode
   */
  synchronized void openInfo() throws IOException, UnresolvedLinkException {
    lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
    int retriesForLastBlockLength = dfsClient.getConf().retryTimesForGetLastBlockLength;
    while (retriesForLastBlockLength > 0) {
      // Getting last block length as -1 is a special case. When cluster
      // restarts, DNs may not report immediately. At this time partial block
      // locations will not be available with NN for getting the length. Lets
      // retry for 3 times to get the length.
      if (lastBlockBeingWrittenLength == -1) {
        DFSClient.LOG.warn("Last block locations not available. "
            + "Datanodes might not have reported blocks completely."
            + " Will retry for " + retriesForLastBlockLength + " times");
        waitFor(dfsClient.getConf().retryIntervalForGetLastBlockLength);
        lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
      } else {
        break;
      }
      retriesForLastBlockLength--;
    }
    if (retriesForLastBlockLength == 0) {
      throw new IOException("Could not obtain the last block locations.");
    }
  }

到这里基本就调用完了,创建了一个DFSInputStream对象

public class DFSInputStream extends FSInputStream
implements ByteBufferReadable, CanSetDropBehind, CanSetReadahead,
    HasEnhancedByteBufferAccess, CanUnbuffer {
    ...
    }

dis.read(writeBuf);

  @Override
  public synchronized int read(final ByteBuffer buf) throws IOException {
    ReaderStrategy byteBufferReader = new ByteBufferStrategy(buf);

    return readWithStrategy(byteBufferReader, 0, buf.remaining());
  }
private int readWithStrategy(ReaderStrategy strategy, int off, int len) throws IOException {
    dfsClient.checkOpen();
    if (closed) {
      throw new IOException("Stream closed");
    }
    Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap 
      = new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
    failures = 0;
    if (pos < getFileLength()) {
      int retries = 2;
      while (retries > 0) {
        try {
          // currentNode can be left as null if previous read had a checksum
          // error on the same block. See HDFS-3067
          if (pos > blockEnd || currentNode == null) {
            currentNode = blockSeekTo(pos);
          }
          int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
          if (locatedBlocks.isLastBlockComplete()) {
            realLen = (int) Math.min(realLen,
                locatedBlocks.getFileLength() - pos);
          }
          int result = readBuffer(strategy, off, realLen, corruptedBlockMap);
          
          if (result >= 0) {
            pos += result;
          } else {
            // got a EOS from reader though we expect more data on it.
            throw new IOException("Unexpected EOS from the reader");
          }
          if (dfsClient.stats != null) {
            dfsClient.stats.incrementBytesRead(result);
          }
          return result;
        } catch (ChecksumException ce) {
          throw ce;            
        } catch (IOException e) {
          if (retries == 1) {
            DFSClient.LOG.warn("DFS Read", e);
          }
          blockEnd = -1;
          if (currentNode != null) { addToDeadNodes(currentNode); }
          if (--retries == 0) {
            throw e;
          }
        } finally {
          // Check if need to report block replicas corruption either read
          // was successful or ChecksumException occured.
          reportCheckSumFailure(corruptedBlockMap, 
              currentLocatedBlock.getLocations().length);
        }
      }
    }
    return -1;
  }

readBuffer方法比较简单,直接调用BlockReader的read方法直接读取数据。BlockReader的read方法就根据请求的块起始偏移量,长度,通过socket连接DataNode,获取块内容,BlockReader的read方法不会做缓存优化。

接下来看一下mapreduce的读取

其已经知道了文件具体信息,所以直接调用另一个read()

/**
   * Read bytes starting from the specified position.
   * 
   * @param position start read from this position
   * @param buffer read buffer
   * @param offset offset into buffer
   * @param length number of bytes to read
   * 
   * @return actual number of bytes read
   */
  @Override
  public int read(long position, byte[] buffer, int offset, int length)
    throws IOException {
    // sanity checks
    dfsClient.checkOpen();
    if (closed) {
      throw new IOException("Stream closed");
    }
    failures = 0;
    long filelen = getFileLength();
    if ((position < 0) || (position >= filelen)) {
      return -1;
    }
    int realLen = length;
    if ((position + length) > filelen) {
      realLen = (int)(filelen - position);
    }
    
    // determine the block and byte range within the block
    // corresponding to position and realLen
    List<LocatedBlock> blockRange = getBlockRange(position, realLen);
    int remaining = realLen;
    Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap 
      = new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
    for (LocatedBlock blk : blockRange) {
      long targetStart = position - blk.getStartOffset();
      long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);
      try {
        if (dfsClient.isHedgedReadsEnabled()) {
          hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
              - 1, buffer, offset, corruptedBlockMap);
        } else {
          fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
              buffer, offset, corruptedBlockMap);
        }
      } finally {
        // Check and report if any block replicas are corrupted.
        // BlockMissingException may be caught if all block replicas are
        // corrupted.
        reportCheckSumFailure(corruptedBlockMap, blk.getLocations().length);
      }

      remaining -= bytesToRead;
      position += bytesToRead;
      offset += bytesToRead;
    }
    assert remaining == 0 : "Wrong number of bytes read.";
    if (dfsClient.stats != null) {
      dfsClient.stats.incrementBytesRead(realLen);
    }
    return realLen;
  }
  

这里在2.4以后增加了一个hedgedRead,通过判断一个节点的速度来决定要不要跳过。
http://chengfeng96.com/blog/2018/11/25/HDFS里的Hedged-Read源码以及局限性分析/
在这先看不开启的

  private void fetchBlockByteRange(LocatedBlock block, long start, long end,
      byte[] buf, int offset,
      Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
      throws IOException {
    block = getBlockAt(block.getStartOffset(), false);
    while (true) {
      DNAddrPair addressPair = chooseDataNode(block, null);
      try {
        actualGetFromOneDataNode(addressPair, block, start, end, buf, offset,
            corruptedBlockMap);
        return;
      } catch (IOException e) {
        // Ignore. Already processed inside the function.
        // Loop through to try the next node.
      }
    }
  }
 private void actualGetFromOneDataNode(final DNAddrPair datanode,
      LocatedBlock block, final long start, final long end, byte[] buf,
      int offset, Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
      throws IOException {
      ...
      int nread = reader.readAll(buf, offset, len);
      ...
      }

调用BlockReader readAll()方法

public interface BlockReader extends ByteBufferReadable {
...
}

通过工厂模式产生BlockReaderLocalLegacy(还没有看和其他之间的区别)

  public synchronized int read(byte[] buf, int off, int len) throws IOException {
    if (LOG.isTraceEnabled()) {
      LOG.trace("read off " + off + " len " + len);
    }
    if (!verifyChecksum) {
      return dataIn.read(buf, off, len);
    }

    int nRead = fillSlowReadBuffer(slowReadBuff.capacity());

    if (nRead > 0) {
      // Possible that buffer is filled with a larger read than we need, since
      // we tried to read as much as possible at once
      nRead = Math.min(len, nRead);
      slowReadBuff.get(buf, off, nRead);
    }

    return nRead;
  }

  \\ 进入readWithBounceBuffer
    private synchronized int readWithBounceBuffer(byte arr[], int off, int len,
        boolean canSkipChecksum) throws IOException {
    createDataBufIfNeeded();
    if (!dataBuf.hasRemaining()) {
      dataBuf.position(0);
      dataBuf.limit(maxReadaheadLength);
      fillDataBuf(canSkipChecksum);
    }
    if (dataBuf.remaining() == 0) return -1;
    int toRead = Math.min(dataBuf.remaining(), len);
    dataBuf.get(arr, off, toRead);
    return toRead;
  }

在这里filldatebuff会根据是否需要校验对databuf进行校验,教研也是根据chunk进行教研。

 private synchronized boolean fillDataBuf(boolean canSkipChecksum)
      throws IOException {
    createDataBufIfNeeded();
    final int slop = (int)(dataPos % bytesPerChecksum);
    final long oldDataPos = dataPos;
    dataBuf.limit(maxReadaheadLength);
    if (canSkipChecksum) {
      dataBuf.position(slop);
      fillBuffer(dataBuf, canSkipChecksum);
    } else {
      dataPos -= slop;
      dataBuf.position(0);
      fillBuffer(dataBuf, canSkipChecksum);
    }
    dataBuf.limit(dataBuf.position());
    dataBuf.position(Math.min(dataBuf.position(), slop));
    if (LOG.isTraceEnabled()) {
      LOG.trace("loaded " + dataBuf.remaining() + " bytes into bounce " +
          "buffer from offset " + oldDataPos + " of " + block);
    }
    return dataBuf.limit() != maxReadaheadLength;
  }

最后会读取databuf中的内容

总结

写了一些,进行一下阶段性总结,之后看看哪里还需要深入。

首先读取会创建一个指向文件的流,这个流是通过分布式文件系统创建的。通过这个流会对数据进行读取。这里的抽象是

  1. FileSystem
  2. DSInputStream

具体的实例

  1. DistributedFileSystem
  2. DFSInputStream
创建
调用
调用
DistributedFileSystem
DFSInputStream
read()方法
readAll()
dataBuf读取内容

收获:

减轻了对源码的恐惧,同时对于具体源码实现也有了一定的了解,下一步要在过完其他源码的基础上,再深一步的去了解细节的处理。

下一步计划

  1. 写入源码
  2. mapreduce的任务过程
  3. 底层读取的内容以及异常的处理
posted @   yych0745  阅读(81)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
点击右上角即可分享
微信分享提示