Shuffle

Shuffle

将数据从map端拷贝到reduce端的过程。

  1. map生成的数据会放入内存缓冲区,大小为100M,当达到80M时溢写到磁盘中。直到map把数据计算完,然后将缓存中数据也保存到磁盘中。当map生成的数据存在分区时,磁盘中保存的数据也会分区。
  2. 本地文件合并,图中存在3个分区。
  3. 这3个分区会被shuffle线程分别复制到不同reduce节点上。不同Map节点的相同分区的数据会合并到一个Reduce节点上。

 

Hadoop 序列化

Map程序从HDFS读取数据、写入本地磁盘,Shuffle程序拷贝数据,Reduce程序读取数据、写入HDFS 都会涉及到磁盘IO 甚至网络IO。可见IO占用的比重非常高。IO步骤无法省略,所以为了加速而优化序列化操作。

默认情况下,java会把整个对象的信息以及父类的信息进行保存,这样会额外消耗性能。Hadoop为了效率自己设计了一套序列化机制

补充:

  1. Text 等价于 java.lang.String 的Writable,针对 UTF-8 序列
  2. NullWritable 是单例,获取实例使用 NullWritable.get()

这些类都实现了这个接口

public interface WritableComparable<T> extends Writable, Comparable<T> {
}

Hadoop 序列化特点:

  1. 紧凑 高效利用空间
  2. 快速 读写数据额外开销小
  3. 可拓展 可透明地读取老格式地数据
  4. 互操作 支持多语言地交互

java 序列化地不足:

  1. 不精简,附加信息多,不太适合随机访问
  2. 存储空间大,递归地输出类的父类描述,直到不再有父类

 

/**
 * 分别使用java自带的序列化方法和hadoop自定义的序列化方法,比较二进制文件的大小
 */
package org.example.mapreduce;

import org.apache.hadoop.io.Writable;
import java.io.*;

public class TestSerialize {
    public static void main(String[] args) {

        SerializeJava java = new SerializeJava();
        java.setId(1L);
        java.setName("123");

        try(FileOutputStream fileOutputStream = new FileOutputStream("C:\\code\\java");
            ObjectOutputStream objectOutputStream = new ObjectOutputStream(fileOutputStream)) {

            objectOutputStream.writeObject(java);

        } catch (Exception e) {
            e.printStackTrace();
        }

        SerializeHadoop hadoop = new SerializeHadoop();
        hadoop.setId(1L);
        hadoop.setName("123");

        try(FileOutputStream fileOutputStream = new FileOutputStream("C:\\code\\hadoop");
            ObjectOutputStream objectOutputStream = new ObjectOutputStream(fileOutputStream)) {

            hadoop.write(objectOutputStream);

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

@Data
class SerializeJava implements Serializable {
    private static final long serialVersionUID = 1L;

    private Long id;
    private String name;
}

@Data
class SerializeHadoop implements Writable { // 如果需要排序则实现 WritableComparable
    private Long id;
    private String name;

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(this.id);
        out.writeUTF(this.name);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        // 注意:读取属性的顺序要与写属性的顺序一致
        this.id = in.readLong();
        this.name = in.readUTF();
    }
}

可见使用 Writable 序列化的文件远小于 java 原生的序列化

InputFormat 分析

这是一个抽象类,是MapReduce输入数据的顶层基类。

@Public
@Stable
public abstract class InputFormat<K, V> {
    public InputFormat() {
    }
    
    // 逻辑切分输入文件
    public abstract List<InputSplit> getSplits(JobContext var1) throws IOException, InterruptedException;

    // 为每一行创建一个行阅读器。在开始分割之前会调用 RecordReader 的 initialize(InputSplit, TaskAttemptContext) 方法
    // 每一个InputSplit都有一个RecordReader,作用是把InputSplit中的数据解析成Record,即<k1, v1>
    public abstract RecordReader<K, V> createRecordReader(InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;
}

FileInputFormat 对getSplits 的实现如下:

public class FileInputFormat {

  /** 
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    // 获取分割任务的最小最大值
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // 创建List,用于保存生成的 InputSplit
    List<InputSplit> splits = new ArrayList<InputSplit>();
    // 获取输入文件列表
    List<FileStatus> files = listStatus(job);

    // 标记是否忽略目录
    boolean ignoreDirs = !getInputDirRecursive(job)
      && job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
    // 迭代输入文件列表
    for (FileStatus file: files) {
      // 是否忽略子目录,默认不忽略
      if (ignoreDirs && file.isDirectory()) {
        continue;
      }
      // 获取文件或目录的路径
      Path path = file.getPath();
      // 获取文件或目录的长度
      long length = file.getLen();
      // 如果是文件
      if (length != 0) {
        // 用于保存文件Block块所在位置的数组
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        // 判断文件是否支持切割,默认为true
        if (isSplitable(job, path)) {
          // 获取文件Block大小,默认128M
          long blockSize = file.getBlockSize();
          // 计算split大小,默认split逻辑切片的大小和Block size相等
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          // 待处理的文件剩余字节大小,也就是这个文件的原始大小
          long bytesRemaining = length;
          
          // SPLIT_SLOP=1.1,当剩余文件的大小与splitSize的比值还大于1.1时旧继续切分,否则直接作为下一个InputSplit
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            // 生成split并添加到数组中
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }
          // 处理最后一块数据
          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // 如果是压缩文件,不支持切割
          if (LOG.isDebugEnabled()) {
            // Log only if the file is big enough to be splitted
            if (length > Math.min(file.getBlockSize(), minSize)) {
              LOG.debug("File is not splittable so no parallelization "
                  + "is possible: " + file.getPath());
            }
          }
          // 不支持切割,则把整个文件作为一个InputSplit
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }
}

 TextInputFormat中的createRecordReader的实现

/*
  Q 文件会被切分为block块,如果一行文本被分割到了不同的block,如何处理?由哪一个InputSplit处理?
  A 会被前一个InputSplit处理
*/

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  // 针对每一个InputSplit都会创建一个RecordReader
  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    // 获取换行符,默认没有配置
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    // 创建一个阅读器
    return new LineRecordReader(recordDelimiterBytes);
  }
    
}


public class LineRecordReader extends RecordReader<LongWritable, Text> {

  // 初始化方法,框架再调用split之前调用
  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    // 获取传来的InputSplit,将其转为子类FileSplit
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    // MAX_LINE_LENGTH 默认没有配置,所以默认会取 Integer.MAX_VALUE
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    // 获取起始位置
    start = split.getStart();
    // 获取结束位置
    end = start + split.getLength();
    // 获取 InputSplit 的路径
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FutureDataInputStreamBuilder builder =
        file.getFileSystem(job).openFile(file);
    FutureIOSupport.propagateOptions(builder, job,
        MRJobConfig.INPUT_FILE_OPTION_PREFIX,
        MRJobConfig.INPUT_FILE_MANDATORY_PREFIX);
    fileIn = FutureIOSupport.awaitFuture(builder.build());
    
    // 获取文件的压缩信息
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    // 如果是压缩文件,则执行以下语句
    if (null!=codec) {
      isCompressedInput = true;
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        if (start != 0) {
          // So we have a split that is only part of a file stored using
          // a Compression codec that cannot be split.
          throw new IOException("Cannot seek in " +
              codec.getClass().getSimpleName() + " compressed stream");
        }

        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else { // 如果是非压缩文件则跳转到此处
      fileIn.seek(start);
      // 针对未压缩文件,创建一个阅读器读取一行一行的数据
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method. 因为 next() 会多读取一行,所以一行数据被拆分到两个InputSplit中也没有问题
    // 如果 start 不等于0,表示不是第一个InputSplit,所以把start的值重置为第二行的起始位置
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }


  // 核心方法,会被框架调用,没调用一次,就会读取一行数据,返回true表示最终获取到 <k1, v1>
  public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    // k1 就是每一行的起始位
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        // 读取一行数据,赋值给value,也就是v1
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }
    
}

当一个InputSplit中最后一行不完整时,会从下一个InputSplit读取到完整的一行数据。

OutputFormat 分析

OutputFormat:输出数据顶层基类(抽象类)

FileOutputFormat:文件数据处理基类(抽象类,继承自 OutputFormat,重写了 checkOutputSpecs 和 getOutputCommitter )

TextOutputFormat:默认文本文件处理类(实现类,继承自 FileOutputFormat,重写了 getRecordWriter )

public abstract class OutputFormat<K, V> {

  /** 
   * Get the {@link RecordWriter} for the given task.
   */
  public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context) 
      throws IOException, InterruptedException;

  /** 
   * Check for validity of the output-specification for the job.
   *  
   * <p>This is to validate the output specification for the job when it is
   * a job is submitted.  Typically checks that it does not already exist,
   * throwing an exception when it already exists, so that output is not
   * overwritten.</p>
   *
   * Implementations which write to filesystems which support delegation
   * tokens usually collect the tokens for the destination path(s)
   * and attach them to the job context's JobConf.
   */
  public abstract void checkOutputSpecs(JobContext context)
      throws IOException, InterruptedException;

  /**
   * Get the output committer for this output format. This is responsible
   * for ensuring the output is committed correctly.
   */
  public abstract 
  OutputCommitter getOutputCommitter(TaskAttemptContext context)
      throws IOException, InterruptedException;
}

 

一个1G的文件,会产生多少个map任务? 

1G / 128M = 8 所以会产生8个map任务。

1000个文件,每个文件100KB,会产生多少个map任务?

一个文件再小都会占用一个block块,1000个InputSplit和map任务。

 一个140M的文件,会产生多少个map任务?

 140 / 128 < 1.1  只会产生一个InputSplit,生成一个Map任务。

 

posted @ 2022-05-30 16:44  某某人8265  阅读(404)  评论(0编辑  收藏  举报