Apache Hudi 源码分析 - HoodieTableSink

可以看出,分为几种,

bulk_insert

append

default, upsert

compact,clean

 

 

 

Pipelines这部分,注释写的非常不错

bulk_insert

先按照partition path进行shuffle分区,然后再按照partition path进行排序,一个partition path的record都聚合在一起

这样再按照每个partition path去写文件,有效降低了小文件

  /**
   * Bulk insert the input dataset at once.
   *
   * <p>By default, the input dataset would shuffle by the partition path first then
   * sort by the partition path before passing around to the write function.
   * The whole pipeline looks like the following:
   *
   * <pre>
   *      | input1 | ===\     /=== |sorter| === | task1 | (p1, p2)
   *                   shuffle
   *      | input2 | ===/     \=== |sorter| === | task2 | (p3, p4)
   *
   *      Note: Both input1 and input2's dataset come from partitions: p1, p2, p3, p4
   * </pre>
   *
   * <p>The write task switches to new file handle each time it receives a record
   * from the different partition path, the shuffle and sort would reduce small files.
   *
   * <p>The bulk insert should be run in batch execution mode.
   *
   * @param conf       The configuration
   * @param rowType    The input row type
   * @param dataStream The input data stream
   * @return the bulk insert data stream sink
   */

可以看到代码逻辑上也是分成3块,

    final String[] partitionFields = FilePathUtils.extractPartitionKeys(conf);
    if (partitionFields.length > 0) {
      RowDataKeyGen rowDataKeyGen = RowDataKeyGen.instance(conf, rowType);
      if (conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SHUFFLE_INPUT)) {

        // shuffle by partition keys,分区
        // use #partitionCustom instead of #keyBy to avoid duplicate sort operations,
        // see BatchExecutionUtils#applyBatchExecutionSettings for details.
        Partitioner<String> partitioner = (key, channels) ->
            KeyGroupRangeAssignment.assignKeyToParallelOperator(key, StreamGraphGenerator.DEFAULT_LOWER_BOUND_MAX_PARALLELISM, channels);
        dataStream = dataStream.partitionCustom(partitioner, rowDataKeyGen::getPartitionPath);
      }
      if (conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT)) {
        SortOperatorGen sortOperatorGen = new SortOperatorGen(rowType, partitionFields);
        // sort by partition keys,排序
        dataStream = dataStream
            .transform("partition_key_sorter",
                TypeInformation.of(RowData.class),
                sortOperatorGen.createSortOperator())
            .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));
        ExecNodeUtil.setManagedMemoryWeight(dataStream.getTransformation(),
            conf.getInteger(FlinkOptions.WRITE_SORT_MEMORY) * 1024L * 1024L);
      }
    }
    return dataStream
        .transform("hoodie_bulk_insert_write",
            TypeInformation.of(Object.class),
            operatorFactory) //写入operator
        // follow the parallelism of upstream operators to avoid shuffle
        .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS))
        .addSink(DummySink.INSTANCE) //最后加上DummySink
        .name("dummy");

这里operatorFactory最终生成的是

BulkInsertWriteFunction

  @Override
  public void processElement(I value, Context ctx, Collector<Object> out) throws IOException {
    this.writerHelper.write((RowData) value);
  }

BucketBulkInsertWriterHelper

如果record是连续的,可以直接重用lastFileId对应的file handle,比较高效,所以前面需要排序

  public void write(RowData tuple) throws IOException {
    try {
      RowData record = tuple.getRow(1, this.recordArity);
      String recordKey = keyGen.getRecordKey(record);
      String partitionPath = keyGen.getPartitionPath(record);
      String fileId = tuple.getString(0).toString();
      if ((lastFileId == null) || !lastFileId.equals(fileId)) {
        LOG.info("Creating new file for partition path " + partitionPath);
        handle = getRowCreateHandle(partitionPath, fileId);
        lastFileId = fileId;
      }
      handle.write(recordKey, partitionPath, record); //单行写入parquet文件
    } catch (Throwable throwable) {
      LOG.error("Global error thrown while trying to write records in HoodieRowDataCreateHandle", throwable);
      throw throwable;
    }
  }

否则需要调用,getRowCreateHandle

  private HoodieRowDataCreateHandle getRowCreateHandle(String partitionPath) throws IOException {
    if (!handles.containsKey(partitionPath)) { // handles里面没有现成的
      // if records are sorted, we can close all existing handles
      if (isInputSorted) { //
        close(); //如果Sorted,没有必要cache之前的handle,所以可以close掉
      }
      HoodieRowDataCreateHandle rowCreateHandle = new HoodieRowDataCreateHandle(hoodieTable, writeConfig, partitionPath, getNextFileId(),
          instantTime, taskPartitionId, taskId, taskEpochId, rowType);
      handles.put(partitionPath, rowCreateHandle); //创建一个新的加入到handles里面
    } else if (!handles.get(partitionPath).canWrite()) { //虽然已经有handle,但是已经写满
      // even if there is a handle to the partition path, it could have reached its max size threshold. So, we close the handle here and
      // create a new one.
      writeStatusList.add(handles.remove(partitionPath).close()); //所以把当前的handle关掉,并放入writeStatus中
      HoodieRowDataCreateHandle rowCreateHandle = new HoodieRowDataCreateHandle(hoodieTable, writeConfig, partitionPath, getNextFileId(),
          instantTime, taskPartitionId, taskId, taskEpochId, rowType);
      handles.put(partitionPath, rowCreateHandle); //创建新的handle加入
    }
    return handles.get(partitionPath); //从handles取到handle
  }

可以看到这里写文件也是单条写入的,

之所以叫BulkInsert,

因为BulkInsertWriteOperator实现BoundedOneInput

public class BulkInsertWriteOperator<I>
    extends AbstractWriteOperator<I>
    implements BoundedOneInput {

BoundedOneInput中主要是endInput函数,

  /**
   * End input action for batch source.
   */
  public void endInput() {
    final List<WriteStatus> writeStatus = this.writerHelper.getWriteStatuses(this.taskID);

    final WriteMetadataEvent event = WriteMetadataEvent.builder()
        .taskID(taskID)
        .instantTime(this.writerHelper.getInstantTime())
        .writeStatus(writeStatus)
        .lastBatch(true)
        .endInput(true)
        .build();
    this.eventGateway.sendEventToCoordinator(event);
  }

可以看到,这里会给coordinator发生event来更新meta

所以是bulkinsert,因为当所有数据写完后,需要手动调用一次endInput来更新元数据

 

append

不考虑update和去重

所以流程很简单直接写

这里画了shuffle,但是没有特意去partition,如果并行度相同应该没有shuffle的过程

这种方式会产生大量的小文件,没有按partition分区,也没有排序

主要是没有分区,没有排序不会有太大的影响文件数目,因为只有在snapshot的时候才会产生新文件

  /**
   * Insert the dataset with append mode(no upsert or deduplication).
   *
   * <p>The input dataset would be rebalanced among the write tasks:
   *
   * <pre>
   *      | input1 | ===\     /=== | task1 | (p1, p2, p3, p4)
   *                   shuffle
   *      | input2 | ===/     \=== | task2 | (p1, p2, p3, p4)
   *
   *      Note: Both input1 and input2's dataset come from partitions: p1, p2, p3, p4
   * </pre>
   *
   * <p>The write task switches to new file handle each time it receives a record
   * from the different partition path, so there may be many small files.
   *
   * @param conf       The configuration
   * @param rowType    The input row type
   * @param dataStream The input data stream
   * @param bounded    Whether the input stream is bounded
   * @return the appending data stream sink
   */

 

  public static DataStreamSink<Object> append(
      Configuration conf,
      RowType rowType,
      DataStream<RowData> dataStream,
      boolean bounded) {
    WriteOperatorFactory<RowData> operatorFactory = AppendWriteOperator.getFactory(conf, rowType); //

    return dataStream
        .transform("hoodie_append_write", TypeInformation.of(Object.class), operatorFactory) //直接Append Operator写入
        .uid("uid_hoodie_stream_write" + conf.getString(FlinkOptions.TABLE_NAME))
        .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS))
        .addSink(DummySink.INSTANCE)
        .name("dummy");
  }

 

AppendWriteFunction

首先对于processElement的逻辑和BulkInsert一样,没有区别

不一样的,Append除了支持batch,即boundedOneInput接口,在endInput中flushData

也支持streaming的情况,在snapshotState中,flushData

  @Override
  public void snapshotState() {
    // Based on the fact that the coordinator starts the checkpoint first,
    // it would check the validity.
    // wait for the buffer data flush out and request a new instant
    flushData(false);
  }

  @Override
  public void processElement(I value, Context ctx, Collector<Object> out) throws Exception {
    if (this.writerHelper == null) {
      initWriterHelper();
    }
    this.writerHelper.write((RowData) value);
  }

  /**
   * End input action for batch source.
   */
  public void endInput() {
    flushData(true);
    this.writeStatuses.clear();
  }

这个flush的逻辑和BulkInsert中的基本一致

  private void flushData(boolean endInput) {
    final List<WriteStatus> writeStatus;
    final String instant;
    if (this.writerHelper != null) { //
      writeStatus = this.writerHelper.getWriteStatuses(this.taskID); //
      instant = this.writerHelper.getInstantTime(); //
    } else {
      writeStatus = Collections.emptyList();
      instant = instantToWrite(false);
      LOG.info("No data to write in subtask [{}] for instant [{}]", taskID, instant);
    }
    final WriteMetadataEvent event = WriteMetadataEvent.builder()
        .taskID(taskID)
        .instantTime(instant)
        .writeStatus(writeStatus)
        .lastBatch(true)
        .endInput(endInput)
        .build(); //
    this.eventGateway.sendEventToCoordinator(event); //
    // nullify the write helper for next ckp
    this.writerHelper = null;
    this.writeStatuses.addAll(writeStatus);
    // blocks flushing until the coordinator starts a new instant
    this.confirming = true;
  }

需要注意的是,在getWriteStatuses的时候,会close file handle

所以一次snapshot会产生一批新的文件

  public List<HoodieInternalWriteStatus> getHoodieWriteStatuses() throws IOException {
    close();
    return writeStatusList;
  }

这里BulkInsert和Append两种模式的设计不太清晰,大部分逻辑雷同

BulkInsert,有按partitionPath分区和排序,但是只支持batch

Append,直接写,但是支持streaming和batch

为啥不能合成一个?

 

bootstrap

有两种,boundedBootstrap和streamBootstrap

  /**
   * Constructs bootstrap pipeline.
   * The bootstrap operator loads the existing data index (primary key to file id mapping),
   * then send the indexing data set to subsequent operator(usually the bucket assign operator).
   *
   * @param conf               The configuration
   * @param rowType            The row type
   * @param defaultParallelism The default parallelism
   * @param dataStream         The data stream
   * @param bounded            Whether the source is bounded
   * @param overwrite          Whether it is insert overwrite
   */
  public static DataStream<HoodieRecord> bootstrap(
      Configuration conf,
      RowType rowType,
      int defaultParallelism,
      DataStream<RowData> dataStream,
      boolean bounded,
      boolean overwrite) {
    final boolean globalIndex = conf.getBoolean(FlinkOptions.INDEX_GLOBAL_ENABLED);
    if (overwrite || OptionsResolver.isBucketIndexType(conf)) {
      return rowDataToHoodieRecord(conf, rowType, dataStream); //只是转换成HoodieRecord
    } else if (bounded && !globalIndex && OptionsResolver.isPartitionedTable(conf)) {
      return boundedBootstrap(conf, rowType, defaultParallelism, dataStream); //
    } else {
      return streamBootstrap(conf, rowType, defaultParallelism, dataStream, bounded); //
    }
  }

streamBootstrap

  private static DataStream<HoodieRecord> streamBootstrap(
      Configuration conf,
      RowType rowType,
      int defaultParallelism,
      DataStream<RowData> dataStream,
      boolean bounded) {
    DataStream<HoodieRecord> dataStream1 = rowDataToHoodieRecord(conf, rowType, dataStream); //转化成HoodieRecord

    if (conf.getBoolean(FlinkOptions.INDEX_BOOTSTRAP_ENABLED) || bounded) {
      dataStream1 = dataStream1
          .transform(
              "index_bootstrap",
              TypeInformation.of(HoodieRecord.class),
              new BootstrapOperator<>(conf)) //使用BootstrapOperator
          .setParallelism(conf.getOptional(FlinkOptions.INDEX_BOOTSTRAP_TASKS).orElse(defaultParallelism))
          .uid("uid_index_bootstrap_" + conf.getString(FlinkOptions.TABLE_NAME));
    }

    return dataStream1;
  }

BootstrapOperator

主要逻辑在initializeState里面,processElement只是forword record

state中,要保存lastPendingInstant,最新的没有完成的instant

主要的逻辑是preLoadIndexRecords

  @Override
  public void snapshotState(StateSnapshotContext context) throws Exception {
    lastInstantTime = this.ckpMetadata.lastPendingInstant(); //
    instantState.update(Collections.singletonList(lastInstantTime));
  }

  @Override
  public void initializeState(StateInitializationContext context) throws Exception {
    ListStateDescriptor<String> instantStateDescriptor = new ListStateDescriptor<>(
        "instantStateDescriptor",
        Types.STRING
    );
    instantState = context.getOperatorStateStore().getListState(instantStateDescriptor);

    if (context.isRestored()) {
      Iterator<String> instantIterator = instantState.get().iterator(); //
      if (instantIterator.hasNext()) {
        lastInstantTime = instantIterator.next(); //
      }
    }

    this.hadoopConf = StreamerUtil.getHadoopConf();
    this.writeConfig = StreamerUtil.getHoodieClientConfig(this.conf, true);
    this.hoodieTable = FlinkTables.createTable(writeConfig, hadoopConf, getRuntimeContext());
    this.ckpMetadata = CkpMetadata.getInstance(hoodieTable.getMetaClient().getFs(), this.writeConfig.getBasePath());
    this.aggregateManager = getRuntimeContext().getGlobalAggregateManager();

    preLoadIndexRecords(); //
  }

  @Override
  @SuppressWarnings("unchecked")
  public void processElement(StreamRecord<I> element) throws Exception {
    output.collect((StreamRecord<O>) element);
  }

preLoadIndexRecords

对于这个table下的所有partition,满足Pattern,即需要加载index的,调用loadRecords

然后等待所有并发的task都完成loadrecords,这步不完成init就不结束,job不会正式开始

  /**
   * Load the index records before {@link #processElement}.
   */
  protected void preLoadIndexRecords() throws Exception {
    String basePath = hoodieTable.getMetaClient().getBasePath();
    int taskID = getRuntimeContext().getIndexOfThisSubtask();
    LOG.info("Start loading records in table {} into the index state, taskId = {}", basePath, taskID);
    for (String partitionPath : FSUtils.getAllFoldersWithPartitionMetaFile(FSUtils.getFs(basePath, hadoopConf), basePath)) {
      if (pattern.matcher(partitionPath).matches()) {
        loadRecords(partitionPath);
      }
    }// wait for the other bootstrap tasks finish bootstrapping.
    waitForBootstrapReady(getRuntimeContext().getIndexOfThisSubtask());
  }

  /**
   * Wait for other bootstrap tasks to finish the index bootstrap.
   */
  private void waitForBootstrapReady(int taskID) {
    int taskNum = getRuntimeContext().getNumberOfParallelSubtasks();
    int readyTaskNum = 1;
    while (taskNum != readyTaskNum) {
      try {
        readyTaskNum = aggregateManager.updateGlobalAggregate(BootstrapAggFunction.NAME, taskID, new BootstrapAggFunction());
        LOG.info("Waiting for other bootstrap tasks to complete, taskId = {}.", taskID);

        TimeUnit.SECONDS.sleep(5);
      } catch (Exception e) {
        LOG.warn("Update global task bootstrap summary error", e);
      }
    }
  }

loadRecords

  protected void loadRecords(String partitionPath) throws Exception {

    HoodieTimeline commitsTimeline = this.hoodieTable.getMetaClient().getCommitsTimeline(); //
    if (!StringUtils.isNullOrEmpty(lastInstantTime)) {
      commitsTimeline = commitsTimeline.findInstantsAfter(lastInstantTime); //lastInstantTime是state保存,读取after的timeline
    }
    Option<HoodieInstant> latestCommitTime = commitsTimeline.filterCompletedInstants().lastInstant(); //找出timeline中完成instant中最新的一个

    if (latestCommitTime.isPresent()) {
      BaseFileUtils fileUtils = BaseFileUtils.getInstance(this.hoodieTable.getBaseFileFormat());
      Schema schema = new TableSchemaResolver(this.hoodieTable.getMetaClient()).getTableAvroSchema();

      List<FileSlice> fileSlices = this.hoodieTable.getSliceView()
          .getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitTime.get().getTimestamp(), true)
          .collect(toList()); //Before,因为后面都是没有commit的,所以读前面的,读出所有的fileSlices

      for (FileSlice fileSlice : fileSlices) { //
        if (!shouldLoadFile(fileSlice.getFileId(), maxParallelism, parallelism, taskID)) {
          continue;
        }
        LOG.info("Load records from {}.", fileSlice);

        // load parquet records
        fileSlice.getBaseFile().ifPresent(baseFile -> { //先读BaseFile,parquet格式
          // filter out crushed files
          if (!isValidFile(baseFile.getFileStatus())) {
            return;
          }
          try (ClosableIterator<HoodieKey> iterator = fileUtils.getHoodieKeyIterator(this.hadoopConf, new Path(baseFile.getPath()))) {
            iterator.forEachRemaining(hoodieKey -> {
              output.collect(new StreamRecord(new IndexRecord(generateHoodieRecord(hoodieKey, fileSlice))));
            }); //最终输出indexRecord,hoodiekey和fileSlice的匹配,这里会输出这个fileslice中所有的recordKey
          }
        });

        // load avro log records
        List<String> logPaths = fileSlice.getLogFiles() //后读logfile
            // filter out crushed files
            .filter(logFile -> isValidFile(logFile.getFileStatus()))
            .map(logFile -> logFile.getPath().toString())
            .collect(toList());
        HoodieMergedLogRecordScanner scanner = FormatUtils.logScanner(logPaths, schema, latestCommitTime.get().getTimestamp(),
            writeConfig, hadoopConf); //生成对logfile的scanner

        try { //输出和上面一样,也是indexRecord
          for (String recordKey : scanner.getRecords().keySet()) {
            output.collect(new StreamRecord(new IndexRecord(generateHoodieRecord(new HoodieKey(recordKey, partitionPath), fileSlice))));
          }
        } catch (Exception e) {
          throw new HoodieException(String.format("Error when loading record keys from files: %s", logPaths), e);
        } finally {
          scanner.close();
        }
      }
    }

 

boundedBootstrap

  /**
   * Constructs bootstrap pipeline for batch execution mode.
   * The indexing data set is loaded before the actual data write
   * in order to support batch UPSERT.
   */
  private static DataStream<HoodieRecord> boundedBootstrap(
      Configuration conf,
      RowType rowType,
      int defaultParallelism,
      DataStream<RowData> dataStream) {
    final RowDataKeyGen rowDataKeyGen = RowDataKeyGen.instance(conf, rowType);
    // shuffle by partition keys
    dataStream = dataStream
        .keyBy(rowDataKeyGen::getPartitionPath); //加上key,以partitionPath为key

    return rowDataToHoodieRecord(conf, rowType, dataStream) //转换成HoodieRecord
        .transform(
            "batch_index_bootstrap",
            TypeInformation.of(HoodieRecord.class),
            new BatchBootstrapOperator<>(conf)) //BathBootStrap
        .setParallelism(conf.getOptional(FlinkOptions.INDEX_BOOTSTRAP_TASKS).orElse(defaultParallelism))
        .uid("uid_batch_index_bootstrap_" + conf.getString(FlinkOptions.TABLE_NAME));
  }

BatchBootstrapOperator

和上面不同在于,

这里,preLoadIndexRecords是空的,不做事情

然后在processElement的时候,动态对于record对应的partitionPath进行load

  @Override
  protected void preLoadIndexRecords() {
    // no operation
  }

  @Override
  @SuppressWarnings("unchecked")
  public void processElement(StreamRecord<I> element) throws Exception {
    final HoodieRecord<?> record = (HoodieRecord<?>) element.getValue();
    final String partitionPath = record.getKey().getPartitionPath();

    if (haveSuccessfulCommits && !partitionPathSet.contains(partitionPath)) {
      loadRecords(partitionPath);
      partitionPathSet.add(partitionPath);
    }

    // send the trigger record
    output.collect((StreamRecord<O>) element);
  }

所以最终bootstrap,除了output正常的record流,还会output indexRecord流

 

hoodieStreamWrite

先按照recordKey shuffle

然后assign bucket

再按照bucket id,即file group,shuffle,保证一个file group只会在一个task中写入,避免冲突

  /**
   * The streaming write pipeline.
   *
   * <p>The input dataset shuffles by the primary key first then
   * shuffles by the file group ID before passing around to the write function.
   * The whole pipeline looks like the following:
   *
   * <pre>
   *      | input1 | ===\     /=== | bucket assigner | ===\     /=== | task1 |
   *                   shuffle(by PK)                    shuffle(by bucket ID)
   *      | input2 | ===/     \=== | bucket assigner | ===/     \=== | task2 |
   *
   *      Note: a file group must be handled by one write task to avoid write conflict.
   * </pre>
   *
   * <p>The bucket assigner assigns the inputs to suitable file groups, the write task caches
   * and flushes the data set to disk.
   *
   * @param conf               The configuration
   * @param defaultParallelism The default parallelism
   * @param dataStream         The input data stream
   * @return the stream write data stream pipeline
   */

代码逻辑比较清晰,

      WriteOperatorFactory<HoodieRecord>
              operatorFactory = StreamWriteOperator.getFactory(conf); //
      return dataStream
          // Key-by record key, to avoid multiple subtasks write to a bucket at the same time
          .keyBy(HoodieRecord::getRecordKey) //按照record key进行shuffle
          .transform(
              "bucket_assigner",
              TypeInformation.of(HoodieRecord.class),
              new KeyedProcessOperator<>(new BucketAssignFunction<>(conf))) //分配bucket
          .uid("uid_bucket_assigner_" + conf.getString(FlinkOptions.TABLE_NAME))
          .setParallelism(conf.getOptional(FlinkOptions.BUCKET_ASSIGN_TASKS).orElse(defaultParallelism))
          // shuffle by fileId(bucket id)
          .keyBy(record -> record.getCurrentLocation().getFileId()) //按照bucketID进行shuffle
          .transform("stream_write", TypeInformation.of(Object.class), operatorFactory) //流式写入
          .uid("uid_stream_write" + conf.getString(FlinkOptions.TABLE_NAME))
          .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));

BucketAssignFunction

有两个流,一个是从bootstrap过来的indexRecord,一个是正常的数据record

所以对于,indexRecord逻辑很简单,利用KeyedStateStore来存每个recordKey对应的location

 

 

对于正常的record,

几种情况,

如果当前的PartitionPath和老的不一样,

    - 如果是globalIndex,需要先创建一条deleteRecord,然后再在当前partitionPath,getNewRecordLocation,即分配一个新的file group

    - 如果不是globalIndex,那index是partition内部的,不用管其他的partition,所以直接getNewRecordLocation

如果location一样,那么直接update

  private void processRecord(HoodieRecord<?> record, Collector<O> out) throws Exception {
    // 1. put the record into the BucketAssigner;
    // 2. look up the state for location, if the record has a location, just send it out;
    // 3. if it is an INSERT, decide the location using the BucketAssigner then send it out.
    final HoodieKey hoodieKey = record.getKey();
    final String recordKey = hoodieKey.getRecordKey();
    final String partitionPath = hoodieKey.getPartitionPath();
    final HoodieRecordLocation location;

    // Only changing records need looking up the index for the location,
    // append only records are always recognized as INSERT.
    HoodieRecordGlobalLocation oldLoc = indexState.value();
    if (isChangingRecords && oldLoc != null) {
      // Set up the instant time as "U" to mark the bucket as an update bucket.
      if (!Objects.equals(oldLoc.getPartitionPath(), partitionPath)) {
        if (globalIndex) {
          // if partition path changes, emit a delete record for old partition path,
          // then update the index state using location with new partition path.
          HoodieRecord<?> deleteRecord = new HoodieAvroRecord<>(new HoodieKey(recordKey, oldLoc.getPartitionPath()),
              payloadCreation.createDeletePayload((BaseAvroPayload) record.getData()));
          deleteRecord.setCurrentLocation(oldLoc.toLocal("U"));
          deleteRecord.seal();
          out.collect((O) deleteRecord);
        }
        location = getNewRecordLocation(partitionPath);
      } else {
        location = oldLoc.toLocal("U");
        this.bucketAssigner.addUpdate(partitionPath, location.getFileId());
      }
    } else {
      location = getNewRecordLocation(partitionPath);
    }
    // always refresh the index
    if (isChangingRecords) {
      updateIndexState(partitionPath, location);
    }
    record.setCurrentLocation(location);
    out.collect((O) record);
  }

getNewRecordLocation,其实是调用了bucketAssigner.addInsert

下面就看下,

bucketAssigner.addInsert和bucketAssigner.addUpdate

addInsert

这里注意,newFileAssignState和bucketInfoMap在每次checkpoint的时候会clear掉

因为cp的时候,所有的file会close掉

  public BucketInfo addInsert(String partitionPath) {
    // for new inserts, compute buckets depending on how many records we have for each partition
    SmallFileAssign smallFileAssign = getSmallFileAssign(partitionPath);

    // 先找小文件,如果有可以assign,就update该文件
    if (smallFileAssign != null && smallFileAssign.assign()) {
      return new BucketInfo(BucketType.UPDATE, smallFileAssign.getFileId(), partitionPath);
    }

    // 看下当前partitionPath上,是否有现成可以assign的文件,如果有拿过来直接update
    if (newFileAssignStates.containsKey(partitionPath)) {
      NewFileAssignState newFileAssignState = newFileAssignStates.get(partitionPath);
      if (newFileAssignState.canAssign()) {
        newFileAssignState.assign();
        final String key = StreamerUtil.generateBucketKey(partitionPath, newFileAssignState.fileId);
        if (bucketInfoMap.containsKey(key)) {
          // the newFileAssignStates is cleaned asynchronously when received the checkpoint success notification,
          // the records processed within the time range:
          // (start checkpoint, checkpoint success(and instant committed))
          // should still be assigned to the small buckets of last checkpoint instead of new one.

          // the bucketInfoMap is cleaned when checkpoint starts.

          // A promotion: when the HoodieRecord can record whether it is an UPDATE or INSERT,
          // we can always return an UPDATE BucketInfo here, and there is no need to record the
          // UPDATE bucket through calling #addUpdate.
          return bucketInfoMap.get(key);
        }
        return new BucketInfo(BucketType.UPDATE, newFileAssignState.fileId, partitionPath);
      }
    }
    //如果没有,就要新创建bucket
    BucketInfo bucketInfo = new BucketInfo(BucketType.INSERT, createFileIdOfThisTask(), partitionPath);
    final String key = StreamerUtil.generateBucketKey(partitionPath, bucketInfo.getFileIdPrefix());
    bucketInfoMap.put(key, bucketInfo);
    NewFileAssignState newFileAssignState = new NewFileAssignState(bucketInfo.getFileIdPrefix(), writeProfile.getRecordsPerBucket()); //
    newFileAssignState.assign();
    newFileAssignStates.put(partitionPath, newFileAssignState);
    return bucketInfo;
  }

addUpdate

这个逻辑就很简单了,对于update已经知道fileid,就直接生成bucketInfo就可以

  public BucketInfo addUpdate(String partitionPath, String fileIdHint) {
    final String key = StreamerUtil.generateBucketKey(partitionPath, fileIdHint);
    if (!bucketInfoMap.containsKey(key)) {
      BucketInfo bucketInfo = new BucketInfo(BucketType.UPDATE, fileIdHint, partitionPath);
      bucketInfoMap.put(key, bucketInfo);
    }
    // else do nothing because the bucket already exists.
    return bucketInfoMap.get(key);
  }

 

StreamWriteFunction

可以看到,这里核心逻辑非常简单

processElement只是buffer

当checkpoint的时候去flush所有的buckets

  @Override
  public void open(Configuration parameters) throws IOException {
    this.tracer = new TotalSizeTracer(this.config);
    initBuffer();
    initWriteFunction();
  }

  @Override
  public void snapshotState() {
    // Based on the fact that the coordinator starts the checkpoint first,
    // it would check the validity.
    // wait for the buffer data flush out and request a new instant
    flushRemaining(false);
  }

  @Override
  public void processElement(I value, ProcessFunction<I, Object>.Context ctx, Collector<Object> out) throws Exception {
    bufferRecord((HoodieRecord<?>) value);
  }

bufferRecord

把element里面的data加到对应的bucket的buffer中去,

然后会额外做两个判断,flushBucket和flushBuffer

如果bucket过大,就flush掉该bucket;如果buffer太大,从里面找到最大的bucket去flush

  /**
   * Buffers the given record.
   *
   * <p>Flush the data bucket first if the bucket records size is greater than
   * the configured value {@link FlinkOptions#WRITE_BATCH_SIZE}.
   *
   * <p>Flush the max size data bucket if the total buffer size exceeds the configured
   * threshold {@link FlinkOptions#WRITE_TASK_MAX_SIZE}.
   *
   * @param value HoodieRecord
   */
  protected void bufferRecord(HoodieRecord<?> value) {
    final String bucketID = getBucketID(value); //

    DataBucket bucket = this.buckets.computeIfAbsent(bucketID,
        k -> new DataBucket(this.config.getDouble(FlinkOptions.WRITE_BATCH_SIZE), value)); //
    final DataItem item = DataItem.fromHoodieRecord(value); //

    bucket.records.add(item); //

    boolean flushBucket = bucket.detector.detect(item); //
    boolean flushBuffer = this.tracer.trace(bucket.detector.lastRecordSize); //
    if (flushBucket) {
      if (flushBucket(bucket)) {
        this.tracer.countDown(bucket.detector.totalSize);
        bucket.reset();
      }
    } else if (flushBuffer) {
      // find the max size bucket and flush it out
      List<DataBucket> sortedBuckets = this.buckets.values().stream()
          .sorted((b1, b2) -> Long.compare(b2.detector.totalSize, b1.detector.totalSize))
          .collect(Collectors.toList());
      final DataBucket bucketToFlush = sortedBuckets.get(0);
      if (flushBucket(bucketToFlush)) {
        this.tracer.countDown(bucketToFlush.detector.totalSize);
        bucketToFlush.reset();
      } else {
        LOG.warn("The buffer size hits the threshold {}, but still flush the max size data bucket failed!", this.tracer.maxBufferSize);
      }
    }
  }

flushRemaining

  private void flushRemaining(boolean endInput) {
    this.currentInstant = instantToWrite(hasData()); //获取instant

    final List<WriteStatus> writeStatus;
    if (buckets.size() > 0) {
      writeStatus = new ArrayList<>();
      this.buckets.values()
          // The records are partitioned by the bucket ID and each batch sent to
          // the writer belongs to one bucket.
          .forEach(bucket -> {
            List<HoodieRecord> records = bucket.writeBuffer();
            if (records.size() > 0) {
              if (config.getBoolean(FlinkOptions.PRE_COMBINE)) {
                records = FlinkWriteHelper.newInstance().deduplicateRecords(records, (HoodieIndex) null, -1); //去重,在bucket所缓存的records中
              }
              bucket.preWrite(records); //
              writeStatus.addAll(writeFunction.apply(records, currentInstant)); //利用writeFunction进行真实写入
              records.clear();
              bucket.reset();
            }
          });
    } else {
      LOG.info("No data to write in subtask [{}] for instant [{}]", taskID, currentInstant);
      writeStatus = Collections.emptyList();
    }
    final WriteMetadataEvent event = WriteMetadataEvent.builder()
        .taskID(taskID)
        .instantTime(currentInstant)
        .writeStatus(writeStatus)
        .lastBatch(true)
        .endInput(endInput)
        .build(); //将writeStatus发给coordinator,更新meta

    this.eventGateway.sendEventToCoordinator(event);
    this.buckets.clear();
    this.tracer.reset();
    this.writeClient.cleanHandles();
    this.writeStatuses.addAll(writeStatus);
    // blocks flushing until the coordinator starts a new instant
    this.confirming = true;
  }

这里的writeFunction,

我们可以挑个看下,

HoodieFlinkWriteClient.upsert

HoodieFlinkCopyOnWriteTable.upsert

FlinkUpsertCommitActionExecutor.execute

FlinkWriteHelper.write

BaseFlinkCommitActionExecutor.execute

 protected Iterator<List<WriteStatus>> handleUpdateInternal(HoodieMergeHandle<?, ?, ?, ?> upsertHandle, String fileId)
      throws IOException {
    if (upsertHandle.getOldFilePath() == null) {
      throw new HoodieUpsertException(
          "Error in finding the old file path at commit " + instantTime + " for fileId: " + fileId);
    } else {
      FlinkMergeHelper.newInstance().runMerge(table, upsertHandle); //最终调用
    }
    return Collections.singletonList(upsertHandle.writeStatuses()).iterator();
  }

runMerge

producer,readerIterator读取record

consumer,调用mergeHandle

      ThreadLocal<BinaryEncoder> encoderCache = new ThreadLocal<>();
      ThreadLocal<BinaryDecoder> decoderCache = new ThreadLocal<>();
      //producer-consumer模式
      wrapper = new BoundedInMemoryExecutor<>(table.getConfig().getWriteBufferLimitBytes(), new IteratorBasedQueueProducer<>(readerIterator),
          Option.of(new UpdateHandler(mergeHandle)), record -> {
        if (!externalSchemaTransformation) {
          return record;
        }
        return transformRecordBasedOnNewSchema(gReader, gWriter, encoderCache, decoderCache, (GenericRecord) record);
      });
      wrapper.execute();

mergeHandle

HoodieFlinkWriteClient.upsert中生成,

 

 

 getOrCreateWriteHandle

    final boolean isDelta = table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ);
    final HoodieWriteHandle<?, ?, ?, ?> writeHandle;
    if (isDelta) { //merge on Read
      writeHandle = new FlinkAppendHandle<>(config, instantTime, table, partitionPath, fileID, recordItr,
          table.getTaskContextSupplier());
    } else if (loc.getInstantTime().equals("I")) { //insert
      writeHandle = new FlinkCreateHandle<>(config, instantTime, table, partitionPath,
          fileID, table.getTaskContextSupplier());
    } else { //update
      writeHandle = insertClustering
          ? new FlinkConcatHandle<>(config, instantTime, table, recordItr, partitionPath,
              fileID, table.getTaskContextSupplier())
          : new FlinkMergeHandle<>(config, instantTime, table, recordItr, partitionPath,
              fileID, table.getTaskContextSupplier());
    }

分成几种,

FlinkAppendHandle,FlinkCreateHandle,FlinkMergeHandle,基本重用HoodieAppendHandle,HoodieCreateHandle,HoodieMergeHandle

 

StreamWriteOperatorCoordinator

首先是Coordinator,所以要接收Operator发来的消息,

分三种,endInput,bootstrap,writeMeta

  @Override
  public void handleEventFromOperator(int i, OperatorEvent operatorEvent) {
    ValidationUtils.checkState(operatorEvent instanceof WriteMetadataEvent,
        "The coordinator can only handle WriteMetaEvent");
    WriteMetadataEvent event = (WriteMetadataEvent) operatorEvent;

    if (event.isEndInput()) {
      // handle end input event synchronously
      handleEndInputEvent(event); //
    } else {
      executor.execute(
          () -> {
            if (event.isBootstrap()) {
              handleBootstrapEvent(event); //
            } else {
              handleWriteMetaEvent(event); //
            }
          }, "handle write metadata event for instant %s", this.instant
      );
    }
  }

EventBuffer,数组,大小等于task的并发度,一个task对应于一个slot

所以通过EventBuffer,就可以很容易看出来各个task的event的情况

this.eventBuffer = new WriteMetadataEvent[this.parallelism];

3个handle event的逻辑

  private void handleBootstrapEvent(WriteMetadataEvent event) {
    this.eventBuffer[event.getTaskID()] = event; //把event放入到buffer
    if (Arrays.stream(eventBuffer).allMatch(evt -> evt != null && evt.isBootstrap())) { //如果所有的task都收到了bootstrap event
      // start to initialize the instant.
      initInstant(event.getInstantTime()); //初始化新的instant
    }
  }

  private void handleEndInputEvent(WriteMetadataEvent event) {
    addEventToBuffer(event); //放入buffer
    if (allEventsReceived()) { //如果所有task都收到EndInput
      // start to commit the instant.
      commitInstant(this.instant); //commit该instant
      // The executor thread inherits the classloader of the #handleEventFromOperator
      // caller, which is a AppClassLoader.
      Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
      // sync Hive synchronously if it is enabled in batch mode.
      syncHive();
    }
  }

  private void handleWriteMetaEvent(WriteMetadataEvent event) {
    // the write task does not block after checkpointing(and before it receives a checkpoint success event),
    // if it checkpoints succeed then flushes the data buffer again before this coordinator receives a checkpoint
    // success event, the data buffer would flush with an older instant time.
    ValidationUtils.checkState(
        HoodieTimeline.compareTimestamps(this.instant, HoodieTimeline.GREATER_THAN_OR_EQUALS, event.getInstantTime()),
        String.format("Receive an unexpected event for instant %s from task %d",
            event.getInstantTime(), event.getTaskID()));

    addEventToBuffer(event); //写Meta的event直接buffer
  }

然后,

initInstant

  private void initInstant(String instant) {
    HoodieTimeline completedTimeline =
        StreamerUtil.createMetaClient(conf).getActiveTimeline().filterCompletedInstants(); //找出所有完成的instant
    executor.execute(() -> {
      if (instant.equals("") || completedTimeline.containsInstant(instant)) {
        // the last instant committed successfully
        reset(); //如果当前instant已经完成,那就清空eventBuffer
      } else {
        LOG.info("Recommit instant {}", instant);
        commitInstant(instant); //如果没有完成,先commit该instant
      }
      // starts a new instant
      startInstant(); //start一个新的instant
      // upgrade downgrade
      this.writeClient.upgradeDowngrade(this.instant);
    }, "initialize instant %s", instant);
  }

startInstant

  private void startInstant() {
    // put the assignment in front of metadata generation,
    // because the instant request from write task is asynchronous.
    this.instant = this.writeClient.startCommit(tableState.commitAction, this.metaClient); //在Timeline上创建新的instant
    this.metaClient.getActiveTimeline().transitionRequestedToInflight(tableState.commitAction, this.instant); //将instant的状态迁移成inflight
    this.ckpMetadata.startInstant(this.instant); //创建start instant对应的cp文件
  }

commitInstant,这个是在EndInputEvent中,即是batch写入的时候触发

  private boolean commitInstant(String instant, long checkpointId) {

    List<WriteStatus> writeResults = Arrays.stream(eventBuffer)
        .filter(Objects::nonNull)
        .map(WriteMetadataEvent::getWriteStatuses)
        .flatMap(Collection::stream)
        .collect(Collectors.toList()); //从Eventbuffer中提取出writeResults

    if (writeResults.size() == 0) { //如果没有results,直接ack
      sendCommitAckEvents(checkpointId);
      return false;
    }
    doCommit(instant, writeResults); //commit
    return true;
  }

doCommit

  private void doCommit(String instant, List<WriteStatus> writeResults) {
      final Map<String, List<String>> partitionToReplacedFileIds = tableState.isOverwrite
          ? writeClient.getPartitionToReplacedFileIds(tableState.operationType, writeResults)
          : Collections.emptyMap();
      boolean success = writeClient.commit(instant, writeResults, Option.of(checkpointCommitMetadata),
          tableState.commitAction, partitionToReplacedFileIds); //commit Metadata
      if (success) {
        reset();
        this.ckpMetadata.commitInstant(instant); //创建commit instant的cp file
        LOG.info("Commit instant [{}] success!", instant);
      } else {
        throw new HoodieException(String.format("Commit instant [%s] failed!", instant));
      }
  }

HoodieFlinkWriteClient

  public boolean commit(String instantTime, List<WriteStatus> writeStatuses, Option<Map<String, String>> extraMetadata, String commitActionType, Map<String, List<String>> partitionToReplacedFileIds) {
    List<HoodieWriteStat> writeStats = writeStatuses.parallelStream().map(WriteStatus::getStat).collect(Collectors.toList());
    return commitStats(instantTime, writeStats, extraMetadata, commitActionType, partitionToReplacedFileIds);
  }

BaseHoodieWriteClient

  public boolean commitStats(String instantTime, List<HoodieWriteStat> stats, Option<Map<String, String>> extraMetadata,
                             String commitActionType, Map<String, List<String>> partitionToReplaceFileIds) {
    // Create a Hoodie table which encapsulated the commits and files visible
    HoodieTable table = createTable(config, hadoopConf); //
    HoodieCommitMetadata metadata = CommitUtils.buildMetadata(stats, partitionToReplaceFileIds,
        extraMetadata, operationType, config.getWriteSchema(), commitActionType); //
    HoodieInstant inflightInstant = new HoodieInstant(State.INFLIGHT, table.getMetaClient().getCommitActionType(), instantTime);
    HeartbeatUtils.abortIfHeartbeatExpired(instantTime, table, heartbeatClient, config);
    this.txnManager.beginTransaction(Option.of(inflightInstant),
        lastCompletedTxnAndMetadata.isPresent() ? Option.of(lastCompletedTxnAndMetadata.get().getLeft()) : Option.empty());
    try {
      preCommit(inflightInstant, metadata); //主要是resolve写冲突
      commit(table, commitActionType, instantTime, metadata, stats); //commit,将instant存入timeline
      // 删除marker,自动clean,自动archive
      postCommit(table, metadata, instantTime, extraMetadata, false);
      LOG.info("Committed " + instantTime);
      releaseResources();
    } catch (IOException e) {
      throw new HoodieCommitException("Failed to complete commit " + config.getBasePath() + " at time " + instantTime, e);
    } finally {
      this.txnManager.endTransaction(Option.of(inflightInstant)); //更改meta前后需要加减锁
    }
    // do this outside of lock since compaction, clustering can be time taking and we don't need a lock for the entire execution period
    runTableServicesInline(table, metadata, extraMetadata); //执行定义的table services,如compaction或clustering
    emitCommitMetrics(instantTime, metadata, commitActionType);

 

再者,要响应checkpoint

对于流式写入,需要在每次checkpoint的时候去commitInstant,并且开启新的instant

  @Override
  public void notifyCheckpointComplete(long checkpointId) {
    executor.execute(
        () -> {
          // The executor thread inherits the classloader of the #notifyCheckpointComplete
          // caller, which is a AppClassLoader.
          Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
          // for streaming mode, commits the ever received events anyway,
          // the stream write task snapshot and flush the data buffer synchronously in sequence,
          // so a successful checkpoint subsumes the old one(follows the checkpoint subsuming contract)
          final boolean committed = commitInstant(this.instant, checkpointId); //

          if (tableState.scheduleCompaction) {
            // if async compaction is on, schedule the compaction
            CompactionUtil.scheduleCompaction(metaClient, writeClient, tableState.isDeltaTimeCompaction, committed); //
          }

          if (committed) {
            // start new instant.
            startInstant(); //
            // sync Hive if is enabled
            syncHiveAsync();
          }
        }, "commits the instant %s", this.instant
    );
  }

  @Override
  public void notifyCheckpointAborted(long checkpointId) {
    if (checkpointId == this.checkpointId) {
      executor.execute(() -> {
        this.ckpMetadata.abortInstant(this.instant); //
      }, "abort instant %s", this.instant);
    }
  }

 

posted on 2022-07-12 18:01  fxjwind  阅读(1401)  评论(0编辑  收藏  举报