[Apache Doris] Apache Doris 元数据设计及DDL操作源码阅读

元数据设计

image

如上图,Doris 的元数据主要存储4类数据:

  • 用户数据信息。包括数据库、表的 Schema、分片信息等。
  • 各类作业信息。如导入作业,Clone 作业、SchemaChange 作业等。
  • 用户及权限信息
  • 集群及节点信息

image

元数据目录

元数据目录通过 FE 的配置项 meta_dir 指定。
bdb/ 目录下为 bdbje 的数据存放目录。
image/ 目录下为 image 文件的存放目录。
image.[logid] 是最新的 image 文件。后缀 logid 表明 image 所包含的最后一条日志的 id。

image.ckpt 是正在写入的 image 文件,如果写入成功,会重命名为 image.[logid],并替换掉旧的 image 文件。

VERSION 文件中记录着 cluster_id。cluster_id 唯一标识一个 Doris 集群。是在 leader 第一次启动时随机生成的一个 32 位整型。也可以通过 fe 配置项 cluster_id 来指定一个 cluster id。

ROLE 文件中记录的 FE 自身的角色。只有 FOLLOWER 和 OBSERVER 两种。其中 FOLLOWER 表示 FE 为一个可选举的节点。(注意:即使是 leader 节点,其角色也为 FOLLOWER)

DDL相关源代码阅读

启动MySQL服务

org.apache.doris.qe.QeService

if (nioEnabled) {
    mysqlServer = new NMysqlServer(port, scheduler);
} else {
    mysqlServer = new MysqlServer(port, scheduler);
}

DDL代码调用过程

image

org.apache.doris.qe.ConnectProcessor#dispatch 命令识别

switch (command) {
    case COM_INIT_DB:
        handleInitDb();
        break;
    case COM_QUIT:
        handleQuit();
        break;
    case COM_QUERY:
        handleQuery();
        break;
    case COM_FIELD_LIST:
        handleFieldList();
        break;
    case COM_PING:
        handlePing();
        break;
    default:
        ctx.getState().setError("Unsupported command(" + command + ")");
        LOG.warn("Unsupported command(" + command + ")");
        break;

org.apache.doris.qe.ConnectProcessor#analyze 词法语法解析

// Parse statement with parser generated by CUP&FLEX
SqlScanner input = new SqlScanner(new StringReader(originStmt), ctx.getSessionVariable().getSqlMode());
SqlParser parser = new SqlParser(input);

从连接中读取原始语句字符串
词法解析文件
• fe/fe-core/src/main/jflex/sql_scanner.flex
• 语法解析文件
• fe/fe-core/src/main/cup/sql_parser.cup

所有语法实现类:

StatementBase     [vim org/apache/doris/analysis/StatementBase.java +33]
      ├── ExportStmt    [vim org/apache/doris/analysis/ExportStmt.java +48]
      ├── ImportColumnsStmt     [vim org/apache/doris/analysis/ImportColumnsStmt.java +21]
      ├── ImportDeleteOnStmt    [vim org/apache/doris/analysis/ImportDeleteOnStmt.java +19]
      ├── ImportSequenceStmt    [vim org/apache/doris/analysis/ImportSequenceStmt.java +19]
      ├── ImportWhereStmt       [vim org/apache/doris/analysis/ImportWhereStmt.java +19]
      ├── KillStmt      [vim org/apache/doris/analysis/KillStmt.java +19]
      ├── SetStmt       [vim org/apache/doris/analysis/SetStmt.java +24]
      ├── UseStmt       [vim org/apache/doris/analysis/UseStmt.java +33]
      ├── QueryStmt     [vim org/apache/doris/analysis/QueryStmt.java +38]
      │   ├── SelectStmt        [vim org/apache/doris/analysis/SelectStmt.java +65]
      │   └── SetOperationStmt  [vim org/apache/doris/analysis/SetOperationStmt.java +36]
      ├── ShowStmt      [vim org/apache/doris/analysis/ShowStmt.java +22]
      │   ├── AdminShowConfigStmt       [vim org/apache/doris/analysis/AdminShowConfigStmt.java +33]
      │   ├── AdminShowDataSkewStmt     [vim org/apache/doris/analysis/AdminShowDataSkewStmt.java +32]
      │   ├── AdminShowReplicaDistributionStmt  [vim org/apache/doris/analysis/AdminShowReplicaDistributionStmt.java +34]
      │   ├── AdminShowReplicaStatusStmt        [vim org/apache/doris/analysis/AdminShowReplicaStatusStmt.java +39]
      │   ├── DescribeStmt      [vim org/apache/doris/analysis/DescribeStmt.java +54]
      │   ├── HelpStmt  [vim org/apache/doris/analysis/HelpStmt.java +26]
      │   ├── ShowAlterStmt     [vim org/apache/doris/analysis/ShowAlterStmt.java +46]
      │   ├── ShowAuthorStmt    [vim org/apache/doris/analysis/ShowAuthorStmt.java +23]
      │   ├── ShowBackendsStmt  [vim org/apache/doris/analysis/ShowBackendsStmt.java +30]
      │   ├── ShowBackupStmt    [vim org/apache/doris/analysis/ShowBackupStmt.java +38]
      │   ├── ShowBrokerStmt    [vim org/apache/doris/analysis/ShowBrokerStmt.java +30]
      │   ├── ShowCharsetStmt   [vim org/apache/doris/analysis/ShowCharsetStmt.java +23]
      │   ├── ShowClusterStmt   [vim org/apache/doris/analysis/ShowClusterStmt.java +34]
      │   ├── ShowCollationStmt [vim org/apache/doris/analysis/ShowCollationStmt.java +24]
      │   ├── ShowColumnStatsStmt       [vim org/apache/doris/analysis/ShowColumnStatsStmt.java +28]
      │   ├── ShowColumnStmt    [vim org/apache/doris/analysis/ShowColumnStmt.java +28]
      │   ├── ShowCreateDbStmt  [vim org/apache/doris/analysis/ShowCreateDbStmt.java +36]
      │   ├── ShowCreateFunctionStmt    [vim org/apache/doris/analysis/ShowCreateFunctionStmt.java +32]
      │   ├── ShowCreateRoutineLoadStmt [vim org/apache/doris/analysis/ShowCreateRoutineLoadStmt.java +24]
      │   ├── ShowCreateTableStmt       [vim org/apache/doris/analysis/ShowCreateTableStmt.java +29]
      │   ├── ShowDataStmt      [vim org/apache/doris/analysis/ShowDataStmt.java +56]
      │   ├── ShowDbIdStmt      [vim org/apache/doris/analysis/ShowDbIdStmt.java +29]
      │   ├── ShowDbStmt        [vim org/apache/doris/analysis/ShowDbStmt.java +27]
      │   ├── ShowDeleteStmt    [vim org/apache/doris/analysis/ShowDeleteStmt.java +31]
      │   ├── ShowDynamicPartitionStmt  [vim org/apache/doris/analysis/ShowDynamicPartitionStmt.java +29]
      │   ├── ShowEncryptKeysStmt       [vim org/apache/doris/analysis/ShowEncryptKeysStmt.java +32]
      │   ├── ShowEnginesStmt   [vim org/apache/doris/analysis/ShowEnginesStmt.java +23]
      │   ├── ShowEventsStmt    [vim org/apache/doris/analysis/ShowEventsStmt.java +23]
      │   ├── ShowExportStmt    [vim org/apache/doris/analysis/ShowExportStmt.java +40]
      │   ├── ShowFrontendsStmt [vim org/apache/doris/analysis/ShowFrontendsStmt.java +30]
      │   ├── ShowFunctionsStmt [vim org/apache/doris/analysis/ShowFunctionsStmt.java +32]
      │   ├── ShowGrantsStmt    [vim org/apache/doris/analysis/ShowGrantsStmt.java +32]
      │   ├── ShowIndexStmt     [vim org/apache/doris/analysis/ShowIndexStmt.java +32]
      │   ├── ShowLoadProfileStmt       [vim org/apache/doris/analysis/ShowLoadProfileStmt.java +27]
      │   ├── ShowLoadStmt      [vim org/apache/doris/analysis/ShowLoadStmt.java +42]
      │   ├── ShowLoadWarningsStmt      [vim org/apache/doris/analysis/ShowLoadWarningsStmt.java +36]
      │   ├── ShowMigrationsStmt        [vim org/apache/doris/analysis/ShowMigrationsStmt.java +31]
      │   ├── ShowOpenTableStmt [vim org/apache/doris/analysis/ShowOpenTableStmt.java +23]
      │   ├── ShowPartitionIdStmt       [vim org/apache/doris/analysis/ShowPartitionIdStmt.java +29]
      │   ├── ShowPartitionsStmt        [vim org/apache/doris/analysis/ShowPartitionsStmt.java +49]
      │   ├── ShowPluginsStmt   [vim org/apache/doris/analysis/ShowPluginsStmt.java +23]
      │   ├── ShowProcStmt      [vim org/apache/doris/analysis/ShowProcStmt.java +32]
      │   ├── ShowProcedureStmt [vim org/apache/doris/analysis/ShowProcedureStmt.java +23]
      │   ├── ShowProcesslistStmt       [vim org/apache/doris/analysis/ShowProcesslistStmt.java +24]
      │   ├── ShowQueryProfileStmt      [vim org/apache/doris/analysis/ShowQueryProfileStmt.java +27]
      │   ├── ShowRepositoriesStmt      [vim org/apache/doris/analysis/ShowRepositoriesStmt.java +25]
      │   ├── ShowResourcesStmt [vim org/apache/doris/analysis/ShowResourcesStmt.java +37]
      │   ├── ShowRestoreStmt   [vim org/apache/doris/analysis/ShowRestoreStmt.java +38]
      │   ├── ShowRolesStmt     [vim org/apache/doris/analysis/ShowRolesStmt.java +29]
      │   ├── ShowRollupStmt    [vim org/apache/doris/analysis/ShowRollupStmt.java +28]
      │   ├── ShowRoutineLoadStmt       [vim org/apache/doris/analysis/ShowRoutineLoadStmt.java +34]
      │   ├── ShowRoutineLoadTaskStmt   [vim org/apache/doris/analysis/ShowRoutineLoadTaskStmt.java +32]
      │   ├── ShowSmallFilesStmt        [vim org/apache/doris/analysis/ShowSmallFilesStmt.java +32]
      │   ├── ShowSnapshotStmt  [vim org/apache/doris/analysis/ShowSnapshotStmt.java +29]
      │   ├── ShowSqlBlockRuleStmt      [vim org/apache/doris/analysis/ShowSqlBlockRuleStmt.java +31]
      │   ├── ShowStatusStmt    [vim org/apache/doris/analysis/ShowStatusStmt.java +23]
      │   ├── ShowStreamLoadStmt        [vim org/apache/doris/analysis/ShowStreamLoadStmt.java +39]
      │   ├── ShowSyncJobStmt   [vim org/apache/doris/analysis/ShowSyncJobStmt.java +33]
      │   ├── ShowTableIdStmt   [vim org/apache/doris/analysis/ShowTableIdStmt.java +30]
      │   ├── ShowTableStatsStmt        [vim org/apache/doris/analysis/ShowTableStatsStmt.java +32]
      │   ├── ShowTableStatusStmt       [vim org/apache/doris/analysis/ShowTableStatusStmt.java +35]
      │   ├── ShowTableStmt     [vim org/apache/doris/analysis/ShowTableStmt.java +34]
      │   ├── ShowTabletStmt    [vim org/apache/doris/analysis/ShowTabletStmt.java +39]
      │   ├── ShowTransactionStmt       [vim org/apache/doris/analysis/ShowTransactionStmt.java +35]
      │   ├── ShowTrashDiskStmt [vim org/apache/doris/analysis/ShowTrashDiskStmt.java +33]
      │   ├── ShowTrashStmt     [vim org/apache/doris/analysis/ShowTrashStmt.java +36]
      │   ├── ShowTriggersStmt  [vim org/apache/doris/analysis/ShowTriggersStmt.java +23]
      │   ├── ShowUserPropertyStmt      [vim org/apache/doris/analysis/ShowUserPropertyStmt.java +42]
      │   ├── ShowUserStmt      [vim org/apache/doris/analysis/ShowUserStmt.java +25]
      │   ├── ShowVariablesStmt [vim org/apache/doris/analysis/ShowVariablesStmt.java +29]
      │   ├── ShowViewStmt      [vim org/apache/doris/analysis/ShowViewStmt.java +39]
      │   ├── ShowWarningStmt   [vim org/apache/doris/analysis/ShowWarningStmt.java +23]
      │   └── ShowWhiteListStmt [vim org/apache/doris/analysis/ShowWhiteListStmt.java +23]
      ├── TransactionStmt       [vim org/apache/doris/analysis/TransactionStmt.java +22]
      │   ├── TransactionBeginStmt      [vim org/apache/doris/analysis/TransactionBeginStmt.java +24]
      │   ├── TransactionCommitStmt     [vim org/apache/doris/analysis/TransactionCommitStmt.java +19]
      │   └── TransactionRollbackStmt   [vim org/apache/doris/analysis/TransactionRollbackStmt.java +19]
      ├── UnsupportedStmt       [vim org/apache/doris/analysis/UnsupportedStmt.java +22]
      │   └── EmptyStmt [vim org/apache/doris/analysis/EmptyStmt.java +19]
      └── DdlStmt       [vim org/apache/doris/analysis/DdlStmt.java +19]
          ├── AdminCancelRepairTableStmt        [vim org/apache/doris/analysis/AdminCancelRepairTableStmt.java +33]
          ├── AdminCheckTabletsStmt     [vim org/apache/doris/analysis/AdminCheckTabletsStmt.java +33]
          ├── AdminCleanTrashStmt       [vim org/apache/doris/analysis/AdminCleanTrashStmt.java +34]
          ├── AdminRepairTableStmt      [vim org/apache/doris/analysis/AdminRepairTableStmt.java +33]
          ├── AdminSetConfigStmt        [vim org/apache/doris/analysis/AdminSetConfigStmt.java +32]
          ├── AdminSetReplicaStatusStmt [vim org/apache/doris/analysis/AdminSetReplicaStatusStmt.java +30]
          ├── AlterClusterStmt  [vim org/apache/doris/analysis/AlterClusterStmt.java +29]
          ├── AlterColumnStatsStmt      [vim org/apache/doris/analysis/AlterColumnStatsStmt.java +33]
          ├── AlterDatabasePropertyStmt [vim org/apache/doris/analysis/AlterDatabasePropertyStmt.java +24]
          ├── AlterDatabaseQuotaStmt    [vim org/apache/doris/analysis/AlterDatabaseQuotaStmt.java +30]
          ├── AlterDatabaseRename       [vim org/apache/doris/analysis/AlterDatabaseRename.java +34]
          ├── AlterRoutineLoadStmt      [vim org/apache/doris/analysis/AlterRoutineLoadStmt.java +34]
          ├── AlterSqlBlockRuleStmt     [vim org/apache/doris/analysis/AlterSqlBlockRuleStmt.java +31]
          ├── AlterSystemStmt   [vim org/apache/doris/analysis/AlterSystemStmt.java +28]
          ├── AlterTableStatsStmt       [vim org/apache/doris/analysis/AlterTableStatsStmt.java +33]
          ├── AlterTableStmt    [vim org/apache/doris/analysis/AlterTableStmt.java +37]
          ├── CancelLoadStmt    [vim org/apache/doris/analysis/CancelLoadStmt.java +26]
          ├── CreateClusterStmt [vim org/apache/doris/analysis/CreateClusterStmt.java +33]
          ├── CreateDataSyncJobStmt     [vim org/apache/doris/analysis/CreateDataSyncJobStmt.java +36]
          ├── CreateDbStmt      [vim org/apache/doris/analysis/CreateDbStmt.java +31]
          ├── CreateEncryptKeyStmt      [vim org/apache/doris/analysis/CreateEncryptKeyStmt.java +30]
          ├── CreateFileStmt    [vim org/apache/doris/analysis/CreateFileStmt.java +35]
          ├── CreateFunctionStmt        [vim org/apache/doris/analysis/CreateFunctionStmt.java +47]
          ├── CreateMaterializedViewStmt        [vim org/apache/doris/analysis/CreateMaterializedViewStmt.java +43]
          ├── CreateRepositoryStmt      [vim org/apache/doris/analysis/CreateRepositoryStmt.java +28]
          ├── CreateResourceStmt        [vim org/apache/doris/analysis/CreateResourceStmt.java +32]
          ├── CreateRoleStmt    [vim org/apache/doris/analysis/CreateRoleStmt.java +28]
          ├── CreateRoutineLoadStmt     [vim org/apache/doris/analysis/CreateRoutineLoadStmt.java +48]
          ├── CreateSqlBlockRuleStmt    [vim org/apache/doris/analysis/CreateSqlBlockRuleStmt.java +37]
          ├── CreateTableAsSelectStmt   [vim org/apache/doris/analysis/CreateTableAsSelectStmt.java +26]
          ├── CreateTableLikeStmt       [vim org/apache/doris/analysis/CreateTableLikeStmt.java +31]
          ├── CreateTableStmt   [vim org/apache/doris/analysis/CreateTableStmt.java +56]
          ├── CreateUserStmt    [vim org/apache/doris/analysis/CreateUserStmt.java +36]
          ├── DeleteStmt        [vim org/apache/doris/analysis/DeleteStmt.java +35]
          ├── DropClusterStmt   [vim org/apache/doris/analysis/DropClusterStmt.java +31]
          ├── DropDbStmt        [vim org/apache/doris/analysis/DropDbStmt.java +30]
          ├── DropEncryptKeyStmt        [vim org/apache/doris/analysis/DropEncryptKeyStmt.java +28]
          ├── DropFileStmt      [vim org/apache/doris/analysis/DropFileStmt.java +34]
          ├── DropFunctionStmt  [vim org/apache/doris/analysis/DropFunctionStmt.java +27]
          ├── DropMaterializedViewStmt  [vim org/apache/doris/analysis/DropMaterializedViewStmt.java +29]
          ├── DropRepositoryStmt        [vim org/apache/doris/analysis/DropRepositoryStmt.java +27]
          ├── DropResourceStmt  [vim org/apache/doris/analysis/DropResourceStmt.java +27]
          ├── DropRoleStmt      [vim org/apache/doris/analysis/DropRoleStmt.java +28]
          ├── DropSqlBlockRuleStmt      [vim org/apache/doris/analysis/DropSqlBlockRuleStmt.java +30]
          ├── DropTableStmt     [vim org/apache/doris/analysis/DropTableStmt.java +28]
          ├── DropUserStmt      [vim org/apache/doris/analysis/DropUserStmt.java +27]
          ├── EnterStmt [vim org/apache/doris/analysis/EnterStmt.java +25]
          ├── GrantStmt [vim org/apache/doris/analysis/GrantStmt.java +39]
          ├── InsertStmt        [vim org/apache/doris/analysis/InsertStmt.java +66]
          ├── InstallPluginStmt [vim org/apache/doris/analysis/InstallPluginStmt.java +31]
          ├── LinkDbStmt        [vim org/apache/doris/analysis/LinkDbStmt.java +31]
          ├── LoadStmt  [vim org/apache/doris/analysis/LoadStmt.java +45]
          ├── MigrateDbStmt     [vim org/apache/doris/analysis/MigrateDbStmt.java +29]
          ├── PauseRoutineLoadStmt      [vim org/apache/doris/analysis/PauseRoutineLoadStmt.java +26]
          ├── PauseSyncJobStmt  [vim org/apache/doris/analysis/PauseSyncJobStmt.java +22]
          ├── RecoverDbStmt     [vim org/apache/doris/analysis/RecoverDbStmt.java +33]
          ├── RecoverPartitionStmt      [vim org/apache/doris/analysis/RecoverPartitionStmt.java +32]
          ├── RecoverTableStmt  [vim org/apache/doris/analysis/RecoverTableStmt.java +32]
          ├── ResumeRoutineLoadStmt     [vim org/apache/doris/analysis/ResumeRoutineLoadStmt.java +26]
          ├── ResumeSyncJobStmt [vim org/apache/doris/analysis/ResumeSyncJobStmt.java +22]
          ├── RevokeStmt        [vim org/apache/doris/analysis/RevokeStmt.java +32]
          ├── SetUserPropertyStmt       [vim org/apache/doris/analysis/SetUserPropertyStmt.java +31]
          ├── StopRoutineLoadStmt       [vim org/apache/doris/analysis/StopRoutineLoadStmt.java +23]
          ├── StopSyncJobStmt   [vim org/apache/doris/analysis/StopSyncJobStmt.java +22]
          ├── SyncStmt  [vim org/apache/doris/analysis/SyncStmt.java +22]
          ├── TruncateTableStmt [vim org/apache/doris/analysis/TruncateTableStmt.java +27]
          ├── UninstallPluginStmt       [vim org/apache/doris/analysis/UninstallPluginStmt.java +28]
          ├── UpdateStmt        [vim org/apache/doris/analysis/UpdateStmt.java +35]
          ├── AbstractBackupStmt        [vim org/apache/doris/analysis/AbstractBackupStmt.java +36]
          │   ├── BackupStmt    [vim org/apache/doris/analysis/BackupStmt.java +29]
          │   └── RestoreStmt   [vim org/apache/doris/analysis/RestoreStmt.java +33]
          ├── BaseViewStmt      [vim org/apache/doris/analysis/BaseViewStmt.java +39]
          │   ├── AlterViewStmt [vim org/apache/doris/analysis/AlterViewStmt.java +31]
          │   └── CreateViewStmt        [vim org/apache/doris/analysis/CreateViewStmt.java +33]
          └── CancelStmt        [vim org/apache/doris/analysis/CancelStmt.java +19]
              ├── CancelAlterSystemStmt [vim org/apache/doris/analysis/CancelAlterSystemStmt.java +28]
              ├── CancelAlterTableStmt  [vim org/apache/doris/analysis/CancelAlterTableStmt.java +31]
              └── CancelBackupStmt      [vim org/apache/doris/analysis/CancelBackupStmt.java +30]

org.apache.doris.qe.StmtExecutor#execute(TUniqueId)

analyze(context.getSessionVariable().toThrift()); //语义解析
if (isForwardToMaster()) {
    forwardToMaster();  //转发处理
    if (masterOpExecutor != null && masterOpExecutor.getQueryId() != null) {
        context.setQueryId(masterOpExecutor.getQueryId());
    }
    return;
} else {
    LOG.debug("no need to transfer to Master. stmt: {}", context.getStmtId());
}


//命令执行
else if (parsedStmt instanceof DdlStmt) {
    handleDdlStmt();
} else if (parsedStmt instanceof ShowStmt) {
    handleShow();
} else if (parsedStmt instanceof KillStmt) {
    handleKill();
}

语义解析

判断含义的正确性

@Override
public void analyze(Analyzer analyzer) throws UserException {

    super.analyze(analyzer);
    tableName.analyze(analyzer);   
    
    checkTblPriv(ConnectContext.get(), tableName.getDb(),
        tableName.getTbl(), PrivPredicate.CREATE)
        
    analyzeEngineName();  
      
    keysDesc.analyze(columnDefs);
    
    for (ColumnDef columnDef : columnDefs) {
        columnDef.analyze(engineName.equals("olap"));
    }   
    
    partitionDesc.analyze(columnDefs, properties);
    
    distributionDesc.analyze(columnSet);
}

验证名称是否合法
权限是否正确
分区是否合法
列类型是否合法

转发处理

Master、Follower、Observer
只有Master有元数据的修改能力
所有需要修改元数据的操作,需要转发到Master去执行
转发类型:
FORWARD_NO_SYNC
FORWARD_WITH_SYNC
NO_FORWARD
DDL 采用 FORWARD_WITH_SYNC

命令执行

org.apache.doris.qe.DdlExecutor#execute()

//根据语句类型执行相应的函数

if (ddlStmt instanceof CreateTableStmt) {
    catalog.createTable((CreateTableStmt) ddlStmt);
} 

支持多种表类型, 除了olap 表, 其余都为映射表

if (engineName.equals("olap")) {
    createOlapTable(db, stmt);
    return;
} else if (engineName.equals("odbc")) {
    createOdbcTable(db, stmt);
    return;
} else if (engineName.equals("mysql")) {
    createMysqlTable(db, stmt);
    return;
} else if (engineName.equals("broker")) {
    createBrokerTable(db, stmt);
    return;
} else if (engineName.equalsIgnoreCase("elasticsearch") || engineName.equalsIgnoreCase("es")) {
    createEsTable(db, stmt);
    return;
} else if (engineName.equalsIgnoreCase("hive")) {
    createHiveTable(db, stmt);
    return;
}

org.apache.doris.catalog.Catalog#createOlapTable

//将语法对象转为元数据对象
String tableName = stmt.getTableName();
LOG.debug("begin create olap table: {}", tableName);

// create columns
List<Column> baseSchema = stmt.getColumns();
validateColumns(baseSchema);

// create partition info
PartitionDesc partitionDesc = stmt.getPartitionDesc();
PartitionInfo partitionInfo = null;

//创建table对象
long tableId = Catalog.getCurrentCatalog().getNextId();
OlapTable olapTable = new OlapTable(tableId, tableName, baseSchema, keysType, partitionInfo,
        distributionInfo, indexes);
        
  // 创建Partition 对象
  if (partitionInfo.getType() == PartitionType.UNPARTITIONED) {
        // this is a 1-level partitioned table
        // use table name as partition name
        String partitionName = tableName;
        long partitionId = partitionNameToId.get(partitionName);
        // create partition
        Partition partition = createPartitionWithIndices()
        olapTable.addPartition(partition);
    }
    
    
//添加元数据并进行持久化
Pair<Boolean, Boolean> result = db.createTableWithLock(olapTable, false, stmt.isSetIfNotExists());

org.apache.doris.catalog.Catalog#createPartitionWithIndices

分区是表的实体

image

// create base index first.
Preconditions.checkArgument(baseIndexId != -1);
MaterializedIndex baseIndex = new MaterializedIndex(baseIndexId, IndexState.NORMAL);

// create partition with base index
Partition partition = new Partition(partitionId, partitionName, baseIndex, distributionInfo);

// add to index map
Map<Long, MaterializedIndex> indexMap = new HashMap<>();
indexMap.put(baseIndexId, baseIndex);

// create rollup index if has
for (long indexId : indexIdToMeta.keySet()) {
    if (indexId == baseIndexId) {
        continue;
    }

    MaterializedIndex rollup = new MaterializedIndex(indexId, IndexState.NORMAL);
    indexMap.put(indexId, rollup);
}

for (Map.Entry<Long, MaterializedIndex> entry : indexMap.entrySet()) {
// create tablets
int schemaHash = indexMeta.getSchemaHash();
TabletMeta tabletMeta = new TabletMeta(dbId, tableId, partitionId, indexId, schemaHash, storageMedium);
createTablets(clusterName, index, ReplicaState.NORMAL, distributionInfo, version, versionHash,
        replicaAlloc, tabletMeta, tabletIdSet);
        
    
// add create replica task for olap
short shortKeyColumnCount = indexMeta.getShortKeyColumnCount();
TStorageType storageType = indexMeta.getStorageType();
List<Column> schema = indexMeta.getSchema();
KeysType keysType = indexMeta.getKeysType();
int totalTaskNum = index.getTablets().size() * totalReplicaNum;
MarkedCountDownLatch<Long, Long> countDownLatch = new MarkedCountDownLatch<Long, Long>(totalTaskNum);
AgentBatchTask batchTask = new AgentBatchTask();
for (Tablet tablet : index.getTablets()) {
    long tabletId = tablet.getId();
    for (Replica replica : tablet.getReplicas()) {
        long backendId = replica.getBackendId();
        countDownLatch.addMark(backendId, tabletId);
        CreateReplicaTask task = new CreateReplicaTask(backendId, dbId, tableId,
                partitionId, indexId, tabletId,
                shortKeyColumnCount, schemaHash,
                version, versionHash,
                keysType,
                storageType, storageMedium,
                schema, bfColumns, bfFpp,
                countDownLatch,
                indexes,
                isInMemory,
                tabletType);
        task.setStorageFormat(storageFormat);
        batchTask.addTask(task);
        // add to AgentTaskQueue for handling finish report.
        // not for resending task
        AgentTaskQueue.addTask(task);
    }
}
AgentTaskExecutor.submit(batchTask);    
       
}

整体流程:

  • 创建Partition 对象
  • 创建MaterializedIndex对象
  • 对于每个MaterializedIndex对象 创建创建Tablet
  • 创建replica并下发任务到BE

// estimate timeout
long timeout = Config.tablet_create_timeout_second * 1000L * totalTaskNum;
timeout = Math.min(timeout, Config.max_create_table_timeout_second * 1000);
try {
    ok = countDownLatch.await(timeout, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
    LOG.warn("InterruptedException: ", e);
    ok = false;
}

等待BE执行任务完成

org.apache.doris.catalog.Database#createTableWithLock

idToTable.put(table.getId(), table);
nameToTable.put(table.getName(), table);
lowerCaseToTableName.put(tableName.toLowerCase(), tableName);

if (!isReplay) {
    // Write edit log
    CreateTableInfo info = new CreateTableInfo(fullQualifiedName, table);
    Catalog.getCurrentCatalog().getEditLog().logCreateTable(info);
}
if (table.getType() == TableType.ELASTICSEARCH) {
    Catalog.getCurrentCatalog().getEsRepository().registerTable((EsTable) table);
}

将table添加到DataBase对象里
判断是否replay
写入元数据日志

流程总结:

image

FE与BE交互

image

FE与BE交互流程图

FE 发送任务
BE执行
BE汇报执行结果
FE汇总结果

AgentBatchTask batchTask = new AgentBatchTask();
for (Tablet tablet : index.getTablets()) {
    long tabletId = tablet.getId();
    for (Replica replica : tablet.getReplicas()) {
        long backendId = replica.getBackendId();
        countDownLatch.addMark(backendId, tabletId);
        CreateReplicaTask task = new CreateReplicaTask(backendId, dbId, tableId,
                partitionId, indexId, tabletId,
                shortKeyColumnCount, schemaHash,
                version, versionHash,
                keysType,
                storageType, storageMedium,
                schema, bfColumns, bfFpp,
                countDownLatch,
                indexes,
                isInMemory,
                tabletType);
        task.setStorageFormat(storageFormat);
        batchTask.addTask(task);
        // add to AgentTaskQueue for handling finish report.
        // not for resending task
        AgentTaskQueue.addTask(task);
    }
}
AgentTaskExecutor.submit(batchTask);

AgentBatchTask:
收集Task并按照Be分组

AgentTaskExecutor:
发送AgentBatchTask

AgentTaskQueue:
处理任务完成的上报

BE任务接收
be/src/agent/agent_server.cpp
接收Task

// resend request when something is wrong(BE may need some logic to guarantee idempotence.
void AgentServer::submit_tasks(TAgentResult& agent_result,
                               const std::vector<TAgentTaskRequest>& tasks) {
    Status ret_st;
    // TODO check master_info here if it is the same with that of heartbeat rpc
    if (_master_info.network_address.hostname == "" || _master_info.network_address.port == 0) {
        Status ret_st = Status::Cancelled("Have not get FE Master heartbeat yet");
        ret_st.to_thrift(&agent_result.status);
        return;
    }

    for (auto task : tasks) {
        VLOG_RPC << "submit one task: " << apache::thrift::ThriftDebugString(task).c_str();
        TTaskType::type task_type = task.task_type;
        int64_t signature = task.signature;

#define HANDLE_TYPE(t_task_type, work_pool, req_member)                         \
    case t_task_type:                                                           \
        if (task.__isset.req_member) {                                          \
            work_pool->submit_task(task);                                       \
        } else {                                                                \
            ret_st = Status::InvalidArgument(strings::Substitute(               \
                    "task(signature=$0) has wrong request member", signature)); \
        }                                                                       \
        break;

       ...

    ret_st.to_thrift(&agent_result.status);
}

工作线程

while (_is_work) {
    TAgentTaskRequest agent_task_req;
    TCreateTabletReq create_tablet_req;
    {
        lock_guard<Mutex> worker_thread_lock(_worker_thread_lock);
        while (_is_work && _tasks.empty()) {
            _worker_thread_condition_variable.wait();
        }
        if (!_is_work) {
            return;
        }
        //从队列中取出任务
        agent_task_req = _tasks.front();
        create_tablet_req = agent_task_req.create_tablet_req;
        _tasks.pop_front();
      
        //执行
        OLAPStatus create_status = _env->storage_engine()->create_tablet(create_tablet_req);
      
      
        TFinishTaskRequest finish_task_request;
        finish_task_request.__set_finish_tablet_infos(finish_tablet_infos);
        finish_task_request.__set_backend(_backend);
        finish_task_request.__set_report_version(_s_report_version);
        finish_task_request.__set_task_type(agent_task_req.task_type);
        finish_task_request.__set_signature(agent_task_req.signature);
        finish_task_request.__set_task_status(task_status);
        //汇报结果
        _finish_task(finish_task_request);
   }   

处理任务汇报

org.apache.doris.service.FrontendServiceImpl#finishTask
org.apache.doris.master.MasterImpl#finishTask
FE、BE通过Thrift协议通信

错误处理

org.apache.doris.task.AgentTaskQueue 存储正在执行的Task

image

org.apache.doris.master.ReportHandler#handleReport
org.apache.doris.master.ReportHandler#taskReport
BE: Report tasks/olap tablet/disk state to the master server
FE master 处理任务,超时会进行重试

private static void taskReport(long backendId, Map<TTaskType, Set<Long>> runningTasks) {
    ...
    
    // to escape sending duplicate agent task to be
    if (task.shouldResend(taskReportTime)) {
        batchTask.addTask(task);
    }
    ...
}

元数据持久化

image

Edit类似WAL

BDBJE 分布式KV存储

元数据持久化:org.apache.doris.catalog.Database#createTableWithLock

public Pair<Boolean, Boolean> createTableWithLock(Table table, boolean isReplay, boolean setIfNotExist) {
   ...
   //更新内存
   nameToTable.put(table.getName(), table);
    // Write edit log
    //构建元数据日志
    CreateTableInfo info = new CreateTableInfo(fullQualifiedName, table);
    //写入元数据日志
    Catalog.getCurrentCatalog().getEditLog().logCreateTable(info);  
   ...
    
}

元数据回放

元数据回放发生在FE leader 给 其他FE节点同步的时候

image

逐一回放元数据
在内存中复原元数据
org.apache.doris.catalog.Catalog#replayCreateTable

public void replayCreateTable(String dbName, Table table) {
    Database db = this.fullNameToDb.get(dbName);
    db.createTableWithLock(table, true, false);
    ...
}

如何实现一个新的语句

fe/fe-core/src/main/cup/sql_parser.cup 语法文件

KW_CREATE opt_external:isExternal KW_TABLE opt_if_not_exists:ifNotExists table_name:name
        LPAREN column_definition_list:columns COMMA index_definition_list:indexes RPAREN opt_engine:engineName
        opt_keys:keys
        opt_comment:tableComment
        opt_partition:partition
        opt_distribution:distribution
        opt_rollup:index
        opt_properties:tblProperties
        opt_ext_properties:extProperties
{:
    RESULT = new CreateTableStmt(ifNotExists, isExternal, name, columns, indexes, engineName, keys, partition,
    distribution, tblProperties, extProperties, tableComment, index);
:}

fe/fe-core/src/main/jflex/sql_scanner.flex 词法文件

keywordMap.put("create", new Integer(SqlParserSymbols.KW_CREATE));
keywordMap.put("cross", new Integer(SqlParserSymbols.KW_CROSS));
keywordMap.put("cube", new Integer(SqlParserSymbols.KW_CUBE));
keywordMap.put("current", new Integer(SqlParserSymbols.KW_CURRENT));
keywordMap.put("current_user", new Integer(SqlParserSymbols.KW_CURRENT_USER));
keywordMap.put("data", new Integer(SqlParserSymbols.KW_DATA));
keywordMap.put("database", new Integer(SqlParserSymbols.KW_DATABASE));

词法语法的代码生成:

cd fe/ && mvn clean install –DskipTests
• SqlScanner.java
• SqlParser.java
• SqlParserSymbols.java

实现新语句步骤总结:

  1. 定义词法语法文件
  2. 实现对应的语句类,比如CreateTableStmt
  3. 实现元数据修改的方法,如Catalog.createTable()
  4. 定义对应操作的元数据日志类,如CreateTableInfo
  5. 实现元数据日志的写入
  6. 实现对应的replay方法,如Catalog.replayCreateTable()
posted @ 2021-11-18 21:35  chaplinthink  阅读(1734)  评论(0编辑  收藏  举报