Paimon lookup store 实现
1.Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics2.Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores3.Apache hudi 核心功能点分析4.Paimon的写入流程5.Paimon读取流程6.Paimon Compaction实现7.理解 Paimon changelog producer
8.Paimon lookup store 实现
9.Paimon Deletion VectorLookup Store 主要用于 Paimon 中的 Lookup Compaction 以及 Lookup join 的场景. 会将远程的列存文件在本地转化为 KV 查找的格式.
Hash
https://github.com/linkedin/PalDB
Sort
https://github.com/dain/leveldb
https://github.com/apache/paimon/pull/3770
整体文件结构:
相比于 Hash file 的优势
- 一次写入, 避免了文件merge
- 顺序写入, 保持原先的 key 的顺序, 后续如果按照 key 的顺序查找, 可提升缓存效率
SortLookupStoreWriter
SortLookupStoreWriter#put
put
@Override public void put(byte[] key, byte[] value) throws IOException { dataBlockWriter.add(key, value); if (bloomFilter != null) { bloomFilter.addHash(MurmurHashUtils.hashBytes(key)); } lastKey = key; // 当BlockWriter写入达到一定阈值, 默认是 cache-page-size=64kb. if (dataBlockWriter.memory() > blockSize) { flush(); } recordCount++; }
flush
private void flush() throws IOException { if (dataBlockWriter.size() == 0) { return; } // 将data block写入数据文件, 并记录对应的position和长度 BlockHandle blockHandle = writeBlock(dataBlockWriter); MemorySlice handleEncoding = writeBlockHandle(blockHandle); // 将BlockHandle 写入index writer, 这也通过是一个BlockWriter写的 indexBlockWriter.add(lastKey, handleEncoding.copyBytes()); }
writeBlock
private BlockHandle writeBlock(BlockWriter blockWriter) throws IOException { // close the block // 获取block的完整数组, 此时blockWriter中的数组并不会被释放, 而是会继续复用 MemorySlice block = blockWriter.finish(); totalUncompressedSize += block.length(); // attempt to compress the block BlockCompressionType blockCompressionType = BlockCompressionType.NONE; if (blockCompressor != null) { int maxCompressedSize = blockCompressor.getMaxCompressedSize(block.length()); byte[] compressed = allocateReuseBytes(maxCompressedSize + 5); int offset = encodeInt(compressed, 0, block.length()); int compressedSize = offset + blockCompressor.compress( block.getHeapMemory(), block.offset(), block.length(), compressed, offset); // Don't use the compressed data if compressed less than 12.5%, if (compressedSize < block.length() - (block.length() / 8)) { block = new MemorySlice(MemorySegment.wrap(compressed), 0, compressedSize); blockCompressionType = this.compressionType; } } totalCompressedSize += block.length(); // create block trailer // 每一块block会有一个trailer, 记录压缩类型和crc32校验码 BlockTrailer blockTrailer = new BlockTrailer(blockCompressionType, crc32c(block, blockCompressionType)); MemorySlice trailer = BlockTrailer.writeBlockTrailer(blockTrailer); // create a handle to this block // BlockHandle 记录了每个block的其实position和长度 BlockHandle blockHandle = new BlockHandle(position, block.length()); // write data // 将数据追加写入磁盘文件 writeSlice(block); // write trailer: 5 bytes // 写出trailer writeSlice(trailer); // clean up state blockWriter.reset(); return blockHandle; }
close
public LookupStoreFactory.Context close() throws IOException { // flush current data block flush(); LOG.info("Number of record: {}", recordCount); // write bloom filter @Nullable BloomFilterHandle bloomFilterHandle = null; if (bloomFilter != null) { MemorySegment buffer = bloomFilter.getBuffer(); bloomFilterHandle = new BloomFilterHandle(position, buffer.size(), bloomFilter.expectedEntries()); writeSlice(MemorySlice.wrap(buffer)); LOG.info("Bloom filter size: {} bytes", bloomFilter.getBuffer().size()); } // write index block // 将index数据写出至文件 BlockHandle indexBlockHandle = writeBlock(indexBlockWriter); // write footer // Footer 记录bloomfiler + index Footer footer = new Footer(bloomFilterHandle, indexBlockHandle); MemorySlice footerEncoding = Footer.writeFooter(footer); writeSlice(footerEncoding); // 最后关闭文件 // close file fileOutputStream.close(); LOG.info("totalUncompressedSize: {}", MemorySize.ofBytes(totalUncompressedSize)); LOG.info("totalCompressedSize: {}", MemorySize.ofBytes(totalCompressedSize)); return new SortContext(position); }
BlockWriter
add
public void add(byte[] key, byte[] value) { int startPosition = block.size(); // 写入key长度 block.writeVarLenInt(key.length); // 写入key block.writeBytes(key); // 写入value长度 block.writeVarLenInt(value.length); // 写入value block.writeBytes(value); int endPosition = block.size(); // 使用一个int数组记录每个KV pair的起始位置作为索引 positions.add(startPosition); // 是否对齐. 是否对齐取决于每个KV对的长度是否一样 if (aligned) { int currentSize = endPosition - startPosition; if (alignedSize == 0) { alignedSize = currentSize; } else { aligned = alignedSize == currentSize; } } }
- 这里的 block 对应于一块可扩容的 MemorySegment, 也就是
byte[]
, 当写入长度超过当前数组的长度时, 就会扩容
finish
public MemorySlice finish() throws IOException { if (positions.isEmpty()) { throw new IllegalStateException(); } // 当通过BlockWriter写出的数据长度都是对齐的时, 就不需要记录各个Position的index了, 只需要记录一个对齐长度, 读取时自己可以计算. if (aligned) { block.writeInt(alignedSize); } else { for (int i = 0; i < positions.size(); i++) { block.writeInt(positions.get(i)); } block.writeInt(positions.size()); } block.writeByte(aligned ? ALIGNED.toByte() : UNALIGNED.toByte()); return block.toSlice(); }
小结
整个文件的写出过程非常简单, 就是按 block 写出, 并且记录每个 block 的位置, 作为 index.
SortLookupStoreReader
读取的过程, 主要就是为了查找 key 是否存在, 以及对应的 value 或者对应的行号.
public byte[] lookup(byte[] key) throws IOException { // 先通过bloomfilter提前进行判断 if (bloomFilter != null && !bloomFilter.testHash(MurmurHashUtils.hashBytes(key))) { return null; } MemorySlice keySlice = MemorySlice.wrap(key); // seek the index to the block containing the key indexBlockIterator.seekTo(keySlice); // if indexIterator does not have a next, it means the key does not exist in this iterator if (indexBlockIterator.hasNext()) { // seek the current iterator to the key // 根据从index block中读取到的key value的位置(BlockHandle), 读取对应的value block BlockIterator current = getNextBlock(); // 在value的iterator中再次二分查找寻找对应block中是否存在match的key, 如果存在则返回对应的数据 if (current.seekTo(keySlice)) { return current.next().getValue().copyBytes(); } } return null; }
- 查找一次 key 会经历两次二分查找(index + value).
BlockReader
// 从block创建一个iterator public BlockIterator iterator() { BlockAlignedType alignedType = BlockAlignedType.fromByte(block.readByte(block.length() - 1)); int intValue = block.readInt(block.length() - 5); if (alignedType == ALIGNED) { return new AlignedIterator(block.slice(0, block.length() - 5), intValue, comparator); } else { int indexLength = intValue * 4; int indexOffset = block.length() - 5 - indexLength; MemorySlice data = block.slice(0, indexOffset); MemorySlice index = block.slice(indexOffset, indexLength); return new UnalignedIterator(data, index, comparator); } }
SliceCompartor
这里面传入了 keyComparator, 用于进行 key 的比较. 用于在 index 中进行二分查找. 这里的比较并不是直接基于原始的数据, 而是基于 MemorySlice 进行排序.
比较的过程会将 key 的各个字段从 MemorySegment 中读取反序列化出来, cast 成 Comparable 进行比较.
public SliceComparator(RowType rowType) { int bitSetInBytes = calculateBitSetInBytes(rowType.getFieldCount()); this.reader1 = new RowReader(bitSetInBytes); this.reader2 = new RowReader(bitSetInBytes); this.fieldReaders = new FieldReader[rowType.getFieldCount()]; for (int i = 0; i < rowType.getFieldCount(); i++) { fieldReaders[i] = createFieldReader(rowType.getTypeAt(i)); } } @Override public int compare(MemorySlice slice1, MemorySlice slice2) { reader1.pointTo(slice1.segment(), slice1.offset()); reader2.pointTo(slice2.segment(), slice2.offset()); for (int i = 0; i < fieldReaders.length; i++) { boolean isNull1 = reader1.isNullAt(i); boolean isNull2 = reader2.isNullAt(i); if (!isNull1 || !isNull2) { if (isNull1) { return -1; } else if (isNull2) { return 1; } else { FieldReader fieldReader = fieldReaders[i]; Object o1 = fieldReader.readField(reader1, i); Object o2 = fieldReader.readField(reader2, i); @SuppressWarnings({"unchecked", "rawtypes"}) int comp = ((Comparable) o1).compareTo(o2); if (comp != 0) { return comp; } } } } return 0; }
查找的实现就是二分查找的过程, 因为写入的 key 是有序写入的.
public boolean seekTo(MemorySlice targetKey) { int left = 0; int right = recordCount - 1; while (left <= right) { int mid = left + (right - left) / 2; // 对于aligned iterator, 就直接seek record * recordSize // 对于unaligned iterator, 就根据writer写入的索引表来跳转 seekTo(mid); // 读取一条key value pair BlockEntry midEntry = readEntry(); int compare = comparator.compare(midEntry.getKey(), targetKey); if (compare == 0) { polled = midEntry; return true; } else if (compare > 0) { polled = midEntry; right = mid - 1; } else { left = mid + 1; } } return false; }
小结
查找过程
- 先过一遍 bloom filter
- index 索引查找对应 key 的 block handle
- 根据第二步的 handle, 读取对应的 block, 在 block 中查找对应的 key value.
本文来自博客园,作者:Aitozi,转载请注明原文链接:https://www.cnblogs.com/Aitozi/p/18514737
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!