InfluxDB 增删改查源码分析

influxdb介绍

influxdb是db-engines中当前排行第一的时序数据库，本文针对influxdb源码的阅读，简单介绍influxdb的内部模块设计，实现机制等原理，水平有限，欢迎纠正。

influxdb基础概念

https://docs.influxdata.com/influxdb/v1.4/concepts/glossary

时序数据库的标准数据模型：
influxdb和我们使用的普通关系型数据库一样，存在database, table(在influxdb中称之为measurement)，每条数据的数据模型设计：

timestamp
tags:
  tag_key: tag_value
  tag_key2: tag_value2
fileds:
   field_key: value

我们监控cpu使用量的insert语句如：

insert cpu_info,host=10.1.1.1,cpu_core=1 value=0.2,sys=0.1,user=0.1 1515413665013940928

那么数据结果为：

timestamp: 1515413665013940928.
tags:
  host(tag_key): 10.1.1.1(tag_value)
  cpu_core: 1
fields:
  value: 0.2
  sys: 0.1
  user: 0.1

influxdb内部的数据存储结构仍然是key/value结构：

key 为：measurement+timestamp+tags+filed_key,
value 为： field_value。

influxdb提供了类似sql的语句查询db中存储的时序数据，influxql:如

select * from cpu_info where time>0 and host="10.1.1.1"

其他语句可见官方文档：https://docs.influxdata.com/influxdb/v1.4/query_language/data_download/

模块设计

httpd：influxdb内部所有的api请求均通过httpd接口对外提供服务。
influxsql：influxdb内部实现了一个sql parser 模块，在数据读取写入过程中会对输入的sql进行解析。
meta：metainfo记录了influxdb所有的元信息，并且dump到某一个file中，元信息包括database name, retention policy, shard groups, user等。
index：tags的索引信息
retention：自动清理过期数据功能。
tsdb：influxdb中最核心的模块：存储引擎层，influxdb引擎层的思路基本类似于lsm tree，influxdb将其称之为tsm tree, lsm tree的介绍文章非常的多，这里不详细进行描述。

下文我们会详细描述存储引擎各个模块的工作机制。

influxdb将数据按照时间分区存储，每个分区称之为shard, 每个shard有各自的存储引擎存储数据，且相互独立。作者在官方文档注明这么做的原因是为了快速删除过期数据，拆分为shard后删除数据只需要直接清理shard所有的数据文件即可。

删除过期数据是时序数据库一个比较重要的的特性，如性能数据只保持最近一个月或者几个月的数据的需求。

创建数据库流程

根据influxdb的模块设计，创建一个数据库的流程如下：

进入httpd模块，根据用户调用的请求 POST + /query 路由到相应的函数中。
通过influxsql parse用户输入，获取对象 inflxql.CreateDatabaseStatement
根据inflxql.CreateDatabaseStatement 进入meta模块，修改元信息并且重新生成meta.db file覆盖原先的文件。

meta模块主要存储维护了两方面元信息：

Database -> measurement -> retention policy -> shardGroups 数据存储的元信息，很重要的是记录了每个 shard 的 startTime+endTime, 这样每次用户的 select 查询需要通过 metainfo 定位到哪些shard 去查询数据。
User信息，如哪个用户是admin用户，以及user是否对db拥有读写权限等，influxdb的账户权限体系还是比较简单的，详细可参考官方文档。

数据插入流程

以下面数据为例：

name: cpu_info
time                core_num host     sys user value
----                -------- ----     --- ---- -----
1516326021157346829 2        10.1.1.1 0.1 0.1  0.2
1516326032353077597 1        10.1.1.1 0.2 0.1  0.3
1516326046959517000 3        10.1.1.1 0.1 0.4  0.5
1516326094282729378 1        10.1.1.2 0.3 0.6  0.9

当执行insert时：

insert cpu_info,host=10.1.1.1,core_num=4 value=0.2,user=0.1,sys=0.1

流程如下：

流程解析：

根据meta.db信息判断写入哪个shard, 或者是否需要新建一个shard.
获得shard后，数据写到Cache中。 Cache的数据会在用户select数据时被用户读取，当数据量达到阈值后，落地生成tsm file，同时对应的wal数据也会被相应删除。
数据写入到wal中，wal的数据一般情况上不会使用，当influxdb宕机重启时，wal数据会被读取load成为Cache，保证数据不丢失。
当数据成功写入到wal中后，才返回给调用者表示插入成功。

Insert数据直接写入到Cache内存中，同时写入到Wal (write ahead log) 文件中，所以influxdb的数据写入性能非常优秀

数据库查询流程

查询的流程涉及内容比较多，比insert复杂的多，因为会涉及到索引，所以复杂度提升。下面以查询语句为例进行说明：

# select
select * from cpu_info where time>1000000 and host='10.1.1.1'

查询流程如下：

解析influxql

select请求被httpd模块获取到后，经过Influxql parser后，构造为 SelectStatement structure。如上述中的influxql: select * from cpu_info where time>1000000 and host='10.1.1.1', 会被parse为几部分:

Sources: cpu_info, 为influxql中的Measurement对象，其中包含属性，Database, RetentionPolicy,Measurement Name.
Fields: Wildcard, 既上述influxql中的* .
Condition: 条件表达式，在influxdb中为 Expr 对象。根据where条件，Condition如图：

// BinaryExpr represents an operation between two expressions.
type BinaryExpr struct {
    Op  Token
    LHS Expr
    RHS Expr
}

我们描述的是influxql最简单的场景，实际上influxql目前版本位置已经是较为完善结构化查询语言，支持很多时序场景下的常用聚合等功能。

获取相应的Shards

根据传入的time condition + Sources measurement, 从meta.db中获取对应的shards，下述流程将会到每个shard中读取指定的数据。

read seriesKey from memory index

我们根据Condition中的time>0, host='10.1.1.1', 需要解析成一个sereisKey: 这个seriesKey在influxdb内部存储格式为string: measurementName,tag_key1=tag_value1,tag_key2=tag_value2,... 这个seriesKey就是influxdb中的索引key。

如上述请求中我们解析出来的seriesKey为三个：

cpu_info,host=10.1.1.1,cpu_cores=1
cpu_info,host=10.1.1.1,cpu_cores=2
cpu_info,host=10.1.1.1,cpu_cores=3

我们介绍下如何根据输入的host='10.1.1.1' 的直接定位到上述的seriesKey。

influxdb中存在一个 index 的模块，index中存储着 tag_key/tag_value 定位到 seriesKey 的元信息，数据存储模型总结大致如下：

measurement_name ->
{tag_key1: {tag_value1: [seriesKey1, seriesKey2], ...}, ...}

可以预见上述的存储结构当 tags 大量存在的情况下，会占用相当多的内存。所以在我们使用influxdb的过程中，tags的设计极为重要。

read block data

获取到seriesKey后，根据我们调用的select * from cpu_info where time>1000000 and host='10.1.1.1' 语句解析的SelectStatement:

fields: * 
Condition: time > 1000000

和我们通过index获取的SeriesKeys:

cpu_info,host=10.1.1.1,cpu_cores=1
cpu_info,host=10.1.1.1,cpu_cores=2
cpu_info,host=10.1.1.1,cpu_cores=3

由此从tsm file中获取到我们想要的block data。

tsm file的存储结构

提到block data 这里简单介绍下tsm file的存储结构。官方文档中描述的存储结构如下：

其实主要就分为了Blocks和Index两个部分。 Blocks存储了实际的数据块，Index存储了到实际Blocks的映射。索引数据结构如下：

其中：

Index中的Key 即为我们在上述步骤中获取到的SeriesKey + field key.
Index中的 Offset指向的是tsm file中对应Block的position。

拥有这两个属性，我们就能通过SeriesKey定位到具体的block.

如何从seriesKey定位到Index在tsm file中的position？

内存中还存储了一个数据结构IndirectIndex, 间接索引。
IndirectIndex中Offsets存储了每一条Index record start position。
获取到Index position的流程是通过二分查找法在offsets所有的position获取到Index中的key, 和当前select条件中的sereiesKey进行比较，当然这个前提是tsm file中index是有序存储的。代码如下：

// We use a binary search across our indirect offsets (pointers to all the keys
// in the index slice).
i := sort.Search(len(d.offsets), func(i int) bool {
    // i is the position in offsets we are at so get offset it points to
    offset := d.offsets[i]

    // It's pointing to the start of the key which is a 2 byte length
    keyLen := int32(binary.BigEndian.Uint16(d.b[offset : offset+2]))

    // See if it matches
        // 传入key和 d.b[offset+2:offset+2+keyLen] 进行比较。
    return bytes.Compare(d.b[offset+2:offset+2+keyLen], key) >= 0
})

获取到index position后，tsm index中的Offset记录了对应的Block data在 tsm file中的偏移位置，这样就可以顺利读取到values.

read from cache

cache中的存储结构比较简单，就是一个hashtable，hash key即为seriesKey, 所以根据seriesKey获取values的时间复杂度为O(1)，非常的快。

为何其他lsm tree存储引擎实现中cache使用linked list (leveldb中称之为Memtable)这种数据结构，而influxdb中却是用hashtable方法，我猜测是leveldb中了支持范围搜索的需求，需要可以进行范围查询的数据结构。而influxdb的select 语句中，是不允许tag key > 或 < 等条件的，所有的tag value均存储为string格式，且需要精确匹配，没有这方面的需求。

merge block files and cache data

Influxdb 的 tsm tree 存储引擎拥有所有类lsm tree存储引擎的通病，既高并发的写入是牺牲了一定的查询性能的。

Merge Blocks data：influxdb数据写入到cache中后，当数据写入量达到一定阈值后会dump写入到磁盘中形成tsm file. 并且存在compaction level的概念，如第一次写入到磁盘中compaction level为1。随着后续数据的合并，低层级tsm file进行compaction合并为一个新的高一级的tsm file，compaction level 会递增，如下图为shardId为126的shard存在两个tsm file. 000000717-000000002.tsm和000000717-000000003.tsm，其中00002.tsm表示为level 2的compaction tsm file, 00003.tsm file表示为level 3的compaction tsm file.

├── ps_retention_10m
│   ├── 126
│   │   ├── 000000717-000000002.tsm
│   │   └── 000000717-000000003.tsm

同一个seriesKey 可能会存储在两个compaction level的tsm file中，所以数据读取需要遍历两个block块数据，并进行去重处理。compaction的流程是tsm tree存储引擎中最重要的组成部分，篇幅有限，暂不多做描述。

除了tsm files中的block datas合并，还需要和cache data进行合并，按照后写覆盖先写的原则，进行去重。

数据删除流程

假设用户执行 influxql 如下：

delete from cpu_info where host='10.1.1.1' and time>10000000;

跟 select 流程类似，流程如下：

根据condition, 通过metainfo映射到所有的shards，对下述的shard均执行以下操作
根据condition从index中获取到相应的seriesKeys
删除cache数据：根据seriesKeys和time区间删除cache中的数据，并且写入到wal 中
删除tsm file中的数据：此时创建了tombstone file, 标识哪些数据已经被删除了，并不会修改实际上的tsm file。
判断dalete完成后，是否seriesKey对应的所有的数据被删除，则同时删除index数据。

posted @ 2022-01-21 14:19 梧桐花落阅读(1396) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

梧桐花落

我不能停止呼吸，因为明天，当太阳升起来，谁知道潮水能带来什么？