InfluxDB配置优化

背景

Influxdb是db-engines排名第一的时序数据库，目前在Apsaradb团队有一定规模使用，本文主要介绍在Influxdb使用过程中做的一些调优和运维过程中的问题做分享。主要分为配置调优和问题排查技巧两个方面介绍。

配置调优

对 InfluxDB 的配置优化，主要从一些配置参数出发提高InfluxDB的性能。

性能优化

Influxdb的存储引擎是TSM Tree, 基本上整体思想和LSM Tree类似，做了一些时序场景下数据存储结构上的建模优化。cache-snapshot-memory-size值需要调大。

[data]
  # CacheSnapshotMemorySize is the size at which the engine will
  # snapshot the cache and write it to a TSM file, freeing up memory
  cache-snapshot-memory-size = 562144000

cache-snapshot-memory-size 这个大小控制的是 LSM 中的 cache 的大小，当 cache 达到一定阈值后，cache 会落盘生成tsm file, 此时的 tsm file 的level为 level 0 , 两个相同 level 的 tsm file 会进行 compact 生成一个level + 1的 tsm file, 既两个level 0的tsm file会生成一个level 1的tsm file，这种设计既TSM tree的写入放大问题。

由于Influxdb是固定两个低level 文件compact成一个高一级 level的tsm file，所以如果cache size越小，dump成tsm file的频率越高，进而做compact的频率也越高，造成写入放大越显著，当写入的频率很高的场景下，会导致influxdb的吞吐下降非常明显。

compact频率变高后，Influxdb写入放大很重要一个原因是TSM file数据做了编码压缩磁盘占用空间，当compact时，需要对数据decode，会带来明显的性能损耗。

Influxdb为了优化这个问题，在做compact时分为optimize compact和full compact两种类型。在full compact场景下，首先会对tsm file中的block做decode，然后按照每个Block存储的point数量，将decode的Point value按照时间顺序重新encode成Block，然后写入到新的TSM file中，其中的性能损耗会非常显著。而在optimize场景下，不会读取block内部的数据，会对多个block拼接，减少性能消耗。

目前部分场景下Influxdb做compact还是会选择 full compact。optimize compact虽然会提升compact速度和减少compact的资源消耗，但是会引起查询放大问题：需要从多个block中才能获取到需要返回的数据。

说明

cache-snapshot-memory-size 值理论上是越大越好，但是需要关注你的硬件配置。

cache-snapshot-memory-size 值跟当前并发写入 tags 数量有关系，如果你的tags数很大的情况下，一定要调大这个值，如果tags数不多，只是少数tag的数据写入频率很高，那么这个值稍低也不会对性能有太大影响。

超时设置

这个问题主要是在Influxdb的老版本(低于1.5版本)存在，且有一定风险。Influxdb对控制内存使用量上的设计比较粗糙。建议生产库上不要随意执行sql，一旦sql导致内存使用过多，容易导致Influxdb oom。而且在目前Inverted Index的设计中，Influxdb oom kill掉之后重新启动速度很慢，因为启动过程要重新遍历tsm file生成内存中的Inverted Index，加上数据都是进行了encode，在数据量比较大的情况下，启动速度非常慢，所以建议上述的读取数量开关打开，因为默认情况下是不会发生出现那么大请求，如果有这种特殊sql存在，请酌情考虑。

[coordinator]
  # The maximum time a query will is allowed to execute before being killed by the system.  This limit
  # can help prevent run away queries.  Setting the value to 0 disables the limit.
  query-timeout = "120s"
  
  # The maximum number of points a SELECT can process.  A value of 0 will make
  # the maximum point count unlimited.  This will only be checked every 10 seconds so queries will not
  # be aborted immediately when hitting the limit.
  max-select-point = 1000000
  
  # The maximum number of series a SELECT can run.  A value of 0 will make the maximum series
  # count unlimited.
  max-select-series = 1000000
  
[http]
  # The default chunk size for result sets that should be chunked.
  max-row-limit = 1000000

InfluxDB对控制内存使用量主要cost在几个方面：

返回结果集很大的情况下，数据会cache在内存中，等计算完成后，统一返回给Client。Influxdb内部算子是通过pipeline的方式流式交互，但是返回给Client需要Client在调用的时候传递"chunked"参数才能实现pipeline方式返回。所以建议max-select-point，query-timeout，max-row-limit一定要打开，防止一些误操作查询。不过1.6版本后据说有了kill process的功能，这样可以及时kill掉，这块本人没有具体调研，这里不做介绍。
如果sql没有对tags filter condition，那么每条Sql都会在内存中拷贝一份全量的Series，所以sql query 的 filter condition需要定义明确。

InfluxDB oom的问题在正常的时序使用场景下不会发生，但是一不小心就可以踩进"坑"里。

数据层面

max-series-per-database可调整为0，如注释所示：该参数控制每个db的最大的series数量。

max-values-per-tag可调整为0，如注释所示：该参数控制每个tag的tag_value数量。

[data]
  # The maximum series allowed per database before writes are dropped.  This limit can prevent
  # high cardinality issues at the database level.  This limit can be disabled by setting it to
  # 0.
  max-series-per-database = 0
  
  # The maximum number of tag values per tag that are allowed before writes are dropped.  This limit
  # can prevent high cardinality tag values from being written to a measurement.  This limit can be
  # disabled by setting it to 0.
  # max-values-per-tag = 0

Influxdb这两个参数需要控制根源是来自于内部设计中倒排索引的实现，如果Influxdb使用方数据结构设计不合理，会导致内存过大。所以对于使用方建议这两个参数不要调整为0，为使用估计一个series的数量。

安全层面

如果线上环境使用，reporting-disabled这个要配置上，不仅整个系统的安全性得到了提升，同时也有助于风控。

# Once every 24 hours InfluxDB will report usage data to usage.influxdata.com
# The data includes a random ID, os, arch, version, the number of series and other
# usage data. No data from user databases is ever transmitted.
# Change this option to true to disable reporting.
reporting-disabled = true

posted @ 2022-01-17 17:00 梧桐花落阅读(5153) 评论(0) 编辑收藏举报

刷新页面返回顶部

梧桐花落

我不能停止呼吸，因为明天，当太阳升起来，谁知道潮水能带来什么？