列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩
Column-store compression
At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.
But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.
To see how it can help compression, take this set of doc values for a numeric field:
Doc Terms ----------------------------------------------------------------- Doc_1 | 100 Doc_2 | 1000 Doc_3 | 1500 Doc_4 | 1200 Doc_5 | 300 Doc_6 | 1900 Doc_7 | 4200 -----------------------------------------------------------------
The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200]
.
xxx
Doc values use several tricks like this. In order, the following compression schemes are checked:
- If all values are identical (or missing), set a flag and record the value
- If there are fewer than 256 values, a simple table encoding is used
- If there are > 256 values, check to see if there is a common divisor
- If there is no common divisor, encode everything as an offset from the smallest value
You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.
摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」