[IR] BWT+MTF+AC
BWT (Burrows–Wheeler_transform)数据转换算法
MTF(Move-to-front transform)数据转换
良心PPT: bwt_based_compression_verbin.ppt
BWT Idea:
压缩技术主要的工作方式就是找到重复的模式,进行紧密的编码。
BWT(Burrows–Wheeler_transform)将原来的文本转换为一个相似的文本,转换后使得相同的字符位置连续或者相邻;
之后可以使用其他技术如:Move-to-front transform 和 游程编码(RLE) 进行文本压缩。
一般压缩可以将文本先使用Burrows–Wheeler transform生成局部相关性很好的序列,再使用MTF减少信息熵,最后再进行压缩。
Burrows–Wheeler transform + Run-length coding
Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
RLE:aab4ccac3rcba10b5a$
Encoding:
1. All rotations
#BANANAS
S#BANANA
AS#BANAN
NAS#BANA
ANAS#BAN
NANAS#BA
ANANAS#B
BANANAS#
2. Sort the rows and pick up the last col.
#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA
正好是个行列一致的Matrix。
ori=[#BANANAS]T
bwt=[SBNN#AAA]T
Decoding:
while( 循环len(bwt)次 ) {
"前插一列,字典排序"
}
Result:
#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA
MTF Idea:
可见,BWT转换将文本转换为局部相关性很好的序列。
而刚好MTF转换就是利用了空间局部性原理减少信息熵。
即:最近访问的字符总是出现在“recently used symbols”的前面位置,如果字符的空间局部性较好,编码之后就会出现很多小的数字,如”0“或”1“。
所以,在如下List列中,使用的字符即可提前到head pos处,如此一来,index标号比较小,省空间。
这样,就方便了下一步压缩策略。
The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”.
For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number.
Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.
Encoding:
先建立字符集大小的List,“recently used symbols”,这里只考虑26个小写字母a~z。
From: https://en.wikipedia.org/wiki/Move-to-front_transform
Input Stream | Sequence | List | Operation |
---|---|---|---|
bananaaa | 1 | (abcdefghijklmnopqrstuvwxyz) | b放到head位置 |
bananaaa | 1,1 | (bacdefghijklmnopqrstuvwxyz) | a放到head位置 |
bananaaa | 1,1,13 | (abcdefghijklmnopqrstuvwxyz) | n放到head位置 |
bananaaa | 1,1,13,1 | (nabcdefghijklmopqrstuvwxyz) | a放到head位置 |
bananaaa | 1,1,13,1,1 | (anbcdefghijklmopqrstuvwxyz) | n放到head位置 |
bananaaa | 1,1,13,1,1,1 | (nabcdefghijklmopqrstuvwxyz) | a放到head位置 |
bananaaa | 1,1,13,1,1,1,0 | (anbcdefghijklmopqrstuvwxyz) | a放到head位置 |
bananaaa | 1,1,13,1,1,1,0,0 | (anbcdefghijklmopqrstuvwxyz) | a放到head位置 |
Final |
1,1,13,1,1,1,0,0 |
(anbcdefghijklmopqrstuvwxyz) |
|
ori=[b a n a n a a a]T
mtf=[1 1 13 1 1 1 0 0]T
Decoding:
Input Stream | Sequence | List |
---|---|---|
1,1,13,1,1,1,0,0 | b |
(abcdefghijklmnopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | ba | (bacdefghijklmnopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | ban | (abcdefghijklmnopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | bana | (nabcdefghijklmopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | banan | (anbcdefghijklmopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | banana | (nabcdefghijklmopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | bananaa | (anbcdefghijklmopqrstuvwxyz) |
1,1,13,1,1,1,0,0 | bananaaa | (anbcdefghijklmopqrstuvwxyz) |
可见,这里的思想就在于,利用了变化过程中的顺序性!
这样一来,string变为了数字。如果之前的string是对应一堆大的数字的话,那么,这次变为了一堆小数字,且0,1较多,Entropy小了!
Burrows–Wheeler transform + Move-To-Front
Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
MTF:0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,$
ARI:<(6+1)bytes>
问:MTF是否确实提高了压缩率呢?一堆小数字对压缩来说总会是好事。减小了信息熵,便自然想到了huffman或者arithmetic coding。
过程如下:
Input Stream | Sequence | List |
---|---|---|
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0 | (abcdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0 | (abcdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1 | (abcdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0 | (bacdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2 | (bacdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0 | (cbadefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2 | (cbadefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1 | (acbdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0 | (cabdefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17 | (cbadefghijklmnopqrstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1 | (rcbadefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2 | (rcbadefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3 | (brcadefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0 | (abrcdefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1 | (abrcdefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0 | (barcdefghijklmnopqstuvwxyz) |
aabbbbccacccrcbaaaaaaaaaabbbbba$ | 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1 | (barcdefghijklmnopqstuvwxyz) |
Final |
0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1 |
继续采用Arithmetic coding处理:
32 in total(暂未考虑$)
0: 22/32
1: 5/32
2: 3/32
3: 1/32
17: 1/32
使用了8 Bytes。
计算Entropy:
计算式:(-22/32)*log2(22/32)+(-5/32)*log2(5/32)+(-3/32)*log2(3/32)+(-1/32)*log2(1/32)+(-1/32)*log2(1/32)
对比:
大约每个字符平均采用1.43bits来进行编码,也就是:
1.43*32=45.76 约为46bits,即6Bytes
加上结尾符:6+1($) = 7Bytes
Finally, 15 Bytes 搞定!
RLE:aab4ccac3rcba10b5a$
共18 Bytes!
总结:
有很多种组合方式,以上只是探讨了BWT+MTF+Arithmetic Coding的效果。
BWT+Run-Length --> Run-length FM-Index
参考: [IR] Advanced XML Compression - XBW
都利用了同样的压缩思想(Run-length encoding)。这里,通过B、S列替代了Last列。B列更有利于压缩。
注意:这里是'1'为首,'0'为尾。
通过B'、S列替代了Front列。B'列更有利于压缩。但S需要在使用前排序,也就是S-->C列。
注意:S、C列中每个elem代表的是block。且相同elem内部具有保序性。
index | Last | B | S* | B' | Front | C* |
1 | e | 1 | e | 1 | $ | $ |
2 | e | 0 | d | 1 | _ | _ |
3 | d | 1 | _ | 1 | _ | _ |
4 | _ | 1 | n | 1 | a | a |
5 | n | 1 | r | 1 | d | d |
6 | r | 1 | h | 1 | e | e |
7 | r | 0 | t | 0 | e | e |
8 | h | 1 | $ | 1 | e | h |
9 | h | 0 | a | 0 | e | n |
10 | t | 1 | e | 1 | h | r |
11 | $ | 1 | _ | 0 | h | t |
12 | a | 1 | 1 | n | ||
13 | e | 1 | 1 | r | ||
14 | e | 0 | 0 | r | ||
15 | _ | 1 | 1 | t |
如此这般,
B+B' = 15bits + 15bits = 2 Bytes.
S:11 Bytes
In total: 2+11+1(表len)=14 Bytes.
原来的方式:
Last列:15 Bytes
C Table (for Front列):row:9 * col:2 = 18 Bytes
$ | 1 |
_ | 2 |
a | 4 |
d | 5 |
e | 6 |
h | 10 |
n | 12 |
r | 13 |
t | 15 |
In total: 15+18=33 Bytes.
时间换空间,还是有点效果。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律