[IR] Dictionary Coding
Lempel–Ziv–Welch
LZ77算法是采用字典做数据压缩的算法,由以色列的两位大神Jacob Ziv与Abraham Lempel在1977年发表的论文《A Universal Algorithm for Sequential Data Compression》中提出。
基于统计的数据压缩编码,比如Huffman编码,需要得到先验知识——信源的字符频率,然后进行压缩。但是在大多数情况下,这种先验知识是很难预先获得。
因此,设计一种更为通用的数据压缩编码显得尤为重要。LZ77数据压缩算法应运而生,其核心思想:利用数据的重复结构信息来进行数据压缩。
LZ77: referring to previously processed data as dictionary 利用内部信息作为字典。
在提出基于滑动窗口的LZ77算法后,两位大神Jacob Ziv与Abraham Lempel于1978年在发表的论文中提出了LZ78算法;
与LZ77算法不同的是LZ78算法使用动态树状词典维护历史字符串。
LZ78: use an explicit dictionary 字典是外置的。
LZ系列压缩算法均为LZ77与LZ78的变种,在此基础上做了优化。
- LZ77:LZSS、LZR、LZB、LZH;
- LZ78:LZW、LZC、LZT、LZMW、LZJ、LZFG。
LZW Encoding:
Video: https://www.youtube.com/watch?v=nW7OARbr7OI
"TO BE OR NOT TO BE OR TO BE OR NOT"
Idea:
以下是我们已知的字典。
再动态补充新发现的pattern字典,从27开始编号,如下所示:
current | next | code | dictionary | ||
T | O | 20 | TO | 27 | TO BE OR NOT TO BE OR TO BE OR NOT |
O | B | 15 | OB | 28 | TO BE OR NOT TO BE OR TO BE OR NOT |
B | E | 2 | BE | 29 | TO BE OR NOT TO BE OR TO BE OR NOT |
E | O | 5 | EO | 30 | TO BE OR NOT TO BE OR TO BE OR NOT |
O | R | 15 | OR | 31 | TO BE OR NOT TO BE OR TO BE OR NOT |
R | N | 18 | RV | 32 | TO BE OR NOT TO BE OR TO BE OR NOT |
N | O | 14 | NO | 33 | TO BE OR NOT TO BE OR TO BE OR NOT |
O | T | 15 | OT | 34 | TO BE OR NOT TO BE OR TO BE OR NOT |
T | T | 20 | TT | 35 | TO BE OR NOT TO BE OR TO BE OR NOT |
TO | B | 27 | TOB | 36 | TO BE OR NOT TO BE OR TO BE OR NOT |
BE | O | 29 | BEO | 37 | TO BE OR NOT TO BE OR TO BE OR NOT |
OR | T | 31 | ORT | 38 | TO BE OR NOT TO BE OR TO BE OR NOT |
TOB | E | 36 | TOBE | 39 | TO BE OR NOT TO BE OR TO BE OR NOT |
EO | R | 30 | EOR | 40 | TO BE OR NOT TO BE OR TO BE OR NOT |
RN | O | 32 | RNO | 41 | TO BE OR NOT TO BE OR TO BE OR NOT |
OT | # | 34 | N/A | N/A | TO BE OR NOT TO BE OR TO BE OR NOT |
Input | Output |
这里共16行,也就是原来的24字节 --> 16字节。
LZW Decoding:
code | prev | output | dictionary | ||
20 | T | TO BE OR NOT TO BE OR TO BE OR NOT | |||
15 | T | O | TO | 27 | TO BE OR NOT TO BE OR TO BE OR NOT |
2 | O | B | OB | 28 | TO BE OR NOT TO BE OR TO BE OR NOT |
5 | B | E | BE | 29 | TO BE OR NOT TO BE OR TO BE OR NOT |
15 | E | O | EO | 30 | TO BE OR NOT TO BE OR TO BE OR NOT |
18 | O | R | OR | 31 | TO BE OR NOT TO BE OR TO BE OR NOT |
14 | R | N | RN | 32 | TO BE OR NOT TO BE OR TO BE OR NOT |
15 | N | O | NO | 33 | TO BE OR NOT TO BE OR TO BE OR NOT |
20 | O | T | OT | 34 | TO BE OR NOT TO BE OR TO BE OR NOT |
27 | T | TO | TT | 35 | TO BE OR NOT TO BE OR TO BE OR NOT |
29 | TO | BE | TOB | 36 | TO BE OR NOT TO BE OR TO BE OR NOT |
31 | BE | OR | BEO | 37 | TO BE OR NOT TO BE OR TO BE OR NOT |
36 | OR | TOB | ORT | 38 | TO BE OR NOT TO BE OR TO BE OR NOT |
30 | TOB | EO | TOBE | 39 | TO BE OR NOT TO BE OR TO BE OR NOT |
32 | EO | RN | EOR | 40 | TO BE OR NOT TO BE OR TO BE OR NOT |
34 | RN | OT | RNO | 41 | TO BE OR NOT TO BE OR TO BE OR NOT |
<Output> | <Input> |
可见与encoding时表格一一对应的关系。
就是还原表格的过程。