[IR] Compression

关系：Vocabulary vs. collection size

Heaps’ law: M = kT^b
M is the size of the vocabulary, T is the number of tokens in the collec*on
Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5σ

log M = log K - b*log T

关系：Vocabulary中每个term的量 vs. 该term的次序

Zipf’s law: cf_i = K/i

i.e. the most frequent term (the) occurs cf₁ times

The i th most frequent term has frequency proportional to 1/i .

log cf_i = log K - log i

1). Term's data单独拿出成为String形式, Terms里变为了指针，size:4B
　　11.2 → 7.6

2). Blocking。If k = 4, then 省了3个terms的空间，即3B*3-4(结束符1B)=5B
　　7.6 → 7.1

3). Front coding, 前缀冗余。
　　7.1 → 5.9

如下：

1). Seq1 + 1000 = Seq3

小链表表示大链表

2). Simple9

0110（ID）， 3（三段）， 9（每段的bit数）， 1（最后的waste位的个数）。

那么，4+3*9+1 = 32byte = 4 Bit

3). Gap ( If the ave gap of a term is G)

log₂G bits/gap, 当然会用到之后的Variabe Byte codes.

4). Variable Byte codes.

增加Control Bit，那么完整的一个数据表示：（0数据，0数据，……，1最后一个数据）

5). Elias-γ code

6). Elias-δ code

7). Golomb code

暂略

posted @ 2016-11-05 15:04 郝壹贰叁阅读(363) 评论(0) 编辑收藏举报

刷新页面返回顶部

机器学习水很深