ZSTD相关笔记.md
相关资料
- 推荐的文件大小(或文件大小范围)以使字典压缩性能良好
-
字典最适合从 50 字节到几 KB 的小型数据。
- 一旦您开始进入 100 KB - 1 MB 的范围,字典的有效性就会下降。
- 但是,在 50 字节之前,我们通常找不到足够的压缩机会来偏移帧标头。
- 您应该在要压缩的文件样本上训练字典。
- 文件大小的总和应为字典大小的 10 倍到 100 倍(默认为 110 KB)。
- 在线说明文档:
- zstd(1) — zstd — Debian unstable — Debian Manpages
- 官网有多种语言实现的代码:
- Zstandard - Real-time data compression algorithm
- 下载最新版本:
- Central Repository: com/github/luben/ZstdHelper-jni
- 使用例子:
- GitHub - luben/ZstdAndroidExample
- ZstdAndroidExample/MainActivity.java at master · luben/ZstdAndroidExample · GitHub
测试不同字典大小样本的压缩率情况
样本大小:102 MB (107,155,190 字节) 样本数量:173842
不使用字典进行压缩时的压缩率
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress request/request/* --output-dir-flat req-c
设置多线程参数了CPU占有率居然还是只有12.4%左右! IO居然为0;
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.
读取耗时大概20多分钟
显示压缩进度之后,CPU反而降低到4%左右, 内存涨到33.8MB;IO为1.2MB左右
从10:16:07启动到10:43:46结束 总压缩耗时00:27:39
173842 files compressed : 60.55% ( 102 MiB => 61.9 MiB)
按照ZSTD最小的字典大小256训练试试
zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic
CPU占有率只有12.4%, IO很少
开始时间:11:31:43
! Warning : setting manual memory limit for dictionary training data at 0 MB Training samples set too large (102 MB); training on 0 MB only...
Trying 82 different sets of parameters
d=6
Total number of training samples is 1 and is invalid.Failed to initialize context
dictionary training failed : Src size is incorrect
zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic
CPU占有率只有12.4%, IO很少
开始时间:11:56:32
结束时间:14:07:38
训练耗时:02:11:06
训练倍数:107155190/256=418574倍
k=146
d=6
steps=40
split=100
zstd --verbose --ultra -22 -T0 --auto-threads=logical --progress -D req.256.dic --output-dir-flat req-c-256 request/request/*
173842 files compressed : 49.99% ( 102 MiB => 51.1 MiB)
开始时间:14:24:46
结束时间:15:02:44
压缩耗时:00:37:58
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.256.dic -o req.256.c.dic
字典压缩后大小=101.56% ( 256 B => 260 B, req.256.c.dic)
节约效率=(100-49.99)/2561024=200.04=每KB带来200%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-49.99=10.56%
使用字典压缩后比不用字典提升效率=(60.55-49.99)/2561024=42.24=每KB带来42.24%的效果提升
样本平均大小:107155190/173842=616 字节
zstd --verbose --train --train-cover --maxdict=616 request/request/* -o req.616.dic
CPU占有率只有12.4%, IO很少
开始时间:18:51:07
开始出现训练提示的时间:19:12:56(读取耗时≈22分钟)
训练耗时=03:56:51
总训练耗时=04:18:40
训练倍数:173842倍
k=242
d=6
steps=40
split=100
zstd -D req.616.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616
173842 files compressed : 37.78% ( 102 MiB => 38.6 MiB)
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616.dic -o req.616.c.dic
字典压缩后大小=89.12% ( 616 B => 549 B, req.616.c.dic)
字典压缩后大小节约效率=(100-37.78)/5491024=116.053=每KB带来116%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-37.78=22.77%
使用字典压缩后比不用字典提升效率=(60.55-37.78)/5491024=42.471=每KB带来42%的效果提升比256字典提升幅度=49.99-37.78=12.21%
比256字典提升效率=(49.99-37.78)/(549-256)*1024=42.672=每KB带来42%的效果提升
按照样本平均大小的10倍来设置字典大小6166字节=6.02KB
zstd --verbose --train --train-cover --maxdict=6166 request/request/* -o req.6166.dic
CPU占有率只有12.4%, IO很少
开始时间:19:10:41
开始出现训练提示的时间:19:33:20(读取耗时≈22分钟)
总训练耗时=03:59:06
训练倍数:107155190/6166=17378倍
k=1250
d=8
steps=40
split=100
zstd -D req.6166.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6166
173842 files compressed : 21.22% ( 102 MiB => 21.7 MiB)
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.dic -o req.6166.c.dic
字典压缩后大小=43.01% ( 6.02 KiB => 2.59 KiB, req.6166.c.dic)
字典压缩后大小节约效率=(100-21.22)/26521024=30.419=每KB带来30%的压缩率下降
相对平均大小的字典膨胀=2652/549=4.831倍数
相对平均大小的字典效率降低到=30.419/116.053=26.2%-100=73.8%
使用字典压缩后比不用字典提升幅度=60.55-21.22=39.33%
使用字典压缩后比不用字典提升效率=(60.55-21.22)/26521024=15.186=每KB带来15%的效果提升;比256字典提升幅度=49.99-21.22=28.77%
比256字典提升效率=(49.99-21.22)/(2652-256)*1024=12.296=每KB带来12%的效果提升
手动删除字典后面的可见字符串再尝试压缩效果
原始未压缩后的字典:6166字节 压缩后2652字节(压缩后就没有可见的有意义字符了)
删除可见字符串之后:0151字节 (使用16进制编辑工具删除比较准确)
尝试进行压缩没有一启动就报错,但是最后执行的时候报内存错误:
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*
开始压缩时间:12:11:13
zstd: error 11 : Allocation error : not enough memory
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*
去掉M1024参数还是不行
zstd: error 11 : Allocation error : not enough memory
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.YeThin.dic -o req.6166.YeThin.c.dic
字典压缩后大小=108.61% ( 151 B => 164 B, req.6166.YeThin.c.dic)
按照样本平均大小的100倍来设置字典大小61666字节=60.22KB
zstd --verbose --train --train-cover --maxdict=61666 request/request/* -o req.61666.dic
CPU占有率只有12.4%, IO很少
开始时间:19:35:56
开始出现训练提示的时间:19:57:20(读取耗时≈22分钟)
总训练耗时=03:44:13
训练倍数:107155190/61666=1737倍
k=1970
d=6
steps=40
split=100
zstd -D req.61666.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-61666
173842 files compressed : 18.49% ( 102 MiB => 18.9 MiB)
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.61666.dic -o req.61666.c.dic
字典压缩后大小=24.78% ( 60.2 KiB => 14.9 KiB, req.61666.c.dic)
字典压缩后大小节约效率=(100-18.49)/152781024=5.463=每KB带来5%的压缩率下降
相对平均大小的字典膨胀=15278/549=27.829倍数
相对平均大小的字典效率降低到=5.463/116.053=4.7%-100=95.3%
使用字典压缩后比不用字典提升幅度=60.55-18.49=42.06%
使用字典压缩后比不用字典提升效率=(60.55-18.49)/152781024=2.819=每KB带来2%的效果提升比256字典提升幅度=49.99-18.49=31.5%
比256字典提升效率=(49.99-18.49)/(15278-256)*1024=2.147=每KB带来2%的效果提升
按照样本平均大小的1K倍来设置字典大小616000字节=587.47KB
zstd --verbose --train --train-cover --maxdict=616000 request/request/* -o req.616000.dic
CPU占有率只有12.4%, IO很少
开始时间:18:59:18
开始出现训练提示的时间:19:21:37(读取耗时≈22分钟)
总训练耗时=03:57:46
训练倍数:173倍
k=1778
d=8
steps=40
split=100
zstd -D req.616000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616000
173842 files compressed : 16.00% ( 102 MiB => 16.4 MiB)
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616000.dic -o req.616000.c.dic
字典压缩后大小=19.99% ( 602 KiB => 120 KiB, req.616000.c.dic)
字典压缩后大小节约效率=(100-16.00)/1231511024=0.698=每KB带来0.69%的压缩率下降
相对平均大小的字典膨胀=123151/549=224.319倍数
相对平均大小的字典效率降低到=0.698/116.053=0.6%-100=99.4%
使用字典压缩后比不用字典提升幅度=60.55-16.00=44.55%
使用字典压缩后比不用字典提升效率=(60.55-16.00)/1231511024=0.37=每KB带来0.37%的效果提升比256字典提升幅度=49.99-16.00=33.99%
比256字典提升效率=(49.99-16.00)/(123151-256)*1024=0.283=每KB带来0.283%的效果提升
按照样本平均大小的1W倍来设置字典大小6160000字节=5.87MB
zstd --verbose --train --train-cover --maxdict=6160000 request/request/* -o req.6160000.dic
CPU占有率只有12.4%, IO很少
开始时间:18:57:15
开始出现训练提示的时间:19:19:35(读取耗时≈22分钟)
总训练耗时=03:58:28
训练倍数:17.3倍
k=1922
d=8
steps=40
split=100
zstd -D req.6160000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6160000
173842 files compressed : 10.64% ( 102 MiB => 10.9 MiB)
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6160000.dic -o req.6160000.c.dic
字典压缩后大小=15.15% ( 5.87 MiB => 912 KiB, req.6160000.c.dic)
字典压缩后大小节约效率=(100-10.64)/9334571024=0.098=每KB带来0.098%的压缩率下降
相对平均大小的字典膨胀=933457/549=1700.286倍数
相对平均大小的字典效率降低到=0.098/116.053=0.1%-100=99.9%
使用字典压缩后比不用字典提升幅度=60.55-10.64=49.91%
使用字典压缩后比不用字典提升效率=(60.55-10.64)/9334571024=0.055=每KB带来0.055%的效果提升比256字典提升幅度=49.99-10.64=39.35%
比256字典提升效率=(49.99-10.64)/(933457-256)*1024=0.043=每KB带来0.043%的效果提升
样本大小:107 KB (110,148 字节) 样本数量:306
样本平均大小:110148/306= 359.96字节
训练得到的字典大小都为:23788,超过这个大小都是一样的!
zstd --verbose --train --train-cover --maxdict=110148 req/* -o req.110148.dic
zstd --verbose --train --train-cover --maxdict=110141 req/* -o req.110141.dic
zstd --verbose --train --train-cover --maxdict=108KB req/* -o req.108KB.dic
字典大小=23788
比平均尺寸大的倍数=23788/360=66倍
训练倍数:110148/23788=4.63倍
zstd --ultra -22 --progress req.108KB.dic -o req.108KB.c.dic
req.108KB.dic : 50.25% ( 23.2 KiB => 11.7 KiB, req.108KB.c.dic)
字典压缩后大小=11.6 KB (11,954 字节)
zstd -D req.108KB.dic --ultra -22 --progress req/* --output-dir-flat req-c-108KB
306 files compressed : 17.13% ( 108 KiB => 18.4 KiB)
重新指定字典大小23788进行训练,压缩率居然更高!
zstd --verbose --train --train-cover --maxdict=23788 req/* -o req.23788.dic
字典大小=23788
zstd --ultra -22 --progress req.23788.dic -o req.23788.c.dic
req.23788.dic : 28.52% ( 23.2 KiB => 6.63 KiB, req.23788.c.dic)
字典压缩后大小=6.62 KB (6,785 字节)
zstd -D req.23788.dic --ultra -22 --progress req/* --output-dir-flat req-c-23788
306 files compressed : 14.37% ( 108 KiB => 15.5 KiB)
随便测试一个比平均大小更大的字典大小888
zstd --verbose --train --train-cover --maxdict=888 req/* -o req.888.dic
字典大小=888
zstd --ultra -22 --progress req.888.dic -o req.888.c.dic
req.888.dic : 65.99% ( 888 B => 586 B, req.888.c.dic)
字典压缩后大小=586 字节
zstd -D req.888.dic --ultra -22 --progress req/* --output-dir-flat req-c-888
306 files compressed : 32.51% ( 108 KiB => 35.0 KiB)
按照110148/306=359.961平均每个文件360的大小来指定字典大小
zstd --verbose --train --train-cover --maxdict=360 req/* -o req.360.dic
字典大小=360
zstd --ultra -22 --progress req.360.dic -o req.360.c.dic
req.360.dic : 94.44% ( 360 B => 340 B, req.360.c.dic)
字典压缩后大小=340 字节
zstd -D req.360.dic --ultra -22 --progress req/* --output-dir-flat req-c-360
306 files compressed : 47.99% ( 108 KiB => 51.6 KiB)
按照ZSTD最小的字典大小256训练试试
zstd --verbose --train --train-cover --maxdict=256 req/* -o req.256.dic
字典大小=256
zstd --ultra -22 --progress req.256.dic -o req.256.c.dic
req.256.dic :102.34% ( 256 B => 262 B, req.256.c.dic)
字典压缩后大小=262 字节(反而增大了!!!)
zstd -D req.256.dic --ultra -22 --progress req/* --output-dir-flat req-c-256
306 files compressed : 58.17% ( 108 KiB => 62.6 KiB)
不使用字典进行压缩时的压缩率
zstd --ultra -22 --progress req/* --output-dir-flat req-c
306 files compressed : 74.81% ( 108 KiB => 80.5 KiB)
训练字典的相关参数经验
以下3个参数训练出来的字典MD5居然都是一样的,看来还是不会使用这些参数:
--train-cover=shrink=2
--train-cover=shrink
--train-cover
训练字典时无法占满CPU和内存
zstd --verbose -T0 --auto-threads=logical --train -M1024 --train-cover --maxdict=616 request/request/* -o req.616.dic
设置多线程参数了CPU占有率居然还是只有12.4%左右!
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.
参数相关帮助说明
一看就懂的K近邻算法(KNN),K-D树,并实现手写数字识别! - 简书
Kd-树是K-dimension tree的缩写
KD树的最近邻搜索算法
zstd(1) — zstd — Debian unstable — Debian Manpages
如果一个小数据样本家族中存在某种相关性,那么培训就会奏效。特定于数据的字典越多,它的效率就越高 (没有通用字典)。
因此,每种类型的数据部署一个字典将提供最大的好处。
字典增益大多在最初的几KB有效。然后,压缩算法将逐渐使用先前解码的内容来更好地压缩文件的其余部分。
--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]
If split is not specified or split <= 0, then the default value of 100 is used.
If shrink flag is not used, then the default value for shrinkDict of 0 is used.
If shrink is not specified, then the default value for shrinkDictMaxRegression of 1 is used.
Having shrink enabled takes a truncated dictionary of minimum size and doubles in size(尺寸翻倍)
until compression ratio of the truncated dictionary is at most shrinkDictMaxRegression%(退化率) worse than the compression ratio of the largest dictionary.
启用shrink后,会得到一个最小尺寸的截断的字典,并使其尺寸加倍,直到截断的字典的压缩率最多比最大的字典的压缩率差N%。
字典大小设置不准确时的提示
! Warning : data size of samples too small for target dictionary size
! Samples should be about 100x larger than target dictionary size
Trying 5 different sets of parameters
WARNING: The maximum dictionary size 112640 is too large compared to the source size 82775!
size(source)/size(dictionary) = 0.734863, but it should be >= 10!
This may lead to a subpar次品 dictionary!
We recommend training on sources at least 10x, and preferably 100x the size of the dictionary!
Zstandard CLI 帮助说明
*** Zstandard CLI (64-bit) v1.5.4, by Yann Collet ***
Compress or decompress the INPUT file(s); reads from STDIN if INPUT is `-` or not provided.
Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT]
Options:
-o OUTPUT Write output to a single file, OUTPUT.
-k, --keep Preserve INPUT file(s). [Default]
--rm Remove INPUT file(s) after successful (de)compression.
-# Desired compression level, where `#` is a number between 1 and 19;
lower numbers provide faster compression, higher numbers yield
better compression ratios. [Default: 3]
-d, --decompress Perform decompression.
-D DICT Use DICT as the dictionary for compression or decompression.
-f, --force Disable input and output checks. Allows overwriting existing files,
receiving input from the console, printing ouput to STDOUT, and
operating on links, block devices, etc. Unrecognized formats will be
passed-through through as-is.
-h Display short usage and exit.
-H, --help Display full help and exit.
-V, --version Display the program version and exit.
Advanced options:
-c, --stdout Write to STDOUT (even if it is a console) and keep the INPUT file(s).
-v, --verbose Enable verbose output; pass multiple times to increase verbosity.
-q, --quiet Suppress warnings; pass twice to suppress errors.
--trace LOG Log tracing information to LOG.
--[no-]progress Forcibly show/hide the progress counter. NOTE: Any (de)compressed
output to terminal will mix with progress counter text.
-r Operate recursively on directories.
--filelist LIST Read a list of files to operate on from LIST.
--output-dir-flat DIR Store processed files in DIR.
--[no-]asyncio Use asynchronous IO. [Default: Enabled]
--[no-]check Add XXH64 integrity checksums during compression. [Default: Add, Validate]
If `-d` is present, ignore/validate checksums during decompression.
-- Treat remaining arguments after `--` as files.
Advanced compression options:
--ultra Enable levels beyond 19, up to 22; requires more memory.
--fast[=#] Use to very fast compression levels. [Default: 1]
--adapt Dynamically adapt compression level to I/O conditions.
--long[=#] Enable long distance matching with window log #. [Default: 27]
--patch-from=REF Use REF as the reference point for Zstandard's diff engine.
-T# Spawn # compression threads. [Default: 1; pass 0 for core count.]
--single-thread Share a single thread for I/O and compression (slightly different than `-T1`).
--auto-threads={physical|logical}
Use physical/logical cores when using `-T0`. [Default: Physical]
-B# Set job size to #. [Default: 0 (automatic)]
--rsyncable Compress using a rsync-friendly method (`-B` sets block size).
--exclude-compressed Only compress files that are not already compressed.
--stream-size=# Specify size of streaming input from STDIN.
--size-hint=# Optimize compression parameters for streaming input of approximately size #.
--target-compressed-block-size=#
Generate compressed blocks of approximately # size.
--no-dictID Don't write `dictID` into the header (dictionary compression only).
--[no-]compress-literals Force (un)compressed literals.
--[no-]row-match-finder Explicitly enable/disable the fast, row-based matchfinder for
the 'greedy', 'lazy', and 'lazy2' strategies.
--format=zstd Compress files to the `.zst` format. [Default]
--format=gzip Compress files to the `.gz` format.
--format=xz Compress files to the `.xz` format.
--format=lzma Compress files to the `.lzma` format.
Advanced decompression options:
-l Print information about Zstandard-compressed files.
--test Test compressed file integrity.
-M# Set the memory usage limit to # megabytes.
--[no-]sparse Enable sparse mode. [Default: Enabled for files, disabled for STDOUT.]
--[no-]pass-through Pass through uncompressed files as-is. [Default: Disabled]
Dictionary builder:
--train Create a dictionary from a training set of files.
--train-cover[=k=#,d=#,steps=#,split=#,shrink[=#]]
Use the cover algorithm (with optional arguments).
--train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink[=#]]
Use the fast cover algorithm (with optional arguments).
--train-legacy[=s=#] Use the legacy algorithm with selectivity #. [Default: 9]
-o NAME Use NAME as dictionary name. [Default: dictionary]
--maxdict=# Limit dictionary to specified size #. [Default: 112640]
--dictID=# Force dictionary ID to #. [Default: Random]
Benchmark options:
-b# Perform benchmarking with compression level #. [Default: 3]
-e# Test all compression levels up to #; starting level is `-b#`. [Default: 1]
-i# Set the minimum evaluation to time # seconds. [Default: 3]
-B# Cut file into independent chunks of size #. [Default: No chunking]
-S Output one benchmark result per input file. [Default: Consolidated result]
--priority=rt Set process priority to real-time.