bzip2 compresses files using the Burrows-Wheeler block-sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors.bzip2 is built on top of libbzip2, a flexible library for handling compressed data in the bzip2 format.

 To use any part of the library, you need to #include <bzlib.h> into your sources.

 

MEMORY MANAGEMENT

bzip2 compresses large files in blocks. The block size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) respectively. At decompression time, the block size used for compression is read from the header of the compressed file, and bunzip2 then allocates itself just enough memory to decompress the file. Since block sizes are stored in compressed files, it follows that the flags -1 to -9 are irrelevant to and so ignored during decompression.

Compression and decompression requirements, in bytes, can be estimated as:

Compression:   400k + ( 8 x block size )
Decompression: 100k + ( 4 x block size ), or 100k + ( 2.5 x block size ) (add relevant flag is -s)

In general, try and use the largest block size memory constraints allow, since that maximises the compression achieved. Compression

and decompression speed are virtually unaffected by block size.

  • Low-level interface

 BZ2_bzCompressInit

1 typedef struct {
2 char *next_in;
3 unsigned int avail_in;
4 unsigned int total_in_lo32; //低32位
5 unsigned int total_in_hi32;      //高32位 可应付64-bit机器 下文6
6
7 char *next_out;
8 unsigned int avail_out;
9 unsigned int total_out_lo32;
10 unsigned int total_out_hi32;
11
12 void *state; //下文2
13
14 void *(*bzalloc)(void *,int,int); //下文3 返回值是指针的函数指针
15 void (*bzfree)(void *,void *); //函数指针 指向内存分配函数 见Pa1 可以不用系统的malloc() 指向自己写的函数
16 void *opaque; //当做上两个所指函数的第一个参数
17 } bz_stream;
18
19 int BZ2_bzCompressInit ( bz_stream *strm,
20 int blockSize100k,
21 int verbosity,
22 int workFactor );
23

 

  1. Prepares for compression. The bz_stream structure holds all data pertaining to the compression activity. A bz_stream structure should be allocated and initialised prior to the call. The fields of bz_stream comprise the entirety of the user-visible data.
  2. state is a pointer to the private data structures required for compression.
  3. Custom memory allocators are supported, via fields bzalloc, bzfree, and opaque. The value opaque is passed to as the first argument to all calls to bzalloc and bzfree, but is otherwise ignored by the library.
  4. The call bzalloc ( opaque, n, m ) is expected to return a pointer p to n * m bytes of memory, and bzfree ( opaque, p ) should free that memory.
  5. If you don't want to use a custom memory allocator, set bzalloc, bzfree and opaque to NULL, and the library will then use the standard malloc / free routines.
  6. Before calling BZ2_bzCompressInit, fields bzalloc, bzfree and opaque should be filled appropriately, as just described. Upon return, the internal state will have been allocated and initialised, and total_in_lo32, total_in_hi32, total_out_lo32 and total_out_hi32 will have been set to zero. These four fields are used by the library to inform the caller of the total amount of data passed into and out of the library, respectively. You should not try to change them. As of version 1.0, 64-bit counts are maintained, even on 32-bit platforms, using the _hi32 fields to store the upper 32 bits of the count. So, for example, the total amount of data in is (total_in_hi32 << 32) + total_in_lo32.
  7. Parameter blockSize100k specifies the block size to be used for compression. It should be a value between 1 and 9 inclusive, and the actual block size used is 100000 x this figure. 9 gives the best compression but takes most memory.
  8. Parameter verbosity should be set to a number between 0 and 4 inclusive. 0 is silent, and greater numbers give increasingly verbose monitoring/debugging output. If the library has been compiled with -DBZ_NO_STDIO, no such output will appear for any verbosity setting.
  9. Parameter workFactor controls how the compression phase behaves when presented with worst case, highly repetitive, input data. If compression runs into difficulties caused by repetitive data, the library switches from the standard sorting algorithm to a fallback algorithm. The fallback is slower than the standard algorithm by perhaps a factor of three, but always behaves reasonably, no matter how bad the input.Lower values of workFactor reduce the amount of effort the standard algorithm will expend before resorting to the fallback. You should set this parameter carefully; too low, and many inputs will be handled by the fallback algorithm and so compress rather slowly, too high, and your average-to-worst case compression times can become very large. The default value of 30 gives reasonable behaviour over a wide range of circumstances.Allowable values range from 0 to 250 inclusive. 0 is a special case, equivalent to using the default value of 30.Note that the compressed output generated is the same regardless of whether or not the fallback algorithm is used.Be aware also that this parameter may disappear entirely in future versions of the library. In principle it should be possible to devise a good way to automatically choose which algorithm to use. Such a mechanism would render the parameter obsolete.
in BZlib.c
1 int BZ_API(BZ2_bzCompressInit)
2 ( bz_stream* strm,
3 int blockSize100k,
4 int verbosity,
5 int workFactor )
6 {
7 Int32 n;
8 EState* s; // 结构体EState 以后再谈
9
10 if (!bz_config_ok()) return BZ_CONFIG_ERROR;
11
12 if (strm == NULL ||
13 blockSize100k < 1 || blockSize100k > 9 || //上文7 bzip2将文件压缩为块 blocksize由参数-number定
14 workFactor < 0 || workFactor > 250)
15 return BZ_PARAM_ERROR;
16
17 if (workFactor == 0) workFactor = 30;
18 if (strm->bzalloc == NULL) strm->bzalloc = default_bzalloc; //上文3 赋NULL 用系统的malloc()分配内存
19 if (strm->bzfree == NULL) strm->bzfree = default_bzfree;
20
21 s = BZALLOC( sizeof(EState) ); // #define BZALLOC(nnn) (strm->bzalloc)(strm->opaque,(nnn),1)
22 if (s == NULL) return BZ_MEM_ERROR;
23 s->strm = strm;
24
25 s->arr1 = NULL;
26 s->arr2 = NULL;
27 s->ftab = NULL;
28
29 n = 100000 * blockSize100k;
30 s->arr1 = BZALLOC( n * sizeof(UInt32) );
31 s->arr2 = BZALLOC( (n+BZ_N_OVERSHOOT) * sizeof(UInt32) );
32 s->ftab = BZALLOC( 65537 * sizeof(UInt32) );
33
34 if (s->arr1 == NULL || s->arr2 == NULL || s->ftab == NULL) {
35 if (s->arr1 != NULL) BZFREE(s->arr1); // #define BZFREE(ppp)  (strm->bzfree)(strm->opaque,(ppp))
36 if (s->arr2 != NULL) BZFREE(s->arr2);
37 if (s->ftab != NULL) BZFREE(s->ftab);
38 if (s != NULL) BZFREE(s);
39 return BZ_MEM_ERROR;
40 }
41
42 s->blockNo = 0;
43 s->state = BZ_S_INPUT;
44 s->mode = BZ_M_RUNNING;
45 s->combinedCRC = 0;
46 s->blockSize100k = blockSize100k;
47 s->nblockMAX = 100000 * blockSize100k - 19;
48 s->verbosity = verbosity;
49 s->workFactor = workFactor;
50
51 s->block = (UChar*)s->arr2;
52 s->mtfv = (UInt16*)s->arr1;
53 s->zbits = NULL;
54 s->ptr = (UInt32*)s->arr1;
55
56 strm->state = s; // 上文2
57 strm->total_in_lo32 = 0; // 上文6
58 strm->total_in_hi32 = 0;
59 strm->total_out_lo32 = 0;
60 strm->total_out_hi32 = 0;
61 init_RL ( s );
62 prepare_new_block ( s );
63 return BZ_OK;
64 }

 

BZ2_bzCompress 

to be continued...