tree: 77ed9c6ada9beec168bcc007b512f375c39b20ee [path history] [tgz]

std/bzip2/README.md

Bzip2

Bzip2 is a general purpose compression format, using Huffman codes, Run Length Encoding, the Move To Front Transform and the Burrows Wheeler Transform. There is no official specification, but Joe Tsai has written comprehensive documentation of the wire format.

It is a block based format, which means that when decoding each block, the whole compressed block has to be read before a single byte of the uncompressed block can be written.

Like zlib (but unlike gzip), bzip2 just holds compressed bytes. It does not hold additional metadata like the original file's name or modification time.

Wire Format Worked Example

Let's compress “abraca”. Bzip2 compresses poorly on such a short input file, converting 6 uncompressed bytes to 0x2b = 43 compressed bytes. Nonetheless, the “abraca” uncompressed text matches the example from the 1994 BWT technical report (by M. Burrows and D. J. Wheeler) that introduced their eponymous Burrows Wheeler Transform.

$ hd abraca.txt
00000000  61 62 72 61 63 61                                 |abraca|
00000006
$ hd abraca.txt.bz2
00000000  42 5a 68 39 31 41 59 26  53 59 76 a7 09 95 00 00  |BZh91AY&SYv.....|
00000010  00 81 80 38 00 10 00 20  00 21 9a 68 33 4d 30 91  |...8... .!.h3M0.|
00000020  e2 ee 48 a7 0a 12 0e d4  e1 32 a0                 |..H......2.|
0000002b

Here are the bits of those 43 bytes with various bitslices labeled from (a) to (s). The bzip2 file format works in bits instead of bytes. It reads bits from MSB (Most Significant Bit) to LSB (Least Significant Bit) order and does not pad to byte boundaries (except at the end of the file).

offset  xoffset ASCII   hex     binary          bitslice
000000  0x0000  B       0x42    0b_0100_0010    aaaa_aaaa
000001  0x0001  Z       0x5A    0b_0101_1010    aaaa_aaaa
000002  0x0002  h       0x68    0b_0110_1000    bbbb_bbbb
000003  0x0003  9       0x39    0b_0011_1001    cccc_cccc
000004  0x0004  1       0x31    0b_0011_0001    dddd_dddd
000005  0x0005  A       0x41    0b_0100_0001    dddd_dddd
000006  0x0006  Y       0x59    0b_0101_1001    dddd_dddd
000007  0x0007  &       0x26    0b_0010_0110    dddd_dddd
000008  0x0008  S       0x53    0b_0101_0011    dddd_dddd
000009  0x0009  Y       0x59    0b_0101_1001    dddd_dddd
000010  0x000A  v       0x76    0b_0111_0110    eeee_eeee
000011  0x000B  .       0xA7    0b_1010_0111    eeee_eeee
000012  0x000C  .       0x09    0b_0000_1001    eeee_eeee
000013  0x000D  .       0x95    0b_1001_0101    eeee_eeee
000014  0x000E  .       0x00    0b_0000_0000    fggg_gggg
000015  0x000F  .       0x00    0b_0000_0000    gggg_gggg
000016  0x0010  .       0x00    0b_0000_0000    gggg_gggg
000017  0x0011  .       0x81    0b_1000_0001    ghhh_hhhh    popcount(h) = 2
000018  0x0012  .       0x80    0b_1000_0000    hhhh_hhhh
000019  0x0013  8       0x38    0b_0011_1000    hiii_iiii    popcount(i) = 3
000020  0x0014  .       0x00    0b_0000_0000    iiii_iiii
000021  0x0015  .       0x10    0b_0001_0000    ijjj_jjjj    popcount(j) = 1
000022  0x0016  .       0x00    0b_0000_0000    jjjj_jjjj
000023  0x0017          0x20    0b_0010_0000    jkkk_llll
000024  0x0018  .       0x00    0b_0000_0000    llll_llll
000025  0x0019  !       0x21    0b_0010_0001    lllm_nnnn
000026  0x001A  .       0x9A    0b_1001_1010    nnnn_nnnn
000027  0x001B  h       0x68    0b_0110_1000    nnnn_nnno
000028  0x001C  3       0x33    0b_0011_0011    oooo_oooo
000029  0x001D  M       0x4D    0b_0100_1101    oooo_oooo
000030  0x001E  0       0x30    0b_0011_0000    oopp_pppp
000031  0x001F  .       0x91    0b_1001_0001    pppp_pppp
000032  0x0020  .       0xE2    0b_1110_0010    pppq_qqqq
000033  0x0021  .       0xEE    0b_1110_1110    qqqq_qqqq
000034  0x0022  H       0x48    0b_0100_1000    qqqq_qqqq
000035  0x0023  .       0xA7    0b_1010_0111    qqqq_qqqq
000036  0x0024  .       0x0A    0b_0000_1010    qqqq_qqqq
000037  0x0025  .       0x12    0b_0001_0010    qqqq_qqqq
000038  0x0026  .       0x0E    0b_0000_1110    qqqr_rrrr
000039  0x0027  .       0xD4    0b_1101_0100    rrrr_rrrr
000040  0x0028  .       0xE1    0b_1110_0001    rrrr_rrrr
000041  0x0029  2       0x32    0b_0011_0010    rrrr_rrrr
000042  0x002A  .       0xA0    0b_1010_0000    rrrs_ssss

(a) A 16-bit magic number “BZ” meaning the start of the bzip2 file.

(b) The ‘h’ byte means Huffman compression. The original bzip (as in, without the “2” in “bzip2”) used a ‘0’ byte here to indicate arithmetic codes instead of Huffman codes, but arithmetic/bzip was not widely used because of patent concerns. This package (and worked example) only considers Huffman/bzip2.

(d) A 48-bit magic number 0x314159265359 (a homage to π) means the start of a block.

(e) A 32-bit checksum. Refer to section 2.2.3.1 “BlockHeader” of Tsai's document which notes, “It is the same [CRC-32] checksum used in GZip (RFC 1952, section 8), but is slightly different due to the bit-packing differences noted in section 2.1”.

(f) A “block randomized” bit, deprecated and not used here.

(g) The 24-bit index of the original string in the sorted list of BWT rotations. The BWT technical report calls this number I. In this case, the value is 0x000001.

(h) A 16-bit bitmap of which byte values are present in the uncompressed text. In this case, 0b_0000_0011_0000_0000 means that there are values in the 7th and 8th rows (with 16 bytes per row). Its popcount (population count, the number of set bits) is 2, which means there are 2 further 16-bit bitmaps immediately following.

(i) The 16-bit bitmap for the 7th row in (h). 0b_0111_0000_0000_0000 means that the 0x61, 0x62 and 0x63 bytes (ASCII ‘a’, ‘b’ and ‘c’) are present.

(j) The 16-bit bitmap for the 8th row in (h). 0b_0010_0000_0000_0000 means that the 0x72 byte (ASCII ‘r’) is present.

(k) The 3-bit number of Huffman codes. In this case, 0b_010 = 2, the minimum valid value.

(l) The 15-bit number of sections. Each section is 50 symbols. Each symbol either produces one byte of uncompressed data (roughly speaking) or are one of three special symbols: RUNA, RUNB (both Run Length Encoding related) or EOB (End Of Block). The selected Huffman code can change at section boundaries.

(m) The array of unary encoded Huffman code selectors (one for each section). In this case, an array of one element, [0]. Applying the MTFT (Move To Front Transform) produces the same values: [0]. This part of abraca.txt.bz2 isn‘t very enlightening but there’s a more interesting example in the “MTFT Reprised” section below.

(n) The first Huffman code. More details below.

(o) The second Huffman code. For abraca.txt.bz2, the second Huffman code is never referenced by the selectors in the (m) bitslice but, for the record, the (o) bitslice is identical to the (n) bitslice and therefore produces the same Huffman code.

(p) The bitstream to feed the Huffman codes. More details below.

(q) A 48-bit magic number 0x177245385090 (a homage to sqrt(π)) means no further blocks.

(r) A 32-bit checksum. Refer to section 2.2.4 “StreamFooter” of Tsai's document.

(s) Padding up to a byte boundary.

Huffman Codes

The (h), (i) and (j) bitslices collectively state that there are 4 byte values present in the uncompressed output: ‘a’, ‘b’, ‘c’ and ‘r’. The relevant MTFT uses 1 fewer symbols (m01, m02 and m03) than this count, since an m00 operation (see below) is redundant with the RUNA and RUNB symbols. Adding the 3 special symbols (RUNA, RUNB, EOB) means that each Huffman code covers 6 symbols.

Here's the (n) bitslice again:

offset  xoffset ASCII   hex     binary          bitslice
000025  0x0019  !       0x21    0b_0010_0001    lllm_nnnn
000026  0x001A  .       0x9A    0b_1001_1010    nnnn_nnnn
000027  0x001B  h       0x68    0b_0110_1000    nnnn_nnno

Its bits are “0001100110100110100”. Adding spaces gives “00011 0 0 11 0 10 0 11 0 10 0”, which represents the sequence [3+0, +0, -1+0, +1+0, -1+0, +1+0]. Adding up the deltas gives the code lengths [3, 3, 2, 3, 2, 3] and so the canonical Huffman code (with RUNA and RUNB before the mNN symbols and EOB after) is:

code  length  symbol
100   3       RUNA
101   3       RUNB
00    2       m01
110   3       m02
01    2       m03
111   3       EOB

Here's the (p) bitslice again.

offset  xoffset ASCII   hex     binary          bitslice
000030  0x001E  0       0x30    0b_0011_0000    oopp_pppp
000031  0x001F  .       0x91    0b_1001_0001    pppp_pppp
000032  0x0020  .       0xE2    0b_1110_0010    pppq_qqqq

Its bits are “11000010010001111”. Adding spaces, giving “110 00 01 00 100 01 111”, makes it clearer that applying the Huffman code from the (n) bitslice to the (p) bitslice produces the symbol sequence [m02, m01, m03, m01, RUNA, m03, EOB].

Move To Front Transform

MTFT operates on a byte array (let‘s call it the “order” buffer) that’s initialized with with the bytes-in-use in their natural order: [‘a’, ‘b’, ‘c’, ‘r’]. Each mNN symbol emits the N'th element (using 0-based indexing) and also rearranges the order buffer: moving that element to the start and pushing the N earlier symbols one to the right.

There is no explicit m00 symbol. Instead, a combination of RUNA and RUNB symbols produce a number R meaning to repeat the 0th element R times (and leave the order buffer unchanged). For the abraca.txt.bz2 example, we have one RUNA symbol but no RUNB symbols, which is effectively an m00 repeated once. Here's the production:

order  symbol  emission
abcr   m02     c
cabr   m01     a
acbr   m03     r
racb   m01     a
arcb   RUNA    a
arcb   m03     b
barc   EOB

Tsai calls this RUNA / RUNB mechanism RLE2 (and goes into further detail on their semantics) and calls a later Run Length Encoding mechanism RLE1. Later when decoding, but when encoding, RLE1 occurs before RLE2.

Burrows Wheeler Transform

Reading down the emission column gives “caraab”, which the BWT technical report calls L, the last column of the sorted list of permutations of the uncompressed text. We can compute its histogram (‘a’ occurs 3 times, ‘b’, ‘c’ and ‘r’ all occur 1 time) in a first pass over L, and thus the cumulative counts:

Byte  Cumulative
a     0
b     3
c     4
r     5

We then compute T in a second pass over L, by assigning an incrementing value (relative to each cumulative count) to all the ‘a’ elements in L (assigning 0, 1 and 2), all the ‘b’ elements (assigning 3), all the ‘c’ elements (assinging 4) and all the ‘r’ elements (assigning 5).

Index  L  T
0      c  4
1      a  0
2      r  5
3      a  1
4      a  2
5      b  3

Starting with the I value given in the (g) bitslice, which was 1, calculate I, T[I], T[T[I]], T[T[T[I]]], etc to give [1, 0, 4, 2, 5, 3]. Indexing L with these values gives “acarba” and reversing that gives “abraca”, the uncompressed text.

Alternative BWT Inversion

It's not in the BWT technical report, but instead of walking I, T[I], T[T[I]], etc, an alternative but equivalent algorithm works on U, the inverse of the T vector. For example T[4] == 2 and so U[2] == 4.

Index  L  T  F  U
0      c  4  a  1
1      a  0  a  3
2      r  5  a  4
3      a  1  b  5
4      a  2  c  0
5      b  3  r  2

Calculate U[I], U[U[I]], U[U[U[I]]], etc to give [3, 5, 2, 4, 0, 1]. Indexing L with these values gives “abraca”, the uncompressed text (without being reversed).

Final Run Length Encoding

It‘s not used in abraca.txt.bz2 but after BWT inversion, there’s a final Run Length Encoding step (which Tsai calls RLE1), separate from the RUNA and RUNB ops mentioned above. It's triggered by four consecutive identical bytes and when decoding, converts strings like “AAAA\x03BBBB\x00CCCD” to “AAAAAAABBBBCCCD”.

Another Example

Here's a larger file, 942 bytes uncompressed and 568 bytes bzip2-compressed.

$ cat romeo.txt     | head
Romeo and Juliet
Excerpt from Act 2, Scene 2

JULIET
O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.

ROMEO
$ hd  romeo.txt.bz2 | head
00000000  42 5a 68 39 31 41 59 26  53 59 f8 7d 63 30 00 00  |BZh91AY&SY.}c0..|
00000010  8c 5f 80 00 10 60 85 10  18 be 77 9e 8a 3f ef df  |._...`....w..?..|
00000020  f0 40 02 30 e3 bb 00 22  9e a0 68 01 a1 a0 00 00  |.@.0..."..h.....|
00000030  00 02 26 13 23 4d 21 20  6d 41 a0 d0 d0 d0 f2 80  |..&.#M! mA......|
00000040  2a 9e f5 4f 50 f5 46 4c  23 13 09 89 82 60 4c 09  |*..OP.FL#....`L.|
00000050  84 89 1a 9a 05 31 92 27  92 6c 88 68 7a 23 26 83  |.....1.'.l.hz#&.|
00000060  4a b3 4b 3a 06 9b ec c7  19 bc 9a 88 60 6a 6b 42  |J.K:........`jkB|
00000070  5a c6 8e ad c7 c3 8f a8  f0 91 d1 db e3 1b 25 b5  |Z.............%.|
00000080  4e 06 cd f3 08 a9 af e8  82 2d 5b 54 fc 35 a9 0e  |N........-[T.5..|
00000090  13 01 17 7b 4c 98 47 24  25 d9 bd aa cd 31 da ee  |...{L.G$%....1..|

Where abraca.txt.bz2 had 3 + 1 = 4 possible uncompressed bytes (‘a’, ‘b’, ‘c’ and ‘r’), romeo.txt.bz2 has 1 + 5 + 4 + 12 + 7 + 14 + 10 = 53. This is represented in the wire format by the (h), (i) and (j) bitslices for abraca.txt.bz2 and the 8 equivalent bitslices in romeo.txt.bz2 (where we have repeatedly alternated the (i) and (j) bitslice labels).

offset  xoffset ASCII   hex     binary          bitslice
000017  0x0011  _       0x5F    0b_0101_1111    ghhh_hhhh    popcount(h) =  7
000018  0x0012  .       0x80    0b_1000_0000    hhhh_hhhh
000019  0x0013  .       0x00    0b_0000_0000    hiii_iiii    popcount(i) =  1
000020  0x0014  .       0x10    0b_0001_0000    iiii_iiii
000021  0x0015  `       0x60    0b_0110_0000    ijjj_jjjj    popcount(j) =  5
000022  0x0016  .       0x85    0b_1000_0101    jjjj_jjjj
000023  0x0017  .       0x10    0b_0001_0000    jiii_iiii    popcount(i) =  4
000024  0x0018  .       0x18    0b_0001_1000    iiii_iiii
000025  0x0019  .       0xBE    0b_1011_1110    ijjj_jjjj    popcount(j) = 12
000026  0x001A  w       0x77    0b_0111_0111    jjjj_jjjj
000027  0x001B  .       0x9E    0b_1001_1110    jiii_iiii    popcount(i) =  7
000028  0x001C  .       0x8A    0b_1000_1010    iiii_iiii
000029  0x001D  ?       0x3F    0b_0011_1111    ijjj_jjjj    popcount(j) = 14
000030  0x001E  .       0xEF    0b_1110_1111    jjjj_jjjj
000031  0x001F  .       0xDF    0b_1101_1111    jiii_iiii    popcount(i) = 10
000032  0x0020  .       0xF0    0b_1111_0000    iiii_iiii
000033  0x0021  @       0x40    0b_0100_0000    ikkk_llll

Similarly, where abraca.txt.bz2 had 1 section (per the (l) bitslice), each selecting from 2 possible Huffman codes (per the (k) bitslice), romeo.txt.bz2 has 17 sections, each selecting from 4 possible Huffman codes.

offset  xoffset ASCII   hex     binary          bitslice
000033  0x0021  @       0x40    0b_0100_0000    ikkk_llll
000034  0x0022  .       0x02    0b_0000_0010    llll_llll
000035  0x0023  0       0x30    0b_0011_0000    lllm_mmmm

MTFT Reprised

Bzip2 uses the Move To Front Transform in two places:

for the selectors array encoded in the (m) bitslice and
for the conversion from the symbols (the Huffman code's output) to generating the L string.

Use 2 is discussed above but the Use 1 values for abraca.txt.bz2 is not very educational, since the (m) bitslice was literally only one bit long. For romeo.txt.bz2, the equivalent (m) bitslice is 29 bits long and makes for a more interesting example.

offset  xoffset ASCII   hex     binary          bitslice
000035  0x0023  0       0x30    0b_0011_0000    lllm_mmmm
000036  0x0024  .       0xE3    0b_1110_0011    mmmm_mmmm
000037  0x0025  .       0xBB    0b_1011_1011    mmmm_mmmm
000038  0x0026  .       0x00    0b_0000_0000    mmmm_mmmm

The bits are “10000111000111011101100000000” which is the unary encoding of the 17-element selector array (before applying the MTFT): [1, 0, 0, 0, 3, 0, 0, 3, 3, 2, 0, 0, 0, 0, 0, 0, 0].

If we label the 4 Huffman codes to select from as w, x, y and z then the MTFT converts those 17 elements to “xxxxzzzywzzzzzzzz”, per the table below. As before, each N value (pre-MTFT) emits the N'th element (using 0-based indexing) and also rearranges the order buffer: moving that element to the start and pushing the N earlier symbols one to the right.

order  pre  post
wxyz   1    x
xwyz   0    x
xwyz   0    x
xwyz   0    x
xwyz   3    z
zxwy   0    z
zxwy   0    z
zxwy   3    y
yzxw   3    w
wyzx   2    z
zwyx   0    z
zwyx   0    z
zwyx   0    z
zwyx   0    z
zwyx   0    z
zwyx   0    z
zwyx   0    z

Even More Examples

Appendix A of Tsai's document gives even more wire format worked examples, presented in great detail.