| # Bzip2 |
| |
| Bzip2 is a general purpose compression format, using Huffman codes, Run Length |
| Encoding, the Move To Front Transform and the Burrows Wheeler Transform. There |
| is no official specification, but Joe Tsai has written [comprehensive |
| documentation](https://github.com/dsnet/compress/raw/master/doc/bzip2-format.pdf) |
| of the wire format. |
| |
| It is a block based format, which means that when decoding each block, the |
| whole compressed block has to be read before a single byte of the uncompressed |
| block can be written. |
| |
| Like zlib (but unlike gzip), bzip2 just holds compressed bytes. It does not |
| hold additional metadata like the original file's name or modification time. |
| |
| |
| # Wire Format Worked Example |
| |
| Let's compress "abraca". Bzip2 compresses poorly on such a short input file, |
| converting 6 uncompressed bytes to 0x2b = 43 compressed bytes. Nonetheless, the |
| "abraca" uncompressed text matches the example from the 1994 [BWT technical |
| report](http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf) (by M. |
| Burrows and D. J. Wheeler) that introduced their eponymous Burrows Wheeler |
| Transform. |
| |
| $ hd abraca.txt |
| 00000000 61 62 72 61 63 61 |abraca| |
| 00000006 |
| $ hd abraca.txt.bz2 |
| 00000000 42 5a 68 39 31 41 59 26 53 59 76 a7 09 95 00 00 |BZh91AY&SYv.....| |
| 00000010 00 81 80 38 00 10 00 20 00 21 9a 68 33 4d 30 91 |...8... .!.h3M0.| |
| 00000020 e2 ee 48 a7 0a 12 0e d4 e1 32 a0 |..H......2.| |
| 0000002b |
| |
| Here are the bits of those 43 bytes with various bitslices labeled from (a) to |
| (s). The bzip2 file format works in bits instead of bytes. It reads bits from |
| MSB (Most Significant Bit) to LSB (Least Significant Bit) order and does not |
| pad to byte boundaries (except at the end of the file). |
| |
| offset xoffset ASCII hex binary bitslice |
| 000000 0x0000 B 0x42 0b_0100_0010 aaaa_aaaa |
| 000001 0x0001 Z 0x5A 0b_0101_1010 aaaa_aaaa |
| 000002 0x0002 h 0x68 0b_0110_1000 bbbb_bbbb |
| 000003 0x0003 9 0x39 0b_0011_1001 cccc_cccc |
| 000004 0x0004 1 0x31 0b_0011_0001 dddd_dddd |
| 000005 0x0005 A 0x41 0b_0100_0001 dddd_dddd |
| 000006 0x0006 Y 0x59 0b_0101_1001 dddd_dddd |
| 000007 0x0007 & 0x26 0b_0010_0110 dddd_dddd |
| 000008 0x0008 S 0x53 0b_0101_0011 dddd_dddd |
| 000009 0x0009 Y 0x59 0b_0101_1001 dddd_dddd |
| 000010 0x000A v 0x76 0b_0111_0110 eeee_eeee |
| 000011 0x000B . 0xA7 0b_1010_0111 eeee_eeee |
| 000012 0x000C . 0x09 0b_0000_1001 eeee_eeee |
| 000013 0x000D . 0x95 0b_1001_0101 eeee_eeee |
| 000014 0x000E . 0x00 0b_0000_0000 fggg_gggg |
| 000015 0x000F . 0x00 0b_0000_0000 gggg_gggg |
| 000016 0x0010 . 0x00 0b_0000_0000 gggg_gggg |
| 000017 0x0011 . 0x81 0b_1000_0001 ghhh_hhhh popcount(h) = 2 |
| 000018 0x0012 . 0x80 0b_1000_0000 hhhh_hhhh |
| 000019 0x0013 8 0x38 0b_0011_1000 hiii_iiii popcount(i) = 3 |
| 000020 0x0014 . 0x00 0b_0000_0000 iiii_iiii |
| 000021 0x0015 . 0x10 0b_0001_0000 ijjj_jjjj popcount(j) = 1 |
| 000022 0x0016 . 0x00 0b_0000_0000 jjjj_jjjj |
| 000023 0x0017 0x20 0b_0010_0000 jkkk_llll |
| 000024 0x0018 . 0x00 0b_0000_0000 llll_llll |
| 000025 0x0019 ! 0x21 0b_0010_0001 lllm_nnnn |
| 000026 0x001A . 0x9A 0b_1001_1010 nnnn_nnnn |
| 000027 0x001B h 0x68 0b_0110_1000 nnnn_nnno |
| 000028 0x001C 3 0x33 0b_0011_0011 oooo_oooo |
| 000029 0x001D M 0x4D 0b_0100_1101 oooo_oooo |
| 000030 0x001E 0 0x30 0b_0011_0000 oopp_pppp |
| 000031 0x001F . 0x91 0b_1001_0001 pppp_pppp |
| 000032 0x0020 . 0xE2 0b_1110_0010 pppq_qqqq |
| 000033 0x0021 . 0xEE 0b_1110_1110 qqqq_qqqq |
| 000034 0x0022 H 0x48 0b_0100_1000 qqqq_qqqq |
| 000035 0x0023 . 0xA7 0b_1010_0111 qqqq_qqqq |
| 000036 0x0024 . 0x0A 0b_0000_1010 qqqq_qqqq |
| 000037 0x0025 . 0x12 0b_0001_0010 qqqq_qqqq |
| 000038 0x0026 . 0x0E 0b_0000_1110 qqqr_rrrr |
| 000039 0x0027 . 0xD4 0b_1101_0100 rrrr_rrrr |
| 000040 0x0028 . 0xE1 0b_1110_0001 rrrr_rrrr |
| 000041 0x0029 2 0x32 0b_0011_0010 rrrr_rrrr |
| 000042 0x002A . 0xA0 0b_1010_0000 rrrs_ssss |
| |
| (a) A 16-bit magic number "BZ" meaning the start of the bzip2 file. |
| |
| (b) The 'h' byte means Huffman compression. The original bzip (as in, without |
| the "2" in "bzip2") used a '0' byte here to indicate arithmetic codes instead |
| of Huffman codes, but arithmetic/bzip was not widely used because of patent |
| concerns. This package (and worked example) only considers Huffman/bzip2. |
| |
| (c) The '9' byte means the uncompressed block size ranges up to 900000 |
| (inclusive). |
| |
| (d) A 48-bit magic number 0x314159265359 (a homage to π) means the start of a |
| block. |
| |
| (e) A 32-bit checksum. Refer to section 2.2.3.1 "BlockHeader" of [Tsai's |
| document](https://github.com/dsnet/compress/raw/master/doc/bzip2-format.pdf) |
| which notes, "It is the same [CRC-32] checksum used in GZip (RFC 1952, section |
| 8), but is slightly different due to the bit-packing differences noted in |
| section 2.1". |
| |
| (f) A "block randomized" bit, deprecated and not used here. |
| |
| (g) The 24-bit index of the original string in the sorted list of BWT |
| rotations. The [BWT technical |
| report](http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf) calls this |
| number *I*. In this case, the value is 0x000001. |
| |
| (h) A 16-bit bitmap of which byte values are present in the uncompressed text. |
| In this case, `0b_0000_0011_0000_0000` means that there are values in the 7th |
| and 8th rows (with 16 bytes per row). Its popcount (population count, the |
| number of set bits) is 2, which means there are 2 further 16-bit bitmaps |
| immediately following. |
| |
| (i) The 16-bit bitmap for the 7th row in (h). `0b_0111_0000_0000_0000` means |
| that the `0x61`, `0x62` and `0x63` bytes (ASCII 'a', 'b' and 'c') are present. |
| |
| (j) The 16-bit bitmap for the 8th row in (h). `0b_0010_0000_0000_0000` means |
| that the `0x72` byte (ASCII 'r') is present. |
| |
| (k) The 3-bit number of Huffman codes. In this case, `0b_010 = 2`, the minimum |
| valid value. |
| |
| (l) The 15-bit number of sections. Each section is 50 symbols. Each symbol |
| either produces one byte of uncompressed data (roughly speaking) or are one of |
| three special symbols: RUNA, RUNB (both Run Length Encoding related) or EOB |
| (End Of Block). The selected Huffman code can change at section boundaries. |
| |
| (m) The array of unary encoded Huffman code selectors (one for each section). |
| In this case, an array of one element, [0]. Applying the MTFT (Move To Front |
| Transform) produces the same values: [0]. This part of `abraca.txt.bz2` isn't |
| very enlightening but there's a more interesting example in the "MTFT Reprised" |
| section below. |
| |
| (n) The first Huffman code. More details below. |
| |
| (o) The second Huffman code. For `abraca.txt.bz2`, the second Huffman code is |
| never referenced by the selectors in the (m) bitslice but, for the record, the |
| (o) bitslice is identical to the (n) bitslice and therefore produces the same |
| Huffman code. |
| |
| (p) The bitstream to feed the Huffman codes. More details below. |
| |
| (q) A 48-bit magic number 0x177245385090 (a homage to sqrt(π)) means no further |
| blocks. |
| |
| (r) A 32-bit checksum. Refer to section 2.2.4 "StreamFooter" of [Tsai's |
| document](https://github.com/dsnet/compress/raw/master/doc/bzip2-format.pdf). |
| |
| (s) Padding up to a byte boundary. |
| |
| |
| ## Huffman Codes |
| |
| The (h), (i) and (j) bitslices collectively state that there are 4 byte values |
| present in the uncompressed output: 'a', 'b', 'c' and 'r'. The relevant MTFT |
| uses 1 fewer symbols (m01, m02 and m03) than this count, since an m00 operation |
| (see below) is redundant with the RUNA and RUNB symbols. Adding the 3 special |
| symbols (RUNA, RUNB, EOB) means that each Huffman code covers 6 symbols. |
| |
| Here's the (n) bitslice again: |
| |
| offset xoffset ASCII hex binary bitslice |
| 000025 0x0019 ! 0x21 0b_0010_0001 lllm_nnnn |
| 000026 0x001A . 0x9A 0b_1001_1010 nnnn_nnnn |
| 000027 0x001B h 0x68 0b_0110_1000 nnnn_nnno |
| |
| Its bits are "0001100110100110100". Adding spaces gives "00011 0 0 11 0 10 0 11 |
| 0 10 0", which represents the sequence [3+0, +0, -1+0, +1+0, -1+0, +1+0]. |
| Adding up the deltas gives the code lengths [3, 3, 2, 3, 2, 3] and so the |
| [canonical Huffman code](https://en.wikipedia.org/wiki/Canonical_Huffman_code) |
| (with RUNA and RUNB before the mNN symbols and EOB after) is: |
| |
| code length symbol |
| 100 3 RUNA |
| 101 3 RUNB |
| 00 2 m01 |
| 110 3 m02 |
| 01 2 m03 |
| 111 3 EOB |
| |
| Here's the (p) bitslice again. |
| |
| offset xoffset ASCII hex binary bitslice |
| 000030 0x001E 0 0x30 0b_0011_0000 oopp_pppp |
| 000031 0x001F . 0x91 0b_1001_0001 pppp_pppp |
| 000032 0x0020 . 0xE2 0b_1110_0010 pppq_qqqq |
| |
| Its bits are "11000010010001111". Adding spaces, giving "110 00 01 00 100 01 |
| 111", makes it clearer that applying the Huffman code from the (n) bitslice to |
| the (p) bitslice produces the symbol sequence [m02, m01, m03, m01, RUNA, m03, |
| EOB]. |
| |
| |
| ## Move To Front Transform |
| |
| MTFT operates on a byte array (let's call it the "order" buffer) that's |
| initialized with with the bytes-in-use in their natural order: ['a', 'b', 'c', |
| 'r']. Each mNN symbol emits the N'th element (using 0-based indexing) and also |
| rearranges the order buffer: moving that element to the start and pushing the N |
| earlier symbols one to the right. |
| |
| There is no explicit m00 symbol. Instead, a combination of RUNA and RUNB |
| symbols produce a number R meaning to repeat the 0th element R times (and leave |
| the order buffer unchanged). For the `abraca.txt.bz2` example, we have one RUNA |
| symbol but no RUNB symbols, which is effectively an m00 repeated once. Here's |
| the production: |
| |
| order symbol emission |
| abcr m02 c |
| cabr m01 a |
| acbr m03 r |
| racb m01 a |
| arcb RUNA a |
| arcb m03 b |
| barc EOB |
| |
| Tsai calls this RUNA / RUNB mechanism RLE2 (and goes into further detail on |
| their semantics) and calls a later Run Length Encoding mechanism RLE1. Later |
| when decoding, but when encoding, RLE1 occurs before RLE2. |
| |
| |
| ## Burrows Wheeler Transform |
| |
| Reading down the emission column gives "caraab", which the [BWT technical |
| report](http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf) calls *L*, |
| the last column of the sorted list of permutations of the uncompressed text. We |
| can compute its histogram ('a' occurs 3 times, 'b', 'c' and 'r' all occur 1 |
| time) in a first pass over *L*, and thus the cumulative counts: |
| |
| Byte Cumulative |
| a 0 |
| b 3 |
| c 4 |
| r 5 |
| |
| We then compute *T* in a second pass over *L*, by assigning an incrementing |
| value (relative to each cumulative count) to all the 'a' elements in *L* |
| (assigning 0, 1 and 2), all the 'b' elements (assigning 3), all the 'c' |
| elements (assinging 4) and all the 'r' elements (assigning 5). |
| |
| Index L T |
| 0 c 4 |
| 1 a 0 |
| 2 r 5 |
| 3 a 1 |
| 4 a 2 |
| 5 b 3 |
| |
| Starting with the *I* value given in the (g) bitslice, which was 1, calculate |
| *I*, *T[I]*, *T[T[I]]*, *T[T[T[I]]]*, etc to give [1, 0, 4, 2, 5, 3]. Indexing |
| *L* with these values gives "acarba" and reversing that gives "abraca", the |
| uncompressed text. |
| |
| |
| ## Alternative BWT Inversion |
| |
| It's not in the BWT technical report, but instead of walking *I*, *T[I]*, |
| *T[T[I]]*, etc, an alternative but equivalent algorithm works on *U*, the |
| inverse of the *T* vector. For example *T[4] == 2* and so *U[2] == 4*. |
| |
| Index L T F U |
| 0 c 4 a 1 |
| 1 a 0 a 3 |
| 2 r 5 a 4 |
| 3 a 1 b 5 |
| 4 a 2 c 0 |
| 5 b 3 r 2 |
| |
| Calculate *U[I]*, *U[U[I]]*, *U[U[U[I]]]*, etc to give [3, 5, 2, 4, 0, 1]. |
| Indexing *L* with these values gives "abraca", the uncompressed text (without |
| being reversed). |
| |
| |
| ## Final Run Length Encoding |
| |
| It's not used in `abraca.txt.bz2` but after BWT inversion, there's a final Run |
| Length Encoding step (which Tsai calls RLE1), separate from the RUNA and RUNB |
| ops mentioned above. It's triggered by four consecutive identical bytes and |
| when decoding, converts strings like "AAAA\x03BBBB\x00CCCD" to |
| "AAAAAAABBBBCCCD". |
| |
| |
| # Another Example |
| |
| Here's a larger file, 942 bytes uncompressed and 568 bytes bzip2-compressed. |
| |
| $ cat romeo.txt | head |
| Romeo and Juliet |
| Excerpt from Act 2, Scene 2 |
| |
| JULIET |
| O Romeo, Romeo! wherefore art thou Romeo? |
| Deny thy father and refuse thy name; |
| Or, if thou wilt not, be but sworn my love, |
| And I'll no longer be a Capulet. |
| |
| ROMEO |
| $ hd romeo.txt.bz2 | head |
| 00000000 42 5a 68 39 31 41 59 26 53 59 f8 7d 63 30 00 00 |BZh91AY&SY.}c0..| |
| 00000010 8c 5f 80 00 10 60 85 10 18 be 77 9e 8a 3f ef df |._...`....w..?..| |
| 00000020 f0 40 02 30 e3 bb 00 22 9e a0 68 01 a1 a0 00 00 |.@.0..."..h.....| |
| 00000030 00 02 26 13 23 4d 21 20 6d 41 a0 d0 d0 d0 f2 80 |..&.#M! mA......| |
| 00000040 2a 9e f5 4f 50 f5 46 4c 23 13 09 89 82 60 4c 09 |*..OP.FL#....`L.| |
| 00000050 84 89 1a 9a 05 31 92 27 92 6c 88 68 7a 23 26 83 |.....1.'.l.hz#&.| |
| 00000060 4a b3 4b 3a 06 9b ec c7 19 bc 9a 88 60 6a 6b 42 |J.K:........`jkB| |
| 00000070 5a c6 8e ad c7 c3 8f a8 f0 91 d1 db e3 1b 25 b5 |Z.............%.| |
| 00000080 4e 06 cd f3 08 a9 af e8 82 2d 5b 54 fc 35 a9 0e |N........-[T.5..| |
| 00000090 13 01 17 7b 4c 98 47 24 25 d9 bd aa cd 31 da ee |...{L.G$%....1..| |
| |
| Where `abraca.txt.bz2` had 3 + 1 = 4 possible uncompressed bytes ('a', 'b', 'c' |
| and 'r'), `romeo.txt.bz2` has 1 + 5 + 4 + 12 + 7 + 14 + 10 = 53. This is |
| represented in the wire format by the (h), (i) and (j) bitslices for |
| `abraca.txt.bz2` and the 8 equivalent bitslices in `romeo.txt.bz2` (where we |
| have repeatedly alternated the (i) and (j) bitslice labels). |
| |
| offset xoffset ASCII hex binary bitslice |
| 000017 0x0011 _ 0x5F 0b_0101_1111 ghhh_hhhh popcount(h) = 7 |
| 000018 0x0012 . 0x80 0b_1000_0000 hhhh_hhhh |
| 000019 0x0013 . 0x00 0b_0000_0000 hiii_iiii popcount(i) = 1 |
| 000020 0x0014 . 0x10 0b_0001_0000 iiii_iiii |
| 000021 0x0015 ` 0x60 0b_0110_0000 ijjj_jjjj popcount(j) = 5 |
| 000022 0x0016 . 0x85 0b_1000_0101 jjjj_jjjj |
| 000023 0x0017 . 0x10 0b_0001_0000 jiii_iiii popcount(i) = 4 |
| 000024 0x0018 . 0x18 0b_0001_1000 iiii_iiii |
| 000025 0x0019 . 0xBE 0b_1011_1110 ijjj_jjjj popcount(j) = 12 |
| 000026 0x001A w 0x77 0b_0111_0111 jjjj_jjjj |
| 000027 0x001B . 0x9E 0b_1001_1110 jiii_iiii popcount(i) = 7 |
| 000028 0x001C . 0x8A 0b_1000_1010 iiii_iiii |
| 000029 0x001D ? 0x3F 0b_0011_1111 ijjj_jjjj popcount(j) = 14 |
| 000030 0x001E . 0xEF 0b_1110_1111 jjjj_jjjj |
| 000031 0x001F . 0xDF 0b_1101_1111 jiii_iiii popcount(i) = 10 |
| 000032 0x0020 . 0xF0 0b_1111_0000 iiii_iiii |
| 000033 0x0021 @ 0x40 0b_0100_0000 ikkk_llll |
| |
| Similarly, where `abraca.txt.bz2` had 1 section (per the (l) bitslice), each |
| selecting from 2 possible Huffman codes (per the (k) bitslice), `romeo.txt.bz2` |
| has 17 sections, each selecting from 4 possible Huffman codes. |
| |
| offset xoffset ASCII hex binary bitslice |
| 000033 0x0021 @ 0x40 0b_0100_0000 ikkk_llll |
| 000034 0x0022 . 0x02 0b_0000_0010 llll_llll |
| 000035 0x0023 0 0x30 0b_0011_0000 lllm_mmmm |
| |
| |
| ## MTFT Reprised |
| |
| Bzip2 uses the [Move To Front |
| Transform](https://en.wikipedia.org/wiki/Move-to-front_transform) in two |
| places: |
| |
| 1. for the selectors array encoded in the (m) bitslice and |
| 2. for the conversion from the symbols (the Huffman code's output) to |
| generating the *L* string. |
| |
| Use 2 is discussed above but the Use 1 values for `abraca.txt.bz2` is not very |
| educational, since the (m) bitslice was literally only one bit long. For |
| `romeo.txt.bz2`, the equivalent (m) bitslice is 29 bits long and makes for a |
| more interesting example. |
| |
| offset xoffset ASCII hex binary bitslice |
| 000035 0x0023 0 0x30 0b_0011_0000 lllm_mmmm |
| 000036 0x0024 . 0xE3 0b_1110_0011 mmmm_mmmm |
| 000037 0x0025 . 0xBB 0b_1011_1011 mmmm_mmmm |
| 000038 0x0026 . 0x00 0b_0000_0000 mmmm_mmmm |
| |
| The bits are "10000111000111011101100000000" which is the unary encoding of the |
| 17-element selector array (before applying the MTFT): [1, 0, 0, 0, 3, 0, 0, 3, |
| 3, 2, 0, 0, 0, 0, 0, 0, 0]. |
| |
| If we label the 4 Huffman codes to select from as w, x, y and z then the MTFT |
| converts those 17 elements to "xxxxzzzywzzzzzzzz", per the table below. As |
| before, each N value (pre-MTFT) emits the N'th element (using 0-based indexing) |
| and also rearranges the order buffer: moving that element to the start and |
| pushing the N earlier symbols one to the right. |
| |
| order pre post |
| wxyz 1 x |
| xwyz 0 x |
| xwyz 0 x |
| xwyz 0 x |
| xwyz 3 z |
| zxwy 0 z |
| zxwy 0 z |
| zxwy 3 y |
| yzxw 3 w |
| wyzx 2 z |
| zwyx 0 z |
| zwyx 0 z |
| zwyx 0 z |
| zwyx 0 z |
| zwyx 0 z |
| zwyx 0 z |
| zwyx 0 z |
| |
| |
| # Even More Examples |
| |
| Appendix A of [Tsai's |
| document](https://github.com/dsnet/compress/raw/master/doc/bzip2-format.pdf) |
| gives even more wire format worked examples, presented in great detail. |