| # Deflate |
| |
| Deflate is a general purpose compression format, using Huffman codes and |
| Lempel-Ziv style length-distance back-references. It is specified in [RFC |
| 1951](https://www.ietf.org/rfc/rfc1951.txt). |
| |
| Gzip ([RFC 1952](https://www.ietf.org/rfc/rfc1952.txt)) and zlib ([RFC |
| 1950](https://www.ietf.org/rfc/rfc1950.txt)) are two thin wrappers (with |
| similar purpose, but different wire formats) around the raw deflate encoding. |
| Gzip (the file format) is used by the `gzip` and `tar` command line tools and |
| by the HTTP network protocol. Zlib is used by the ELF executable and PNG image |
| file formats. |
| |
| [Zip](https://support.pkware.com/display/PKZIP/APPNOTE) (also known as PKZIP) |
| is another wrapper, similar to combining gzip with tar, that can compress |
| multiple files into a single archive. Zip is widely used by the ECMA Office |
| Open XML format, the OASIS Open Document Format for Office Applications and the |
| Java JAR format. |
| |
| Wrangling those formats that build on deflate (gzip, zip and zlib) is not |
| provided by this package. For zlib, look at the `std/zlib` package instead. The |
| other formats are TODO. |
| |
| For example, look at `test/data/romeo.txt*`. First, the uncompressed text: |
| |
| $ xxd test/data/romeo.txt |
| 00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574 Romeo and Juliet |
| 00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163 .Excerpt from Ac |
| etc |
| 00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f t.So stumblest o |
| 000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a n my counsel?. |
| |
| The raw deflate encoding: |
| |
| $ xxd test/data/romeo.txt.deflate |
| 00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d MS.n.0...+.S.F.= |
| 00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b .. ..4...I...... |
| etc |
| 00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d }.....$.h..0zYa. |
| 00000210: 7f01 .. |
| |
| The gzip format wraps a variable length header (in this case, 20 bytes) and 8 |
| byte footer around the raw deflate data. The header contains the NUL-terminated |
| C string name of the original, uncompressed file, "romeo.txt", amongst other |
| data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the |
| uncompressed file (0x3ae = 942 bytes). |
| |
| $ xxd test/data/romeo.txt.gz |
| 00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e ....&.]Y..romeo. |
| 00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853 txt.MS.n.0...+.S |
| etc |
| 00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 .],g}.....$.h..0 |
| 00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000 zYa........... |
| |
| The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate |
| data. The footer contains a 4 byte Adler32 checksum. TODO: move this to |
| std/zlib/README.md. |
| |
| $ xxd test/data/romeo.txt.zlib |
| 00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46 x.MS.n.0...+.S.F |
| 00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d .=.. ..4...I.... |
| etc |
| 00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 ,g}.....$.h..0zY |
| 00000210: 610d 7f01 57bb 3ede a...W.>. |
| |
| |
| # Wire Format Worked Example |
| |
| Consider `test/data/romeo.txt.deflate`. The relevant spec is RFC 1951. |
| |
| offset xoffset ASCII hex binary |
| 000000 0x0000 M 0x4D 0b_0100_1101 |
| 000001 0x0001 S 0x53 0b_0101_0011 |
| 000002 0x0002 . 0xC1 0b_1100_0001 |
| 000003 0x0003 n 0x6E 0b_0110_1110 |
| 000004 0x0004 . 0xDB 0b_1101_1011 |
| 000005 0x0005 0 0x30 0b_0011_0000 |
| 000006 0x0006 . 0x0C 0b_0000_1100 |
| 000007 0x0007 . 0xBD 0b_1011_1101 |
| 000008 0x0008 . 0xF3 0b_1111_0011 |
| 000009 0x0009 + 0x2B 0b_0010_1011 |
| 000010 0x000A . 0xD8 0b_1101_1000 |
| 000011 0x000B S 0x53 0b_0101_0011 |
| 000012 0x000C . 0x2E 0b_0010_1110 |
| 000013 0x000D F 0x46 0b_0100_0110 |
| 000014 0x000E . 0x0E 0b_0000_1110 |
| 000015 0x000F = 0x3D 0b_0011_1101 |
| 000016 0x0010 . 0x2E 0b_0010_1110 |
| 000017 0x0011 . 0x87 0b_1000_0111 |
| 000018 0x0012 0x20 0b_0010_0000 |
| 000019 0x0013 . 0xED 0b_1110_1101 |
| etc |
| 000522 0x020A . 0xEB 0b_1110_1011 |
| 000523 0x020B 0 0x30 0b_0011_0000 |
| 000524 0x020C z 0x7A 0b_0111_1010 |
| 000525 0x020D Y 0x59 0b_0101_1001 |
| 000526 0x020E a 0x61 0b_0110_0001 |
| 000527 0x020F . 0x0D 0b_0000_1101 |
| 000528 0x0210 . 0x7F 0b_0111_1111 |
| 000529 0x0211 . 0x01 0b_0000_0001 |
| |
| Starting at the top: |
| |
| offset xoffset ASCII hex binary |
| 000000 0x0000 M 0x4D 0b_0100_1101 |
| |
| As per the RFC section 3.2.3. Details of block format: |
| |
| - BFINAL is the first (LSB) bit, 0b1. This block is the final block. |
| - BTYPE is the next two bits in natural (MSB to LSB) order, 0b10, which means |
| a block that's compressed with dynamic Huffman codes. |
| |
| There are 3 block types: uncompressed, compressed with fixed Huffman codes and |
| compressed with dynamic Huffman codes. The first type is straightforward: |
| containing a uint16 length `N` (and its complement, for error detection) and |
| then `N` literal bytes. The second type is the same as the third type but with |
| built-in `lcode` and `dcode` tables (see below). The third type is the |
| interesting part of the deflate format, and its deconstruction continues below. |
| |
| |
| ## Dynamic Huffman Blocks |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000000 0x0000 M 0x4D 0b_0100_1... |
| 000001 0x0001 S 0x53 0b_0101_0011 |
| 000002 0x0002 . 0xC1 0b_1100_0001 |
| |
| As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three |
| variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order: |
| |
| - HLIT = 0b01001 = 9 |
| - HDIST = 0b10011 = 19 |
| - HCLEN = 0b1010 = 10 |
| |
| Adding the implicit biases give: |
| |
| nlit = 9 + 257 = 266 |
| ndist = 19 + 1 = 20 |
| nclen = 10 + 4 = 14 |
| |
| |
| ## Code Length Code Lengths |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000002 0x0002 . 0xC1 0b_1100_000. |
| 000003 0x0003 n 0x6E 0b_0110_1110 |
| 000004 0x0004 . 0xDB 0b_1101_1011 |
| 000005 0x0005 0 0x30 0b_0011_0000 |
| 000006 0x0006 . 0x0C 0b_0000_1100 |
| 000007 0x0007 . 0xBD 0b_1011_1101 |
| |
| There are 19 possible code length code lengths (this is not a typo), but the |
| wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8, |
| ..., 15) so that later elements are more likely to be zero (and hence |
| compressible). There are `nclen` (in this case, 14) explicit code length code |
| lengths, 3 bits each. |
| |
| clcode_lengths[16] = 0b000 = 0 |
| clcode_lengths[17] = 0b100 = 4 |
| clcode_lengths[18] = 0b101 = 5 |
| clcode_lengths[ 0] = 0b011 = 3 |
| clcode_lengths[ 8] = 0b011 = 3 |
| clcode_lengths[ 7] = 0b011 = 3 |
| clcode_lengths[ 9] = 0b011 = 3 |
| clcode_lengths[ 6] = 0b011 = 3 |
| clcode_lengths[10] = 0b000 = 0 |
| clcode_lengths[ 5] = 0b011 = 3 |
| clcode_lengths[11] = 0b000 = 0 |
| clcode_lengths[ 4] = 0b011 = 3 |
| clcode_lengths[12] = 0b000 = 0 |
| clcode_lengths[ 3] = 0b101 = 5 |
| |
| The remaining (19 - `nclen`) = 5 entries are implicitly zero: |
| |
| clcode_lengths[13] = 0 |
| clcode_lengths[ 2] = 0 |
| clcode_lengths[14] = 0 |
| clcode_lengths[ 1] = 0 |
| clcode_lengths[15] = 0 |
| |
| Undoing the shuffle gives: |
| |
| clcode_lengths[ 0] = 3 |
| clcode_lengths[ 1] = 0 |
| clcode_lengths[ 2] = 0 |
| clcode_lengths[ 3] = 5 |
| clcode_lengths[ 4] = 3 |
| clcode_lengths[ 5] = 3 |
| clcode_lengths[ 6] = 3 |
| clcode_lengths[ 7] = 3 |
| clcode_lengths[ 8] = 3 |
| clcode_lengths[ 9] = 3 |
| clcode_lengths[10] = 0 |
| clcode_lengths[11] = 0 |
| clcode_lengths[12] = 0 |
| clcode_lengths[13] = 0 |
| clcode_lengths[14] = 0 |
| clcode_lengths[15] = 0 |
| clcode_lengths[16] = 0 |
| clcode_lengths[17] = 4 |
| clcode_lengths[18] = 5 |
| |
| For `clcode`s, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1 |
| 4-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to |
| `clcode`s is: |
| |
| bits clcode |
| 0b000 0 |
| 0b001 4 |
| 0b010 5 |
| 0b011 6 |
| 0b100 7 |
| 0b101 8 |
| 0b110 9 |
| 0b1110 17 |
| 0b11110 3 |
| 0b11111 18 |
| |
| Call that Huffman table `H-CL`. |
| |
| |
| ## Lcodes and Dcodes |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000007 0x0007 . 0xBD 0b_1011_1... |
| 000008 0x0008 . 0xF3 0b_1111_0011 |
| 000009 0x0009 + 0x2B 0b_0010_1011 |
| 000010 0x000A . 0xD8 0b_1101_1000 |
| 000011 0x000B S 0x53 0b_0101_0011 |
| 000012 0x000C . 0x2E 0b_0010_1110 |
| 000013 0x000D F 0x46 0b_0100_0110 |
| |
| When decoding via a Huffman table, bits are read in LSB to MSB order, |
| right-to-left in this debugging output. Extra bits, if any, are then read in |
| the other, natural order (MSB to LSB). Decoding via `H-CL` gives: |
| |
| bits clcode |
| 0b1110 17, 3 extra bits = 0b111 = 7 |
| 0b001 4 |
| 0b11111 18, 7 extra bits = 0b0001010 = 10 |
| 0b001 4 |
| 0b101 8 |
| 0b1110 17, 3 extra bits = 0b010 = 2 |
| 0b100 7 |
| 0b1110 17, 3 extra bits = 0b001 = 1 |
| 0b011 6 |
| 0b000 0 |
| |
| Still in the RFC section 3.2.7., this means: |
| |
| lcode_lengths has 3 + 7 = 10 consecutive zeroes |
| lcode_lengths[ 10] = 4 |
| lcode_lengths has 11 + 10 = 21 consecutive zeroes |
| lcode_lengths[ 32] = 4 |
| lcode_lengths[ 33] = 8 |
| lcode_lengths has 3 + 2 = 5 consecutive zeroes |
| lcode_lengths[ 39] = 7 |
| lcode_lengths has 3 + 1 = 4 consecutive zeroes |
| lcode_lengths[ 44] = 6 |
| lcode_lengths[ 45] = 0 |
| |
| After decoding the first 10 code lengths, the bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000013 0x000D F 0x46 0b_01.._.... |
| |
| Continuing to read all `nlit` (266) `lcode` lengths and `ndist` (20) `dcode` |
| lengths (zeroes are not shown here): |
| |
| lcode_lengths[ 10] = 4 |
| lcode_lengths[ 32] = 4 |
| lcode_lengths[ 33] = 8 |
| lcode_lengths[ 39] = 7 |
| lcode_lengths[ 44] = 6 |
| lcode_lengths[ 46] = 7 |
| lcode_lengths[ 50] = 8 |
| lcode_lengths[ 58] = 9 |
| lcode_lengths[ 59] = 7 |
| lcode_lengths[ 63] = 7 |
| etc |
| lcode_lengths[264] = 9 |
| lcode_lengths[265] = 9 |
| |
| and |
| |
| dcode_lengths[ 5] = 6 |
| dcode_lengths[ 6] = 5 |
| dcode_lengths[ 8] = 4 |
| dcode_lengths[ 9] = 5 |
| dcode_lengths[10] = 3 |
| etc |
| dcode_lengths[18] = 3 |
| dcode_lengths[19] = 6 |
| |
| The `H-CL` table is no longer used from here onwards. |
| |
| For `lcode`s, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6 |
| 4-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and |
| 12 9-bit codes. The canonical Huffman encoding's map from bits to `lcode` |
| values is: |
| |
| bits lcode |
| 0b0000 10 |
| 0b0001 32 |
| 0b0010 101 |
| 0b0011 116 |
| 0b0100 257 |
| 0b0101 259 |
| 0b01100 97 |
| 0b01101 104 |
| 0b01110 105 |
| 0b01111 110 |
| 0b10000 111 |
| 0b10001 114 |
| 0b10010 115 |
| 0b10011 258 |
| 0b10100 260 |
| 0b101010 44 |
| 0b101011 99 |
| 0b101100 100 |
| 0b101101 102 |
| 0b101110 108 |
| 0b101111 109 |
| 0b110000 112 |
| 0b110001 117 |
| 0b110010 119 |
| 0b110011 121 |
| 0b1101000 39 |
| 0b1101001 46 |
| 0b1101010 59 |
| 0b1101011 63 |
| 0b1101100 65 |
| 0b1101101 69 |
| 0b1101110 73 |
| 0b1101111 79 |
| 0b1110000 82 |
| 0b1110001 83 |
| 0b1110010 84 |
| 0b1110011 98 |
| 0b1110100 103 |
| 0b1110101 261 |
| 0b1110110 262 |
| 0b1110111 263 |
| 0b11110000 33 |
| 0b11110001 50 |
| 0b11110010 66 |
| 0b11110011 67 |
| 0b11110100 72 |
| 0b11110101 74 |
| 0b11110110 77 |
| 0b11110111 87 |
| 0b11111000 107 |
| 0b11111001 118 |
| 0b111110100 58 |
| 0b111110101 68 |
| 0b111110110 76 |
| 0b111110111 78 |
| 0b111111000 85 |
| 0b111111001 91 |
| 0b111111010 93 |
| 0b111111011 120 |
| 0b111111100 122 |
| 0b111111101 256 |
| 0b111111110 264 |
| 0b111111111 265 |
| |
| Call that Huffman table `H-L`. |
| |
| For `dcode`s, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4 |
| 4-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's |
| map from bits to `dcode` values is: |
| |
| bits dcode |
| 0b000 10 |
| 0b001 14 |
| 0b010 16 |
| 0b011 17 |
| 0b100 18 |
| 0b1010 8 |
| 0b1011 12 |
| 0b1100 13 |
| 0b1101 15 |
| 0b11100 6 |
| 0b11101 9 |
| 0b11110 11 |
| 0b111110 5 |
| 0b111111 19 |
| |
| Call that Huffman table `H-D`. |
| |
| |
| ## Decoding Literals |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000052 0x0034 . 0xE7 0b_1..._.... |
| 000053 0x0035 C 0x43 0b_0100_0011 |
| 000054 0x0036 . 0xE8 0b_1110_1000 |
| 000055 0x0037 ) 0x29 0b_0010_1001 |
| 000056 0x0038 . 0xA0 0b_1010_0000 |
| 000057 0x0039 . 0xF1 0b_1111_0001 |
| |
| Decoding from that bit stream via `H-L` gives some literal bytes (where the |
| `lcode` value is < 256): |
| |
| bits lcode |
| 0b1110000 82 (ASCII 'R') |
| 0b10000 111 (ASCII 'o') |
| 0b101111 109 (ASCII 'm') |
| 0b0010 101 (ASCII 'e') |
| 0b10000 111 (ASCII 'o') |
| 0b0001 32 (ASCII ' ') |
| 0b01100 97 (ASCII 'a') |
| 0b01111 110 (ASCII 'n') |
| |
| |
| ## Decoding Back-References |
| |
| This continues, up until we get to the "O Romeo, Romeo! wherefore art thou |
| Romeo?" line. |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000089 0x0059 . 0xC1 0b_11.._.... |
| 000090 0x005A . 0x1E 0b_0001_1110 |
| 000091 0x005B . 0x0F 0b_0000_1111 |
| 000092 0x005C . 0xF9 0b_1111_1001 |
| |
| Decoding from that bit stream via `H-L` gives: |
| |
| bits lcode |
| 0b1101111 79 (ASCII 'O') |
| 0b0001 32 (ASCII ' ') |
| 0b1110000 82 (ASCII 'R') |
| 0b10011 258 |
| |
| This 258 is our first non-literal `lcode`, the start of a length-distance |
| back-reference. We first need to decode the length. The RFC section 3.2.5. |
| Compressed blocks (length and distance codes) says that an `lcode` of 258 means |
| that length=4 with no extra bits. |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000092 0x005C . 0xF9 0b_111._.... |
| 000093 0x005D Y 0x59 0b_0101_1001 |
| |
| We next decode via `H-D` instead of `H-L`. |
| |
| bits dcode |
| 0b11110 11 |
| |
| Again, from section 3.2.5., a `dcode` of 11 means a distance in the range [49, |
| 64], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55. |
| We therefore copy length=4 bytes from distance=55 bytes ago: "omeo". |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000093 0x005D Y 0x59 0b_01.._.... |
| 000094 0x005E U 0x55 0b_0101_0101 |
| 000095 0x005F > 0x3E 0b_0011_1110 |
| |
| We go back to decoding via `H-L`, which gives |
| |
| bits lcode |
| 0b101010 44 (ASCII ',') |
| 0b10100 260 |
| |
| This is another non-literal. Section 3.2.5. says that an `lcode` of 260 means |
| length=6 with no extra bits. |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000095 0x005F > 0x3E 0b_0011_111. |
| |
| Decode with `H-D`. |
| |
| bits dcode |
| 0b111110 5 |
| |
| Again, from section 3.2.5., a `dcode` of 5 means a distance in the range [7, |
| 8], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We |
| therefore copy length=6 bytes from distance=7 bytes ago: " Romeo". |
| |
| The bit stream is now at: |
| |
| offset xoffset ASCII hex binary |
| 000096 0x0060 . 0x0F 0b_0000_1111 |
| |
| We go back to decoding via `H-L`, which gives |
| |
| bits lcode |
| 0b11110000 33 (ASCII '!') |
| |
| And on we go. |
| |
| |
| ## End of Block |
| |
| The block finishes with the bit stream at: |
| |
| offset xoffset ASCII hex binary |
| 000522 0x020A . 0xEB 0b_1110_101. |
| 000523 0x020B 0 0x30 0b_0011_0000 |
| 000524 0x020C z 0x7A 0b_0111_1010 |
| 000525 0x020D Y 0x59 0b_0101_1001 |
| 000526 0x020E a 0x61 0b_0110_0001 |
| 000527 0x020F . 0x0D 0b_0000_1101 |
| 000528 0x0210 . 0x7F 0b_0111_1111 |
| 000529 0x0211 . 0x01 0b_0000_0001 |
| |
| The decoding is: |
| |
| bits lcode |
| 0b101011 99 (ASCII 'c') |
| 0b10000 111 (ASCII 'o') |
| 0b110001 117 (ASCII 'u') |
| 0b01111 110 (ASCII 'n') |
| 0b0100 257 (length=3 + 0_extra_bits...) |
| |
| bits dcode |
| 0b1101 15 (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order; |
| length=3, distance=195: copy "sel") |
| |
| bits lcode |
| 0b1101011 63 (ASCII '?') |
| 0b0000 10 (ASCII '\n') |
| 0b111111101 256 (end of block) |
| |
| In other words, the block ends with "counsel?\n". |
| |
| |
| # More Wire Format Examples |
| |
| See `test/data/artificial/deflate-*.commentary.txt` |