tree: 089b0cdb08d530baa9347b9ee827e4e16f353767 [path history] [tgz]

std/deflate/README.md

Deflate

Deflate is a general purpose compression format, using Huffman codes and Lempel-Ziv style length-distance back-references. It is specified in RFC 1951.

Gzip (RFC 1952) and zlib (RFC 1950) are two thin wrappers (with similar purpose, but different wire formats) around the raw deflate encoding. Gzip (the file format) is used by the gzip and tar command line tools and by the HTTP network protocol. Zlib is used by the ELF executable and PNG image file formats.

Zip (also known as PKZIP) is another wrapper, similar to combining gzip with tar, that can compress multiple files into a single archive. Zip is widely used by the ECMA Office Open XML format, the OASIS Open Document Format for Office Applications and the Java JAR format.

Wrangling those formats that build on deflate (gzip, zip and zlib) is not provided by this package. For zlib, look at the std/zlib package instead. The other formats are TODO.

For example, look at test/data/romeo.txt*. First, the uncompressed text:

$ xxd test/data/romeo.txt
00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574  Romeo and Juliet
00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163  .Excerpt from Ac
etc
00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f  t.So stumblest o
000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a       n my counsel?.

The raw deflate encoding:

$ xxd test/data/romeo.txt.deflate
00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d  MS.n.0...+.S.F.=
00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b  .. ..4...I......
etc
00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d  }.....$.h..0zYa.
00000210: 7f01                                     ..

The gzip format wraps a variable length header (in this case, 20 bytes) and 8 byte footer around the raw deflate data. The header contains the NUL-terminated C string name of the original, uncompressed file, “romeo.txt”, amongst other data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the uncompressed file (0x3ae = 942 bytes).

$ xxd test/data/romeo.txt.gz
00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e  ....&.]Y..romeo.
00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853  txt.MS.n.0...+.S
etc
00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30  .],g}.....$.h..0
00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000       zYa...........

The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate data. The footer contains a 4 byte Adler32 checksum. TODO: move this to std/zlib/README.md.

$ xxd test/data/romeo.txt.zlib
00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46  x.MS.n.0...+.S.F
00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d  .=.. ..4...I....
etc
00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59  ,g}.....$.h..0zY
00000210: 610d 7f01 57bb 3ede                      a...W.>.

Wire Format Worked Example

Consider test/data/romeo.txt.deflate. The relevant spec is RFC 1951.

offset  xoffset ASCII   hex     binary
000000  0x0000  M       0x4D    0b_0100_1101
000001  0x0001  S       0x53    0b_0101_0011
000002  0x0002  .       0xC1    0b_1100_0001
000003  0x0003  n       0x6E    0b_0110_1110
000004  0x0004  .       0xDB    0b_1101_1011
000005  0x0005  0       0x30    0b_0011_0000
000006  0x0006  .       0x0C    0b_0000_1100
000007  0x0007  .       0xBD    0b_1011_1101
000008  0x0008  .       0xF3    0b_1111_0011
000009  0x0009  +       0x2B    0b_0010_1011
000010  0x000A  .       0xD8    0b_1101_1000
000011  0x000B  S       0x53    0b_0101_0011
000012  0x000C  .       0x2E    0b_0010_1110
000013  0x000D  F       0x46    0b_0100_0110
000014  0x000E  .       0x0E    0b_0000_1110
000015  0x000F  =       0x3D    0b_0011_1101
000016  0x0010  .       0x2E    0b_0010_1110
000017  0x0011  .       0x87    0b_1000_0111
000018  0x0012          0x20    0b_0010_0000
000019  0x0013  .       0xED    0b_1110_1101
etc
000522  0x020A  .       0xEB    0b_1110_1011
000523  0x020B  0       0x30    0b_0011_0000
000524  0x020C  z       0x7A    0b_0111_1010
000525  0x020D  Y       0x59    0b_0101_1001
000526  0x020E  a       0x61    0b_0110_0001
000527  0x020F  .       0x0D    0b_0000_1101
000528  0x0210  .       0x7F    0b_0111_1111
000529  0x0211  .       0x01    0b_0000_0001

Starting at the top:

offset  xoffset ASCII   hex     binary
000000  0x0000  M       0x4D    0b_0100_1101

As per the RFC section 3.2.3. Details of block format:

BFINAL is the first (LSB) bit, 0b1. This block is the final block.
BTYPE is the next two bits in natural (MSB to LSB) order, 0b10, which means a block that's compressed with dynamic Huffman codes.

There are 3 block types: uncompressed, compressed with fixed Huffman codes and compressed with dynamic Huffman codes. The first type is straightforward: containing a uint16 length N (and its complement, for error detection) and then N literal bytes. The second type is the same as the third type but with built-in lcode and dcode tables (see below). The third type is the interesting part of the deflate format, and its deconstruction continues below.

Dynamic Huffman Blocks

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000000  0x0000  M       0x4D    0b_0100_1...
000001  0x0001  S       0x53    0b_0101_0011
000002  0x0002  .       0xC1    0b_1100_0001

As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order:

HLIT = 0b01001 = 9
HDIST = 0b10011 = 19
HCLEN = 0b1010 = 10

Adding the implicit biases give:

nlit  =  9 + 257 = 266
ndist = 19 +   1 =  20
nclen = 10 +   4 =  14

Code Length Code Lengths

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000002  0x0002  .       0xC1    0b_1100_000.
000003  0x0003  n       0x6E    0b_0110_1110
000004  0x0004  .       0xDB    0b_1101_1011
000005  0x0005  0       0x30    0b_0011_0000
000006  0x0006  .       0x0C    0b_0000_1100
000007  0x0007  .       0xBD    0b_1011_1101

There are 19 possible code length code lengths (this is not a typo), but the wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8, ..., 15) so that later elements are more likely to be zero (and hence compressible). There are nclen (in this case, 14) explicit code length code lengths, 3 bits each.

clcode_lengths[16] = 0b000 = 0
clcode_lengths[17] = 0b100 = 4
clcode_lengths[18] = 0b101 = 5
clcode_lengths[ 0] = 0b011 = 3
clcode_lengths[ 8] = 0b011 = 3
clcode_lengths[ 7] = 0b011 = 3
clcode_lengths[ 9] = 0b011 = 3
clcode_lengths[ 6] = 0b011 = 3
clcode_lengths[10] = 0b000 = 0
clcode_lengths[ 5] = 0b011 = 3
clcode_lengths[11] = 0b000 = 0
clcode_lengths[ 4] = 0b011 = 3
clcode_lengths[12] = 0b000 = 0
clcode_lengths[ 3] = 0b101 = 5

The remaining (19 - nclen) = 5 entries are implicitly zero:

clcode_lengths[13] = 0
clcode_lengths[ 2] = 0
clcode_lengths[14] = 0
clcode_lengths[ 1] = 0
clcode_lengths[15] = 0

Undoing the shuffle gives:

clcode_lengths[ 0] = 3
clcode_lengths[ 1] = 0
clcode_lengths[ 2] = 0
clcode_lengths[ 3] = 5
clcode_lengths[ 4] = 3
clcode_lengths[ 5] = 3
clcode_lengths[ 6] = 3
clcode_lengths[ 7] = 3
clcode_lengths[ 8] = 3
clcode_lengths[ 9] = 3
clcode_lengths[10] = 0
clcode_lengths[11] = 0
clcode_lengths[12] = 0
clcode_lengths[13] = 0
clcode_lengths[14] = 0
clcode_lengths[15] = 0
clcode_lengths[16] = 0
clcode_lengths[17] = 4
clcode_lengths[18] = 5

For clcodes, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1 4-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to clcodes is:

bits        clcode
0b000        0
0b001        4
0b010        5
0b011        6
0b100        7
0b101        8
0b110        9
0b1110      17
0b11110      3
0b11111     18

Call that Huffman table H-CL.

Lcodes and Dcodes

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000007  0x0007  .       0xBD    0b_1011_1...
000008  0x0008  .       0xF3    0b_1111_0011
000009  0x0009  +       0x2B    0b_0010_1011
000010  0x000A  .       0xD8    0b_1101_1000
000011  0x000B  S       0x53    0b_0101_0011
000012  0x000C  .       0x2E    0b_0010_1110
000013  0x000D  F       0x46    0b_0100_0110

When decoding via a Huffman table, bits are read in LSB to MSB order, right-to-left in this debugging output. Extra bits, if any, are then read in the other, natural order (MSB to LSB). Decoding via H-CL gives:

bits        clcode
0b1110      17, 3 extra bits = 0b111 = 7
0b001        4
0b11111     18, 7 extra bits = 0b0001010 = 10
0b001        4
0b101        8
0b1110      17, 3 extra bits = 0b010 = 2
0b100        7
0b1110      17, 3 extra bits = 0b001 = 1
0b011        6
0b000        0

Still in the RFC section 3.2.7., this means:

lcode_lengths has 3 + 7 = 10 consecutive zeroes
lcode_lengths[ 10] = 4
lcode_lengths has 11 + 10 = 21 consecutive zeroes
lcode_lengths[ 32] = 4
lcode_lengths[ 33] = 8
lcode_lengths has 3 + 2 = 5 consecutive zeroes
lcode_lengths[ 39] = 7
lcode_lengths has 3 + 1 = 4 consecutive zeroes
lcode_lengths[ 44] = 6
lcode_lengths[ 45] = 0

After decoding the first 10 code lengths, the bit stream is now at:

offset  xoffset ASCII   hex     binary
000013  0x000D  F       0x46    0b_01.._....

Continuing to read all nlit (266) lcode lengths and ndist (20) dcode lengths (zeroes are not shown here):

lcode_lengths[ 10] = 4
lcode_lengths[ 32] = 4
lcode_lengths[ 33] = 8
lcode_lengths[ 39] = 7
lcode_lengths[ 44] = 6
lcode_lengths[ 46] = 7
lcode_lengths[ 50] = 8
lcode_lengths[ 58] = 9
lcode_lengths[ 59] = 7
lcode_lengths[ 63] = 7
etc
lcode_lengths[264] = 9
lcode_lengths[265] = 9

and

dcode_lengths[ 5] = 6
dcode_lengths[ 6] = 5
dcode_lengths[ 8] = 4
dcode_lengths[ 9] = 5
dcode_lengths[10] = 3
etc
dcode_lengths[18] = 3
dcode_lengths[19] = 6

The H-CL table is no longer used from here onwards.

For lcodes, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6 4-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and 12 9-bit codes. The canonical Huffman encoding's map from bits to lcode values is:

bits        lcode
0b0000       10
0b0001       32
0b0010      101
0b0011      116
0b0100      257
0b0101      259
0b01100      97
0b01101     104
0b01110     105
0b01111     110
0b10000     111
0b10001     114
0b10010     115
0b10011     258
0b10100     260
0b101010     44
0b101011     99
0b101100    100
0b101101    102
0b101110    108
0b101111    109
0b110000    112
0b110001    117
0b110010    119
0b110011    121
0b1101000    39
0b1101001    46
0b1101010    59
0b1101011    63
0b1101100    65
0b1101101    69
0b1101110    73
0b1101111    79
0b1110000    82
0b1110001    83
0b1110010    84
0b1110011    98
0b1110100   103
0b1110101   261
0b1110110   262
0b1110111   263
0b11110000   33
0b11110001   50
0b11110010   66
0b11110011   67
0b11110100   72
0b11110101   74
0b11110110   77
0b11110111   87
0b11111000  107
0b11111001  118
0b111110100  58
0b111110101  68
0b111110110  76
0b111110111  78
0b111111000  85
0b111111001  91
0b111111010  93
0b111111011 120
0b111111100 122
0b111111101 256
0b111111110 264
0b111111111 265

Call that Huffman table H-L.

For dcodes, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4 4-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's map from bits to dcode values is:

bits        dcode
0b000       10
0b001       14
0b010       16
0b011       17
0b100       18
0b1010       8
0b1011      12
0b1100      13
0b1101      15
0b11100      6
0b11101      9
0b11110     11
0b111110     5
0b111111    19

Call that Huffman table H-D.

Decoding Literals

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000052  0x0034  .       0xE7    0b_1..._....
000053  0x0035  C       0x43    0b_0100_0011
000054  0x0036  .       0xE8    0b_1110_1000
000055  0x0037  )       0x29    0b_0010_1001
000056  0x0038  .       0xA0    0b_1010_0000
000057  0x0039  .       0xF1    0b_1111_0001

Decoding from that bit stream via H-L gives some literal bytes (where the lcode value is < 256):

bits        lcode
0b1110000    82 (ASCII 'R')
0b10000     111 (ASCII 'o')
0b101111    109 (ASCII 'm')
0b0010      101 (ASCII 'e')
0b10000     111 (ASCII 'o')
0b0001       32 (ASCII ' ')
0b01100      97 (ASCII 'a')
0b01111     110 (ASCII 'n')

Decoding Back-References

This continues, up until we get to the “O Romeo, Romeo! wherefore art thou Romeo?” line.

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000089  0x0059  .       0xC1    0b_11.._....
000090  0x005A  .       0x1E    0b_0001_1110
000091  0x005B  .       0x0F    0b_0000_1111
000092  0x005C  .       0xF9    0b_1111_1001

Decoding from that bit stream via H-L gives:

bits        lcode
0b1101111    79 (ASCII 'O')
0b0001       32 (ASCII ' ')
0b1110000    82 (ASCII 'R')
0b10011     258

This 258 is our first non-literal lcode, the start of a length-distance back-reference. We first need to decode the length. The RFC section 3.2.5. Compressed blocks (length and distance codes) says that an lcode of 258 means that length=4 with no extra bits.

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000092  0x005C  .       0xF9    0b_111._....
000093  0x005D  Y       0x59    0b_0101_1001

We next decode via H-D instead of H-L.

bits        dcode
0b11110     11

Again, from section 3.2.5., a dcode of 11 means a distance in the range [49, 64], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55. We therefore copy length=4 bytes from distance=55 bytes ago: “omeo”.

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000093  0x005D  Y       0x59    0b_01.._....
000094  0x005E  U       0x55    0b_0101_0101
000095  0x005F  >       0x3E    0b_0011_1110

We go back to decoding via H-L, which gives

bits        lcode
0b101010     44 (ASCII ',')
0b10100     260

This is another non-literal. Section 3.2.5. says that an lcode of 260 means length=6 with no extra bits.

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000095  0x005F  >       0x3E    0b_0011_111.

Decode with H-D.

bits        dcode
0b111110     5

Again, from section 3.2.5., a dcode of 5 means a distance in the range [7, 8], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We therefore copy length=6 bytes from distance=7 bytes ago: " Romeo".

The bit stream is now at:

offset  xoffset ASCII   hex     binary
000096  0x0060  .       0x0F    0b_0000_1111

We go back to decoding via H-L, which gives

bits        lcode
0b11110000   33 (ASCII '!')

And on we go.

End of Block

The block finishes with the bit stream at:

offset  xoffset ASCII   hex     binary
000522  0x020A  .       0xEB    0b_1110_101.
000523  0x020B  0       0x30    0b_0011_0000
000524  0x020C  z       0x7A    0b_0111_1010
000525  0x020D  Y       0x59    0b_0101_1001
000526  0x020E  a       0x61    0b_0110_0001
000527  0x020F  .       0x0D    0b_0000_1101
000528  0x0210  .       0x7F    0b_0111_1111
000529  0x0211  .       0x01    0b_0000_0001

The decoding is:

bits        lcode
0b101011     99 (ASCII 'c')
0b10000     111 (ASCII 'o')
0b110001    117 (ASCII 'u')
0b01111     110 (ASCII 'n')
0b0100      257 (length=3 + 0_extra_bits...)

bits        dcode
0b1101      15  (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order;
                 length=3, distance=195: copy "sel")

bits        lcode
0b1101011    63 (ASCII '?')
0b0000       10 (ASCII '\n')
0b111111101 256 (end of block)

In other words, the block ends with “counsel?\n”.

More Wire Format Examples

See test/data/artificial/deflate-*.commentary.txt