tree: 60b2345138b154deb53496266a4e8d742c184c5a

std/lzw/README.md

LZW (Lempel Ziv Welch) Compression

LZW is a general purpose compression algorithm, not specific to GIF, PDF, TIFF or even to graphics. In practice, there are two incompatible implementations, LSB (Least Significant Bits) and MSB (Most Significant Bits) first.

The GIF format uses LSB first. The literal width can vary between 2 and 8 bits, inclusive. There is no EarlyChange option.

The PDF and TIFF formats use MSB first. The literal width is fixed at 8 bits. There is an EarlyChange option that PDF sometimes uses (see Table 3.7 in the PDF 1.4 spec) and TIFF always uses.

Codes

An LZW encoding is a stream of codes, each code emitting zero or more bytes. There are four types of codes: literals, copies, clear and end.

Literal codes emit exactly one byte. For a given literal width L, there are (1 << L) literal codes. For example, if L is at least 7, then the 0x41 code emits the ASCII ‘A’ byte. If L is exactly 7, the highest literal code is 0x7F.

There is exactly one clear code and one end code, whose encoded values are right after the literal codes: CLEAR = ((1 << L) + 0) and END = ((1 << L) + 1). For 8 bit wide literals, CLEAR is 0x100 and END is 0x101. The clear code is discussed below. The end code means no more data. Both emit zero bytes.

Copy codes (with encoded values ranging from (END + 1) to 0xFFF, or equivalently, up to 4095 or up to 12 bits) emit two or more bytes, bytes that were previously emitted in the decoded stream. This is essentially a lookup in a key-value table. For example, the table could state that the 0x200 code emits “foo” and the 0x300 code emits “bar”.

Valid copy codes are in the range [END + 1, N], where N starts at END (i.e. the table's key range starts empty) and grows over time, up to a maximum value of 0xFFF. A clear code resets N to its starting value, END. In effect, a clear code clears the key-value table.

There is no explicit way to “increase N” or “add this key-value to the table”. Instead, as long as the table isn't full (i.e. that N < 0xFFF), every literal or copy code, other than the first one in the stream or the first one after a clear code, implicitly adds a key-value pair (with a key of N) and then increments N by 1.

The first literal or copy code in the stream (or after a clear code) merely increments N to (END + 1) without adding a key-value pair. In practice, it is equivalent to have that first code set a fake key-value pair for N == END, provided that table lookup algorithm ensures that “END means an end code” takes precendence.

Codes Example

The value in each key-value pair consists of a prefix (being 1 or more bytes) and a suffix (exactly 1 byte). The prefix equals the output emitted by the previous code: the one that was processed immediately before the current code. The suffix is the first byte of the output emitted by the current code.

For example, if a literal T code (which emits “T”) is followed by a literal O (which emits “O”), then the value in the implicit key-value pair is the entire “T” concatenated with the first byte of “O”, so the value is “TO”.

For example, if a copy 0x104 code (which emits “BE”) is followed by a copy 0x106 code (which emits “OR”), then the value is “BEO”.

For example, if a copy 0x10E code (which emits “TOBE”) is followed by a literal Y (which emits “Y”), then the value is “TOBEY”.

It is possible for a copy code to equal N, the very key to be implicitly added to the table. The prefix is, once again, the output emitted by the previous code. The suffix byte is, once again, the first byte of the output emitted by the current code, which in this case is the first byte of the prefix.

For example, if the previous code emitted “OTX”, then a copy code of N would both emit “OTXO” and add a (key=N, value=“OXTO”) pair to the table.

One possible encoding of “TOBEORNOTTOBEORTOBEORNOTXOTXOTXOOTXOOOTXOOOTOBEY” is in the table below. With dots to punctuate each code's emitted output, that is “T.O.B.E.O.R.N.O.T.TO.BE.OR.TOB.EO.RN.OT.X.OTX.OTXO.OTXOO.OTXOOO.TOBE.Y”.

 Code   Emits    Key    Value   Pre1+Suf1  LM1  /Q  %Q   PreQ+SufQ
    T       T
    O       O  0x102       TO      T O       1   0   1      - TO.
    B       B  0x103       OB      O B       1   0   1      - OB.
    E       E  0x104       BE      B E       1   0   1      - BE.
    O       O  0x105       EO      E O       1   0   1      - EO.
    R       R  0x106       OR      O R       1   0   1      - OR.
    N       N  0x107       RN      R N       1   0   1      - RN.
    O       O  0x108       NO      N O       1   0   1      - NO.
    T       T  0x109       OT      O T       1   0   1      - OT.
0x102      TO  0x10A       TT      T T       1   0   1      - TT.
0x104      BE  0x10B      TOB  0x102 B       2   0   2      - TOB
0x106      OR  0x10C      BEO  0x104 O       2   0   2      - BEO
0x10B     TOB  0x10D      ORT  0x106 T       2   0   2      - ORT
0x105      EO  0x10E     TOBE  0x10B E       3   1   0  0x10B E..
0x107      RN  0x10F      EOR  0x105 R       2   0   2      - EOR
0x109      OT  0x110      RNO  0x107 O       2   0   2      - RNO
    X       X  0x111      OTX  0x109 X       2   0   2      - OTX
0x111     OTX  0x112       XO      X O       1   0   1      - XO.
0x113    OTXO  0x113     OTXO  0x111 O       3   1   0  0x111 O..
0x114   OTXOO  0x114    OTXOO  0x113 O       4   1   1  0x111 OO.
0x115  OTXOOO  0x115   OTXOOO  0x114 O       5   1   2  0x111 OOO
0x10E    TOBE  0x116  OTXOOOT  0x115 T       6   2   0  0x115 T..
    Y       Y  0x117    TOBEY  0x10E Y       4   1   1  0x10B EY.
0x101   -end-

See script/print-lzw-example.go for the code that generated that table, including three implementations of a simplified core of the LZW algorithm.

The first four columns (Code, Emits, Key, Value) are discussed above. The remaining columns are discussed below.

Prefix+Suffix Representation

A naive implementation of the key-value table, with the complete value stored contiguously for each key, would have quadratic worst case memory requirements, as each successive value can be up to 1 byte longer than the previous longest.

In contrast, the Pre1+Suf1 columns are a compact (3 bytes per key-value pair for a uint16 prefix and uint8 suffix) representation of the values. Each value is the concatenation of a variable length prefix (stored as a table key) and a 1-byte suffix. For example, the value for key 0x10C is “BE” + “O”, and “BE” is the value for the key 0x104.

Note that the Pre1 column is exactly the same as the Code column, shifted vertically by one row. Note also that the Suf1 column is exactly the same as the final byte of the Value column.

To reconstruct a code‘s value, start with the suffix and work backwards, walking the prefix chain producing one byte at a time. Either reconstruct the value in an intermediate buffer and then once fully reconstructed (and its length is known) copy it to the output buffer, or walk the chain twice (the first pass calculates the value length, the second pass reconstructs the value directly in the output buffer, skipping a copy) or store each value’s length along with its prefix and suffix (an additional uint16 stored per key-value pair) and reconstruct the value directly in the output buffer in one pass.

Some of the computations are slightly easier, avoiding multiple “+1”s and “-1”s in the program, if we store the “length minus 1” of each value, not the length. This is the LM1 column. Note that the LM1 of all literal codes are 0.

Multi-Byte Suffixes

It can be faster (albeit with higher memory requirements) to store up-to-Q-byte suffixes instead of 1-byte suffixes, for some positive integer Q: the quantum of bytes per read or write. In effect, changing Q (usually at compile time, not at run time) is a classic performance versus memory trade-off.

On modern CPUs that can do unaligned 32 or 64 bit loads and stores, a Q of 4 or 8 can perform very well. To keep this worked example short and simple, we use a Q of 3, meaning that we store 1, 2 or 3 byte suffixes per key, and the prefix length is a multiple of 3. For example:

The value “BE” is the "" prefix and the 2-byte “BE” suffix.
The value “TOB” is the "" prefix and the 3-byte “TOB” suffix.
The value “OTXOOOT” is the “OTXOOO” prefix and the 1-byte “T” suffix.

The number of steps (or equivalently, Q-byte chunks) in the prefix chain is (LM1 / Q), which is the “/Q” column in the example above.

The suffix length is ((LM1 % Q) + 1). The first part, (LM1 % Q), is the “%Q” column in the example above. When Q is 8, the “/Q” and “%Q” operations are simple bitwise operations, such as “>> 3” or “& 7”. When Q is 1 (a degenerate case), the “%Q” column is all zeroes and all suffixes are 1 byte long, which is unsurprisingly equivalent to the 1-byte suffix representation described above.

When copying the suffix to a buffer (either an intermediate buffer or a final buffer), it is often unnecessary to copy exactly suffix length bytes, provided that there is enough destination buffer space for all Q bytes. Even if we write excess bytes, decoding subsequent codes will overwrite the excess, or else we'll get an end code and the excess will be ignored. Again, on modern CPUs, it can be faster to write at least 7 bytes (i.e. exactly 8) than to write exactly 7.

Even so, the (LM1 % Q) value is still necessary to calculate where to start writing those Q-byte chunk. After that, reconstructing the value proceeds in the same way as in the 1-byte suffix algorithm. It just involves larger chunks. For example, the value for the key 0x116 is the suffix “T” preceded by the value for 0x115, which is the suffix “OOO” preceded by the value for 0x111, which is the suffix “OTX” with no prefix: the value is “OTX”+“OOO”+“T”.

The PreQ column is more complicated than the Pre1 column. It is no longer just a vertically shifted Code column, and can now contain “no prefix” values (or equivalently, the “/Q” column is 0). When inserting a new key-value pair, we only set the N‘th prefix equal to the previously seen code (the Code column shifted by one row) if the suffix for that previous code was a full Q bytes (or equivalently, if the current row’s “%Q” value is 0). Otherwise, we copy the previous code's prefix and suffix, and then extend the suffix with one more byte. For example, the PreQ+SufQ value for the 0x117 key is based on the PreQ+SufQ value (the same PreQ, an extended SufQ) of the previously seen code, namely 0x10E.

Varying Bit Widths

The description above has discussed an LZW stream as a stream of codes. In the actual wire format, those codes are packed into a stream of bytes, and a code can straddle byte boundaries.

Specifically, the next code in the stream takes B bits, where B is the smallest number of that bits that can distinguish all valid codes: those codes in the range [0, N]. If N is 0x101, B is 9. If N is 0x3FF, B is 10 bits. If N is 0x400, B is 11. Remember that each successive code increments N, up to a maximum N of 0xFFF, or 12 bits.

The literal width can also vary in the range [2..8] for LZW as used by GIF. In that case, B starts off as 1 more than the literal width. For example, with a literal width of 2, the four literal codes are 0x0, 0x1, 0x2 and 0x3, which are followed by a clear code of 0x4 and an end code of 0x5, so N is 0x5 and B is 3.

Early Change

Unfortunately, some implementations misinterpreted the algorithm, and incremented B before instead of after emitting a bit-packed code for N == ((1 << B) - 1). Accordingly, LZW as used in TIFF and sometimes in PDF (depending on the EarlyChange bit) increment B one iteration earlier. The maximum value of B is still 12.

This misinterpretation is unfixable due to backwards compatibility,

LSB and MSB

The bits are then packed into bytes in a predetermined ordering. For GIF, the bits are packed Least Significant Bits first. For PDF and TIFF, Most Significant Bits first. With a 8 bit literal width (and therefore a 9 bit initial code width), the two literal codes T (0x54) and O (0x4F) followed by an end code 0x101 would be encoded as 27 bits, or 4 bytes (with 5 padding bits, which are ignored but are conventionally set to zero).

In LSB first, it would be four bytes, 0x54 0x9E 0x04 0x04:

offset  xoffset ASCII   hex     binary
000000  0x0000  T       0x54    0b_0101_0100
000001  0x0001  .       0x9E    0b_1001_1110
000002  0x0002  .       0x04    0b_0000_0100
000003  0x0003  .       0x04    0b_0000_0100

These four bytes contain three 9-bit codes:

binary                              width   code
0b_...._...._...._...0_0101_0100     9      0x0054
0b_...._...._...._..00_1001_111.     9      0x004F
0b_...._...._...._.100_0000_01..     9      0x0101

In MSB first, it would again be four bytes, 0x2A 0x13 0xE0 0x20:

offset  xoffset ASCII   hex     binary
000000  0x0000  *       0x2A    0b_0010_1010
000001  0x0001  .       0x13    0b_0001_0011
000002  0x0002  .       0xE0    0b_1110_0000
000003  0x0003          0x20    0b_0010_0000

Again, these four bytes contain three 9-bit codes:

binary                              width   code
0b_0010_1010_0..._...._...._....     9      0x0054
0b_.001_0011_11.._...._...._....     9      0x004F
0b_..10_0000_001._...._...._....     9      0x0101

See std/gif/README.md for a complete worked example of LSB-first encoded LZW data, including a varying B width.

More Wire Format Examples

See test/data/artificial/gif-*.commentary.txt