LZMA

Lempel–Ziv–Markov chain algorithm (LZMA) is a general purpose compression format, using range coding and Lempel-Ziv style length-distance back-references. Various descriptions of the file format are at:

Byms and Packets

The compressed stream is a stream of “byms” (binary symbols, either 0 or 1; yeah, it's an awkward term, but less ambiguous than using “bits” for both compressed and uncompressed things). Byms are conceptually very similar to bits (binary digits, ubiquitous in computing, a 0.5 KiB file holds 512 bytes, which is 4096 bits), but are encoded using an adaptive binary range-coder so that N byms usually compresses to fewer than N bits.

The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bym are correlated with the values of that bym (and related byms from the same field) in previous packets of the same type.

There are 7 types of packets:

Symbols                  Meaning
0         ,literal       LITERAL byte (8_byms)
1,0       ,len ,dist     MATCH
1,1,0,0                  SHORTREP   len = 1, dist =     Most Recently Used
1,1,0,1   ,len           LONGREP[0]          dist =     Most Recently Used
1,1,1,0   ,len           LONGREP[1]          dist = 2nd Most Recently Used
1,1,1,1,0 ,len           LONGREP[2]          dist = 3rd Most Recently Used
1,1,1,1,1 ,len           LONGREP[3]          dist = 4th Most Recently Used

The length encoding:

Symbols         Value
0   ,3_byms     Ranges from  2 ..=   9
1,0 ,3_byms     Ranges from 10 ..=  17
1,1 ,8_byms     Ranges from 18 ..= 273

The distance encoding starts with a 6-bym “Slot” value, which determines how many further byms are needed. For small Slot values, there are up to 5 extra byms, all context-encoded like the rest of LZMA. For large Slot values, there are N extra byms. The first (N - 4) of them are encoded with a fixed 50% probability and the remaining 4 byms are context-encoded. The largest encodable distance-biased-by-1 is 0xFFFF_FFFF, a (2 + 26 + 4) bit number.

Slot (decimal)   Distance (binary), biased by 1        Extra byms
0                0                                      0
1                1                                      0
2                10                                     0
3                11                                     0
4                10 x                                   1
5                11 x                                   1
6                10 xx                                  2
7                11 xx                                  2
8                10 xxx                                 3
9                11 xxx                                 4
10               10 xxxx                                5
11               11 xxxx                                4
12               10 xxxxx                               5
13               11 xxxxx                               5
14               10 yy zzzz                             2+4
15               11 yy zzzz                             2+4
16               10 yyy zzzz                            3+4
17               11 yyy zzzz                            3+4
18               10 yyyy zzzz                           4+4
...              ...                                   ...
61               11 yyyyyyyyyyyyyyyyyyyyyyyyy zzzz     25+4
62               10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz    26+4
63               11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz    26+4

“xxxx” means up-to-5 byms are encoded with a “reverse” binary tree. Each Slot has its own binary tree probabilities.

“yyyy” means up-to-26 byms. Each has a fixed 50% probability.

“zzzz” means four byms encoded with a “reverse” binary tree. All Slots use the same “align” binary tree probabilities.

“Biased by 1” means that slot=2 implies biasedDistance=2 so distance=3. A biasedDistance of 0xFFFF_FFFF means the End of Stream. Otherwise, the corrected (unbiased) distance ranges in [1 ..= 0xFFFF_FFFF].

Distance is also capped by the dictionary size, stated in the LZMA header.

State

In addition to the hundreds of contextual probabilities (see below), there's also a “State” variable with 12 possible values. The initial State is zero. Each packet causes a state transtition. The left-most column is the current State, the other columns hold the next State, depending on the packet:

State     LITERAL   MATCH     LONGREP   SHORTREP
0         0         7         8         9
1         0         7         8         9
2         0         7         8         9
3         0         7         8         9
4         1         7         8         9
5         2         7         8         9
6         3         7         8         9
7         4         10        11        11
8         5         10        11        11
9         6         10        11        11
10        4         10        11        11
11        5         10        11        11

Equivalently, but looking backwards instead of forwards, each State embodies the 1st, 2nd, 3rd and 4th MRUP (Most Recently Used Packet). The ? question mark means every possible packet. Some States (2, 5 and 11) have two rows - multiple possible histories could lead to that State:

State     1MRUP         2MRUP         3MRUP         4MRUP
0         LITERAL       LITERAL       LITERAL       ?
1         LITERAL       LITERAL       MATCH         ?
2a        LITERAL       LITERAL       LONGREP       ?
2b        LITERAL       LITERAL       SHORTREP      NON-LITERAL
3         LITERAL       LITERAL       SHORTREP      LITERAL
4         LITERAL       MATCH         ?             ?
5a        LITERAL       LONGREP       ?             ?
5b        LITERAL       SHORTREP      NON-LITERAL   ?
6         LITERAL       SHORTREP      LITERAL       ?
7         MATCH         LITERAL       ?             ?
8         LONGREP       LITERAL       ?             ?
9         SHORTREP      LITERAL       ?             ?
10        MATCH         NON-LITERAL   ?             ?
11a       LONGREP       NON-LITERAL   ?             ?
11b       SHORTREP      NON-LITERAL   ?             ?

A State‘s 12 possible values can also be aggregated into whether the State is less than or at least 7: whether the last packet was a LITERAL or a NON-LITERAL (a Lempel-Ziv copy-from-history). When decoding a LITERAL after a NON-LITERAL (as opposed to a LITERAL after a LITERAL, i.e. having decodeLiteral() branch on State < 7), there is information in the next historical byte after the previous NON-LITERAL packet’s copy-source. That byte almost always does not equal the about-to-be-decoded literal byte, because if it did equal, the copy would have been longer.

Algorithm Overview

The core loop's body (with AlgOveNN line numbers for others to reference) looks conceptually like this:

AlgOve00    if decodeTheNextBym() == 0 {
AlgOve01        // Decode a LITERAL.
AlgOve02        literal = decodeLiteral()
AlgOve03        emitLiteral(literal)
AlgOve04        continue
AlgOve..
AlgOve20    } else if decodeTheNextBym() == 0 {
AlgOve21        // Decode a MATCH.
AlgOve22        len = decodeLen()
AlgOve23        slot = decodeSlot(min(len-2, 3))
AlgOve24        distBiasedBy1 = decodeDistBiasedBy1(slot)
AlgOve25        if distBiasedBy1 == 0xFFFF_FFFF {
AlgOve26            break  // End of Stream.
AlgOve27        }
AlgOve28        mrud = (1 + distBiasedBy1, mrud[0], mrud[1], mrud[2])
AlgOve29        goto labelDoTheLZCopy
AlgOve..
AlgOve40    } else if decodeTheNextBym() == 0 {
AlgOve41        if decodeTheNextBym() == 0 {
AlgOve42            // Decode a SHORTREP.
AlgOve43            len = 1
AlgOve44            goto labelDoTheLZCopy
AlgOve45        }
AlgOve46        // Decode a LONGREP[0].
AlgOve..
AlgOve60    } else if decodeTheNextBym() == 0 {
AlgOve61        // Decode a LONGREP[1].
AlgOve62        mrud = (mrud[1], mrud[0], mrud[2], mrud[3])
AlgOve63    } else if decodeTheNextBym() == 0 {
AlgOve64        // Decode a LONGREP[2].
AlgOve65        mrud = (mrud[2], mrud[0], mrud[1], mrud[3])
AlgOve66    } else {
AlgOve67        // Decode a LONGREP[3].
AlgOve68        mrud = (mrud[3], mrud[0], mrud[1], mrud[2])
AlgOve69    }
AlgOve..
AlgOve80    len = decodeLen()
AlgOve..
AlgOve90    labelDoTheLZCopy:
AlgOve91    // mrud[0] has been set to what will be (after the emitCopy
AlgOve92    // call) the most recently used distance. mrud[1] is the 2nd
AlgOve93    // most recently used, mrud[2] is the 3rd, mrud[3] is the 4th.
AlgOve94    emitCopy(len, mrud[0])

Conceptually means that some details have been elided. The State variable needs updating after each iteration, but that‘s not shown. decodeTheNextBym() also depends not only on the State but also which line number you’re at. The probabilities used for “is it a LITERAL or NON-LITERAL” are different from the probabilities used for “is it a LONGREP[2] or LONGREP[3]”.

decodeLen expands to:

AlgOveNN.0  if decodeTheNextBym() == 0 {
AlgOveNN.1      // Decode a low length.
AlgOveNN.2      len = decodeMultipleByms(3) + 2
AlgOveNN.3  } else if decodeTheNextBym() == 0 {
AlgOveNN.4      // Decode a middle length.
AlgOveNN.5      len = decodeMultipleByms(3) + 10
AlgOveNN.6  } else {
AlgOveNN.7      // Decode a high length.
AlgOveNN.8      len = decodeMultipleByms(8) + 18
AlgOveNN.9  }

Range Coding and `decodeTheNextBym`

The codec also tracks two uint32 variables, called bits and range, that represent an interval (a position and width) on the number line between zero and one. The literal bitstream (the 0s and 1s in the file, read in MSB to LSB order) is translated into a bym stream, where each bym is either 0 or 1. If both bym values have equal (50%) probability then the bym stream is basically the same as the bitstream. If symbols have unequal probability then we can achieve compression: the literal bitstream can be shorter (in terms of number of bits) then the bym stream.

Specifically, if prob is the probability (scaled so that (1 << 11) = 2048 means 100% probability and so 1024 means 50% probability) that the next bym is 0, and range is at least (1 << 24), then (to over-simplify for now) decodeTheNextBym() is:

threshold = (range >> 11) * prob
if bits < threshold {
    bym = 0
} else {
    bym = 1
}

The probability prob depends on the State variable as well as additional context, such as the high bits of the most recent decoded byte and the low bits of the number of decoded bytes. Implementations maintain an array with hundreds of contextual probabilities (each a uint16 in the range 1 ..= 2047), indexed by a prob_index variable (calculated from the State and other factors). Expanding (marked by ¶) the over-simplified pseudo-code above:

prob = probs_array[prob_index]             // ¶
threshold = (range >> 11) * prob
if bits < threshold {
    bym = 0
} else {
    bym = 1
}

LZMA probabilties also adapt. Producing a 0 (or 1) bym increases (or decreases) the relevant contextual probability: the prediction that, the next time we're in the same context, that next bym is 0. Expanding again:

prob = probs_array[prob_index]
threshold = (range >> 11) * prob
if bits < threshold {
    bym = 0
    prob += (2048 - prob) >> 5             // ¶
    probs_array[prob_index] = prob         // ¶
} else {
    bym = 1
    prob -= prob >> 5                      // ¶
    probs_array[prob_index] = prob         // ¶
}

We also need to update the bits and range. The range shrinks every time we decode a bym and, if small enough, then grows by reading the next source byte. Expanding one last time, decodeTheNextBym() (which also updates bits, range and the relevant contextual probability) is:

prob = probs_array[prob_index]
threshold = (range >> 11) * prob
if bits < threshold {
    bym = 0
    range = threshold                      // ¶
    prob += (2048 - prob) >> 5
    probs_array[prob_index] = prob
} else {
    bym = 1
    bits -= threshold                      // ¶
    range -= threshold                     // ¶
    prob -= prob >> 5
    probs_array[prob_index] = prob
}
if (range >> 24) == 0 {                    // ¶
    bits = (bits << 8) | src.read_byte()   // ¶
    range <<= 8                            // ¶
}                                          // ¶

Binary Trees and `decodeMultipleByms`

Decoding N byms (e.g. decoding a literal, a length, a slot or a distance) is basically N sequential calls to decodeTheNextBym, using ((1 << N) - 1) different probabilties. This can be visualized as a balanced binary tree with ((1 << N) - 1) branch nodes and (1 << N) leaf nodes. Each branch node has its own probability for taking one or the other of its two child nodes.

There's 1 probability for the first bym.
There's 2 probabilities for the second bym. The one that is used depends on the value of the first bym.
There's 4 probabilities for the third bym. The one that is used depends on the values of the first two byms.
Etc.
There's (1 << (N - 1)) probabilities for the Nth bym. The one that is used depends on the values of all previous byms.
There's (1 << N) leaf nodes, one for each possible N-bit value.

These probabilities can be flattened to a (1 << N) element array, whose first element is unused (or available for repurposing, such as for what the reference implementation calls the “choice” probabilities). For example, if N is 3 then the 8-element array could be labeled as u1223333, where u is unused, 1 marks the probability for the first bym, 2 marks the probabilities for the second bym, etc.

Algorithmically, we loop N times, walking the tree from the root to a leaf. The final N-bit value is captured in the tree node number (after applying a bit mask or, equivalently, a subtraction):

tree_node = 1
while tree_node < (1 << N) {
    bym = decodeTheNextBym()  // Using prob_index = tree_node.
    tree_node = (tree_node << 1) | bym
}
tree_node &= ((1 << N) - 1)  // Equivalent to "tree_node -= (1 << N)".

That's a “forward” binary tree, where the bits occur in MSB to LSB order. A “reverse” binary tree uses the opposite order.

LZMA

Byms and Packets

State

Algorithm Overview

Range Coding and decodeTheNextBym

Binary Trees and decodeMultipleByms

Further Reading

Range Coding and `decodeTheNextBym`

Binary Trees and `decodeMultipleByms`