| # LZMA |
| |
| Lempel–Ziv–Markov chain algorithm (LZMA) is a general purpose compression |
| format, using range coding and Lempel-Ziv style length-distance |
| back-references. Various descriptions of the file format are at: |
| |
| - [LZMA-SDK reference implementation](https://raw.githubusercontent.com/jljusten/LZMA-SDK/781863cdf592da3e97420f50de5dac056ad352a5/CPP/7zip/Bundles/LzmaSpec/LzmaSpec.cpp) |
| - [LZMA-SDK specification](https://raw.githubusercontent.com/jljusten/LZMA-SDK/781863cdf592da3e97420f50de5dac056ad352a5/DOC/lzma-specification.txt) |
| - [Lzip Manual - Stream Format](https://www.nongnu.org/lzip/manual/lzip_manual.html#Stream-format) |
| - [Wikipedia (LZMA)](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) |
| |
| |
| ## Byms and Packets |
| |
| The compressed stream is a stream of "byms" (binary symbols, either 0 or 1; |
| yeah, it's an awkward term, but less ambiguous than using "bits" for both |
| compressed and uncompressed things). Byms are conceptually very similar to bits |
| (binary digits, ubiquitous in computing, a 0.5 KiB file holds 512 bytes, which |
| is 4096 bits), but are encoded using an adaptive binary range-coder so that N |
| byms usually compresses to fewer than N bits. |
| |
| The stream is divided into packets, each packet describing either a single |
| byte, or an LZ77 sequence with its length and distance implicitly or explicitly |
| encoded. Each part of each packet is modeled with independent contexts, so the |
| probability predictions for each bym are correlated with the values of that bym |
| (and related byms from the same field) in previous packets of the same type. |
| |
| There are 7 types of packets: |
| |
| Symbols Meaning |
| 0 ,literal LITERAL byte (8_byms) |
| 1,0 ,len ,dist MATCH |
| 1,1,0,0 SHORTREP len = 1, dist = Most Recently Used |
| 1,1,0,1 ,len LONGREP[0] dist = Most Recently Used |
| 1,1,1,0 ,len LONGREP[1] dist = 2nd Most Recently Used |
| 1,1,1,1,0 ,len LONGREP[2] dist = 3rd Most Recently Used |
| 1,1,1,1,1 ,len LONGREP[3] dist = 4th Most Recently Used |
| |
| The length encoding: |
| |
| Symbols Value |
| 0 ,3_byms Ranges from 2 ..= 9 |
| 1,0 ,3_byms Ranges from 10 ..= 17 |
| 1,1 ,8_byms Ranges from 18 ..= 273 |
| |
| The distance encoding starts with a 6-bym "Slot" value, which determines how |
| many further byms are needed. For small Slot values, there are up to 5 extra |
| byms, all context-encoded like the rest of LZMA. For large Slot values, there |
| are N extra byms. The first (N - 4) of them are encoded with a fixed 50% |
| probability and the remaining 4 byms are context-encoded. The largest encodable |
| distance-biased-by-1 is `0xFFFF_FFFF`, a (2 + 26 + 4) bit number. |
| |
| Slot (decimal) Distance (binary), biased by 1 Extra byms |
| 0 0 0 |
| 1 1 0 |
| 2 10 0 |
| 3 11 0 |
| 4 10 x 1 |
| 5 11 x 1 |
| 6 10 xx 2 |
| 7 11 xx 2 |
| 8 10 xxx 3 |
| 9 11 xxx 4 |
| 10 10 xxxx 5 |
| 11 11 xxxx 4 |
| 12 10 xxxxx 5 |
| 13 11 xxxxx 5 |
| 14 10 yy zzzz 2+4 |
| 15 11 yy zzzz 2+4 |
| 16 10 yyy zzzz 3+4 |
| 17 11 yyy zzzz 3+4 |
| 18 10 yyyy zzzz 4+4 |
| ... ... ... |
| 61 11 yyyyyyyyyyyyyyyyyyyyyyyyy zzzz 25+4 |
| 62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 26+4 |
| 63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz 26+4 |
| |
| "xxxx" means up-to-5 byms are encoded with a "reverse" binary tree. Each Slot |
| has its own binary tree probabilities. |
| |
| "yyyy" means up-to-26 byms. Each has a fixed 50% probability. |
| |
| "zzzz" means four byms encoded with a "reverse" binary tree. All Slots use the |
| same "align" binary tree probabilities. |
| |
| "Biased by 1" means that slot=2 implies biasedDistance=2 so distance=3. A |
| biasedDistance of `0xFFFF_FFFF` means the End of Stream. Otherwise, the |
| corrected (unbiased) distance ranges in `[1 ..= 0xFFFF_FFFF]`. |
| |
| Distance is also capped by the dictionary size, stated in the LZMA header. |
| |
| |
| ## State |
| |
| In addition to the hundreds of contextual probabilities (see below), there's |
| also a "State" variable with 12 possible values. The initial State is zero. |
| Each packet causes a state transtition. The left-most column is the current |
| State, the other columns hold the next State, depending on the packet: |
| |
| State LITERAL MATCH LONGREP SHORTREP |
| 0 0 7 8 9 |
| 1 0 7 8 9 |
| 2 0 7 8 9 |
| 3 0 7 8 9 |
| 4 1 7 8 9 |
| 5 2 7 8 9 |
| 6 3 7 8 9 |
| 7 4 10 11 11 |
| 8 5 10 11 11 |
| 9 6 10 11 11 |
| 10 4 10 11 11 |
| 11 5 10 11 11 |
| |
| Equivalently, but looking backwards instead of forwards, each State embodies |
| the 1st, 2nd, 3rd and 4th MRUP (Most Recently Used Packet). The ? question mark |
| means every possible packet. Some States (2, 5 and 11) have two rows - multiple |
| possible histories could lead to that State: |
| |
| State 1MRUP 2MRUP 3MRUP 4MRUP |
| 0 LITERAL LITERAL LITERAL ? |
| 1 LITERAL LITERAL MATCH ? |
| 2a LITERAL LITERAL LONGREP ? |
| 2b LITERAL LITERAL SHORTREP NON-LITERAL |
| 3 LITERAL LITERAL SHORTREP LITERAL |
| 4 LITERAL MATCH ? ? |
| 5a LITERAL LONGREP ? ? |
| 5b LITERAL SHORTREP NON-LITERAL ? |
| 6 LITERAL SHORTREP LITERAL ? |
| 7 MATCH LITERAL ? ? |
| 8 LONGREP LITERAL ? ? |
| 9 SHORTREP LITERAL ? ? |
| 10 MATCH NON-LITERAL ? ? |
| 11a LONGREP NON-LITERAL ? ? |
| 11b SHORTREP NON-LITERAL ? ? |
| |
| A State's 12 possible values can also be aggregated into whether the State is |
| less than or at least 7: whether the last packet was a LITERAL or a NON-LITERAL |
| (a Lempel-Ziv copy-from-history). When decoding a LITERAL after a NON-LITERAL |
| (as opposed to a LITERAL after a LITERAL, i.e. having `decodeLiteral()` branch |
| on `State < 7`), there is information in the next historical byte after the |
| previous NON-LITERAL packet's copy-source. That byte almost always does not |
| equal the about-to-be-decoded literal byte, because if it did equal, the copy |
| would have been longer. |
| |
| |
| ## Algorithm Overview |
| |
| The core loop's body (with AlgOveNN line numbers for others to reference) looks |
| conceptually like this: |
| |
| AlgOve00 if decodeTheNextBym() == 0 { |
| AlgOve01 // Decode a LITERAL. |
| AlgOve02 literal = decodeLiteral() |
| AlgOve03 emitLiteral(literal) |
| AlgOve04 continue |
| AlgOve.. |
| AlgOve20 } else if decodeTheNextBym() == 0 { |
| AlgOve21 // Decode a MATCH. |
| AlgOve22 len = decodeLen() |
| AlgOve23 slot = decodeSlot(min(len-2, 3)) |
| AlgOve24 distBiasedBy1 = decodeDistBiasedBy1(slot) |
| AlgOve25 if distBiasedBy1 == 0xFFFF_FFFF { |
| AlgOve26 break // End of Stream. |
| AlgOve27 } |
| AlgOve28 mrud = (1 + distBiasedBy1, mrud[0], mrud[1], mrud[2]) |
| AlgOve29 goto labelDoTheLZCopy |
| AlgOve.. |
| AlgOve40 } else if decodeTheNextBym() == 0 { |
| AlgOve41 if decodeTheNextBym() == 0 { |
| AlgOve42 // Decode a SHORTREP. |
| AlgOve43 len = 1 |
| AlgOve44 goto labelDoTheLZCopy |
| AlgOve45 } |
| AlgOve46 // Decode a LONGREP[0]. |
| AlgOve.. |
| AlgOve60 } else if decodeTheNextBym() == 0 { |
| AlgOve61 // Decode a LONGREP[1]. |
| AlgOve62 mrud = (mrud[1], mrud[0], mrud[2], mrud[3]) |
| AlgOve63 } else if decodeTheNextBym() == 0 { |
| AlgOve64 // Decode a LONGREP[2]. |
| AlgOve65 mrud = (mrud[2], mrud[0], mrud[1], mrud[3]) |
| AlgOve66 } else { |
| AlgOve67 // Decode a LONGREP[3]. |
| AlgOve68 mrud = (mrud[3], mrud[0], mrud[1], mrud[2]) |
| AlgOve69 } |
| AlgOve.. |
| AlgOve80 len = decodeLen() |
| AlgOve.. |
| AlgOve90 labelDoTheLZCopy: |
| AlgOve91 // mrud[0] has been set to what will be (after the emitCopy |
| AlgOve92 // call) the most recently used distance. mrud[1] is the 2nd |
| AlgOve93 // most recently used, mrud[2] is the 3rd, mrud[3] is the 4th. |
| AlgOve94 emitCopy(len, mrud[0]) |
| |
| Conceptually means that some details have been elided. The `State` variable |
| needs updating after each iteration, but that's not shown. `decodeTheNextBym()` |
| also depends not only on the `State` but also which line number you're at. The |
| probabilities used for "is it a LITERAL or NON-LITERAL" are different from the |
| probabilities used for "is it a LONGREP[2] or LONGREP[3]". |
| |
| `decodeLen` expands to: |
| |
| AlgOveNN.0 if decodeTheNextBym() == 0 { |
| AlgOveNN.1 // Decode a low length. |
| AlgOveNN.2 len = decodeMultipleByms(3) + 2 |
| AlgOveNN.3 } else if decodeTheNextBym() == 0 { |
| AlgOveNN.4 // Decode a middle length. |
| AlgOveNN.5 len = decodeMultipleByms(3) + 10 |
| AlgOveNN.6 } else { |
| AlgOveNN.7 // Decode a high length. |
| AlgOveNN.8 len = decodeMultipleByms(8) + 18 |
| AlgOveNN.9 } |
| |
| |
| ## Range Coding and `decodeTheNextBym` |
| |
| The codec also tracks two `uint32` variables, called `bits` and `range`, that |
| represent an interval (a position and width) on the number line between zero |
| and one. The literal bitstream (the 0s and 1s in the file, read in MSB to LSB |
| order) is translated into a `bym` stream, where each `bym` is either 0 or 1. If |
| both `bym` values have equal (50%) probability then the `bym` stream is |
| basically the same as the bitstream. If symbols have unequal probability then |
| we can achieve compression: the literal bitstream can be shorter (in terms of |
| number of bits) then the `bym` stream. |
| |
| Specifically, if `prob` is the probability (scaled so that `(1 << 11) = 2048` |
| means 100% probability and so `1024` means 50% probability) that the next `bym` |
| is 0, and `range` is at least `(1 << 24)`, then (to over-simplify for now) |
| `decodeTheNextBym()` is: |
| |
| threshold = (range >> 11) * prob |
| if bits < threshold { |
| bym = 0 |
| } else { |
| bym = 1 |
| } |
| |
| The probability `prob` depends on the `State` variable as well as additional |
| context, such as the high bits of the most recent decoded byte and the low bits |
| of the number of decoded bytes. Implementations maintain an array with hundreds |
| of contextual probabilities (each a `uint16` in the range `1 ..= 2047`), |
| indexed by a `prob_index` variable (calculated from the `State` and other |
| factors). Expanding (marked by ¶) the over-simplified pseudo-code above: |
| |
| prob = probs_array[prob_index] // ¶ |
| threshold = (range >> 11) * prob |
| if bits < threshold { |
| bym = 0 |
| } else { |
| bym = 1 |
| } |
| |
| LZMA probabilties also adapt. Producing a 0 (or 1) `bym` increases (or |
| decreases) the relevant contextual probability: the prediction that, the next |
| time we're in the same context, that next `bym` is 0. Expanding again: |
| |
| prob = probs_array[prob_index] |
| threshold = (range >> 11) * prob |
| if bits < threshold { |
| bym = 0 |
| prob += (2048 - prob) >> 5 // ¶ |
| probs_array[prob_index] = prob // ¶ |
| } else { |
| bym = 1 |
| prob -= prob >> 5 // ¶ |
| probs_array[prob_index] = prob // ¶ |
| } |
| |
| We also need to update the `bits` and `range`. The `range` shrinks every time |
| we decode a `bym` and, if small enough, then grows by reading the next source |
| byte. Expanding one last time, `decodeTheNextBym()` (which also updates `bits`, |
| `range` and the relevant contextual probability) is: |
| |
| prob = probs_array[prob_index] |
| threshold = (range >> 11) * prob |
| if bits < threshold { |
| bym = 0 |
| range = threshold // ¶ |
| prob += (2048 - prob) >> 5 |
| probs_array[prob_index] = prob |
| } else { |
| bym = 1 |
| bits -= threshold // ¶ |
| range -= threshold // ¶ |
| prob -= prob >> 5 |
| probs_array[prob_index] = prob |
| } |
| if (range >> 24) == 0 { // ¶ |
| bits = (bits << 8) | src.read_byte() // ¶ |
| range <<= 8 // ¶ |
| } // ¶ |
| |
| |
| ## Binary Trees and `decodeMultipleByms` |
| |
| Decoding N byms (e.g. decoding a literal, a length, a slot or a distance) is |
| basically N sequential calls to `decodeTheNextBym`, using `((1 << N) - 1)` |
| different probabilties. This can be visualized as a balanced binary tree with |
| `((1 << N) - 1)` branch nodes and `(1 << N)` leaf nodes. Each branch node has |
| its own probability for taking one or the other of its two child nodes. |
| |
| - There's 1 probability for the first bym. |
| - There's 2 probabilities for the second bym. The one that is used depends on |
| the value of the first bym. |
| - There's 4 probabilities for the third bym. The one that is used depends on |
| the values of the first two byms. |
| - Etc. |
| - There's `(1 << (N - 1))` probabilities for the Nth bym. The one that is used |
| depends on the values of all previous byms. |
| - There's `(1 << N)` leaf nodes, one for each possible N-bit value. |
| |
| These probabilities can be flattened to a `(1 << N)` element array, whose first |
| element is unused (or available for repurposing, such as for what the reference |
| implementation calls the "choice" probabilities). For example, if N is 3 then |
| the 8-element array could be labeled as `u1223333`, where `u` is unused, `1` |
| marks the probability for the first bym, `2` marks the probabilities for the |
| second bym, etc. |
| |
| Algorithmically, we loop N times, walking the tree from the root to a leaf. The |
| final N-bit value is captured in the tree node number (after applying a bit |
| mask or, equivalently, a subtraction): |
| |
| tree_node = 1 |
| while tree_node < (1 << N) { |
| bym = decodeTheNextBym() // Using prob_index = tree_node. |
| tree_node = (tree_node << 1) | bym |
| } |
| tree_node &= ((1 << N) - 1) // Equivalent to "tree_node -= (1 << N)". |
| |
| That's a "forward" binary tree, where the bits occur in MSB to LSB order. A |
| "reverse" binary tree uses the opposite order. |