| # Random Access Compression: RAC |
| |
| Status: Draft (as of September 2019). There is no compatibility guarantee yet. |
| |
| |
| ## Overview |
| |
| RAC is a *compressed* file format that allows *random access* (not just |
| sequential access) to the decompressed contents. In comparison to some other |
| popular compression formats, all four of the |
| [Zlib](https://www.ietf.org/rfc/rfc1950.txt), |
| [Brotli](https://www.ietf.org/rfc/rfc7932.txt), |
| [LZ4](https://github.com/lz4/lz4/blob/master/doc/lz4_Frame_format.md) and |
| [Zstandard](https://www.ietf.org/rfc/rfc8478.txt) specifications explicitly |
| contain the identical phrase: "the data format defined by this specification |
| does not attempt to allow random access to compressed data". |
| |
| Compression means that the derived file is typically smaller than the original |
| file. Random access means that, starting from the compressed file, it is |
| possible to reconstruct the half-open byte range `[di .. dj)` of the |
| decompressed file without always having to first decompress all of `[0 .. di)`. |
| |
| Conceptually, the decompressed file is partitioned into non-overlapping chunks. |
| Each compressed chunk can be decompressed independently (although possibly |
| sharing additional context, such as a LZ77 prefix dictionary). A RAC file also |
| contains a hierarchical index of those chunks. |
| |
| RAC is a container format, and while it supports common compression codecs like |
| Zlib, Brotli, LZ4 and Zstandard, it is not tied to any particular compression |
| codec. |
| |
| Non-goals for version 1 include: |
| |
| - Filesystem metadata like file names, file sizes and modification times. |
| - Multiple source files. |
| |
| There is the capability (see attributes and reserved `TTag`s, below) but no |
| promise to address these in a future RAC version. There might not be a need to, |
| as other designs such as [SquashFS](http://squashfs.sourceforge.net/) and EROFS |
| ([Extendable Read-Only File System](https://lkml.org/lkml/2018/5/31/306)) |
| already exist. |
| |
| Non-goals in general include: |
| |
| - Encryption. |
| - Streaming decompression. |
| |
| A companion document has further discussion of [RAC related |
| work](/doc/spec/rac-related-work.md). |
| |
| |
| ### Glossary |
| |
| - `CBias` is the delta added to a `CPointer` to produce a `COffset`. |
| - `DBias` is the delta added to a `DPointer` to produce a `DOffset`. |
| - `CFile` is the compressed file. |
| - `DFile` is the decompressed file. |
| - `CFileSize` is the size of the `CFile`. |
| - `DFileSize` is the size of the `DFile`. |
| - `COffset` is a byte offset in `CSpace`. |
| - `DOffset` is a byte offset in `DSpace`. |
| - `CPointer` is a relative `COffset`, prior to bias-correction. |
| - `DPointer` is a relative `DOffset`, prior to bias-correction. |
| - `CRange` is a `Range` in `CSpace`. |
| - `DRange` is a `Range` in `DSpace`. |
| - `CSpace` means that byte offsets refer to the `CFile`. |
| - `DSpace` means that byte offsets refer to the `DFile`. |
| |
| `Range` is a pair of byte offsets `[i .. j)`, in either `CSpace` or `DSpace`. |
| It is half-open, containing every byte offset `x` such that `(i <= x)` and `(x |
| < j)`. It is invalid to have `(i > j)`. The size of a `Range` equals `(j - i)`. |
| |
| All bytes are 8 bits and unless explicitly specified otherwise, all fixed-size |
| unsigned integers (e.g. `uint32_t`, `uint64_t`) are encoded little-endian. |
| Within those unsigned integers, bit 0 is the least significant bit and e.g. bit |
| 31 is the most significant bit of a `uint32_t`. |
| |
| The maximum supported `CFileSize` and the maximum supported `DFileSize` are the |
| same number: `0x0000_FFFF_FFFF_FFFF`, which is `((1 << 48) - 1)`. |
| |
| |
| ## File Structure |
| |
| A RAC file (the `CFile`) must be at least 32 bytes long, and start with the 3 |
| byte `Magic` (see below), so that no valid RAC file can also be e.g. a valid |
| JPEG file. |
| |
| The `CFile` contains a tree of `Node`s. Each `Node` is either a `Branch Node` |
| (pointing to between 1 and 255 child `Node`s) or a `Leaf Node`. There must be |
| at least one `Branch Node`, called the `Root Node`. Parsing a `CFile` requires |
| knowing the `CFileSize` in order to identify the `Root Node`, which is either |
| at the start or the end of the `CFile`. |
| |
| Each `Node` has a `DRange`, which can be empty. `Branch Node`s can also have |
| attributes: elements that aren't child `Node`s, which must have an empty |
| `DRange`. Empty elements contain metadata or other decompression context such |
| as a shared dictionary. |
| |
| Each `Leaf Node` also has 3 `CRange`s (`Primary`, `Secondary` and `Tertiary`), |
| any or all of which may be empty. The contents of the `CFile`, within those |
| `CRange`s, are decompressed according to the `Codec` (see below) to reconstruct |
| that part of the `DFile` within the `Leaf Node`'s `DRange`. |
| |
| |
| ## Branch Nodes |
| |
| A `Branch Node`'s encoding in the `CFile` has a variable byte size, between 32 |
| and 4096 inclusive, depending on its number of children. Specifically, it |
| occupies `((Arity * 16) + 16)` bytes, grouped into 8 byte segments (but not |
| necessarily 8 byte aligned), starting at a `COffset` called its `Branch |
| COffset`: |
| |
| +-+-+-+-+-+-+-+-+ |
| |0|1|2|3|4|5|6|7| |
| +-+-+-+-+-+-+-+-+ |
| |Magic|A|Che|0|T| Magic, Arity, Checksum, Reserved (0), TTag[0] |
| | DPtr[1] |0|T| DPtr[1], Reserved (0), TTag[1] |
| | DPtr[2] |0|T| DPtr[2], Reserved (0), TTag[2] |
| | ... |0|.| ..., Reserved (0), ... |
| | DPtr[A-2] |0|T| DPtr[Arity-2], Reserved (0), TTag[Arity-2] |
| | DPtr[A-1] |0|T| DPtr[Arity-1], Reserved (0), TTag[Arity-1] |
| | DPtr[A] |0|C| DPtr[Arity] a.k.a. DPtrMax, Reserved (0), CodecByte |
| | CPtr[0] |L|S| CPtr[0], CLen[0], STag[0] |
| | CPtr[1] |L|S| CPtr[1], CLen[1], STag[1] |
| | CPtr[2] |L|S| CPtr[2], CLen[2], STag[2] |
| | ... |L|.| ..., ..., ... |
| | CPtr[A-2] |L|S| CPtr[Arity-2], CLen[Arity-2], STag[Arity-2] |
| | CPtr[A-1] |L|S| CPtr[Arity-1], CLen[Arity-1], STag[Arity-1] |
| | CPtr[A] |V|A| CPtr[Arity] a.k.a. CPtrMax, Version, Arity |
| +-+-+-+-+-+-+-+-+ |
| |
| See the "Examples" section below for example RAC files and, in particular, |
| example `Branch Node`s. |
| |
| For the `(XPtr | Other6 | Other7)` 8 byte fields, the `XPtr` occupies the low |
| 48 bits (as a little-endian `uint64_t`) and the `Other` fields occupy the high |
| 16 bits. |
| |
| The `CPtr` and `DPtr` values are what is explicitly written in the `CFile`'s |
| bytes. These are added to a `Branch Node`'s implicit `Branch CBias` and `Branch |
| DBias` values to give the implicit `COff` and `DOff` values: `COff[i]` and |
| `DOff[i]` are defined to be `(Branch_CBias + CPtr[i])` and `(Branch_DBias + |
| DPtr[i])`. |
| |
| `CPtrMax` is another name for `CPtr[Arity]`, and `COffMax` is defined to be |
| `(Branch_CBias + CPtrMax)`. Likewise for `DPtrMax` and `DOffMax`. |
| |
| The `DPtr[0]` value is implicit, and always equals zero, so that `DOff[0]` |
| always equals the `Branch DBias`. |
| |
| - For the `Root Node`, the `DPtrMax` also sets the `DFileSize`. The `Branch |
| CBias` and `Branch DBias` are both zero. The `Branch COffset` is determined |
| by the "Root Node" section below. |
| - For a child `Branch Node`, the `Branch COffset`, `Branch CBias` and `Branch |
| DBias` are given by the parent `Branch Node`. See the "Search Within a |
| Branch Node" section below. |
| |
| |
| ### Magic |
| |
| `Magic` is the three bytes `"\x72\xC3\x63"`, which is invalid UTF-8 but is |
| `"rÃc"` in ISO 8859-1. The tilde isn't particularly meaningful, other than |
| `"rÃc"` being a nonsensical word (with nonsensical capitalization) that is |
| unlikely to appear in other files. |
| |
| Every `Branch Node` must start with these `Magic` bytes, not just a `Branch |
| Node` positioned at the start of the `CFile`. |
| |
| |
| ### Arity |
| |
| `Arity` is the `Branch Node`'s number of elements: the number of child `Node`s |
| plus the number of attributes. Having zero children is invalid. |
| |
| The `Arity` byte is given twice: the fourth byte and the final byte of the |
| `Branch Node`. The two values must match. |
| |
| The repetition lets a RAC reader determine the size of the `Branch Node` data |
| (as the size depends on the `Arity`), given either its start or its end offset |
| in `CSpace`. For almost all `Branch Node`s, we will know its start offset (its |
| `Branch COffset`), but for a `Root Node` at the end of a `CFile`, we will only |
| know its end offset. |
| |
| |
| ### Checksum |
| |
| `Checksum` is a checksum of the `Branch Node`'s bytes. It is not a checksum of |
| the `CFile` or `DFile` contents pointed to by a `Branch Node`. Content |
| checksums are a `Codec`-specific consideration. |
| |
| The little-endian `uint16_t` `Checksum` value is the low 16 bits XOR'ed with |
| the high 16 bits of the `uint32_t` CRC-32 IEEE checksum of the `((Arity * 16) + |
| 10)` bytes immediately after the `Checksum`. The 4 bytes immediately before the |
| `Checksum` are not considered: the `Magic` bytes have only one valid value and |
| the `Arity` byte near the start is replicated by the `Arity` byte at the end. |
| |
| |
| ### Reserved (0) |
| |
| The `Reserved (0)` bytes must have the value `0x00`. |
| |
| |
| ### COffs and DOffs, STags and TTags |
| |
| For every `a` in the half-open range `[0 .. Arity)`, the `a`'th element has two |
| tags, `STag[a]` and `TTag[a]`, and a `DRange` of `[DOff[a] .. DOff[a+1])`. The |
| `DOff` values must be non-decreasing: see the "Branch Node Validation" section |
| below. |
| |
| A `TTag[a]` of `0xFE` means that that element is a `Branch Node` child. |
| |
| A `TTag[a]` of `0xFD` means that there is no child, but is instead a `Codec |
| Element` attribute, whose `DRange` must be empty, and the rest of this section |
| does not apply: the `STag` is ignored. |
| |
| A `TTag[a]` in the half-open range `[0xC0 .. 0xFD)` is reserved. Otherwise, the |
| element is a `Leaf Node` child. |
| |
| A child `Branch Node`'s `SubBranch COffset` is defined to be `COff[a]`. Its |
| `SubBranch DBias` and `SubBranch DOffMax` are defined to be `DOff[a]` and |
| `DOff[a+1]`. |
| |
| - When `(STag[a] < Arity)`, it is a `CBiasing Branch Node`. The `SubBranch |
| CBias` is defined to be `(Branch_CBias + CPtr[STag[a]])`. This expression |
| is equivalent to `COff[STag[a]]`. |
| - When `(STag[a] >= Arity)`, it is a `CNeutral Branch Node`. The `SubBranch |
| CBias` is defined to be `(Branch_CBias)`. |
| |
| A child `Leaf Node`'s `STag[a]` and `TTag[a]` values are also called its `Leaf |
| STag` and `Leaf TTag`. It also has: |
| |
| - A `Primary CRange`, equal to `MakeCRange(a)`. |
| - A `Secondary CRange`, equal to `MakeCRange(STag[a])`. |
| - A `Tertiary CRange`, equal to `MakeCRange(TTag[a])`. |
| |
| The `MakeCRange(i)` function defines a `CRange`. If `(i >= Arity)` then that |
| `CRange` is the empty range `[COffMax .. COffMax)`. Otherwise, the lower bound |
| is `COff[i]` and the upper bound is: |
| |
| - `COffMax` when `CLen[i]` is zero. |
| - The minimum of `COffMax` and `(COff[i] + (CLen[i] * 1024))` when `CLen[i]` |
| is non-zero. |
| |
| In other words, the `COffMax` value clamps the `CRange` upper bound. The `CLen` |
| value, if non-zero, combines with the `COff` value to apply another clamp. The |
| `CLen` is given in units of 1024 bytes, but the `(COff[i] + (CLen[i] * 1024))` |
| value is not necessarily quantized to 1024 byte boundaries. |
| |
| Note that, since `Arity` is at most 255, an `STag[a]` of `0xFF` always results |
| in a `CNeutral Branch Node` or an empty `Secondary CRange`. Likewise, a |
| `TTag[a]` of `0xFF` always results in an empty `Tertiary CRange`. |
| |
| |
| ### COffMax |
| |
| `COffMax` is an inclusive upper bound on every `COff` in a `Branch Node` and in |
| its descendent `Branch Node`s. A child `Branch Node` must not have a larger |
| `COffMax` than the parent `Branch Node`'s `COffMax`, and the `Root Node`'s |
| `COffMax` must equal the `CFileSize`. See the "Branch Node Validation" section |
| below. |
| |
| A RAC file can therefore be incrementally modified, if the RAC writer only |
| appends new `CFile` bytes and does not re-write existing `CFile` bytes, so that |
| the `CFileSize` increases. Even if the old (smaller) RAC file's `Root Node` was |
| at the `CFile` start, the new (larger) `CFileSize` means that those starting |
| bytes are an obsolete `Root Node` (but still a valid `Branch Node`). The new |
| `Root Node` is therefore located at the end of the new RAC file. |
| |
| Concatenating RAC files (concatenating in `DSpace`) involves concatenating the |
| RAC files in `CSpace` and then appending a new `Root Node` with `CBiasing |
| Branch Node`s pointing to each source RAC file's `Root Node`. |
| |
| |
| ### Version |
| |
| `Version` must have the value `0x01`, indicating version 1 of the RAC format. |
| The `0x00` value is reserved, although future editions may use other positive |
| values. |
| |
| |
| ### Codec |
| |
| `Codec`s define specializations of RAC, such as "RAC + Zlib" or "RAC + Brotli". |
| It is valid for a "RAC + Zstandard" only decoder to reject a "RAC + Brotli" |
| file, even if it is a valid RAC file. Recall that RAC is just a container, and |
| not tied to any particular compression codec. |
| |
| There are two categories of `Codec`s: the high `0x80` bit of the `Codec Byte` |
| being `0` or `1` denotes a `Short` or `Long` codec respectively. `Short Codec`s |
| are represented by a single byte (the `Codec Byte`). `Long Codec`s use that |
| `Codec Byte` to locate 7 additional bytes: a `Codec Element`. |
| |
| For both `Short Codec`s and `Long Codec`s, the second-highest `0x40` bit of the |
| `Codec Byte` is the `Mix Bit`. A `Mix Bit` of `0` means that all of the |
| `Node`'s descendents have exactly the same `Codec` (not just the same `Codec |
| Byte`; `Short Codec`s and `Long Codec`s are not considered exactly the same |
| even if they represent the same compression algorithm). A `Mix Bit` of `1` |
| means that descendents may have a different `Codec`. |
| |
| For `Short Codec`s, the remaining low 6 bits correspond to a specific |
| compression algorithm: |
| |
| - `0x00` means "RAC + Zeroes". |
| - `0x01` means "RAC + Zlib". |
| - `0x02` means "RAC + LZ4". |
| - `0x03` means "RAC + Zstandard". |
| - All other values are reserved. |
| |
| For `Long Codec`s, the remaining low 6 bits of the `Codec Byte` define a number |
| `c64`. The lowest index `i` out of `(c64 + (64 * 0))`, `(c64 + (64 * 1))`, |
| `(c64 + (64 * 2))` and `(c64 + (64 * 3))` such that `TTag[i]` is `0xFD` locates |
| the 7 bytes that identifies the `Codec`. The location is what would otherwise |
| occupy the `i`th element's `CPtr | CLen` space. It is invalid for no such `i` |
| to exist. |
| |
| `Long Codec` values are not reserved by this specification, other than 7 NUL |
| bytes also means "RAC + Zeroes". Users may define their own values for |
| arbitrary compression algorithms. Maintaining a centralized registry mapping |
| values to algorithms is out of scope of this specification, although we suggest |
| a convention that the 7 bytes form a human-readable (ASCII) string, padded with |
| trailing NUL bytes. For example, a hypothetical "Middle Out version 2" |
| compression algorithm that typically used the ".mdo2" file extension might be |
| represented by the 7 bytes `"mdo2\x00\x00\x00"`. |
| |
| |
| ### Branch Node Validation |
| |
| The first time that a RAC reader visits any particular `Branch Node`, it must |
| check that the `Magic` matches, the two `Arity` values match and are non-zero, |
| there is at least one child `Node` (not just non-`Node` attributes), the |
| computed checksum matches the listed `Checksum` and that the RAC reader accepts |
| the `Version`. For `Long Codec`s, there must exist an `0xFD` `TTag` as per the |
| previous section. |
| |
| It must also check that all of its `DOff` values are sorted: `(DOff[a] <= |
| DOff[a+1])` for every `a` in the half-open range `[0 .. Arity)`. By induction, |
| this means that all of its `DOff` values do not exceed `DOffMax`, and again by |
| induction, therefore do not exceed `DFileSize`. |
| |
| It must also check that, other than `Codec Element` attributes, all of its |
| `COff` values do not exceed `COffMax` (and again by induction, therefore do not |
| exceed `CFileSize`). Other than that, `COff` values do not have to be sorted: |
| successive `Node`s (in `DSpace`) can be out of order (in `CSpace`), allowing |
| for incrementally modified RAC files. |
| |
| For the `Root Node`, its `COffMax` must equal the `CFileSize`. Recall that |
| parsing a `CFile` requires knowing the `CFileSize`, and also that a `Root |
| Node`'s `Branch CBias` is zero, so its `COffMax` equals its `CPtrMax`. |
| |
| For a child `Branch Node`, if the parent's `Codec` does not have the `Mix Bit` |
| set then the child's `Codec` must equal its parent's. Furthermore, its |
| `Version` must be less than or equal to its parent's `Version`, its `COffMax` |
| must be less than or equal to its parent's `COffMax`, and its `DOffMax` must |
| equal its parent's `SubBranch DOffMax`. The `DOffMax` condition is equivalent |
| to checking that the parent and child agree on the child's size in `DSpace`. |
| The parent states that it is its `(DPtr[a+1] - DPtr[a])` and the child states |
| that it is its `DPtrMax`. |
| |
| One conservative way to check `Branch Node`s' validity on first visit is to |
| check them on every visit, as validating any particular `Branch Node` is |
| idempotent, but other ways are acceptable. |
| |
| |
| ## Root Node |
| |
| The `Root Node` might be at the start of the `CFile`, as this might optimize |
| alignment of `Branch Node`s and of `CRange`s. All `Branch Node`s' sizes are |
| multiples of 16 bytes, and a maximal `Branch Node` is exactly 4096 bytes. |
| |
| The `Root Node` might be at the end of the `CFile`, as this allows one-pass |
| (streaming) encoding of a RAC file. It also allows appending to, concatenating |
| or incrementally modifying existing RAC files relatively cheaply. |
| |
| To find the `Root Node`, first look for a valid `Root Node` at the `CFile` |
| start. If and only if that fails, look at the `CFile` end. If that also fails, |
| it is not a valid RAC file. |
| |
| |
| ### Root Node at the CFile Start |
| |
| The fourth byte of the `CFile` gives the `Arity`, assuming the `Root Node` is |
| at the `CFile` start. If it is zero, fail over to the `CFile` end. A RAC writer |
| that does not want to place the `Root Node` at the `CFile` start should |
| intentionally write a zero to the fourth byte. |
| |
| Otherwise, that `Arity` defines the size in bytes of any starting `Root Node`: |
| `((Arity * 16) + 16)`. If the `CFileSize` is less than this size, fail over to |
| the `CFile` end. |
| |
| If those starting bytes form a valid `Root Node` (as per the "Branch Node |
| Validation" section), including having a `CPtrMax` equal to the `CFileSize`, |
| then we have indeed found our `Root Node`: its `Branch COffset` is zero. |
| Otherwise, fail over to the `CFile` end. |
| |
| |
| ### Root Node at the CFile End |
| |
| If there is no valid `Root Node` at the `CFile` start then the last byte of the |
| `CFile` gives the `Root Node`'s `Arity`. This does not necessarily need to |
| match the fourth byte of the `CFile`. |
| |
| As before, that `Arity` defines the size in bytes of any ending `Root Node`: |
| `((Arity * 16) + 16)`. If the `CFileSize` is less than this size, it is not a |
| valid RAC file. |
| |
| If those ending bytes form a valid `Root Node` (as per the "Branch Node |
| Validation" section), including having a `CPtrMax` equal to the `CFileSize`, |
| then we have indeed found our `Root Node`: its `Branch COffset` is the |
| `CFileSize` minus the size implied by the `Arity`. Otherwise, it is not a valid |
| RAC file. |
| |
| |
| ## DRange Reconstruction |
| |
| To reconstruct the `DRange` `[di .. dj)`, if that `DRange` is empty then the |
| request is trivially satisfied. |
| |
| Otherwise, if `(dj > DFileSize)` then reject the request. |
| |
| Otherwise, start at the `Root Node` and continue to the next section to find |
| the `Leaf Node` containing the `DOffset` `di`. |
| |
| |
| ### Search Within a Branch Node |
| |
| Load (and validate) the `Branch Node` given its `Branch COffset`, `Branch |
| CBias` and `Branch DBias`. |
| |
| Find the largest child index `a` such that `(a < Arity)` and `(DOff[a] <= di)` |
| and `(DOff[a+1] > di)`, then examine `TTag[a]` to see if the child is a `Leaf |
| Node`. If so, skip to the next section. |
| |
| For a `Branch Node` child, let `CRemaining` equal this `Branch Node`'s (the |
| parents') `COffMax` minus the `SubBranch COffset`. It invalid for `CRemaining` |
| to be less than 4, or to be less than the child's size implied by the child's |
| `Arity` byte at a `COffset` equal to `(SubBranch_COffset + 3)`. |
| |
| The `SubBranch COffset`, `SubBranch CBias` and `SubBranch DBias` from the |
| parent `Branch Node` become the `Branch COffset`, `Branch CBias` and `Branch |
| DBias` of the child `Branch Node`. In order to rule out infinite loops, at |
| least one of these two conditions must hold: |
| |
| - The child's `Branch COffset` is less than the parent's `Branch COffset`. |
| - The child's `DPtrMax` is less than the parent's `DPtrMax`. |
| |
| It is valid for one of those conditions to hold between a parent-child pair and |
| the other condition to hold between the next parent-child pair. |
| |
| Repeat this "Search Within a Branch Node" section with the child `Branch Node`. |
| |
| |
| ### Decompressing a Leaf Node |
| |
| If a `Leaf Node`'s `DRange` is empty, decompression is a no-op and skip the |
| rest of this section. Specifically, a low-level library that iterates over a |
| RAC file's chunks, without actually performing decompression, should skip over |
| empty chunks instead of yielding them to its caller. |
| |
| Otherwise, decompression combines the `Primary CRange`, `Secondary CRange` and |
| `Tertiary CRange` slices of the `CFile`, and the `Leaf STag` and `Leaf TTag` |
| values, in a `Codec`-specific way to produce decompressed data. |
| |
| There are two general principles, although specific `Codec`s can overrule: |
| |
| - The `Codec` may produce fewer bytes than the `DRange` size. In that case, |
| the remaining bytes (in `DSpace`) are set to NUL (`memset` to zero). |
| - The `Codec` may consume fewer bytes than each `CRange` size, and the |
| compressed data typically contains an explicit "end of data" marker. In |
| that case, the remaining bytes (in `CSpace`) are ignored. Padding allows |
| `COff` values to optionally be aligned. |
| |
| It is invalid to produce more bytes than the `DRange` size or to consume more |
| bytes than each `CRange` size. |
| |
| |
| ### Continuing the Reconstruction |
| |
| If decompressing that `Leaf Node` did not fully reconstruct `[di .. dj)`, |
| advance through the `Node` tree in depth first search order, decompressing |
| every `Leaf Node` along the way, until we have gone up to or beyond `dj`. |
| |
| During that traversal, `Node`s with an empty `DRange` should be skipped, even |
| if they are `Branch Node`s. |
| |
| |
| # Specific Codecs |
| |
| |
| ## Common Dictionary Format |
| |
| Unless otherwise noted, codecs use this common dictionary format. |
| |
| If a `Leaf Node`'s `Secondary CRange` is empty then there is no dictionary. |
| Otherwise, the `Secondary CRange` must be at least 8 bytes long: |
| |
| - 4 byte little-endian `uint32_t` `Dictionary Length`. The high 2 bits are |
| reserved and must be zero. The maximum (inclusive) `Dictionary Length` is |
| therefore `((1 << 30) - 1)`, or `1_073_741_823` bytes. |
| - `Dictionary Length` bytes `Dictionary`. |
| - 4 byte little-endian `uint32_t` `Dictionary Checksum`, the Checksum` is |
| CRC-32 IEEE checksum over the `Dictionary`'s bytes (excluding both the 4 |
| byte prefix and the 4 byte suffix). |
| - Padding (ignored). |
| |
| The `Leaf TTag` must be `0xFF`. All other `Leaf TTag` values (below `0xC0`) are |
| reserved. The empty `Tertiary CRange` is ignored. The `Leaf STag` value is also |
| ignored, other than deriving the `Secondary CRange`. |
| |
| |
| ## RAC + Zeroes |
| |
| The `CRange`s are ignored. The `DRange` is filled with NUL bytes (`memset` to |
| zero). |
| |
| |
| ## RAC + Zlib |
| |
| The `CFile` data in the `Leaf Node`'s `Primary CRange` is decompressed as Zlib |
| (RFC 1950), possibly referencing a LZ77 prefix dictionary (what the RFC calls a |
| "preset dictionary") wrapped in RAC's common dictionary format, described |
| above. |
| |
| |
| ## RAC + LZ4 |
| |
| TODO. |
| |
| |
| ## RAC + Zstandard |
| |
| The `CFile` data in the `Leaf Node`'s `Primary CRange` is decompressed as |
| Zstandard (RFC 8478), possibly referencing a dictionary wrapped in RAC's common |
| dictionary format, described above. After unwrapping, the dictionary's bytes |
| can be either a "raw" or "trained" dictionary, as per RFC 8478 section 5. |
| |
| |
| # Examples |
| |
| These examples display RAC files in the format of the `hexdump -C` command line |
| tool. They are deliberately very short, for ease of understanding. Realistic |
| RAC files, with larger chunk sizes, would typically exhibit much better |
| compression ratios. |
| |
| |
| ## Zlib Example (Root Node at End) |
| |
| The first example is relatively simple. The root node (located at the CFile |
| end) only has one child: a leaf node whose compressed contents starts at |
| position `0x04`. Decompressing that chunk produces the 6 bytes |
| ["More!\n"](https://play.golang.org/p/ZSYmQLgv1RR). |
| |
| 00000000 72 c3 63 00 78 9c 01 06 00 f9 ff 4d 6f 72 65 21 |r.c.x......More!| |
| 00000010 0a 07 42 01 bf 72 c3 63 01 65 a9 00 ff 06 00 00 |..B..r.c.e......| |
| 00000020 00 00 00 00 01 04 00 00 00 00 00 01 ff 35 00 00 |.............5..| |
| 00000030 00 00 00 01 01 |.....| |
| |
| |
| ## Zlib Example (Root Node at Start) |
| |
| The second example consists of a root node with four children: one metadata |
| node (a shared dictionary) and three data nodes. The shared dictionary, |
| "\x20sheep.\n" is `0x00000008` bytes long and its CRC-32 IEEE checksum is |
| `0x477A8DD0`. The third child's (the second data node)'s compressed contents |
| starts at position `0x75`. Decompressing that chunk, together with that shared |
| dictionary, produces the 11 bytes ["Two |
| sheep.\n"](https://play.golang.org/p/Jh9Wyp6PLID). The complete decoding of all |
| three data chunks is "One sheep.\nTwo sheep.\nThree sheep.\n". |
| |
| 00000000 72 c3 63 04 37 39 00 ff 00 00 00 00 00 00 00 ff |r.c.79..........| |
| 00000010 0b 00 00 00 00 00 00 ff 16 00 00 00 00 00 00 ff |................| |
| 00000020 23 00 00 00 00 00 00 01 50 00 00 00 00 00 01 ff |#.......P.......| |
| 00000030 60 00 00 00 00 00 01 00 75 00 00 00 00 00 01 00 |`.......u.......| |
| 00000040 8a 00 00 00 00 00 01 00 a1 00 00 00 00 00 01 04 |................| |
| 00000050 08 00 00 00 20 73 68 65 65 70 2e 0a d0 8d 7a 47 |.... sheep....zG| |
| 00000060 78 f9 0b e0 02 6e f2 cf 4b 85 31 01 01 00 00 ff |x....n..K.1.....| |
| 00000070 ff 17 21 03 90 78 f9 0b e0 02 6e 0a 29 cf 87 31 |..!..x....n.)..1| |
| 00000080 01 01 00 00 ff ff 18 0c 03 a8 78 f9 0b e0 02 6e |..........x....n| |
| 00000090 0a c9 28 4a 4d 85 71 00 01 00 00 ff ff 21 6e 04 |..(JM.q......!n.| |
| 000000a0 66 |f| |
| |
| |
| ## Zlib Example (Concatenation) |
| |
| The third example demonstrates concatenating two RAC files: the two examples |
| above. The decompressed form of the resultant RAC file is the concatenation of |
| the two decompressed forms: "One sheep.\nTwo sheep.\nThree sheep.\nMore!\n". |
| Its 41 decompressed bytes consists of the "sheep" RAC file's 35 bytes and then |
| the "more" RAC file's 6 bytes. In hexadecimal, `0x29 = 0x23 + 0x06`, and the |
| `0x29` and `0x23` numbers can be seen in the compressed form's bytes. |
| |
| The compressed form (what's shown in `hexdump -C` format below) is the |
| concatenation of the two compressed forms, plus a new root node (64 bytes |
| starting at position `0xD6`). Even though the overall RAC file starts with the |
| "sheep" RAC file, whose root node was at its start, those opening bytes are no |
| longer a valid root node for the larger file. |
| |
| That new root node has 3 children: 1 metadata node and 2 branch nodes. The |
| metadata node (one whose `DRange` is empty) is required because one of the |
| original RAC files' root node is not located at its start. Walking to that |
| child branch node needs two `COffset` values: one for the embedded RAC file's |
| start and one for the embedded RAC file's root node. |
| |
| 00000000 72 c3 63 04 37 39 00 ff 00 00 00 00 00 00 00 ff |r.c.79..........| |
| 00000010 0b 00 00 00 00 00 00 ff 16 00 00 00 00 00 00 ff |................| |
| 00000020 23 00 00 00 00 00 00 01 50 00 00 00 00 00 01 ff |#.......P.......| |
| 00000030 60 00 00 00 00 00 01 00 75 00 00 00 00 00 01 00 |`.......u.......| |
| 00000040 8a 00 00 00 00 00 01 00 a1 00 00 00 00 00 01 04 |................| |
| 00000050 08 00 00 00 20 73 68 65 65 70 2e 0a d0 8d 7a 47 |.... sheep....zG| |
| 00000060 78 f9 0b e0 02 6e f2 cf 4b 85 31 01 01 00 00 ff |x....n..K.1.....| |
| 00000070 ff 17 21 03 90 78 f9 0b e0 02 6e 0a 29 cf 87 31 |..!..x....n.)..1| |
| 00000080 01 01 00 00 ff ff 18 0c 03 a8 78 f9 0b e0 02 6e |..........x....n| |
| 00000090 0a c9 28 4a 4d 85 71 00 01 00 00 ff ff 21 6e 04 |..(JM.q......!n.| |
| 000000a0 66 72 c3 63 00 78 9c 01 06 00 f9 ff 4d 6f 72 65 |fr.c.x......More| |
| 000000b0 21 0a 07 42 01 bf 72 c3 63 01 65 a9 00 ff 06 00 |!..B..r.c.e.....| |
| 000000c0 00 00 00 00 00 01 04 00 00 00 00 00 01 ff 35 00 |..............5.| |
| 000000d0 00 00 00 00 01 01 72 c3 63 03 83 16 00 ff 00 00 |......r.c.......| |
| 000000e0 00 00 00 00 00 fe 23 00 00 00 00 00 00 fe 29 00 |......#.......).| |
| 000000f0 00 00 00 00 00 01 a1 00 00 00 00 00 00 ff 00 00 |................| |
| 00000100 00 00 00 00 04 01 b6 00 00 00 00 00 04 00 16 01 |................| |
| 00000110 00 00 00 00 01 03 |......| |
| |
| |
| Focusing on that new root node: |
| |
| 000000d0 72 c3 63 03 83 16 00 ff 00 00 |
| 000000e0 00 00 00 00 00 fe 23 00 00 00 00 00 00 fe 29 00 |
| 000000f0 00 00 00 00 00 01 a1 00 00 00 00 00 00 ff 00 00 |
| 00000100 00 00 00 00 04 01 b6 00 00 00 00 00 04 00 16 01 |
| 00000110 00 00 00 00 01 03 |
| |
| Re-formatting to highlight the groups-of-8-bytes structure and reprise the |
| "Branch Nodes" section's diagram: |
| |
| 000000d6 72 c3 63 03 83 16 00 ff |Magic|A|Che|0|T| |
| 000000de 00 00 00 00 00 00 00 fe | DPtr[1] |0|T| |
| 000000e6 23 00 00 00 00 00 00 fe | DPtr[2] |0|T| |
| 000000ee 29 00 00 00 00 00 00 01 | DPtr[A] |0|C| |
| 000000f6 a1 00 00 00 00 00 00 ff | CPtr[0] |L|S| |
| 000000fe 00 00 00 00 00 00 04 01 | CPtr[1] |L|S| |
| 00000106 b6 00 00 00 00 00 04 00 | CPtr[2] |L|S| |
| 0000010e 16 01 00 00 00 00 01 03 | CPtr[A] |V|A| |
| |
| Its `CodecByte` (and therefore its `Short Codec`) is `0x01`, "RAC + Zlib", its |
| Version is `0x01` and its `Arity` is `0x03`. The `DPtr` values are `0x0000` |
| (implicit), `0x0000`, `0x0023` and `0x0029`. The `CPtr` values are `0x00A1`, |
| `0x0000`, `0x00B6` and `0x0116` (the size of the whole RAC file). Note that the |
| `CPtr` values are not sorted. The last two children's `TTag`s are `0xFE` and |
| `0xFE` and their `STag`s are `0x01` and `0x00`, which means that they are both |
| `CBiasing Branch Node`s. |
| |
| |
| # Reference Implementation |
| |
| C programming language libraries: |
| |
| - TODO. Follow [this GitHub issue](https://github.com/google/wuffs/issues/22) |
| for updates. |
| |
| Go programming language libraries: |
| |
| - [RAC](https://godoc.org/github.com/google/wuffs/lib/rac) |
| - [RAC + Zlib](https://godoc.org/github.com/google/wuffs/lib/raczlib) |
| |
| Command line tool, installable via `go get |
| github.com/google/wuffs/cmd/ractool`: |
| |
| - [ractool](https://godoc.org/github.com/google/wuffs/cmd/ractool) |
| |
| |
| # Acknowledgements |
| |
| I (Nigel Tao) thank Robert Obryk, Sanjay Ghemawat and Sean Klein for their |
| review. |
| |
| |
| --- |
| |
| Updated on September 2019. |