Random Access Compression (RAC) Related Work

This is a companion document to the Random Access Compression (RAC) Specification. Its contents aren't necessary for implementing the RAC file format, but gives further context on what RAC is and is not.

In general, RAC differs from most other archive or compression formats in a number of ways. This doesn't mean that RAC is necessarily better or worse than these other designs, just that different designs make different trade-offs.

For example, a key feature is that RAC supports shared compression dictionaries, embedded within the archive file. This can claw back some of the compression ratio lost when splitting a source file into independently compressed chunks. The trade-off is that decompressing a RAC file therefore requires random access to the compressed file, not just sequential access. Unlike some other formats discussed below, RAC cannot do streaming (one pass) decoding. Decompression might not be able to start until the entire compressed file is available: the root node of the index might be at the end of the compressed file.

Similarly, unlike Gzip or XZ, concatenating two RAC files together does not produce a valid RAC file. A new root node also has to be appended afterwards, which is cheap and easy, but not a no-op. Nonetheless, that new root node allows for incremental, append-only modification of a RAC file: removing, replacing or inserting bytes in the middle of the decompressed form by only appending to the compressed form.

A few more differences between RAC and most if not all of the other file formats or algorithms discussed further below:

  • RAC itself isn‘t tied to any one underlying compression codec. Obviously, “RAC + Zlib” is tied to Zlib, and “RAC + Brotli” is tied to Brotli, but if somebody invents a new “MiddleOut” compression codec, one that wasn’t random access per se, it would be easy to have “RAC + MiddleOut”.
  • RAC seek points are identified by numbers (i.e. DOffsets), not strings (i.e. file names). Some other formats' seek points are positioned only at source file boundaries, as they're primarily used to compress part of a file system, containing directories and (variable length) files. They cannot seek to a relative offset (in what RAC calls DSpace) within a source file.

These differences apply to established archive formats like Tar and Zip, and newer archive formats like Śiva.

Software Products

XZ supports random access and multiple underlying compression codecs like RAC. Out of all the formats listed here, it is the one most similar to RAC, but as one of XZ‘s explicit goals is to support streaming decoding, XZ does not support shared dictionaries or incremental modification, as discussed above. Also, unlike RAC, finding the record that contains any given DOffset requires a linear scan of all previous records, even if those records don’t need decompressing.

DictZip supports random access like RAC, but is tied to the Gzip compression codec (every valid DictZip file is also a valid Gzip file), and due to size restrictions in the Gzip extension format, a DictZip file's decompressed size cannot exceed about 1.8 GiB. CSIO seems to do something similar. Idzip avoids the 1.8 GiB limitation on the overall size (although retaining the 64 KiB limit on individual chunks) by not requiring the index to fit inside a single Gzip extension block.

BGZF is very similar to DictZip, but also avoids the 1.8 GiB limitation. It places the uncompressed size of each chunk together with each chunk, instead of having a single contiguous index separate from chunk data. Unless the index is also stored externally, random access is achieved through following a linked list of size offsets. Some more discussion is on Heng Li's blog.

Xflate is similar to BGZF, but extends the Deflate format instead of Gzip (every valid Xflate file is also a valid Deflate file). Gzip extends Deflate with metadata (where BGZF stashes its index information) and a checksum. Deflate has no capacity for metadata per se, but Xflate exploits the fact that there are many ways to encode an empty Deflate block (empty meaning that decoding it produces zero bytes).

Bzip2 and RAZip are other formats based around the same idea of partitioning the source file into independently compressed chunks, and for each chunk, recording both its compressed and uncompressed size. Once again, random access is achieved through following a linked list of size offsets.

Riegeli is a sequence-of-records format. A RAC file arguably contains what Riegeli calls records (which are not files, as they don't have names) and both formats support numerical seek points, and multiple compression codecs, but Riegeli seeks to a point in what RAC calls CSpace, whereas RAC seeks to a point in DSpace.

QCOW supports random access like RAC, but compression does not seem to be the foremost concern. That specification mentions compression but does not give any particular codec. A separate document suggests that the codec is Zlib (with no other option). Like other RAC-related work, QCOW does not support shared dictionaries.

The examples/zran.c program in the zlib-the-library source code repository (and e.g. the jzran Java language port) builds an in-memory index, not a persistent (disk-backed) one, and building the index involves first decompressing the whole file. It also requires storing (in memory) the entire state of the decompressor, including 32 KiB of history, per seek point. For coarse grained seek points (e.g. once every 1 MiB), that overhead can be acceptable, but for fine grained seek points (e.g. once every 16 KiB), that overhead is prohibitive.

The examples/dictionaryRandomAccess.c program in the lz4-the-library source code repository is much closer in spirit to RAC, and unlike the zlib example, does not require saving decompressor state: its chunks are independently compressed. It supports dictionaries, but they have to be supplied out-of-band, separate from the compressed file. It is also tied to the LZ4 compression format, hard-codes what RAC would call the chunk DRange size to a fixed value, and the file format is endian-dependent. This is a proof of concept (the implementation's maximum DFileSize is 1MiB), not a production quality file format.

RAC differs from the LevelDB Table format, even if the LevelDB Table string keys are re-purposed to encode both numerical DOffset seek points and identifiers for shared compression dictionaries. A LevelDB Table index is flat, unlike RAC‘s hierarchical index. In both cases, a maliciously written index could contain multiple entries for the same key, and different decoders could therefore produce different output for the same source file - a potential security vulnerability. For LevelDB Tables, verifying an index key’s uniqueness requires scanning every key in the file, which can be relatively expensive, and is therefore not done in practice. In comparison, RAC‘s “Branch Node Validation” process (see below) ensures that exactly one Leaf Node contains any given byte offset di in the decompressed file, and the RAC index’s hierarchical nature places a scalable upper bound on the scanning cost of verifying that, based on that portion of the index tree visited for any given request [di .. dj).

zfp primarily concerns lossy compression of spatial floating point data.

Academic Publications

Kerbiriou and Chikhi's “Parallel decompression of gzip-compressed files and random access to DNA sequences” does not modify the Gzip / Deflate file format. Seek points are located in CSpace, not DSpace. It uses a heuristic approach to determine block boundaries near an arbitrary seek point, which is susceptible to false positive matches. The primary use case is biometric data in the FASTQ file format, limited to ASCII. Reconstruction can be lossy: “The method is near-exact at low compression levels, but often a large fraction of sequences contains undetermined characters at higher compression levels”.

Kreft and Navarro's “LZ77-like Compression with Fast Random Access” constrains the more general Lempel-Ziv approach so that back-reference spans always end at phrase boundaries. Combining with an efficient data structure for phrase boundaries, a bitmap with rank and select operations, allows for backwards reconstruction (starting from a phrase endpoint). However, in terms of compression ratios, “Our least-space variants take 2.5–4.0 times the space of p7zip, and 1.8–2.5 times the space of lz77”.

Fredriksson and Nikitin's “Simple Random Access Compression” provides random access to short substrings (as short as a single symbol) of the decompressed file, focusing on natural language source text (where a symbol corresponds to e.g. an English language word). It is unclear how well their technique performs on decompressing large substrings of binary data. For example, while their reported compression ratios appear competitive with gzip for an English text file, they are substantially worse for XML data.

Zhang and Bockelman's “Exploring compression techniques for ROOT IO” has a section literally titled “Random Access Compression”, which appears to apply to structured floating point data (from the High Energy Physics domain), and the reported compression ratios are dramatically worse (2.88x) with what they call RAC than without. This document discusses lossless compression of arbitrary binary data, with a small (e.g. 1.02x overhead).

Rodler's “Compression with Fast Random Access” sounds relevant based on its title, but primarily concerns lossy compression of volumetric data.

Robert and Nadarajan's “New Algorithms For Random Access Text Compression” cannot compress aribrary data, only 7-bit ASCII. The largest document they consider is also only 17 KiB in size.