Re-do RAC codec values, from 8 bits to 56 bits

commit: 4006157f9f646ea58dc7219f2bc2b06a0d7d767a [log] [tgz]
author: Nigel Tao <nigeltao@golang.org> Sat Sep 07 09:09:18 2019 +1000
committer: Nigel Tao <nigeltao@golang.org> Sat Sep 07 10:25:33 2019 +1000
tree: 43501e495e4c792f7718f76db404f07595e287e4
parent: e7c7e7bcf95895eef37172d40cf42602348d2bdf [diff]
diff --git a/doc/spec/rac-spec.md b/doc/spec/rac-spec.md
index 3aff1b9..a14fef2 100644
--- a/doc/spec/rac-spec.md
+++ b/doc/spec/rac-spec.md

@@ -1,6 +1,6 @@
 # Random Access Compression: RAC
 
-Status: Draft (as of August 2019). There is no compatibility guarantee yet.
+Status: Draft (as of September 2019). There is no compatibility guarantee yet.
 
 
 ## Overview
@@ -32,10 +32,11 @@
   - Filesystem metadata like file names, file sizes and modification times.
   - Multiple source files.
 
-There is the capability (see reserved `TTag`s, below) but no promise to address
-these in a future RAC version. There might not be a need to, as other designs
-such as [SquashFS](http://squashfs.sourceforge.net/) and EROFS ([Extendable
-Read-Only File System](https://lkml.org/lkml/2018/5/31/306)) already exist.
+There is the capability (see attributes and reserved `TTag`s, below) but no
+promise to address these in a future RAC version. There might not be a need to,
+as other designs such as [SquashFS](http://squashfs.sourceforge.net/) and EROFS
+([Extendable Read-Only File System](https://lkml.org/lkml/2018/5/31/306))
+already exist.
 
 Non-goals in general include:
 
@@ -88,8 +89,10 @@
 knowing the `CFileSize` in order to identify the `Root Node`, which is either
 at the start or the end of the `CFile`.
 
-Each `Node` has a `DRange`. An empty `DRange` means that the `Node` contains
-metadata or other decompression context such as a shared dictionary.
+Each `Node` has a `DRange`, which can be empty. `Branch Node`s can also have
+attributes: elements that aren't child `Node`s, which must have an empty
+`DRange`. Empty elements contain metadata or other decompression context such
+as a shared dictionary.
 
 Each `Leaf Node` also has 3 `CRange`s (`Primary`, `Secondary` and `Tertiary`),
 any or all of which may be empty. The contents of the `CFile`, within those
@@ -114,7 +117,7 @@
     | ...       |0|.|  ...,                        Reserved (0),  ...
     | DPtr[A-2] |0|T|  DPtr[Arity-2],              Reserved (0),  TTag[Arity-2]
     | DPtr[A-1] |0|T|  DPtr[Arity-1],              Reserved (0),  TTag[Arity-1]
-    | DPtr[A]   |0|C|  DPtr[Arity] a.k.a. DPtrMax, Reserved (0),  Codec
+    | DPtr[A]   |0|C|  DPtr[Arity] a.k.a. DPtrMax, Reserved (0),  CodecByte
     | CPtr[0]   |L|S|  CPtr[0],                    CLen[0],       STag[0]
     | CPtr[1]   |L|S|  CPtr[1],                    CLen[1],       STag[1]
     | CPtr[2]   |L|S|  CPtr[2],                    CLen[2],       STag[2]
@@ -164,7 +167,8 @@
 
 ### Arity
 
-`Arity` is the `Branch Node`'s number of children. Zero is invalid.
+`Arity` is the `Branch Node`'s number of elements: the number of child `Node`s
+plus the number of attributes. Having zero children is invalid.
 
 The `Arity` byte is given twice: the fourth byte and the final byte of the
 `Branch Node`. The two values must match.
@@ -196,14 +200,19 @@
 
 ### COffs and DOffs, STags and TTags
 
-For every `a` in the half-open range `[0 .. Arity)`, the `a`'th child `Node`
-has two tags, `STag[a]` and `TTag[a]`, and a `DRange` of `[DOff[a] ..
-DOff[a+1])`. The `DOff` values must be non-decreasing: see the "Branch Node
-Validation" section below.
+For every `a` in the half-open range `[0 .. Arity)`, the `a`'th element has two
+tags, `STag[a]` and `TTag[a]`, and a `DRange` of `[DOff[a] .. DOff[a+1])`. The
+`DOff` values must be non-decreasing: see the "Branch Node Validation" section
+below.
 
-A `TTag[a]` of `0xFE` means that child is a `Branch Node`. A `TTag[a]` in the
-half-open range `[0xC0 .. 0xFE)` is reserved. Otherwise, the child is a `Leaf
-Node`.
+A `TTag[a]` of `0xFE` means that that element is a `Branch Node` child.
+
+A `TTag[a]` of `0xFD` means that there is no child, but is instead a `Codec
+Element` attribute, whose `DRange` must be empty, and the rest of this section
+does not apply: the `STag` is ignored.
+
+A `TTag[a]` in the half-open range `[0xC0 .. 0xFD)` is reserved. Otherwise, the
+element is a `Leaf Node` child.
 
 A child `Branch Node`'s `SubBranch COffset` is defined to be `COff[a]`. Its
 `SubBranch DBias` and `SubBranch DOffMax` are defined to be `DOff[a]` and
@@ -272,47 +281,78 @@
 `Codec`s define specializations of RAC, such as "RAC + Zlib" or "RAC + Brotli".
 It is valid for a "RAC + Zstandard" only decoder to reject a "RAC + Brotli"
 file, even if it is a valid RAC file. Recall that RAC is just a container, and
-not tied to any particular compression codec. For the `Codec` byte in a `Branch
-Node`:
+not tied to any particular compression codec.
+
+There are two categories of `Codec`s: the high `0x80` bit of the `Codec Byte`
+being `0` or `1` denotes a `Short` or `Long` codec respectively. `Short Codec`s
+are represented by a single byte (the `Codec Byte`). `Long Codec`s use that
+`Codec Byte` to locate 7 additional bytes: a `Codec Element`.
+
+For both `Short Codec`s and `Long Codec`s, the second-highest `0x40` bit of the
+`Codec Byte` is the `Mix Bit`. A `Mix Bit` of `0` means that all of the
+`Node`'s descendents have exactly the same `Codec` (not just the same `Codec
+Byte`; `Short Codec`s and `Long Codec`s are not considered exactly the same
+even if they represent the same compression algorithm). A `Mix Bit` of `1`
+means that descendents may have a different `Codec`.
+
+For `Short Codec`s, the remaining low 6 bits correspond to a specific
+compression algorithm:
 
   - `0x00` means "RAC + Zeroes".
   - `0x01` means "RAC + Zlib".
   - `0x02` means "RAC + Brotli".
   - `0x04` means "RAC + ZStandard".
-  - Any other value less than 0x08 means that all of this `Branch Node`'s
-    children must be `Branch Node`s and not `Leaf Node`s and that no child's
-    `Codec` byte can have a bit set that is not set in this `Codec` byte.
-  - All other values, 0x08 or greater, are reserved.
+  - All other values are reserved.
+
+For `Long Codec`s, the remaining low 6 bits of the `Codec Byte` define a number
+`c64`. The lowest index `i` out of `(c64 + (64 * 0))`, `(c64 + (64 * 1))`,
+`(c64 + (64 * 2))` and `(c64 + (64 * 3))` such that `TTag[i]` is `0xFD` locates
+the 7 bytes that identifies the `Codec`. The location is what would otherwise
+occupy the `i`th element's `CPtr | CLen` space. It is invalid for no such `i`
+to exist.
+
+`Long Codec` values are not reserved by this specification, other than 7 NUL
+bytes also means "RAC + Zeroes". Users may define their own values for
+arbitrary compression algorithms. Maintaining a centralized registry mapping
+values to algorithms is out of scope of this specification, although we suggest
+a convention that the 7 bytes form a human-readable (ASCII) string, padded with
+trailing NUL bytes. For example, a hypothetical "Middle Out version 2"
+compression algorithm that typically used the ".mdo2" file extension might be
+represented by the 7 bytes `"mdo2\x00\x00\x00"`.
 
 
 ### Branch Node Validation
 
 The first time that a RAC reader visits any particular `Branch Node`, it must
 check that the `Magic` matches, the two `Arity` values match and are non-zero,
-the computed checksum matches the listed `Checksum` and that the RAC reader
-accepts the `Version` and the `Codec`.
+there is at least one child `Node` (not just non-`Node` attributes), the
+computed checksum matches the listed `Checksum` and that the RAC reader accepts
+the `Version`. For `Long Codec`s, there must exist an `0xFD` `TTag` as per the
+previous section.
 
 It must also check that all of its `DOff` values are sorted: `(DOff[a] <=
 DOff[a+1])` for every `a` in the half-open range `[0 .. Arity)`. By induction,
 this means that all of its `DOff` values do not exceed `DOffMax`, and again by
 induction, therefore do not exceed `DFileSize`.
 
-It must also check that all of its `COff` values do not exceed `COffMax` (and
-again by induction, therefore do not exceed `CFileSize`). Other than that,
-`COff` values do not have to be sorted: successive `Node`s (in `DSpace`) can be
-out of order (in `CSpace`), allowing for incrementally modified RAC files.
+It must also check that, other than `Codec Element` attributes, all of its
+`COff` values do not exceed `COffMax` (and again by induction, therefore do not
+exceed `CFileSize`). Other than that, `COff` values do not have to be sorted:
+successive `Node`s (in `DSpace`) can be out of order (in `CSpace`), allowing
+for incrementally modified RAC files.
 
 For the `Root Node`, its `COffMax` must equal the `CFileSize`. Recall that
 parsing a `CFile` requires knowing the `CFileSize`, and also that a `Root
 Node`'s `Branch CBias` is zero, so its `COffMax` equals its `CPtrMax`.
 
-For a child `Branch Node`, its `Codec` bits must be a subset of its parent's
-`Codec` bits, its `Version` must be less than or equal to its parent's
-`Version`, its `COffMax` must be less than or equal to its parent's `COffMax`,
-and its `DOffMax` must equal its parent's `SubBranch DOffMax`. The `DOffMax`
-condition is equivalent to checking that the parent and child agree on the
-child's size in `DSpace`. The parent states that it is its `(DPtr[a+1] -
-DPtr[a])` and the child states that it is its `DPtrMax`.
+For a child `Branch Node`, if the parent's `Codec` does not have the `Mix Bit`
+set then the child's `Codec` must equal its parent's. Furthermore, its
+`Version` must be less than or equal to its parent's `Version`, its `COffMax`
+must be less than or equal to its parent's `COffMax`, and its `DOffMax` must
+equal its parent's `SubBranch DOffMax`. The `DOffMax` condition is equivalent
+to checking that the parent and child agree on the child's size in `DSpace`.
+The parent states that it is its `(DPtr[a+1] - DPtr[a])` and the child states
+that it is its `DPtrMax`.
 
 One conservative way to check `Branch Node`s' validity on first visit is to
 check them on every visit, as validating any particular `Branch Node` is
@@ -618,12 +658,13 @@
     00000104    b4 00 00 00 00 00 04 00    | CPtr[2]   |L|S|
     0000010c    14 01 00 00 00 00 01 03    | CPtr[A]   |V|A|
 
-Its `Codec` is `0x01`, "RAC + Zlib", its Version is `0x01` and its `Arity` is
-`0x03`. The `DPtr` values are `0x0000` (implicit), `0x0000`, `0x0023` and
-`0x0029`. The `CPtr` values are `0x009F`, `0x0000`, `0x00B4` and `0x0114` (the
-size of the whole RAC file). Note that the `CPtr` values are not sorted. The
-last two children's `TTag`s are `0xFE` and `0xFE` and their `STag`s are `0x01`
-and `0x00`, which means that they are both `CBiasing Branch Node`s.
+Its `CodecByte` (and therefore its `Codec`) is `0x01`, "RAC + Zlib", its
+Version is `0x01` and its `Arity` is `0x03`. The `DPtr` values are `0x0000`
+(implicit), `0x0000`, `0x0023` and `0x0029`. The `CPtr` values are `0x009F`,
+`0x0000`, `0x00B4` and `0x0114` (the size of the whole RAC file). Note that the
+`CPtr` values are not sorted. The last two children's `TTag`s are `0xFE` and
+`0xFE` and their `STag`s are `0x01` and `0x00`, which means that they are both
+`CBiasing Branch Node`s.
 
 
 # Reference Implementation
@@ -652,4 +693,4 @@
 
 ---
 
-Updated on August 2019.
+Updated on September 2019.

diff --git a/lib/rac/chunk_reader.go b/lib/rac/chunk_reader.go
index 28dc0ed..f0b1fd1 100644
--- a/lib/rac/chunk_reader.go
+++ b/lib/rac/chunk_reader.go

@@ -25,12 +25,14 @@
 	_ = b[7] // Early bounds check to guarantee safety of reads below.
 	u := uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | uint64(b[3])<<24 |
 		uint64(b[4])<<32 | uint64(b[5])<<40 | uint64(b[6])<<48 | uint64(b[7])<<56
-	return int64(u & 0xFFFFFFFFFFFF)
+	const mask = (1 << 48) - 1
+	return int64(u & mask)
 }
 
-// bitwiseSubset returns whether every 1-bit in inner is also set in outer.
-func bitwiseSubset(outer Codec, inner Codec) bool {
-	return outer == (outer | inner)
+func u64LE(b []byte) uint64 {
+	_ = b[7] // Early bounds check to guarantee safety of reads below.
+	return uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | uint64(b[3])<<24 |
+		uint64(b[4])<<32 | uint64(b[5])<<40 | uint64(b[6])<<48 | uint64(b[7])<<56
 }
 
 // Range is the half-open range [low, high). It is invalid for low to be
@@ -78,11 +80,40 @@
 // true.
 type rNode [4096]byte
 
-func (b *rNode) arity() int     { return int(b[3]) }
-func (b *rNode) codec() Codec   { return Codec(b[(8*int(b[3]))+7]) }
-func (b *rNode) cPtrMax() int64 { return u48LE(b[(16*int(b[3]))+8:]) }
-func (b *rNode) dPtrMax() int64 { return u48LE(b[8*int(b[3]):]) }
-func (b *rNode) version() uint8 { return b[(16*int(b[3]))+14] }
+func (b *rNode) arity() int           { return int(b[3]) }
+func (b *rNode) codecHasMixBit() bool { return b[(8*int(b[3]))+7]&0x40 != 0 }
+func (b *rNode) cPtrMax() int64       { return u48LE(b[(16*int(b[3]))+8:]) }
+func (b *rNode) dPtrMax() int64       { return u48LE(b[8*int(b[3]):]) }
+func (b *rNode) version() uint8       { return b[(16*int(b[3]))+14] }
+
+// codec returns the 1-byte (short) or 7-byte (long) codec as a valid Codec, or
+// CodecInvalid.
+//
+// The (uint64) Codec value returned does not have the Mix Bit (the 62nd bit)
+// set, regardless of whether it's set in the rNode's bytes.
+func (b *rNode) codec() Codec {
+	arity := int(b[3])
+	cByte := b[(8*arity)+7]
+
+	// Look for a short codec.
+	if (cByte & 0x80) == 0 {
+		return Codec(cByte&0x3F) << 56
+	}
+	cByte &= 0x3F
+
+	// Look for a long codec.
+	const highBit, low56Bits = 0x8000000000000000, 0x00FFFFFFFFFFFFFF
+	for j := uint8(0); j < 4; j++ {
+		i := int(cByte | (j << 6))
+		if (i < arity) && (b.tTag(i) == 0xFD) {
+			base := (8 * arity) + 8
+			codec := Codec(u64LE(b[(8*i)+base:]))
+			return (codec & low56Bits) | highBit
+		}
+	}
+
+	return CodecInvalid
+}
 
 func (b *rNode) cLen(i int) uint8 {
 	base := (8 * int(b[3])) + 14
@@ -178,37 +209,49 @@
 	}
 
 	// Check that the "Reserved (0)" bytes are zero and that the TTag values
-	// aren't in the reserved range [0xC0, 0xFE).
+	// aren't in the reserved range [0xC0, 0xFD).
+	hasChildren := false
 	for i := 0; i < arity; i++ {
 		if b[(8*i)+6] != 0 {
 			return false
 		}
-		if tTag := b[(8*i)+7]; (0xC0 <= tTag) && (tTag < 0xFE) {
+		if tTag := b[(8*i)+7]; (0xC0 <= tTag) && (tTag < 0xFD) {
 			return false
+		} else if tTag != 0xFD {
+			hasChildren = true
 		}
 	}
-	if b[(8*arity)+6] != 0 {
+	if !hasChildren || (b[(8*arity)+6] != 0) {
 		return false
 	}
 
 	// Check that the DPtr values are non-decreasing. The first DPtr value is
 	// implicitly zero.
-	prev := u48LE(b[8*1:])
-	for i := 2; i <= arity; i++ {
+	prev := int64(0)
+	for i := 1; i <= arity; i++ {
 		curr := u48LE(b[8*i:])
 		if curr < prev {
 			return false
+		} else if curr != prev {
+			if tTag := b[(8*i)+7]; tTag == 0xFD {
+				return false
+			}
 		}
 		prev = curr
 	}
 
-	// Check that no CPtr value exceeds CPtrMax (the final CPtr value).
+	// Check that no CPtr value exceeds CPtrMax (the final CPtr value), other
+	// than 0xFD Codec Entries.
 	base := (8 * arity) + 8
 	cPtrMax := u48LE(b[size-8:])
 	for i := 0; i < arity; i++ {
-		if cPtr := u48LE(b[(8*i)+base:]); cPtr > cPtrMax {
-			return false
+		if cPtr := u48LE(b[(8*i)+base:]); cPtr <= cPtrMax {
+			continue
 		}
+		if tTag := b[(8*i)+7]; tTag == 0xFD {
+			continue
+		}
+		return false
 	}
 
 	// Check the the version is non-zero.
@@ -225,7 +268,8 @@
 
 	// Further checking of the codec, version, COffMax and DOffMax requires
 	// more context, and is done in loadAndValidate.
-	return true
+
+	return b.codec().Valid()
 }
 
 // ChunkReader parses a RAC file.
@@ -431,7 +475,7 @@
 }
 
 func (r *ChunkReader) loadAndValidate(cOffset int64,
-	parentCodec Codec, parentVersion uint8, parentCOffMax int64,
+	parentCodec Codec, parentCodecHasMixBit bool, parentVersion uint8, parentCOffMax int64,
 	childCBias int64, childDSize int64) error {
 
 	if (cOffset < 0) || ((r.CompressedSize - 4) < cOffset) {
@@ -467,7 +511,7 @@
 
 	// Validate the parent and child codec, version, COffMax and DOffMax.
 	childVersion := r.currNode.version()
-	if !bitwiseSubset(parentCodec, r.currNode.codec()) ||
+	if !parentChildCodecsValid(parentCodec, r.currNode.codec(), parentCodecHasMixBit) ||
 		(parentVersion < childVersion) ||
 		(parentCOffMax < (childCBias + r.currNode.cPtrMax())) ||
 		(childDSize != r.currNode.dPtrMax()) {
@@ -553,6 +597,7 @@
 		}
 
 		parentCodec := r.currNode.codec()
+		parentCodecHasMixBit := r.currNode.codecHasMixBit()
 		parentVersion := r.currNode.version()
 		parentCOffMax := cBias + r.currNode.cPtrMax()
 		childCOffset := r.currNode.cOff(i, cBias)
@@ -564,7 +609,7 @@
 		childDSize := r.currNode.dSize(i)
 
 		if err := r.loadAndValidate(childCOffset,
-			parentCodec, parentVersion, parentCOffMax,
+			parentCodec, parentCodecHasMixBit, parentVersion, parentCOffMax,
 			childCBias, childDSize); err != nil {
 			return err
 		}

diff --git a/lib/rac/chunk_writer.go b/lib/rac/chunk_writer.go
index f665b23..f07442e 100644
--- a/lib/rac/chunk_writer.go
+++ b/lib/rac/chunk_writer.go

@@ -15,6 +15,7 @@
 package rac
 
 import (
+	"errors"
 	"hash/crc32"
 	"io"
 	"sort"
@@ -346,6 +347,9 @@
 	dRangeSize uint64, codec Codec, primary []byte,
 	secondary OptResource, tertiary OptResource) error {
 
+	if w.err != nil {
+		return w.err
+	}
 	if dRangeSize == 0 {
 		return nil
 	}
@@ -361,6 +365,15 @@
 	if err := w.initialize(); err != nil {
 		return err
 	}
+	if len(w.leafNodes) == 0 {
+		if !codec.Valid() {
+			return errInvalidCodec
+		}
+		w.codec = codec
+	} else if w.codec != codec {
+		w.err = errors.New("rac: TODO: support writing multiple Codecs")
+		return w.err
+	}
 	if err := w.write(primary); err != nil {
 		return err
 	}
@@ -413,7 +426,7 @@
 		_, err := w.Writer.Write(emptyRACFile[:])
 		return err
 	}
-	rootNode := gather(w.leafNodes)
+	rootNode := gather(w.leafNodes, w.codec.isLong())
 	indexSize := rootNode.calcEncodedSize(0, w.IndexLocation == IndexLocationAtEnd)
 
 	nw := &nodeWriter{
@@ -505,6 +518,9 @@
 	if arity == 0 {
 		return accumulator
 	}
+	if n.codec.isLong() {
+		arity++
+	}
 	size := (arity * 16) + 16
 	cLength := calcCLength(size)
 
@@ -552,12 +568,25 @@
 		}
 	}
 
+	if !n.codec.Valid() {
+		return errors.New("rac: TODO: support writing multiple Codecs")
+	}
+
 	buf, dPtr := w.buffer[:], uint64(0)
 	arity := uint64(len(n.children) + len(n.resources))
+	if n.codec.isLong() {
+		arity++
+	}
 	if arity > 0xFF {
 		return errInternalArityIsTooLarge
 	}
 
+	// Codec Element 'DPtr|Reserved0|TTag', if present.
+	if n.codec.isLong() {
+		putU64LE(buf, 0xFD00000000000000)
+		buf = buf[8:]
+	}
+
 	// DPtr's. We write resources before regular children (non-resources), so
 	// that any TTag that refers to a resource always avoids the [0xC0, 0xFD]
 	// reserved zone. The ratio of resources to regulars is at most 2:1, as
@@ -577,10 +606,18 @@
 	}
 	buf = buf[8*len(n.children):]
 
-	// DPtrMax.
-	putU64LE(buf, dPtr|(uint64(n.codec)<<56))
+	// DPtrMax and the CodecByte. By construction, a Long Codec's 'c64' value
+	// is always 0, as the Codec Element is always in position 0.
+	codecHighByte := uint64(n.codec) & 0xFF00000000000000
+	putU64LE(buf, dPtr|codecHighByte)
 	buf = buf[8:]
 
+	// Codec Element 'CPtr|CLen|STag', if present.
+	if n.codec.isLong() {
+		putU64LE(buf, uint64(n.codec)&0x00FFFFFFFFFFFFFF)
+		buf = buf[8:]
+	}
+
 	// CPtr's. While the RAC format allows otherwise, this nodeWriter always
 	// keeps the CBias at zero, so that a CPtr equals a COffset.
 	for i, res := range n.resources {
@@ -670,13 +707,22 @@
 // that the root node always directly contains the first and last leaf nodes as
 // children, as those leaves presumably contain the most commonly accessed
 // parts of the decompressed file.
-func gather(nodes []wNode) wNode {
+//
+// If doing this TODO, we'd also have to change the "codec = codecMixBit |
+// CodecZeroes" line below, as it assumes that no branch nodes have both branch
+// node children and leaf node children.
+func gather(nodes []wNode, codecIsLong bool) wNode {
 	if len(nodes) == 0 {
 		panic("gather: no nodes")
 	}
 
 	resources := map[OptResource]bool{}
 
+	arityBudget := 0xFF
+	if codecIsLong {
+		arityBudget = 0xFE
+	}
+
 	for {
 		i, j, arity, newNodes := 0, 0, 0, []wNode(nil)
 		for ; j < len(nodes); j++ {
@@ -685,7 +731,7 @@
 			new2 := (o.secondary != 0) && !resources[o.secondary]
 			new3 := (o.tertiary != 0) && !resources[o.tertiary]
 			arity += 1 + btoi(new2) + btoi(new3)
-			if arity <= 0xFF {
+			if arity <= arityBudget {
 				if new2 {
 					resources[o.secondary] = true
 				}
@@ -727,9 +773,19 @@
 
 func makeBranch(children []wNode, resMap map[OptResource]bool) wNode {
 	dRangeSize, codec := uint64(0), Codec(0)
-	for _, c := range children {
+	for i, c := range children {
 		dRangeSize += c.dRangeSize
-		codec |= c.codec
+		if i == 0 {
+			codec = c.codec
+		} else if codec != c.codec {
+			// We construct the node tree so that a branch node's children are
+			// either all branch nodes or all leaf nodes. If they are all leaf
+			// nodes (with the same Codec), or all branch nodes with the same
+			// Codec, the parent uses the same Codec. If they are all branch
+			// nodes, with different Codecs, we set the Mix bit and since it
+			// has no leaf nodes, we might as well use CodecZeroes.
+			codec = codecMixBit | CodecZeroes
+		}
 	}
 
 	resList := []int(nil)

diff --git a/lib/rac/rac.go b/lib/rac/rac.go
index 8933ced..093fa9e 100644
--- a/lib/rac/rac.go
+++ b/lib/rac/rac.go

@@ -39,14 +39,71 @@
 
 // Codec is the compression codec for the RAC file.
 //
+// For leaf nodes, there are two categories of valid Codecs: Short and Long. A
+// Short Codec's uint64 value's high 2 bits and low 56 bits must be zero. A
+// Long Codec's uint64 value's high 8 bits must be one and then 7 zeroes.
+// Symbolically, Short and Long match:
+//   - 0b00??????_00000000_00000000_00000000_00000000_00000000_00000000_00000000
+//   - 0b10000000_????????_????????_????????_????????_????????_????????_????????
+//
+// In terms of the RAC file format, a Short Codec fits in a single byte: the
+// Codec Byte in the middle of a branch node. A Long Codec uses that Codec Byte
+// to locate 7 other bytes, a location which would otherwise form a "CPtr|CLen"
+// value. When converting from the 7 bytes in the wire format to this Go type
+// (a uint64 value), they are read little-endian: the "CPtr" bytes are the low
+// 6 bytes, the "CLen" is the second-highest byte and the highest byte is
+// hard-coded to 0x80.
+//
+// For example, the 7 bytes "mdo2\x00\x00\x00" would correspond to a Codec
+// value of 0x8000_0000_326F_646D.
+//
+// The Mix Bit is not part of the uint64 representation. Neither is a Long
+// Codec's 'c64' index. This package's exported API deals with leaf nodes.
+// Branch nodes' wire formats are internal implementation details.
+//
 // See the RAC specification for further discussion.
-type Codec uint8
+type Codec uint64
+
+func (c Codec) isLong() bool  { return int64(c) < 0 }
+func (c Codec) isShort() bool { return int64(c) >= 0 }
+
+// Valid returns whether c matches the Short or Long pattern.
+func (c Codec) Valid() bool {
+	if (c >> 63) == 0 {
+		return ((c << 8) == 0) && ((c >> 62) == 0)
+	}
+	return (c >> 56) == 0x80
+}
+
+func (c Codec) name() string {
+	if c.isShort() && c.Valid() {
+		switch c >> 56 {
+		case 0:
+			return "Zeroes"
+		case 1:
+			return "Zlib"
+		case 2:
+			return "Brotli"
+		case 4:
+			return "Zstandard"
+		}
+	}
+	return ""
+}
+
+func parentChildCodecsValid(parent Codec, child Codec, parentHasMixBit bool) bool {
+	return (parent == child) || parentHasMixBit
+}
 
 const (
-	CodecZeroes    = Codec(0x00)
-	CodecZlib      = Codec(0x01)
-	CodecBrotli    = Codec(0x02)
-	CodecZstandard = Codec(0x04)
+	CodecZeroes    = Codec(0x00 << 56)
+	CodecZlib      = Codec(0x01 << 56)
+	CodecBrotli    = Codec(0x02 << 56)
+	CodecZstandard = Codec(0x04 << 56)
+
+	codecMixBit     = Codec(1 << 62)
+	codecLongZeroes = Codec(1 << 63)
+	CodecInvalid    = Codec((1 << 64) - 1)
 )
 
 // IndexLocation is whether the index is at the start or end of the RAC file.

diff --git a/lib/rac/rac_test.go b/lib/rac/rac_test.go
index a351d81..7daa197 100644
--- a/lib/rac/rac_test.go
+++ b/lib/rac/rac_test.go

@@ -594,3 +594,53 @@
 		t.Fatalf("got %q, want %q", got, want)
 	}
 }
+
+func TestLongCodec(t *testing.T) {
+	const codec = Codec(0x80000000326F646D) // "mdo2" backwards, with a high bit.
+	buf := &bytes.Buffer{}
+	w := &ChunkWriter{
+		Writer:        buf,
+		IndexLocation: IndexLocationAtStart,
+		TempFile:      &bytes.Buffer{},
+	}
+	if err := w.AddChunk(0x66, codec, []byte{0xAA, 0xBB}, 0, 0); err != nil {
+		t.Fatalf("AddChunk: %v", err)
+	}
+	if err := w.AddChunk(0x77, codec, []byte{0xCC}, 0, 0); err != nil {
+		t.Fatalf("AddChunk: %v", err)
+	}
+	if err := w.Close(); err != nil {
+		t.Fatalf("Close: %v", err)
+	}
+
+	encoded := buf.Bytes()
+	gotHexDump := hex.Dump(encoded)
+
+	const wantHexDump = "" +
+		"00000000  72 c3 63 03 3d c9 00 fd  00 00 00 00 00 00 00 ff  |r.c.=...........|\n" +
+		"00000010  66 00 00 00 00 00 00 ff  dd 00 00 00 00 00 00 80  |f...............|\n" +
+		"00000020  6d 64 6f 32 00 00 00 00  40 00 00 00 00 00 01 ff  |mdo2....@.......|\n" +
+		"00000030  42 00 00 00 00 00 01 ff  43 00 00 00 00 00 01 03  |B.......C.......|\n" +
+		"00000040  aa bb cc                                          |...|\n"
+
+	if gotHexDump != wantHexDump {
+		t.Fatalf("\ngot:\n%s\nwant:\n%s", gotHexDump, wantHexDump)
+	}
+
+	r := &ChunkReader{
+		ReadSeeker:     bytes.NewReader(encoded),
+		CompressedSize: int64(len(encoded)),
+	}
+	for i := 0; i < 2; i++ {
+		c, err := r.NextChunk()
+		if err != nil {
+			t.Fatalf("i=%d: %v", i, err)
+		}
+		if got, want := c.Codec, codec; got != want {
+			t.Fatalf("i=%d: Codec: got 0x%X, want 0x%X", i, got, want)
+		}
+	}
+	if _, err := r.NextChunk(); err != io.EOF {
+		t.Fatalf("got %v, want %v", err, io.EOF)
+	}
+}

diff --git a/lib/rac/reader.go b/lib/rac/reader.go
index d64bb92..dc50e17 100644
--- a/lib/rac/reader.go
+++ b/lib/rac/reader.go

@@ -368,7 +368,7 @@
 		return r.err
 	}
 
-	if chunk.Codec == 0 {
+	if (chunk.Codec == CodecZeroes) || (chunk.Codec == codecLongZeroes) {
 		r.currChunk.N = 0
 		r.dRange = chunk.DRange
 		r.zeroes = zeroesReader(r.dRange.Size())
@@ -384,16 +384,12 @@
 		}
 	}
 	if codecReader == nil {
-		name := ""
-		switch chunk.Codec {
-		case CodecZlib:
-			name = " (Zlib)"
-		case CodecBrotli:
-			name = " (Brotli)"
-		case CodecZstandard:
-			name = " (Zstandard)"
+		name0, name1, name2 := "", chunk.Codec.name(), ""
+		if name1 != "" {
+			name0, name2 = " (", ")"
 		}
-		r.err = fmt.Errorf("rac: no matching CodecReader for Codec 0x%02X%s", chunk.Codec, name)
+		r.err = fmt.Errorf("rac: no matching CodecReader for Codec 0x%X%s%s%s",
+			chunk.Codec, name0, name1, name2)
 		return r.err
 	}
 

diff --git a/lib/raczlib/raczlib_test.go b/lib/raczlib/raczlib_test.go
index aba447d..66bfd54 100644
--- a/lib/raczlib/raczlib_test.go
+++ b/lib/raczlib/raczlib_test.go

@@ -161,7 +161,7 @@
 	buf[0x37] = 0x00 // CBiasing with COff[0], which is encLen0.
 
 	// Codec and Version.
-	buf[0x1F] = byte(rac.CodecZlib)
+	buf[0x1F] = byte(rac.CodecZlib >> 56)
 	buf[0x3E] = 0x01
 
 	// Checksum.
commit	4006157f9f646ea58dc7219f2bc2b06a0d7d767a	[log] [tgz]
author	Nigel Tao <nigeltao@golang.org>	Sat Sep 07 09:09:18 2019 +1000
committer	Nigel Tao <nigeltao@golang.org>	Sat Sep 07 10:25:33 2019 +1000
tree	43501e495e4c792f7718f76db404f07595e287e4
parent	e7c7e7bcf95895eef37172d40cf42602348d2bdf [diff]