Add doc/note/tokens.md

commit: 478d1b817cb25a81f4eb8a006cab208f48942b22 [log] [tgz]
author: Nigel Tao <nigeltao@golang.org> Wed Apr 08 23:03:51 2020 +1000
committer: Nigel Tao <nigeltao@golang.org> Wed Apr 08 23:31:18 2020 +1000
tree: b120bc90fff142c3ebed4b83fb622dceab63a302
parent: df3acfbb21cc991e87019cf01d08a1a97b7ddd52 [diff]
diff --git a/doc/note/tokens.md b/doc/note/tokens.md
new file mode 100644
index 0000000..8a9bde0
--- /dev/null
+++ b/doc/note/tokens.md

@@ -0,0 +1,156 @@
+# Tokens
+
+Some Wuffs codecs transform between byte streams and token streams. For
+example, a JSON decoder might transform `[1,true,"abc\txyz"]` as
+
+```
+Bytes    [  1  ,  t  r  u  e  ,  "  a  b  c  \  t  x  y  z  "  ]
+Tokens   0  1  2  3.........  4  5  6......  7...  8......  9  10
+Chains   -  -  -  ----------  -  ----------------------------  -
+```
+
+The tokens partition the bytes. Each byte belongs to exactly one token. Each
+token spans zero or more bytes.
+
+Just as [I/O buffers](/doc/note/io-input-output.md) and
+[coroutines](/doc/note/coroutines.md) mean that a byte stream doesn't need to
+be entirely in memory at any point, token buffers similarly allow a token
+stream to be produced and consumed incrementally.
+
+Nonetheless, conceptually, if the byte stream was a single contiguous slice
+then each token would correspond to a sub-slice, one whose length was the token
+length and whose position was the sum of all previous tokens' lengths.
+
+Consecutive tokens can also form token chains, which capture higher level
+concepts. For example, the JSON string `"abc\txyz"` can correspond to multiple
+tokens. One reason is that the maximum token length is `65535` bytes, but JSON
+strings can be longer. Another reason is that different parts of the encoded
+string are treated differently when reconstructing the decoded string:
+
+- Decoding the first or last `"` is a no-op.
+- Decoding `abc` or `xyz` is a `memcpy`.
+- Decoding the backslash-escaped `\t` should produce a tab character.
+
+
+## Representation
+
+A token is just a `uint64_t`. The broad divisions are:
+
+- Bits 63 .. 18 (46 bits) is the value.
+- Bits 17 .. 16  (2 bits) is `LP` and `LN` (`link_prev` and `link_next`).
+- Bits 15 ..  0 (16 bits) is the length.
+
+The `LP` and `LN` bits denote tokens that are part of a multi-token chain:
+
+- `LP` means that this token is not the first (there is a previous token).
+- `LN` means that this token is not the last  (there is a next     token).
+
+A stand-alone token will have both link bits set to zero.
+
+```
++-----+-------------+-------+-------------+-----+-----+-----------+
+|  1  |      21     |   3   |      21     |  1  |  1  |     16    |
++-----+-------------+-------+-------------+-----+-----+-----------+
+[..................value..................]  LP    LN     length
+[..1..|..........~value_extension.........]
+[..0..|.value_major.|.....value_minor.....]
+[..0..|......0......|..VBC..|.....VBD.....]
+```
+
+The value bits can be sub-divided in multiple ways. First, the high bit:
+
+- Bits 63 .. 63  (1 bit)  denote an extended (1) or simple (0) token.
+
+
+### Extended Tokens
+
+- Bits 62 .. 18 (45 bits) is the bitwise-not (~) of the `value_extension`.
+
+Extended tokens are typically part of a multi-token chain whose first token is
+a simple token that provides the semantics for each `value_extension`.
+
+
+### Simple Tokens
+
+- Bits 62 .. 42 (21 bits) is the `value_major`.
+- Bits 41 .. 18 (24 bits) is the `value_minor`.
+
+The `value_major` is a 21-bit [Base38](doc/note/base38-and-fourcc.md) number.
+For example, an HTML tokenizer might produce a combination of "base" tokens
+(see below) and tokens whose `value_major` is `0x109B0B`, the Base38 encoding
+of `html`. The `value_major` forms a namespace that distinguishes e.g.
+HTML-specific tokens from JSON-specific tokens.
+
+If `value_major` is non-zero then `value_minor` has whatever meaning the
+tokenizer's package assigns to it.
+
+
+### VBCs and VBDs
+
+A zero `value_major` is reserved for Wuffs' built-in "base" package. The
+`value_minor` is further sub-divided:
+
+ - Bits 41 .. 39  (3 bits) is the `VBC` (`value_base_category`).
+ - Bits 38 .. 18 (21 bits) is the `VBD` (`value_base_detail`).
+
+The high 46 bits (bits 63 .. 18) only have `VBC` and `VBD` semantics when the
+high 22 bits (the `extended` and `value_major` parts) are all zero. An
+equivalent test is that the high 25 bits (the notional `VBC`), as either an
+unsigned integer or a sign-extended integer, is in the range `0 ..= 7`.
+
+These `VBC`s organize tokens into broad groups, generally applicable (as
+opposed to being e.g. HTML-specific or JSON-specific). For example, strings and
+numbers are two `VBC`s. Structure is another, for container boundaries like the
+start and end of HTML elements and JSON arrays.
+
+Filler is yet another `VBC`. Such tokens can generally be ignored (other than
+accumulating their length). Filler is most often encountered as whitespace, but
+also includes JSON commas (which are [structurally
+inessential](https://www.tbray.org/ongoing/When/201x/2016/08/20/Fixing-JSON#p-1))
+and comments.
+
+The `VBD` semantics depend on the `VBC`. For example, at 21 bits, the `VBD` can
+hold every valid Unicode code point, up to U+10FFFF. A `\t` or `\u2603` in a
+JSON string can each be represented by a single `VBC__UNICODE_CODE_POINT` token
+whose `VBD` is `0x0009` or `0x2603`, meaning the Unicode code points U+0009
+CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.
+
+More details on the `VBC` and `VBD` bit assignments are in the [`source
+code`](/internal/cgen/base/token-public.h).
+
+
+## SAX/Pull versus DOM/Push
+
+For file formats that conceptually decode into a node tree, such as HTML or
+JSON, Wuffs typically provides a
+[SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML)-like pull parser, not a
+[DOM](https://en.wikipedia.org/wiki/Document_Object_Model)-like push parser.
+There are general reasons for [favoring pull
+parsers](https://github.com/raphlinus/pulldown-cmark/blob/master/README.md#why-a-pull-parser),
+but also, Wuffs code cannot [dynamically allocate
+memory](/doc/note/memory-safety.md), nor does Wuffs have an `unsafe` keyword or
+a foreign function interface, so a caller cannot pass arbitrary callbacks into
+Wuffs code. Instead, Wuffs just outputs tokens and tokens are just `uint64_t`s.
+
+The [example/jsonfindptrs](/example/jsonfindptrs/jsonfindptrs.cc) program
+demonstrates creating a traditional DOM-like node tree from a SAX-like token
+stream. The [example/jsonptr](/example/jsonptr/jsonptr.cc) program demonstrates
+a different, lower-level approach that works directly on tokens, where the
+entire program (not just the Wuffs library) never calls `malloc`.
+
+
+## Example Token Stream
+
+```
+$ gcc script/print-json-token-debug-format.c && \
+>   ./a.out -all-tokens -human-readable < test/data/json-things.formatted.json
+pos=0x00000000  len=0x0001  link=0b00  vbc=1:Structure........  vbd=0x004011
+pos=0x00000001  len=0x0005  link=0b00  vbc=0:Filler...........  vbd=0x000000
+pos=0x00000006  len=0x0001  link=0b01  vbc=2:String...........  vbd=0x000013
+pos=0x00000007  len=0x0002  link=0b11  vbc=2:String...........  vbd=0x000021
+pos=0x00000009  len=0x0001  link=0b10  vbc=2:String...........  vbd=0x000013
+etc
+pos=0x00000094  len=0x0001  link=0b10  vbc=2:String...........  vbd=0x000013
+pos=0x00000095  len=0x0001  link=0b00  vbc=0:Filler...........  vbd=0x000000
+pos=0x00000096  len=0x0001  link=0b00  vbc=1:Structure........  vbd=0x001042
+```

diff --git a/internal/cgen/base/token-public.h b/internal/cgen/base/token-public.h
index 1484552..aa37f13 100644
--- a/internal/cgen/base/token-public.h
+++ b/internal/cgen/base/token-public.h

@@ -16,63 +16,10 @@
 
 // ---------------- Tokens
 
+// wuffs_base__token is an element of a byte stream's tokenization.
+//
+// See https://github.com/google/wuffs/blob/master/doc/note/tokens.md
 typedef struct {
-  // The repr's 64 bits are divided as:
-  //
-  // +-----+-------------+-------+-------------+-----+-----+-----------+
-  // |  1  |      21     |   3   |      21     |  1  |  1  |     16    |
-  // +-----+-------------+-------+-------------+-----+-----+-----------+
-  // [..................value..................]  LP    LN     length
-  // [..1..|..........~value_extension.........]
-  // [..0..|.value_major.|.....value_minor.....]
-  // [..0..|.........VBC.........|.....VBD.....]
-  //
-  // The broad divisions are:
-  //  - Bits 63 .. 18 (46 bits) is the value.
-  //  - Bits 17 .. 16 ( 2 bits) is LP and LN (link_prev and link_next).
-  //  - Bits 15 ..  0 (16 bits) is the length.
-  //
-  // ----
-  //
-  // The value bits can be sub-divided in multiple ways. First, the high bit:
-  //  - Bits 63 .. 63 ( 1 bits) is an extended (1) or simple (0) token.
-  //
-  // For extended tokens:
-  //  - Bits 62 .. 18 (45 bits) is the bitwise-not (~) of the value_extension.
-  //
-  // For simple tokens:
-  //  - Bits 62 .. 42 (21 bits) is the value_major.
-  //  - Bits 41 .. 18 (24 bits) is the value_minor.
-  //  - Bits 62 .. 39 (24 bits) is the VBC (value_base_category).
-  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).
-  //
-  // The value_major is a 21-bit [Base38](doc/note/base38-and-fourcc.md) value.
-  // If all of its bits are zero (special cased for Wuffs' built-in "base"
-  // package) then the value_minor is further sub-divided:
-  //  - Bits 41 .. 39 ( 3 bits) is the VBC (value_base_category).
-  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).
-  //
-  // The high 46 bits (bits 63 .. 18) only have VBC and VBD semantics when the
-  // high 22 bits (the value_major) are all zero. An equivalent test is that
-  // the high 25 bits (the notional VBC) has a value in the range 0 ..= 7.
-  //
-  // At 21 bits, the VBD can hold every valid Unicode code point.
-  //
-  // If value_major is non-zero then value_minor has whatever arbitrary meaning
-  // the tokenizer's package assigns to it.
-  //
-  // ----
-  //
-  // Multiple consecutive tokens can form a larger conceptual unit. For
-  // example, an "abc\tz" string is a single higher level concept but at the
-  // lower level, it could consist of multiple tokens: the quotes '"', the
-  // ASCII texts "abc" and "z" and the backslash-escaped tab '\t'. The LP and
-  // LN (link_prev and link_next) bits denote tokens that are part of a
-  // multi-token chain:
-  //  - LP means that this token is not the first (there is a previous token).
-  //  - LN means that this token is not the last  (there is a next     token).
-  //
-  // In particular, a stand-alone token will have both link bits set to zero.
   uint64_t repr;
 
 #ifdef __cplusplus
@@ -151,6 +98,10 @@
 // CONVERT_1_DST_4_SRC_BACKSLASH_X means a source like "\\x23\\x67\\xAB", where
 // 12 src bytes encode 3 dst bytes.
 //
+// Post-processing may further transform those D destination bytes (e.g. treat
+// "\\xFF" as the Unicode code point U+00FF instead of the byte 0xFF), but that
+// is out of scope of this VBD's semantics.
+//
 // When src is the empty string, multiple conversion algorithms are applicable
 // (so these bits are not necessarily mutually exclusive), all producing the
 // same empty dst string.

diff --git a/internal/cgen/data.go b/internal/cgen/data.go
index 0537a67..7d5d0e6 100644
--- a/internal/cgen/data.go
+++ b/internal/cgen/data.go

@@ -382,13 +382,7 @@
 	""
 
 const baseTokenPublicH = "" +
-	"// ---------------- Tokens\n\ntypedef struct {\n  // The repr's 64 bits are divided as:\n  //\n  // +-----+-------------+-------+-------------+-----+-----+-----------+\n  // |  1  |      21     |   3   |      21     |  1  |  1  |     16    |\n  // +-----+-------------+-------+-------------+-----+-----+-----------+\n  // [..................value..................]  LP    LN     length\n  // [..1..|..........~value_extension.........]\n  // [..0..|.value_major.|.....value_minor.....]\n  // [..0..|.........VBC.........|.....VBD.....]\n  //\n  // The broad divisions are:\n  //  - Bits 63 .. 18 (46 bits) is the value.\n  //  - Bits 17 .. 16 ( 2 bits) is LP and LN (link_prev and link_next).\n  //  - Bits 15 ..  0 (16 bits) is the length.\n  //\n  " +
-	"" +
-	"// ----\n  //\n  // The value bits can be sub-divided in multiple ways. First, the high bit:\n  //  - Bits 63 .. 63 ( 1 bits) is an extended (1) or simple (0) token.\n  //\n  // For extended tokens:\n  //  - Bits 62 .. 18 (45 bits) is the bitwise-not (~) of the value_extension.\n  //\n  // For simple tokens:\n  //  - Bits 62 .. 42 (21 bits) is the value_major.\n  //  - Bits 41 .. 18 (24 bits) is the value_minor.\n  //  - Bits 62 .. 39 (24 bits) is the VBC (value_base_category).\n  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).\n  //\n  // The value_major is a 21-bit [Base38](doc/note/base38-and-fourcc.md) value.\n  // If all of its bits are zero (special cased for Wuffs' built-in \"base\"\n  // package) then the value_minor is further sub-divided:\n  //  - Bits 41 .. 39 ( 3 bits) is the VBC (value_base_category).\n  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).\n  //\n  // The high 46 bits (bits 63 .. 18) only have VBC and VBD semantics when the\n  // high 22 bits (the value_major) are all zero. An eq" +
-	"uivalent test is that\n  // the high 25 bits (the notional VBC) has a value in the range 0 ..= 7.\n  //\n  // At 21 bits, the VBD can hold every valid Unicode code point.\n  //\n  // If value_major is non-zero then value_minor has whatever arbitrary meaning\n  // the tokenizer's package assigns to it.\n  //\n  " +
-	"" +
-	"// ----\n  //\n  // Multiple consecutive tokens can form a larger conceptual unit. For\n  // example, an \"abc\\tz\" string is a single higher level concept but at the\n  // lower level, it could consist of multiple tokens: the quotes '\"', the\n  // ASCII texts \"abc\" and \"z\" and the backslash-escaped tab '\\t'. The LP and\n  // LN (link_prev and link_next) bits denote tokens that are part of a\n  // multi-token chain:\n  //  - LP means that this token is not the first (there is a previous token).\n  //  - LN means that this token is not the last  (there is a next     token).\n  //\n  // In particular, a stand-alone token will have both link bits set to zero.\n  uint64_t repr;\n\n#ifdef __cplusplus\n  inline int64_t value() const;\n  inline int64_t value_extension() const;\n  inline int64_t value_major() const;\n  inline int64_t value_base_category() const;\n  inline uint64_t value_minor() const;\n  inline uint64_t value_base_detail() const;\n  inline bool link_prev() const;\n  inline bool link_next() const;\n  inline uint64_t length() " +
-	"const;\n#endif  // __cplusplus\n\n} wuffs_base__token;\n\nstatic inline wuffs_base__token  //\nwuffs_base__make_token(uint64_t repr) {\n  wuffs_base__token ret;\n  ret.repr = repr;\n  return ret;\n}\n\n  " +
+	"// ---------------- Tokens\n\n// wuffs_base__token is an element of a byte stream's tokenization.\n//\n// See https://github.com/google/wuffs/blob/master/doc/note/tokens.md\ntypedef struct {\n  uint64_t repr;\n\n#ifdef __cplusplus\n  inline int64_t value() const;\n  inline int64_t value_extension() const;\n  inline int64_t value_major() const;\n  inline int64_t value_base_category() const;\n  inline uint64_t value_minor() const;\n  inline uint64_t value_base_detail() const;\n  inline bool link_prev() const;\n  inline bool link_next() const;\n  inline uint64_t length() const;\n#endif  // __cplusplus\n\n} wuffs_base__token;\n\nstatic inline wuffs_base__token  //\nwuffs_base__make_token(uint64_t repr) {\n  wuffs_base__token ret;\n  ret.repr = repr;\n  return ret;\n}\n\n  " +
 	"" +
 	"// --------\n\n#define WUFFS_BASE__TOKEN__LENGTH__MAX_INCL 0xFFFF\n\n#define WUFFS_BASE__TOKEN__VALUE__SHIFT 18\n#define WUFFS_BASE__TOKEN__VALUE_EXTENSION__SHIFT 18\n#define WUFFS_BASE__TOKEN__VALUE_MAJOR__SHIFT 42\n#define WUFFS_BASE__TOKEN__VALUE_BASE_CATEGORY__SHIFT 39\n#define WUFFS_BASE__TOKEN__VALUE_MINOR__SHIFT 18\n#define WUFFS_BASE__TOKEN__VALUE_BASE_DETAIL__SHIFT 18\n#define WUFFS_BASE__TOKEN__LINK__SHIFT 16\n#define WUFFS_BASE__TOKEN__LENGTH__SHIFT 0\n\n#define WUFFS_BASE__TOKEN__LINK_PREV 0x20000\n#define WUFFS_BASE__TOKEN__LINK_NEXT 0x10000\n\n  " +
 	"" +
@@ -398,8 +392,8 @@
 	"" +
 	"// --------\n\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH 0x00001\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP 0x00002\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_NONE 0x00010\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST 0x00020\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_DICT 0x00040\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_NONE 0x01000\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST 0x02000\n#define WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_DICT 0x04000\n\n" +
 	"" +
-	"// --------\n\n// \"DEFINITELY_FOO\" means that the destination bytes (and also the source\n// bytes, for 1_DST_1_SRC_COPY) are in the FOO format. Definitely means that\n// the lack of the bit is conservative: it is valid for all-ASCII strings to\n// have neither DEFINITELY_UTF_8 or DEFINITELY_ASCII bits set.\n#define WUFFS_BASE__TOKEN__VBD__STRING__DEFINITELY_UTF_8 0x00001\n#define WUFFS_BASE__TOKEN__VBD__STRING__DEFINITELY_ASCII 0x00002\n\n// \"CONVERT_D_DST_S_SRC\" means that multiples of S source bytes (possibly\n// padded) produces multiples of D destination bytes. For example,\n// CONVERT_1_DST_4_SRC_BACKSLASH_X means a source like \"\\\\x23\\\\x67\\\\xAB\", where\n// 12 src bytes encode 3 dst bytes.\n//\n// When src is the empty string, multiple conversion algorithms are applicable\n// (so these bits are not necessarily mutually exclusive), all producing the\n// same empty dst string.\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP 0x00010\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY 0x00020\n#" +
-	"define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_2_SRC_HEXADECIMAL 0x00040\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_4_SRC_BACKSLASH_X 0x00080\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_3_DST_4_SRC_BASE_64_STD 0x00100\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_3_DST_4_SRC_BASE_64_URL 0x00200\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_4_DST_5_SRC_ASCII_85 0x00400\n\n  " +
+	"// --------\n\n// \"DEFINITELY_FOO\" means that the destination bytes (and also the source\n// bytes, for 1_DST_1_SRC_COPY) are in the FOO format. Definitely means that\n// the lack of the bit is conservative: it is valid for all-ASCII strings to\n// have neither DEFINITELY_UTF_8 or DEFINITELY_ASCII bits set.\n#define WUFFS_BASE__TOKEN__VBD__STRING__DEFINITELY_UTF_8 0x00001\n#define WUFFS_BASE__TOKEN__VBD__STRING__DEFINITELY_ASCII 0x00002\n\n// \"CONVERT_D_DST_S_SRC\" means that multiples of S source bytes (possibly\n// padded) produces multiples of D destination bytes. For example,\n// CONVERT_1_DST_4_SRC_BACKSLASH_X means a source like \"\\\\x23\\\\x67\\\\xAB\", where\n// 12 src bytes encode 3 dst bytes.\n//\n// Post-processing may further transform those D destination bytes (e.g. treat\n// \"\\\\xFF\" as the Unicode code point U+00FF instead of the byte 0xFF), but that\n// is out of scope of this VBD's semantics.\n//\n// When src is the empty string, multiple conversion algorithms are applicable\n// (so these bits are not necessarily mutual" +
+	"ly exclusive), all producing the\n// same empty dst string.\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP 0x00010\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY 0x00020\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_2_SRC_HEXADECIMAL 0x00040\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_4_SRC_BACKSLASH_X 0x00080\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_3_DST_4_SRC_BASE_64_STD 0x00100\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_3_DST_4_SRC_BASE_64_URL 0x00200\n#define WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_4_DST_5_SRC_ASCII_85 0x00400\n\n  " +
 	"" +
 	"// --------\n\n#define WUFFS_BASE__TOKEN__VBD__LITERAL__UNDEFINED 0x00001\n#define WUFFS_BASE__TOKEN__VBD__LITERAL__NULL 0x00002\n#define WUFFS_BASE__TOKEN__VBD__LITERAL__FALSE 0x00004\n#define WUFFS_BASE__TOKEN__VBD__LITERAL__TRUE 0x00008\n\n  " +
 	"" +

diff --git a/release/c/wuffs-unsupported-snapshot.c b/release/c/wuffs-unsupported-snapshot.c
index b5cca28..9b0b950 100644
--- a/release/c/wuffs-unsupported-snapshot.c
+++ b/release/c/wuffs-unsupported-snapshot.c

@@ -1875,63 +1875,10 @@
 
 // ---------------- Tokens
 
+// wuffs_base__token is an element of a byte stream's tokenization.
+//
+// See https://github.com/google/wuffs/blob/master/doc/note/tokens.md
 typedef struct {
-  // The repr's 64 bits are divided as:
-  //
-  // +-----+-------------+-------+-------------+-----+-----+-----------+
-  // |  1  |      21     |   3   |      21     |  1  |  1  |     16    |
-  // +-----+-------------+-------+-------------+-----+-----+-----------+
-  // [..................value..................]  LP    LN     length
-  // [..1..|..........~value_extension.........]
-  // [..0..|.value_major.|.....value_minor.....]
-  // [..0..|.........VBC.........|.....VBD.....]
-  //
-  // The broad divisions are:
-  //  - Bits 63 .. 18 (46 bits) is the value.
-  //  - Bits 17 .. 16 ( 2 bits) is LP and LN (link_prev and link_next).
-  //  - Bits 15 ..  0 (16 bits) is the length.
-  //
-  // ----
-  //
-  // The value bits can be sub-divided in multiple ways. First, the high bit:
-  //  - Bits 63 .. 63 ( 1 bits) is an extended (1) or simple (0) token.
-  //
-  // For extended tokens:
-  //  - Bits 62 .. 18 (45 bits) is the bitwise-not (~) of the value_extension.
-  //
-  // For simple tokens:
-  //  - Bits 62 .. 42 (21 bits) is the value_major.
-  //  - Bits 41 .. 18 (24 bits) is the value_minor.
-  //  - Bits 62 .. 39 (24 bits) is the VBC (value_base_category).
-  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).
-  //
-  // The value_major is a 21-bit [Base38](doc/note/base38-and-fourcc.md) value.
-  // If all of its bits are zero (special cased for Wuffs' built-in "base"
-  // package) then the value_minor is further sub-divided:
-  //  - Bits 41 .. 39 ( 3 bits) is the VBC (value_base_category).
-  //  - Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).
-  //
-  // The high 46 bits (bits 63 .. 18) only have VBC and VBD semantics when the
-  // high 22 bits (the value_major) are all zero. An equivalent test is that
-  // the high 25 bits (the notional VBC) has a value in the range 0 ..= 7.
-  //
-  // At 21 bits, the VBD can hold every valid Unicode code point.
-  //
-  // If value_major is non-zero then value_minor has whatever arbitrary meaning
-  // the tokenizer's package assigns to it.
-  //
-  // ----
-  //
-  // Multiple consecutive tokens can form a larger conceptual unit. For
-  // example, an "abc\tz" string is a single higher level concept but at the
-  // lower level, it could consist of multiple tokens: the quotes '"', the
-  // ASCII texts "abc" and "z" and the backslash-escaped tab '\t'. The LP and
-  // LN (link_prev and link_next) bits denote tokens that are part of a
-  // multi-token chain:
-  //  - LP means that this token is not the first (there is a previous token).
-  //  - LN means that this token is not the last  (there is a next     token).
-  //
-  // In particular, a stand-alone token will have both link bits set to zero.
   uint64_t repr;
 
 #ifdef __cplusplus
@@ -2010,6 +1957,10 @@
 // CONVERT_1_DST_4_SRC_BACKSLASH_X means a source like "\\x23\\x67\\xAB", where
 // 12 src bytes encode 3 dst bytes.
 //
+// Post-processing may further transform those D destination bytes (e.g. treat
+// "\\xFF" as the Unicode code point U+00FF instead of the byte 0xFF), but that
+// is out of scope of this VBD's semantics.
+//
 // When src is the empty string, multiple conversion algorithms are applicable
 // (so these bits are not necessarily mutually exclusive), all producing the
 // same empty dst string.
commit	478d1b817cb25a81f4eb8a006cab208f48942b22	[log] [tgz]
author	Nigel Tao <nigeltao@golang.org>	Wed Apr 08 23:03:51 2020 +1000
committer	Nigel Tao <nigeltao@golang.org>	Wed Apr 08 23:31:18 2020 +1000
tree	b120bc90fff142c3ebed4b83fb622dceab63a302
parent	df3acfbb21cc991e87019cf01d08a1a97b7ddd52 [diff]