Tokens

Some Wuffs codecs transform between byte streams and token streams. For example, a JSON decoder might transform [1,true,"abc\txyz"] as

Bytes    [  1  ,  t  r  u  e  ,  "  a  b  c  \  t  x  y  z  "  ]
Tokens   0  1  2  3.........  4  5  6......  7...  8......  9  10
Chains   -  -  -  ----------  -  ----------------------------  -

The tokens partition the bytes. Each byte belongs to exactly one token. Each token spans zero or more bytes.

Just as I/O buffers and coroutines mean that a byte stream doesn't need to be entirely in memory at any point, token buffers similarly allow a token stream to be produced and consumed incrementally.

Nonetheless, conceptually, if the byte stream was a single contiguous slice then each token would correspond to a sub-slice, one whose length was the token length and whose position was the sum of all previous tokens' lengths.

Consecutive tokens can also form token chains, which capture higher level concepts. For example, the JSON string "abc\txyz" can correspond to multiple tokens. One reason is that the maximum token length is 65535 bytes, but JSON strings can be longer. Another reason is that different parts of the encoded string are treated differently when reconstructing the decoded string:

  • Decoding the first or last " is a no-op.
  • Decoding abc or xyz is a memcpy.
  • Decoding the backslash-escaped \t should produce a tab character.

Representation

A token is just a uint64_t. The broad divisions are:

  • Bits 63 .. 18 (46 bits) is the value.
  • Bits 17 .. 16 (2 bits) is LP and LN (link_prev and link_next).
  • Bits 15 .. 0 (16 bits) is the length.

The LP and LN bits denote tokens that are part of a multi-token chain:

  • LP means that this token is not the first (there is a previous token).
  • LN means that this token is not the last (there is a next token).

A stand-alone token will have both link bits set to zero.

+-----+-------------+-------+-------------+-----+-----+-----------+
|  1  |      21     |   3   |      21     |  1  |  1  |     16    |
+-----+-------------+-------+-------------+-----+-----+-----------+
[..................value..................]  LP    LN     length
[..1..|..........~value_extension.........]
[..0..|.value_major.|.....value_minor.....]
[..0..|......0......|..VBC..|.....VBD.....]

The value bits can be sub-divided in multiple ways. First, the high bit:

  • Bits 63 .. 63 (1 bit) denote an extended (1) or simple (0) token.

Extended Tokens

  • Bits 62 .. 18 (45 bits) is the bitwise-not (~) of the value_extension.

Extended tokens are typically part of a multi-token chain whose first token is a simple token that provides the semantics for each value_extension.

Simple Tokens

  • Bits 62 .. 42 (21 bits) is the value_major.
  • Bits 41 .. 18 (24 bits) is the value_minor.

The value_major is a 21-bit Base38 number. For example, an HTML tokenizer might produce a combination of “base” tokens (see below) and tokens whose value_major is 0x109B0B, the Base38 encoding of html. The value_major forms a namespace that distinguishes e.g. HTML-specific tokens from JSON-specific tokens.

If value_major is non-zero then value_minor has whatever meaning the tokenizer's package assigns to it.

VBCs and VBDs

A zero value_major is reserved for Wuffs' built-in “base” package. The value_minor is further sub-divided:

  • Bits 41 .. 39 (3 bits) is the VBC (value_base_category).
  • Bits 38 .. 18 (21 bits) is the VBD (value_base_detail).

The high 46 bits (bits 63 .. 18) only have VBC and VBD semantics when the high 22 bits (the extended and value_major parts) are all zero. An equivalent test is that the high 25 bits (the notional VBC), as either an unsigned integer or a sign-extended integer, is in the range 0 ..= 7.

These VBCs organize tokens into broad groups, generally applicable (as opposed to being e.g. HTML-specific or JSON-specific). For example, strings and numbers are two VBCs. Structure is another, for container boundaries like the start and end of HTML elements and JSON arrays.

Filler is yet another VBC. Such tokens can generally be ignored (other than accumulating their length). Filler is most often encountered as whitespace, but also includes JSON commas (which are structurally inessential) and comments.

The VBD semantics depend on the VBC. For example, at 21 bits, the VBD can hold every valid Unicode code point, up to U+10FFFF. A \t or \u2603 in a JSON string can each be represented by a single VBC__UNICODE_CODE_POINT token whose VBD is 0x0009 or 0x2603, meaning the Unicode code points U+0009 CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.

More details on the VBC and VBD bit assignments are in the source code.

SAX/Pull versus DOM/Push

For file formats that conceptually decode into a node tree, such as HTML or JSON, Wuffs typically provides a SAX-like pull parser, not a DOM-like push parser. There are general reasons for favoring pull parsers, but also, Wuffs code cannot dynamically allocate memory, nor does Wuffs have an unsafe keyword or a foreign function interface, so a caller cannot pass arbitrary callbacks into Wuffs code. Instead, Wuffs just outputs tokens and tokens are just uint64_ts.

The example/jsonfindptrs program demonstrates creating a traditional DOM-like node tree from a SAX-like token stream. The example/jsonptr program demonstrates a different, lower-level approach that works directly on tokens, where the entire program (not just the Wuffs library) never calls malloc.

Example Token Stream

$ gcc script/print-json-token-debug-format.c && \
>   ./a.out -all-tokens -human-readable < test/data/json-things.formatted.json
pos=0x00000000  len=0x0001  link=0b00  vbc=1:Structure........  vbd=0x004011
pos=0x00000001  len=0x0005  link=0b00  vbc=0:Filler...........  vbd=0x000000
pos=0x00000006  len=0x0001  link=0b01  vbc=2:String...........  vbd=0x000013
pos=0x00000007  len=0x0002  link=0b11  vbc=2:String...........  vbd=0x000021
pos=0x00000009  len=0x0001  link=0b10  vbc=2:String...........  vbd=0x000013
etc
pos=0x00000094  len=0x0001  link=0b10  vbc=2:String...........  vbd=0x000013
pos=0x00000095  len=0x0001  link=0b00  vbc=0:Filler...........  vbd=0x000000
pos=0x00000096  len=0x0001  link=0b00  vbc=1:Structure........  vbd=0x001042