Some Wuffs codecs transform between byte streams and token streams. For example, a JSON decoder might transform
Bytes [ 1 , t r u e , " a b c \ t x y z " ] Tokens 0 1 2 3......... 4 5 6...... 7... 8...... 9 10 Chains - - - ---------- - ---------------------------- -
The tokens partition the bytes. Each byte belongs to exactly one token. Each token spans zero or more bytes.
Nonetheless, conceptually, if the byte stream was a single contiguous slice then each token would correspond to a sub-slice, one whose length was the token length and whose position was the sum of all previous tokens' lengths.
Consecutive tokens can also form token chains, which capture higher level concepts. For example, the JSON string
"abc\txyz" can correspond to multiple tokens. One reason is that the maximum token length is
65535 bytes, but JSON strings can be longer. Another reason is that different parts of the encoded string are treated differently when reconstructing the decoded string:
"is a no-op.
\tshould produce a tab character.
A token is just a
uint64_t. The broad divisions are:
LN bits denote tokens that are part of a multi-token chain:
LPmeans that this token is not the first (there is a previous token).
LNmeans that this token is not the last (there is a next token).
A stand-alone token will have both link bits set to zero.
+-----+-------------+-------+-------------+-----+-----+-----------+ | 1 | 21 | 3 | 21 | 1 | 1 | 16 | +-----+-------------+-------+-------------+-----+-----+-----------+ [..................value..................] LP LN length [..1..|..........~value_extension.........] [..0..|.value_major.|.....value_minor.....] [..0..|......0......|..VBC..|.....VBD.....]
The value bits can be sub-divided in multiple ways. First, the high bit:
Extended tokens are typically part of a multi-token chain whose first token is a simple token that provides the semantics for each
value_major is a 21-bit Base38 number. For example, an HTML tokenizer might produce a combination of “base” tokens (see below) and tokens whose
0x109B0B, the Base38 encoding of
value_major forms a namespace that distinguishes e.g. HTML-specific tokens from JSON-specific tokens.
value_major is non-zero then
value_minor has whatever meaning the tokenizer's package assigns to it.
value_major is reserved for Wuffs' built-in “base” package. The
value_minor is further sub-divided:
The high 46 bits (bits 63 .. 18) only have
VBD semantics when the high 22 bits (the
value_major parts) are all zero. An equivalent test is that the high 25 bits (the notional
VBC), as either an unsigned integer or a sign-extended integer, is in the range
0 ..= 7.
VBCs organize tokens into broad groups, generally applicable (as opposed to being e.g. HTML-specific or JSON-specific). For example, strings and numbers are two
VBCs. Structure is another, for container boundaries like the start and end of HTML elements and JSON arrays.
Filler is yet another
VBC. Such tokens can generally be ignored (other than accumulating their length). Filler is most often encountered as whitespace, but also includes JSON commas (which are structurally inessential) and comments.
VBD semantics depend on the
VBC. For example, at 21 bits, the
VBD can hold every valid Unicode code point, up to U+10FFFF. A
\u2603 in a JSON string can each be represented by a single
VBC__UNICODE_CODE_POINT token whose
0x2603, meaning the Unicode code points U+0009 CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.
More details on the
VBD bit assignments are in the
For file formats that conceptually decode into a node tree, such as HTML or JSON, Wuffs typically provides a SAX-like pull parser, not a DOM-like push parser. There are general reasons for favoring pull parsers, but also, Wuffs code cannot dynamically allocate memory, nor does Wuffs have an
unsafe keyword or a foreign function interface, so a caller cannot pass arbitrary callbacks into Wuffs code. Instead, Wuffs just outputs tokens and tokens are just
The example/jsonfindptrs program demonstrates creating a traditional DOM-like node tree from a SAX-like token stream. The example/jsonptr program demonstrates a different, lower-level approach that works directly on tokens, where the entire program (not just the Wuffs library) never calls
$ gcc script/print-json-token-debug-format.c && \ > ./a.out -all-tokens -human-readable < test/data/json-things.formatted.json pos=0x00000000 len=0x0001 link=0b00 vbc=1:Structure........ vbd=0x004011 pos=0x00000001 len=0x0005 link=0b00 vbc=0:Filler........... vbd=0x000000 pos=0x00000006 len=0x0001 link=0b01 vbc=2:String........... vbd=0x000013 pos=0x00000007 len=0x0002 link=0b11 vbc=2:String........... vbd=0x000021 pos=0x00000009 len=0x0001 link=0b10 vbc=2:String........... vbd=0x000013 etc pos=0x00000094 len=0x0001 link=0b10 vbc=2:String........... vbd=0x000013 pos=0x00000095 len=0x0001 link=0b00 vbc=0:Filler........... vbd=0x000000 pos=0x00000096 len=0x0001 link=0b00 vbc=1:Structure........ vbd=0x001042