Some Wuffs codecs transform between byte streams and token streams. For example, a JSON decoder might transform
Bytes [ 1 , t r u e , " a b c \ t x y z " ] Tokens 0 1 2 3......... 4 5 6...... 7... 8...... 9 10 Chains - - - ---------- - ---------------------------- -
The tokens partition the bytes. Each byte belongs to exactly one token. Each token spans zero or more bytes.
Just as I/O buffers and coroutines mean that a byte stream doesn't need to be entirely in memory at any point, token buffers similarly allow a token stream to be produced and consumed incrementally.
Nonetheless, conceptually, if the byte stream was a single contiguous slice then each token would correspond to a sub-slice, one whose length was the token length and whose position was the sum of all previous tokens' lengths.
Consecutive tokens can also form token chains, which capture higher level concepts. For example, the JSON string
"abc\txyz" can correspond to multiple tokens. One reason is that the maximum token length is
65535 bytes, but JSON strings can be longer. Another reason is that different parts of the encoded string are treated differently when reconstructing the decoded string:
"is a no-op.
\tshould produce a tab character.
A token is just a
uint64_t. The broad divisions are:
17 ..= 63(47 bits) are the value.
16(1 bit) is the
0 ..= 15(16 bits) are the length.
continued bit is whether that the token chain for this token also contains the next token. The final token in a token chain, including stand-alone tokens, will have the
continued bit set to zero.
+-----+-------------+-------+-------------+-----+-----------+ | 1 | 21 | 4 | 21 | 1 | 16 | +-----+-------------+-------+-------------+-----+-----------+ [..................value..................] con length [..1..|..........~value_extension.........] [..0..|.value_major.|.....value_minor.....] [..0..|......0......|..VBC..|.....VBD.....]
The value bits can be sub-divided in multiple ways. First, the high bit:
63(1 bit) denotes an extended (1) or simple (0) token.
17 ..= 62(46 bits) are the bitwise-not (~) of the
Extended tokens are typically part of a multi-token chain whose first token is a simple token that provides the semantics for each
42 ..= 62(21 bits) are the
17 ..= 41(25 bits) are the
value_major is a 21-bit Base38 number. For example, an HTML tokenizer might produce a combination of “base” tokens (see below) and tokens whose
0x109B0B, the Base38 encoding of
value_major forms a namespace that distinguishes e.g. HTML-specific tokens from JSON-specific tokens.
value_major is non-zero then
value_minor has whatever meaning the tokenizer‘s package assigns to it. For example, combining its low 18 bits with a follow-up extended token’s 46
value_extension bits could form a 64-bit overall value, whose semnatics depend on the
value_minor's high 7 bits.
value_major is reserved for Wuffs' built-in “base” package. The
value_minor is further sub-divided:
38 ..= 41(4 bits) are the
17 ..= 37(21 bits) are the
The high 47 bits (bits
17 ..= 63) only have
VBD semantics when the high 22 bits (the
value_major parts) are all zero. An equivalent test is that the high 26 bits (the notional
VBC), as either an unsigned integer or a sign-extended integer, is in the range
0 ..= 15.
VBCs organize tokens into broad groups, generally applicable (as opposed to being e.g. HTML-specific or JSON-specific). For example, strings and numbers are two
VBCs. Structure is another, for container boundaries like the start and end of HTML elements and JSON arrays.
Filler is yet another
VBC. Such tokens can generally be ignored (other than accumulating their length). Filler is most often encountered as whitespace, but also includes JSON commas (which are structurally inessential) and comments.
VBD semantics depend on the
VBC. For example, at 21 bits, the
VBD can hold every valid Unicode code point, up to U+10FFFF. A
\u2603 in a JSON string can each be represented by a single
VBC__UNICODE_CODE_POINT token whose
0x2603, meaning the Unicode code points U+0009 CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.
More details on the
VBD bit assignments are in the
For file formats that conceptually decode into a node tree, such as HTML or JSON, Wuffs typically provides a SAX-like pull parser, not a DOM-like push parser. There are general reasons for favoring pull parsers, but also, Wuffs code cannot dynamically allocate memory, nor does Wuffs have an
unsafe keyword or a foreign function interface, so a caller cannot pass arbitrary callbacks into Wuffs code. Instead, Wuffs just outputs tokens and tokens are just
The example/jsonfindptrs program demonstrates creating a traditional DOM-like node tree from a SAX-like token stream. The example/jsonptr program demonstrates a lower-level approach that works directly on tokens, where the entire program (not just the Wuffs library) never calls
$ gcc script/print-json-token-debug-format.c -o pjtdf $ ./pjtdf -all-tokens -human-readable < test/data/json-things.formatted.json pos=0x00000000 len=0x0001 con=0 vbc=1:Structure........ vbd=0x004011 pos=0x00000001 len=0x0005 con=0 vbc=0:Filler........... vbd=0x000000 pos=0x00000006 len=0x0001 con=1 vbc=2:String........... vbd=0x000113 pos=0x00000007 len=0x0002 con=1 vbc=2:String........... vbd=0x000203 pos=0x00000009 len=0x0001 con=0 vbc=2:String........... vbd=0x000113 etc pos=0x00000090 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A pos=0x00000092 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A pos=0x00000094 len=0x0001 con=0 vbc=2:String........... vbd=0x000113 pos=0x00000095 len=0x0001 con=0 vbc=0:Filler........... vbd=0x000000 pos=0x00000096 len=0x0001 con=0 vbc=1:Structure........ vbd=0x001042