Some Wuffs codecs transform between byte streams and token streams. For example, a JSON decoder might transform [1,true,"abc\txyz"]
as
Bytes [ 1 , t r u e , " a b c \ t x y z " ] Tokens 0 1 2 3......... 4 5 6...... 7... 8...... 9 10 Chains - - - ---------- - ---------------------------- -
The tokens partition the bytes. Each byte belongs to exactly one token. Each token spans zero or more bytes.
Just as I/O buffers and coroutines mean that a byte stream doesn't need to be entirely in memory at any point, token buffers similarly allow a token stream to be produced and consumed incrementally.
Nonetheless, conceptually, if the byte stream was a single contiguous slice then each token would correspond to a sub-slice, one whose length was the token length and whose position was the sum of all previous tokens' lengths.
Consecutive tokens can also form token chains, which capture higher level concepts. For example, the JSON string "abc\txyz"
can correspond to multiple tokens. One reason is that the maximum token length is 65535
bytes, but JSON strings can be longer. Another reason is that different parts of the encoded string are treated differently when reconstructing the decoded string:
"
is a no-op.abc
or xyz
is a memcpy
.\t
should produce a tab character.A token is just a uint64_t
. The broad divisions are:
17 ..= 63
(47 bits) are the value.16
(1 bit) is the continued
bit.0 ..= 15
(16 bits) are the length.The continued
bit is whether that the token chain for this token also contains the next token. The final token in a token chain, including stand-alone tokens, will have the continued
bit set to zero.
+-----+-------------+-------+-------------+-----+-----------+ | 1 | 21 | 4 | 21 | 1 | 16 | +-----+-------------+-------+-------------+-----+-----------+ [..................value..................] con length [..1..|..........~value_extension.........] [..0..|.value_major.|.....value_minor.....] [..0..|......0......|..VBC..|.....VBD.....]
The value bits can be sub-divided in multiple ways. First, the high bit:
63
(1 bit) denotes an extended (1) or simple (0) token.17 ..= 62
(46 bits) are the bitwise-not (~) of the value_extension
.Extended tokens are typically part of a multi-token chain whose first token is a simple token that provides the semantics for each value_extension
.
42 ..= 62
(21 bits) are the value_major
.17 ..= 41
(25 bits) are the value_minor
.The value_major
is a 21-bit Base38 number. For example, an HTML tokenizer might produce a combination of “base” tokens (see below) and tokens whose value_major
is 0x109B0B
, the Base38 encoding of html
. The value_major
forms a namespace that distinguishes e.g. HTML-specific tokens from JSON-specific tokens.
If value_major
is non-zero then value_minor
has whatever meaning the tokenizer‘s package assigns to it. For example, combining its low 18 bits with a follow-up extended token’s 46 value_extension
bits could form a 64-bit overall value, whose semnatics depend on the value_minor
's high 7 bits.
A zero value_major
is reserved for Wuffs' built-in “base” package. The value_minor
is further sub-divided:
38 ..= 41
(4 bits) are the VBC
(value_base_category
).17 ..= 37
(21 bits) are the VBD
(value_base_detail
).The high 47 bits (bits 17 ..= 63
) only have VBC
and VBD
semantics when the high 22 bits (the extended
and value_major
parts) are all zero. An equivalent test is that the high 26 bits (the notional VBC
), as either an unsigned integer or a sign-extended integer, is in the range 0 ..= 15
.
These VBC
s organize tokens into broad groups, generally applicable (as opposed to being e.g. HTML-specific or JSON-specific). For example, strings and numbers are two VBC
s. Structure is another, for container boundaries like the start and end of HTML elements and JSON arrays.
Filler is yet another VBC
. Such tokens can generally be ignored (other than accumulating their length). Filler is most often encountered as whitespace, but also includes JSON commas (which are structurally inessential) and comments.
The VBD
semantics depend on the VBC
. For example, at 21 bits, the VBD
can hold every valid Unicode code point, up to U+10FFFF. A \t
or \u2603
in a JSON string can each be represented by a single VBC__UNICODE_CODE_POINT
token whose VBD
is 0x0009
or 0x2603
, meaning the Unicode code points U+0009 CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.
More details on the VBC
and VBD
bit assignments are in the source code
.
For file formats that conceptually decode into a node tree, such as HTML or JSON, Wuffs typically provides a SAX-like pull parser, not a DOM-like push parser. There are general reasons for favoring pull parsers, but also, Wuffs code cannot dynamically allocate memory, nor does Wuffs have an unsafe
keyword or a foreign function interface, so a caller cannot pass arbitrary callbacks into Wuffs code. Instead, Wuffs just outputs tokens and tokens are just uint64_t
s.
The example/jsonfindptrs program demonstrates creating a traditional DOM-like node tree from a SAX-like token stream. The example/jsonptr program demonstrates a lower-level approach that works directly on tokens, where the entire program (not just the Wuffs library) never calls malloc
.
$ gcc script/print-json-token-debug-format.c -o pjtdf $ ./pjtdf -all-tokens -human-readable < test/data/json-things.formatted.json pos=0x00000000 len=0x0001 con=0 vbc=1:Structure........ vbd=0x004011 pos=0x00000001 len=0x0005 con=0 vbc=0:Filler........... vbd=0x000000 pos=0x00000006 len=0x0001 con=1 vbc=2:String........... vbd=0x000113 pos=0x00000007 len=0x0002 con=1 vbc=2:String........... vbd=0x000203 pos=0x00000009 len=0x0001 con=0 vbc=2:String........... vbd=0x000113 etc pos=0x00000090 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A pos=0x00000092 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A pos=0x00000094 len=0x0001 con=0 vbc=2:String........... vbd=0x000113 pos=0x00000095 len=0x0001 con=0 vbc=0:Filler........... vbd=0x000000 pos=0x00000096 len=0x0001 con=0 vbc=1:Structure........ vbd=0x001042