blob: 3480b6cc4ae0b155b006aed2a1b9055113efd7e0 [file] [log] [blame] [view]
# Tokens
Some Wuffs codecs transform between byte streams and token streams. For
example, a JSON decoder might transform `[1,true,"abc\txyz"]` as
```
Bytes [ 1 , t r u e , " a b c \ t x y z " ]
Tokens 0 1 2 3......... 4 5 6...... 7... 8...... 9 10
Chains - - - ---------- - ---------------------------- -
```
The tokens partition the bytes. Each byte belongs to exactly one token. Each
token spans zero or more bytes.
Just as [I/O buffers](/doc/note/io-input-output.md) and
[coroutines](/doc/note/coroutines.md) mean that a byte stream doesn't need to
be entirely in memory at any point, token buffers similarly allow a token
stream to be produced and consumed incrementally.
Nonetheless, conceptually, if the byte stream was a single contiguous slice
then each token would correspond to a sub-slice, one whose length was the token
length and whose position was the sum of all previous tokens' lengths.
Consecutive tokens can also form token chains, which capture higher level
concepts. For example, the JSON string `"abc\txyz"` can correspond to multiple
tokens. One reason is that the maximum token length is `65535` bytes, but JSON
strings can be longer. Another reason is that different parts of the encoded
string are treated differently when reconstructing the decoded string:
- Decoding the first or last `"` is a no-op.
- Decoding `abc` or `xyz` is a `memcpy`.
- Decoding the backslash-escaped `\t` should produce a tab character.
## Representation
A token is just a `uint64_t`. The broad divisions are:
- Bits `17 ..= 63` (47 bits) are the value.
- Bit `16` (1 bit) is the `continued` bit.
- Bits `0 ..= 15` (16 bits) are the length.
The `continued` bit is whether that the token chain for this token also
contains the next token. The final token in a token chain, including
stand-alone tokens, will have the `continued` bit set to zero.
```
+-----+-------------+-------+-------------+-----+-----------+
| 1 | 21 | 4 | 21 | 1 | 16 |
+-----+-------------+-------+-------------+-----+-----------+
[..................value..................] con length
[..1..|..........~value_extension.........]
[..0..|.value_major.|.....value_minor.....]
[..0..|......0......|..VBC..|.....VBD.....]
```
The value bits can be sub-divided in multiple ways. First, the high bit:
- Bit `63` (1 bit) denotes an extended (1) or simple (0) token.
### Extended Tokens
- Bits `17 ..= 62` (46 bits) are the bitwise-not (~) of the `value_extension`.
Extended tokens are typically part of a multi-token chain whose first token is
a simple token that provides the semantics for each `value_extension`.
### Simple Tokens
- Bits `42 ..= 62` (21 bits) are the `value_major`.
- Bits `17 ..= 41` (25 bits) are the `value_minor`.
The `value_major` is a 21-bit [Base38](/doc/note/base38-and-fourcc.md) number.
For example, an HTML tokenizer might produce a combination of "base" tokens
(see below) and tokens whose `value_major` is `0x109B0B`, the Base38 encoding
of `html`. The `value_major` forms a namespace that distinguishes e.g.
HTML-specific tokens from JSON-specific tokens.
If `value_major` is non-zero then `value_minor` has whatever meaning the
tokenizer's package assigns to it. For example, combining its low 18 bits with
a follow-up extended token's 46 `value_extension` bits could form a 64-bit
overall value, whose semnatics depend on the `value_minor`'s high 7 bits.
### VBCs and VBDs
A zero `value_major` is reserved for Wuffs' built-in "base" package. The
`value_minor` is further sub-divided:
- Bits `38 ..= 41` (4 bits) are the `VBC` (`value_base_category`).
- Bits `17 ..= 37` (21 bits) are the `VBD` (`value_base_detail`).
The high 47 bits (bits `17 ..= 63`) only have `VBC` and `VBD` semantics when
the high 22 bits (the `extended` and `value_major` parts) are all zero. An
equivalent test is that the high 26 bits (the notional `VBC`), as either an
unsigned integer or a sign-extended integer, is in the range `0 ..= 15`.
These `VBC`s organize tokens into broad groups, generally applicable (as
opposed to being e.g. HTML-specific or JSON-specific). For example, strings and
numbers are two `VBC`s. Structure is another, for container boundaries like the
start and end of HTML elements and JSON arrays.
Filler is yet another `VBC`. Such tokens can generally be ignored (other than
accumulating their length). Filler is most often encountered as whitespace, but
also includes JSON commas (which are [structurally
inessential](https://www.tbray.org/ongoing/When/201x/2016/08/20/Fixing-JSON#p-1))
and comments.
The `VBD` semantics depend on the `VBC`. For example, at 21 bits, the `VBD` can
hold every valid Unicode code point, up to U+10FFFF. A `\t` or `\u2603` in a
JSON string can each be represented by a single `VBC__UNICODE_CODE_POINT` token
whose `VBD` is `0x0009` or `0x2603`, meaning the Unicode code points U+0009
CHARACTER TABULATION (the ASCII tab character) or U+2603 SNOWMAN.
More details on the `VBC` and `VBD` bit assignments are in the [`source
code`](/internal/cgen/base/token-public.h).
## SAX/Pull versus DOM/Push
For file formats that conceptually decode into a node tree, such as HTML or
JSON, Wuffs typically provides a
[SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML)-like pull parser, not a
[DOM](https://en.wikipedia.org/wiki/Document_Object_Model)-like push parser.
There are general reasons for [favoring pull
parsers](https://github.com/raphlinus/pulldown-cmark/blob/master/README.md#why-a-pull-parser),
but also, Wuffs code cannot [dynamically allocate
memory](/doc/note/memory-safety.md), nor does Wuffs have an `unsafe` keyword or
a foreign function interface, so a caller cannot pass arbitrary callbacks into
Wuffs code. Instead, Wuffs just outputs tokens and tokens are just `uint64_t`s.
The [example/jsonfindptrs](/example/jsonfindptrs/jsonfindptrs.cc) program
demonstrates creating a traditional DOM-like node tree from a SAX-like token
stream. The [example/jsonptr](/example/jsonptr/jsonptr.cc) program demonstrates
a lower-level approach that works directly on tokens, where the entire program
(not just the Wuffs library) never calls `malloc`.
## Example Token Stream
```
$ gcc script/print-json-token-debug-format.c -o pjtdf
$ ./pjtdf -all-tokens -human-readable < test/data/json-things.formatted.json
pos=0x00000000 len=0x0001 con=0 vbc=1:Structure........ vbd=0x004011
pos=0x00000001 len=0x0005 con=0 vbc=0:Filler........... vbd=0x000000
pos=0x00000006 len=0x0001 con=1 vbc=2:String........... vbd=0x000113
pos=0x00000007 len=0x0002 con=1 vbc=2:String........... vbd=0x000203
pos=0x00000009 len=0x0001 con=0 vbc=2:String........... vbd=0x000113
etc
pos=0x00000090 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A
pos=0x00000092 len=0x0002 con=1 vbc=3:UnicodeCodePoint. vbd=0x00000A
pos=0x00000094 len=0x0001 con=0 vbc=2:String........... vbd=0x000113
pos=0x00000095 len=0x0001 con=0 vbc=0:Filler........... vbd=0x000000
pos=0x00000096 len=0x0001 con=0 vbc=1:Structure........ vbd=0x001042
```