blob: c2888ac0079b015cfdfd2f917d9628996a0ca4f3 [file] [log] [blame] [view]
# Base38 and FourCC Codes
Both of these encode a four-character string such as `"JPEG"` as a `uint32_t`
value. Computers can compare two integer values faster than they can compare
two arbitrary strings.
Both schemes maintain ordering: if two four-character strings `s` and `t`
satisfy `(s < t)`, and those strings have valid numerical encodings, then the
numerical values also satisfy `(encoding(s) < encoding(t))`.
## FourCC
FourCC codes are not specific to Wuffs. For example, the AVI multimedia
container format can hold various sub-formats, such as "H264" or "YV12",
distinguished in the overall file format by their FourCC code.
The FourCC encoding is the straightforward sequence of each character's ASCII
encoding. The FourCC code for `"JPEG"` is `0x4A504547`, since `'J'` is `0x4A`,
`'P'` is `0x50`, etc. This is essentially 8 bits for each character, 32 bits
overall. The big-endian representation of this number is exactly the ASCII (and
UTF-8) string `"JPEG"`.
Other FourCC documentation sometimes use a little-endian convention, so that
the `{0x4A, 0x50, 0x45, 0x47}` bytes on the wire for `"JPEG"` corresponds to
the number `0x4745504A` (little-endian) instead of `0x4A504547` (big-endian).
Wuffs uses the big-endian interpretation, as it maintains ordering.
## Base38
Base38 is a tighter encoding than FourCC, fitting four characters into 21 bits
instead of 32 bits. This is achieved by using a smaller alphabet of 38 possible
values (space, 0-9, ? or a-z), so that it cannot distinguish between e.g. an
upper case 'X' and a lower case 'x'. There's also the happy coincidence that
`38 ** 4 = 0x1FD110 = 2085136` is slightly smaller than `2 ** 21 = 0x200000 =
2097152`.
The base38 encoding of `"JPEG"` is `0x122FF6`, which is `1191926`, which is
`((21 * (38 ** 3)) + (27 * (38 ** 2)) + (16 * (38 ** 1)) + (18 * (38 ** 0)))`.
Using only 21 bits means that we can use base38 values to partition the set of
possible `uint32_t` values into file-format specific enumerations. Each package
(i.e. Wuffs implementation of a specific file format) can define up to 1024
different values in their own namespace, without conflicting with other
packages (assuming that there aren't e.g. two `"JPEG"` Wuffs packages in the
same library). The conventional `uint32_t` packing is:
- Bit `31` (1 bit) is reserved (zero).
- Bits `10 ..= 30` (21 bits) are the base38 value, shifted by 10.
- Bits `0 ..= 9` (10 bits) are the enumeration value.
For example:
- [Quirk values](/doc/note/quirks.md) use this `((base38 << 10) | enumeration)`
scheme.
- [Tokens](/doc/note/tokens.md) assign 21 out of 64 bits for a Base38 value.