Fully rework LUT generation
As you may have noticed, libgrapheme currently is two versions behind
on Unicode. This is because they massively overhaul their algorithms
with each release, and the existing data model I developed came to
its limits.
For each algorithm, it is necessary to extract properties from multiple
files, and it is kind of a hack when two properties coincide,
complicating the code.
The only solution was to fully rethink the data generation, including
the compression. Here's what's changed:
1) Multiple properties are now possible, using a bitfield
approach
2) Data compression is facilitated by a third dictionary stage.
For the provided first port of the character properties, we
reduce the LUT size from 35K to 23K, making it possible for
them to reside in L1, promising more performance.
3) We don't need any of the ugly postprocessing, or magic
'temporary' classes, etc., to work around the too stiff
data structures.
The old infrastructure remains in gen/, the new one is put in gen2/.
Once everything is fully ported, gen/ is removed and gen2/ renamed to
gen/.
One after another, this will allow us to port libgrapheme to the latest
Unicode version.
Signed-off-by: Laslo Hunhold <dev@frign.de>
5 files changed