| --- |
| layout: default |
| title: Preparsed UCD |
| parent: Design Docs |
| --- |
| |
| <!-- |
| © 2016 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Preparsed UCD |
| |
| ## What |
| |
| A text file with preparsed UCD ([Unicode Character |
| Database](http://www.unicode.org/ucd/)) data. |
| |
| * Preparser script: |
| [tools/unicode/py/**preparseucd.py**](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py) |
| * ppucd.txt output: |
| [icu4c/source/data/unidata/**ppucd.txt**](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt) |
| ([raw text |
| version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt)) |
| * Parser for ppucd.txt: |
| [icu4c/source/tools/toolutil/**ppucd.h**](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h) |
| & |
| [.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp) |
| * genprops tool rewritten to use that: |
| [tools/unicode/c/**genprops**](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops) |
| |
| ## Syntax |
| |
| ``` |
| # Preparsed UCD generated by ICU preparseucd.py |
| ``` |
| |
| Only whole-line comments starting with #, no inline comments. |
| |
| ``` |
| ucd;10.0.0 |
| ``` |
| |
| Data lines start with a type keyword. Data fields are semicolon-separated. The |
| number of fields per line is highly variable. |
| |
| The ucd line should be the first data line. It provides the Unicode version |
| number. |
| |
| ``` |
| property;Binary;Alpha;Alphabetic |
| property;Enumerated;bc;Bidi_Class |
| ``` |
| |
| Property lines define properties with a type and two or more aliases. |
| |
| ``` |
| binary;N;No;F;False |
| binary;Y;Yes;T;True |
| value;bc;ON;Other_Neutral |
| ``` |
| |
| Property value lines define the values of enumerated and catalog properties, |
| with the property short name and two or more aliases for each value. |
| |
| There is only one shared definition of the values and aliases for binary |
| properties. |
| |
| ``` |
| defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX |
| ``` |
| |
| After the version, property, and property value lines, and before other data |
| lines, the defaults line defines default values for all code points |
| (corresponding to @missing data in the UCD). Any properties not mentioned here |
| default to null values according to their type, such as False or the empty |
| string. |
| |
| The general syntax of this line is the same as for the following data lines: |
| |
| 1. Line type keyword. |
| 2. Code point or start..end range (inclusive end). |
| 3. Zero or more property values. |
| * Binary values are given by their property name alone if True ("Alpha"), |
| or with a minus sign prepended ("-Alpha"). |
| * Other values are given as "pname=value" pairs, where pname is the |
| property name. |
| * In the ppucd.txt file, short names of properties and values are used, |
| but parsers should be prepared to accept any of the aliases according to |
| the earlier sections of the file. |
| * In the ppucd.txt file, properties are listed in sorted order, but this |
| is not required by the syntax. |
| |
| ``` |
| block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS |
| # 20000..2A6D6 CJK Unified Ideographs Extension B |
| algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH- |
| cp;20001;nt=Nu;nv=7 |
| cp;20064;nt=Nu;nv=4 |
| unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U |
| # No block |
| unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U |
| algnamesrange;AC00..D7A3;hangul |
| ``` |
| |
| Block lines specify a Unicode Block and provide an opportunity for compact data |
| lines for ranges inside the block, by listing common property values once for |
| the whole block. Block properties override the defaults for cp and unassigned |
| lines with code point ranges inside the block. The file syntax and parser do not |
| require the presence of block lines. |
| |
| cp lines provide the data for a code point or range. They override the |
| default+block properties. Properties that are not mentioned fall back to the |
| block, then to the defaults. |
| |
| Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an |
| unassigned code point or range (gc=Cn). They override only the default |
| properties, except for the blk=Block property (if the range is inside a block). |
| Properties that are not mentioned fall back to the defaults, except that the |
| blk=Block property applies to unassigned lines as well. |
| |
| A range is considered inside a block if it is fully inside the range of the last |
| defined block. Otherwise it is considered outside a block and falls back only to |
| the defaults. This is the case even if the range is inside an earlier block, to |
| simplify parsing & processing (such data lines should be avoided). |
| |
| A range inside the block for which there is no data line inherits all of the |
| default+block properties (see Han blocks). Note that this is very different from |
| the behavior of an unassigned line, in particular since such blocks typically |
| default to gc!=Cn. |
| |
| Non-default properties for unassigned ranges inside and outside of blocks are |
| typically for [complex |
| defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for |
| noncharacters. |
| |
| ppucd.txt data lines are in code point order, although this should not be |
| strictly required. |
| |
| Assigned characters normally have their unique na=Name property value. For |
| Hangul syllables with their algorithmically computed names, the entire range is |
| covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic |
| characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-" |
| provides a Name prefix which is to be followed by the code point (in hex like |
| %04lX). |
| |
| ## Why not UCD .txt files? |
| |
| See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/) |
| |
| Nontrivial parsing: |
| |
| * The UCD has grown from a couple of semicolon-delimited files plus an |
| informative "Property dump" (early PropList.txt) to a collection of dozens |
| of files with a variety of (now more regular) formats. |
| * Related properties are scattered over several files. |
| * Full information for Numeric_Value and Numeric_Type requires parsing two |
| files. |
| * Default values are "hidden" in comments. |
| * The UCD folder structure (which file where) has changed over time. |
| * UCD filenames change during each Unicode beta period. (A detailed version |
| number is inserted into each filename.) |
| * Many files are bloated with comments that show the General Category and name |
| of each character or range start/end; if the data were combined into a |
| single file, then all properties for a character or range would be listed |
| together, without need for such comments. |
| |
| Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires |
| adding data in many of the UCD files. |
| |
| ICU already preprocesses some of the UCD .txt files. We strip comments from some |
| files (because they are huge) and in some files merge adjacent same-property |
| code points into ranges. |
| |
| Some changes are manual, such as updating and adding ranges of algorithmic |
| character names. |
| |
| Then we run several tools, most of them twice, to parse different sets of .txt |
| files and write several output files. We use several Python and shell scripts, |
| and a "log" (unidata/changes.txt) with details of what was changed and run in |
| each Unicode version upgrade. |
| |
| Markus has done ICU Unicode updates since about 2002. Someone else might have a |
| hard time picking this up for maintenance and future Unicode version updates. |
| |
| ### Why not UCD XML files? |
| |
| See [UAX #42 "Unicode Character Database in |
| XML"](http://www.unicode.org/reports/tr42/) |
| |
| Good: The UCD XML file format stores all properties in a single file with a |
| relatively simple structure, with property values as XML attributes. |
| |
| Issues: |
| |
| * **Missing data** which is needed for ICU |
| * Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta. |
| * Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode |
| property as of UCD 6.1. Useful, used in ICU, but not available in UCD |
| XML. |
| * Adopting UCD XML would require to either still also parse some UCD .txt |
| files or write another tool to merge more data into the XML. |
| * Dependency on third party |
| * Lag time between UCD .txt vs. XML availability during beta. |
| * Unable to fix/update/extend XML generator tools. |
| * For new properties, need to wait for standardization (UAX #42), tool |
| update, and XML publication. |
| * Will not support custom/nonstandard data. |
| * Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in |
| C++ (we have a "poor man's" XML parser), but not as easy as |
| `line.split(";")`. |
| * There is no need for complex structure for the UCD. |
| * Could be easier to read for humans: By not storing defaults for all of |
| Unicode in one place, each `<group>` carries them, making it hard to see which |
| values are specific to each group. "Fluffy" XML makes for longer text lines, |
| more horizontal scrolling. |
| * Hard to diff: The XML format can be used in different ways, and Unicode |
| publishes different forms of the same data. Also, the precise XML text |
| depends on the XML formatting code used. |
| * For diffing, a special tool needs to be run, parse old & new XML data, |
| compare values and generate a diff report. Unicode publishes some of |
| those too. |
| * Some data still requires nontrivial parsing. |
| * For algorithmic character names, the range needs to be determined by |
| collecting a contiguous sequence of elements with a shared name pattern. |
| There is not even any special notation for the algorithmic names for |
| Hangul syllables. |
| * Minor: Unnecessary data (for ICU) |
| * Precomputed Hangul syllable names |
| * Irrelevant contributory properties like "Other_Xyz" |
| * Properties not used by ICU |
| * Minor, just awkward: Blocks are treated as auxiliary data, rather than as a |
| core means to organize and store the data. On the other hand, the "grouped" |
| XML files also use them as the basis for the `<group>` elements and associated |
| compaction. (The "flat" files don't.) |
| |
| ## Goals |
| |
| * Single file with all data relevant for ICU. |
| * Very easy to parse and use the data in C/C++ tools. |
| * Easily human readable. |
| * Easy-to-read diffs from standard diff tools. |
| * Compact file format. |
| * Conversion tool easy to write, maintain, extend. |
| * Convert from UCD .txt files because those are maintained directly by the UTC |
| & editorial committee. No waiting for third party to convert the files. |
| * Able to extend for new kinds of data. |
| * Easy format for manual data fixes/additions (e.g., PUA or proposed/draft). |
| * Move much of the parsing from scattered C code into one Python script. |
| |
| ## Details |
| |
| * All-Unicode defaults in one place, but only list non-null default values. |
| (`blk=No_Block, cf=<code point>, ...`) |
| * Line-oriented, always semicolon-separated, with type-of-line in the first |
| field. |
| * Block properties override defaults; only for few properties where properties |
| in the block have common, non-default values. |
| * Effective because blocks represent actual allocation & organization of |
| Unicode. Maintained by UTC. |
| * Code point/range properties override default+block properties. |
| * Algorithmic names stored as ranges with type & shared name prefixes (for |
| CJK). |
| * No gratuitous white space or syntax characters. |
| * Mostly key=value, simpler format for binary properties. Easy to read. |
| * Comment lines with headings from NamesList.txt further improve readability. |
| (There are few of them, so no significant size bloat.) |
| * Simple, stable file generation allows diffing. |
| * E.g., list properties in sorted order of property names. |
| * No need to implement/store properties that are not used in ICU. (But format |
| & tool are easy to extend.) |
| |
| ## Plan |
| |
| * (done) Write Python tool to preparse UCD .txt files and generate one output |
| ppucd.txt file. |
| * (done) Subsume existing ucdcopy.py. |
| * (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata |
| folder. |
| * (done) Merge genbidi, gencase, gennames, gennorm into genprops |
| * Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt |
| parser. |
| * Generate all output files in one genprops invocation. |
| * Update makeprops.sh (delete half of it) & changes.txt. |
| * (done) Make preparseucd.py also parse uchar.h & uscript.h and write the |
| property names data header file. (was: ~~Change genpname/preparse.pl to read |
| ppucd.txt rather than Property\[Value\]Aliases.txt.~~) |
| * (done) Consider changing pnames_data.h so that minor changes don't change |
| most of the file contents. |
| * (done) Write wiki/Markus/ReviewTicket8972 with diff links. |
| * 2019-sep-27: The old Trac server is going away. I copied the wiki page |
| contents into a comment on |
| [ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972). |
| * Move UCD tests from cintltst to intltest, change to use the toolutil |
| ppucd.txt parser. ([ticket |
| #9041](https://unicode-org.atlassian.net/browse/ICU-9041)) |
| * Change Java UCD tests to parse & use ppucd.txt. (ticket #9041) |
| * (partially done) Change Python preparser to not copy input UCD .txt files |
| any more, delete them from unidata & Java. (ticket #9041) |
| |
| ## Other tool improvements |
| |
| **Bad**: Until **ICU 4.8**, the process is |
| |
| build & install ICU -> build Unicode tools -> run genpname -> build & install |
| ICU (now with updated property names) -> build Unicode tools -> run UCD parsers |
| -> build & install ICU (now also with case properties & normalization etc.) -> |
| build Unicode tools -> run genuca -> build & install ICU |
| |
| It should be possible to |
| |
| 1. merge the Unicode tools into one binary |
| 2. parameterize the relevant properties code (property name lookup, case & some |
| other properties, NFC) |
| 3. inject newly built data into the common library for the next part of the |
| merged Unicode tool's processing. |
| |
| **ICU 49**: |
| |
| build & install ICU -> build Unicode tools -> run genprops -> build & install |
| ICU (now with updated properties) -> build Unicode tools -> run genuca -> build |
| & install ICU |
| |
| genprops builds the property (value) names data and injects it into the live |
| ppucd.txt parser for further processing. |
| |
| **Goal**: |
| |
| build & install ICU -> build Unicode tool -> run it -> build & install ICU (now |
| with all updated Unicode data) |
| |
| Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040), |
| could be "hard". |