docs/design/props/ppucd.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: Preparsed UCD
 parent: Design Docs
 ---

 <!--
 © 2016 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Preparsed UCD

 ## What

 A text file with preparsed UCD ([Unicode Character
 Database](http://www.unicode.org/ucd/)) data.

 *   Preparser script:
     [tools/unicode/py/**preparseucd.py**](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py)
 *   ppucd.txt output:
     [icu4c/source/data/unidata/**ppucd.txt**](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt)
     ([raw text
     version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt))
 *   Parser for ppucd.txt:
     [icu4c/source/tools/toolutil/**ppucd.h**](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h)
     &
     [.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp)
 *   genprops tool rewritten to use that:
     [tools/unicode/c/**genprops**](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops)

 ## Syntax

 ```
 # Preparsed UCD generated by ICU preparseucd.py
 ```

 Only whole-line comments starting with #, no inline comments.

 ```
 ucd;10.0.0
 ```

 Data lines start with a type keyword. Data fields are semicolon-separated. The
 number of fields per line is highly variable.

 The ucd line should be the first data line. It provides the Unicode version
 number.

 ```
 property;Binary;Alpha;Alphabetic
 property;Enumerated;bc;Bidi_Class
 ```

 Property lines define properties with a type and two or more aliases.

 ```
 binary;N;No;F;False
 binary;Y;Yes;T;True
 value;bc;ON;Other_Neutral
 ```

 Property value lines define the values of enumerated and catalog properties,
 with the property short name and two or more aliases for each value.

 There is only one shared definition of the values and aliases for binary
 properties.

 ```
 defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX
 ```

 After the version, property, and property value lines, and before other data
 lines, the defaults line defines default values for all code points
 (corresponding to @missing data in the UCD). Any properties not mentioned here
 default to null values according to their type, such as False or the empty
 string.

 The general syntax of this line is the same as for the following data lines:

 1.  Line type keyword.
 2.  Code point or start..end range (inclusive end).
 3.  Zero or more property values.
     *   Binary values are given by their property name alone if True ("Alpha"),
         or with a minus sign prepended ("-Alpha").
     *   Other values are given as "pname=value" pairs, where pname is the
         property name.
     *   In the ppucd.txt file, short names of properties and values are used,
         but parsers should be prepared to accept any of the aliases according to
         the earlier sections of the file.
     *   In the ppucd.txt file, properties are listed in sorted order, but this
         is not required by the syntax.

 ```
 block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
 # 20000..2A6D6 CJK Unified Ideographs Extension B
 algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
 cp;20001;nt=Nu;nv=7
 cp;20064;nt=Nu;nv=4
 unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
 # No block
 unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
 algnamesrange;AC00..D7A3;hangul
 ```

 Block lines specify a Unicode Block and provide an opportunity for compact data
 lines for ranges inside the block, by listing common property values once for
 the whole block. Block properties override the defaults for cp and unassigned
 lines with code point ranges inside the block. The file syntax and parser do not
 require the presence of block lines.

 cp lines provide the data for a code point or range. They override the
 default+block properties. Properties that are not mentioned fall back to the
 block, then to the defaults.

 Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an
 unassigned code point or range (gc=Cn). They override only the default
 properties, except for the blk=Block property (if the range is inside a block).
 Properties that are not mentioned fall back to the defaults, except that the
 blk=Block property applies to unassigned lines as well.

 A range is considered inside a block if it is fully inside the range of the last
 defined block. Otherwise it is considered outside a block and falls back only to
 the defaults. This is the case even if the range is inside an earlier block, to
 simplify parsing & processing (such data lines should be avoided).

 A range inside the block for which there is no data line inherits all of the
 default+block properties (see Han blocks). Note that this is very different from
 the behavior of an unassigned line, in particular since such blocks typically
 default to gc!=Cn.

 Non-default properties for unassigned ranges inside and outside of blocks are
 typically for [complex
 defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for
 noncharacters.

 ppucd.txt data lines are in code point order, although this should not be
 strictly required.

 Assigned characters normally have their unique na=Name property value. For
 Hangul syllables with their algorithmically computed names, the entire range is
 covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic
 characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-"
 provides a Name prefix which is to be followed by the code point (in hex like
 %04lX).

 ## Why not UCD .txt files?

 See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/)

 Nontrivial parsing:

 *   The UCD has grown from a couple of semicolon-delimited files plus an
     informative "Property dump" (early PropList.txt) to a collection of dozens
     of files with a variety of (now more regular) formats.
 *   Related properties are scattered over several files.
 *   Full information for Numeric_Value and Numeric_Type requires parsing two
     files.
 *   Default values are "hidden" in comments.
 *   The UCD folder structure (which file where) has changed over time.
 *   UCD filenames change during each Unicode beta period. (A detailed version
     number is inserted into each filename.)
 *   Many files are bloated with comments that show the General Category and name
     of each character or range start/end; if the data were combined into a
     single file, then all properties for a character or range would be listed
     together, without need for such comments.

 Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires
 adding data in many of the UCD files.

 ICU already preprocesses some of the UCD .txt files. We strip comments from some
 files (because they are huge) and in some files merge adjacent same-property
 code points into ranges.

 Some changes are manual, such as updating and adding ranges of algorithmic
 character names.

 Then we run several tools, most of them twice, to parse different sets of .txt
 files and write several output files. We use several Python and shell scripts,
 and a "log" (unidata/changes.txt) with details of what was changed and run in
 each Unicode version upgrade.

 Markus has done ICU Unicode updates since about 2002. Someone else might have a
 hard time picking this up for maintenance and future Unicode version updates.

 ### Why not UCD XML files?

 See [UAX #42 "Unicode Character Database in
 XML"](http://www.unicode.org/reports/tr42/)

 Good: The UCD XML file format stores all properties in a single file with a
 relatively simple structure, with property values as XML attributes.

 Issues:

 *   **Missing data** which is needed for ICU
     *   Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
     *   Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode
         property as of UCD 6.1. Useful, used in ICU, but not available in UCD
         XML.
     *   Adopting UCD XML would require to either still also parse some UCD .txt
         files or write another tool to merge more data into the XML.
 *   Dependency on third party
     *   Lag time between UCD .txt vs. XML availability during beta.
     *   Unable to fix/update/extend XML generator tools.
     *   For new properties, need to wait for standardization (UAX #42), tool
         update, and XML publication.
     *   Will not support custom/nonstandard data.
 *   Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in
     C++ (we have a "poor man's" XML parser), but not as easy as
     `line.split(";")`.
     *   There is no need for complex structure for the UCD.
 *   Could be easier to read for humans: By not storing defaults for all of
     Unicode in one place, each `<group>` carries them, making it hard to see which
     values are specific to each group. "Fluffy" XML makes for longer text lines,
     more horizontal scrolling.
 *   Hard to diff: The XML format can be used in different ways, and Unicode
     publishes different forms of the same data. Also, the precise XML text
     depends on the XML formatting code used.
     *   For diffing, a special tool needs to be run, parse old & new XML data,
         compare values and generate a diff report. Unicode publishes some of
         those too.
 *   Some data still requires nontrivial parsing.
     *   For algorithmic character names, the range needs to be determined by
         collecting a contiguous sequence of elements with a shared name pattern.
         There is not even any special notation for the algorithmic names for
         Hangul syllables.
 *   Minor: Unnecessary data (for ICU)
     *   Precomputed Hangul syllable names
     *   Irrelevant contributory properties like "Other_Xyz"
     *   Properties not used by ICU
 *   Minor, just awkward: Blocks are treated as auxiliary data, rather than as a
     core means to organize and store the data. On the other hand, the "grouped"
     XML files also use them as the basis for the `<group>` elements and associated
     compaction. (The "flat" files don't.)

 ## Goals

 *   Single file with all data relevant for ICU.
 *   Very easy to parse and use the data in C/C++ tools.
 *   Easily human readable.
 *   Easy-to-read diffs from standard diff tools.
 *   Compact file format.
 *   Conversion tool easy to write, maintain, extend.
 *   Convert from UCD .txt files because those are maintained directly by the UTC
     & editorial committee. No waiting for third party to convert the files.
 *   Able to extend for new kinds of data.
 *   Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
 *   Move much of the parsing from scattered C code into one Python script.

 ## Details

 *   All-Unicode defaults in one place, but only list non-null default values.
     (`blk=No_Block, cf=<code point>, ...`)
 *   Line-oriented, always semicolon-separated, with type-of-line in the first
     field.
 *   Block properties override defaults; only for few properties where properties
     in the block have common, non-default values.
     *   Effective because blocks represent actual allocation & organization of
         Unicode. Maintained by UTC.
 *   Code point/range properties override default+block properties.
 *   Algorithmic names stored as ranges with type & shared name prefixes (for
     CJK).
 *   No gratuitous white space or syntax characters.
 *   Mostly key=value, simpler format for binary properties. Easy to read.
 *   Comment lines with headings from NamesList.txt further improve readability.
     (There are few of them, so no significant size bloat.)
 *   Simple, stable file generation allows diffing.
     *   E.g., list properties in sorted order of property names.
 *   No need to implement/store properties that are not used in ICU. (But format
     & tool are easy to extend.)

 ## Plan

 *   (done) Write Python tool to preparse UCD .txt files and generate one output
     ppucd.txt file.
 *   (done) Subsume existing ucdcopy.py.
 *   (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata
     folder.
 *   (done) Merge genbidi, gencase, gennames, gennorm into genprops
     *   Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt
         parser.
     *   Generate all output files in one genprops invocation.
     *   Update makeprops.sh (delete half of it) & changes.txt.
 *   (done) Make preparseucd.py also parse uchar.h & uscript.h and write the
     property names data header file. (was: ~~Change genpname/preparse.pl to read
     ppucd.txt rather than Property\[Value\]Aliases.txt.~~)
 *   (done) Consider changing pnames_data.h so that minor changes don't change
     most of the file contents.
 *   (done) Write wiki/Markus/ReviewTicket8972 with diff links.
     *   2019-sep-27: The old Trac server is going away. I copied the wiki page
         contents into a comment on
         [ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972).
 *   Move UCD tests from cintltst to intltest, change to use the toolutil
     ppucd.txt parser. ([ticket
     #9041](https://unicode-org.atlassian.net/browse/ICU-9041))
 *   Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
 *   (partially done) Change Python preparser to not copy input UCD .txt files
     any more, delete them from unidata & Java. (ticket #9041)

 ## Other tool improvements

 **Bad**: Until **ICU 4.8**, the process is

 build & install ICU -> build Unicode tools -> run genpname -> build & install
 ICU (now with updated property names) -> build Unicode tools -> run UCD parsers
 -> build & install ICU (now also with case properties & normalization etc.) ->
 build Unicode tools -> run genuca -> build & install ICU

 It should be possible to

 1.  merge the Unicode tools into one binary
 2.  parameterize the relevant properties code (property name lookup, case & some
     other properties, NFC)
 3.  inject newly built data into the common library for the next part of the
     merged Unicode tool's processing.

 **ICU 49**:

 build & install ICU -> build Unicode tools -> run genprops -> build & install
 ICU (now with updated properties) -> build Unicode tools -> run genuca -> build
 & install ICU

 genprops builds the property (value) names data and injects it into the live
 ppucd.txt parser for further processing.

 **Goal**:

 build & install ICU -> build Unicode tool -> run it -> build & install ICU (now
 with all updated Unicode data)

 Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040),
 could be "hard".
	---
	layout: default
	title: Preparsed UCD
	parent: Design Docs
	---

	<!--
	© 2016 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# Preparsed UCD

	## What

	A text file with preparsed UCD ([Unicode Character
	Database](http://www.unicode.org/ucd/)) data.

	* Preparser script:
	[tools/unicode/py/preparseucd.py](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py)
	* ppucd.txt output:
	[icu4c/source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt)
	([raw text
	version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt))
	* Parser for ppucd.txt:
	[icu4c/source/tools/toolutil/ppucd.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h)
	&
	[.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp)
	* genprops tool rewritten to use that:
	[tools/unicode/c/genprops](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops)

	## Syntax

	```
	# Preparsed UCD generated by ICU preparseucd.py
	```

	Only whole-line comments starting with #, no inline comments.

	```
	ucd;10.0.0
	```

	Data lines start with a type keyword. Data fields are semicolon-separated. The
	number of fields per line is highly variable.

	The ucd line should be the first data line. It provides the Unicode version
	number.

	```
	property;Binary;Alpha;Alphabetic
	property;Enumerated;bc;Bidi_Class
	```

	Property lines define properties with a type and two or more aliases.

	```
	binary;N;No;F;False
	binary;Y;Yes;T;True
	value;bc;ON;Other_Neutral
	```

	Property value lines define the values of enumerated and catalog properties,
	with the property short name and two or more aliases for each value.

	There is only one shared definition of the values and aliases for binary
	properties.

	```
	defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX
	```

	After the version, property, and property value lines, and before other data
	lines, the defaults line defines default values for all code points
	(corresponding to @missing data in the UCD). Any properties not mentioned here
	default to null values according to their type, such as False or the empty
	string.

	The general syntax of this line is the same as for the following data lines:

	1. Line type keyword.
	2. Code point or start..end range (inclusive end).
	3. Zero or more property values.
	* Binary values are given by their property name alone if True ("Alpha"),
	or with a minus sign prepended ("-Alpha").
	* Other values are given as "pname=value" pairs, where pname is the
	property name.
	* In the ppucd.txt file, short names of properties and values are used,
	but parsers should be prepared to accept any of the aliases according to
	the earlier sections of the file.
	* In the ppucd.txt file, properties are listed in sorted order, but this
	is not required by the syntax.

	```
	block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
	# 20000..2A6D6 CJK Unified Ideographs Extension B
	algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
	cp;20001;nt=Nu;nv=7
	cp;20064;nt=Nu;nv=4
	unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
	# No block
	unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
	algnamesrange;AC00..D7A3;hangul
	```

	Block lines specify a Unicode Block and provide an opportunity for compact data
	lines for ranges inside the block, by listing common property values once for
	the whole block. Block properties override the defaults for cp and unassigned
	lines with code point ranges inside the block. The file syntax and parser do not
	require the presence of block lines.

	cp lines provide the data for a code point or range. They override the
	default+block properties. Properties that are not mentioned fall back to the
	block, then to the defaults.

	Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an
	unassigned code point or range (gc=Cn). They override only the default
	properties, except for the blk=Block property (if the range is inside a block).
	Properties that are not mentioned fall back to the defaults, except that the
	blk=Block property applies to unassigned lines as well.

	A range is considered inside a block if it is fully inside the range of the last
	defined block. Otherwise it is considered outside a block and falls back only to
	the defaults. This is the case even if the range is inside an earlier block, to
	simplify parsing & processing (such data lines should be avoided).

	A range inside the block for which there is no data line inherits all of the
	default+block properties (see Han blocks). Note that this is very different from
	the behavior of an unassigned line, in particular since such blocks typically
	default to gc!=Cn.

	Non-default properties for unassigned ranges inside and outside of blocks are
	typically for [complex
	defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for
	noncharacters.

	ppucd.txt data lines are in code point order, although this should not be
	strictly required.

	Assigned characters normally have their unique na=Name property value. For
	Hangul syllables with their algorithmically computed names, the entire range is
	covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic
	characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-"
	provides a Name prefix which is to be followed by the code point (in hex like
	%04lX).

	## Why not UCD .txt files?

	See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/)

	Nontrivial parsing:

	* The UCD has grown from a couple of semicolon-delimited files plus an
	informative "Property dump" (early PropList.txt) to a collection of dozens
	of files with a variety of (now more regular) formats.
	* Related properties are scattered over several files.
	* Full information for Numeric_Value and Numeric_Type requires parsing two
	files.
	* Default values are "hidden" in comments.
	* The UCD folder structure (which file where) has changed over time.
	* UCD filenames change during each Unicode beta period. (A detailed version
	number is inserted into each filename.)
	* Many files are bloated with comments that show the General Category and name
	of each character or range start/end; if the data were combined into a
	single file, then all properties for a character or range would be listed
	together, without need for such comments.

	Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires
	adding data in many of the UCD files.

	ICU already preprocesses some of the UCD .txt files. We strip comments from some
	files (because they are huge) and in some files merge adjacent same-property
	code points into ranges.

	Some changes are manual, such as updating and adding ranges of algorithmic
	character names.

	Then we run several tools, most of them twice, to parse different sets of .txt
	files and write several output files. We use several Python and shell scripts,
	and a "log" (unidata/changes.txt) with details of what was changed and run in
	each Unicode version upgrade.

	Markus has done ICU Unicode updates since about 2002. Someone else might have a
	hard time picking this up for maintenance and future Unicode version updates.

	### Why not UCD XML files?

	See [UAX #42 "Unicode Character Database in
	XML"](http://www.unicode.org/reports/tr42/)

	Good: The UCD XML file format stores all properties in a single file with a
	relatively simple structure, with property values as XML attributes.

	Issues:

	* Missing data which is needed for ICU
	* Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
	* Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode
	property as of UCD 6.1. Useful, used in ICU, but not available in UCD
	XML.
	* Adopting UCD XML would require to either still also parse some UCD .txt
	files or write another tool to merge more data into the XML.
	* Dependency on third party
	* Lag time between UCD .txt vs. XML availability during beta.
	* Unable to fix/update/extend XML generator tools.
	* For new properties, need to wait for standardization (UAX #42), tool
	update, and XML publication.
	* Will not support custom/nonstandard data.
	* Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in
	C++ (we have a "poor man's" XML parser), but not as easy as
	`line.split(";")`.
	* There is no need for complex structure for the UCD.
	* Could be easier to read for humans: By not storing defaults for all of
	Unicode in one place, each `<group>` carries them, making it hard to see which
	values are specific to each group. "Fluffy" XML makes for longer text lines,
	more horizontal scrolling.
	* Hard to diff: The XML format can be used in different ways, and Unicode
	publishes different forms of the same data. Also, the precise XML text
	depends on the XML formatting code used.
	* For diffing, a special tool needs to be run, parse old & new XML data,
	compare values and generate a diff report. Unicode publishes some of
	those too.
	* Some data still requires nontrivial parsing.
	* For algorithmic character names, the range needs to be determined by
	collecting a contiguous sequence of elements with a shared name pattern.
	There is not even any special notation for the algorithmic names for
	Hangul syllables.
	* Minor: Unnecessary data (for ICU)
	* Precomputed Hangul syllable names
	* Irrelevant contributory properties like "Other_Xyz"
	* Properties not used by ICU
	* Minor, just awkward: Blocks are treated as auxiliary data, rather than as a
	core means to organize and store the data. On the other hand, the "grouped"
	XML files also use them as the basis for the `<group>` elements and associated
	compaction. (The "flat" files don't.)

	## Goals

	* Single file with all data relevant for ICU.
	* Very easy to parse and use the data in C/C++ tools.
	* Easily human readable.
	* Easy-to-read diffs from standard diff tools.
	* Compact file format.
	* Conversion tool easy to write, maintain, extend.
	* Convert from UCD .txt files because those are maintained directly by the UTC
	& editorial committee. No waiting for third party to convert the files.
	* Able to extend for new kinds of data.
	* Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
	* Move much of the parsing from scattered C code into one Python script.

	## Details

	* All-Unicode defaults in one place, but only list non-null default values.
	(`blk=No_Block, cf=<code point>, ...`)
	* Line-oriented, always semicolon-separated, with type-of-line in the first
	field.
	* Block properties override defaults; only for few properties where properties
	in the block have common, non-default values.
	* Effective because blocks represent actual allocation & organization of
	Unicode. Maintained by UTC.
	* Code point/range properties override default+block properties.
	* Algorithmic names stored as ranges with type & shared name prefixes (for
	CJK).
	* No gratuitous white space or syntax characters.
	* Mostly key=value, simpler format for binary properties. Easy to read.
	* Comment lines with headings from NamesList.txt further improve readability.
	(There are few of them, so no significant size bloat.)
	* Simple, stable file generation allows diffing.
	* E.g., list properties in sorted order of property names.
	* No need to implement/store properties that are not used in ICU. (But format
	& tool are easy to extend.)

	## Plan

	* (done) Write Python tool to preparse UCD .txt files and generate one output
	ppucd.txt file.
	* (done) Subsume existing ucdcopy.py.
	* (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata
	folder.
	* (done) Merge genbidi, gencase, gennames, gennorm into genprops
	* Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt
	parser.
	* Generate all output files in one genprops invocation.
	* Update makeprops.sh (delete half of it) & changes.txt.
	* (done) Make preparseucd.py also parse uchar.h & uscript.h and write the
	property names data header file. (was: ~~Change genpname/preparse.pl to read
	ppucd.txt rather than Property\[Value\]Aliases.txt.~~)
	* (done) Consider changing pnames_data.h so that minor changes don't change
	most of the file contents.
	* (done) Write wiki/Markus/ReviewTicket8972 with diff links.
	* 2019-sep-27: The old Trac server is going away. I copied the wiki page
	contents into a comment on
	[ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972).
	* Move UCD tests from cintltst to intltest, change to use the toolutil
	ppucd.txt parser. ([ticket
	#9041](https://unicode-org.atlassian.net/browse/ICU-9041))
	* Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
	* (partially done) Change Python preparser to not copy input UCD .txt files
	any more, delete them from unidata & Java. (ticket #9041)

	## Other tool improvements

	Bad: Until ICU 4.8, the process is

	build & install ICU -> build Unicode tools -> run genpname -> build & install
	ICU (now with updated property names) -> build Unicode tools -> run UCD parsers
	-> build & install ICU (now also with case properties & normalization etc.) ->
	build Unicode tools -> run genuca -> build & install ICU

	It should be possible to

	1. merge the Unicode tools into one binary
	2. parameterize the relevant properties code (property name lookup, case & some
	other properties, NFC)
	3. inject newly built data into the common library for the next part of the
	merged Unicode tool's processing.

	ICU 49:

	build & install ICU -> build Unicode tools -> run genprops -> build & install
	ICU (now with updated properties) -> build Unicode tools -> run genuca -> build
	& install ICU

	genprops builds the property (value) names data and injects it into the live
	ppucd.txt parser for further processing.

	Goal:

	build & install ICU -> build Unicode tool -> run it -> build & install ICU (now
	with all updated Unicode data)

	Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040),
	could be "hard".