docs/userguide/strings/properties.md - external/github.com/unicode-org/icu - Git at Google

 # Properties

 ## Overview

 Text processing requires that a program treat text appropriately. If text is
 exchanged between several systems, it is important for them to process the text
 consistently. This is done by assigning each character, or a range of
 characters, attributes or properties used for text processing, and by defining
 standard algorithms for at least the basic text operations.

 Traditionally, such attributes and algorithms have not been well-defined for
 most character sets, and text processing had to rely on ad-hoc solutions. Over
 time, standards were created for querying properties of the system codepage.
 However, the set of these properties was limited. Their data was not coordinated
 among implementations, and standard algorithms were not available.

 It is one of the strengths of Unicode that it not only defines a very large
 character set, but also assigns a comprehensive set of properties and usage
 notes to all characters. It defines standard algorithms for critical text
 processing, and the data is publicly provided and kept up-to-date. See
 <http://www.unicode.org/> for more information.

 Sample code is available in the ICU source code library at
 [icu4c/source/samples/props/props.cpp](http://source.icu-project.org/repos/icu/trunk/icu4c/source/samples/props/props.cpp)
 . See also the source code for the [Unicode
 browser](http://source.icu-project.org/repos/icu/icuapps/trunk/ubrowse/) demo
 application, which can be used
 [online](http://demo.icu-project.org/icu-bin/ubrowse) to browse Unicode
 characters with their properties.

 ## Unicode Character Database properties in ICU APIs

 The following table shows all Unicode Character Database properties (except for
 purely "extracted" ones and Unihan properties) and the corresponding ICU APIs.
 Most of the time, ICU4C provides functions in
 icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the
 com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are
 accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in
 Java).

 [Surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
 mostly have default property values, except for the General_Category (gc=Cs).

 For integer values outside the Unicode code point range (negative or ≥
 0x110000), most API functions return null values (false, 0, etc.). API functions
 that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase())
 normally return out-of-range values (i.e., map them to themselves), just like
 for unassigned code points or generally code points that have no specific
 mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1.

 Most properties are also available via UnicodeSet APIs and patterns. See the
 Lookup section below.

 See the [Unicode Character
 Database](http://www.unicode.org/reports/tr44/#Properties) itself for
 comparison. The UCD files
 [PropertyAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
 and
 [PropertyValueAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
 list all properties and their values by name and type.

 Most properties that use binary, integer, or enumerated values are available via
 functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty
 enum constants to select the property. (ICU4J UCharacter member functions do not
 have the "u_" prefix.) The constant names include the long property name
 according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property
 value enum constant names often contain the short property name and the long
 value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the
 enumeration result type is also listed here.

 Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and
 UnicodeSet and regular expression patterns use the long or short property
 aliases and property value aliases (see PropertyAliases.txt and
 PropertyValueAliases.txt).

 There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do
 not use a single value but a bit-set (a mask) of zero or more values, with each
 bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to
 represent property value aliases for multiple general categories, like "Letters"
 (which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other
 words, there are two ICU properties for the same Unicode property, one
 delivering single values (for per-code point lookup) and the other delivering
 sets of values (for use with value aliases and UnicodeSet).

 | UCD Name(see PropertyAliases.txt) | Type |  | ICU4C uchar.hICU4J UCharacter | UCD File (.txt) |
 |------------------------------------|-----------------------------------------------|-----|------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|
 | Age | Unicode version | (U) | C: u_charAge fills in UVersionInfoJava: getAge returns a VersionInfo reference | DerivedAge |
 | Alphabetic | binary | (U) | u_isUAlphabetic, UCHAR_ALPHABETIC | DerivedCoreProperties |
 | ASCII_Hex_Digit | binary | (U) | UCHAR_ASCII_HEX_DIGIT | PropList |
 | Bidi_Class | enum UCharDirection | (U) | u_charDirection, UCHAR_BIDI_CLASS | UnicodeData |
 | Bidi_Control | binary | (U) | UCHAR_BIDI_CONTROL | PropList |
 | Bidi_Mirrored | binary | (U) | u_isMirrored, UCHAR_BIDI_MIRRORED | UnicodeData |
 | Bidi_Mirroring_Glyph | code point |  | u_charMirror | BidiMirroring |
 | Block | enum UBlockCode (growing) | (U) | ublock_getCode, UCHAR_BLOCK | Blocks |
 | Canonical_Combining_Class | 0..255 | (U) | u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS | UnicodeData |
 | Case_Folding | Unicode string |  | u_strFoldCase (ustring.h) | CaseFolding |
 | Case_Ignorable | binary | (U) | UCHAR_CASE_IGNORABLE | DerivedCoreProperties |
 | Cased | binary | (U) | UCHAR_CASED | DerivedCoreProperties |
 | Changes_When_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_CASEFOLDED | DerivedCoreProperties |
 | Changes_When_Casemapped | binary | (U) | UCHAR_CHANGES_WHEN_CASEMAPPED | DerivedCoreProperties |
 | Changes_When_NFKC_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED | DerivedNormalizationProps |
 | Changes_When_Lowercased | binary | (U) | UCHAR_CHANGES_WHEN_LOWERCASED | DerivedCoreProperties |
 | Changes_When_Titlecased | binary | (U) | UCHAR_CHANGES_WHEN_TITLECASED | DerivedCoreProperties |
 | Changes_When_Uppercased | binary | (U) | UCHAR_CHANGES_WHEN_UPPERCASED | DerivedCoreProperties |
 | Composition_Exclusion | binary | (c) | contributes to Full_Composition_Exclusion | CompositionExclusions |
 | Dash | binary | (U) | UCHAR_DASH | PropList |
 | Decomposition_Mapping | Unicode string |  | NFKC Normalizer2::getRawDecomposition() | UnicodeData |
 | Decomposition_Type | enum UDecompositionType | (U) | UCHAR_DECOMPOSITION_TYPE | UnicodeData |
 | Default_Ignorable_Code_Point | binary | (U) | UCHAR_DEFAULT_IGNORABLE_CODE_POINT | DerivedCoreProperties |
 | Deprecated | binary | (U) | UCHAR_DEPRECATED | PropList |
 | Diacritic | binary | (U) | UCHAR_DIACRITIC | PropList |
 | East_Asian_Width | enum UEastAsianWidth | (U) | UCHAR_EAST_ASIAN_WIDTH | EastAsianWidth |
 | Expands_On_NF* | binary |  | available via normalization API (normalizer2.h) | DerivedNormalizationProps |
 | Extender | binary | (U) | UCHAR_EXTENDER | PropList |
 | FC_NFKC_Closure | Unicode string |  | u_getFC_NFKC_Closure | DerivedNormalizationProps |
 | Full_Composition_Exclusion | binary | (U) | UCHAR_FULL_COMPOSITION_EXCLUSION | DerivedNormalizationProps |
 | General_Category | enum (<= 32 values) | (U) | u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategory | UnicodeData |
 | Grapheme_Base | binary | (U) | UCHAR_GRAPHEME_BASE | DerivedCoreProperties |
 | Grapheme_Cluster_Break | enum UGraphemeClusterBreak | (U) | UCHAR_GRAPHEME_CLUSTER_BREAK | GraphemeBreakProperty |
 | Grapheme_Extend | binary | (U) | UCHAR_GRAPHEME_EXTEND | DerivedCoreProperties |
 | Grapheme_Link | binary | (U) | UCHAR_GRAPHEME_LINK | DerivedCoreProperties |
 | Hangul_Syllable_Type | enum UHangulSyllableType | (U) | UCHAR_HANGUL_SYLLABLE_TYPE | HangulSyllableType |
 | Hex_Digit | binary | (U) | UCHAR_HEX_DIGIT | PropList |
 | Hyphen | binary | (U) | UCHAR_HYPHEN | PropList |
 | ID_Continue | binary | (U) | UCHAR_ID_CONTINUE | DerivedCoreProperties |
 | ID_Start | binary | (U) | UCHAR_ID_START | DerivedCoreProperties |
 | Ideographic | binary | (U) | UCHAR_IDEOGRAPHIC | PropList |
 | IDS_Binary_Operator | binary | (U) | UCHAR_IDS_BINARY_OPERATOR | PropList |
 | IDS_Triary_Operator | binary | (U) | UCHAR_IDS_TRINARY_OPERATOR | PropList |
 | Indic_Matra_Category | (enum) |  | provisional, not yet supported | IndicMatraCategory |
 | Indic_Syllabic_Category | (enum) |  | provisional, not yet supported | IndicSyllabicCategory |
 | ISO_Comment | ASCII string |  | u_getISOComment | UnicodeData |
 | Jamo_Short_Name | ASCII string | (c) | contributes to Name | Jamo |
 | Join_Control | binary | (U) | UCHAR_JOIN_CONTROL | PropList |
 | Joining_Group | enum UJoiningGroup | (U) | UCHAR_JOINING_GROUP | ArabicShaping |
 | Joining_Type | enum UJoiningType | (U) | UCHAR_JOINING_TYPE | ArabicShaping |
 | Line_Break | enum ULineBreak | (U) | UCHAR_LINE_BREAK | LineBreak |
 | Logical_Order_Exception | binary | (U) | UCHAR_LOGICAL_ORDER_EXCEPTION | PropList |
 | Lowercase | binary | (U) | u_isULowercase, UCHAR_LOWERCASE | DerivedCoreProperties |
 | Lowercase_Mapping | Unicode string + conditions |  | available via u_strToLower (ustring.h) | UnicodeData + SpecialCasing |
 | Math | binary | (U) | UCHAR_MATH | DerivedCoreProperties |
 | Name | ASCII string | (U) | u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) | UnicodeData |
 | Name_Alias | ASCII string |  | u_charName(U_CHAR_NAME_ALIAS) | NameAliases |
 | NF*_QuickCheck | enum UNormalizationCheckResult (no/maybe/yes) | (U) | UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h) | DerivedNormalizationProps |
 | NFKC_Casefold | Unicode string |  | available via normalization API (normalizer2.h "nfkc_cf") | DerivedNormalizationProps |
 | Noncharacter_Code_Point | binary | (U) | UCHAR_NONCHARACTER_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h) | PropList |
 | Numeric_Type | enum UNumericType | (U) | UCHAR_NUMERIC_TYPE | UnicodeData |
 | Numeric_Value | double | (U) | u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions | UnicodeData |
 | Other_Alphabetic | binary | (c) | contributes to Alphabetic | PropList |
 | Other_Default_Ignorable_Code_Point | binary | (c) | contributes to Default_Ignorable_Code_Point | PropList |
 | Other_Grapheme_Extend | binary | (c) | contributes to Grapheme_Extend | PropList |
 | Other_Lowercase | binary | (c) | contributes to Lowercase | PropList |
 | Other_Math | binary | (c) | contributes to Math | PropList |
 | Other_Uppercase | binary | (c) | contributes to Uppercase | PropList |
 | Pattern_Syntax | binary | (U) | UCHAR_PATTERN_SYNTAX | PropList |
 | Pattern_White_Space | binary | (U) | UCHAR_PATTERN_WHITE_SPACE | PropList |
 | Quotation_Mark | binary | (U) | UCHAR_QUOTATION_MARK | PropList |
 | Radical | binary | (U) | UCHAR_RADICAL | PropList |
 | Script | enum UScriptCode (growing) | (U) | uscript_getCode (uscript.h), UCHAR_SCRIPT | Scripts |
 | Script_Extensions (provisional) | list of enum UScriptCode (growing) | (U) | uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONSUnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:] | ScriptExtensions |
 | Sentence_Break | enum USentenceBreak | (U) | UCHAR_SENTENCE_BREAK | SentenceBreakProperty |
 | Simple_Case_Folding | code point |  | u_foldCase | CaseFolding |
 | Simple_Lowercase_ Mapping | code point |  | u_tolower | UnicodeData |
 | Simple_Titlecase_ Mapping | code point |  | u_totitle | UnicodeData |
 | Simple_Uppercase_ Mapping | code point |  | u_toupper | UnicodeData |
 | Soft_Dotted | binary | (U) | UCHAR_SOFT_DOTTED | PropList |
 | Special_Case_Condition | conditions |  | available via u_strToLower etc. (ustring.h) | SpecialCasing |
 | STerm | binary | (U) | UCHAR_S_TERM | PropList |
 | Terminal_Punctuation | binary | (U) | UCHAR_TERMINAL_PUNCTUATION | PropList |
 | Titlecase_Mapping | Unicode string + conditions |  | u_strToTitle (ustring.h) | UnicodeData + SpecialCasing |
 | Unicode_1_Name | ASCII string | (U) | u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) | UnicodeData |
 | Unified_Ideograph | binary | (U) | UCHAR_UNIFIED_IDEOGRAPH | PropList |
 | Uppercase | binary | (U) | u_isUUppercase, UCHAR_UPPERCASE | DerivedCoreProperties |
 | Uppercase_Mapping | Unicode string + conditions |  | u_strToUpper (ustring.h) | UnicodeData + SpecialCasing |
 | White_Space | binary | (U) | u_isUWhiteSpace, UCHAR_WHITE_SPACE | PropList |
 | Word_Break | enum UWordBreakValues | (U) | UCHAR_WORD_BREAK | WordBreakProperty |
 | XID_Continue | binary | (U) | UCHAR_XID_CONTINUE | DerivedCoreProperties |
 | XID_Start | binary | (U) | UCHAR_XID_START | DerivedCoreProperties |

 Notes:

 1.  (c) - This property only **contributes** to "real" properties (mostly
     "Other_..." properties), so there is no direct support for this property in
     ICU.

 2.  (U) - This property is available via the UnicodeSet APIs and patterns. Any
     property available in UnicodeSet is also available in regular expressions.
     Properties which are not available in UnicodeSet are generally those that
     are not available through a UProperty selector.

 ## Customization

 ICU does not provide the means to modify properties at runtime. The properties
 are provided exactly as specified by a recent version of the Unicode Standard
 (as published in the [Character
 Database](http://www.unicode.org/unicode/onlinedat/online.html) ).

 For custom sets and maps, it is easiest to make UnicodeSet or
 UCPTrie/CodePointTrie objects with the desired values.

 However, if an application requires custom properties (for example, for [Private
 Use](http://www.unicode.org/glossary/) characters), then it is possible to
 change or add them at build-time. This is doable but not easy.

 It is done by modifying the Character Database files copied into the ICU source
 tree at
 [icu4c/source/data/unidata](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata).
 Since ICU 49, most of the properties have been combined into one file,
 unidata/ppucd.txt (see the [Preparsed
 UCD](http://site.icu-project.org/design/props/ppucd) design doc). Some of the
 remaining UCD files are still inputs, others are only used for unit tests.

 To add a character to such a file, a line must be inserted into the file with
 the format used in that file (see the online documentation on the [Unicode
 site](http://www.unicode.org/reports/tr44/) for more information). After
 modifying one or more of these files, the ICU data needs to be rebuilt, and the
 resulting files need to be checked into the ICU source tree. The files are
 processed by special ICU tools outside of the normal ICU build. The
 [unidata/changes.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/changes.txt)
 file documents the process that has been used for the last several Unicode
 version updates; skip the file preparation and API update steps.

 Any available Unicode code point (0 to 10FFFF16) can be used. Code point values
 should be written with either 4, 5, or 6 hex digits. The minimum number of
 digits possible should be used (but no fewer than 4). Note that the Unicode
 Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points
 U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters,
 therefore they should not be added to any of the character database files.

 ## Lookup

 For lookup by code point, iterate through the string, fetch code points, and
 either call the unicode/uchar.h / UCharacter or similar functions, or use
 dedicated sets and maps. For binary properties, and sets in general, there are
 also more efficient methods for iterating over substrings.

 ### Binary property from code point

 Call one of the binary-property functions. Alternatively, make a UnicodeSet for
 the property (remember to freeze() it) or for a custom set of characters, and
 call contains().

 ### Binary property over string

 It is often useful to partition a string into substrings where every character
 has the property, and substrings where every character does not have the
 property. For example, to split the string at separator characters, remove
 certain types of characters, trim white space, etc. Use a UnicodeSet with its
 span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you
 can also use a UnicodeSetSpanner.

 ### Enumerated property from code point

 Call one of the int-property functions. Alternatively, build a UCPTrie /
 CodePointTrie (new in ICU 63) via its mutable version and build method, then use
 that to get the int value for each code point.

 ### Enumerated property over string

 Easiest is to iterate over code points of the string and call per-code point
 lookup methods (or use a code point trie).

 The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java
 String iterator class where the iteration and data lookup are integrated to
 avoid redundancies in validation and range checks.

 The UTF-16 code point macros and the Java String iterator also provide the code
 point as output, because it has to be fetched or assembled anyway.

 The UTF-8 macros do not assemble the code point because that would be some
 amount of extra work, but often only the lookup value is used and the code point
 is not needed. When it is needed after all, it is possible to take advantage of
 the macros having validated the byte sequence: If the sequence was ill-formed,
 then the trie's error value is set. Therefore, if a value other than the trie
 error value was returned, then the sequence was well-formed, and the code point
 can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()).
 Since the length of the sequence (1..4 bytes) is also known from the iteration
 (string index before/after next() call), an even simpler piece of code can be
 used. (See for example the ICU-internal function codePointFromValidUTF8() in
 normalizer2impl.cpp.)

 ### Code point trie most-optimized UTF-16 access

 UTF-16 text processing can be further optimized by detecting surrogate pairs and
 assembling supplementary code points only when there is non-trivial data
 available.

 At build time, iterate over all supplementary code points
 (umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from
 U+10000) to see if there is non-trivial data for any of the supplementary code
 points associated with a lead surrogate. If so, then set a special
 (application-specific) value for the lead surrogate.

 At runtime, use UCPTRIE_FAST_BMP_GET() per code *unit*. If there is non-trivial
 data and the code unit is a lead surrogate, then check if a trail surrogate
 follows. If so, assemble the supplementary code point with
 U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET();
 otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast
 and java.lang.Character have equivalent methods.)

 If there is only trivial data for lead and trail surrogates, then processing can
 often skip them. (In this case, there will be two data lookups, one for the lead
 surrogate and one for the trail surrogate, but they are fast, and this
 optimization speeds up the more common BMP characters by not checking for
 surrogates each time.)

 For example, in normalization or case mapping all characters that do not have
 any mappings are simply copied as is.

 ## Properties in ICU Rule Syntax

 ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic
 "spaces" to allow for the usage of white space characters outside of the normal
 ASCII range while still maintaining backward compatibility. See
 <http://www.unicode.org/reports/tr31/#Pattern_Syntax> for more information.
	# Properties

	## Overview

	Text processing requires that a program treat text appropriately. If text is
	exchanged between several systems, it is important for them to process the text
	consistently. This is done by assigning each character, or a range of
	characters, attributes or properties used for text processing, and by defining
	standard algorithms for at least the basic text operations.

	Traditionally, such attributes and algorithms have not been well-defined for
	most character sets, and text processing had to rely on ad-hoc solutions. Over
	time, standards were created for querying properties of the system codepage.
	However, the set of these properties was limited. Their data was not coordinated
	among implementations, and standard algorithms were not available.

	It is one of the strengths of Unicode that it not only defines a very large
	character set, but also assigns a comprehensive set of properties and usage
	notes to all characters. It defines standard algorithms for critical text
	processing, and the data is publicly provided and kept up-to-date. See
	<http://www.unicode.org/> for more information.

	Sample code is available in the ICU source code library at
	[icu4c/source/samples/props/props.cpp](http://source.icu-project.org/repos/icu/trunk/icu4c/source/samples/props/props.cpp)
	. See also the source code for the [Unicode
	browser](http://source.icu-project.org/repos/icu/icuapps/trunk/ubrowse/) demo
	application, which can be used
	[online](http://demo.icu-project.org/icu-bin/ubrowse) to browse Unicode
	characters with their properties.

	## Unicode Character Database properties in ICU APIs

	The following table shows all Unicode Character Database properties (except for
	purely "extracted" ones and Unihan properties) and the corresponding ICU APIs.
	Most of the time, ICU4C provides functions in
	icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the
	com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are
	accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in
	Java).

	[Surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
	mostly have default property values, except for the General_Category (gc=Cs).

	For integer values outside the Unicode code point range (negative or ≥
	0x110000), most API functions return null values (false, 0, etc.). API functions
	that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase())
	normally return out-of-range values (i.e., map them to themselves), just like
	for unassigned code points or generally code points that have no specific
	mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1.

	Most properties are also available via UnicodeSet APIs and patterns. See the
	Lookup section below.

	See the [Unicode Character
	Database](http://www.unicode.org/reports/tr44/#Properties) itself for
	comparison. The UCD files
	[PropertyAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
	and
	[PropertyValueAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
	list all properties and their values by name and type.

	Most properties that use binary, integer, or enumerated values are available via
	functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty
	enum constants to select the property. (ICU4J UCharacter member functions do not
	have the "u_" prefix.) The constant names include the long property name
	according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property
	value enum constant names often contain the short property name and the long
	value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the
	enumeration result type is also listed here.

	Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and
	UnicodeSet and regular expression patterns use the long or short property
	aliases and property value aliases (see PropertyAliases.txt and
	PropertyValueAliases.txt).

	There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do
	not use a single value but a bit-set (a mask) of zero or more values, with each
	bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to
	represent property value aliases for multiple general categories, like "Letters"
	(which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other
	words, there are two ICU properties for the same Unicode property, one
	delivering single values (for per-code point lookup) and the other delivering
	sets of values (for use with value aliases and UnicodeSet).

	\| UCD Name(see PropertyAliases.txt) \| Type \| \| ICU4C uchar.hICU4J UCharacter \| UCD File (.txt) \|
	\|------------------------------------\|-----------------------------------------------\|-----\|------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------------\|
	\| Age \| Unicode version \| (U) \| C: u_charAge fills in UVersionInfoJava: getAge returns a VersionInfo reference \| DerivedAge \|
	\| Alphabetic \| binary \| (U) \| u_isUAlphabetic, UCHAR_ALPHABETIC \| DerivedCoreProperties \|
	\| ASCII_Hex_Digit \| binary \| (U) \| UCHAR_ASCII_HEX_DIGIT \| PropList \|
	\| Bidi_Class \| enum UCharDirection \| (U) \| u_charDirection, UCHAR_BIDI_CLASS \| UnicodeData \|
	\| Bidi_Control \| binary \| (U) \| UCHAR_BIDI_CONTROL \| PropList \|
	\| Bidi_Mirrored \| binary \| (U) \| u_isMirrored, UCHAR_BIDI_MIRRORED \| UnicodeData \|
	\| Bidi_Mirroring_Glyph \| code point \| \| u_charMirror \| BidiMirroring \|
	\| Block \| enum UBlockCode (growing) \| (U) \| ublock_getCode, UCHAR_BLOCK \| Blocks \|
	\| Canonical_Combining_Class \| 0..255 \| (U) \| u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS \| UnicodeData \|
	\| Case_Folding \| Unicode string \| \| u_strFoldCase (ustring.h) \| CaseFolding \|
	\| Case_Ignorable \| binary \| (U) \| UCHAR_CASE_IGNORABLE \| DerivedCoreProperties \|
	\| Cased \| binary \| (U) \| UCHAR_CASED \| DerivedCoreProperties \|
	\| Changes_When_Casefolded \| binary \| (U) \| UCHAR_CHANGES_WHEN_CASEFOLDED \| DerivedCoreProperties \|
	\| Changes_When_Casemapped \| binary \| (U) \| UCHAR_CHANGES_WHEN_CASEMAPPED \| DerivedCoreProperties \|
	\| Changes_When_NFKC_Casefolded \| binary \| (U) \| UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED \| DerivedNormalizationProps \|
	\| Changes_When_Lowercased \| binary \| (U) \| UCHAR_CHANGES_WHEN_LOWERCASED \| DerivedCoreProperties \|
	\| Changes_When_Titlecased \| binary \| (U) \| UCHAR_CHANGES_WHEN_TITLECASED \| DerivedCoreProperties \|
	\| Changes_When_Uppercased \| binary \| (U) \| UCHAR_CHANGES_WHEN_UPPERCASED \| DerivedCoreProperties \|
	\| Composition_Exclusion \| binary \| (c) \| contributes to Full_Composition_Exclusion \| CompositionExclusions \|
	\| Dash \| binary \| (U) \| UCHAR_DASH \| PropList \|
	\| Decomposition_Mapping \| Unicode string \| \| NFKC Normalizer2::getRawDecomposition() \| UnicodeData \|
	\| Decomposition_Type \| enum UDecompositionType \| (U) \| UCHAR_DECOMPOSITION_TYPE \| UnicodeData \|
	\| Default_Ignorable_Code_Point \| binary \| (U) \| UCHAR_DEFAULT_IGNORABLE_CODE_POINT \| DerivedCoreProperties \|
	\| Deprecated \| binary \| (U) \| UCHAR_DEPRECATED \| PropList \|
	\| Diacritic \| binary \| (U) \| UCHAR_DIACRITIC \| PropList \|
	\| East_Asian_Width \| enum UEastAsianWidth \| (U) \| UCHAR_EAST_ASIAN_WIDTH \| EastAsianWidth \|
	\| Expands_On_NF* \| binary \| \| available via normalization API (normalizer2.h) \| DerivedNormalizationProps \|
	\| Extender \| binary \| (U) \| UCHAR_EXTENDER \| PropList \|
	\| FC_NFKC_Closure \| Unicode string \| \| u_getFC_NFKC_Closure \| DerivedNormalizationProps \|
	\| Full_Composition_Exclusion \| binary \| (U) \| UCHAR_FULL_COMPOSITION_EXCLUSION \| DerivedNormalizationProps \|
	\| General_Category \| enum (<= 32 values) \| (U) \| u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategory \| UnicodeData \|
	\| Grapheme_Base \| binary \| (U) \| UCHAR_GRAPHEME_BASE \| DerivedCoreProperties \|
	\| Grapheme_Cluster_Break \| enum UGraphemeClusterBreak \| (U) \| UCHAR_GRAPHEME_CLUSTER_BREAK \| GraphemeBreakProperty \|
	\| Grapheme_Extend \| binary \| (U) \| UCHAR_GRAPHEME_EXTEND \| DerivedCoreProperties \|
	\| Grapheme_Link \| binary \| (U) \| UCHAR_GRAPHEME_LINK \| DerivedCoreProperties \|
	\| Hangul_Syllable_Type \| enum UHangulSyllableType \| (U) \| UCHAR_HANGUL_SYLLABLE_TYPE \| HangulSyllableType \|
	\| Hex_Digit \| binary \| (U) \| UCHAR_HEX_DIGIT \| PropList \|
	\| Hyphen \| binary \| (U) \| UCHAR_HYPHEN \| PropList \|
	\| ID_Continue \| binary \| (U) \| UCHAR_ID_CONTINUE \| DerivedCoreProperties \|
	\| ID_Start \| binary \| (U) \| UCHAR_ID_START \| DerivedCoreProperties \|
	\| Ideographic \| binary \| (U) \| UCHAR_IDEOGRAPHIC \| PropList \|
	\| IDS_Binary_Operator \| binary \| (U) \| UCHAR_IDS_BINARY_OPERATOR \| PropList \|
	\| IDS_Triary_Operator \| binary \| (U) \| UCHAR_IDS_TRINARY_OPERATOR \| PropList \|
	\| Indic_Matra_Category \| (enum) \| \| provisional, not yet supported \| IndicMatraCategory \|
	\| Indic_Syllabic_Category \| (enum) \| \| provisional, not yet supported \| IndicSyllabicCategory \|
	\| ISO_Comment \| ASCII string \| \| u_getISOComment \| UnicodeData \|
	\| Jamo_Short_Name \| ASCII string \| (c) \| contributes to Name \| Jamo \|
	\| Join_Control \| binary \| (U) \| UCHAR_JOIN_CONTROL \| PropList \|
	\| Joining_Group \| enum UJoiningGroup \| (U) \| UCHAR_JOINING_GROUP \| ArabicShaping \|
	\| Joining_Type \| enum UJoiningType \| (U) \| UCHAR_JOINING_TYPE \| ArabicShaping \|
	\| Line_Break \| enum ULineBreak \| (U) \| UCHAR_LINE_BREAK \| LineBreak \|
	\| Logical_Order_Exception \| binary \| (U) \| UCHAR_LOGICAL_ORDER_EXCEPTION \| PropList \|
	\| Lowercase \| binary \| (U) \| u_isULowercase, UCHAR_LOWERCASE \| DerivedCoreProperties \|
	\| Lowercase_Mapping \| Unicode string + conditions \| \| available via u_strToLower (ustring.h) \| UnicodeData + SpecialCasing \|
	\| Math \| binary \| (U) \| UCHAR_MATH \| DerivedCoreProperties \|
	\| Name \| ASCII string \| (U) \| u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) \| UnicodeData \|
	\| Name_Alias \| ASCII string \| \| u_charName(U_CHAR_NAME_ALIAS) \| NameAliases \|
	\| NF_QuickCheck \| enum UNormalizationCheckResult (no/maybe/yes) \| (U) \| UCHAR_NF_QUICK_CHECK and available via quickCheck (normalizer2.h) \| DerivedNormalizationProps \|
	\| NFKC_Casefold \| Unicode string \| \| available via normalization API (normalizer2.h "nfkc_cf") \| DerivedNormalizationProps \|
	\| Noncharacter_Code_Point \| binary \| (U) \| UCHAR_NONCHARACTER_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h) \| PropList \|
	\| Numeric_Type \| enum UNumericType \| (U) \| UCHAR_NUMERIC_TYPE \| UnicodeData \|
	\| Numeric_Value \| double \| (U) \| u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions \| UnicodeData \|
	\| Other_Alphabetic \| binary \| (c) \| contributes to Alphabetic \| PropList \|
	\| Other_Default_Ignorable_Code_Point \| binary \| (c) \| contributes to Default_Ignorable_Code_Point \| PropList \|
	\| Other_Grapheme_Extend \| binary \| (c) \| contributes to Grapheme_Extend \| PropList \|
	\| Other_Lowercase \| binary \| (c) \| contributes to Lowercase \| PropList \|
	\| Other_Math \| binary \| (c) \| contributes to Math \| PropList \|
	\| Other_Uppercase \| binary \| (c) \| contributes to Uppercase \| PropList \|
	\| Pattern_Syntax \| binary \| (U) \| UCHAR_PATTERN_SYNTAX \| PropList \|
	\| Pattern_White_Space \| binary \| (U) \| UCHAR_PATTERN_WHITE_SPACE \| PropList \|
	\| Quotation_Mark \| binary \| (U) \| UCHAR_QUOTATION_MARK \| PropList \|
	\| Radical \| binary \| (U) \| UCHAR_RADICAL \| PropList \|
	\| Script \| enum UScriptCode (growing) \| (U) \| uscript_getCode (uscript.h), UCHAR_SCRIPT \| Scripts \|
	\| Script_Extensions (provisional) \| list of enum UScriptCode (growing) \| (U) \| uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONSUnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:] \| ScriptExtensions \|
	\| Sentence_Break \| enum USentenceBreak \| (U) \| UCHAR_SENTENCE_BREAK \| SentenceBreakProperty \|
	\| Simple_Case_Folding \| code point \| \| u_foldCase \| CaseFolding \|
	\| Simple_Lowercase_ Mapping \| code point \| \| u_tolower \| UnicodeData \|
	\| Simple_Titlecase_ Mapping \| code point \| \| u_totitle \| UnicodeData \|
	\| Simple_Uppercase_ Mapping \| code point \| \| u_toupper \| UnicodeData \|
	\| Soft_Dotted \| binary \| (U) \| UCHAR_SOFT_DOTTED \| PropList \|
	\| Special_Case_Condition \| conditions \| \| available via u_strToLower etc. (ustring.h) \| SpecialCasing \|
	\| STerm \| binary \| (U) \| UCHAR_S_TERM \| PropList \|
	\| Terminal_Punctuation \| binary \| (U) \| UCHAR_TERMINAL_PUNCTUATION \| PropList \|
	\| Titlecase_Mapping \| Unicode string + conditions \| \| u_strToTitle (ustring.h) \| UnicodeData + SpecialCasing \|
	\| Unicode_1_Name \| ASCII string \| (U) \| u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) \| UnicodeData \|
	\| Unified_Ideograph \| binary \| (U) \| UCHAR_UNIFIED_IDEOGRAPH \| PropList \|
	\| Uppercase \| binary \| (U) \| u_isUUppercase, UCHAR_UPPERCASE \| DerivedCoreProperties \|
	\| Uppercase_Mapping \| Unicode string + conditions \| \| u_strToUpper (ustring.h) \| UnicodeData + SpecialCasing \|
	\| White_Space \| binary \| (U) \| u_isUWhiteSpace, UCHAR_WHITE_SPACE \| PropList \|
	\| Word_Break \| enum UWordBreakValues \| (U) \| UCHAR_WORD_BREAK \| WordBreakProperty \|
	\| XID_Continue \| binary \| (U) \| UCHAR_XID_CONTINUE \| DerivedCoreProperties \|
	\| XID_Start \| binary \| (U) \| UCHAR_XID_START \| DerivedCoreProperties \|

	Notes:

	1. (c) - This property only contributes to "real" properties (mostly
	"Other_..." properties), so there is no direct support for this property in
	ICU.

	2. (U) - This property is available via the UnicodeSet APIs and patterns. Any
	property available in UnicodeSet is also available in regular expressions.
	Properties which are not available in UnicodeSet are generally those that
	are not available through a UProperty selector.

	## Customization

	ICU does not provide the means to modify properties at runtime. The properties
	are provided exactly as specified by a recent version of the Unicode Standard
	(as published in the [Character
	Database](http://www.unicode.org/unicode/onlinedat/online.html) ).

	For custom sets and maps, it is easiest to make UnicodeSet or
	UCPTrie/CodePointTrie objects with the desired values.

	However, if an application requires custom properties (for example, for [Private
	Use](http://www.unicode.org/glossary/) characters), then it is possible to
	change or add them at build-time. This is doable but not easy.

	It is done by modifying the Character Database files copied into the ICU source
	tree at
	[icu4c/source/data/unidata](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata).
	Since ICU 49, most of the properties have been combined into one file,
	unidata/ppucd.txt (see the [Preparsed
	UCD](http://site.icu-project.org/design/props/ppucd) design doc). Some of the
	remaining UCD files are still inputs, others are only used for unit tests.

	To add a character to such a file, a line must be inserted into the file with
	the format used in that file (see the online documentation on the [Unicode
	site](http://www.unicode.org/reports/tr44/) for more information). After
	modifying one or more of these files, the ICU data needs to be rebuilt, and the
	resulting files need to be checked into the ICU source tree. The files are
	processed by special ICU tools outside of the normal ICU build. The
	[unidata/changes.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/changes.txt)
	file documents the process that has been used for the last several Unicode
	version updates; skip the file preparation and API update steps.

	Any available Unicode code point (0 to 10FFFF16) can be used. Code point values
	should be written with either 4, 5, or 6 hex digits. The minimum number of
	digits possible should be used (but no fewer than 4). Note that the Unicode
	Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points
	U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters,
	therefore they should not be added to any of the character database files.

	## Lookup

	For lookup by code point, iterate through the string, fetch code points, and
	either call the unicode/uchar.h / UCharacter or similar functions, or use
	dedicated sets and maps. For binary properties, and sets in general, there are
	also more efficient methods for iterating over substrings.

	### Binary property from code point

	Call one of the binary-property functions. Alternatively, make a UnicodeSet for
	the property (remember to freeze() it) or for a custom set of characters, and
	call contains().

	### Binary property over string

	It is often useful to partition a string into substrings where every character
	has the property, and substrings where every character does not have the
	property. For example, to split the string at separator characters, remove
	certain types of characters, trim white space, etc. Use a UnicodeSet with its
	span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you
	can also use a UnicodeSetSpanner.

	### Enumerated property from code point

	Call one of the int-property functions. Alternatively, build a UCPTrie /
	CodePointTrie (new in ICU 63) via its mutable version and build method, then use
	that to get the int value for each code point.

	### Enumerated property over string

	Easiest is to iterate over code points of the string and call per-code point
	lookup methods (or use a code point trie).

	The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java
	String iterator class where the iteration and data lookup are integrated to
	avoid redundancies in validation and range checks.

	The UTF-16 code point macros and the Java String iterator also provide the code
	point as output, because it has to be fetched or assembled anyway.

	The UTF-8 macros do not assemble the code point because that would be some
	amount of extra work, but often only the lookup value is used and the code point
	is not needed. When it is needed after all, it is possible to take advantage of
	the macros having validated the byte sequence: If the sequence was ill-formed,
	then the trie's error value is set. Therefore, if a value other than the trie
	error value was returned, then the sequence was well-formed, and the code point
	can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()).
	Since the length of the sequence (1..4 bytes) is also known from the iteration
	(string index before/after next() call), an even simpler piece of code can be
	used. (See for example the ICU-internal function codePointFromValidUTF8() in
	normalizer2impl.cpp.)

	### Code point trie most-optimized UTF-16 access

	UTF-16 text processing can be further optimized by detecting surrogate pairs and
	assembling supplementary code points only when there is non-trivial data
	available.

	At build time, iterate over all supplementary code points
	(umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from
	U+10000) to see if there is non-trivial data for any of the supplementary code
	points associated with a lead surrogate. If so, then set a special
	(application-specific) value for the lead surrogate.

	At runtime, use UCPTRIE_FAST_BMP_GET() per code unit. If there is non-trivial
	data and the code unit is a lead surrogate, then check if a trail surrogate
	follows. If so, assemble the supplementary code point with
	U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET();
	otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast
	and java.lang.Character have equivalent methods.)

	If there is only trivial data for lead and trail surrogates, then processing can
	often skip them. (In this case, there will be two data lookups, one for the lead
	surrogate and one for the trail surrogate, but they are fast, and this
	optimization speeds up the more common BMP characters by not checking for
	surrogates each time.)

	For example, in normalization or case mapping all characters that do not have
	any mappings are simply copied as is.

	## Properties in ICU Rule Syntax

	ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic
	"spaces" to allow for the usage of white space characters outside of the normal
	ASCII range while still maintaining backward compatibility. See
	<http://www.unicode.org/reports/tr31/#Pattern_Syntax> for more information.