| # Properties |
| |
| ## Overview |
| |
| Text processing requires that a program treat text appropriately. If text is |
| exchanged between several systems, it is important for them to process the text |
| consistently. This is done by assigning each character, or a range of |
| characters, attributes or properties used for text processing, and by defining |
| standard algorithms for at least the basic text operations. |
| |
| Traditionally, such attributes and algorithms have not been well-defined for |
| most character sets, and text processing had to rely on ad-hoc solutions. Over |
| time, standards were created for querying properties of the system codepage. |
| However, the set of these properties was limited. Their data was not coordinated |
| among implementations, and standard algorithms were not available. |
| |
| It is one of the strengths of Unicode that it not only defines a very large |
| character set, but also assigns a comprehensive set of properties and usage |
| notes to all characters. It defines standard algorithms for critical text |
| processing, and the data is publicly provided and kept up-to-date. See |
| <http://www.unicode.org/> for more information. |
| |
| Sample code is available in the ICU source code library at |
| [icu4c/source/samples/props/props.cpp](http://source.icu-project.org/repos/icu/trunk/icu4c/source/samples/props/props.cpp) |
| . See also the source code for the [Unicode |
| browser](http://source.icu-project.org/repos/icu/icuapps/trunk/ubrowse/) demo |
| application, which can be used |
| [online](http://demo.icu-project.org/icu-bin/ubrowse) to browse Unicode |
| characters with their properties. |
| |
| ## Unicode Character Database properties in ICU APIs |
| |
| The following table shows all Unicode Character Database properties (except for |
| purely "extracted" ones and Unihan properties) and the corresponding ICU APIs. |
| Most of the time, ICU4C provides functions in |
| icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the |
| com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are |
| accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in |
| Java). |
| |
| [Surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point) |
| mostly have default property values, except for the General_Category (gc=Cs). |
| |
| For integer values outside the Unicode code point range (negative or ≥ |
| 0x110000), most API functions return null values (false, 0, etc.). API functions |
| that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase()) |
| normally return out-of-range values (i.e., map them to themselves), just like |
| for unassigned code points or generally code points that have no specific |
| mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1. |
| |
| Most properties are also available via UnicodeSet APIs and patterns. See the |
| Lookup section below. |
| |
| See the [Unicode Character |
| Database](http://www.unicode.org/reports/tr44/#Properties) itself for |
| comparison. The UCD files |
| [PropertyAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) |
| and |
| [PropertyValueAliases.txt](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) |
| list all properties and their values by name and type. |
| |
| Most properties that use binary, integer, or enumerated values are available via |
| functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty |
| enum constants to select the property. (ICU4J UCharacter member functions do not |
| have the "u_" prefix.) The constant names include the long property name |
| according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property |
| value enum constant names often contain the short property name and the long |
| value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the |
| enumeration result type is also listed here. |
| |
| Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and |
| UnicodeSet and regular expression patterns use the long or short property |
| aliases and property value aliases (see PropertyAliases.txt and |
| PropertyValueAliases.txt). |
| |
| There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do |
| not use a single value but a bit-set (a mask) of zero or more values, with each |
| bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to |
| represent property value aliases for multiple general categories, like "Letters" |
| (which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other |
| words, there are two ICU properties for the same Unicode property, one |
| delivering single values (for per-code point lookup) and the other delivering |
| sets of values (for use with value aliases and UnicodeSet). |
| |
| | UCD Name(see PropertyAliases.txt) | Type | | ICU4C uchar.hICU4J UCharacter | UCD File (.txt) | |
| |------------------------------------|-----------------------------------------------|-----|------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------| |
| | Age | Unicode version | (U) | C: u_charAge fills in UVersionInfoJava: getAge returns a VersionInfo reference | DerivedAge | |
| | Alphabetic | binary | (U) | u_isUAlphabetic, UCHAR_ALPHABETIC | DerivedCoreProperties | |
| | ASCII_Hex_Digit | binary | (U) | UCHAR_ASCII_HEX_DIGIT | PropList | |
| | Bidi_Class | enum UCharDirection | (U) | u_charDirection, UCHAR_BIDI_CLASS | UnicodeData | |
| | Bidi_Control | binary | (U) | UCHAR_BIDI_CONTROL | PropList | |
| | Bidi_Mirrored | binary | (U) | u_isMirrored, UCHAR_BIDI_MIRRORED | UnicodeData | |
| | Bidi_Mirroring_Glyph | code point | | u_charMirror | BidiMirroring | |
| | Block | enum UBlockCode (growing) | (U) | ublock_getCode, UCHAR_BLOCK | Blocks | |
| | Canonical_Combining_Class | 0..255 | (U) | u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS | UnicodeData | |
| | Case_Folding | Unicode string | | u_strFoldCase (ustring.h) | CaseFolding | |
| | Case_Ignorable | binary | (U) | UCHAR_CASE_IGNORABLE | DerivedCoreProperties | |
| | Cased | binary | (U) | UCHAR_CASED | DerivedCoreProperties | |
| | Changes_When_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_CASEFOLDED | DerivedCoreProperties | |
| | Changes_When_Casemapped | binary | (U) | UCHAR_CHANGES_WHEN_CASEMAPPED | DerivedCoreProperties | |
| | Changes_When_NFKC_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED | DerivedNormalizationProps | |
| | Changes_When_Lowercased | binary | (U) | UCHAR_CHANGES_WHEN_LOWERCASED | DerivedCoreProperties | |
| | Changes_When_Titlecased | binary | (U) | UCHAR_CHANGES_WHEN_TITLECASED | DerivedCoreProperties | |
| | Changes_When_Uppercased | binary | (U) | UCHAR_CHANGES_WHEN_UPPERCASED | DerivedCoreProperties | |
| | Composition_Exclusion | binary | (c) | contributes to Full_Composition_Exclusion | CompositionExclusions | |
| | Dash | binary | (U) | UCHAR_DASH | PropList | |
| | Decomposition_Mapping | Unicode string | | NFKC Normalizer2::getRawDecomposition() | UnicodeData | |
| | Decomposition_Type | enum UDecompositionType | (U) | UCHAR_DECOMPOSITION_TYPE | UnicodeData | |
| | Default_Ignorable_Code_Point | binary | (U) | UCHAR_DEFAULT_IGNORABLE_CODE_POINT | DerivedCoreProperties | |
| | Deprecated | binary | (U) | UCHAR_DEPRECATED | PropList | |
| | Diacritic | binary | (U) | UCHAR_DIACRITIC | PropList | |
| | East_Asian_Width | enum UEastAsianWidth | (U) | UCHAR_EAST_ASIAN_WIDTH | EastAsianWidth | |
| | Expands_On_NF* | binary | | available via normalization API (normalizer2.h) | DerivedNormalizationProps | |
| | Extender | binary | (U) | UCHAR_EXTENDER | PropList | |
| | FC_NFKC_Closure | Unicode string | | u_getFC_NFKC_Closure | DerivedNormalizationProps | |
| | Full_Composition_Exclusion | binary | (U) | UCHAR_FULL_COMPOSITION_EXCLUSION | DerivedNormalizationProps | |
| | General_Category | enum (<= 32 values) | (U) | u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategory | UnicodeData | |
| | Grapheme_Base | binary | (U) | UCHAR_GRAPHEME_BASE | DerivedCoreProperties | |
| | Grapheme_Cluster_Break | enum UGraphemeClusterBreak | (U) | UCHAR_GRAPHEME_CLUSTER_BREAK | GraphemeBreakProperty | |
| | Grapheme_Extend | binary | (U) | UCHAR_GRAPHEME_EXTEND | DerivedCoreProperties | |
| | Grapheme_Link | binary | (U) | UCHAR_GRAPHEME_LINK | DerivedCoreProperties | |
| | Hangul_Syllable_Type | enum UHangulSyllableType | (U) | UCHAR_HANGUL_SYLLABLE_TYPE | HangulSyllableType | |
| | Hex_Digit | binary | (U) | UCHAR_HEX_DIGIT | PropList | |
| | Hyphen | binary | (U) | UCHAR_HYPHEN | PropList | |
| | ID_Continue | binary | (U) | UCHAR_ID_CONTINUE | DerivedCoreProperties | |
| | ID_Start | binary | (U) | UCHAR_ID_START | DerivedCoreProperties | |
| | Ideographic | binary | (U) | UCHAR_IDEOGRAPHIC | PropList | |
| | IDS_Binary_Operator | binary | (U) | UCHAR_IDS_BINARY_OPERATOR | PropList | |
| | IDS_Triary_Operator | binary | (U) | UCHAR_IDS_TRINARY_OPERATOR | PropList | |
| | Indic_Matra_Category | (enum) | | provisional, not yet supported | IndicMatraCategory | |
| | Indic_Syllabic_Category | (enum) | | provisional, not yet supported | IndicSyllabicCategory | |
| | ISO_Comment | ASCII string | | u_getISOComment | UnicodeData | |
| | Jamo_Short_Name | ASCII string | (c) | contributes to Name | Jamo | |
| | Join_Control | binary | (U) | UCHAR_JOIN_CONTROL | PropList | |
| | Joining_Group | enum UJoiningGroup | (U) | UCHAR_JOINING_GROUP | ArabicShaping | |
| | Joining_Type | enum UJoiningType | (U) | UCHAR_JOINING_TYPE | ArabicShaping | |
| | Line_Break | enum ULineBreak | (U) | UCHAR_LINE_BREAK | LineBreak | |
| | Logical_Order_Exception | binary | (U) | UCHAR_LOGICAL_ORDER_EXCEPTION | PropList | |
| | Lowercase | binary | (U) | u_isULowercase, UCHAR_LOWERCASE | DerivedCoreProperties | |
| | Lowercase_Mapping | Unicode string + conditions | | available via u_strToLower (ustring.h) | UnicodeData + SpecialCasing | |
| | Math | binary | (U) | UCHAR_MATH | DerivedCoreProperties | |
| | Name | ASCII string | (U) | u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) | UnicodeData | |
| | Name_Alias | ASCII string | | u_charName(U_CHAR_NAME_ALIAS) | NameAliases | |
| | NF*_QuickCheck | enum UNormalizationCheckResult (no/maybe/yes) | (U) | UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h) | DerivedNormalizationProps | |
| | NFKC_Casefold | Unicode string | | available via normalization API (normalizer2.h "nfkc_cf") | DerivedNormalizationProps | |
| | Noncharacter_Code_Point | binary | (U) | UCHAR_NONCHARACTER_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h) | PropList | |
| | Numeric_Type | enum UNumericType | (U) | UCHAR_NUMERIC_TYPE | UnicodeData | |
| | Numeric_Value | double | (U) | u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions | UnicodeData | |
| | Other_Alphabetic | binary | (c) | contributes to Alphabetic | PropList | |
| | Other_Default_Ignorable_Code_Point | binary | (c) | contributes to Default_Ignorable_Code_Point | PropList | |
| | Other_Grapheme_Extend | binary | (c) | contributes to Grapheme_Extend | PropList | |
| | Other_Lowercase | binary | (c) | contributes to Lowercase | PropList | |
| | Other_Math | binary | (c) | contributes to Math | PropList | |
| | Other_Uppercase | binary | (c) | contributes to Uppercase | PropList | |
| | Pattern_Syntax | binary | (U) | UCHAR_PATTERN_SYNTAX | PropList | |
| | Pattern_White_Space | binary | (U) | UCHAR_PATTERN_WHITE_SPACE | PropList | |
| | Quotation_Mark | binary | (U) | UCHAR_QUOTATION_MARK | PropList | |
| | Radical | binary | (U) | UCHAR_RADICAL | PropList | |
| | Script | enum UScriptCode (growing) | (U) | uscript_getCode (uscript.h), UCHAR_SCRIPT | Scripts | |
| | Script_Extensions (provisional) | list of enum UScriptCode (growing) | (U) | uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONSUnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:] | ScriptExtensions | |
| | Sentence_Break | enum USentenceBreak | (U) | UCHAR_SENTENCE_BREAK | SentenceBreakProperty | |
| | Simple_Case_Folding | code point | | u_foldCase | CaseFolding | |
| | Simple_Lowercase_ Mapping | code point | | u_tolower | UnicodeData | |
| | Simple_Titlecase_ Mapping | code point | | u_totitle | UnicodeData | |
| | Simple_Uppercase_ Mapping | code point | | u_toupper | UnicodeData | |
| | Soft_Dotted | binary | (U) | UCHAR_SOFT_DOTTED | PropList | |
| | Special_Case_Condition | conditions | | available via u_strToLower etc. (ustring.h) | SpecialCasing | |
| | STerm | binary | (U) | UCHAR_S_TERM | PropList | |
| | Terminal_Punctuation | binary | (U) | UCHAR_TERMINAL_PUNCTUATION | PropList | |
| | Titlecase_Mapping | Unicode string + conditions | | u_strToTitle (ustring.h) | UnicodeData + SpecialCasing | |
| | Unicode_1_Name | ASCII string | (U) | u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) | UnicodeData | |
| | Unified_Ideograph | binary | (U) | UCHAR_UNIFIED_IDEOGRAPH | PropList | |
| | Uppercase | binary | (U) | u_isUUppercase, UCHAR_UPPERCASE | DerivedCoreProperties | |
| | Uppercase_Mapping | Unicode string + conditions | | u_strToUpper (ustring.h) | UnicodeData + SpecialCasing | |
| | White_Space | binary | (U) | u_isUWhiteSpace, UCHAR_WHITE_SPACE | PropList | |
| | Word_Break | enum UWordBreakValues | (U) | UCHAR_WORD_BREAK | WordBreakProperty | |
| | XID_Continue | binary | (U) | UCHAR_XID_CONTINUE | DerivedCoreProperties | |
| | XID_Start | binary | (U) | UCHAR_XID_START | DerivedCoreProperties | |
| |
| Notes: |
| |
| 1. (c) - This property only **contributes** to "real" properties (mostly |
| "Other_..." properties), so there is no direct support for this property in |
| ICU. |
| |
| 2. (U) - This property is available via the UnicodeSet APIs and patterns. Any |
| property available in UnicodeSet is also available in regular expressions. |
| Properties which are not available in UnicodeSet are generally those that |
| are not available through a UProperty selector. |
| |
| ## Customization |
| |
| ICU does not provide the means to modify properties at runtime. The properties |
| are provided exactly as specified by a recent version of the Unicode Standard |
| (as published in the [Character |
| Database](http://www.unicode.org/unicode/onlinedat/online.html) ). |
| |
| For custom sets and maps, it is easiest to make UnicodeSet or |
| UCPTrie/CodePointTrie objects with the desired values. |
| |
| However, if an application requires custom properties (for example, for [Private |
| Use](http://www.unicode.org/glossary/) characters), then it is possible to |
| change or add them at build-time. This is doable but not easy. |
| |
| It is done by modifying the Character Database files copied into the ICU source |
| tree at |
| [icu4c/source/data/unidata](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata). |
| Since ICU 49, most of the properties have been combined into one file, |
| unidata/ppucd.txt (see the [Preparsed |
| UCD](http://site.icu-project.org/design/props/ppucd) design doc). Some of the |
| remaining UCD files are still inputs, others are only used for unit tests. |
| |
| To add a character to such a file, a line must be inserted into the file with |
| the format used in that file (see the online documentation on the [Unicode |
| site](http://www.unicode.org/reports/tr44/) for more information). After |
| modifying one or more of these files, the ICU data needs to be rebuilt, and the |
| resulting files need to be checked into the ICU source tree. The files are |
| processed by special ICU tools outside of the normal ICU build. The |
| [unidata/changes.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/changes.txt) |
| file documents the process that has been used for the last several Unicode |
| version updates; skip the file preparation and API update steps. |
| |
| Any available Unicode code point (0 to 10FFFF16) can be used. Code point values |
| should be written with either 4, 5, or 6 hex digits. The minimum number of |
| digits possible should be used (but no fewer than 4). Note that the Unicode |
| Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points |
| U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters, |
| therefore they should not be added to any of the character database files. |
| |
| ## Lookup |
| |
| For lookup by code point, iterate through the string, fetch code points, and |
| either call the unicode/uchar.h / UCharacter or similar functions, or use |
| dedicated sets and maps. For binary properties, and sets in general, there are |
| also more efficient methods for iterating over substrings. |
| |
| ### Binary property from code point |
| |
| Call one of the binary-property functions. Alternatively, make a UnicodeSet for |
| the property (remember to freeze() it) or for a custom set of characters, and |
| call contains(). |
| |
| ### Binary property over string |
| |
| It is often useful to partition a string into substrings where every character |
| has the property, and substrings where every character does not have the |
| property. For example, to split the string at separator characters, remove |
| certain types of characters, trim white space, etc. Use a UnicodeSet with its |
| span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you |
| can also use a UnicodeSetSpanner. |
| |
| ### Enumerated property from code point |
| |
| Call one of the int-property functions. Alternatively, build a UCPTrie / |
| CodePointTrie (new in ICU 63) via its mutable version and build method, then use |
| that to get the int value for each code point. |
| |
| ### Enumerated property over string |
| |
| Easiest is to iterate over code points of the string and call per-code point |
| lookup methods (or use a code point trie). |
| |
| The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java |
| String iterator class where the iteration and data lookup are integrated to |
| avoid redundancies in validation and range checks. |
| |
| The UTF-16 code point macros and the Java String iterator also provide the code |
| point as output, because it has to be fetched or assembled anyway. |
| |
| The UTF-8 macros do not assemble the code point because that would be some |
| amount of extra work, but often only the lookup value is used and the code point |
| is not needed. When it is needed after all, it is possible to take advantage of |
| the macros having validated the byte sequence: If the sequence was ill-formed, |
| then the trie's error value is set. Therefore, if a value other than the trie |
| error value was returned, then the sequence was well-formed, and the code point |
| can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()). |
| Since the length of the sequence (1..4 bytes) is also known from the iteration |
| (string index before/after next() call), an even simpler piece of code can be |
| used. (See for example the ICU-internal function codePointFromValidUTF8() in |
| normalizer2impl.cpp.) |
| |
| ### Code point trie most-optimized UTF-16 access |
| |
| UTF-16 text processing can be further optimized by detecting surrogate pairs and |
| assembling supplementary code points only when there is non-trivial data |
| available. |
| |
| At build time, iterate over all supplementary code points |
| (umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from |
| U+10000) to see if there is non-trivial data for any of the supplementary code |
| points associated with a lead surrogate. If so, then set a special |
| (application-specific) value for the lead surrogate. |
| |
| At runtime, use UCPTRIE_FAST_BMP_GET() per code *unit*. If there is non-trivial |
| data and the code unit is a lead surrogate, then check if a trail surrogate |
| follows. If so, assemble the supplementary code point with |
| U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET(); |
| otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast |
| and java.lang.Character have equivalent methods.) |
| |
| If there is only trivial data for lead and trail surrogates, then processing can |
| often skip them. (In this case, there will be two data lookups, one for the lead |
| surrogate and one for the trail surrogate, but they are fast, and this |
| optimization speeds up the more common BMP characters by not checking for |
| surrogates each time.) |
| |
| For example, in normalization or case mapping all characters that do not have |
| any mappings are simply copied as is. |
| |
| ## Properties in ICU Rule Syntax |
| |
| ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic |
| "spaces" to allow for the usage of white space characters outside of the normal |
| ASCII range while still maintaining backward compatibility. See |
| <http://www.unicode.org/reports/tr31/#Pattern_Syntax> for more information. |